HN2new | past | comments | ask | show | jobs | submitlogin
Using YQL to grab HN links (yahooapis.com)
54 points by chaosmachine on May 2, 2009 | hide | past | favorite | 20 comments


The query is:

  select * from html where url="https://hackernews.hn/" and
  xpath='//tr/td/a[substring(@href,1,4)="http"][@href!="http://ycombinator.com"]'
You can play with it yourself here (needs yahoo login):

http://developer.yahoo.com/yql/console/


You know, honestly, I wasn't grokking much about how cool YQL could be until I saw this example. This helps a lot. Thanks.


I don't get it. Don't queries like that suffer from the same problems as normal screen scraping?


Yeah, they're obviously brittle, but it's baby steps, y'know?

I think it's pretty crazy that you can now scrape well-marked pages with a SQL-like syntax.


YQL isn't that great server-side until they add an `Access-Control-Allow-Origin: *` header. I'm pretty disappointed that they completely disallow Javascript access.


It offers JSONP instead so you can access any YQL content via Javascript in a browser via script nodes.


You still have to give Yahoo! complete control with that method. With XHR, Yahoo! has no control over your webpage.


It only it works in a few browsers but doesn't seem like it would hurt anything to add it. Thanks for the suggestion!


I love how HN is the only site where "[x] would be cool" turns into a feature request implemented by the vendor.


Nah, we do the same thing for things suggested on Twitter :)


Just spent 20 minutes trying to grab all article links from a newspaper's website based on a url pattern. Failed. The docs could use some work.

Anyone know how to do this?


For another approach, with a structure editor, check out http://parselets.com


which newspaper website was it?


in python it would be like this (get BeautifulSoup first):

--

  import urllib2

  from BeautifulSoup import BeautifulSoup

  ychtml = urllib2.urlopen('https://hackernews.hn/).read()

  for tdtitle in BeautifulSoup(ychtml).findAll("td", "title"):

    print tdtitle.a


If you use Linux or some other Unix, you can also do it with standard Unix tools:

  wget -O- https://hackernews.hn/ | grep -o http[^\"]*
Personally, I prefer curl because it writes to stdout per default:

  curl https://hackernews.hn/ | grep -o http[^\"]*
(After posting this, i noticed that HN cuts * signs at the end of a message. So I have to add this text here, or the last * would not be displayed.)


This shows all urls on HN including images etc. whereas the original post demonstrates retrieving only linked articles


You can filter that with another grep for example:

  wget -O- https://hackernews.hn/ | grep -o 'title"><a href="[^"]*' | grep -o http.*


Does it first convert the page to valid xhtml and then perform the xpath or does it rely on the website to be well-marked?


It works against not-well-formed markup.


I actually attended the talk on YQL at Barcamp Portland and it seems really powerful.

You are basically able to use Yahoo's cloud servers and huge internet pipe for free with this service.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: