Using YQL to grab HN links

chaosmachine · on May 2, 2009

The query is:

  select * from html where url="https://hackernews.hn/" and
  xpath='//tr/td/a[substring(@href,1,4)="http"][@href!="http://ycombinator.com"]'

You can play with it yourself here (needs yahoo login):

http://developer.yahoo.com/yql/console/

thorax · on May 3, 2009

You know, honestly, I wasn't grokking much about how cool YQL could be until I saw this example. This helps a lot. Thanks.

henning · on May 3, 2009

I don't get it. Don't queries like that suffer from the same problems as normal screen scraping?

jfarmer · on May 3, 2009

Yeah, they're obviously brittle, but it's baby steps, y'know?

I think it's pretty crazy that you can now scrape well-marked pages with a SQL-like syntax.

Sephr · on May 2, 2009

YQL isn't that great server-side until they add an `Access-Control-Allow-Origin: *` header. I'm pretty disappointed that they completely disallow Javascript access.

spullara · on May 2, 2009

It offers JSONP instead so you can access any YQL content via Javascript in a browser via script nodes.

Sephr · on May 3, 2009

You still have to give Yahoo! complete control with that method. With XHR, Yahoo! has no control over your webpage.

spullara · on May 3, 2009

It only it works in a few browsers but doesn't seem like it would hurt anything to add it. Thanks for the suggestion!

sachinag · on May 3, 2009

I love how HN is the only site where "[x] would be cool" turns into a feature request implemented by the vendor.

spullara · on May 3, 2009

Nah, we do the same thing for things suggested on Twitter :)

rjurney · on May 3, 2009

Just spent 20 minutes trying to grab all article links from a newspaper's website based on a url pattern. Failed. The docs could use some work.

Anyone know how to do this?

tectonic · on May 3, 2009

For another approach, with a structure editor, check out http://parselets.com

hsuresh · on May 3, 2009

which newspaper website was it?

3ds · on May 3, 2009

in python it would be like this (get BeautifulSoup first):

--

  import urllib2

  from BeautifulSoup import BeautifulSoup

  ychtml = urllib2.urlopen('https://hackernews.hn/).read()

  for tdtitle in BeautifulSoup(ychtml).findAll("td", "title"):

    print tdtitle.a

yeahit · on May 3, 2009

If you use Linux or some other Unix, you can also do it with standard Unix tools:

  wget -O- https://hackernews.hn/ | grep -o http[^\"]*

Personally, I prefer curl because it writes to stdout per default:

  curl https://hackernews.hn/ | grep -o http[^\"]*

(After posting this, i noticed that HN cuts * signs at the end of a message. So I have to add this text here, or the last * would not be displayed.)

sarp · on May 3, 2009

This shows all urls on HN including images etc. whereas the original post demonstrates retrieving only linked articles

yeahit · on May 3, 2009

You can filter that with another grep for example:

  wget -O- https://hackernews.hn/ | grep -o 'title"><a href="[^"]*' | grep -o http.*

ghempton · on May 3, 2009

Does it first convert the page to valid xhtml and then perform the xpath or does it rely on the website to be well-marked?

simonw · on May 3, 2009

It works against not-well-formed markup.

found_dead · on May 4, 2009

I actually attended the talk on YQL at Barcamp Portland and it seems really powerful.

You are basically able to use Yahoo's cloud servers and huge internet pipe for free with this service.