select * from html where url="https://hackernews.hn/" and xpath='//tr/td/a[substring(@href,1,4)="http"][@href!="http://ycombinator.com"]'
http://developer.yahoo.com/yql/console/
I think it's pretty crazy that you can now scrape well-marked pages with a SQL-like syntax.
Anyone know how to do this?
--
import urllib2 from BeautifulSoup import BeautifulSoup ychtml = urllib2.urlopen('https://hackernews.hn/).read() for tdtitle in BeautifulSoup(ychtml).findAll("td", "title"): print tdtitle.a
wget -O- https://hackernews.hn/ | grep -o http[^\"]*
curl https://hackernews.hn/ | grep -o http[^\"]*
wget -O- https://hackernews.hn/ | grep -o 'title"><a href="[^"]*' | grep -o http.*
You are basically able to use Yahoo's cloud servers and huge internet pipe for free with this service.
http://developer.yahoo.com/yql/console/