New HN data dump available with over 14.5m entries

minimaxir · on June 18, 2017

If you're interested in playing with Hacker News data and don't want to download the entire dataset (or don't have the CPU/memory to perform large JOINs on stories/comments), you can use the Google BigQuery HN dataset, which is now up-to-date: https://cloud.google.com/bigquery/public-data/hacker-news (specifically, the .full table, which combines both stories and comments; the dedicated tables are not up-to-date)

venning · on June 18, 2017

I see this link mentioned all the time, but every time I try it I can't get it to work.

Specifically, the "GO TO THE HACKER NEWS DATASET" big blue button on that page. It kicks me over to a Google Cloud console link, which spins for a few seconds, and then brings up a "Welcome to BigQuery!" modal. The only thing I can do then is click "Create a Project", which then kicks me over to the generic console with a listing of all APIs.

Am I missing something?

minimaxir · on June 18, 2017

You'll need to create a GCE project before you can use BigQuery (you don't need to provide a credit card if you remain in the free tier)

ers35 · on June 18, 2017

See also: A dump of the stories, comments, and users from the Firebase API as a SQLite database with a full text search index: https://archive.org/details/hackernews-2017-05-18.db

B1FF_PSUVM · on June 18, 2017

Can you tell about the time period and (estimate of) % of comments covered, for your DB and the dump posted?

Thanks.

ers35 · on June 18, 2017

My DB covers from https://hacker-news.firebaseio.com/v0/item/1.json?print=pret... (10/09/2006 at 6:21pm UTC) to https://hacker-news.firebaseio.com/v0/item/14372035.json?pri... (05/18/2017 at 11:58pm UTC)

The dump posted claims to cover from 1 to 14566367 (06/16/2017 at 3:03am UTC)

B1FF_PSUVM · on June 18, 2017

Ah, OK, forgot there's the item id for comments, ensuring 100% comment coverage.

(I read "the story vote count is inaccurate for certain stories because it is only scraped once and not updated" and thought some comments might be left out too.)

So, 145M at min. 10 sec. per comment, that's at least 40k hours worth, probably one order of magnitude more. Just writing time, reading maybe 3 orders of magnitude more.

Modern pyramids, they're impalpable ...

natch · on June 18, 2017

Kudos to archive.org for hosting torrents. It would be helpful to know the size of the download up front. Nice clean web page design; would love to see that one bit of information added.

ers35 · on June 18, 2017

You can see the size of the download by clicking show all: https://archive.org/download/14566367HackerNewsCommentsAndSt...

natch · on June 18, 2017

Thanks. My point to the web designer stands though, the information should be on the first page. Before seeing your message, I looked and didn't find it; had to determine it by loading up the torrent in a client. Also as sillysaurus3 pointed out the expanded size is useful too.

fit2rule · on June 18, 2017

The answer: 1.6 gigs.

sillysaurus3 · on June 18, 2017

1.6gigs is no doubt the compressed answer. Anyone know the uncompressed size?

ers35 · on June 18, 2017

7.27 GB.