If you're interested in playing with Hacker News data and don't want to download the entire dataset (or don't have the CPU/memory to perform large JOINs on stories/comments), you can use the Google BigQuery HN dataset, which is now up-to-date: https://cloud.google.com/bigquery/public-data/hacker-news (specifically, the .full table, which combines both stories and comments; the dedicated tables are not up-to-date)
I see this link mentioned all the time, but every time I try it I can't get it to work.
Specifically, the "GO TO THE HACKER NEWS DATASET" big blue button on that page. It kicks me over to a Google Cloud console link, which spins for a few seconds, and then brings up a "Welcome to BigQuery!" modal. The only thing I can do then is click "Create a Project", which then kicks me over to the generic console with a listing of all APIs.
Ah, OK, forgot there's the item id for comments, ensuring 100% comment coverage.
(I read "the story vote count is inaccurate for certain stories because it is only scraped once and not updated" and thought some comments might be left out too.)
So, 145M at min. 10 sec. per comment, that's at least 40k hours worth, probably one order of magnitude more. Just writing time, reading maybe 3 orders of magnitude more.
Kudos to archive.org for hosting torrents. It would be helpful to know the size of the download up front. Nice clean web page design; would love to see that one bit of information added.
Thanks. My point to the web designer stands though, the information should be on the first page. Before seeing your message, I looked and didn't find it; had to determine it by loading up the torrent in a client. Also as sillysaurus3 pointed out the expanded size is useful too.