Hacker News .hnnew | past | comments | ask | show | jobs | submit | mauriceweber's commentslogin

yes, small clarification: the 1TB per dump refers to the head+middle partition of the dataset and includes the text documents and the quality signals. There is another ~700GB for the minhash signatures and 1-1.5TB for the documents in the tail split.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: