https://github.com/LAION-AI/Open-Assistant/issues/1110

> https://www.gutenberg.org/ has an extensive collection of ebooks in multiple languages and formats that would make great trianing data

…

> There is detailed legal information on which books are under public domain and which ones are copyrighted, it would be great if someone would go through these and decide which books are okay to crawl and use as training data (my understanding is that it is okay to scrape the contents as they are publicly available in a browser, but just to be sure)

Yup, sure are the same folk who put together that dataset they used to train stable diffusion.

Data? Yeah, just take everything. It’s all good.