> Our large language models are trained on a broad corpus of text that includes ...

celestialcheese · on April 5, 2023

Being Google or Microsoft (or microsoft affiliated)has its perks.

Laws around scraping content and using that data for derivative works is incredibly nuanced. This article is the best up-to-date overview of the state of the industry [1].

TL;DR - IANYL. if you have enough money for legal defense, and you are scraping publicly available, not behind login-gate, content, it's probably fine and defensible, but will cost an unbelievable amount of time and money to defend.

1 - https://blog.ericgoldman.org/archives/2022/12/hello-youve-be...