Hacker News .hnnew | past | comments | ask | show | jobs | submitlogin

> Our large language models are trained on a broad corpus of text that includes publicly available, licensed content

I wonder how they licensed all those websites that had no license information, making them by default copyrighted.



Being Google or Microsoft (or microsoft affiliated)has its perks.

Laws around scraping content and using that data for derivative works is incredibly nuanced. This article is the best up-to-date overview of the state of the industry [1].

TL;DR - IANYL. if you have enough money for legal defense, and you are scraping publicly available, not behind login-gate, content, it's probably fine and defensible, but will cost an unbelievable amount of time and money to defend.

1 - https://blog.ericgoldman.org/archives/2022/12/hello-youve-be...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: