OpenAI and Anthropic are ignoring robots.txt

joshstrange · on June 22, 2024

arthurcolle · on June 22, 2024

robots.txt is a suggestion not a rule

hifromwork · on June 22, 2024

But terms of service are a rule, and robots.txt are usually a machine readable representation of the terms of service.

nradov · on June 22, 2024

What do you mean by "rule"? Depending on the exact circumstances, violating terms of service or ignoring robots.txt may not be a violation of criminal law or create any civil liability. In particular, scraping public data is generally legal under the CFAA regardless of robots.txt content.

https://newmedialaw.proskauer.com/2022/05/24/doj-revises-pol...

As a practical matter, if web site owners don't like particular HTTP requests then they can just ignore them or return errors or junk responses.

halJordan · on June 22, 2024

That's a cool article to read. It explicitly wonders whether a robots.txt is enough to revoke authorization. And it seems like the DoJ does allow itself to consider the different blocking mechanisms used (including robots.txt) on whether to prosecute.

The DoJ is explicit in saying that something like a Cease and Desist is enough, so if for example the NYTs found OpenAI's bot then that would likely be prosecutable.

Handy-Man · on June 21, 2024

Title editorialized due to being too long