Hacker News .hnnew | past | comments | ask | show | jobs | submitlogin

A lot of news websites restrict any crawler other than Google. And this does not happen only via robots.txt.


Indeed, years ago I had scripts to automatically fetch URLs from IRC and I quickly realized that if I didn't spoof the user agent of a proper web browser many websites would reject the query. Googlebot's UA worked just fine however.


> Googlebot's UA worked just fine however

They obviously don't care enough then - Google says you should use rdns to verify that googlebot crawls are real[0]. Cloudflare does this automatically now as well for customers with WAF (pro plan).

0: https://developers.google.com/search/docs/advanced/crawling/...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: