HN2new | past | comments | ask | show | jobs | submit | dmn001's commentslogin

I find there is really no need to hide or mask the IP address when web scraping. The use of proxies or Tor to do so is completely unnecessary and maybe prohibitive e.g. try using Google in Tor.


When you are hitting sites thousands of times you have to make yourself as human like and anonymous as possible. Which isn't even as hard as it sounds. Just your address, user agent and random timers are the three most important things in botting.

http://www.blackhatunderground.net/forum/the-deep-web/9-blac...


There is no issue with parsing and scraping in the same loop as long as there is caching in there as well. You don't want to be hitting the server repeatedly whilst you're debugging.

A project like Scrapy should have caching on by default, but it seems to be an afterthought. Repeatable and reproducible parsing of cached websites is necessary, e.g. if you find additional data fields that you want to parse without downloading the entire site over again.


I think the bigger point is the benefit of storing pulled data as is for the future, not so much about hitting the server multiple times. If so, I agree with this 100% -- being able to re-run your algorithms later on a local dataset is a powerful capability. Later time, different computer, new software version -- no problem, you have a local copy of the data.

With caching, you are at the mercy of whatever third party caching scheme is used under the hood and raw pulled data can disappear any time without your explicit command (e.g., if some library gets updated and decides that this invalidates the caching scheme).


By caching, I just mean storing of data locally so you don't have to request it again under a certain timeframe. I use my own caching scripts written in Python, if you use a 3rd party library then data deletion does not matter too much either if you configure it properly and backup the data - html/json data compresses really well using lzma2 in 7-zip.


On the contrary, I have found lxml suitable for all of my scraping projects where the objective is to write some XPath to parse or extract some data from some element.


LXML itself is, the problem is that its HTML parser (libxml's really) is an ad-hoc "HTML4" parser which means the tree it builds routinely diverges from a proper HTML5 tree as you'd find in e.g. your browser's developer tools and the way it fixes (or whether it fixes it at all) markup is completely ad-hoc and hard to predict.


That may be fine for javascript heavy websites for a site with a few pages, but for anything with more than say 1,000 pages it is much more efficient to scrape using requests with lxml. The requests can be made concurrently, are scalable and there is no browser overhead with page rendering.


I've done a lot of scraping in my day, and I've found that lxml/requests is 2-3 OOM more resource efficient than a Selenium based browser. That JS/rendering engine is HEAVY!


SVG flowcharts of many gamebooks including FF and Lone Wolf: http://outspaced.fightingfantasy.net/SVG_Flowcharts/main.htm...


I made .svg diagrams for the Fighting Fantasy books Warlock of Firetop Mountain and Deathtrap dungeon a while back on my old blog: https://daveman.wordpress.com/2010/01/08/how-to-create-svg-l...


You have it marked as private


Should be fixed now.


No. Most websites don't do this.


It's extremely rare to be ip-blocked by any website just for using the Google's user agent from a non-specific range. IP's get re-used and you can switch to a new one easily, so it's really not common or good practice for this to happen.


> IP's get re-used and you can switch to a new one easily, so it's really not common or good practice for this to happen.

On the flip side, some people can't change their IP addresses easily, and getting IP banned (even if rare because of the reasons you stated) is actually a major hassle when it actually happens for those people. :/


SEEKING WORK - England,UK/Remote

Over 7 years experience with Python software development, cloud services, data mining, web crawling, databases. Want to extract or crawl data from a website such as business listings, sports data, government data, site directories, either one-off, periodically, or in real-time? Contact me via email: dmn001(at.[gmail


The first part seems like a very long-winded way to say "don't use the default user agent".

The captcha was unusually simple to solve, in most cases the best strategy is to avoid seeing it in the first place.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: