My Master's degree project was a webcrawler. If you're already reading this, the thesis[0] might be a somewhat entertaining read.
I had a bit different constraints (only hitting frontpage, cms/webserver/... fingerprinting, backend has to be able to do ad-hoc queries for site features), but it's nice to see that the process is always somewhat the same.
One of the most interesting things I experienced was, that link crawling works pretty ok for a certain amount of time, but after you have visited a large amount, bloom filters are pretty much the only way to protect against duplication in a memory efficient way.
I switched to a hybrid model where I do still check for links, but to limit the needed depth, I switched to using pinboard/twitter/reddit to find new domains. For bootstrapping you can get your hands on zonefiles from places on the internet (e.g. premiumdrops.com) that will keep you from having to crawl too deep pretty fast.
These days, I run on a combination of a worker approach with redis as a queue/cache and straight elasticsearch in the backend. I'm pretty happy with the easy scalability.
Web Crawlers are a great weekend project, they allow you to fiddle with evented architectures (github sample [1]), scaling a database and seeing the bottlenecks jump from place to place within your architecture. I can only recommend writing one :)
I had a bit different constraints (only hitting frontpage, cms/webserver/... fingerprinting, backend has to be able to do ad-hoc queries for site features), but it's nice to see that the process is always somewhat the same.
One of the most interesting things I experienced was, that link crawling works pretty ok for a certain amount of time, but after you have visited a large amount, bloom filters are pretty much the only way to protect against duplication in a memory efficient way.
I switched to a hybrid model where I do still check for links, but to limit the needed depth, I switched to using pinboard/twitter/reddit to find new domains. For bootstrapping you can get your hands on zonefiles from places on the internet (e.g. premiumdrops.com) that will keep you from having to crawl too deep pretty fast.
These days, I run on a combination of a worker approach with redis as a queue/cache and straight elasticsearch in the backend. I'm pretty happy with the easy scalability.
Web Crawlers are a great weekend project, they allow you to fiddle with evented architectures (github sample [1]), scaling a database and seeing the bottlenecks jump from place to place within your architecture. I can only recommend writing one :)
[0] http://blog.marc-seeger.de/2010/12/09/my-thesis-building-blo... [1] https://github.com/rb2k/em-crawler-sample