Disclaimer: I’m a creator of Frontera crawl frontier framework, so the vision below may be biased accordingly.
It’s a tough question to answer in general. The answer differs significantly depending on the business goals of this system. But, I’ll assume the goal is acquire content at least once, and re-crawl it from time to time, something what Common Crawl does. I’ll start with defining the problem and then proceed with a possible solution.
My estimate of the world total count of web hosts is ~500M and registered domain names are at least 900M. There could be big websites with millions of pages, or small ones with only a few. Let’s assume we want to create a birds view collection and will be crawling of no more than 100 pages per website. Therefore the target is to crawl maximum ~50B pages. Some of the registrars are keeping their zone files private, and web is not ideally interlinked thus there is no way to discover all the hosts in general. But let’s limit the problem to 50B pages, just to make further design easier.
The data volume can be estimated using average, uncompressed page size of 60Kb. Overall, it turns out to be 3 Pb uncompressed and 1 Pb of data compressed with Snappy.
Next thing to think about is the time we would like to spend to acquire this content. Let’s say we’re poor and fine with just 6 months It assumes we crawl ~280M pages daily, ~12M per hour and ~3240 per second. Also general throughput is expected to be ~190Mb per second and 2-4Gbit network connection should be enough for our cluster.
Alright, so this is how our engineering problem looks like. Let’s move on to the possible solution and answer the questions stated above.
There are not that many open source crawlers available capable of doing large scale crawls, but just to name few Apache Nutch, Heritrix and Frontera/Scrapy. In this case I’ll stick with latter. It will require ~160 fetchers to run in parallel and 80 strategy workers to run the crawling strategy code. 1 core per process plus some overhead for monitoring/links db storage/queue operation resulting to ~280 cores, which is 7 modern machines 48 core each. This is only crawler part. Frontera requires Apache Kafka and HBase to operate, so it’s going to be around 10-12 more machines to run these services.
Q: How would you schedule and manage the crawlers?
I would say there are two common ways to do that. Either your crawler operates in batches: crawl of the batch, stop, parsing, links extraction, crawl of the next batch and so on, or the crawler is online: e.g. it crawls, parses, extracts and schedules links without stops. The latter are usually faster. Batch operation is implemented in Nutch, using cmd line calls and online is done in Heritrix, StormCrawler and Frontera/Scrapy.
The modern way to do that is to run the crawler processes in container farm.
Q: How would you collect and parse incoming data?
Depending on the purpose. Online processing is getting popular these days. Have a look at Apache Kafka and Storm. There are many HTML and other types of documents parsers available open source.
Q: What would you use to store the data?
The main issue here is the way you’re going to access the data. If it’s an ad-hoc random access of document and occasional full scan then Apache HBase or other column-based storage is a way to go. If you’re going to perform full scan only (for example to collect some stats or build an index) then Apache Kafka or HDFS can be used as a storage.
Q: How would you process and draw meaning from the data?
In general case it’s Solr or Elastic. If there something specific needed then it could be sampling using linear scan and Apache Spark applied on sample. Again, these days are various workers running in docker containers and processing the content online is very popular.
Q: What information would be important to build a pagerank system for searching?
Pagerank requires only link graph. You would need to extract it from HTML content after the crawling. You could also use Apache Giraph or Spark GraphX to do the computations. These days Pagerank is abused for many commercial purposes and antispam filtering will be needed to clean up the graph.
Q: What other major components are missing?
Rendering engine. Many websites nowadays require Javascript executing and limit links discovery for non-browsers.
Deduplication. Sometimes website serves the same content under different URLs, also stealing of the content on the web is quite common. So deduplication is need if you’re going to build a search engine.
Monitoring. The crawl will take a long time, so it’s important to monitor the system performance (the current crawl speed aggregated, queue contents, etc.) and health.
Orchestration. Your crawler is going to be a multiprocess application and it’s important to be able quickly and reliably run, finish and check status of all the processes.
Having worked with/on two large-scale web crawling systems at FAST Search & Transfer; The C-Crawler that powered alltheweb.com (owned now by Yahoo) and a python-based crawler that was a workhorse for FAST ESP/Scirus.com/etc.. (now owned by Microsoft); this run-down is pretty good. Some things I'd add:
- Link traps. If you limit to 100 pages per site not as big an issue but if you want to go deeper you need a way to detect when a site is generating garbage.
- Near duplicate detection. There's lots of sites like you mention that republish content of others, but some just present it in different ways with different headers, timestamps, etc..
- Content/meta-data detection/extraction, once crawled you want to do something with the content and detecting the actual content of pages is non-trivial if you don't want headers/ads/etc..
- How do you handle non-HTML content (PDF, Docs, etc?)
- How do you handle large content (sample, truncate, ignore)?
I used to work at a vertically-focused web search engine and ran the operational side of the crawler.
Also missing from this discussion would be a mechanism to rate limit (and determine adequate rate limits, based on your error rates) the crawl.
Also, detecting that you've been blocked and backing off so as not to further hammer the site you're crawling with requests. Related:
IP management is an issue here as well: lots of places just carte blanche block whole ranges from crawling activity. And will you be honoring robots.txt or not?
Be prepared for people to block you in new and stupid ways: once got blocked from hitting the site's name servers to even do lookups against them. They blackholed our packets. So what should have been a ~500ms DNS query at each http request turned into a 15s pause while the DNS request timed out ... eventually this stacked up across all threads, backing the overall crawling infrastructure to deadlock.
The Wayback Machine architecture is probably a good, public implementation of a large scale crawling mechanism. This post[1] about it may be a bit dated, but it's probably still accurate.
As for open-source crawler goes, I've heard of SpiderLing (http://corpus.tools/wiki/SpiderLing) which is somewhat specialized for building text corpus.
Hi, mryan! I'm the core Frontera developer. The precise answer heavily depends on your use case (what "task" is? scalability requirements, data flow), so please ask your question in Frontera google groups, and we will try to address it directly.
First you could try making use of Frontera, here are the different distribution models it provides out of the box http://frontera.readthedocs.org/en/latest/topics/run-modes.h.... Frontera is web crawling framework made in Scrapinghub, providing crawl frontier and scaling/distribution capabilities. Along with flexible queue and partitioning design, you will get also document metadata storage (HBase or RDBMS of your choice) with simple revisiting mechanism.
Second, we have a simple redis-based solution for scaling spiders https://github.com/rolando/scrapy-redis. It's only dependency is Redis, so it's easy to quick start, but it has only one queue shared between spiders, hard-coded partitioning, and Redis limiting scalability.
You mentioned a scrapy-cluster (not scrapyd-cluster probably). It provides a more sophisticated distribution model, allowing you to separate crawls with jobs concept within the same service, maintains separate per-spider queues (allowing to crawl politely, I hope you plan to do so? :), and forcing you to use it's prioritization model. Also it allows to use spiders of different types sharing the same Redis instance, and prioritize requests on cluster level. BTW, I haven't found any dependencies on Zookeeper.
None of the solutions provides provisioning out of the box. If spider was killed by OOM, or consume too much resources (open file descriptors, memory) you have to take care of it by yourself. You could use supervisord, or upstart or some custom process management solution. It all depends on you monitoring requirements.
Sounds like I have some more research to do - thanks for the detailed response.
Btw, you were right, I meant scrapy-cluster and not scrapyd-cluster: https://github.com/istresearch/scrapy-cluster. There is a requirement on Zookeeper/Kafka, which is the main showstopper for me.
It’s a tough question to answer in general. The answer differs significantly depending on the business goals of this system. But, I’ll assume the goal is acquire content at least once, and re-crawl it from time to time, something what Common Crawl does. I’ll start with defining the problem and then proceed with a possible solution.
My estimate of the world total count of web hosts is ~500M and registered domain names are at least 900M. There could be big websites with millions of pages, or small ones with only a few. Let’s assume we want to create a birds view collection and will be crawling of no more than 100 pages per website. Therefore the target is to crawl maximum ~50B pages. Some of the registrars are keeping their zone files private, and web is not ideally interlinked thus there is no way to discover all the hosts in general. But let’s limit the problem to 50B pages, just to make further design easier.
The data volume can be estimated using average, uncompressed page size of 60Kb. Overall, it turns out to be 3 Pb uncompressed and 1 Pb of data compressed with Snappy.
Next thing to think about is the time we would like to spend to acquire this content. Let’s say we’re poor and fine with just 6 months It assumes we crawl ~280M pages daily, ~12M per hour and ~3240 per second. Also general throughput is expected to be ~190Mb per second and 2-4Gbit network connection should be enough for our cluster.
Alright, so this is how our engineering problem looks like. Let’s move on to the possible solution and answer the questions stated above.
There are not that many open source crawlers available capable of doing large scale crawls, but just to name few Apache Nutch, Heritrix and Frontera/Scrapy. In this case I’ll stick with latter. It will require ~160 fetchers to run in parallel and 80 strategy workers to run the crawling strategy code. 1 core per process plus some overhead for monitoring/links db storage/queue operation resulting to ~280 cores, which is 7 modern machines 48 core each. This is only crawler part. Frontera requires Apache Kafka and HBase to operate, so it’s going to be around 10-12 more machines to run these services.
Q: How would you schedule and manage the crawlers? I would say there are two common ways to do that. Either your crawler operates in batches: crawl of the batch, stop, parsing, links extraction, crawl of the next batch and so on, or the crawler is online: e.g. it crawls, parses, extracts and schedules links without stops. The latter are usually faster. Batch operation is implemented in Nutch, using cmd line calls and online is done in Heritrix, StormCrawler and Frontera/Scrapy.
The modern way to do that is to run the crawler processes in container farm.
Q: How would you collect and parse incoming data? Depending on the purpose. Online processing is getting popular these days. Have a look at Apache Kafka and Storm. There are many HTML and other types of documents parsers available open source.
Q: What would you use to store the data? The main issue here is the way you’re going to access the data. If it’s an ad-hoc random access of document and occasional full scan then Apache HBase or other column-based storage is a way to go. If you’re going to perform full scan only (for example to collect some stats or build an index) then Apache Kafka or HDFS can be used as a storage.
Q: How would you process and draw meaning from the data? In general case it’s Solr or Elastic. If there something specific needed then it could be sampling using linear scan and Apache Spark applied on sample. Again, these days are various workers running in docker containers and processing the content online is very popular.
Q: What information would be important to build a pagerank system for searching? Pagerank requires only link graph. You would need to extract it from HTML content after the crawling. You could also use Apache Giraph or Spark GraphX to do the computations. These days Pagerank is abused for many commercial purposes and antispam filtering will be needed to clean up the graph.
Q: What other major components are missing? Rendering engine. Many websites nowadays require Javascript executing and limit links discovery for non-browsers. Deduplication. Sometimes website serves the same content under different URLs, also stealing of the content on the web is quite common. So deduplication is need if you’re going to build a search engine. Monitoring. The crawl will take a long time, so it’s important to monitor the system performance (the current crawl speed aggregated, queue contents, etc.) and health. Orchestration. Your crawler is going to be a multiprocess application and it’s important to be able quickly and reliably run, finish and check status of all the processes.
Good luck.