Design Web Crawler | System Design
Problem Requirements
Back of the Envelope Estimation
Process Overview
?
Finding URLs to Crawl
Can we reduce networking calls to the frontier??
We could keep each node’s URL completely locally.?
Local Frontier Evaluation
While the local frontier allows us to avoid an extra network call as opposed to having a centralized frontier, it has major downsides.?
We would require an extra network call to a centralized service in order to check weather we have already processed a URL.?
Distributed Frontier
Idea:- To make sure that each node gets an equal amount of work, we can send URLs from one node to another.
Avoid Duplicate Fetches
We want to avoid fetching and storing the same website twice!
One Solution : Database that stores fetched URLs
This requires an extra network call to read from !?
Avoid duplicate fetching, optimized
Route URLs x and y to the same node if x == y
We can just partition our nodes by a hash range of the URL
This keeps load per node balanced?
Avoid Duplicate Content on different sites
Can we do something smart here to avoid extra network calls to stop ourselves from processing duplicate HTML
Content Hash Checking
Since duplicate hashes can show up on any node , We need some sort of centralized set of hashes?
Option 1: Centralized Redis set of hashes?
Content Hash Low Latency
If our nodes are in multiple data centers, then read and write to Redis could be slow
领英推荐
Idea: Set CRDT(Conflict-free Replicated Data Type) on each node, perform anti-entropy in the background?
Issue:- We may now process the same content multiple times?
This will be resolved if operations are idempotent and tolerable?
Domain Name Service?
Provides us a mapping from hostname (e.g. wikipaedia.com ) to IP address!
Previously, we said we would partition our URLs by a hash range of URLs.
What if instead, we partition them by a hash range of the host .
Handling Robots.txt
Robots.txt tells us whether we can crawl a certain host, and if so how frequently?
We are already partitioning our nodes by hostname, We also need to keep a local cache of the crawling policy?
Ex 1:- Do not crawl Twitter?
-> Ignore the URL and get the next one from frontier?
Ex 2:- You can only crawl Twitter once per minute?
-> Keep a table with the last time you crawled each host?
-> Put the message back on your frontier and take the next one(no network call).
Fetching the URL
Can we partition our nodes to have a similar geographic location to the URLs they fetch??
It's not possible, the domain doesn’t indicate where the URL is hosted?
We need to fetch the URL?
Storing Results
We can keep our writing latency minimal?
-> Ideally want data locality?
Pushing to the Frontier
We now have a bunch of URLs to push!?
How should we model our frontier??
What data structure should we use??
Architecture Choices?
We want reliability, we don’t want to stop any websites!?
-> would be great if don’t have to build our own partitioning Solution?
We get full responsibility, no lost messages, partitioning
All writes to S3 can be idempotent, and can use an internal state for caching.
Final Design