Design Web Crawler | System Design

Design Web Crawler | System Design

Problem Requirements

  1. Scrape all content currently accessible via the web.?
  2. Store (and possibly index) all contents.?
  3. Respect website crawling policies (Robots.txt files).?
  4. Complete this process within one week.?

Back of the Envelope Estimation

  1. Around 1 billion web pages.
  2. Around 1 MB to store the contents of a web page.?
  3. Therefore, need to make 1 billion requests, and store 1 Pb.?
  4. ! week? -> 600k seconds , 1 billion /600k = 1500 req/sec.?
  5. Maybe require 500 threads, So around 100 series.?

Process Overview

  1. Pull in the URL from our “To-Crawl” list.?
  2. Check if we have already crawled it.?
  3. Check if crawling it is compliant with its host‘s robots.txt file.?
  4. Get the IP address of the host via DNS.?
  5. Make an HTTP request to load the contents of the site.?
  6. Check if we have already processed a different URL with identical content.?
  7. Parse the content.?
  8. Store the results somewhere.?
  9. Add any referenced URLs to our “to-crawl” list.

?

Finding URLs to Crawl

Can we reduce networking calls to the frontier??

We could keep each node’s URL completely locally.?

Local Frontier Evaluation

While the local frontier allows us to avoid an extra network call as opposed to having a centralized frontier, it has major downsides.?

  1. No load balancing?
  2. We can process duplicate lists?

We would require an extra network call to a centralized service in order to check weather we have already processed a URL.?

Distributed Frontier

Idea:- To make sure that each node gets an equal amount of work, we can send URLs from one node to another.

Avoid Duplicate Fetches

We want to avoid fetching and storing the same website twice!

One Solution : Database that stores fetched URLs

This requires an extra network call to read from !?

Avoid duplicate fetching, optimized

Route URLs x and y to the same node if x == y

We can just partition our nodes by a hash range of the URL

This keeps load per node balanced?

Avoid Duplicate Content on different sites

Can we do something smart here to avoid extra network calls to stop ourselves from processing duplicate HTML

  • We have to actually fetch the content first(expensive).
  • We can then hash it?
  • Since we don’t know what these hashes are before fetching the URLs, there ‘s no way to pre-organize the URLs onto the same node based on its contents.?

Content Hash Checking

Since duplicate hashes can show up on any node , We need some sort of centralized set of hashes?

  • 64-bit hash, 1 billion sites, up to 8 GB to store these .?
  • 8 GB is pretty low (if we have some beefy servers), could keep in memory?

Option 1: Centralized Redis set of hashes?

  • If we always read and write from and write to the leader.?
  • We can maintain strong consistency?


Content Hash Low Latency

If our nodes are in multiple data centers, then read and write to Redis could be slow

Idea: Set CRDT(Conflict-free Replicated Data Type) on each node, perform anti-entropy in the background?

Issue:- We may now process the same content multiple times?

This will be resolved if operations are idempotent and tolerable?

Domain Name Service?

Provides us a mapping from hostname (e.g. wikipaedia.com ) to IP address!

  • Likely too large to cache the whole mapping on just one node?
  • Requires a networking call
  • Recall just one IP address per host?

Previously, we said we would partition our URLs by a hash range of URLs.

What if instead, we partition them by a hash range of the host .

Handling Robots.txt

Robots.txt tells us whether we can crawl a certain host, and if so how frequently?

We are already partitioning our nodes by hostname, We also need to keep a local cache of the crawling policy?

Ex 1:- Do not crawl Twitter?

-> Ignore the URL and get the next one from frontier?

Ex 2:- You can only crawl Twitter once per minute?

-> Keep a table with the last time you crawled each host?

-> Put the message back on your frontier and take the next one(no network call).

Fetching the URL

Can we partition our nodes to have a similar geographic location to the URLs they fetch??

It's not possible, the domain doesn’t indicate where the URL is hosted?

We need to fetch the URL?

Storing Results

We can keep our writing latency minimal?

-> Ideally want data locality?

  1. Cou ld run our nodes in the Hadoop cluster and export to HDFS.?
  2. If using S3, could try to put our nodes in the same data center.?

Pushing to the Frontier

We now have a bunch of URLs to push!?

How should we model our frontier??

What data structure should we use??

  • Stack (Depth First Search):- Not really feasible, many processes writing to the stack, What; 's the top??
  • Queue(Breadth First Search):- We can do this?
  • PriorityQueue:- may be harder to implement, but if we prefer certain websites.?

Architecture Choices?

We want reliability, we don’t want to stop any websites!?

-> would be great if don’t have to build our own partitioning Solution?

  1. Frontier -> Kafka?
  2. Processing Nodes - > Flink?

We get full responsibility, no lost messages, partitioning

All writes to S3 can be idempotent, and can use an internal state for caching.

Final Design


要查看或添加评论,请登录

社区洞察

其他会员也浏览了