登录查看更多内容

Design Web Crawler | System Design

Aman Lalwani

SDE-2 @ Dario | Ex-Innovaccer | Python | Golang | AWS | SQL

发布日期: 2024年6月11日

+ 关注

Problem Requirements

Scrape all content currently accessible via the web.?
Store (and possibly index) all contents.?
Respect website crawling policies (Robots.txt files).?
Complete this process within one week.?

Back of the Envelope Estimation

Around 1 billion web pages.
Around 1 MB to store the contents of a web page.?
Therefore, need to make 1 billion requests, and store 1 Pb.?
! week? -> 600k seconds , 1 billion /600k = 1500 req/sec.?
Maybe require 500 threads, So around 100 series.?

Process Overview

Pull in the URL from our “To-Crawl” list.?
Check if we have already crawled it.?
Check if crawling it is compliant with its host‘s robots.txt file.?
Get the IP address of the host via DNS.?
Make an HTTP request to load the contents of the site.?
Check if we have already processed a different URL with identical content.?
Parse the content.?
Store the results somewhere.?
Add any referenced URLs to our “to-crawl” list.

Finding URLs to Crawl

Can we reduce networking calls to the frontier??

We could keep each node’s URL completely locally.?

Local Frontier Evaluation

While the local frontier allows us to avoid an extra network call as opposed to having a centralized frontier, it has major downsides.?

No load balancing?
We can process duplicate lists?

We would require an extra network call to a centralized service in order to check weather we have already processed a URL.?

Distributed Frontier

Idea:- To make sure that each node gets an equal amount of work, we can send URLs from one node to another.

Avoid Duplicate Fetches

We want to avoid fetching and storing the same website twice!

One Solution : Database that stores fetched URLs

This requires an extra network call to read from !?

Avoid duplicate fetching, optimized

Route URLs x and y to the same node if x == y

We can just partition our nodes by a hash range of the URL

This keeps load per node balanced?

Avoid Duplicate Content on different sites

Can we do something smart here to avoid extra network calls to stop ourselves from processing duplicate HTML

We have to actually fetch the content first(expensive).
We can then hash it?
Since we don’t know what these hashes are before fetching the URLs, there ‘s no way to pre-organize the URLs onto the same node based on its contents.?

Content Hash Checking

Since duplicate hashes can show up on any node , We need some sort of centralized set of hashes?

64-bit hash, 1 billion sites, up to 8 GB to store these .?
8 GB is pretty low (if we have some beefy servers), could keep in memory?

Option 1: Centralized Redis set of hashes?

If we always read and write from and write to the leader.?
We can maintain strong consistency?

Content Hash Low Latency

If our nodes are in multiple data centers, then read and write to Redis could be slow

David Shergilashvili 5 个月前

D3.js: Revolutionising Data Visualisation on web

Zaidul Alam 10 个月前

Scrape webdata from Google Sheets, an alternative to…

Adrien Velter 4 个月前

Idea: Set CRDT(Conflict-free Replicated Data Type) on each node, perform anti-entropy in the background?

Issue:- We may now process the same content multiple times?

This will be resolved if operations are idempotent and tolerable?

Domain Name Service?

Provides us a mapping from hostname (e.g. wikipaedia.com ) to IP address!

Likely too large to cache the whole mapping on just one node?
Requires a networking call
Recall just one IP address per host?

Previously, we said we would partition our URLs by a hash range of URLs.

What if instead, we partition them by a hash range of the host .

Now we can maintain an LRU cache of DNS results?
Facebook.com/a , facebook.com/b , facebook.com/c on the same node.?

Handling Robots.txt

Robots.txt tells us whether we can crawl a certain host, and if so how frequently?

We are already partitioning our nodes by hostname, We also need to keep a local cache of the crawling policy?

Ex 1:- Do not crawl Twitter?

-> Ignore the URL and get the next one from frontier?

Ex 2:- You can only crawl Twitter once per minute?

-> Keep a table with the last time you crawled each host?

-> Put the message back on your frontier and take the next one(no network call).

Fetching the URL

Can we partition our nodes to have a similar geographic location to the URLs they fetch??

It's not possible, the domain doesn’t indicate where the URL is hosted?

We need to fetch the URL?

Storing Results

We can keep our writing latency minimal?

-> Ideally want data locality?

Cou ld run our nodes in the Hadoop cluster and export to HDFS.?
If using S3, could try to put our nodes in the same data center.?

Pushing to the Frontier

We now have a bunch of URLs to push!?

How should we model our frontier??

What data structure should we use??

Stack (Depth First Search):- Not really feasible, many processes writing to the stack, What; 's the top??
Queue(Breadth First Search):- We can do this?
PriorityQueue:- may be harder to implement, but if we prefer certain websites.?

Architecture Choices?

We want reliability, we don’t want to stop any websites!?

-> would be great if don’t have to build our own partitioning Solution?

Frontier -> Kafka?
Processing Nodes - > Flink?

We get full responsibility, no lost messages, partitioning

All writes to S3 can be idempotent, and can use an internal state for caching.

Final Design

Design Web Crawler | System Design

Aman Lalwani

SDE-2 @ Dario | Ex-Innovaccer | Python | Golang | AWS | SQL

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Scrape webdata from Google Sheets, an alternative to IMPORTXML / IMPORTHTML

Implementing Google Tag Manager in Angular Single Page Applications: A Step-by-Step Guide

Google Analytics Audit, CXL Institute Minidegree Review

?? Mastering Static Site Generation (SSG) in Next.js: A Practical Guide with Code ??

Optimizing Scroll Tracking in Google Tag Manager: Choosing the Right Trigger Event

Meta-SEO

C# Keywords Tutorial Part 20: default

How to Customize Your robots.txt File, Maintain a Different Version Per Environment and Prevent it from Being Overwritten

领英推荐

Design TicketMaster | System Design

2024年6月30日

Design Uber | System Design

2024年6月21日

Design Google Places/Yelp | System Design

2024年6月14日

Design Rate Limiter| System Design

2024年6月7日

Design Typehead/Google Search | System Design

2024年6月5日

Design YouTube/Netflix | System Design

2024年5月31日

Design Facebook/ Whatsapp | System Design

2024年5月28日

Design DropBox/Google Drive | System Design

2024年5月25日

Design Instagram/Twitter/Facebook/Reddit | System?Design

2024年5月23日

Design Tiny URL and PasteBin | System Design

2024年5月22日