Design a Web Crawler

Design a large-scale web crawler similar to Googlebot that systematically browses the web, downloading and storing page content for later indexing by a search engine. The crawler starts with seed URLs, fetches pages, extracts hyperlinks, and recursively follows them — all while respecting politeness policies, avoiding traps, and scaling to billions of pages.

Scale Estimates

Metric	Value
Total web pages to crawl	5 billion (initial crawl cycle)
Crawl rate (target)	1 billion pages / day (~12,000 pages/sec)
Average page size	100 KB (HTML)
Data downloaded per day	1B × 100KB = 100 TB / day
Stored content (compressed 5×)	~20 TB / day → 600 TB / month
Unique domains	~200 million
URLs in frontier	billions (disk-backed queue)
Bloom filter for URL dedup	5B URLs × 10 bits ≈ 6 GB (1% FPR)
Crawler workers	500–1,000 machines
DNS queries	~12,000/sec (cached; raw = much higher)

Non-Functional Requirements

Scalability: Distribute crawl work across 500–1,000 workers by domain hash; Kafka-backed frontier; linear throughput scaling
Politeness: Max 1 request/sec per domain (or as specified by Crawl-delay); never overwhelm target web servers
Robustness: Handle network errors, timeouts, malformed HTML, infinite traps, soft 404s; auto-retry with backoff
Freshness: Re-crawl pages at adaptive rates based on change frequency; use conditional HTTP requests (304 Not Modified) to save bandwidth
Completeness: Cover as much of the web as possible; prioritise important/popular pages; handle deep web and dynamic rendering
Efficiency: URL deduplication via Bloom filter (~6 GB for 5B URLs); content dedup via SimHash; DNS caching to avoid resolver bottleneck

Scale Estimates

Metric

Value

Total web pages to crawl

5 billion (initial crawl cycle)

Crawl rate (target)

1 billion pages / day (~12,000 pages/sec)

Average page size

100 KB (HTML)

Data downloaded per day

1B × 100KB = 100 TB / day

Stored content (compressed 5×)

~20 TB / day → 600 TB / month

Unique domains

~200 million

URLs in frontier

billions (disk-backed queue)

Bloom filter for URL dedup

5B URLs × 10 bits ≈ 6 GB (1% FPR)

Crawler workers

500–1,000 machines

DNS queries

~12,000/sec (cached; raw = much higher)

Non-Functional Requirements

Scalability: Distribute crawl work across 500–1,000 workers by domain hash; Kafka-backed frontier; linear throughput scaling

Politeness: Max 1 request/sec per domain (or as specified by Crawl-delay); never overwhelm target web servers

Robustness: Handle network errors, timeouts, malformed HTML, infinite traps, soft 404s; auto-retry with backoff

Freshness: Re-crawl pages at adaptive rates based on change frequency; use conditional HTTP requests (304 Not Modified) to save bandwidth

Completeness: Cover as much of the web as possible; prioritise important/popular pages; handle deep web and dynamic rendering

Efficiency: URL deduplication via Bloom filter (~6 GB for 5B URLs); content dedup via SimHash; DNS caching to avoid resolver bottleneck

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Follow-up Deep Dives(Questions an interviewer might ask)

Design a Web Crawler

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Follow-up Deep Dives(Questions an interviewer might ask)

Design a Web Crawler

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Non-Functional Requirements~3 min

Core Entities~2 min

API Design~3 min

High-Level Design~5 min

Follow-up Deep Dives(Questions an interviewer might ask)

1Walk through the end-to-end crawl loop. What happens from seed URL to stored page?

2How would you design the URL Frontier (the crawl queue) to balance priority and politeness?

3How would you deduplicate URLs at scale (billions of URLs)?

4How would you handle robots.txt parsing and politeness at scale?

5How would you detect and avoid crawler traps?

6How would you scale the crawler to crawl billions of pages?

7How would you implement freshness-based re-crawling?

Key Topics

Asked At

Design a Web Crawler

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Non-Functional Requirements~3 min

Core Entities~2 min

API Design~3 min

High-Level Design~5 min

Follow-up Deep Dives(Questions an interviewer might ask)

1Walk through the end-to-end crawl loop. What happens from seed URL to stored page?

2How would you design the URL Frontier (the crawl queue) to balance priority and politeness?

3How would you deduplicate URLs at scale (billions of URLs)?

4How would you handle robots.txt parsing and politeness at scale?

5How would you detect and avoid crawler traps?

6How would you scale the crawler to crawl billions of pages?

7How would you implement freshness-based re-crawling?

Key Topics

Asked At