Loading system design...
Design a large-scale web crawler similar to Googlebot that systematically browses the web, downloading and storing page content for later indexing by a search engine. The crawler starts with seed URLs, fetches pages, extracts hyperlinks, and recursively follows them — all while respecting politeness policies, avoiding traps, and scaling to billions of pages.
| Metric | Value |
|---|---|
| Total web pages to crawl | 5 billion (initial crawl cycle) |
| Crawl rate (target) | 1 billion pages / day (~12,000 pages/sec) |
| Average page size | 100 KB (HTML) |
| Data downloaded per day | 1B × 100KB = 100 TB / day |
| Stored content (compressed 5×) | ~20 TB / day → 600 TB / month |
| Unique domains | ~200 million |
| URLs in frontier | billions (disk-backed queue) |
| Bloom filter for URL dedup | 5B URLs × 10 bits ≈ 6 GB (1% FPR) |
| Crawler workers | 500–1,000 machines |
| DNS queries | ~12,000/sec (cached; raw = much higher) |
Given a set of seed URLs, crawl the web by recursively following hyperlinks discovered on each page
Download and store the HTML content of each crawled page for later indexing/processing
Avoid crawling the same URL twice (URL deduplication) within a crawl cycle
Respect robots.txt: parse each domain's robots.txt to honour Disallow rules and Crawl-delay directives
Politeness: rate-limit requests per domain to avoid overloading web servers (e.g., max 1 request per second per domain)
Prioritise crawling: important/popular pages should be crawled before less important ones (priority queue based on PageRank, domain authority, or freshness)
Handle incremental/re-crawling: detect content changes since last crawl and re-crawl pages whose content may have changed (freshness-based scheduling)
Resolve and normalise URLs: handle redirects (301/302), relative URLs, URL fragments, query parameter ordering, and canonicalisation
Detect and avoid crawler traps: infinite URL spaces (calendar pages, session IDs in URLs, query param permutations)
Support distributed crawling: scale to billions of pages by distributing crawl work across hundreds of worker nodes
Non-functional requirements define the system qualities critical to your users. Frame them as 'The system should be able to...' statements. These will guide your deep dives later.
Think about CAP theorem trade-offs, scalability limits, latency targets, durability guarantees, security requirements, fault tolerance, and compliance needs.
Frame NFRs for this specific system. 'Low latency search under 100ms' is far more valuable than just 'low latency'.
Add concrete numbers: 'P99 response time < 500ms', '99.9% availability', '10M DAU'. This drives architectural decisions.
Choose the 3-5 most critical NFRs. Every system should be 'scalable', but what makes THIS system's scaling uniquely challenging?