Loading content...
A single machine, no matter how powerful, cannot crawl the web. With 50+ billion pages, petabytes of content, and the need to process thousands of pages per second, web crawling is inherently a distributed systems problem.
But distributing a crawler isn't simply running multiple instances. The challenges compound:
This page explores the architectural patterns that make distributed crawling possible at scale.
By the end of this page, you will understand: (1) Distributed crawler architecture patterns, (2) Work distribution and partitioning strategies, (3) Worker design and lifecycle management, (4) Coordination mechanisms for distributed politeness, (5) Fault tolerance and failure recovery, and (6) Operational concerns including monitoring and scaling.
A production distributed crawler consists of several interacting components, each scaled independently to meet demand.
Core components:
Complete distributed architecture:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED WEB CRAWLER ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ COORDINATOR SERVICE │ │
│ │ │ │
│ │ • Campaign management (start/stop/pause crawls) │ │
│ │ • Worker registration and health monitoring │ │
│ │ • Global rate limiting and quota management │ │
│ │ • Metrics aggregation and alerting │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ Coordination │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ URL FRONTIER (Distributed) │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Partition 0 │ │ Partition 1 │ │ Partition N │ │ │
│ │ │ (domains │ │ (domains │ │ (domains │ │ │
│ │ │ hash % N=0) │ │ hash % N=1) │ │ hash % N=N) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ │ Fetch tasks │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ CRAWLER WORKERS │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Worker 1 │ │ Worker 2 │ │ Worker 3 │ │ Worker N │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ Fetch loop │ │ Fetch loop │ │ Fetch loop │ │ Fetch loop │ │ │
│ │ │ DNS cache │ │ DNS cache │ │ DNS cache │ │ DNS cache │ │ │
│ │ │ HTTP client │ │ HTTP client │ │ HTTP client │ │ HTTP client │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ Fetched content │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ PROCESSING PIPELINE │ │
│ │ │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Message │───▶│ Link │───▶│ URL │───▶│ Content │ │ │
│ │ │ Queue │ │ Extraction │ │ Normalizer │ │ Store │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │ │
│ │ │ │ │
│ │ │ Discovered URLs │ │
│ │ ▼ │ │
│ │ ┌────────────────┐ │ │
│ │ │ Deduplication │──────▶ URL Frontier │ │
│ │ └────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ SHARED SERVICES │ │
│ │ │ │
│ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────────────┐ │ │
│ │ │ DNS Resolver │ │ Robots.txt │ │ Content Store │ │ │
│ │ │ Service │ │ Service │ │ (Object Storage/HDFS) │ │ │
│ │ │ (cached, │ │ (cached, │ │ │ │ │
│ │ │ high-perf) │ │ per-domain) │ │ │ │ │
│ │ └────────────────┘ └────────────────┘ └────────────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
Notice how the architecture separates I/O-bound work (fetching) from CPU-bound work (parsing, extraction). This allows independent scaling: add more workers for higher fetch throughput, more processing nodes for extraction bottlenecks. Each component is stateless where possible, with state pushed to shared services (frontier, content store).
How should work be divided among crawler workers? The distribution strategy directly impacts politeness, efficiency, and fault tolerance.
Key requirements:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123
import crypto from 'crypto'; class ConsistentHashRing { private ring: Map<number, string> = new Map(); // position -> workerId private sortedPositions: number[] = []; private virtualNodes: number; constructor(virtualNodes: number = 150) { this.virtualNodes = virtualNodes; } /** * Add a worker to the ring */ public addWorker(workerId: string): void { for (let i = 0; i < this.virtualNodes; i++) { const position = this.hash(`${workerId}:${i}`); this.ring.set(position, workerId); this.sortedPositions.push(position); } this.sortedPositions.sort((a, b) => a - b); } /** * Remove a worker from the ring */ public removeWorker(workerId: string): void { for (let i = 0; i < this.virtualNodes; i++) { const position = this.hash(`${workerId}:${i}`); this.ring.delete(position); } this.sortedPositions = this.sortedPositions.filter( pos => this.ring.has(pos) ); } /** * Get the worker responsible for a domain */ public getWorker(domain: string): string | null { if (this.sortedPositions.length === 0) { return null; } const domainPosition = this.hash(domain); // Find first position >= domainPosition let idx = this.binarySearch(domainPosition); // Wrap around if necessary if (idx >= this.sortedPositions.length) { idx = 0; } return this.ring.get(this.sortedPositions[idx]) || null; } /** * Get all domains that would be reassigned if a worker is added * Useful for planning capacity changes */ public getReassignedDomains( newWorkerId: string, allDomains: string[] ): string[] { const beforeAssignments = new Map<string, string>(); for (const domain of allDomains) { beforeAssignments.set(domain, this.getWorker(domain) || ''); } this.addWorker(newWorkerId); const reassigned: string[] = []; for (const domain of allDomains) { const afterWorker = this.getWorker(domain); if (afterWorker !== beforeAssignments.get(domain)) { reassigned.push(domain); } } this.removeWorker(newWorkerId); // Rollback return reassigned; } private hash(key: string): number { const hash = crypto.createHash('md5').update(key).digest(); // Use first 4 bytes as a 32-bit integer return hash.readUInt32BE(0); } private binarySearch(target: number): number { let left = 0; let right = this.sortedPositions.length; while (left < right) { const mid = Math.floor((left + right) / 2); if (this.sortedPositions[mid] < target) { left = mid + 1; } else { right = mid; } } return left; }} // Usage exampleconst ring = new ConsistentHashRing(100); ring.addWorker('worker-1');ring.addWorker('worker-2');ring.addWorker('worker-3'); console.log('example.com ->', ring.getWorker('example.com'));console.log('google.com ->', ring.getWorker('google.com'));console.log('amazon.com ->', ring.getWorker('amazon.com')); // When adding a new worker, only ~1/4 of domains get reassignedring.addWorker('worker-4');console.log('After adding worker-4:');console.log('example.com ->', ring.getWorker('example.com')); // May or may not changeFor most production crawlers, consistent hashing provides the foundation (domain affinity for politeness), with limited work stealing for load balancing. Workers primarily process their assigned partitions but can steal from overloaded partitions when idle. This hybrid approach combines the benefits of both strategies.
Crawler workers are the workhorses of the system—they perform the actual HTTP requests. Designing workers for high throughput while minimizing resource usage is critical.
Worker design principles:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277
interface CrawlTask { url: string; domain: string; priority: number; attempt: number; maxAttempts: number;} interface CrawlResult { url: string; statusCode: number; content: Buffer | null; headers: Record<string, string>; error: string | null; fetchTimeMs: number; contentLength: number;} class CrawlerWorker { private workerId: string; private frontierClient: FrontierClient; private robotsClient: RobotsClient; private dnsClient: DNSClient; private contentStore: ContentStore; private httpClient: HttpClient; private politenessScheduler: PolitenessScheduler; private running: boolean = false; private activeTasks: Map<string, CrawlTask> = new Map(); private maxConcurrentTasks: number = 100; private metrics: WorkerMetrics; constructor(config: WorkerConfig) { this.workerId = config.workerId; this.frontierClient = new FrontierClient(config.frontierEndpoints); this.robotsClient = new RobotsClient(config.robotsEndpoint); this.dnsClient = new DNSClient(config.dnsEndpoint); this.contentStore = new ContentStore(config.contentStoreEndpoint); this.httpClient = new HttpClient({ timeout: 30000, maxConnections: 200, keepAlive: true }); this.politenessScheduler = new PolitenessScheduler(); this.metrics = new WorkerMetrics(this.workerId); } /** * Main worker loop */ public async start(): Promise<void> { this.running = true; console.log(`Worker ${this.workerId} started`); // Start multiple fetch loops to maintain concurrency const numFetchers = 10; const fetchers = []; for (let i = 0; i < numFetchers; i++) { fetchers.push(this.fetchLoop(i)); } // Wait for all fetchers (they run until shutdown) await Promise.all(fetchers); console.log(`Worker ${this.workerId} stopped`); } public async stop(): Promise<void> { this.running = false; // Wait for active tasks to complete while (this.activeTasks.size > 0) { await new Promise(resolve => setTimeout(resolve, 100)); } } private async fetchLoop(fetcherId: number): Promise<void> { while (this.running) { try { // Check if we have capacity if (this.activeTasks.size >= this.maxConcurrentTasks) { await new Promise(resolve => setTimeout(resolve, 50)); continue; } // Get next task from frontier const task = await this.getNextTask(); if (!task) { // No work available, back off await new Promise(resolve => setTimeout(resolve, 100)); continue; } // Process task (don't await - run concurrently) this.processTask(task).catch(err => { console.error(`Error processing ${task.url}:`, err); this.metrics.recordError(task.domain, 'processing_error'); }); } catch (err) { console.error(`Fetcher ${fetcherId} error:`, err); await new Promise(resolve => setTimeout(resolve, 1000)); } } } private async getNextTask(): Promise<CrawlTask | null> { // Request batch of tasks from frontier for efficiency const tasks = await this.frontierClient.getNextURLs( this.workerId, 10 // Batch size ); // Find first task we can crawl now (respecting politeness) for (const task of tasks) { const canCrawl = await this.politenessScheduler.canCrawl(task.domain); if (canCrawl) { // Return remaining tasks to frontier const remaining = tasks.filter(t => t !== task); if (remaining.length > 0) { await this.frontierClient.returnTasks(remaining); } return task; } } // No tasks are ready now (all domains on cooldown) if (tasks.length > 0) { await this.frontierClient.returnTasks(tasks); } return null; } private async processTask(task: CrawlTask): Promise<void> { const taskId = `${task.url}-${Date.now()}`; this.activeTasks.set(taskId, task); const startTime = Date.now(); try { // 1. Check robots.txt const robotsAllowed = await this.robotsClient.isAllowed( task.domain, new URL(task.url).pathname ); if (!robotsAllowed) { this.metrics.recordRobotsBlocked(task.domain); await this.frontierClient.markComplete(task.url, 'robots_blocked'); return; } // 2. Resolve DNS const ip = await this.dnsClient.resolve(task.domain); if (!ip) { this.metrics.recordError(task.domain, 'dns_failure'); await this.handleRetry(task, 'dns_failure'); return; } // 3. Wait for politeness (if needed) const waitTime = await this.politenessScheduler.getWaitTime(task.domain); if (waitTime > 0) { await new Promise(resolve => setTimeout(resolve, waitTime)); } // 4. Fetch the page const result = await this.fetch(task, ip); // 5. Record politeness (must happen after fetch) await this.politenessScheduler.recordRequest(task.domain); // 6. Process result await this.handleResult(task, result); // 7. Record metrics this.metrics.recordFetch(task.domain, result.statusCode, Date.now() - startTime); } finally { this.activeTasks.delete(taskId); } } private async fetch(task: CrawlTask, ip: string): Promise<CrawlResult> { const startTime = Date.now(); try { const response = await this.httpClient.get(task.url, { headers: { 'User-Agent': 'AcmeCrawler/1.0 (+https://acme.com/crawler)', 'Accept': 'text/html,application/xhtml+xml', 'Accept-Encoding': 'gzip, deflate, br', }, resolvedIP: ip, timeout: 30000, maxSize: 10 * 1024 * 1024, // 10MB limit }); return { url: task.url, statusCode: response.statusCode, content: response.body, headers: response.headers, error: null, fetchTimeMs: Date.now() - startTime, contentLength: response.body?.length || 0 }; } catch (err: any) { return { url: task.url, statusCode: 0, content: null, headers: {}, error: err.message, fetchTimeMs: Date.now() - startTime, contentLength: 0 }; } } private async handleResult(task: CrawlTask, result: CrawlResult): Promise<void> { if (result.error) { await this.handleRetry(task, result.error); return; } switch (true) { case result.statusCode >= 200 && result.statusCode < 300: // Success - store content and extract links await this.contentStore.store(task.url, result.content!, result.headers); await this.frontierClient.markComplete(task.url, 'success'); break; case result.statusCode >= 300 && result.statusCode < 400: // Redirect - add redirect target to frontier const location = result.headers['location']; if (location) { await this.frontierClient.addURL(location, task.priority, task.url); } await this.frontierClient.markComplete(task.url, 'redirect'); break; case result.statusCode === 429: // Rate limited - increase backoff this.politenessScheduler.increaseBackoff(task.domain); await this.handleRetry(task, 'rate_limited'); break; case result.statusCode >= 500: // Server error - retry with backoff await this.handleRetry(task, `server_error_${result.statusCode}`); break; default: // Client error (4xx except 429) - don't retry await this.frontierClient.markComplete( task.url, `client_error_${result.statusCode}` ); } } private async handleRetry(task: CrawlTask, reason: string): Promise<void> { if (task.attempt >= task.maxAttempts) { await this.frontierClient.markComplete(task.url, `failed_${reason}`); return; } // Exponential backoff for retry const delay = Math.pow(2, task.attempt) * 1000; // 1s, 2s, 4s, 8s... await this.frontierClient.requeueForRetry(task.url, { ...task, attempt: task.attempt + 1 }, delay); }}HTTP connection pooling is critical for performance. Creating new TCP connections for each request adds significant latency (~100-300ms for handshake). Use keep-alive connections and pool them by domain/IP. At high throughput, connection limits to individual IPs also become important—don't exhaust the server's ability to accept connections.
Distributed workers need coordination for several purposes: rate limiting, work assignment, health monitoring, and configuration updates. The challenge is providing necessary coordination without creating bottlenecks.
Coordination requirements:
| Need | Frequency | Consistency | Solution |
|---|---|---|---|
| Domain → worker mapping | On demand | Eventual | Consistent hash ring (local) |
| Per-domain rate limiting | Per request | Best effort | Domain affinity (no coordination) |
| Worker health | Every 10-30s | Eventual | Heartbeats to coordinator |
| Work queue state | Per request | Strong | Distributed queue (Redis/Kafka) |
| Configuration updates | Rare | Eventual | Config service with polling |
| Metrics aggregation | Every 10-60s | Eventual | Push to metrics service |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170
interface WorkerRegistration { workerId: string; host: string; port: number; assignedPartitions: number[]; registeredAt: Date; lastHeartbeat: Date; status: 'active' | 'draining' | 'dead'; metrics: WorkerMetrics;} class CrawlCoordinator { private workers: Map<string, WorkerRegistration> = new Map(); private partitionAssignment: Map<number, string> = new Map(); private hashRing: ConsistentHashRing; private numPartitions: number; private heartbeatTimeoutMs: number = 30000; constructor(numPartitions: number = 256) { this.numPartitions = numPartitions; this.hashRing = new ConsistentHashRing(); // Start background tasks this.startHealthChecker(); this.startMetricsAggregator(); } /** * Handle worker registration */ public registerWorker( workerId: string, host: string, port: number ): { partitions: number[] } { // Add to hash ring this.hashRing.addWorker(workerId); // Compute partition assignments const partitions = this.computePartitions(workerId); const registration: WorkerRegistration = { workerId, host, port, assignedPartitions: partitions, registeredAt: new Date(), lastHeartbeat: new Date(), status: 'active', metrics: { fetched: 0, errors: 0, bytesDownloaded: 0 } }; this.workers.set(workerId, registration); // Update partition assignment for (const partition of partitions) { this.partitionAssignment.set(partition, workerId); } console.log(`Registered worker ${workerId} with partitions ${partitions.join(',')}`); return { partitions }; } /** * Handle worker heartbeat */ public heartbeat(workerId: string, metrics: WorkerMetrics): void { const worker = this.workers.get(workerId); if (worker) { worker.lastHeartbeat = new Date(); worker.metrics = metrics; } } /** * Handle worker deregistration (graceful shutdown) */ public deregisterWorker(workerId: string): void { const worker = this.workers.get(workerId); if (!worker) return; worker.status = 'draining'; // Wait for in-flight work to complete // Then remove from ring and redistribute setTimeout(() => this.removeWorker(workerId), 30000); } private removeWorker(workerId: string): void { const worker = this.workers.get(workerId); if (!worker) return; this.hashRing.removeWorker(workerId); this.workers.delete(workerId); // Redistribute partitions to remaining workers this.redistributePartitions(worker.assignedPartitions); console.log(`Removed worker ${workerId}`); } private computePartitions(workerId: string): number[] { // For consistent hashing, each worker gets a set of partitions // based on where their virtual nodes land on the ring const partitions: number[] = []; for (let p = 0; p < this.numPartitions; p++) { // Simulate a domain in this partition const representative = `partition-${p}-representative`; if (this.hashRing.getWorker(representative) === workerId) { partitions.push(p); } } return partitions; } private redistributePartitions(partitions: number[]): void { // Assign orphaned partitions to remaining workers for (const partition of partitions) { const representative = `partition-${partition}-representative`; const newOwner = this.hashRing.getWorker(representative); if (newOwner) { this.partitionAssignment.set(partition, newOwner); const worker = this.workers.get(newOwner); if (worker) { worker.assignedPartitions.push(partition); } } } } private startHealthChecker(): void { setInterval(() => { const now = Date.now(); for (const [workerId, worker] of this.workers) { if (worker.status !== 'active') continue; const timeSinceHeartbeat = now - worker.lastHeartbeat.getTime(); if (timeSinceHeartbeat > this.heartbeatTimeoutMs) { console.warn(`Worker ${workerId} missed heartbeat, marking dead`); worker.status = 'dead'; this.removeWorker(workerId); } } }, 10000); } private startMetricsAggregator(): void { setInterval(() => { let totalFetched = 0; let totalErrors = 0; let totalBytes = 0; for (const worker of this.workers.values()) { if (worker.status === 'active') { totalFetched += worker.metrics.fetched; totalErrors += worker.metrics.errors; totalBytes += worker.metrics.bytesDownloaded; } } console.log(`Cluster metrics: ${totalFetched} fetched, ${totalErrors} errors, ${(totalBytes / 1024 / 1024).toFixed(2)} MB`); }, 60000); }}In a distributed system with hundreds of nodes, failures are not exceptional—they're expected. Crawlers must handle failures gracefully without losing work or degrading quality.
Types of failures:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
interface InFlightTask { task: CrawlTask; workerId: string; assignedAt: Date; deadline: Date;} class FaultTolerantFrontier { private pending: PriorityQueue<CrawlTask>; // Tasks waiting to be claimed private inFlight: Map<string, InFlightTask>; // Tasks currently being processed private completed: Set<string>; // URLs that have been completed private failed: Map<string, number>; // URL -> failure count private visibilityTimeout: number = 60000; // 60 seconds to complete private maxRetries: number = 3; constructor() { this.pending = new PriorityQueue(); this.inFlight = new Map(); this.completed = new Set(); this.failed = new Map(); // Start timeout checker this.startTimeoutChecker(); } /** * Add a URL to the frontier */ public addURL(url: string, priority: number): boolean { // Skip if already completed if (this.completed.has(url)) { return false; } // Skip if max retries exceeded if ((this.failed.get(url) || 0) >= this.maxRetries) { return false; } // Skip if already in flight if (this.inFlight.has(url)) { return false; } const task: CrawlTask = { url, domain: new URL(url).hostname, priority, attempt: this.failed.get(url) || 0, maxAttempts: this.maxRetries }; this.pending.enqueue(task, priority); return true; } /** * Get next task for a worker * Task becomes invisible to other workers until completed or timeout */ public getNext(workerId: string): CrawlTask | null { const task = this.pending.dequeue(); if (!task) return null; const inFlightTask: InFlightTask = { task, workerId, assignedAt: new Date(), deadline: new Date(Date.now() + this.visibilityTimeout) }; this.inFlight.set(task.url, inFlightTask); return task; } /** * Mark task as completed */ public complete(url: string, status: 'success' | 'skip'): void { this.inFlight.delete(url); this.completed.add(url); this.failed.delete(url); } /** * Mark task as failed (will be retried) */ public fail(url: string): void { const inFlightTask = this.inFlight.get(url); if (!inFlightTask) return; this.inFlight.delete(url); const failCount = (this.failed.get(url) || 0) + 1; this.failed.set(url, failCount); if (failCount < this.maxRetries) { // Requeue with exponential backoff delay const delay = Math.pow(2, failCount) * 1000; setTimeout(() => { this.addURL(url, inFlightTask.task.priority); }, delay); } } /** * Extend visibility timeout (for long-running tasks) */ public extendTimeout(url: string, extensionMs: number): boolean { const inFlightTask = this.inFlight.get(url); if (!inFlightTask) return false; inFlightTask.deadline = new Date( inFlightTask.deadline.getTime() + extensionMs ); return true; } /** * Check for timed-out tasks and requeue them */ private startTimeoutChecker(): void { setInterval(() => { const now = Date.now(); for (const [url, inFlightTask] of this.inFlight) { if (now > inFlightTask.deadline.getTime()) { console.warn( `Task ${url} timed out (worker: ${inFlightTask.workerId})` ); // Treat timeout as failure this.fail(url); } } }, 10000); } /** * Handle worker failure - requeue all their in-flight tasks */ public handleWorkerFailure(workerId: string): void { for (const [url, inFlightTask] of this.inFlight) { if (inFlightTask.workerId === workerId) { console.log(`Requeuing task ${url} from failed worker ${workerId}`); this.fail(url); } } }}Crawlers typically use at-least-once semantics for URL processing. It's acceptable (if wasteful) to crawl a URL twice; it's unacceptable to miss a URL. The visibility timeout pattern ensures failed workers' tasks are eventually processed, at the cost of occasional duplicates. Content deduplication catches these.
As crawl requirements grow, the system must scale horizontally. But scaling introduces new challenges:
Scaling dimensions:
| Component | Scaling Strategy | Bottleneck | Solution |
|---|---|---|---|
| Crawler Workers | Add more instances | Network bandwidth | Distribute across multiple DCs |
| URL Frontier | Partition by domain hash | Write throughput | More partitions; faster storage |
| DNS Service | Replicas with local cache | Upstream DNS limits | Larger caches; multiple upstreams |
| Robots.txt Service | Replicas with cache | Per-domain freshness | Longer TTLs; lazy refresh |
| Content Store | Object storage (S3/GCS) | Write IOPS | Batch writes; concurrent uploads |
| Link Extractor | Parallel consumers | CPU for parsing | More instances; faster parsers |
Multi-region architecture for global crawling:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ MULTI-REGION CRAWLER DEPLOYMENT │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────┐ ┌───────────────────────────┐ │
│ │ US-WEST REGION │ │ EU-WEST REGION │ │
│ │ │ │ │ │
│ │ Workers (US-assigned │ │ Workers (EU-assigned │ │
│ │ domains: *.com, *.us) │ │ domains: *.eu, *.co.uk) │ │
│ │ │ │ │ │
│ │ Local DNS resolvers │ │ Local DNS resolvers │ │
│ │ Local robots.txt cache │ │ Local robots.txt cache │ │
│ │ │ │ │
│ └─────────────┬─────────────┘ └─────────────┬─────────────┘ │
│ │ │ │
│ └───────────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ GLOBAL SERVICES │ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ Global URL │ │ Global Content │ │ Global │ │ │
│ │ │ Frontier │ │ Store (S3) │ │ Coordinator │ │ │
│ │ │ (cross-region │ │ │ │ │ │ │
│ │ │ replication) │ │ │ │ │ │ │
│ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
│ Benefits: │
│ • Lower latency to target sites (geographically close) │
│ • Higher aggregate bandwidth (multiple data center pipes) │
│ • Better fault isolation (region failure doesn't halt crawl) │
│ • Compliance with data residency requirements (crawl EU from EU) │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
Running a distributed crawler in production requires robust operational practices. These concerns often get less attention than the algorithms but are equally critical for reliable operation.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
// Key metrics to expose from every component interface CrawlerMetrics { // Worker metrics worker: { pages_fetched_total: Counter; bytes_downloaded_total: Counter; fetch_duration_seconds: Histogram; active_tasks: Gauge; errors_by_type: Counter<{ error_type: string }>; http_responses: Counter<{ status_code: number }>; }; // Frontier metrics frontier: { queue_size: Gauge<{ partition: number }>; enqueue_rate: Counter; dequeue_rate: Counter; duplicate_urls_rejected: Counter; urls_in_flight: Gauge; retry_queue_size: Gauge; }; // Politeness metrics politeness: { domains_in_backoff: Gauge; robots_blocked_total: Counter; rate_limit_delays_total: Counter; crawl_delay_seconds: Histogram<{ domain: string }>; }; // Processing metrics processing: { links_extracted_total: Counter; content_stored_total: Counter; content_store_bytes: Counter; near_duplicates_detected: Counter; }; // System metrics system: { active_workers: Gauge; partition_assignments: Gauge<{ worker: string }>; coordinator_heartbeat_lag: Gauge; };} // Example Prometheus query for throughput// rate(pages_fetched_total[5m]) // Example alert ruleconst alerts = { ThroughputDrop: { expr: 'rate(pages_fetched_total[5m]) < 500', for: '5m', labels: { severity: 'warning' }, annotations: { summary: 'Crawler throughput has dropped below 500 pages/sec' } }, WorkerDown: { expr: 'active_workers < 10', for: '2m', labels: { severity: 'critical' }, annotations: { summary: 'Fewer than 10 crawler workers are active' } }, HighErrorRate: { expr: 'rate(errors_by_type[5m]) / rate(pages_fetched_total[5m]) > 0.05', for: '5m', labels: { severity: 'warning' }, annotations: { summary: 'Error rate exceeds 5%' } }};Design the crawler to degrade gracefully under stress. If the frontier is overloaded, slow down URL discovery rather than losing URLs. If content storage is slow, buffer and batch writes. If a region is having network issues, reduce its crawl rate. The goal is continuous, steady progress rather than all-or-nothing operation.
We've explored the architectural patterns and operational practices that enable distributed crawling at scale—from work distribution to fault tolerance.
What's Next:
The final page explores Content Extraction—how to parse fetched pages, extract meaningful content, identify links for further crawling, and handle the diversity of content types and formats on the modern web.
You now understand how to architect a distributed web crawler that scales to thousands of pages per second across hundreds of machines. The key insight: distribution is not just about parallelism—it's about careful partitioning that maintains correctness (politeness, deduplication) while enabling horizontal scaling.