Web Crawler - Learning Module

Loading content...

0/273

Distributed Crawling

Scaling to the Web's Size

A single machine, no matter how powerful, cannot crawl the web. With 50+ billion pages, petabytes of content, and the need to process thousands of pages per second, web crawling is inherently a distributed systems problem.

But distributing a crawler isn't simply running multiple instances. The challenges compound:

Coordination — How do workers divide work without overlap or gaps?
Politeness — How do we ensure the aggregate rate across workers respects per-domain limits?
Consistency — How do we prevent the same URL from being crawled by multiple workers?
Fault Tolerance — How do we recover when workers fail mid-crawl?
Efficiency — How do we maximize throughput while minimizing coordination overhead?

This page explores the architectural patterns that make distributed crawling possible at scale.

What You Will Learn

By the end of this page, you will understand: (1) Distributed crawler architecture patterns, (2) Work distribution and partitioning strategies, (3) Worker design and lifecycle management, (4) Coordination mechanisms for distributed politeness, (5) Fault tolerance and failure recovery, and (6) Operational concerns including monitoring and scaling.

Distributed Architecture Overview

A production distributed crawler consists of several interacting components, each scaled independently to meet demand.

Core components:

URL Frontier Service — Distributed queue of URLs to crawl
Crawler Workers — Stateless processes that fetch pages
DNS Resolver Service — High-performance, caching DNS resolution
Robots.txt Service — Fetches and caches robots.txt for all domains
Content Store — Distributed storage for fetched content
Link Extractor Pipeline — Processes content and discovers new URLs
Coordinator/Scheduler — Orchestrates crawl campaigns and monitors health

Complete distributed architecture:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                     DISTRIBUTED WEB CRAWLER ARCHITECTURE                         │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                         COORDINATOR SERVICE                              │  │
│   │                                                                          │  │
│   │   • Campaign management (start/stop/pause crawls)                       │  │
│   │   • Worker registration and health monitoring                           │  │
│   │   • Global rate limiting and quota management                           │  │
│   │   • Metrics aggregation and alerting                                    │  │
│   │                                                                          │  │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                                         │                                        │
│                                         │ Coordination                           │
│                                         ▼                                        │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                    URL FRONTIER (Distributed)                            │  │
│   │                                                                          │  │
│   │   ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                    │  │
│   │   │ Partition 0  │ │ Partition 1  │ │ Partition N  │                    │  │
│   │   │ (domains     │ │ (domains     │ │ (domains     │                    │  │
│   │   │  hash % N=0) │ │  hash % N=1) │ │  hash % N=N) │                    │  │
│   │   └──────────────┘ └──────────────┘ └──────────────┘                    │  │
│   │                                                                          │  │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│            │                    │                    │                          │
│            │ Fetch tasks        │                    │                          │
│            ▼                    ▼                    ▼                          │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                       CRAWLER WORKERS                                    │  │
│   │                                                                          │  │
│   │   ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐       │  │
│   │   │  Worker 1   │ │  Worker 2   │ │  Worker 3   │ │  Worker N   │       │  │
│   │   │             │ │             │ │             │ │             │       │  │
│   │   │ Fetch loop  │ │ Fetch loop  │ │ Fetch loop  │ │ Fetch loop  │       │  │
│   │   │ DNS cache   │ │ DNS cache   │ │ DNS cache   │ │ DNS cache   │       │  │
│   │   │ HTTP client │ │ HTTP client │ │ HTTP client │ │ HTTP client │       │  │
│   │   │             │ │             │ │             │ │             │       │  │
│   │   └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘       │  │
│   │                                                                          │  │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│            │                                                                     │
│            │ Fetched content                                                     │
│            ▼                                                                     │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                     PROCESSING PIPELINE                                  │  │
│   │                                                                          │  │
│   │   ┌────────────┐    ┌────────────┐    ┌────────────┐    ┌────────────┐  │  │
│   │   │  Message   │───▶│   Link     │───▶│    URL     │───▶│  Content   │  │  │
│   │   │  Queue     │    │ Extraction │    │ Normalizer │    │   Store    │  │  │
│   │   └────────────┘    └────────────┘    └────────────┘    └────────────┘  │  │
│   │                            │                                             │  │
│   │                            │ Discovered URLs                             │  │
│   │                            ▼                                             │  │
│   │                     ┌────────────────┐                                   │  │
│   │                     │ Deduplication  │──────▶ URL Frontier               │  │
│   │                     └────────────────┘                                   │  │
│   │                                                                          │  │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                                                                                  │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                      SHARED SERVICES                                     │  │
│   │                                                                          │  │
│   │   ┌────────────────┐   ┌────────────────┐   ┌────────────────────────┐  │  │
│   │   │ DNS Resolver   │   │ Robots.txt     │   │ Content Store          │  │  │
│   │   │ Service        │   │ Service        │   │ (Object Storage/HDFS)  │  │  │
│   │   │ (cached,       │   │ (cached,       │   │                        │  │  │
│   │   │  high-perf)    │   │  per-domain)   │   │                        │  │  │
│   │   └────────────────┘   └────────────────┘   └────────────────────────┘  │  │
│   │                                                                          │  │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

Separation of Concerns

Notice how the architecture separates I/O-bound work (fetching) from CPU-bound work (parsing, extraction). This allows independent scaling: add more workers for higher fetch throughput, more processing nodes for extraction bottlenecks. Each component is stateless where possible, with state pushed to shared services (frontier, content store).

Work Distribution Strategies

How should work be divided among crawler workers? The distribution strategy directly impacts politeness, efficiency, and fault tolerance.

Key requirements:

Each domain should be crawled by a predictable set of workers (for politeness)
Work should be evenly balanced across workers
Adding/removing workers should minimally disrupt the crawl

Strategy 1: Static Hash Partitioning

•How it works: hash(domain) % num_workers → worker assignment
•Politeness: Each domain assigned to exactly one worker → local rate limiting suffices
•Pros: Simple, predictable, no coordination required for politeness
•Cons: Adding/removing workers reassigns all domains (major disruption)
•Cons: Large domains may overload their assigned worker (hot spots)

Strategy 2: Consistent Hashing

•How it works: Place workers on a hash ring; domains map to the next worker clockwise
•Politeness: Same as static hash—each domain maps to one worker
•Pros: Adding/removing workers only reassigns K/N domains (minimal disruption)
•Pros: Virtual nodes can balance load across heterogeneous workers
•Cons: More complex to implement; ring must be synchronized across workers

Strategy 3: Work Stealing

•How it works: Workers pull work from shared queue; idle workers 'steal' from busy ones
•Politeness: Requires centralized rate limiting per domain (more complex)
•Pros: Dynamic load balancing; handles hot spots automatically
•Cons: Coordination overhead; potential for races; harder to reason about

Consistent Hashing for Domain Assignment
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import crypto from 'crypto';
 
class ConsistentHashRing {
  private ring: Map<number, string> = new Map();  // position -> workerId
  private sortedPositions: number[] = [];
  private virtualNodes: number;
 
  constructor(virtualNodes: number = 150) {
    this.virtualNodes = virtualNodes;
  }
 
  /**
   * Add a worker to the ring
   */
  public addWorker(workerId: string): void {
    for (let i = 0; i < this.virtualNodes; i++) {
      const position = this.hash(`${workerId}:${i}`);
      this.ring.set(position, workerId);
      this.sortedPositions.push(position);
    }
    this.sortedPositions.sort((a, b) => a - b);
  }
 
  /**
   * Remove a worker from the ring
   */
  public removeWorker(workerId: string): void {
    for (let i = 0; i < this.virtualNodes; i++) {
      const position = this.hash(`${workerId}:${i}`);
      this.ring.delete(position);
    }
    this.sortedPositions = this.sortedPositions.filter(
      pos => this.ring.has(pos)
    );
  }
 
  /**
   * Get the worker responsible for a domain
   */
  public getWorker(domain: string): string | null {
    if (this.sortedPositions.length === 0) {
      return null;
    }
 
    const domainPosition = this.hash(domain);
    
    // Find first position >= domainPosition
    let idx = this.binarySearch(domainPosition);
    
    // Wrap around if necessary
    if (idx >= this.sortedPositions.length) {
      idx = 0;
    }
 
    return this.ring.get(this.sortedPositions[idx]) || null;
  }
 
  /**
   * Get all domains that would be reassigned if a worker is added
   * Useful for planning capacity changes
   */
  public getReassignedDomains(
    newWorkerId: string, 
    allDomains: string[]
  ): string[] {
    const beforeAssignments = new Map<string, string>();
    for (const domain of allDomains) {
      beforeAssignments.set(domain, this.getWorker(domain) || '');
    }
 
    this.addWorker(newWorkerId);
 
    const reassigned: string[] = [];
    for (const domain of allDomains) {
      const afterWorker = this.getWorker(domain);
      if (afterWorker !== beforeAssignments.get(domain)) {
        reassigned.push(domain);
      }
    }
 
    this.removeWorker(newWorkerId);  // Rollback
 
    return reassigned;
  }
 
  private hash(key: string): number {
    const hash = crypto.createHash('md5').update(key).digest();
    // Use first 4 bytes as a 32-bit integer
    return hash.readUInt32BE(0);
  }
 
  private binarySearch(target: number): number {
    let left = 0;
    let right = this.sortedPositions.length;
 
    while (left < right) {
      const mid = Math.floor((left + right) / 2);
      if (this.sortedPositions[mid] < target) {
        left = mid + 1;
      } else {
        right = mid;
      }
    }
 
    return left;
  }
}
 
// Usage example
const ring = new ConsistentHashRing(100);
 
ring.addWorker('worker-1');
ring.addWorker('worker-2');
ring.addWorker('worker-3');
 
console.log('example.com ->', ring.getWorker('example.com'));
console.log('google.com ->', ring.getWorker('google.com'));
console.log('amazon.com ->', ring.getWorker('amazon.com'));
 
// When adding a new worker, only ~1/4 of domains get reassigned
ring.addWorker('worker-4');
console.log('After adding worker-4:');
console.log('example.com ->', ring.getWorker('example.com'));  // May or may not change

Recommended: Consistent Hashing + Work Stealing

For most production crawlers, consistent hashing provides the foundation (domain affinity for politeness), with limited work stealing for load balancing. Workers primarily process their assigned partitions but can steal from overloaded partitions when idle. This hybrid approach combines the benefits of both strategies.

Worker Architecture

Crawler workers are the workhorses of the system—they perform the actual HTTP requests. Designing workers for high throughput while minimizing resource usage is critical.

Worker design principles:

Stateless — Workers don't store persistent state; all state lives in shared services
Concurrent — Multiple in-flight requests to maximize network utilization
Bounded — Limit resource usage (connections, memory) to prevent runaway workers
Observable — Emit metrics and logs for monitoring

Worker Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
interface CrawlTask {
  url: string;
  domain: string;
  priority: number;
  attempt: number;
  maxAttempts: number;
}
 
interface CrawlResult {
  url: string;
  statusCode: number;
  content: Buffer | null;
  headers: Record<string, string>;
  error: string | null;
  fetchTimeMs: number;
  contentLength: number;
}
 
class CrawlerWorker {
  private workerId: string;
  private frontierClient: FrontierClient;
  private robotsClient: RobotsClient;
  private dnsClient: DNSClient;
  private contentStore: ContentStore;
  private httpClient: HttpClient;
  private politenessScheduler: PolitenessScheduler;
  
  private running: boolean = false;
  private activeTasks: Map<string, CrawlTask> = new Map();
  private maxConcurrentTasks: number = 100;
  private metrics: WorkerMetrics;
 
  constructor(config: WorkerConfig) {
    this.workerId = config.workerId;
    this.frontierClient = new FrontierClient(config.frontierEndpoints);
    this.robotsClient = new RobotsClient(config.robotsEndpoint);
    this.dnsClient = new DNSClient(config.dnsEndpoint);
    this.contentStore = new ContentStore(config.contentStoreEndpoint);
    this.httpClient = new HttpClient({
      timeout: 30000,
      maxConnections: 200,
      keepAlive: true
    });
    this.politenessScheduler = new PolitenessScheduler();
    this.metrics = new WorkerMetrics(this.workerId);
  }
 
  /**
   * Main worker loop
   */
  public async start(): Promise<void> {
    this.running = true;
    console.log(`Worker ${this.workerId} started`);
 
    // Start multiple fetch loops to maintain concurrency
    const numFetchers = 10;
    const fetchers = [];
    
    for (let i = 0; i < numFetchers; i++) {
      fetchers.push(this.fetchLoop(i));
    }
 
    // Wait for all fetchers (they run until shutdown)
    await Promise.all(fetchers);
    
    console.log(`Worker ${this.workerId} stopped`);
  }
 
  public async stop(): Promise<void> {
    this.running = false;
    // Wait for active tasks to complete
    while (this.activeTasks.size > 0) {
      await new Promise(resolve => setTimeout(resolve, 100));
    }
  }
 
  private async fetchLoop(fetcherId: number): Promise<void> {
    while (this.running) {
      try {
        // Check if we have capacity
        if (this.activeTasks.size >= this.maxConcurrentTasks) {
          await new Promise(resolve => setTimeout(resolve, 50));
          continue;
        }
 
        // Get next task from frontier
        const task = await this.getNextTask();
        if (!task) {
          // No work available, back off
          await new Promise(resolve => setTimeout(resolve, 100));
          continue;
        }
 
        // Process task (don't await - run concurrently)
        this.processTask(task).catch(err => {
          console.error(`Error processing ${task.url}:`, err);
          this.metrics.recordError(task.domain, 'processing_error');
        });
 
      } catch (err) {
        console.error(`Fetcher ${fetcherId} error:`, err);
        await new Promise(resolve => setTimeout(resolve, 1000));
      }
    }
  }
 
  private async getNextTask(): Promise<CrawlTask | null> {
    // Request batch of tasks from frontier for efficiency
    const tasks = await this.frontierClient.getNextURLs(
      this.workerId,
      10  // Batch size
    );
 
    // Find first task we can crawl now (respecting politeness)
    for (const task of tasks) {
      const canCrawl = await this.politenessScheduler.canCrawl(task.domain);
      if (canCrawl) {
        // Return remaining tasks to frontier
        const remaining = tasks.filter(t => t !== task);
        if (remaining.length > 0) {
          await this.frontierClient.returnTasks(remaining);
        }
        return task;
      }
    }
 
    // No tasks are ready now (all domains on cooldown)
    if (tasks.length > 0) {
      await this.frontierClient.returnTasks(tasks);
    }
    return null;
  }
 
  private async processTask(task: CrawlTask): Promise<void> {
    const taskId = `${task.url}-${Date.now()}`;
    this.activeTasks.set(taskId, task);
    const startTime = Date.now();
 
    try {
      // 1. Check robots.txt
      const robotsAllowed = await this.robotsClient.isAllowed(
        task.domain,
        new URL(task.url).pathname
      );
 
      if (!robotsAllowed) {
        this.metrics.recordRobotsBlocked(task.domain);
        await this.frontierClient.markComplete(task.url, 'robots_blocked');
        return;
      }
 
      // 2. Resolve DNS
      const ip = await this.dnsClient.resolve(task.domain);
      if (!ip) {
        this.metrics.recordError(task.domain, 'dns_failure');
        await this.handleRetry(task, 'dns_failure');
        return;
      }
 
      // 3. Wait for politeness (if needed)
      const waitTime = await this.politenessScheduler.getWaitTime(task.domain);
      if (waitTime > 0) {
        await new Promise(resolve => setTimeout(resolve, waitTime));
      }
 
      // 4. Fetch the page
      const result = await this.fetch(task, ip);
      
      // 5. Record politeness (must happen after fetch)
      await this.politenessScheduler.recordRequest(task.domain);
 
      // 6. Process result
      await this.handleResult(task, result);
 
      // 7. Record metrics
      this.metrics.recordFetch(task.domain, result.statusCode, Date.now() - startTime);
 
    } finally {
      this.activeTasks.delete(taskId);
    }
  }
 
  private async fetch(task: CrawlTask, ip: string): Promise<CrawlResult> {
    const startTime = Date.now();
 
    try {
      const response = await this.httpClient.get(task.url, {
        headers: {
          'User-Agent': 'AcmeCrawler/1.0 (+https://acme.com/crawler)',
          'Accept': 'text/html,application/xhtml+xml',
          'Accept-Encoding': 'gzip, deflate, br',
        },
        resolvedIP: ip,
        timeout: 30000,
        maxSize: 10 * 1024 * 1024,  // 10MB limit
      });
 
      return {
        url: task.url,
        statusCode: response.statusCode,
        content: response.body,
        headers: response.headers,
        error: null,
        fetchTimeMs: Date.now() - startTime,
        contentLength: response.body?.length || 0
      };
 
    } catch (err: any) {
      return {
        url: task.url,
        statusCode: 0,
        content: null,
        headers: {},
        error: err.message,
        fetchTimeMs: Date.now() - startTime,
        contentLength: 0
      };
    }
  }
 
  private async handleResult(task: CrawlTask, result: CrawlResult): Promise<void> {
    if (result.error) {
      await this.handleRetry(task, result.error);
      return;
    }
 
    switch (true) {
      case result.statusCode >= 200 && result.statusCode < 300:
        // Success - store content and extract links
        await this.contentStore.store(task.url, result.content!, result.headers);
        await this.frontierClient.markComplete(task.url, 'success');
        break;
 
      case result.statusCode >= 300 && result.statusCode < 400:
        // Redirect - add redirect target to frontier
        const location = result.headers['location'];
        if (location) {
          await this.frontierClient.addURL(location, task.priority, task.url);
        }
        await this.frontierClient.markComplete(task.url, 'redirect');
        break;
 
      case result.statusCode === 429:
        // Rate limited - increase backoff
        this.politenessScheduler.increaseBackoff(task.domain);
        await this.handleRetry(task, 'rate_limited');
        break;
 
      case result.statusCode >= 500:
        // Server error - retry with backoff
        await this.handleRetry(task, `server_error_${result.statusCode}`);
        break;
 
      default:
        // Client error (4xx except 429) - don't retry
        await this.frontierClient.markComplete(
          task.url, 
          `client_error_${result.statusCode}`
        );
    }
  }
 
  private async handleRetry(task: CrawlTask, reason: string): Promise<void> {
    if (task.attempt >= task.maxAttempts) {
      await this.frontierClient.markComplete(task.url, `failed_${reason}`);
      return;
    }
 
    // Exponential backoff for retry
    const delay = Math.pow(2, task.attempt) * 1000;  // 1s, 2s, 4s, 8s...
    
    await this.frontierClient.requeueForRetry(task.url, {
      ...task,
      attempt: task.attempt + 1
    }, delay);
  }
}

Connection Management

HTTP connection pooling is critical for performance. Creating new TCP connections for each request adds significant latency (~100-300ms for handshake). Use keep-alive connections and pool them by domain/IP. At high throughput, connection limits to individual IPs also become important—don't exhaust the server's ability to accept connections.

Coordination Mechanisms

Distributed workers need coordination for several purposes: rate limiting, work assignment, health monitoring, and configuration updates. The challenge is providing necessary coordination without creating bottlenecks.

Coordination requirements:

Coordination Needs and Solutions
Need	Frequency	Consistency	Solution
Domain → worker mapping	On demand	Eventual	Consistent hash ring (local)
Per-domain rate limiting	Per request	Best effort	Domain affinity (no coordination)
Worker health	Every 10-30s	Eventual	Heartbeats to coordinator
Work queue state	Per request	Strong	Distributed queue (Redis/Kafka)
Configuration updates	Rare	Eventual	Config service with polling
Metrics aggregation	Every 10-60s	Eventual	Push to metrics service

Coordinator Service Responsibilities

•Worker Registry — Track active workers; detect failures via heartbeat timeouts; reassign work from dead workers.
•Campaign Management — Start/stop/pause crawl campaigns; set priorities; configure URL patterns to include/exclude.
•Rate Limiting — Global rate limits (e.g., max 10K req/sec total); domain-specific overrides for sensitive sites.
•Health Monitoring — Aggregate metrics from workers; alert on anomalies (error spikes, throughput drops).
•Configuration Distribution — Push configuration updates to workers (new robots.txt rules, rate limit changes).

Coordinator Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
interface WorkerRegistration {
  workerId: string;
  host: string;
  port: number;
  assignedPartitions: number[];
  registeredAt: Date;
  lastHeartbeat: Date;
  status: 'active' | 'draining' | 'dead';
  metrics: WorkerMetrics;
}
 
class CrawlCoordinator {
  private workers: Map<string, WorkerRegistration> = new Map();
  private partitionAssignment: Map<number, string> = new Map();
  private hashRing: ConsistentHashRing;
  private numPartitions: number;
  private heartbeatTimeoutMs: number = 30000;
 
  constructor(numPartitions: number = 256) {
    this.numPartitions = numPartitions;
    this.hashRing = new ConsistentHashRing();
    
    // Start background tasks
    this.startHealthChecker();
    this.startMetricsAggregator();
  }
 
  /**
   * Handle worker registration
   */
  public registerWorker(
    workerId: string, 
    host: string, 
    port: number
  ): { partitions: number[] } {
    // Add to hash ring
    this.hashRing.addWorker(workerId);
    
    // Compute partition assignments
    const partitions = this.computePartitions(workerId);
    
    const registration: WorkerRegistration = {
      workerId,
      host,
      port,
      assignedPartitions: partitions,
      registeredAt: new Date(),
      lastHeartbeat: new Date(),
      status: 'active',
      metrics: { fetched: 0, errors: 0, bytesDownloaded: 0 }
    };
 
    this.workers.set(workerId, registration);
    
    // Update partition assignment
    for (const partition of partitions) {
      this.partitionAssignment.set(partition, workerId);
    }
 
    console.log(`Registered worker ${workerId} with partitions ${partitions.join(',')}`);
    
    return { partitions };
  }
 
  /**
   * Handle worker heartbeat
   */
  public heartbeat(workerId: string, metrics: WorkerMetrics): void {
    const worker = this.workers.get(workerId);
    if (worker) {
      worker.lastHeartbeat = new Date();
      worker.metrics = metrics;
    }
  }
 
  /**
   * Handle worker deregistration (graceful shutdown)
   */
  public deregisterWorker(workerId: string): void {
    const worker = this.workers.get(workerId);
    if (!worker) return;
 
    worker.status = 'draining';
    
    // Wait for in-flight work to complete
    // Then remove from ring and redistribute
    setTimeout(() => this.removeWorker(workerId), 30000);
  }
 
  private removeWorker(workerId: string): void {
    const worker = this.workers.get(workerId);
    if (!worker) return;
 
    this.hashRing.removeWorker(workerId);
    this.workers.delete(workerId);
 
    // Redistribute partitions to remaining workers
    this.redistributePartitions(worker.assignedPartitions);
    
    console.log(`Removed worker ${workerId}`);
  }
 
  private computePartitions(workerId: string): number[] {
    // For consistent hashing, each worker gets a set of partitions
    // based on where their virtual nodes land on the ring
    const partitions: number[] = [];
    
    for (let p = 0; p < this.numPartitions; p++) {
      // Simulate a domain in this partition
      const representative = `partition-${p}-representative`;
      if (this.hashRing.getWorker(representative) === workerId) {
        partitions.push(p);
      }
    }
 
    return partitions;
  }
 
  private redistributePartitions(partitions: number[]): void {
    // Assign orphaned partitions to remaining workers
    for (const partition of partitions) {
      const representative = `partition-${partition}-representative`;
      const newOwner = this.hashRing.getWorker(representative);
      
      if (newOwner) {
        this.partitionAssignment.set(partition, newOwner);
        const worker = this.workers.get(newOwner);
        if (worker) {
          worker.assignedPartitions.push(partition);
        }
      }
    }
  }
 
  private startHealthChecker(): void {
    setInterval(() => {
      const now = Date.now();
      
      for (const [workerId, worker] of this.workers) {
        if (worker.status !== 'active') continue;
        
        const timeSinceHeartbeat = now - worker.lastHeartbeat.getTime();
        
        if (timeSinceHeartbeat > this.heartbeatTimeoutMs) {
          console.warn(`Worker ${workerId} missed heartbeat, marking dead`);
          worker.status = 'dead';
          this.removeWorker(workerId);
        }
      }
    }, 10000);
  }
 
  private startMetricsAggregator(): void {
    setInterval(() => {
      let totalFetched = 0;
      let totalErrors = 0;
      let totalBytes = 0;
 
      for (const worker of this.workers.values()) {
        if (worker.status === 'active') {
          totalFetched += worker.metrics.fetched;
          totalErrors += worker.metrics.errors;
          totalBytes += worker.metrics.bytesDownloaded;
        }
      }
 
      console.log(`Cluster metrics: ${totalFetched} fetched, ${totalErrors} errors, ${(totalBytes / 1024 / 1024).toFixed(2)} MB`);
    }, 60000);
  }
}

Fault Tolerance and Recovery

In a distributed system with hundreds of nodes, failures are not exceptional—they're expected. Crawlers must handle failures gracefully without losing work or degrading quality.

Types of failures:

Failure Modes and Responses

•Worker crash — Heartbeat timeout triggers reassignment of worker's partitions to healthy nodes; in-flight URLs marked for requeue
•Network partition — Workers may be unable to reach frontier or content store; buffer locally with timeout-based retry
•Frontier unavailability — Workers pause and retry; use circuit breaker pattern to avoid overwhelming recovering frontier
•Content store failure — Buffer fetched content locally; retry persistence; worst case, re-fetch later
•Target site failure — Per-URL retry with exponential backoff; per-domain health tracking; temporary domain exclusion

Fault Tolerant Work Queue
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
interface InFlightTask {
  task: CrawlTask;
  workerId: string;
  assignedAt: Date;
  deadline: Date;
}
 
class FaultTolerantFrontier {
  private pending: PriorityQueue<CrawlTask>;   // Tasks waiting to be claimed
  private inFlight: Map<string, InFlightTask>; // Tasks currently being processed
  private completed: Set<string>;               // URLs that have been completed
  private failed: Map<string, number>;          // URL -> failure count
  
  private visibilityTimeout: number = 60000;   // 60 seconds to complete
  private maxRetries: number = 3;
 
  constructor() {
    this.pending = new PriorityQueue();
    this.inFlight = new Map();
    this.completed = new Set();
    this.failed = new Map();
 
    // Start timeout checker
    this.startTimeoutChecker();
  }
 
  /**
   * Add a URL to the frontier
   */
  public addURL(url: string, priority: number): boolean {
    // Skip if already completed
    if (this.completed.has(url)) {
      return false;
    }
 
    // Skip if max retries exceeded
    if ((this.failed.get(url) || 0) >= this.maxRetries) {
      return false;
    }
 
    // Skip if already in flight
    if (this.inFlight.has(url)) {
      return false;
    }
 
    const task: CrawlTask = {
      url,
      domain: new URL(url).hostname,
      priority,
      attempt: this.failed.get(url) || 0,
      maxAttempts: this.maxRetries
    };
 
    this.pending.enqueue(task, priority);
    return true;
  }
 
  /**
   * Get next task for a worker
   * Task becomes invisible to other workers until completed or timeout
   */
  public getNext(workerId: string): CrawlTask | null {
    const task = this.pending.dequeue();
    if (!task) return null;
 
    const inFlightTask: InFlightTask = {
      task,
      workerId,
      assignedAt: new Date(),
      deadline: new Date(Date.now() + this.visibilityTimeout)
    };
 
    this.inFlight.set(task.url, inFlightTask);
    return task;
  }
 
  /**
   * Mark task as completed
   */
  public complete(url: string, status: 'success' | 'skip'): void {
    this.inFlight.delete(url);
    this.completed.add(url);
    this.failed.delete(url);
  }
 
  /**
   * Mark task as failed (will be retried)
   */
  public fail(url: string): void {
    const inFlightTask = this.inFlight.get(url);
    if (!inFlightTask) return;
 
    this.inFlight.delete(url);
    
    const failCount = (this.failed.get(url) || 0) + 1;
    this.failed.set(url, failCount);
 
    if (failCount < this.maxRetries) {
      // Requeue with exponential backoff delay
      const delay = Math.pow(2, failCount) * 1000;
      setTimeout(() => {
        this.addURL(url, inFlightTask.task.priority);
      }, delay);
    }
  }
 
  /**
   * Extend visibility timeout (for long-running tasks)
   */
  public extendTimeout(url: string, extensionMs: number): boolean {
    const inFlightTask = this.inFlight.get(url);
    if (!inFlightTask) return false;
 
    inFlightTask.deadline = new Date(
      inFlightTask.deadline.getTime() + extensionMs
    );
    return true;
  }
 
  /**
   * Check for timed-out tasks and requeue them
   */
  private startTimeoutChecker(): void {
    setInterval(() => {
      const now = Date.now();
      
      for (const [url, inFlightTask] of this.inFlight) {
        if (now > inFlightTask.deadline.getTime()) {
          console.warn(
            `Task ${url} timed out (worker: ${inFlightTask.workerId})`
          );
          
          // Treat timeout as failure
          this.fail(url);
        }
      }
    }, 10000);
  }
 
  /**
   * Handle worker failure - requeue all their in-flight tasks
   */
  public handleWorkerFailure(workerId: string): void {
    for (const [url, inFlightTask] of this.inFlight) {
      if (inFlightTask.workerId === workerId) {
        console.log(`Requeuing task ${url} from failed worker ${workerId}`);
        this.fail(url);
      }
    }
  }
}

Exactly-Once vs At-Least-Once

Crawlers typically use at-least-once semantics for URL processing. It's acceptable (if wasteful) to crawl a URL twice; it's unacceptable to miss a URL. The visibility timeout pattern ensures failed workers' tasks are eventually processed, at the cost of occasional duplicates. Content deduplication catches these.

Scaling Considerations

As crawl requirements grow, the system must scale horizontally. But scaling introduces new challenges:

Scaling dimensions:

Scaling Strategies by Component
Component	Scaling Strategy	Bottleneck	Solution
Crawler Workers	Add more instances	Network bandwidth	Distribute across multiple DCs
URL Frontier	Partition by domain hash	Write throughput	More partitions; faster storage
DNS Service	Replicas with local cache	Upstream DNS limits	Larger caches; multiple upstreams
Robots.txt Service	Replicas with cache	Per-domain freshness	Longer TTLs; lazy refresh
Content Store	Object storage (S3/GCS)	Write IOPS	Batch writes; concurrent uploads
Link Extractor	Parallel consumers	CPU for parsing	More instances; faster parsers

Scaling Best Practices

•Scale workers independently — Add crawl capacity without changing frontier or processing; workers are stateless and horizontally scalable.
•Pre-partition for scale — Start with more partitions than needed (256-1024); redistribution is painful later.
•Use object storage for content — Don't store content in databases; use S3/GCS which scale automatically with no ops burden.
•Decouple with message queues — Kafka/SQS between crawl and processing smooths load spikes and enables independent scaling.
•Geographic distribution — Crawl from multiple regions to reduce latency and increase effective bandwidth; route domains to nearest region.

Multi-region architecture for global crawling:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                        MULTI-REGION CRAWLER DEPLOYMENT                          │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌───────────────────────────┐   ┌───────────────────────────┐                 │
│   │      US-WEST REGION       │   │      EU-WEST REGION       │                 │
│   │                           │   │                           │                 │
│   │   Workers (US-assigned    │   │   Workers (EU-assigned    │                 │
│   │   domains: *.com, *.us)   │   │   domains: *.eu, *.co.uk) │                 │
│   │                           │   │                           │                 │
│   │   Local DNS resolvers     │   │   Local DNS resolvers     │                 │
│   │   Local robots.txt cache  │   │   Local robots.txt cache  │                 │
│   │                           │   │                           │
│   └─────────────┬─────────────┘   └─────────────┬─────────────┘                 │
│                 │                               │                                │
│                 └───────────────┬───────────────┘                                │
│                                 │                                                │
│                                 ▼                                                │
│   ┌─────────────────────────────────────────────────────────────────────────┐   │
│   │                      GLOBAL SERVICES                                     │   │
│   │                                                                          │   │
│   │   ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐       │   │
│   │   │ Global URL      │   │ Global Content  │   │ Global          │       │   │
│   │   │ Frontier        │   │ Store (S3)      │   │ Coordinator     │       │   │
│   │   │ (cross-region   │   │                 │   │                 │       │   │
│   │   │  replication)   │   │                 │   │                 │       │   │
│   │   └─────────────────┘   └─────────────────┘   └─────────────────┘       │   │
│   │                                                                          │   │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                                                                                  │
│   Benefits:                                                                     │
│   • Lower latency to target sites (geographically close)                       │
│   • Higher aggregate bandwidth (multiple data center pipes)                    │
│   • Better fault isolation (region failure doesn't halt crawl)                │
│   • Compliance with data residency requirements (crawl EU from EU)             │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

Operational Concerns

Running a distributed crawler in production requires robust operational practices. These concerns often get less attention than the algorithms but are equally critical for reliable operation.

Key Metrics to Monitor

•Throughput — Pages/second (overall and per-worker); bytes/second downloaded
•Latency — P50/P95/P99 fetch latency; frontier queue latency (time URL waits before crawl)
•Error rates — HTTP errors by status code; DNS failures; connection timeouts; robots.txt blocks
•Queue depth — Frontier size (total and by partition); in-flight tasks; retry queue size
•Resource utilization — CPU, memory, network I/O, disk I/O per worker
•Domain coverage — Unique domains crawled; domain distribution; domains in backoff

Alerting Triggers

•Throughput drop — > 20% drop in pages/sec sustained for 5 minutes
•Error spike — Error rate > 5% (excluding expected 404s)
•Worker failures — More than 2 workers dead in 10 minutes
•Queue stall — Frontier not draining (no URLs being claimed) for 10 minutes
•Storage issues — Content store write failures; disk full on workers
•Abuse complaints — External complaints or IP blacklisting

Crawler Metrics Dashboard Schema
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
// Key metrics to expose from every component
 
interface CrawlerMetrics {
  // Worker metrics
  worker: {
    pages_fetched_total: Counter;
    bytes_downloaded_total: Counter;
    fetch_duration_seconds: Histogram;
    active_tasks: Gauge;
    errors_by_type: Counter<{ error_type: string }>;
    http_responses: Counter<{ status_code: number }>;
  };
 
  // Frontier metrics
  frontier: {
    queue_size: Gauge<{ partition: number }>;
    enqueue_rate: Counter;
    dequeue_rate: Counter;
    duplicate_urls_rejected: Counter;
    urls_in_flight: Gauge;
    retry_queue_size: Gauge;
  };
 
  // Politeness metrics
  politeness: {
    domains_in_backoff: Gauge;
    robots_blocked_total: Counter;
    rate_limit_delays_total: Counter;
    crawl_delay_seconds: Histogram<{ domain: string }>;
  };
 
  // Processing metrics
  processing: {
    links_extracted_total: Counter;
    content_stored_total: Counter;
    content_store_bytes: Counter;
    near_duplicates_detected: Counter;
  };
 
  // System metrics
  system: {
    active_workers: Gauge;
    partition_assignments: Gauge<{ worker: string }>;
    coordinator_heartbeat_lag: Gauge;
  };
}
 
// Example Prometheus query for throughput
// rate(pages_fetched_total[5m])
 
// Example alert rule
const alerts = {
  ThroughputDrop: {
    expr: 'rate(pages_fetched_total[5m]) < 500',
    for: '5m',
    labels: { severity: 'warning' },
    annotations: {
      summary: 'Crawler throughput has dropped below 500 pages/sec'
    }
  },
  
  WorkerDown: {
    expr: 'active_workers < 10',
    for: '2m',
    labels: { severity: 'critical' },
    annotations: {
      summary: 'Fewer than 10 crawler workers are active'
    }
  },
  
  HighErrorRate: {
    expr: 'rate(errors_by_type[5m]) / rate(pages_fetched_total[5m]) > 0.05',
    for: '5m',
    labels: { severity: 'warning' },
    annotations: {
      summary: 'Error rate exceeds 5%'
    }
  }
};

Graceful Degradation

Design the crawler to degrade gracefully under stress. If the frontier is overloaded, slow down URL discovery rather than losing URLs. If content storage is slow, buffer and batch writes. If a region is having network issues, reduce its crawl rate. The goal is continuous, steady progress rather than all-or-nothing operation.

Summary and Next Steps

We've explored the architectural patterns and operational practices that enable distributed crawling at scale—from work distribution to fault tolerance.

Key Takeaways

•Separate concerns by component — Stateless workers, distributed frontier, shared services each scale independently.
•Consistent hashing solves domain affinity — Each domain maps to one worker partition, enabling local politeness without coordination.
•Workers should be stateless and observable — Fail fast, emit metrics, and let the coordinator handle reassignment.
•Visibility timeouts enable at-least-once processing — Tasks not completed in time are automatically requeued.
•Fault tolerance is built-in, not bolted-on — Every component assumes failures and handles them gracefully.
•Multi-region deployment improves latency and resilience — Crawl geographically close; isolate regional failures.

What's Next:

The final page explores Content Extraction—how to parse fetched pages, extract meaningful content, identify links for further crawling, and handle the diversity of content types and formats on the modern web.

Page Complete

You now understand how to architect a distributed web crawler that scales to thousands of pages per second across hundreds of machines. The key insight: distribution is not just about parallelism—it's about careful partitioning that maintains correctness (politeness, deduplication) while enabling horizontal scaling.