System Design (HLD)Web Crawler

Designing a Web Crawler at Scale

LevelAdvanced

Duration90 mins

TopicWeb Crawler

3 / 6

Politeness Policies

The Social Contract of the Web

The web operates on a foundation of implicit trust. When you publish a website, you implicitly invite visitors—both human and automated. But this invitation comes with expectations: don't consume excessive resources, respect declared preferences, and don't interfere with legitimate operations.

For web crawlers, these expectations manifest as politeness policies—the rules and mechanisms that ensure your crawler is a good citizen of the web. Violating politeness isn't just bad etiquette; it has real consequences:

Your crawler gets blocked — Site operators will detect aggressive crawlers and ban your IP ranges
You degrade target infrastructure — Overwhelming a server can cause outages that affect real users
You face legal consequences — Aggressive crawling can violate computer fraud and abuse laws
You poison your own data — Stressed servers return errors, incomplete pages, or anti-bot countermeasures

Politeness is not a constraint on your crawler's effectiveness—it's a prerequisite for sustainable operation.

What You Will Learn

By the end of this page, you will understand: (1) The robots.txt standard and how to implement compliant parsing, (2) Rate limiting strategies at domain, IP, and global levels, (3) Crawl delay mechanisms and their implementation, (4) How to detect and respond to server stress signals, and (5) Best practices for identifying your crawler and handling blocked scenarios.

The Robots Exclusion Protocol (robots.txt)

The Robots Exclusion Protocol is the de facto standard for communicating crawler permissions. Introduced in 1994 and formalized as RFC 9309 in 2022, robots.txt files tell crawlers which parts of a site they may access.

Key principles:

robots.txt is advisory — Compliance is voluntary. Malicious crawlers ignore it, but legitimate crawlers must respect it.
Default is permissive — Without a robots.txt, everything is allowed. Without specific rules, everything is allowed.
Most specific rule wins — More specific path patterns override general ones.
Per-user-agent rules — Different crawlers can have different permissions.

robots.txt location:

https://example.com/robots.txt

The file MUST be at the root of the domain. /subdirectory/robots.txt is not valid.

Example robots.txt Files
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# ============================================================
# BASIC robots.txt
# ============================================================
 
# Block all crawlers from /private directory
User-agent: *
Disallow: /private/
Disallow: /admin/
Disallow: /api/
 
# Allow everything else (implicit)
 
 
# ============================================================
# COMPLEX robots.txt (like a major site)
# ============================================================
 
# Google-specific rules
User-agent: Googlebot
Disallow: /search
Disallow: /sdch
Allow: /search/about
Crawl-delay: 1
 
# Bing-specific rules
User-agent: Bingbot
Disallow: /search
Crawl-delay: 2
 
# Block bad bots entirely
User-agent: MJ12bot
User-agent: AhrefsBot
User-agent: SemrushBot
Disallow: /
 
# Default for other crawlers
User-agent: *
Disallow: /private/
Disallow: /cgi-bin/
Disallow: /*.json$
Disallow: /*?sessionid=
Crawl-delay: 5
 
# Sitemaps
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
 
 
# ============================================================
# RESTRICTIVE robots.txt (block everything)
# ============================================================
 
User-agent: *
Disallow: /
 
 
# ============================================================
# PERMISSIVE robots.txt (allow everything explicitly)
# ============================================================
 
User-agent: *
Allow: /

robots.txt Directives
Directive	Purpose	Example
User-agent	Specifies which crawler the following rules apply to	User-agent: Googlebot
Disallow	Blocks access to specified path or pattern	Disallow: /private/
Allow	Explicitly permits access (overrides broader Disallow)	Allow: /private/public-file.html
Crawl-delay	Seconds to wait between requests (non-standard but widely used)	Crawl-delay: 10
Sitemap	Location of XML sitemap for URL discovery	Sitemap: https://example.com/sitemap.xml
Host	Preferred domain for canonicalization (deprecated)	Host: www.example.com

Crawl-delay Is Critical

While Crawl-delay is not in the original RFC, it's widely used and MUST be respected. A Crawl-delay: 60 means wait 60 seconds between requests to that domain. Ignoring this directive is one of the fastest ways to get your crawler blocked. Some sites specify aggressive delays (300+ seconds) specifically to deter unwanted crawlers—respect their wishes.

Parsing robots.txt Correctly

Implementing a compliant robots.txt parser is surprisingly nuanced. Edge cases abound, and incorrect parsing can lead to crawling pages you shouldn't (risking bans) or missing pages you could (wasting opportunity).

Parsing algorithm overview:

Fetch https://domain.com/robots.txt
Handle HTTP response appropriately (see below)
Parse file into rules grouped by User-agent
Find the most specific matching group for your crawler
Apply rules to determine if a URL is allowed

HTTP Response Handling for robots.txt

•200 OK — Parse the response body as robots.txt rules. Verify content-type suggests text (not HTML error pages).
•3xx Redirect — Follow redirects (up to a limit). The final destination is the authoritative robots.txt.
•4xx Client Error (especially 404) — No robots.txt exists. Assume all URLs are allowed.
•5xx Server Error — Temporary server issue. Apply a restrictive policy (don't crawl) and retry later.
•Timeout / Network Error — Similar to 5xx. Assume temporary and retry. Don't assume permissive.

robots.txt Parser Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
interface RobotsRule {
  path: string;
  isAllow: boolean;
}
 
interface RobotsDirectives {
  rules: RobotsRule[];
  crawlDelay: number | null;
  sitemaps: string[];
}
 
class RobotsParser {
  private userAgentRules: Map<string, RobotsDirectives> = new Map();
  private globalSitemaps: string[] = [];
  private fetchedAt: Date | null = null;
  private ttlSeconds: number = 86400; // Cache for 24 hours
 
  /**
   * Parse robots.txt content
   */
  public parse(content: string): void {
    this.userAgentRules.clear();
    this.globalSitemaps = [];
    
    const lines = content.split(/\r?
/);
    let currentAgents: string[] = [];
    let currentRules: RobotsRule[] = [];
    let currentCrawlDelay: number | null = null;
 
    const commitGroup = () => {
      if (currentAgents.length > 0) {
        const directives: RobotsDirectives = {
          rules: [...currentRules],
          crawlDelay: currentCrawlDelay,
          sitemaps: []
        };
        for (const agent of currentAgents) {
          this.userAgentRules.set(agent.toLowerCase(), directives);
        }
      }
      currentAgents = [];
      currentRules = [];
      currentCrawlDelay = null;
    };
 
    for (const rawLine of lines) {
      // Remove comments and trim
      const line = rawLine.split('#')[0].trim();
      if (!line) continue;
 
      const colonIndex = line.indexOf(':');
      if (colonIndex === -1) continue;
 
      const directive = line.substring(0, colonIndex).trim().toLowerCase();
      const value = line.substring(colonIndex + 1).trim();
 
      switch (directive) {
        case 'user-agent':
          // If we were building a group, commit it
          if (currentRules.length > 0 || currentCrawlDelay !== null) {
            commitGroup();
          }
          currentAgents.push(value);
          break;
 
        case 'disallow':
          if (value) { // Empty Disallow means allow all
            currentRules.push({ path: value, isAllow: false });
          }
          break;
 
        case 'allow':
          currentRules.push({ path: value, isAllow: true });
          break;
 
        case 'crawl-delay':
          const delay = parseFloat(value);
          if (!isNaN(delay) && delay >= 0) {
            currentCrawlDelay = delay;
          }
          break;
 
        case 'sitemap':
          this.globalSitemaps.push(value);
          break;
      }
    }
 
    // Commit final group
    commitGroup();
    this.fetchedAt = new Date();
  }
 
  /**
   * Check if a URL path is allowed for a given user-agent
   */
  public isAllowed(userAgent: string, path: string): boolean {
    const directives = this.getDirectivesForAgent(userAgent);
    if (!directives || directives.rules.length === 0) {
      return true; // No matching rules = allowed
    }
 
    // Find the longest matching rule
    let matchedRule: RobotsRule | null = null;
    let matchLength = 0;
 
    for (const rule of directives.rules) {
      if (this.pathMatches(path, rule.path)) {
        // More specific (longer) patterns take precedence
        if (rule.path.length > matchLength) {
          matchLength = rule.path.length;
          matchedRule = rule;
        }
        // If same length, Allow takes precedence over Disallow
        else if (rule.path.length === matchLength && rule.isAllow) {
          matchedRule = rule;
        }
      }
    }
 
    return matchedRule ? matchedRule.isAllow : true;
  }
 
  /**
   * Get crawl delay for a user-agent
   */
  public getCrawlDelay(userAgent: string): number | null {
    const directives = this.getDirectivesForAgent(userAgent);
    return directives?.crawlDelay ?? null;
  }
 
  /**
   * Check if robots.txt cache is expired
   */
  public isExpired(): boolean {
    if (!this.fetchedAt) return true;
    const age = (Date.now() - this.fetchedAt.getTime()) / 1000;
    return age > this.ttlSeconds;
  }
 
  private getDirectivesForAgent(userAgent: string): RobotsDirectives | null {
    const normalizedAgent = userAgent.toLowerCase();
    
    // Try exact match first
    if (this.userAgentRules.has(normalizedAgent)) {
      return this.userAgentRules.get(normalizedAgent)!;
    }
 
    // Try partial match (e.g., "Googlebot/2.1" matches "Googlebot")
    for (const [agent, directives] of this.userAgentRules) {
      if (normalizedAgent.includes(agent) || agent.includes(normalizedAgent)) {
        return directives;
      }
    }
 
    // Fall back to wildcard
    return this.userAgentRules.get('*') ?? null;
  }
 
  private pathMatches(path: string, pattern: string): boolean {
    // Handle wildcards (*) and end-of-path ($)
    // Convert pattern to regex
    let regexPattern = pattern
      .replace(/[.+?^${}()|[\]\\]/g, '\\$&') // Escape special regex chars
      .replace(/\*/g, '.*')                        // * = any characters
      .replace(/\$$/g, '$');                       // $ = end of string
 
    // Pattern must match from the start of the path
    const regex = new RegExp('^' + regexPattern);
    return regex.test(path);
  }
}
 
// Example usage
const parser = new RobotsParser();
parser.parse(`
User-agent: *
Disallow: /private/
Disallow: /api/
Allow: /api/public/
Crawl-delay: 2
 
User-agent: Googlebot
Disallow: /search
Allow: /search/about
Crawl-delay: 1
`);
 
console.log(parser.isAllowed('MyBot', '/public/page'));     // true
console.log(parser.isAllowed('MyBot', '/private/secret'));  // false
console.log(parser.isAllowed('MyBot', '/api/internal'));    // false
console.log(parser.isAllowed('MyBot', '/api/public/data')); // true (Allow override)
console.log(parser.getCrawlDelay('MyBot'));                 // 2
console.log(parser.getCrawlDelay('Googlebot'));             // 1

Use Established Libraries

In production, use well-tested robots.txt parsing libraries rather than rolling your own. Examples include Google's robotstxt library (C++), rep-cpp, or language-specific packages. These handle edge cases, performance optimizations, and RFC compliance that are easy to get wrong.

Rate Limiting Strategies

Beyond robots.txt, crawlers must implement their own rate limiting to avoid overwhelming servers—even when robots.txt is permissive. Rate limiting operates at multiple levels:

Levels of rate limiting:

Domain Level — Maximum requests per second/minute to a single domain
IP Level — Maximum requests to a single server IP (important when multiple domains share an IP)
Global Level — Maximum total outbound requests from your crawler
Worker Level — Maximum concurrent requests per worker node

Token Bucket Rate Limiter

•How it works: Bucket holds tokens; requests consume tokens; tokens refill at a constant rate
•Burst handling: Allows short bursts up to bucket capacity, then rate-limited
•Pros: Simple, handles bursts gracefully, widely understood
•Cons: Burst behavior may not be appropriate for politeness
•Best for: Worker-level and global rate limiting

Fixed Delay Rate Limiter

•How it works: Enforce minimum time between consecutive requests to same target
•Burst handling: No bursts allowed; strictly spacing requests
•Pros: Predictable, prevents any clustering of requests
•Cons: Less efficient utilization; can't absorb variable response times
•Best for: Domain-level politeness (respecting Crawl-delay)

Multi-Level Rate Limiter
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
class DomainRateLimiter {
  private lastRequestTime: Map<string, number> = new Map();
  private domainDelays: Map<string, number> = new Map();
  private defaultDelay: number = 1000; // 1 second default
 
  /**
   * Set delay for a specific domain (from robots.txt or config)
   */
  public setDomainDelay(domain: string, delayMs: number): void {
    this.domainDelays.set(domain, delayMs);
  }
 
  /**
   * Get required delay before next request to domain
   * Returns 0 if request can proceed immediately
   */
  public getWaitTime(domain: string): number {
    const now = Date.now();
    const lastRequest = this.lastRequestTime.get(domain) || 0;
    const delay = this.domainDelays.get(domain) ?? this.defaultDelay;
    
    const elapsed = now - lastRequest;
    const waitTime = Math.max(0, delay - elapsed);
    
    return waitTime;
  }
 
  /**
   * Check if we can make a request to domain now
   */
  public canRequest(domain: string): boolean {
    return this.getWaitTime(domain) === 0;
  }
 
  /**
   * Record that a request was made to domain
   */
  public recordRequest(domain: string): void {
    this.lastRequestTime.set(domain, Date.now());
  }
 
  /**
   * Get next available time to request from domain
   */
  public getNextAvailableTime(domain: string): Date {
    const waitTime = this.getWaitTime(domain);
    return new Date(Date.now() + waitTime);
  }
}
 
class IPRateLimiter {
  // Track requests per IP to handle shared hosting
  private ipRequestCounts: Map<string, { count: number; windowStart: number }> = new Map();
  private maxRequestsPerWindow: number = 10;
  private windowSizeMs: number = 60000; // 1 minute
 
  /**
   * Check if we can make a request to this IP
   */
  public canRequest(ip: string): boolean {
    const now = Date.now();
    const record = this.ipRequestCounts.get(ip);
 
    if (!record || now - record.windowStart >= this.windowSizeMs) {
      return true; // New window
    }
 
    return record.count < this.maxRequestsPerWindow;
  }
 
  /**
   * Record a request to an IP
   */
  public recordRequest(ip: string): void {
    const now = Date.now();
    const record = this.ipRequestCounts.get(ip);
 
    if (!record || now - record.windowStart >= this.windowSizeMs) {
      // Start new window
      this.ipRequestCounts.set(ip, { count: 1, windowStart: now });
    } else {
      // Increment in current window
      record.count++;
    }
  }
}
 
class PolitenessScheduler {
  private domainLimiter: DomainRateLimiter = new DomainRateLimiter();
  private ipLimiter: IPRateLimiter = new IPRateLimiter();
  private dnsCache: Map<string, string> = new Map(); // domain -> IP
 
  /**
   * Check if crawling a URL is currently allowed
   */
  public async canCrawl(url: string): Promise<{ allowed: boolean; waitTime: number }> {
    const { hostname } = new URL(url);
    
    // Check domain rate limit
    const domainWait = this.domainLimiter.getWaitTime(hostname);
    if (domainWait > 0) {
      return { allowed: false, waitTime: domainWait };
    }
 
    // Check IP rate limit (for shared hosting protection)
    const ip = await this.resolveIP(hostname);
    if (!this.ipLimiter.canRequest(ip)) {
      // Need to wait for IP window to reset
      return { allowed: false, waitTime: 1000 }; // Rough estimate
    }
 
    return { allowed: true, waitTime: 0 };
  }
 
  /**
   * Record that a request was made
   */
  public async recordRequest(url: string): Promise<void> {
    const { hostname } = new URL(url);
    const ip = await this.resolveIP(hostname);
 
    this.domainLimiter.recordRequest(hostname);
    this.ipLimiter.recordRequest(ip);
  }
 
  /**
   * Configure domain delay from robots.txt
   */
  public setDomainCrawlDelay(domain: string, delaySec: number): void {
    this.domainLimiter.setDomainDelay(domain, delaySec * 1000);
  }
 
  private async resolveIP(hostname: string): Promise<string> {
    if (this.dnsCache.has(hostname)) {
      return this.dnsCache.get(hostname)!;
    }
    
    // In real implementation, use DNS resolver
    // This is simplified
    const ip = `resolved-ip-for-${hostname}`;
    this.dnsCache.set(hostname, ip);
    return ip;
  }
}

Shared Hosting Awareness

Many small websites share the same IP address through shared hosting. Without IP-level rate limiting, a crawler could overwhelm a shared server by crawling many domains simultaneously—each within its domain rate limit, but collectively exceeding what the server can handle. Always consider the IP layer, especially for smaller sites.

Detecting and Responding to Server Stress

Even with careful rate limiting, servers may become stressed. A polite crawler monitors for stress signals and adjusts behavior dynamically.

Stress indicators to monitor:

Server Stress Signals

•HTTP 429 (Too Many Requests) — Server explicitly signals rate limiting. Response may include Retry-After header.
•HTTP 503 (Service Unavailable) — Server is overloaded. Often includes Retry-After header.
•HTTP 500/502/504 (Server Errors) — May indicate server struggling under load.
•Increasing Response Latency — Baseline 100ms responses climbing to 500ms+ suggests server strain.
•Connection Timeouts — Server too busy to accept connections.
•Response Size Anomalies — Truncated responses may indicate server resource exhaustion.
•CAPTCHAs or Block Pages — Server deploying countermeasures against your crawler.

Adaptive Backoff Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
interface DomainHealthMetrics {
  domain: string;
  recentResponseTimes: number[];      // Last N response times in ms
  recentErrorCodes: number[];          // Last N HTTP status codes
  consecutiveErrors: number;           // Current error streak
  currentBackoffLevel: number;         // 0 = normal, higher = more backed off
  lastRequestTime: number;
  blockedUntil: number;                // Timestamp when block expires
}
 
class AdaptivePolitenessController {
  private domainMetrics: Map<string, DomainHealthMetrics> = new Map();
  
  // Backoff configuration
  private baseDelay: number = 1000;               // 1 second base
  private maxBackoffLevel: number = 8;            // Max 2^8 = 256x backoff
  private errorThreshold: number = 5;             // Errors before backoff
  private latencyThreshold: number = 2000;        // 2 second latency = slow
  private metricsWindowSize: number = 20;
 
  /**
   * Record the result of a crawl attempt
   */
  public recordResult(
    domain: string, 
    statusCode: number, 
    responseTimeMs: number
  ): void {
    const metrics = this.getOrCreateMetrics(domain);
    
    // Update response times
    metrics.recentResponseTimes.push(responseTimeMs);
    if (metrics.recentResponseTimes.length > this.metricsWindowSize) {
      metrics.recentResponseTimes.shift();
    }
 
    // Update error codes
    metrics.recentErrorCodes.push(statusCode);
    if (metrics.recentErrorCodes.length > this.metricsWindowSize) {
      metrics.recentErrorCodes.shift();
    }
 
    // Handle specific status codes
    if (statusCode === 429) {
      // Explicit rate limiting - significant backoff
      this.increaseBackoff(domain, 3); // Jump 3 levels
    } else if (statusCode === 503) {
      // Server overloaded - moderate backoff
      this.increaseBackoff(domain, 2);
    } else if (statusCode >= 500) {
      // Server error - slight backoff
      metrics.consecutiveErrors++;
      if (metrics.consecutiveErrors >= this.errorThreshold) {
        this.increaseBackoff(domain, 1);
      }
    } else if (statusCode >= 200 && statusCode < 300) {
      // Success - potentially reduce backoff
      metrics.consecutiveErrors = 0;
      if (this.isDomainHealthy(domain)) {
        this.decreaseBackoff(domain);
      }
    }
 
    // Check for latency-based stress
    if (this.isLatencyElevated(domain)) {
      this.increaseBackoff(domain, 1);
    }
 
    metrics.lastRequestTime = Date.now();
  }
 
  /**
   * Get the current delay for a domain
   */
  public getCurrentDelay(domain: string): number {
    const metrics = this.getOrCreateMetrics(domain);
    const backoffMultiplier = Math.pow(2, metrics.currentBackoffLevel);
    return this.baseDelay * backoffMultiplier;
  }
 
  /**
   * Check if domain is blocked (e.g., after receiving block signal)
   */
  public isBlocked(domain: string): boolean {
    const metrics = this.getOrCreateMetrics(domain);
    return Date.now() < metrics.blockedUntil;
  }
 
  /**
   * Block domain for a specified duration (e.g., from Retry-After header)
   */
  public blockDomain(domain: string, durationMs: number): void {
    const metrics = this.getOrCreateMetrics(domain);
    metrics.blockedUntil = Date.now() + durationMs;
  }
 
  private increaseBackoff(domain: string, levels: number = 1): void {
    const metrics = this.getOrCreateMetrics(domain);
    metrics.currentBackoffLevel = Math.min(
      this.maxBackoffLevel,
      metrics.currentBackoffLevel + levels
    );
    
    console.log(
      `Increased backoff for ${domain} to level ${metrics.currentBackoffLevel} ` +
      `(delay: ${this.getCurrentDelay(domain)}ms)`
    );
  }
 
  private decreaseBackoff(domain: string): void {
    const metrics = this.getOrCreateMetrics(domain);
    if (metrics.currentBackoffLevel > 0) {
      metrics.currentBackoffLevel--;
    }
  }
 
  private isDomainHealthy(domain: string): boolean {
    const metrics = this.getOrCreateMetrics(domain);
    
    // Check error rate
    const recentErrors = metrics.recentErrorCodes.filter(c => c >= 400).length;
    const errorRate = recentErrors / metrics.recentErrorCodes.length;
    
    // Check average latency
    const avgLatency = this.getAverageLatency(domain);
    
    return errorRate < 0.1 && avgLatency < this.latencyThreshold;
  }
 
  private isLatencyElevated(domain: string): boolean {
    const avgLatency = this.getAverageLatency(domain);
    return avgLatency > this.latencyThreshold;
  }
 
  private getAverageLatency(domain: string): number {
    const metrics = this.getOrCreateMetrics(domain);
    if (metrics.recentResponseTimes.length === 0) return 0;
    
    const sum = metrics.recentResponseTimes.reduce((a, b) => a + b, 0);
    return sum / metrics.recentResponseTimes.length;
  }
 
  private getOrCreateMetrics(domain: string): DomainHealthMetrics {
    if (!this.domainMetrics.has(domain)) {
      this.domainMetrics.set(domain, {
        domain,
        recentResponseTimes: [],
        recentErrorCodes: [],
        consecutiveErrors: 0,
        currentBackoffLevel: 0,
        lastRequestTime: 0,
        blockedUntil: 0
      });
    }
    return this.domainMetrics.get(domain)!;
  }
}

Respect Retry-After Headers

When you receive a 429 or 503 response with a Retry-After header, ALWAYS respect it. This header tells you exactly how long to wait (either in seconds or as an HTTP-date). Ignoring it is hostile behavior that will get your crawler permanently blocked.

User-Agent and Crawler Identification

Legitimate crawlers identify themselves clearly. This transparency serves multiple purposes:

Site operators can contact you if there are issues
Custom robots.txt rules can be applied to your crawler
Access controls may grant privileges to known, trusted crawlers
Abuse attribution helps distinguish your crawler from malicious bots

User-Agent best practices:

User-Agent Requirements

•Include a unique, identifiable name — Something memorable like "AcmeCrawler" or "DataBotX", not generic strings
•Include a version number — Helps identify which version is causing issues
•Include a URL for more information — A webpage explaining your crawler's purpose and contact info
•Include contact information — Email address for urgent issues
•Be distinctive — Don't spoof popular browser User-Agents (this is hostile and may be illegal)

Good and Bad User-Agent Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// ============================================================
// GOOD User-Agent Examples
// ============================================================
 
// Googlebot (the gold standard)
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
 
// Bingbot
"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
 
// Your custom crawler (recommended format)
"MyCrawler/1.0 (+https://mycrawler.example.com/about; contact@example.com)"
 
// With more detail
"AcmeSearchBot/2.5.1 (Linux; +https://acme.com/searchbot; crawl-admin@acme.com)"
 
// ============================================================
// BAD User-Agent Examples (DON'T DO THIS)
// ============================================================
 
// Spoofing a browser (hostile, possibly illegal)
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36..."
 
// No identification
"curl/7.68.0"
 
// Generic/meaningless
"python-requests/2.25.1"
"Java/1.8.0_201"
 
// No contact info
"SomeCrawler/1.0"
 
// ============================================================
// HTTP Request Headers for Identification
// ============================================================
 
const crawlerHeaders = {
  'User-Agent': 'AcmeCrawler/1.0 (+https://acme.com/crawler; crawler@acme.com)',
  'From': 'crawler@acme.com',                    // RFC 7231 header for bot contact
  'Accept': 'text/html,application/xhtml+xml',
  'Accept-Language': 'en-US,en;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Optional: indicate crawl purpose
  'X-Crawler-Purpose': 'search-indexing'
};

Never Spoof User-Agents

Making your crawler appear to be a browser (User-Agent spoofing) is considered hostile behavior. It's often used to bypass bot detection, which makes it look like your crawler is trying to evade scrutiny. Depending on jurisdiction and terms of service, this may be illegal. Always identify your crawler honestly.

Handling Blocked Scenarios

Even with perfect politeness, your crawler may be blocked. Blocks can be intentional (site policy) or mistaken (you're collateral damage from anti-bot measures). How you respond matters.

Types of blocks and appropriate responses:

Block Types and Responses
Block Type	How to Detect	Appropriate Response
robots.txt Disallow	Can't match User-agent; path disallowed	Respect it completely. Do not crawl.
HTTP 403 Forbidden	Status code 403 on crawl attempt	Stop crawling this URL. Log for analysis.
HTTP 401 Unauthorized	Status code 401, requires authentication	Skip. Crawler shouldn't access protected content.
CAPTCHA Challenge	Response contains CAPTCHA HTML patterns	Stop crawling domain temporarily. Do NOT solve CAPTCHAs.
IP Block (connection refused)	Connection timeout or TCP RST	Pause all crawling to that IP. May need IP rotation.
Soft Block (fake content)	Valid response but with bot-trap content	Detect via content analysis. Reduce rate significantly.
Honeypot Links	Links visible only to crawlers, trap URLs	Detect patterns. Avoid following suspicious links.

Best Practices for Block Handling

•Log blocks for analysis — Track which domains block you and why. Patterns may reveal issues in your crawler's behavior.
•Don't retry immediately — Exponential backoff is critical. A block after many requests means you should wait hours, not seconds.
•Don't circumvent blocks — Using proxy rotation, IP cycling, or CAPTCHA solving to evade blocks is hostile and potentially illegal.
•Contact site operators — If you believe you're blocked by mistake, reach out politely. Explain your crawler's purpose.
•Gracefully exclude blocked domains — Add persistently-blocking domains to a blocklist to avoid wasting resources.
•Monitor your reputation — Check if your IPs appear on abuse lists. Being listed can cascade to blocks across many sites.

Prevention Is Better Than Cure

The best response to blocks is avoiding them in the first place. Reasonable rate limits, robots.txt compliance, and clear identification prevent most blocks. If you're being frequently blocked, audit your crawler's behavior before looking for workarounds.

Distributed Politeness Coordination

In a distributed crawler with many worker nodes, politeness becomes a coordination challenge. You must ensure that the aggregate rate across all workers respects limits, not just individual worker rates.

The distributed politeness problem:

100 workers, each respecting 1 req/sec per domain
Without coordination: up to 100 req/sec to a single domain!
With coordination: exactly 1 req/sec across all workers

Solutions:

Option 1: Domain Affinity (Recommended)

•How it works: Assign each domain to exactly one worker (via hash partitioning)
•Politeness: Each domain is only crawled by one worker → no coordination needed
•Pros: Simple, no distributed state, naturally load-balanced across workers
•Cons: Large domains may bottleneck one worker; worker failures require domain redistribution
•Implementation: hash(domain) % num_workers → worker assignment

Option 2: Centralized Rate Limiter

•How it works: Central service tracks last request time per domain; workers request permission before crawling
•Politeness: Central authority ensures global rate limits
•Pros: Flexible, can dynamically adjust rates, handles any distribution of work
•Cons: Central point of failure; adds latency; scalability bottleneck
•Implementation: Redis or similar with SETNX for distributed locks

Option 3: Distributed Leases

•How it works: Workers acquire time-limited leases to crawl specific domains; lease includes rate limit
•Politeness: Only lease holder can crawl; lease duration enforces rate
•Pros: Combines benefits of affinity (local decisions) with flexibility (lease reassignment)
•Cons: Complexity; lease management overhead; edge cases around expiration
•Implementation: Lease management service with heartbeats and expiration

Recommended architecture for distributed politeness:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                  DISTRIBUTED POLITENESS WITH DOMAIN AFFINITY                    │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                      URL FRONTIER (Partitioned by Domain)                │  │
│   │                                                                           │  │
│   │   Partition 0          Partition 1          Partition 2                  │  │
│   │   ─────────────        ─────────────        ─────────────                │  │
│   │   example.com ──┐      google.com ───┐      amazon.com ───┐              │  │
│   │   github.com ───┼──▶   facebook.com ─┼──▶   netflix.com ──┼──▶           │  │
│   │   ...           │      ...           │      ...           │              │  │
│   │                 │                    │                    │              │  │
│   └─────────────────┼────────────────────┼────────────────────┼──────────────┘  │
│                     │                    │                    │                 │
│                     ▼                    ▼                    ▼                 │
│   ┌─────────────────────────────────────────────────────────────────────────┐   │
│   │                          WORKER NODES                                    │   │
│   │                                                                          │   │
│   │   ┌────────────────┐   ┌────────────────┐   ┌────────────────┐          │   │
│   │   │  Worker 0      │   │  Worker 1      │   │  Worker 2      │          │   │
│   │   │                │   │                │   │                │          │   │
│   │   │  Owns:         │   │  Owns:         │   │  Owns:         │          │   │
│   │   │  - Partition 0 │   │  - Partition 1 │   │  - Partition 2 │          │   │
│   │   │                │   │                │   │                │          │   │
│   │   │  Local Rate    │   │  Local Rate    │   │  Local Rate    │          │   │
│   │   │  Limiter per   │   │  Limiter per   │   │  Limiter per   │          │   │
│   │   │  owned domain  │   │  owned domain  │   │  owned domain  │          │   │
│   │   │                │   │                │   │                │          │   │
│   │   └────────────────┘   └────────────────┘   └────────────────┘          │   │
│   │                                                                          │   │
│   │   Key Insight: Each domain is crawled by exactly ONE worker.            │   │
│   │   No cross-worker coordination needed for rate limiting!                 │   │
│   │                                                                          │   │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                                                                                  │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                    SHARED SERVICES                                        │  │
│   │                                                                           │  │
│   │   ┌─────────────────────────┐   ┌────────────────────────────────────┐   │  │
│   │   │  robots.txt Cache       │   │  DNS Cache                         │   │  │
│   │   │  (shared to avoid       │   │  (shared to reduce DNS load)       │   │  │
│   │   │   redundant fetches)    │   │                                    │   │  │
│   │   └─────────────────────────┘   └────────────────────────────────────┘   │  │
│   │                                                                           │  │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

Domain affinity is the recommended approach for most crawlers because it eliminates distributed coordination overhead while naturally achieving politeness. The key insight: if only Worker 0 ever crawls example.com, then Worker 0's local rate limiter is sufficient.

Summary and Next Steps

We've covered the comprehensive framework for crawler politeness—the principles and mechanisms that enable sustainable web crawling at scale.

Key Takeaways

•robots.txt is mandatory — Parse it correctly, cache it, and always respect Disallow rules and Crawl-delay directives.
•Rate limiting operates at multiple levels — Domain, IP, and global limits all matter. Missing any layer can cause problems.
•Adaptive backoff responds to stress — Monitor response codes and latency; increase delays when servers show strain.
•Identify yourself clearly — A good User-Agent includes your crawler name, version, URL, and contact info. Never spoof.
•Handle blocks gracefully — Don't circumvent blocks. Log them, back off, and contact operators if needed.
•Domain affinity simplifies distributed politeness — Assign each domain to one worker to achieve politeness without coordination.

What's Next:

The next page explores Duplicate Detection—how to identify and avoid wasting resources on duplicate content that appears under different URLs. This includes content fingerprinting, near-duplicate detection, and the probabilistic data structures that make it feasible at scale.

Page Complete

You now understand how to build a polite crawler that can operate sustainably at scale. Politeness isn't a constraint on effectiveness—it's the foundation that enables long-term operation. A polite crawler has access to more of the web, runs more efficiently, and never faces legal or reputational consequences.

3 / 6

Loading learning content...

System Design (HLD)Web Crawler

Designing a Web Crawler at Scale

LevelAdvanced

Duration90 mins

TopicWeb Crawler

3 / 6

Politeness Policies

The Social Contract of the Web

Your crawler gets blocked — Site operators will detect aggressive crawlers and ban your IP ranges
You degrade target infrastructure — Overwhelming a server can cause outages that affect real users
You face legal consequences — Aggressive crawling can violate computer fraud and abuse laws
You poison your own data — Stressed servers return errors, incomplete pages, or anti-bot countermeasures

Politeness is not a constraint on your crawler's effectiveness—it's a prerequisite for sustainable operation.

What You Will Learn

The Robots Exclusion Protocol (robots.txt)

Key principles:

robots.txt is advisory — Compliance is voluntary. Malicious crawlers ignore it, but legitimate crawlers must respect it.
Default is permissive — Without a robots.txt, everything is allowed. Without specific rules, everything is allowed.
Most specific rule wins — More specific path patterns override general ones.
Per-user-agent rules — Different crawlers can have different permissions.

robots.txt location:

https://example.com/robots.txt

The file MUST be at the root of the domain. /subdirectory/robots.txt is not valid.

Example robots.txt Files
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# ============================================================
# BASIC robots.txt
# ============================================================
 
# Block all crawlers from /private directory
User-agent: *
Disallow: /private/
Disallow: /admin/
Disallow: /api/
 
# Allow everything else (implicit)
 
 
# ============================================================
# COMPLEX robots.txt (like a major site)
# ============================================================
 
# Google-specific rules
User-agent: Googlebot
Disallow: /search
Disallow: /sdch
Allow: /search/about
Crawl-delay: 1
 
# Bing-specific rules
User-agent: Bingbot
Disallow: /search
Crawl-delay: 2
 
# Block bad bots entirely
User-agent: MJ12bot
User-agent: AhrefsBot
User-agent: SemrushBot
Disallow: /
 
# Default for other crawlers
User-agent: *
Disallow: /private/
Disallow: /cgi-bin/
Disallow: /*.json$
Disallow: /*?sessionid=
Crawl-delay: 5
 
# Sitemaps
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
 
 
# ============================================================
# RESTRICTIVE robots.txt (block everything)
# ============================================================
 
User-agent: *
Disallow: /
 
 
# ============================================================
# PERMISSIVE robots.txt (allow everything explicitly)
# ============================================================
 
User-agent: *
Allow: /

robots.txt Directives
Directive	Purpose	Example
User-agent	Specifies which crawler the following rules apply to	User-agent: Googlebot
Disallow	Blocks access to specified path or pattern	Disallow: /private/
Allow	Explicitly permits access (overrides broader Disallow)	Allow: /private/public-file.html
Crawl-delay	Seconds to wait between requests (non-standard but widely used)	Crawl-delay: 10
Sitemap	Location of XML sitemap for URL discovery	Sitemap: https://example.com/sitemap.xml
Host	Preferred domain for canonicalization (deprecated)	Host: www.example.com

Crawl-delay Is Critical

Parsing robots.txt Correctly

Parsing algorithm overview:

Fetch https://domain.com/robots.txt
Handle HTTP response appropriately (see below)
Parse file into rules grouped by User-agent
Find the most specific matching group for your crawler
Apply rules to determine if a URL is allowed

HTTP Response Handling for robots.txt

•200 OK — Parse the response body as robots.txt rules. Verify content-type suggests text (not HTML error pages).
•3xx Redirect — Follow redirects (up to a limit). The final destination is the authoritative robots.txt.
•4xx Client Error (especially 404) — No robots.txt exists. Assume all URLs are allowed.
•5xx Server Error — Temporary server issue. Apply a restrictive policy (don't crawl) and retry later.
•Timeout / Network Error — Similar to 5xx. Assume temporary and retry. Don't assume permissive.

robots.txt Parser Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
interface RobotsRule {
  path: string;
  isAllow: boolean;
}
 
interface RobotsDirectives {
  rules: RobotsRule[];
  crawlDelay: number | null;
  sitemaps: string[];
}
 
class RobotsParser {
  private userAgentRules: Map<string, RobotsDirectives> = new Map();
  private globalSitemaps: string[] = [];
  private fetchedAt: Date | null = null;
  private ttlSeconds: number = 86400; // Cache for 24 hours
 
  /**
   * Parse robots.txt content
   */
  public parse(content: string): void {
    this.userAgentRules.clear();
    this.globalSitemaps = [];
    
    const lines = content.split(/\r?
/);
    let currentAgents: string[] = [];
    let currentRules: RobotsRule[] = [];
    let currentCrawlDelay: number | null = null;
 
    const commitGroup = () => {
      if (currentAgents.length > 0) {
        const directives: RobotsDirectives = {
          rules: [...currentRules],
          crawlDelay: currentCrawlDelay,
          sitemaps: []
        };
        for (const agent of currentAgents) {
          this.userAgentRules.set(agent.toLowerCase(), directives);
        }
      }
      currentAgents = [];
      currentRules = [];
      currentCrawlDelay = null;
    };
 
    for (const rawLine of lines) {
      // Remove comments and trim
      const line = rawLine.split('#')[0].trim();
      if (!line) continue;
 
      const colonIndex = line.indexOf(':');
      if (colonIndex === -1) continue;
 
      const directive = line.substring(0, colonIndex).trim().toLowerCase();
      const value = line.substring(colonIndex + 1).trim();
 
      switch (directive) {
        case 'user-agent':
          // If we were building a group, commit it
          if (currentRules.length > 0 || currentCrawlDelay !== null) {
            commitGroup();
          }
          currentAgents.push(value);
          break;
 
        case 'disallow':
          if (value) { // Empty Disallow means allow all
            currentRules.push({ path: value, isAllow: false });
          }
          break;
 
        case 'allow':
          currentRules.push({ path: value, isAllow: true });
          break;
 
        case 'crawl-delay':
          const delay = parseFloat(value);
          if (!isNaN(delay) && delay >= 0) {
            currentCrawlDelay = delay;
          }
          break;
 
        case 'sitemap':
          this.globalSitemaps.push(value);
          break;
      }
    }
 
    // Commit final group
    commitGroup();
    this.fetchedAt = new Date();
  }
 
  /**
   * Check if a URL path is allowed for a given user-agent
   */
  public isAllowed(userAgent: string, path: string): boolean {
    const directives = this.getDirectivesForAgent(userAgent);
    if (!directives || directives.rules.length === 0) {
      return true; // No matching rules = allowed
    }
 
    // Find the longest matching rule
    let matchedRule: RobotsRule | null = null;
    let matchLength = 0;
 
    for (const rule of directives.rules) {
      if (this.pathMatches(path, rule.path)) {
        // More specific (longer) patterns take precedence
        if (rule.path.length > matchLength) {
          matchLength = rule.path.length;
          matchedRule = rule;
        }
        // If same length, Allow takes precedence over Disallow
        else if (rule.path.length === matchLength && rule.isAllow) {
          matchedRule = rule;
        }
      }
    }
 
    return matchedRule ? matchedRule.isAllow : true;
  }
 
  /**
   * Get crawl delay for a user-agent
   */
  public getCrawlDelay(userAgent: string): number | null {
    const directives = this.getDirectivesForAgent(userAgent);
    return directives?.crawlDelay ?? null;
  }
 
  /**
   * Check if robots.txt cache is expired
   */
  public isExpired(): boolean {
    if (!this.fetchedAt) return true;
    const age = (Date.now() - this.fetchedAt.getTime()) / 1000;
    return age > this.ttlSeconds;
  }
 
  private getDirectivesForAgent(userAgent: string): RobotsDirectives | null {
    const normalizedAgent = userAgent.toLowerCase();
    
    // Try exact match first
    if (this.userAgentRules.has(normalizedAgent)) {
      return this.userAgentRules.get(normalizedAgent)!;
    }
 
    // Try partial match (e.g., "Googlebot/2.1" matches "Googlebot")
    for (const [agent, directives] of this.userAgentRules) {
      if (normalizedAgent.includes(agent) || agent.includes(normalizedAgent)) {
        return directives;
      }
    }
 
    // Fall back to wildcard
    return this.userAgentRules.get('*') ?? null;
  }
 
  private pathMatches(path: string, pattern: string): boolean {
    // Handle wildcards (*) and end-of-path ($)
    // Convert pattern to regex
    let regexPattern = pattern
      .replace(/[.+?^${}()|[\]\\]/g, '\\$&') // Escape special regex chars
      .replace(/\*/g, '.*')                        // * = any characters
      .replace(/\$$/g, '$');                       // $ = end of string
 
    // Pattern must match from the start of the path
    const regex = new RegExp('^' + regexPattern);
    return regex.test(path);
  }
}
 
// Example usage
const parser = new RobotsParser();
parser.parse(`
User-agent: *
Disallow: /private/
Disallow: /api/
Allow: /api/public/
Crawl-delay: 2
 
User-agent: Googlebot
Disallow: /search
Allow: /search/about
Crawl-delay: 1
`);
 
console.log(parser.isAllowed('MyBot', '/public/page'));     // true
console.log(parser.isAllowed('MyBot', '/private/secret'));  // false
console.log(parser.isAllowed('MyBot', '/api/internal'));    // false
console.log(parser.isAllowed('MyBot', '/api/public/data')); // true (Allow override)
console.log(parser.getCrawlDelay('MyBot'));                 // 2
console.log(parser.getCrawlDelay('Googlebot'));             // 1

Use Established Libraries

Rate Limiting Strategies

Beyond robots.txt, crawlers must implement their own rate limiting to avoid overwhelming servers—even when robots.txt is permissive. Rate limiting operates at multiple levels:

Levels of rate limiting:

Domain Level — Maximum requests per second/minute to a single domain
IP Level — Maximum requests to a single server IP (important when multiple domains share an IP)
Global Level — Maximum total outbound requests from your crawler
Worker Level — Maximum concurrent requests per worker node

Token Bucket Rate Limiter

•How it works: Bucket holds tokens; requests consume tokens; tokens refill at a constant rate
•Burst handling: Allows short bursts up to bucket capacity, then rate-limited
•Pros: Simple, handles bursts gracefully, widely understood
•Cons: Burst behavior may not be appropriate for politeness
•Best for: Worker-level and global rate limiting

Fixed Delay Rate Limiter

•How it works: Enforce minimum time between consecutive requests to same target
•Burst handling: No bursts allowed; strictly spacing requests
•Pros: Predictable, prevents any clustering of requests
•Cons: Less efficient utilization; can't absorb variable response times
•Best for: Domain-level politeness (respecting Crawl-delay)

Multi-Level Rate Limiter
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
class DomainRateLimiter {
  private lastRequestTime: Map<string, number> = new Map();
  private domainDelays: Map<string, number> = new Map();
  private defaultDelay: number = 1000; // 1 second default
 
  /**
   * Set delay for a specific domain (from robots.txt or config)
   */
  public setDomainDelay(domain: string, delayMs: number): void {
    this.domainDelays.set(domain, delayMs);
  }
 
  /**
   * Get required delay before next request to domain
   * Returns 0 if request can proceed immediately
   */
  public getWaitTime(domain: string): number {
    const now = Date.now();
    const lastRequest = this.lastRequestTime.get(domain) || 0;
    const delay = this.domainDelays.get(domain) ?? this.defaultDelay;
    
    const elapsed = now - lastRequest;
    const waitTime = Math.max(0, delay - elapsed);
    
    return waitTime;
  }
 
  /**
   * Check if we can make a request to domain now
   */
  public canRequest(domain: string): boolean {
    return this.getWaitTime(domain) === 0;
  }
 
  /**
   * Record that a request was made to domain
   */
  public recordRequest(domain: string): void {
    this.lastRequestTime.set(domain, Date.now());
  }
 
  /**
   * Get next available time to request from domain
   */
  public getNextAvailableTime(domain: string): Date {
    const waitTime = this.getWaitTime(domain);
    return new Date(Date.now() + waitTime);
  }
}
 
class IPRateLimiter {
  // Track requests per IP to handle shared hosting
  private ipRequestCounts: Map<string, { count: number; windowStart: number }> = new Map();
  private maxRequestsPerWindow: number = 10;
  private windowSizeMs: number = 60000; // 1 minute
 
  /**
   * Check if we can make a request to this IP
   */
  public canRequest(ip: string): boolean {
    const now = Date.now();
    const record = this.ipRequestCounts.get(ip);
 
    if (!record || now - record.windowStart >= this.windowSizeMs) {
      return true; // New window
    }
 
    return record.count < this.maxRequestsPerWindow;
  }
 
  /**
   * Record a request to an IP
   */
  public recordRequest(ip: string): void {
    const now = Date.now();
    const record = this.ipRequestCounts.get(ip);
 
    if (!record || now - record.windowStart >= this.windowSizeMs) {
      // Start new window
      this.ipRequestCounts.set(ip, { count: 1, windowStart: now });
    } else {
      // Increment in current window
      record.count++;
    }
  }
}
 
class PolitenessScheduler {
  private domainLimiter: DomainRateLimiter = new DomainRateLimiter();
  private ipLimiter: IPRateLimiter = new IPRateLimiter();
  private dnsCache: Map<string, string> = new Map(); // domain -> IP
 
  /**
   * Check if crawling a URL is currently allowed
   */
  public async canCrawl(url: string): Promise<{ allowed: boolean; waitTime: number }> {
    const { hostname } = new URL(url);
    
    // Check domain rate limit
    const domainWait = this.domainLimiter.getWaitTime(hostname);
    if (domainWait > 0) {
      return { allowed: false, waitTime: domainWait };
    }
 
    // Check IP rate limit (for shared hosting protection)
    const ip = await this.resolveIP(hostname);
    if (!this.ipLimiter.canRequest(ip)) {
      // Need to wait for IP window to reset
      return { allowed: false, waitTime: 1000 }; // Rough estimate
    }
 
    return { allowed: true, waitTime: 0 };
  }
 
  /**
   * Record that a request was made
   */
  public async recordRequest(url: string): Promise<void> {
    const { hostname } = new URL(url);
    const ip = await this.resolveIP(hostname);
 
    this.domainLimiter.recordRequest(hostname);
    this.ipLimiter.recordRequest(ip);
  }
 
  /**
   * Configure domain delay from robots.txt
   */
  public setDomainCrawlDelay(domain: string, delaySec: number): void {
    this.domainLimiter.setDomainDelay(domain, delaySec * 1000);
  }
 
  private async resolveIP(hostname: string): Promise<string> {
    if (this.dnsCache.has(hostname)) {
      return this.dnsCache.get(hostname)!;
    }
    
    // In real implementation, use DNS resolver
    // This is simplified
    const ip = `resolved-ip-for-${hostname}`;
    this.dnsCache.set(hostname, ip);
    return ip;
  }
}

Shared Hosting Awareness

Detecting and Responding to Server Stress

Even with careful rate limiting, servers may become stressed. A polite crawler monitors for stress signals and adjusts behavior dynamically.

Stress indicators to monitor:

Server Stress Signals

•HTTP 429 (Too Many Requests) — Server explicitly signals rate limiting. Response may include Retry-After header.
•HTTP 503 (Service Unavailable) — Server is overloaded. Often includes Retry-After header.
•HTTP 500/502/504 (Server Errors) — May indicate server struggling under load.
•Increasing Response Latency — Baseline 100ms responses climbing to 500ms+ suggests server strain.
•Connection Timeouts — Server too busy to accept connections.
•Response Size Anomalies — Truncated responses may indicate server resource exhaustion.
•CAPTCHAs or Block Pages — Server deploying countermeasures against your crawler.

Adaptive Backoff Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
interface DomainHealthMetrics {
  domain: string;
  recentResponseTimes: number[];      // Last N response times in ms
  recentErrorCodes: number[];          // Last N HTTP status codes
  consecutiveErrors: number;           // Current error streak
  currentBackoffLevel: number;         // 0 = normal, higher = more backed off
  lastRequestTime: number;
  blockedUntil: number;                // Timestamp when block expires
}
 
class AdaptivePolitenessController {
  private domainMetrics: Map<string, DomainHealthMetrics> = new Map();
  
  // Backoff configuration
  private baseDelay: number = 1000;               // 1 second base
  private maxBackoffLevel: number = 8;            // Max 2^8 = 256x backoff
  private errorThreshold: number = 5;             // Errors before backoff
  private latencyThreshold: number = 2000;        // 2 second latency = slow
  private metricsWindowSize: number = 20;
 
  /**
   * Record the result of a crawl attempt
   */
  public recordResult(
    domain: string, 
    statusCode: number, 
    responseTimeMs: number
  ): void {
    const metrics = this.getOrCreateMetrics(domain);
    
    // Update response times
    metrics.recentResponseTimes.push(responseTimeMs);
    if (metrics.recentResponseTimes.length > this.metricsWindowSize) {
      metrics.recentResponseTimes.shift();
    }
 
    // Update error codes
    metrics.recentErrorCodes.push(statusCode);
    if (metrics.recentErrorCodes.length > this.metricsWindowSize) {
      metrics.recentErrorCodes.shift();
    }
 
    // Handle specific status codes
    if (statusCode === 429) {
      // Explicit rate limiting - significant backoff
      this.increaseBackoff(domain, 3); // Jump 3 levels
    } else if (statusCode === 503) {
      // Server overloaded - moderate backoff
      this.increaseBackoff(domain, 2);
    } else if (statusCode >= 500) {
      // Server error - slight backoff
      metrics.consecutiveErrors++;
      if (metrics.consecutiveErrors >= this.errorThreshold) {
        this.increaseBackoff(domain, 1);
      }
    } else if (statusCode >= 200 && statusCode < 300) {
      // Success - potentially reduce backoff
      metrics.consecutiveErrors = 0;
      if (this.isDomainHealthy(domain)) {
        this.decreaseBackoff(domain);
      }
    }
 
    // Check for latency-based stress
    if (this.isLatencyElevated(domain)) {
      this.increaseBackoff(domain, 1);
    }
 
    metrics.lastRequestTime = Date.now();
  }
 
  /**
   * Get the current delay for a domain
   */
  public getCurrentDelay(domain: string): number {
    const metrics = this.getOrCreateMetrics(domain);
    const backoffMultiplier = Math.pow(2, metrics.currentBackoffLevel);
    return this.baseDelay * backoffMultiplier;
  }
 
  /**
   * Check if domain is blocked (e.g., after receiving block signal)
   */
  public isBlocked(domain: string): boolean {
    const metrics = this.getOrCreateMetrics(domain);
    return Date.now() < metrics.blockedUntil;
  }
 
  /**
   * Block domain for a specified duration (e.g., from Retry-After header)
   */
  public blockDomain(domain: string, durationMs: number): void {
    const metrics = this.getOrCreateMetrics(domain);
    metrics.blockedUntil = Date.now() + durationMs;
  }
 
  private increaseBackoff(domain: string, levels: number = 1): void {
    const metrics = this.getOrCreateMetrics(domain);
    metrics.currentBackoffLevel = Math.min(
      this.maxBackoffLevel,
      metrics.currentBackoffLevel + levels
    );
    
    console.log(
      `Increased backoff for ${domain} to level ${metrics.currentBackoffLevel} ` +
      `(delay: ${this.getCurrentDelay(domain)}ms)`
    );
  }
 
  private decreaseBackoff(domain: string): void {
    const metrics = this.getOrCreateMetrics(domain);
    if (metrics.currentBackoffLevel > 0) {
      metrics.currentBackoffLevel--;
    }
  }
 
  private isDomainHealthy(domain: string): boolean {
    const metrics = this.getOrCreateMetrics(domain);
    
    // Check error rate
    const recentErrors = metrics.recentErrorCodes.filter(c => c >= 400).length;
    const errorRate = recentErrors / metrics.recentErrorCodes.length;
    
    // Check average latency
    const avgLatency = this.getAverageLatency(domain);
    
    return errorRate < 0.1 && avgLatency < this.latencyThreshold;
  }
 
  private isLatencyElevated(domain: string): boolean {
    const avgLatency = this.getAverageLatency(domain);
    return avgLatency > this.latencyThreshold;
  }
 
  private getAverageLatency(domain: string): number {
    const metrics = this.getOrCreateMetrics(domain);
    if (metrics.recentResponseTimes.length === 0) return 0;
    
    const sum = metrics.recentResponseTimes.reduce((a, b) => a + b, 0);
    return sum / metrics.recentResponseTimes.length;
  }
 
  private getOrCreateMetrics(domain: string): DomainHealthMetrics {
    if (!this.domainMetrics.has(domain)) {
      this.domainMetrics.set(domain, {
        domain,
        recentResponseTimes: [],
        recentErrorCodes: [],
        consecutiveErrors: 0,
        currentBackoffLevel: 0,
        lastRequestTime: 0,
        blockedUntil: 0
      });
    }
    return this.domainMetrics.get(domain)!;
  }
}

Respect Retry-After Headers

User-Agent and Crawler Identification

Legitimate crawlers identify themselves clearly. This transparency serves multiple purposes:

Site operators can contact you if there are issues
Custom robots.txt rules can be applied to your crawler
Access controls may grant privileges to known, trusted crawlers
Abuse attribution helps distinguish your crawler from malicious bots

User-Agent best practices:

User-Agent Requirements

•Include a unique, identifiable name — Something memorable like "AcmeCrawler" or "DataBotX", not generic strings
•Include a version number — Helps identify which version is causing issues
•Include a URL for more information — A webpage explaining your crawler's purpose and contact info
•Include contact information — Email address for urgent issues
•Be distinctive — Don't spoof popular browser User-Agents (this is hostile and may be illegal)

Good and Bad User-Agent Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// ============================================================
// GOOD User-Agent Examples
// ============================================================
 
// Googlebot (the gold standard)
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
 
// Bingbot
"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
 
// Your custom crawler (recommended format)
"MyCrawler/1.0 (+https://mycrawler.example.com/about; contact@example.com)"
 
// With more detail
"AcmeSearchBot/2.5.1 (Linux; +https://acme.com/searchbot; crawl-admin@acme.com)"
 
// ============================================================
// BAD User-Agent Examples (DON'T DO THIS)
// ============================================================
 
// Spoofing a browser (hostile, possibly illegal)
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36..."
 
// No identification
"curl/7.68.0"
 
// Generic/meaningless
"python-requests/2.25.1"
"Java/1.8.0_201"
 
// No contact info
"SomeCrawler/1.0"
 
// ============================================================
// HTTP Request Headers for Identification
// ============================================================
 
const crawlerHeaders = {
  'User-Agent': 'AcmeCrawler/1.0 (+https://acme.com/crawler; crawler@acme.com)',
  'From': 'crawler@acme.com',                    // RFC 7231 header for bot contact
  'Accept': 'text/html,application/xhtml+xml',
  'Accept-Language': 'en-US,en;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  // Optional: indicate crawl purpose
  'X-Crawler-Purpose': 'search-indexing'
};

Never Spoof User-Agents

Handling Blocked Scenarios

Even with perfect politeness, your crawler may be blocked. Blocks can be intentional (site policy) or mistaken (you're collateral damage from anti-bot measures). How you respond matters.

Types of blocks and appropriate responses:

Block Types and Responses
Block Type	How to Detect	Appropriate Response
robots.txt Disallow	Can't match User-agent; path disallowed	Respect it completely. Do not crawl.
HTTP 403 Forbidden	Status code 403 on crawl attempt	Stop crawling this URL. Log for analysis.
HTTP 401 Unauthorized	Status code 401, requires authentication	Skip. Crawler shouldn't access protected content.
CAPTCHA Challenge	Response contains CAPTCHA HTML patterns	Stop crawling domain temporarily. Do NOT solve CAPTCHAs.
IP Block (connection refused)	Connection timeout or TCP RST	Pause all crawling to that IP. May need IP rotation.
Soft Block (fake content)	Valid response but with bot-trap content	Detect via content analysis. Reduce rate significantly.
Honeypot Links	Links visible only to crawlers, trap URLs	Detect patterns. Avoid following suspicious links.

Best Practices for Block Handling

•Log blocks for analysis — Track which domains block you and why. Patterns may reveal issues in your crawler's behavior.
•Don't retry immediately — Exponential backoff is critical. A block after many requests means you should wait hours, not seconds.
•Don't circumvent blocks — Using proxy rotation, IP cycling, or CAPTCHA solving to evade blocks is hostile and potentially illegal.
•Contact site operators — If you believe you're blocked by mistake, reach out politely. Explain your crawler's purpose.
•Gracefully exclude blocked domains — Add persistently-blocking domains to a blocklist to avoid wasting resources.
•Monitor your reputation — Check if your IPs appear on abuse lists. Being listed can cascade to blocks across many sites.

Prevention Is Better Than Cure

Distributed Politeness Coordination

The distributed politeness problem:

100 workers, each respecting 1 req/sec per domain
Without coordination: up to 100 req/sec to a single domain!
With coordination: exactly 1 req/sec across all workers

Solutions:

Option 1: Domain Affinity (Recommended)

•How it works: Assign each domain to exactly one worker (via hash partitioning)
•Politeness: Each domain is only crawled by one worker → no coordination needed
•Pros: Simple, no distributed state, naturally load-balanced across workers
•Cons: Large domains may bottleneck one worker; worker failures require domain redistribution
•Implementation: hash(domain) % num_workers → worker assignment

Option 2: Centralized Rate Limiter

•How it works: Central service tracks last request time per domain; workers request permission before crawling
•Politeness: Central authority ensures global rate limits
•Pros: Flexible, can dynamically adjust rates, handles any distribution of work
•Cons: Central point of failure; adds latency; scalability bottleneck
•Implementation: Redis or similar with SETNX for distributed locks

Option 3: Distributed Leases

•How it works: Workers acquire time-limited leases to crawl specific domains; lease includes rate limit
•Politeness: Only lease holder can crawl; lease duration enforces rate
•Pros: Combines benefits of affinity (local decisions) with flexibility (lease reassignment)
•Cons: Complexity; lease management overhead; edge cases around expiration
•Implementation: Lease management service with heartbeats and expiration

Recommended architecture for distributed politeness:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                  DISTRIBUTED POLITENESS WITH DOMAIN AFFINITY                    │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                      URL FRONTIER (Partitioned by Domain)                │  │
│   │                                                                           │  │
│   │   Partition 0          Partition 1          Partition 2                  │  │
│   │   ─────────────        ─────────────        ─────────────                │  │
│   │   example.com ──┐      google.com ───┐      amazon.com ───┐              │  │
│   │   github.com ───┼──▶   facebook.com ─┼──▶   netflix.com ──┼──▶           │  │
│   │   ...           │      ...           │      ...           │              │  │
│   │                 │                    │                    │              │  │
│   └─────────────────┼────────────────────┼────────────────────┼──────────────┘  │
│                     │                    │                    │                 │
│                     ▼                    ▼                    ▼                 │
│   ┌─────────────────────────────────────────────────────────────────────────┐   │
│   │                          WORKER NODES                                    │   │
│   │                                                                          │   │
│   │   ┌────────────────┐   ┌────────────────┐   ┌────────────────┐          │   │
│   │   │  Worker 0      │   │  Worker 1      │   │  Worker 2      │          │   │
│   │   │                │   │                │   │                │          │   │
│   │   │  Owns:         │   │  Owns:         │   │  Owns:         │          │   │
│   │   │  - Partition 0 │   │  - Partition 1 │   │  - Partition 2 │          │   │
│   │   │                │   │                │   │                │          │   │
│   │   │  Local Rate    │   │  Local Rate    │   │  Local Rate    │          │   │
│   │   │  Limiter per   │   │  Limiter per   │   │  Limiter per   │          │   │
│   │   │  owned domain  │   │  owned domain  │   │  owned domain  │          │   │
│   │   │                │   │                │   │                │          │   │
│   │   └────────────────┘   └────────────────┘   └────────────────┘          │   │
│   │                                                                          │   │
│   │   Key Insight: Each domain is crawled by exactly ONE worker.            │   │
│   │   No cross-worker coordination needed for rate limiting!                 │   │
│   │                                                                          │   │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                                                                                  │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                    SHARED SERVICES                                        │  │
│   │                                                                           │  │
│   │   ┌─────────────────────────┐   ┌────────────────────────────────────┐   │  │
│   │   │  robots.txt Cache       │   │  DNS Cache                         │   │  │
│   │   │  (shared to avoid       │   │  (shared to reduce DNS load)       │   │  │
│   │   │   redundant fetches)    │   │                                    │   │  │
│   │   └─────────────────────────┘   └────────────────────────────────────┘   │  │
│   │                                                                           │  │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

Summary and Next Steps

We've covered the comprehensive framework for crawler politeness—the principles and mechanisms that enable sustainable web crawling at scale.

Key Takeaways

•robots.txt is mandatory — Parse it correctly, cache it, and always respect Disallow rules and Crawl-delay directives.
•Rate limiting operates at multiple levels — Domain, IP, and global limits all matter. Missing any layer can cause problems.
•Adaptive backoff responds to stress — Monitor response codes and latency; increase delays when servers show strain.
•Identify yourself clearly — A good User-Agent includes your crawler name, version, URL, and contact info. Never spoof.
•Handle blocks gracefully — Don't circumvent blocks. Log them, back off, and contact operators if needed.
•Domain affinity simplifies distributed politeness — Assign each domain to one worker to achieve politeness without coordination.

What's Next:

Page Complete

3 / 6