Loading learning content...
The web operates on a foundation of implicit trust. When you publish a website, you implicitly invite visitors—both human and automated. But this invitation comes with expectations: don't consume excessive resources, respect declared preferences, and don't interfere with legitimate operations.
For web crawlers, these expectations manifest as politeness policies—the rules and mechanisms that ensure your crawler is a good citizen of the web. Violating politeness isn't just bad etiquette; it has real consequences:
Politeness is not a constraint on your crawler's effectiveness—it's a prerequisite for sustainable operation.
By the end of this page, you will understand: (1) The robots.txt standard and how to implement compliant parsing, (2) Rate limiting strategies at domain, IP, and global levels, (3) Crawl delay mechanisms and their implementation, (4) How to detect and respond to server stress signals, and (5) Best practices for identifying your crawler and handling blocked scenarios.
The Robots Exclusion Protocol is the de facto standard for communicating crawler permissions. Introduced in 1994 and formalized as RFC 9309 in 2022, robots.txt files tell crawlers which parts of a site they may access.
Key principles:
robots.txt location:
https://example.com/robots.txt
The file MUST be at the root of the domain. /subdirectory/robots.txt is not valid.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
# ============================================================# BASIC robots.txt# ============================================================ # Block all crawlers from /private directoryUser-agent: *Disallow: /private/Disallow: /admin/Disallow: /api/ # Allow everything else (implicit) # ============================================================# COMPLEX robots.txt (like a major site)# ============================================================ # Google-specific rulesUser-agent: GooglebotDisallow: /searchDisallow: /sdchAllow: /search/aboutCrawl-delay: 1 # Bing-specific rulesUser-agent: BingbotDisallow: /searchCrawl-delay: 2 # Block bad bots entirelyUser-agent: MJ12botUser-agent: AhrefsBotUser-agent: SemrushBotDisallow: / # Default for other crawlersUser-agent: *Disallow: /private/Disallow: /cgi-bin/Disallow: /*.json$Disallow: /*?sessionid=Crawl-delay: 5 # SitemapsSitemap: https://example.com/sitemap.xmlSitemap: https://example.com/sitemap-news.xml # ============================================================# RESTRICTIVE robots.txt (block everything)# ============================================================ User-agent: *Disallow: / # ============================================================# PERMISSIVE robots.txt (allow everything explicitly)# ============================================================ User-agent: *Allow: /| Directive | Purpose | Example |
|---|---|---|
| User-agent | Specifies which crawler the following rules apply to | User-agent: Googlebot |
| Disallow | Blocks access to specified path or pattern | Disallow: /private/ |
| Allow | Explicitly permits access (overrides broader Disallow) | Allow: /private/public-file.html |
| Crawl-delay | Seconds to wait between requests (non-standard but widely used) | Crawl-delay: 10 |
| Sitemap | Location of XML sitemap for URL discovery | Sitemap: https://example.com/sitemap.xml |
| Host | Preferred domain for canonicalization (deprecated) | Host: www.example.com |
While Crawl-delay is not in the original RFC, it's widely used and MUST be respected. A Crawl-delay: 60 means wait 60 seconds between requests to that domain. Ignoring this directive is one of the fastest ways to get your crawler blocked. Some sites specify aggressive delays (300+ seconds) specifically to deter unwanted crawlers—respect their wishes.
Implementing a compliant robots.txt parser is surprisingly nuanced. Edge cases abound, and incorrect parsing can lead to crawling pages you shouldn't (risking bans) or missing pages you could (wasting opportunity).
Parsing algorithm overview:
https://domain.com/robots.txt123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195
interface RobotsRule { path: string; isAllow: boolean;} interface RobotsDirectives { rules: RobotsRule[]; crawlDelay: number | null; sitemaps: string[];} class RobotsParser { private userAgentRules: Map<string, RobotsDirectives> = new Map(); private globalSitemaps: string[] = []; private fetchedAt: Date | null = null; private ttlSeconds: number = 86400; // Cache for 24 hours /** * Parse robots.txt content */ public parse(content: string): void { this.userAgentRules.clear(); this.globalSitemaps = []; const lines = content.split(/\r?/); let currentAgents: string[] = []; let currentRules: RobotsRule[] = []; let currentCrawlDelay: number | null = null; const commitGroup = () => { if (currentAgents.length > 0) { const directives: RobotsDirectives = { rules: [...currentRules], crawlDelay: currentCrawlDelay, sitemaps: [] }; for (const agent of currentAgents) { this.userAgentRules.set(agent.toLowerCase(), directives); } } currentAgents = []; currentRules = []; currentCrawlDelay = null; }; for (const rawLine of lines) { // Remove comments and trim const line = rawLine.split('#')[0].trim(); if (!line) continue; const colonIndex = line.indexOf(':'); if (colonIndex === -1) continue; const directive = line.substring(0, colonIndex).trim().toLowerCase(); const value = line.substring(colonIndex + 1).trim(); switch (directive) { case 'user-agent': // If we were building a group, commit it if (currentRules.length > 0 || currentCrawlDelay !== null) { commitGroup(); } currentAgents.push(value); break; case 'disallow': if (value) { // Empty Disallow means allow all currentRules.push({ path: value, isAllow: false }); } break; case 'allow': currentRules.push({ path: value, isAllow: true }); break; case 'crawl-delay': const delay = parseFloat(value); if (!isNaN(delay) && delay >= 0) { currentCrawlDelay = delay; } break; case 'sitemap': this.globalSitemaps.push(value); break; } } // Commit final group commitGroup(); this.fetchedAt = new Date(); } /** * Check if a URL path is allowed for a given user-agent */ public isAllowed(userAgent: string, path: string): boolean { const directives = this.getDirectivesForAgent(userAgent); if (!directives || directives.rules.length === 0) { return true; // No matching rules = allowed } // Find the longest matching rule let matchedRule: RobotsRule | null = null; let matchLength = 0; for (const rule of directives.rules) { if (this.pathMatches(path, rule.path)) { // More specific (longer) patterns take precedence if (rule.path.length > matchLength) { matchLength = rule.path.length; matchedRule = rule; } // If same length, Allow takes precedence over Disallow else if (rule.path.length === matchLength && rule.isAllow) { matchedRule = rule; } } } return matchedRule ? matchedRule.isAllow : true; } /** * Get crawl delay for a user-agent */ public getCrawlDelay(userAgent: string): number | null { const directives = this.getDirectivesForAgent(userAgent); return directives?.crawlDelay ?? null; } /** * Check if robots.txt cache is expired */ public isExpired(): boolean { if (!this.fetchedAt) return true; const age = (Date.now() - this.fetchedAt.getTime()) / 1000; return age > this.ttlSeconds; } private getDirectivesForAgent(userAgent: string): RobotsDirectives | null { const normalizedAgent = userAgent.toLowerCase(); // Try exact match first if (this.userAgentRules.has(normalizedAgent)) { return this.userAgentRules.get(normalizedAgent)!; } // Try partial match (e.g., "Googlebot/2.1" matches "Googlebot") for (const [agent, directives] of this.userAgentRules) { if (normalizedAgent.includes(agent) || agent.includes(normalizedAgent)) { return directives; } } // Fall back to wildcard return this.userAgentRules.get('*') ?? null; } private pathMatches(path: string, pattern: string): boolean { // Handle wildcards (*) and end-of-path ($) // Convert pattern to regex let regexPattern = pattern .replace(/[.+?^${}()|[\]\\]/g, '\\$&') // Escape special regex chars .replace(/\*/g, '.*') // * = any characters .replace(/\$$/g, '$'); // $ = end of string // Pattern must match from the start of the path const regex = new RegExp('^' + regexPattern); return regex.test(path); }} // Example usageconst parser = new RobotsParser();parser.parse(`User-agent: *Disallow: /private/Disallow: /api/Allow: /api/public/Crawl-delay: 2 User-agent: GooglebotDisallow: /searchAllow: /search/aboutCrawl-delay: 1`); console.log(parser.isAllowed('MyBot', '/public/page')); // trueconsole.log(parser.isAllowed('MyBot', '/private/secret')); // falseconsole.log(parser.isAllowed('MyBot', '/api/internal')); // falseconsole.log(parser.isAllowed('MyBot', '/api/public/data')); // true (Allow override)console.log(parser.getCrawlDelay('MyBot')); // 2console.log(parser.getCrawlDelay('Googlebot')); // 1In production, use well-tested robots.txt parsing libraries rather than rolling your own. Examples include Google's robotstxt library (C++), rep-cpp, or language-specific packages. These handle edge cases, performance optimizations, and RFC compliance that are easy to get wrong.
Beyond robots.txt, crawlers must implement their own rate limiting to avoid overwhelming servers—even when robots.txt is permissive. Rate limiting operates at multiple levels:
Levels of rate limiting:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144
class DomainRateLimiter { private lastRequestTime: Map<string, number> = new Map(); private domainDelays: Map<string, number> = new Map(); private defaultDelay: number = 1000; // 1 second default /** * Set delay for a specific domain (from robots.txt or config) */ public setDomainDelay(domain: string, delayMs: number): void { this.domainDelays.set(domain, delayMs); } /** * Get required delay before next request to domain * Returns 0 if request can proceed immediately */ public getWaitTime(domain: string): number { const now = Date.now(); const lastRequest = this.lastRequestTime.get(domain) || 0; const delay = this.domainDelays.get(domain) ?? this.defaultDelay; const elapsed = now - lastRequest; const waitTime = Math.max(0, delay - elapsed); return waitTime; } /** * Check if we can make a request to domain now */ public canRequest(domain: string): boolean { return this.getWaitTime(domain) === 0; } /** * Record that a request was made to domain */ public recordRequest(domain: string): void { this.lastRequestTime.set(domain, Date.now()); } /** * Get next available time to request from domain */ public getNextAvailableTime(domain: string): Date { const waitTime = this.getWaitTime(domain); return new Date(Date.now() + waitTime); }} class IPRateLimiter { // Track requests per IP to handle shared hosting private ipRequestCounts: Map<string, { count: number; windowStart: number }> = new Map(); private maxRequestsPerWindow: number = 10; private windowSizeMs: number = 60000; // 1 minute /** * Check if we can make a request to this IP */ public canRequest(ip: string): boolean { const now = Date.now(); const record = this.ipRequestCounts.get(ip); if (!record || now - record.windowStart >= this.windowSizeMs) { return true; // New window } return record.count < this.maxRequestsPerWindow; } /** * Record a request to an IP */ public recordRequest(ip: string): void { const now = Date.now(); const record = this.ipRequestCounts.get(ip); if (!record || now - record.windowStart >= this.windowSizeMs) { // Start new window this.ipRequestCounts.set(ip, { count: 1, windowStart: now }); } else { // Increment in current window record.count++; } }} class PolitenessScheduler { private domainLimiter: DomainRateLimiter = new DomainRateLimiter(); private ipLimiter: IPRateLimiter = new IPRateLimiter(); private dnsCache: Map<string, string> = new Map(); // domain -> IP /** * Check if crawling a URL is currently allowed */ public async canCrawl(url: string): Promise<{ allowed: boolean; waitTime: number }> { const { hostname } = new URL(url); // Check domain rate limit const domainWait = this.domainLimiter.getWaitTime(hostname); if (domainWait > 0) { return { allowed: false, waitTime: domainWait }; } // Check IP rate limit (for shared hosting protection) const ip = await this.resolveIP(hostname); if (!this.ipLimiter.canRequest(ip)) { // Need to wait for IP window to reset return { allowed: false, waitTime: 1000 }; // Rough estimate } return { allowed: true, waitTime: 0 }; } /** * Record that a request was made */ public async recordRequest(url: string): Promise<void> { const { hostname } = new URL(url); const ip = await this.resolveIP(hostname); this.domainLimiter.recordRequest(hostname); this.ipLimiter.recordRequest(ip); } /** * Configure domain delay from robots.txt */ public setDomainCrawlDelay(domain: string, delaySec: number): void { this.domainLimiter.setDomainDelay(domain, delaySec * 1000); } private async resolveIP(hostname: string): Promise<string> { if (this.dnsCache.has(hostname)) { return this.dnsCache.get(hostname)!; } // In real implementation, use DNS resolver // This is simplified const ip = `resolved-ip-for-${hostname}`; this.dnsCache.set(hostname, ip); return ip; }}Many small websites share the same IP address through shared hosting. Without IP-level rate limiting, a crawler could overwhelm a shared server by crawling many domains simultaneously—each within its domain rate limit, but collectively exceeding what the server can handle. Always consider the IP layer, especially for smaller sites.
Even with careful rate limiting, servers may become stressed. A polite crawler monitors for stress signals and adjusts behavior dynamically.
Stress indicators to monitor:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157
interface DomainHealthMetrics { domain: string; recentResponseTimes: number[]; // Last N response times in ms recentErrorCodes: number[]; // Last N HTTP status codes consecutiveErrors: number; // Current error streak currentBackoffLevel: number; // 0 = normal, higher = more backed off lastRequestTime: number; blockedUntil: number; // Timestamp when block expires} class AdaptivePolitenessController { private domainMetrics: Map<string, DomainHealthMetrics> = new Map(); // Backoff configuration private baseDelay: number = 1000; // 1 second base private maxBackoffLevel: number = 8; // Max 2^8 = 256x backoff private errorThreshold: number = 5; // Errors before backoff private latencyThreshold: number = 2000; // 2 second latency = slow private metricsWindowSize: number = 20; /** * Record the result of a crawl attempt */ public recordResult( domain: string, statusCode: number, responseTimeMs: number ): void { const metrics = this.getOrCreateMetrics(domain); // Update response times metrics.recentResponseTimes.push(responseTimeMs); if (metrics.recentResponseTimes.length > this.metricsWindowSize) { metrics.recentResponseTimes.shift(); } // Update error codes metrics.recentErrorCodes.push(statusCode); if (metrics.recentErrorCodes.length > this.metricsWindowSize) { metrics.recentErrorCodes.shift(); } // Handle specific status codes if (statusCode === 429) { // Explicit rate limiting - significant backoff this.increaseBackoff(domain, 3); // Jump 3 levels } else if (statusCode === 503) { // Server overloaded - moderate backoff this.increaseBackoff(domain, 2); } else if (statusCode >= 500) { // Server error - slight backoff metrics.consecutiveErrors++; if (metrics.consecutiveErrors >= this.errorThreshold) { this.increaseBackoff(domain, 1); } } else if (statusCode >= 200 && statusCode < 300) { // Success - potentially reduce backoff metrics.consecutiveErrors = 0; if (this.isDomainHealthy(domain)) { this.decreaseBackoff(domain); } } // Check for latency-based stress if (this.isLatencyElevated(domain)) { this.increaseBackoff(domain, 1); } metrics.lastRequestTime = Date.now(); } /** * Get the current delay for a domain */ public getCurrentDelay(domain: string): number { const metrics = this.getOrCreateMetrics(domain); const backoffMultiplier = Math.pow(2, metrics.currentBackoffLevel); return this.baseDelay * backoffMultiplier; } /** * Check if domain is blocked (e.g., after receiving block signal) */ public isBlocked(domain: string): boolean { const metrics = this.getOrCreateMetrics(domain); return Date.now() < metrics.blockedUntil; } /** * Block domain for a specified duration (e.g., from Retry-After header) */ public blockDomain(domain: string, durationMs: number): void { const metrics = this.getOrCreateMetrics(domain); metrics.blockedUntil = Date.now() + durationMs; } private increaseBackoff(domain: string, levels: number = 1): void { const metrics = this.getOrCreateMetrics(domain); metrics.currentBackoffLevel = Math.min( this.maxBackoffLevel, metrics.currentBackoffLevel + levels ); console.log( `Increased backoff for ${domain} to level ${metrics.currentBackoffLevel} ` + `(delay: ${this.getCurrentDelay(domain)}ms)` ); } private decreaseBackoff(domain: string): void { const metrics = this.getOrCreateMetrics(domain); if (metrics.currentBackoffLevel > 0) { metrics.currentBackoffLevel--; } } private isDomainHealthy(domain: string): boolean { const metrics = this.getOrCreateMetrics(domain); // Check error rate const recentErrors = metrics.recentErrorCodes.filter(c => c >= 400).length; const errorRate = recentErrors / metrics.recentErrorCodes.length; // Check average latency const avgLatency = this.getAverageLatency(domain); return errorRate < 0.1 && avgLatency < this.latencyThreshold; } private isLatencyElevated(domain: string): boolean { const avgLatency = this.getAverageLatency(domain); return avgLatency > this.latencyThreshold; } private getAverageLatency(domain: string): number { const metrics = this.getOrCreateMetrics(domain); if (metrics.recentResponseTimes.length === 0) return 0; const sum = metrics.recentResponseTimes.reduce((a, b) => a + b, 0); return sum / metrics.recentResponseTimes.length; } private getOrCreateMetrics(domain: string): DomainHealthMetrics { if (!this.domainMetrics.has(domain)) { this.domainMetrics.set(domain, { domain, recentResponseTimes: [], recentErrorCodes: [], consecutiveErrors: 0, currentBackoffLevel: 0, lastRequestTime: 0, blockedUntil: 0 }); } return this.domainMetrics.get(domain)!; }}When you receive a 429 or 503 response with a Retry-After header, ALWAYS respect it. This header tells you exactly how long to wait (either in seconds or as an HTTP-date). Ignoring it is hostile behavior that will get your crawler permanently blocked.
Legitimate crawlers identify themselves clearly. This transparency serves multiple purposes:
User-Agent best practices:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// ============================================================// GOOD User-Agent Examples// ============================================================ // Googlebot (the gold standard)"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" // Bingbot"Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" // Your custom crawler (recommended format)"MyCrawler/1.0 (+https://mycrawler.example.com/about; contact@example.com)" // With more detail"AcmeSearchBot/2.5.1 (Linux; +https://acme.com/searchbot; crawl-admin@acme.com)" // ============================================================// BAD User-Agent Examples (DON'T DO THIS)// ============================================================ // Spoofing a browser (hostile, possibly illegal)"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36..." // No identification"curl/7.68.0" // Generic/meaningless"python-requests/2.25.1""Java/1.8.0_201" // No contact info"SomeCrawler/1.0" // ============================================================// HTTP Request Headers for Identification// ============================================================ const crawlerHeaders = { 'User-Agent': 'AcmeCrawler/1.0 (+https://acme.com/crawler; crawler@acme.com)', 'From': 'crawler@acme.com', // RFC 7231 header for bot contact 'Accept': 'text/html,application/xhtml+xml', 'Accept-Language': 'en-US,en;q=0.9', 'Accept-Encoding': 'gzip, deflate, br', // Optional: indicate crawl purpose 'X-Crawler-Purpose': 'search-indexing'};Making your crawler appear to be a browser (User-Agent spoofing) is considered hostile behavior. It's often used to bypass bot detection, which makes it look like your crawler is trying to evade scrutiny. Depending on jurisdiction and terms of service, this may be illegal. Always identify your crawler honestly.
Even with perfect politeness, your crawler may be blocked. Blocks can be intentional (site policy) or mistaken (you're collateral damage from anti-bot measures). How you respond matters.
Types of blocks and appropriate responses:
| Block Type | How to Detect | Appropriate Response |
|---|---|---|
| robots.txt Disallow | Can't match User-agent; path disallowed | Respect it completely. Do not crawl. |
| HTTP 403 Forbidden | Status code 403 on crawl attempt | Stop crawling this URL. Log for analysis. |
| HTTP 401 Unauthorized | Status code 401, requires authentication | Skip. Crawler shouldn't access protected content. |
| CAPTCHA Challenge | Response contains CAPTCHA HTML patterns | Stop crawling domain temporarily. Do NOT solve CAPTCHAs. |
| IP Block (connection refused) | Connection timeout or TCP RST | Pause all crawling to that IP. May need IP rotation. |
| Soft Block (fake content) | Valid response but with bot-trap content | Detect via content analysis. Reduce rate significantly. |
| Honeypot Links | Links visible only to crawlers, trap URLs | Detect patterns. Avoid following suspicious links. |
The best response to blocks is avoiding them in the first place. Reasonable rate limits, robots.txt compliance, and clear identification prevent most blocks. If you're being frequently blocked, audit your crawler's behavior before looking for workarounds.
In a distributed crawler with many worker nodes, politeness becomes a coordination challenge. You must ensure that the aggregate rate across all workers respects limits, not just individual worker rates.
The distributed politeness problem:
Solutions:
Recommended architecture for distributed politeness:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED POLITENESS WITH DOMAIN AFFINITY │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ URL FRONTIER (Partitioned by Domain) │ │
│ │ │ │
│ │ Partition 0 Partition 1 Partition 2 │ │
│ │ ───────────── ───────────── ───────────── │ │
│ │ example.com ──┐ google.com ───┐ amazon.com ───┐ │ │
│ │ github.com ───┼──▶ facebook.com ─┼──▶ netflix.com ──┼──▶ │ │
│ │ ... │ ... │ ... │ │ │
│ │ │ │ │ │ │
│ └─────────────────┼────────────────────┼────────────────────┼──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ WORKER NODES │ │
│ │ │ │
│ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ Worker 0 │ │ Worker 1 │ │ Worker 2 │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ Owns: │ │ Owns: │ │ Owns: │ │ │
│ │ │ - Partition 0 │ │ - Partition 1 │ │ - Partition 2 │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ Local Rate │ │ Local Rate │ │ Local Rate │ │ │
│ │ │ Limiter per │ │ Limiter per │ │ Limiter per │ │ │
│ │ │ owned domain │ │ owned domain │ │ owned domain │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ └────────────────┘ └────────────────┘ └────────────────┘ │ │
│ │ │ │
│ │ Key Insight: Each domain is crawled by exactly ONE worker. │ │
│ │ No cross-worker coordination needed for rate limiting! │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ SHARED SERVICES │ │
│ │ │ │
│ │ ┌─────────────────────────┐ ┌────────────────────────────────────┐ │ │
│ │ │ robots.txt Cache │ │ DNS Cache │ │ │
│ │ │ (shared to avoid │ │ (shared to reduce DNS load) │ │ │
│ │ │ redundant fetches) │ │ │ │ │
│ │ └─────────────────────────┘ └────────────────────────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
Domain affinity is the recommended approach for most crawlers because it eliminates distributed coordination overhead while naturally achieving politeness. The key insight: if only Worker 0 ever crawls example.com, then Worker 0's local rate limiter is sufficient.
We've covered the comprehensive framework for crawler politeness—the principles and mechanisms that enable sustainable web crawling at scale.
What's Next:
The next page explores Duplicate Detection—how to identify and avoid wasting resources on duplicate content that appears under different URLs. This includes content fingerprinting, near-duplicate detection, and the probabilistic data structures that make it feasible at scale.
You now understand how to build a polite crawler that can operate sustainably at scale. Politeness isn't a constraint on effectiveness—it's the foundation that enables long-term operation. A polite crawler has access to more of the web, runs more efficiently, and never faces legal or reputational consequences.