Web Crawler - Learning Module

Loading content...

0/273

URL Frontier Management

The Heart of the Crawler

If a web crawler were a human explorer, the URL frontier would be their to-do list—an ever-growing, constantly shuffled inventory of places yet to visit. But unlike a human's notepad, this list contains billions of entries, requires intelligent prioritization, must avoid duplicates, and needs to coordinate access across hundreds of distributed workers.

The URL frontier is arguably the most critical component of a web crawler's architecture. Get it wrong, and your crawler will waste resources on low-value pages while high-priority content grows stale. Get it right, and your crawler becomes a precision instrument—systematically covering the web while respecting resource constraints.

This page explores the data structures, algorithms, and architectural patterns that make large-scale URL frontier management possible.

What You Will Learn

By the end of this page, you will understand: (1) The functional requirements of a URL frontier, (2) Data structures for efficient URL storage and retrieval, (3) Priority scheduling algorithms, (4) URL normalization and deduplication techniques, (5) Distributed frontier architectures, and (6) How to balance politeness with throughput.

What Is the URL Frontier?

The URL frontier (also called the crawl frontier or URL queue) is the data structure that stores URLs discovered by the crawler but not yet visited. It serves as the central coordination point between URL discovery and URL fetching.

Core responsibilities of the URL frontier:

Storage — Hold billions of URLs in a space-efficient manner
Prioritization — Return URLs in an order that maximizes crawl value
Deduplication — Prevent the same URL from being crawled multiple times
Domain Grouping — Organize URLs by domain to enable politeness scheduling
Recrawl Scheduling — Manage when previously-crawled URLs should be revisited
Persistence — Survive system restarts without losing discovered URLs

URL Frontier Operations and Their Complexity Requirements
Operation	Description	Target Complexity	Scale
Insert URL	Add a newly discovered URL to the frontier	O(log n) amortized	Billions of inserts/day
Get Next URL	Retrieve the highest-priority URL for crawling	O(log n)	Thousands/second
Check Duplicate	Verify if URL already exists in frontier or has been crawled	O(1) expected	Billions of lookups/day
Mark Complete	Record that a URL has been crawled	O(1)	Thousands/second
Get Domain URLs	Retrieve all pending URLs for a specific domain	O(k) where k = domain URLs	Per-domain operations
Update Priority	Modify the priority of an existing URL	O(log n)	Rare but needed

The Scale Challenge

The frontier is not a simple queue. With 50+ billion known URLs on the web, the frontier must handle billions of entries while supporting high-throughput concurrent access from hundreds of workers. In-memory solutions cannot scale; disk-based solutions must be carefully designed for I/O efficiency.

Two-Level Frontier Architecture

The most effective approach to URL frontier design is a two-level architecture that separates concerns:

Back Queue (Domain-Level Queues):

Multiple FIFO queues, one per domain or domain group
Ensures URLs from the same domain are crawled sequentially
Enables politeness enforcement by controlling drain rate per domain
Typically implemented as persistent, disk-backed queues

Front Queue (Priority-Based Selector):

Priority queue that selects which back queue to pull from next
Determines crawl order based on page importance, freshness, and domain priority
Implements global scheduling policy across all domains
Can be in-memory since it only tracks queue metadata, not individual URLs

Visual representation of the two-level architecture:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                         TWO-LEVEL FRONTIER ARCHITECTURE                          │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                     FRONT QUEUE (Priority Selector)                       │  │
│   │                                                                           │  │
│   │   ┌─────────────────────────────────────────────────────────────────┐    │  │
│   │   │              Priority Heap / Sorted Set                          │    │  │
│   │   │                                                                  │    │  │
│   │   │   ┌──────────────────────────────────────────────────────────┐  │    │  │
│   │   │   │ Domain: wikipedia.org    │ Priority: 0.95  │ Ready: Yes │  │    │  │
│   │   │   ├──────────────────────────────────────────────────────────┤  │    │  │
│   │   │   │ Domain: nytimes.com      │ Priority: 0.92  │ Ready: Yes │  │    │  │
│   │   │   ├──────────────────────────────────────────────────────────┤  │    │  │
│   │   │   │ Domain: amazon.com       │ Priority: 0.88  │ Ready: No  │  │    │  │
│   │   │   ├──────────────────────────────────────────────────────────┤  │    │  │
│   │   │   │ Domain: random-blog.net  │ Priority: 0.23  │ Ready: Yes │  │    │  │
│   │   │   └──────────────────────────────────────────────────────────┘  │    │  │
│   │   │                                                                  │    │  │
│   │   └─────────────────────────────────────────────────────────────────┘    │  │
│   │                                                                           │  │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                                          │                                       │
│                                          │ Select highest-priority READY domain  │
│                                          ▼                                       │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                     BACK QUEUES (Per-Domain FIFO)                        │  │
│   │                                                                           │  │
│   │   ┌────────────────┐  ┌────────────────┐  ┌────────────────┐             │  │
│   │   │ wikipedia.org  │  │ nytimes.com    │  │ amazon.com     │   ...       │  │
│   │   │                │  │                │  │                │             │  │
│   │   │ /wiki/PageA    │  │ /2024/news/1   │  │ /product/123   │             │  │
│   │   │ /wiki/PageB    │  │ /2024/news/2   │  │ /product/456   │             │  │
│   │   │ /wiki/PageC    │  │ /2024/news/3   │  │ /product/789   │             │  │
│   │   │ /wiki/PageD    │  │ /sports/game   │  │ /category/abc  │             │  │
│   │   │ ...            │  │ ...            │  │ ...            │             │  │
│   │   │ (1.2M URLs)    │  │ (450K URLs)    │  │ (2.1M URLs)    │             │  │
│   │   └───────┬────────┘  └───────┬────────┘  └───────┬────────┘             │  │
│   │           │                   │                   │                      │  │
│   │           ▼                   ▼                   ▼                      │  │
│   │       Pop one URL         Pop one URL         Pop one URL                │  │
│   │       when selected       when selected       when selected              │  │
│   │                                                                           │  │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

How the two levels interact:

The Front Queue maintains a priority-sorted view of all domains with pending URLs
When a worker requests work, the scheduler:
- Finds the highest-priority domain where enough time has passed since the last request (politeness)
- Pops one URL from that domain's Back Queue
- Updates the domain's 'next available time' based on crawl delay
- If the back queue is now empty, removes the domain from the front queue
When new URLs are discovered:
- Each URL is routed to its domain's back queue
- If the domain wasn't already in the front queue, it's added with an appropriate priority

Why Two Levels?

A single global priority queue of all URLs would be prohibitively expensive (billions of entries) and wouldn't naturally support per-domain politeness. The two-level design separates concerns: the front queue handles cross-domain scheduling (relatively small—millions of domains), while back queues handle intra-domain ordering (large but sequential access). This separation enables practical implementation at scale.

Priority Calculation Strategies

With more URLs than you can ever crawl, prioritization is everything. The goal is to maximize the value extracted from each crawl request. Priority can be calculated at multiple levels:

Domain-Level Priority:

How important is this entire domain compared to others?
Factors: Domain authority, traffic estimates, historical quality, crawl history

Page-Level Priority:

How important is this specific page within its domain?
Factors: Incoming link count, URL depth, content type, update frequency

Freshness Priority:

How urgently does this page need to be recrawled?
Factors: Time since last crawl, predicted change rate, content volatility

Domain-Level Priority Signals

•PageRank / Domain Authority — The classic approach: domains linked by many high-quality sites are more important. Google's original PageRank propagates importance through the link graph.
•Traffic Estimates — External data sources (SimilarWeb, Alexa rankings) provide estimates of domain traffic. High-traffic sites likely contain more valuable content.
•Content Quality Signals — Historical analysis of crawled pages: original content ratio, spam signals, update frequency, user engagement metrics.
•Sitemap Presence — Domains that provide sitemaps are typically more crawl-friendly and often represent legitimate, maintained sites.
•robots.txt Stability — Domains with stable robots.txt and reasonable crawl delays suggest well-maintained infrastructure.

Page-Level Priority Signals

•Inbound Link Count — Pages linked by many other pages within the domain (or externally) are typically more important. The homepage usually has the highest inlink count.
•URL Depth — Pages closer to the root (fewer path segments) tend to be more important. /products is likely more important than /products/subcategory/item/variant/color.
•Content Type Hints — URLs containing patterns like /article/, /blog/, /news/ suggest content pages; /login, /cart, /checkout suggest transactional pages with less indexable value.
•URL Discovery Source — Was this URL found in a sitemap (high signal), a main navigation (medium signal), or a random footer link (low signal)?
•Historical Performance — If we've crawled this page before, was it valuable? Did it change? Did users engage with it in search results?

Priority Calculation Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
// Simplified priority calculation for a URL
interface URLPriorityFactors {
  // Domain factors (0-1 scale)
  domainAuthority: number;      // From external ranking or PageRank
  domainTrafficRank: number;    // Inverse of traffic rank (higher = more traffic)
  domainQualityScore: number;   // Historical content quality
 
  // Page factors (0-1 scale)
  inboundLinkScore: number;     // Normalized inlink count
  urlDepthScore: number;        // Inverse of path depth
  contentTypeScore: number;     // Based on URL pattern analysis
  discoverySourceScore: number; // Sitemap > navigation > random link
 
  // Freshness factors (0-1 scale)
  timeSinceLastCrawl: number;   // Normalized time since last crawl
  predictedChangeRate: number;  // Learned from historical changes
  contentVolatility: number;    // How often content changes
}
 
function calculatePriority(factors: URLPriorityFactors): number {
  // Weights for each factor (tunable hyperparameters)
  const weights = {
    domain: 0.3,
    page: 0.4,
    freshness: 0.3
  };
 
  // Domain score
  const domainScore = (
    factors.domainAuthority * 0.5 +
    factors.domainTrafficRank * 0.3 +
    factors.domainQualityScore * 0.2
  );
 
  // Page score
  const pageScore = (
    factors.inboundLinkScore * 0.3 +
    factors.urlDepthScore * 0.25 +
    factors.contentTypeScore * 0.25 +
    factors.discoverySourceScore * 0.2
  );
 
  // Freshness score (urgency of recrawl)
  const freshnessScore = (
    factors.timeSinceLastCrawl * 0.4 +
    factors.predictedChangeRate * 0.35 +
    factors.contentVolatility * 0.25
  );
 
  // Combined priority
  const priority = (
    weights.domain * domainScore +
    weights.page * pageScore +
    weights.freshness * freshnessScore
  );
 
  return Math.max(0, Math.min(1, priority)); // Clamp to [0, 1]
}
 
// Example usage
const examplePriority = calculatePriority({
  domainAuthority: 0.92,        // High authority domain (e.g., wikipedia.org)
  domainTrafficRank: 0.95,      // Very high traffic
  domainQualityScore: 0.88,     // Good historical quality
  inboundLinkScore: 0.75,       // Many internal links
  urlDepthScore: 0.90,          // Close to root (/wiki/PageName)
  contentTypeScore: 0.85,       // Looks like content page
  discoverySourceScore: 0.80,   // Found in navigation
  timeSinceLastCrawl: 0.60,     // Moderate time since last crawl
  predictedChangeRate: 0.40,    // Wikipedia pages change occasionally
  contentVolatility: 0.35       // Content relatively stable
});
 
// Result: ~0.76 (high priority URL)

Priority Is Approximate

Priority calculation is necessarily heuristic. You're predicting page value before visiting it, using signals that are proxies for true importance. The weights above are illustrative—production systems tune these based on downstream metrics (search quality, content freshness, user engagement). Some crawlers use machine learning models trained on historical crawl outcomes to predict priority.

URL Normalization

The same web page can be referenced by many different URL representations. Without URL normalization, the crawler will waste resources fetching duplicate content and bloat the frontier with redundant entries.

Examples of URLs that reference the same content:

https://example.com/page             (canonical)
https://example.com/page/             (trailing slash)
https://EXAMPLE.COM/page              (uppercase host)
https://example.com/page?             (empty query string)
https://example.com/page?a=1&b=2      (query parameters)
https://example.com/page?b=2&a=1      (reordered parameters)
https://example.com/page#section      (fragment)
https://example.com/./page            (dot segment)
https://example.com/foo/../page       (parent reference)
http://example.com:80/page            (default port)
https://example.com:443/page          (default port for HTTPS)
https://example.com/page?utm_source=x (tracking parameters)

Normalization transforms all variations to a canonical form:

Standard Normalization Steps (RFC 3986 Compliant)

•Convert scheme and host to lowercase — HTTPS://Example.COM/ → https://example.com/
•Remove default port — https://example.com:443/ → https://example.com/
•Remove fragment — https://example.com/page#section → https://example.com/page
•Decode percent-encoded unreserved characters — https://example.com/%7Euser → https://example.com/~user
•Remove dot segments — https://example.com/a/b/../c → https://example.com/a/c
•Add trailing slash to root — https://example.com → https://example.com/

Aggressive Normalization Steps (Domain-Specific)

•Remove tracking parameters — Strip known tracking params like utm_source, utm_medium, fbclid, gclid, etc.
•Remove session IDs — Strip session parameters like PHPSESSID, JSESSIONID, sid, etc.
•Sort query parameters — ?b=2&a=1 → ?a=1&b=2 (consistent ordering)
•Remove empty query string — https://example.com/? → https://example.com/
•Handle trailing slash consistently — Either always add or always remove (domain-specific decision)
•Respect canonical link headers — If a page specifies <link rel="canonical">, use that URL instead

URL Normalization Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
class URLNormalizer {
  // Known tracking parameters to remove
  private static TRACKING_PARAMS = new Set([
    'utm_source', 'utm_medium', 'utm_campaign', 'utm_term', 'utm_content',
    'fbclid', 'gclid', 'dclid', 
    'msclkid', 'mc_eid', 'ref', 'source'
  ]);
 
  // Session parameters to remove
  private static SESSION_PARAMS = new Set([
    'PHPSESSID', 'JSESSIONID', 'ASPSESSIONID', 
    'sid', 'session_id', 'sessionid'
  ]);
 
  public static normalize(urlString: string): string | null {
    try {
      const url = new URL(urlString);
 
      // 1. Lowercase scheme and host
      url.protocol = url.protocol.toLowerCase();
      url.hostname = url.hostname.toLowerCase();
 
      // 2. Remove default ports
      if (
        (url.protocol === 'http:' && url.port === '80') ||
        (url.protocol === 'https:' && url.port === '443')
      ) {
        url.port = '';
      }
 
      // 3. Remove fragment
      url.hash = '';
 
      // 4. Remove dot segments (handled by URL constructor)
      // The URL constructor already resolves . and .. segments
 
      // 5. Process path
      let path = url.pathname;
      
      // Decode percent-encoded unreserved characters
      path = this.decodeUnreserved(path);
      
      // Remove trailing slash for non-root paths (choose one convention)
      if (path.length > 1 && path.endsWith('/')) {
        path = path.slice(0, -1);
      }
      
      url.pathname = path;
 
      // 6. Process query parameters
      if (url.search) {
        const params = new URLSearchParams(url.search);
        const cleanParams = new URLSearchParams();
 
        // Filter and sort parameters
        const sortedKeys = Array.from(params.keys())
          .filter(key => !this.TRACKING_PARAMS.has(key.toLowerCase()))
          .filter(key => !this.SESSION_PARAMS.has(key))
          .sort();
 
        for (const key of sortedKeys) {
          const value = params.get(key);
          if (value !== null && value !== '') {
            cleanParams.set(key, value);
          }
        }
 
        url.search = cleanParams.toString() ? '?' + cleanParams.toString() : '';
      }
 
      return url.toString();
    } catch (e) {
      // Invalid URL
      return null;
    }
  }
 
  private static decodeUnreserved(str: string): string {
    // Unreserved characters per RFC 3986: A-Z a-z 0-9 - . _ ~
    return str.replace(/%([0-9A-Fa-f]{2})/g, (match, hex) => {
      const char = String.fromCharCode(parseInt(hex, 16));
      if (/[A-Za-z0-9\-._~]/.test(char)) {
        return char;
      }
      return match.toUpperCase(); // Normalize encoding to uppercase
    });
  }
 
  // Normalize relative URL against a base URL
  public static resolveAndNormalize(base: string, relative: string): string | null {
    try {
      const resolved = new URL(relative, base);
      return this.normalize(resolved.toString());
    } catch (e) {
      return null;
    }
  }
}
 
// Example usage
console.log(URLNormalizer.normalize('HTTPS://Example.COM:443/Page?b=2&a=1&utm_source=google#section'));
// Output: https://example.com/page?a=1&b=2

Normalization Can Be Lossy

Aggressive normalization assumes certain parameters don't affect content. This is usually true for tracking parameters but may not be for all sites. Some sites use query parameters for content selection (e.g., ?page=2). Removing these would break pagination. Production normalizers often maintain domain-specific rules for edge cases.

URL Deduplication

After normalization, we still need to check whether a URL has already been seen—either currently in the frontier or previously crawled. With billions of URLs, this deduplication check must be extremely efficient.

The three states of a URL:

Not Seen — New URL, should be added to frontier
In Frontier — Already queued for crawling, don't add again
Already Crawled — Previously fetched, may need recrawl scheduling

We need data structures that can check membership across billions of entries with minimal false positives and reasonable memory usage.

Exact Deduplication (Hash Tables)

•How it works: Store URL hash in a distributed hash table or database
•Pros: Zero false positives; exact membership check
•Cons: High memory usage (~64 bytes per URL minimum)
•Scale: 50B URLs × 64B = ~3.2 PB (!)
•Use case: Smaller crawls or when exactness is critical

Probabilistic Deduplication (Bloom Filters)

•How it works: Probabilistic set membership using bit arrays and hashing
•Pros: Extremely space-efficient (~10 bits per element for 1% FP rate)
•Cons: False positives possible (says 'seen' for never-seen URLs)
•Scale: 50B URLs × 10 bits = ~60 GB (!)—fits in memory
•Use case: Large-scale crawls; false positives acceptable

Bloom Filter Deep Dive:

A Bloom filter is a probabilistic data structure that answers the question "have I seen this element before?" It can say:

"Definitely not seen" → add to frontier
"Possibly seen" → might be duplicate (false positive) or truly seen

The false positive rate is tunable: more bits per element = lower false positive rate.

Bloom Filter Sizing:
- n = number of elements (URLs)
- m = number of bits in filter
- k = number of hash functions
- p = false positive probability

Optimal k = (m/n) × ln(2) ≈ 0.693 × (m/n)
For p = 1%:  m/n ≈ 9.6 bits per URL
For p = 0.1%: m/n ≈ 14.4 bits per URL

Example for 50 billion URLs with 1% FP rate:
m = 50B × 10 bits = 500 billion bits = 62.5 GB
k = 7 hash functions

This fits in a single high-memory server or can be partitioned.

Bloom Filter Implementation Sketch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
class BloomFilter {
  private bits: BitArray;
  private numHashes: number;
  private size: number;
 
  constructor(expectedItems: number, falsePositiveRate: number) {
    // Calculate optimal size
    this.size = Math.ceil(
      -expectedItems * Math.log(falsePositiveRate) / (Math.log(2) ** 2)
    );
    this.numHashes = Math.ceil(
      (this.size / expectedItems) * Math.log(2)
    );
    this.bits = new BitArray(this.size);
  }
 
  // Add a URL to the filter
  add(url: string): void {
    const hashes = this.getHashes(url);
    for (const hash of hashes) {
      this.bits.set(hash % this.size);
    }
  }
 
  // Check if URL might be in the filter
  // Returns: false = definitely not seen, true = possibly seen
  mightContain(url: string): boolean {
    const hashes = this.getHashes(url);
    for (const hash of hashes) {
      if (!this.bits.get(hash % this.size)) {
        return false; // Definitely not in set
      }
    }
    return true; // Possibly in set (might be false positive)
  }
 
  // Generate k hash values for a URL
  // Using double hashing: h(i) = h1 + i*h2
  private getHashes(url: string): number[] {
    const h1 = this.hash1(url);
    const h2 = this.hash2(url);
    
    const hashes: number[] = [];
    for (let i = 0; i < this.numHashes; i++) {
      hashes.push(Math.abs(h1 + i * h2));
    }
    return hashes;
  }
 
  private hash1(str: string): number {
    // MurmurHash3 or similar
    return murmurHash3(str, 0);
  }
 
  private hash2(str: string): number {
    // Different seed for independence
    return murmurHash3(str, 0x9747b28c);
  }
}
 
// Usage in frontier
class URLFrontier {
  private bloomFilter: BloomFilter;
  private exactStore: URLDatabase; // For recrawl tracking
 
  constructor() {
    // 50 billion URLs, 1% false positive rate
    this.bloomFilter = new BloomFilter(50_000_000_000, 0.01);
  }
 
  addURL(url: string, priority: number): boolean {
    const normalized = URLNormalizer.normalize(url);
    if (!normalized) return false;
 
    // Quick check with Bloom filter
    if (this.bloomFilter.mightContain(normalized)) {
      // Might be duplicate, do exact check for recrawl scheduling
      if (this.exactStore.exists(normalized)) {
        return false; // True duplicate
      }
    }
 
    // Not seen before, add to filter and queue
    this.bloomFilter.add(normalized);
    this.enqueue(normalized, priority);
    return true;
  }
}

Layered Deduplication

Production crawlers often use layered deduplication: (1) A fast in-memory Bloom filter for initial screening, (2) A distributed hash table or database for exact checking when the Bloom filter returns 'possibly seen', and (3) A separate tracking store for crawled URLs that need recrawl scheduling. This balances speed (Bloom filter), correctness (exact store), and functionality (recrawl tracking).

Distributed Frontier Architecture

A single-machine frontier cannot scale to billions of URLs. Production web crawlers distribute the frontier across multiple nodes, with careful partitioning to maintain politeness guarantees.

Partitioning Strategies:

Partition by Domain Hash

•How it works: hash(domain) % num_partitions → partition ID
•Pros: All URLs for a domain go to the same partition; politeness is local to one node
•Pros: Even distribution of domains across partitions
•Cons: Large domains (wikipedia, amazon) create hot spots
•Use case: Default approach for most crawlers

Partition by Domain + Consistent Hashing

•How it works: Use consistent hashing ring; domains map to nodes on the ring
•Pros: Adding/removing nodes only redistributes K/N domains (minimal disruption)
•Pros: Can use virtual nodes to balance load across physical nodes
•Cons: More complex to implement; requires ring maintenance
•Use case: Dynamic clusters with frequent node changes

Distributed Frontier Architecture Diagram:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                      DISTRIBUTED URL FRONTIER                                   │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                         URL ROUTER                                        │  │
│   │                                                                           │  │
│   │   Incoming URL → hash(domain) % N → Route to Partition                   │  │
│   │                                                                           │  │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                      │              │              │              │             │
│                      ▼              ▼              ▼              ▼             │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│   │ Partition 0  │  │ Partition 1  │  │ Partition 2  │  │ Partition N  │       │
│   │              │  │              │  │              │  │              │       │
│   │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │       │
│   │ │Front Q   │ │  │ │Front Q   │ │  │ │Front Q   │ │  │ │Front Q   │ │       │
│   │ │(domains) │ │  │ │(domains) │ │  │ │(domains) │ │  │ │(domains) │ │       │
│   │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │       │
│   │      │       │  │      │       │  │      │       │  │      │       │       │
│   │      ▼       │  │      ▼       │  │      ▼       │  │      ▼       │       │
│   │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │       │
│   │ │Back Qs   │ │  │ │Back Qs   │ │  │ │Back Qs   │ │  │ │Back Qs   │ │       │
│   │ │(per-dom) │ │  │ │(per-dom) │ │  │ │(per-dom) │ │  │ │(per-dom) │ │       │
│   │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │       │
│   │              │  │              │  │              │  │              │       │
│   │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │       │
│   │ │Bloom Flt │ │  │ │Bloom Flt │ │  │ │Bloom Flt │ │  │ │Bloom Flt │ │       │
│   │ │(local)   │ │  │ │(local)   │ │  │ │(local)   │ │  │ │(local)   │ │       │
│   │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │       │
│   │              │  │              │  │              │  │              │       │
│   │ Domains:     │  │ Domains:     │  │ Domains:     │  │ Domains:     │       │
│   │ example.com  │  │ google.com   │  │ amazon.com   │  │ news.ycombinator │   │
│   │ github.com   │  │ facebook.com │  │ netflix.com  │  │ reddit.com   │       │
│   │ ...          │  │ ...          │  │ ...          │  │ ...          │       │
│   └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘       │
│                                                                                  │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                     CRAWLER WORKERS                                       │  │
│   │                                                                           │  │
│   │   Workers pull from partitions; each worker can access any partition     │  │
│   │   Work stealing: idle workers take from busy partitions                  │  │
│   │                                                                           │  │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

Key design considerations:

Partition-Local Politeness: Since all URLs for a domain are in one partition, that partition enforces politeness for that domain without cross-node coordination.
Load Balancing: Workers can pull from any partition, naturally balancing load. Work-stealing algorithms help idle workers take work from busy partitions.
Partition-Local Bloom Filters: Each partition maintains its own Bloom filter for its assigned domains, reducing memory pressure and eliminating cross-node dedup checks.
Fault Tolerance: Partitions can be replicated for durability. If a partition node fails, its replica takes over. URLs in flight are requeued.

Handling Hot Spots

Large domains like Wikipedia or Amazon have millions of URLs. A single partition might be overwhelmed. Solutions include: (1) Sub-partitioning large domains by URL hash, (2) Dedicated workers for large domains, (3) Rate limiting within partitions to prevent starvation of other domains. The key is identifying hot spots through monitoring and applying targeted solutions.

Recrawl Scheduling

The web is not static. Pages are created, updated, and deleted constantly. A production crawler must revisit pages to maintain fresh content. But with billions of pages and limited resources, we can't recrawl everything at the same frequency.

The recrawl problem:

News sites update hourly → recrawl frequently
Corporate 'About Us' pages rarely change → recrawl infrequently
Predicting change frequency is difficult without visiting
Wasting resources on unchanged pages means missing actual updates

Recrawl scheduling strategies:

Fixed Interval Scheduling

•How it works: Assign each page a fixed recrawl interval based on its priority tier
•Example: High priority = 1 day, Medium = 7 days, Low = 30 days
•Pros: Simple to implement; predictable resource usage
•Cons: Ignores actual change patterns; wastes resources on static pages and misses dynamic ones

Adaptive Recrawl Scheduling

•How it works: Learn each page's change rate from historical crawls; adjust interval accordingly
•Example: If page changed 3 of last 5 visits → decrease interval; if unchanged 5 of 5 → increase interval
•Pros: Allocates resources to pages that actually change; converges to efficient schedule
•Cons: Requires historical tracking per URL; cold start problem for new URLs

Adaptive Recrawl Algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
interface URLCrawlHistory {
  url: string;
  lastCrawledAt: Date;
  lastModifiedAt: Date | null;
  contentHash: string;
  currentInterval: number;  // seconds
  changeHistory: boolean[]; // last N crawls: true = changed, false = unchanged
}
 
class AdaptiveRecrawlScheduler {
  // Interval bounds
  private static MIN_INTERVAL = 60 * 60;           // 1 hour
  private static MAX_INTERVAL = 30 * 24 * 60 * 60; // 30 days
  
  // Adjustment factors
  private static INCREASE_FACTOR = 1.5;  // Slow down if unchanged
  private static DECREASE_FACTOR = 0.5;  // Speed up if changed
  
  // Change history window
  private static HISTORY_WINDOW = 10;
 
  /**
   * Calculate next recrawl time based on whether content changed
   */
  public updateSchedule(history: URLCrawlHistory, contentChanged: boolean): URLCrawlHistory {
    // Update change history
    history.changeHistory.push(contentChanged);
    if (history.changeHistory.length > AdaptiveRecrawlScheduler.HISTORY_WINDOW) {
      history.changeHistory.shift();
    }
 
    // Calculate change rate
    const changeRate = this.calculateChangeRate(history.changeHistory);
    
    // Adjust interval based on change rate
    let newInterval: number;
    if (contentChanged) {
      // Content changed: decrease interval (crawl more often)
      newInterval = history.currentInterval * AdaptiveRecrawlScheduler.DECREASE_FACTOR;
    } else {
      // Content unchanged: increase interval (crawl less often)
      newInterval = history.currentInterval * AdaptiveRecrawlScheduler.INCREASE_FACTOR;
    }
 
    // Apply additional adjustment based on recent change rate
    // High change rate → even shorter intervals
    // Low change rate → even longer intervals
    const rateAdjustment = 1 - (changeRate - 0.5); // Range: 0.5 to 1.5
    newInterval *= rateAdjustment;
 
    // Clamp to bounds
    newInterval = Math.max(
      AdaptiveRecrawlScheduler.MIN_INTERVAL,
      Math.min(AdaptiveRecrawlScheduler.MAX_INTERVAL, newInterval)
    );
 
    history.currentInterval = Math.round(newInterval);
    history.lastCrawledAt = new Date();
    
    return history;
  }
 
  /**
   * Calculate change rate from history
   * Returns value between 0 (never changes) and 1 (always changes)
   */
  private calculateChangeRate(history: boolean[]): number {
    if (history.length === 0) return 0.5; // Unknown, assume moderate
    
    const changes = history.filter(h => h).length;
    return changes / history.length;
  }
 
  /**
   * Get next recrawl time for a URL
   */
  public getNextCrawlTime(history: URLCrawlHistory): Date {
    return new Date(history.lastCrawledAt.getTime() + history.currentInterval * 1000);
  }
 
  /**
   * Check if URL is due for recrawl
   */
  public isDueForRecrawl(history: URLCrawlHistory): boolean {
    return new Date() >= this.getNextCrawlTime(history);
  }
}
 
// Example usage
const scheduler = new AdaptiveRecrawlScheduler();
 
// Simulate a news page that changes frequently
let newsPage: URLCrawlHistory = {
  url: 'https://news.site/homepage',
  lastCrawledAt: new Date(Date.now() - 3600000), // 1 hour ago
  lastModifiedAt: new Date(),
  contentHash: 'abc123',
  currentInterval: 86400, // Start at 1 day
  changeHistory: []
};
 
// Simulate multiple crawls with changes
for (let i = 0; i < 5; i++) {
  newsPage = scheduler.updateSchedule(newsPage, true); // Changed
  console.log(`After change ${i+1}: interval = ${newsPage.currentInterval / 3600} hours`);
}
// Output: Interval converges toward MIN_INTERVAL (1 hour)

HTTP Conditional Requests

Use HTTP conditional requests (If-Modified-Since, If-None-Match) to check if a page has changed without downloading the full content. If the server returns 304 Not Modified, you know the page is unchanged without transferring bytes. This dramatically reduces bandwidth for recrawl checks—especially for static pages.

Summary and Next Steps

We've explored the URL frontier in depth—the component that determines what your crawler visits and when. Let's consolidate the key insights.

Key Takeaways

•Two-level architecture separates concerns — Front queues handle cross-domain priority scheduling; back queues handle per-domain FIFO ordering. This separation enables both priority and politeness.
•Priority is multi-factorial — Domain authority, page importance, and freshness urgency all contribute. Weights are tunable and often learned from crawl outcomes.
•URL normalization is essential — Without canonicalization, the same page under different URLs wastes resources. Standard and aggressive normalization reduce frontier bloat.
•Bloom filters enable scale — Probabilistic deduplication reduces memory by orders of magnitude, making it feasible to track billions of URLs with acceptable false positive rates.
•Distribution is by domain — Partitioning by domain hash keeps politeness enforcement local to each partition, avoiding distributed coordination overhead.
•Adaptive recrawl beats fixed intervals — Learning each page's change rate and adjusting recrawl frequency maximizes freshness while minimizing wasted requests.

What's Next:

The next page explores Politeness Policies—how to respect the servers you crawl, implement robots.txt compliance, and avoid becoming an unwelcome presence on the web. Politeness is not just about being nice; it's a hard requirement that determines whether your crawler can operate sustainably.

Page Complete

You now understand how to design a URL frontier that can manage billions of URLs efficiently while supporting priority scheduling, deduplication, and distributed operation. The frontier is the brain of your crawler—everything else depends on getting this right.