System Design (HLD)Web Crawler

Designing a Web Crawler at Scale

LevelAdvanced

Duration90 mins

TopicWeb Crawler

4 / 6

Duplicate Detection

The Duplication Problem

The web is rife with duplicates. The same content appears under multiple URLs due to URL normalization failures, mirrors, syndication, content management systems, session IDs, tracking parameters, and countless other reasons. Studies estimate that 25-30% of web content is duplicated.

For a web crawler, duplicates are a serious problem:

Wasted bandwidth — Downloading the same content multiple times
Wasted storage — Storing redundant copies of identical pages
Wasted compute — Processing the same content repeatedly
Index bloat — Search indices containing redundant entries
Freshness dilution — Crawl quota spent on duplicates instead of new content

Effective duplicate detection is essential for crawler efficiency. This page explores the techniques—from simple URL matching to sophisticated content fingerprinting—that enable detection at scale.

What You Will Learn

By the end of this page, you will understand: (1) The spectrum of duplicate types from URL to near-duplicate content, (2) Content fingerprinting using cryptographic and locality-sensitive hashing, (3) Shingling and MinHash for near-duplicate detection, (4) SimHash for similarity search at scale, and (5) Probabilistic data structures that make duplicate detection feasible for billions of pages.

Types of Web Content Duplication

Duplication exists on a spectrum from trivial URL variations to near-identical content. Understanding this spectrum is essential for choosing appropriate detection strategies.

The duplication spectrum:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                        DUPLICATION SPECTRUM                                      │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   EXACT                                                            NEAR          │
│   ◄──────────────────────────────────────────────────────────────────────────►  │
│                                                                                  │
│   URL Variants    Same Content     Boilerplate +    Similar       Slight        │
│   (trivial)       Different URLs   Same Core        Content       Variations    │
│                                                                                  │
│   /page?a=1       Mirror sites     Templates with   Syndicated    Minor edits   │
│   /page?a=1&     CDN copies        same article     articles      Translations  │
│                   Canonical issues Different ads    Scraped       Reformatted   │
│                                                                                  │
│   ─────────────────────────────────────────────────────────────────────────────  │
│   Detection       Detection        Detection        Detection     Detection     │
│   Method:         Method:          Method:          Method:       Method:       │
│   URL Normal-     Content Hash     Content Hash     MinHash /     MinHash /     │
│   ization         (MD5/SHA)        (after clean)    Shingling     SimHash       │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

Duplication Types and Detection Approaches
Duplication Type	Examples	Detection Approach	Complexity
URL Variants	Query parameter ordering, trailing slashes, case differences	URL normalization	Low
Exact Content Duplicates	Mirrors, CDN copies, www vs non-www	Cryptographic hash (MD5/SHA-256)	Low
Template Duplicates	Same article with different ads/headers	Content extraction + hashing	Medium
Near-Duplicates	Syndicated articles, minor edits	MinHash, SimHash, Shingling	High
Semantic Duplicates	Same information, different wording	ML embeddings, semantic similarity	Very High

Focus on Practical Detection

Most production crawlers focus on the first four types. Semantic duplicate detection (same meaning, different words) requires deep learning models and is typically handled in the search ranking layer, not the crawler. Our focus here is on practical, scalable detection for the crawl pipeline.

Content Fingerprinting

The simplest approach to detecting duplicate content is fingerprinting: computing a hash that uniquely identifies a page's content. If two pages have the same fingerprint, they're duplicates.

Fingerprinting workflow:

Fetch page content
Preprocess (optional): extract main content, remove boilerplate, normalize
Compute hash (MD5, SHA-256, etc.)
Check if hash exists in seen-hashes set
If exists → duplicate; if not → add to set

Hash function requirements:

Deterministic — Same input always produces same hash
Fast — Billions of pages to hash
Collision-resistant — Different content shouldn't produce same hash
Fixed output size — Enables efficient storage and comparison

Content Fingerprinting Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import crypto from 'crypto';
 
class ContentFingerprinter {
  /**
   * Compute fingerprint for raw HTML content
   * This catches exact duplicates but is sensitive to any change
   */
  public hashRaw(content: string): string {
    return crypto.createHash('sha256').update(content).digest('hex');
  }
 
  /**
   * Compute fingerprint after basic normalization
   * More robust to whitespace/encoding differences
   */
  public hashNormalized(content: string): string {
    const normalized = this.normalizeContent(content);
    return crypto.createHash('sha256').update(normalized).digest('hex');
  }
 
  /**
   * Compute fingerprint of extracted main content
   * Ignores headers, footers, ads, navigation - focuses on article body
   */
  public hashMainContent(html: string): string {
    const mainContent = this.extractMainContent(html);
    const normalized = this.normalizeContent(mainContent);
    return crypto.createHash('sha256').update(normalized).digest('hex');
  }
 
  private normalizeContent(content: string): string {
    return content
      // Normalize whitespace
      .replace(/\s+/g, ' ')
      // Remove leading/trailing whitespace
      .trim()
      // Normalize case (optional, may be too aggressive)
      .toLowerCase()
      // Remove common variable elements
      .replace(/\d{4}-\d{2}-\d{2}/g, 'DATE')  // Dates
      .replace(/\d{1,2}:\d{2}/g, 'TIME')       // Times
      .replace(/\b\d+\b/g, 'NUM');            // Numbers
  }
 
  private extractMainContent(html: string): string {
    // In production, use a library like readability-lxml or mercury-parser
    // This is a simplified example using regex
 
    // Remove script and style elements
    let content = html
      .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
      .replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
      .replace(/<!--[\s\S]*?-->/g, '');
 
    // Remove header, footer, nav, aside elements (common boilerplate)
    content = content
      .replace(/<header[^>]*>[\s\S]*?<\/header>/gi, '')
      .replace(/<footer[^>]*>[\s\S]*?<\/footer>/gi, '')
      .replace(/<nav[^>]*>[\s\S]*?<\/nav>/gi, '')
      .replace(/<aside[^>]*>[\s\S]*?<\/aside>/gi, '');
 
    // Extract text content
    content = content.replace(/<[^>]+>/g, ' ');
 
    return content;
  }
}
 
class DuplicateDetector {
  private seenHashes: Set<string> = new Set();
  private fingerprinter: ContentFingerprinter = new ContentFingerprinter();
 
  /**
   * Check if content is a duplicate
   * Returns the URL of the original if duplicate, null if new
   */
  public checkDuplicate(
    url: string, 
    content: string
  ): { isDuplicate: boolean; fingerprint: string } {
    const fingerprint = this.fingerprinter.hashMainContent(content);
    
    if (this.seenHashes.has(fingerprint)) {
      return { isDuplicate: true, fingerprint };
    }
 
    this.seenHashes.add(fingerprint);
    return { isDuplicate: false, fingerprint };
  }
 
  /**
   * Get statistics
   */
  public getStats(): { totalSeen: number; uniqueCount: number } {
    return {
      totalSeen: this.seenHashes.size,
      uniqueCount: this.seenHashes.size
    };
  }
}
 
// Usage example
const detector = new DuplicateDetector();
 
const page1 = '<html><body><article>Breaking news about event X</article></body></html>';
const page2 = '<html><body><article>Breaking news about event X</article></body></html>';  // Same
const page3 = '<html><body><article>Different story about event Y</article></body></html>';
 
console.log(detector.checkDuplicate('url1', page1)); // { isDuplicate: false, ... }
console.log(detector.checkDuplicate('url2', page2)); // { isDuplicate: true, ... }
console.log(detector.checkDuplicate('url3', page3)); // { isDuplicate: false, ... }

Content Extraction Quality Matters

The quality of duplicate detection depends heavily on content extraction. Poor extraction that includes boilerplate (headers, footers, ads) will cause false negatives—pages with the same article but different ads will appear as different. Use robust extraction libraries like Mozilla Readability or diff-match-patch for production systems.

Shingling and MinHash for Near-Duplicates

Exact hashing fails for near-duplicates—pages that share most of their content but differ in minor ways (different ads, slight edits, reformatting). For these, we need similarity-preserving techniques.

Shingling (n-grams):

A shingle (or k-shingle) is a contiguous sequence of k tokens. By representing a document as the set of its shingles, we create a signature that captures its content in a way that's robust to minor changes.

Document: "The quick brown fox jumps"
3-shingles: {"The quick brown", "quick brown fox", "brown fox jumps"}

Jaccard Similarity:

Given two documents A and B represented as shingle sets:

Jaccard(A, B) = |A ∩ B| / |A ∪ B|

Jaccard = 1.0 → identical documents
Jaccard = 0.0 → no common shingles
Jaccard > 0.8 → typically considered near-duplicates

The Problem with Direct Comparison:

Comparing shingle sets directly is expensive:

Each document may have thousands of shingles
Storing all shingles for billions of documents is prohibitive
Pairwise comparison of all documents is O(n²)

MinHash to the rescue:

MinHash is a technique that estimates Jaccard similarity using compact signatures. The key insight: if we apply a random hash function to each shingle and take the minimum hash value, the probability that two documents have the same minimum hash equals their Jaccard similarity.

Pr[min(h(A)) = min(h(B))] = Jaccard(A, B)

By using multiple hash functions and creating a signature of minimum hashes, we can estimate Jaccard with high accuracy using much less space.

MinHash Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
class MinHash {
  private numHashes: number;
  private hashFunctions: Array<(x: number) => number>;
  
  constructor(numHashes: number = 128) {
    this.numHashes = numHashes;
    this.hashFunctions = this.generateHashFunctions(numHashes);
  }
 
  /**
   * Generate shingles from text
   */
  public shingle(text: string, k: number = 5): Set<string> {
    const words = text.toLowerCase().split(/\s+/).filter(w => w.length > 0);
    const shingles = new Set<string>();
    
    for (let i = 0; i <= words.length - k; i++) {
      const shingle = words.slice(i, i + k).join(' ');
      shingles.add(shingle);
    }
    
    return shingles;
  }
 
  /**
   * Compute MinHash signature for a set of shingles
   */
  public computeSignature(shingles: Set<string>): number[] {
    // Initialize signature with infinity
    const signature = new Array(this.numHashes).fill(Infinity);
    
    // Convert shingles to numeric hashes
    const shingleHashes = Array.from(shingles).map(s => this.stringHash(s));
    
    // For each hash function, find minimum across all shingles
    for (let i = 0; i < this.numHashes; i++) {
      for (const shingleHash of shingleHashes) {
        const hashValue = this.hashFunctions[i](shingleHash);
        if (hashValue < signature[i]) {
          signature[i] = hashValue;
        }
      }
    }
    
    return signature;
  }
 
  /**
   * Estimate Jaccard similarity from two MinHash signatures
   */
  public estimateSimilarity(sig1: number[], sig2: number[]): number {
    if (sig1.length !== sig2.length) {
      throw new Error('Signatures must have same length');
    }
    
    let matches = 0;
    for (let i = 0; i < sig1.length; i++) {
      if (sig1[i] === sig2[i]) {
        matches++;
      }
    }
    
    return matches / sig1.length;
  }
 
  /**
   * Generate family of hash functions
   * Using the form: h(x) = (a*x + b) mod p mod m
   */
  private generateHashFunctions(n: number): Array<(x: number) => number> {
    const p = 2147483647; // Large prime (2^31 - 1)
    const m = 2 ** 32;
    const functions: Array<(x: number) => number> = [];
    
    for (let i = 0; i < n; i++) {
      const a = Math.floor(Math.random() * (p - 1)) + 1;
      const b = Math.floor(Math.random() * p);
      
      functions.push((x: number) => ((a * x + b) % p) % m);
    }
    
    return functions;
  }
 
  /**
   * Simple string to number hash
   */
  private stringHash(str: string): number {
    let hash = 0;
    for (let i = 0; i < str.length; i++) {
      const char = str.charCodeAt(i);
      hash = ((hash << 5) - hash) + char;
      hash = hash & hash; // Convert to 32-bit integer
    }
    return Math.abs(hash);
  }
}
 
// Usage example
const minhash = new MinHash(128);
 
const doc1 = "The quick brown fox jumps over the lazy dog";
const doc2 = "The quick brown fox leaps over the lazy dog";  // One word different
const doc3 = "A completely different document about cats";
 
const shingles1 = minhash.shingle(doc1, 3);
const shingles2 = minhash.shingle(doc2, 3);
const shingles3 = minhash.shingle(doc3, 3);
 
const sig1 = minhash.computeSignature(shingles1);
const sig2 = minhash.computeSignature(shingles2);
const sig3 = minhash.computeSignature(shingles3);
 
console.log('doc1 vs doc2:', minhash.estimateSimilarity(sig1, sig2)); // ~0.75-0.85
console.log('doc1 vs doc3:', minhash.estimateSimilarity(sig1, sig3)); // ~0.0-0.1
console.log('doc2 vs doc3:', minhash.estimateSimilarity(sig2, sig3)); // ~0.0-0.1

Signature Size vs. Accuracy

The number of hash functions (signature length) trades off space against accuracy. With 128 hashes, you get ~10% standard error on similarity estimates with just 128 integers (~512 bytes) per document. For billions of documents, this is vastly more efficient than storing all shingles.

Locality-Sensitive Hashing (LSH)

MinHash gives us compact signatures that estimate similarity, but we still face the all-pairs comparison problem: with n documents, comparing every pair is O(n²)—infeasible for billions of pages.

Locality-Sensitive Hashing (LSH) solves this by hashing documents into buckets such that similar documents are likely to hash to the same bucket.

LSH for MinHash (Banding Technique):

Divide the MinHash signature into b bands of r rows each
For each band, hash the r values to a bucket
Documents are candidates if they match in any band
Only compare candidates (not all pairs)

Why it works:

For two documents with Jaccard similarity s:

Probability they match in one specific row: s
Probability they match in all r rows of a band: s^r
Probability they DON'T match in a band: 1 - s^r
Probability they don't match in ANY of b bands: (1 - s^r)^b
Probability they become candidates: 1 - (1 - s^r)^b

By tuning b and r, we create an S-curve that sharply separates similar from dissimilar pairs.

LSH Parameter Examples (128 hash signature)
Bands (b)	Rows (r)	Threshold (~)	P(Candidate) at 0.8 similarity	P(Candidate) at 0.3 similarity
32	4	~0.5	99.6%	0.03%
16	8	~0.7	99.9%	~0%
64	2	~0.3	~100%	47%
8	16	~0.85	91%	~0%

LSH for Near-Duplicate Detection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
class LSHIndex {
  private bands: number;
  private rows: number;
  private buckets: Map<string, Set<string>>[];  // One bucket map per band
 
  constructor(signatureLength: number, bands: number) {
    if (signatureLength % bands !== 0) {
      throw new Error('Signature length must be divisible by bands');
    }
    
    this.bands = bands;
    this.rows = signatureLength / bands;
    
    // Initialize bucket maps for each band
    this.buckets = [];
    for (let i = 0; i < bands; i++) {
      this.buckets.push(new Map());
    }
  }
 
  /**
   * Add a document (by its signature) to the index
   */
  public add(docId: string, signature: number[]): void {
    for (let band = 0; band < this.bands; band++) {
      const bandSignature = this.getBandSignature(signature, band);
      const bucketKey = this.hashBand(bandSignature);
      
      if (!this.buckets[band].has(bucketKey)) {
        this.buckets[band].set(bucketKey, new Set());
      }
      this.buckets[band].get(bucketKey)!.add(docId);
    }
  }
 
  /**
   * Find candidate documents that might be similar to query
   */
  public findCandidates(signature: number[]): Set<string> {
    const candidates = new Set<string>();
    
    for (let band = 0; band < this.bands; band++) {
      const bandSignature = this.getBandSignature(signature, band);
      const bucketKey = this.hashBand(bandSignature);
      
      const bucket = this.buckets[band].get(bucketKey);
      if (bucket) {
        for (const docId of bucket) {
          candidates.add(docId);
        }
      }
    }
    
    return candidates;
  }
 
  private getBandSignature(signature: number[], band: number): number[] {
    const start = band * this.rows;
    return signature.slice(start, start + this.rows);
  }
 
  private hashBand(bandSignature: number[]): string {
    // Simple hash: join values
    return bandSignature.join('-');
  }
}
 
class NearDuplicateDetector {
  private minhash: MinHash;
  private lshIndex: LSHIndex;
  private signatures: Map<string, number[]> = new Map();
  private similarityThreshold: number;
 
  constructor(
    numHashes: number = 128,
    bands: number = 16,
    similarityThreshold: number = 0.8
  ) {
    this.minhash = new MinHash(numHashes);
    this.lshIndex = new LSHIndex(numHashes, bands);
    this.similarityThreshold = similarityThreshold;
  }
 
  /**
   * Check if document is near-duplicate of existing documents
   * Returns list of similar document IDs if duplicates found
   */
  public checkNearDuplicate(docId: string, content: string): {
    isDuplicate: boolean;
    similarDocs: Array<{ docId: string; similarity: number }>;
  } {
    // Compute signature
    const shingles = this.minhash.shingle(content, 5);
    const signature = this.minhash.computeSignature(shingles);
    
    // Find candidates via LSH
    const candidates = this.lshIndex.findCandidates(signature);
    
    // Verify candidates by computing actual similarity
    const similarDocs: Array<{ docId: string; similarity: number }> = [];
    
    for (const candidateId of candidates) {
      if (candidateId === docId) continue;
      
      const candidateSig = this.signatures.get(candidateId);
      if (!candidateSig) continue;
      
      const similarity = this.minhash.estimateSimilarity(signature, candidateSig);
      
      if (similarity >= this.similarityThreshold) {
        similarDocs.push({ docId: candidateId, similarity });
      }
    }
    
    // Sort by similarity descending
    similarDocs.sort((a, b) => b.similarity - a.similarity);
    
    // Add new document to index
    this.lshIndex.add(docId, signature);
    this.signatures.set(docId, signature);
    
    return {
      isDuplicate: similarDocs.length > 0,
      similarDocs
    };
  }
 
  /**
   * Get index statistics
   */
  public getStats(): { documentsIndexed: number } {
    return { documentsIndexed: this.signatures.size };
  }
}
 
// Usage example
const detector = new NearDuplicateDetector(128, 16, 0.7);
 
const articles = [
  { id: 'article1', content: 'Breaking news: Major event happens in city. Officials respond to situation.' },
  { id: 'article2', content: 'Breaking news: Major event happens in city. Officials respond to the situation.' },  // Near-dup
  { id: 'article3', content: 'Sports update: Team wins championship after long season of competition.' },
  { id: 'article4', content: 'Breaking news: Major event occurs in city. Officials respond to situation.' },  // Near-dup of 1
];
 
for (const article of articles) {
  const result = detector.checkNearDuplicate(article.id, article.content);
  console.log(`${article.id}: isDuplicate=${result.isDuplicate}`);
  if (result.similarDocs.length > 0) {
    console.log(`  Similar to: ${result.similarDocs.map(d => `${d.docId}(${(d.similarity*100).toFixed(1)}%)`).join(', ')}`);
  }
}

False Positives and Negatives

LSH trades accuracy for speed. Some similar documents might not become candidates (false negatives), and some dissimilar documents might become candidates (false positives, caught in verification). The banding parameters let you tune this tradeoff. For crawlers, it's usually better to err on the side of more candidates (lower threshold) since verification is cheap compared to re-crawling duplicates.

SimHash: Google's Approach

SimHash is an alternative to MinHash, popularized by Google for web-scale duplicate detection. While MinHash estimates Jaccard similarity, SimHash produces a single hash where similar documents have similar hashes (differing in few bits).

SimHash Algorithm:

Extract features — Break document into features (words, shingles)
Hash each feature — Produce a b-bit hash for each feature
Weight features — Assign importance weights (e.g., TF-IDF)
Aggregate — For each bit position:
- Sum weights of features with 1 at that position
- Subtract weights of features with 0 at that position
Threshold — If sum > 0, final bit = 1; else 0

The result is a b-bit fingerprint where similar documents differ in few bits.

Why SimHash?

Single hash comparison — Just XOR and count differing bits (Hamming distance)
No pairwise computation — Can use sorted table for efficient lookup
Proven at scale — Used by Google for the entire web

SimHash Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
class SimHash {
  private hashBits: number;
 
  constructor(hashBits: number = 64) {
    this.hashBits = hashBits;
  }
 
  /**
   * Compute SimHash fingerprint for a document
   */
  public compute(text: string): bigint {
    const features = this.extractFeatures(text);
    const weights = this.computeWeights(features);
    
    // Initialize vector with zeros
    const v = new Array(this.hashBits).fill(0);
    
    // For each feature
    for (const [feature, weight] of weights) {
      const featureHash = this.hashFeature(feature);
      
      // For each bit position
      for (let i = 0; i < this.hashBits; i++) {
        const bit = (featureHash >> BigInt(i)) & 1n;
        if (bit === 1n) {
          v[i] += weight;
        } else {
          v[i] -= weight;
        }
      }
    }
    
    // Create fingerprint: 1 if v[i] > 0, else 0
    let fingerprint = 0n;
    for (let i = 0; i < this.hashBits; i++) {
      if (v[i] > 0) {
        fingerprint |= (1n << BigInt(i));
      }
    }
    
    return fingerprint;
  }
 
  /**
   * Compute Hamming distance between two SimHash fingerprints
   */
  public hammingDistance(hash1: bigint, hash2: bigint): number {
    let xor = hash1 ^ hash2;
    let distance = 0;
    
    while (xor > 0n) {
      distance += Number(xor & 1n);
      xor >>= 1n;
    }
    
    return distance;
  }
 
  /**
   * Check if two documents are near-duplicates
   * Threshold of 3 bits is common for 64-bit SimHash
   */
  public areNearDuplicates(
    hash1: bigint, 
    hash2: bigint, 
    threshold: number = 3
  ): boolean {
    return this.hammingDistance(hash1, hash2) <= threshold;
  }
 
  /**
   * Extract features (words or n-grams) from text
   */
  private extractFeatures(text: string): string[] {
    // Use word 3-grams as features
    const words = text.toLowerCase()
      .replace(/[^a-z0-9\s]/g, '')
      .split(/\s+/)
      .filter(w => w.length > 0);
    
    const features: string[] = [];
    for (let i = 0; i <= words.length - 3; i++) {
      features.push(words.slice(i, i + 3).join(' '));
    }
    
    return features;
  }
 
  /**
   * Compute weights for features (simplified: all weight 1)
   * In production, use TF-IDF or similar
   */
  private computeWeights(features: string[]): Map<string, number> {
    const counts = new Map<string, number>();
    
    for (const feature of features) {
      counts.set(feature, (counts.get(feature) || 0) + 1);
    }
    
    return counts;
  }
 
  /**
   * Hash a feature string to a 64-bit value
   * Using FNV-1a hash (in production, use xxHash or MurmurHash)
   */
  private hashFeature(feature: string): bigint {
    let hash = 14695981039346656037n;  // FNV offset basis
    const prime = 1099511628211n;       // FNV prime
    
    for (let i = 0; i < feature.length; i++) {
      hash ^= BigInt(feature.charCodeAt(i));
      hash = (hash * prime) & ((1n << 64n) - 1n);  // Keep 64 bits
    }
    
    return hash;
  }
}
 
// Efficient lookup using sorted table
class SimHashIndex {
  private hashBits: number;
  private threshold: number;
  private tables: Map<bigint, string[]>[];  // Multiple tables for different bit permutations
  private simhash: SimHash;
  private hashes: Map<string, bigint> = new Map();
 
  constructor(hashBits: number = 64, threshold: number = 3) {
    this.hashBits = hashBits;
    this.threshold = threshold;
    this.simhash = new SimHash(hashBits);
    
    // Create multiple tables for multi-probe lookup
    // Split hash into (threshold + 1) blocks
    const numTables = threshold + 1;
    this.tables = [];
    for (let i = 0; i < numTables; i++) {
      this.tables.push(new Map());
    }
  }
 
  /**
   * Add document to index
   */
  public add(docId: string, hash: bigint): void {
    this.hashes.set(docId, hash);
    
    // Add to each table using block as key
    for (let t = 0; t < this.tables.length; t++) {
      const blockKey = this.getBlockKey(hash, t);
      
      if (!this.tables[t].has(blockKey)) {
        this.tables[t].set(blockKey, []);
      }
      this.tables[t].get(blockKey)!.push(docId);
    }
  }
 
  /**
   * Find near-duplicates of a query hash
   */
  public findNearDuplicates(queryHash: bigint): string[] {
    const candidates = new Set<string>();
    
    // Check all tables
    for (let t = 0; t < this.tables.length; t++) {
      const blockKey = this.getBlockKey(queryHash, t);
      const bucket = this.tables[t].get(blockKey);
      
      if (bucket) {
        for (const docId of bucket) {
          candidates.add(docId);
        }
      }
    }
    
    // Verify candidates
    const duplicates: string[] = [];
    for (const docId of candidates) {
      const docHash = this.hashes.get(docId)!;
      if (this.simhash.areNearDuplicates(queryHash, docHash, this.threshold)) {
        duplicates.push(docId);
      }
    }
    
    return duplicates;
  }
 
  /**
   * Get block key for a given table
   * Divides hash into blocks, returns the block for this table
   */
  private getBlockKey(hash: bigint, tableIndex: number): bigint {
    const bitsPerBlock = Math.floor(this.hashBits / this.tables.length);
    const shift = BigInt(tableIndex * bitsPerBlock);
    const mask = (1n << BigInt(bitsPerBlock)) - 1n;
    
    return (hash >> shift) & mask;
  }
}
 
// Usage example
const simhash = new SimHash(64);
 
const doc1 = "The quick brown fox jumps over the lazy dog near the river";
const doc2 = "The quick brown fox leaps over the lazy dog near the river";  // Near-duplicate
const doc3 = "A completely different document about programming in Python";
 
const hash1 = simhash.compute(doc1);
const hash2 = simhash.compute(doc2);
const hash3 = simhash.compute(doc3);
 
console.log('doc1 hash:', hash1.toString(16));
console.log('doc2 hash:', hash2.toString(16));
console.log('doc3 hash:', hash3.toString(16));
 
console.log('doc1 vs doc2 distance:', simhash.hammingDistance(hash1, hash2));  // Small (1-3)
console.log('doc1 vs doc3 distance:', simhash.hammingDistance(hash1, hash3));  // Large (20+)
console.log('Are doc1 and doc2 near-duplicates?', simhash.areNearDuplicates(hash1, hash2));  // true

SimHash vs MinHash Comparison

SimHash produces a single compact hash (64-128 bits), making storage and comparison very efficient. MinHash requires longer signatures (128+ values) but provides better accuracy for highly similar documents. SimHash is typically preferred when storage is critical and you need exact near-duplicate detection (< ~5% difference). MinHash is better for detecting a range of similarity levels.

Distributed Duplicate Detection

At web scale, duplicate detection must be distributed across many machines. This introduces coordination challenges: how do you efficiently check a new document against billions of existing fingerprints spread across a cluster?

Distributed architecture options:

Option 1: Centralized SimHash Service

•Architecture: Central cluster holds all SimHash fingerprints; crawlers query for duplicates
•Storage: 64-bit SimHash × 50B docs = ~400 GB (fits on commodity machines with replication)
•Lookup: Use multiple tables (as shown above) for O(1) average lookup
•Pros: Simple, consistent view across all crawlers
•Cons: Central bottleneck, network latency for every check

Option 2: Partitioned by Hash Prefix

•Architecture: Partition fingerprints by hash prefix; route queries to relevant partition
•Partitioning: First N bits of SimHash → partition ID
•Lookup: Check document against partition matching its hash prefix, plus neighboring partitions within Hamming distance
•Pros: Scales horizontally, parallel queries
•Cons: Near-duplicates may span partitions (must query multiple)

Hybrid architecture for production:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                  DISTRIBUTED DUPLICATE DETECTION ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                         CRAWLER WORKERS                                   │  │
│   │                                                                           │  │
│   │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                      │  │
│   │   │  Worker 1   │  │  Worker 2   │  │  Worker N   │                      │  │
│   │   │             │  │             │  │             │                      │  │
│   │   │  ┌────────┐ │  │  ┌────────┐ │  │  ┌────────┐ │                      │  │
│   │   │  │Local   │ │  │  │Local   │ │  │  │Local   │ │                      │  │
│   │   │  │Bloom   │ │  │  │Bloom   │ │  │  │Bloom   │ │                      │  │
│   │   │  │Filter  │ │  │  │Filter  │ │  │  │Filter  │ │                      │  │
│   │   │  └───┬────┘ │  │  └───┬────┘ │  │  └───┬────┘ │                      │  │
│   │   │      │      │  │      │      │  │      │      │                      │  │
│   │   └──────┼──────┘  └──────┼──────┘  └──────┼──────┘                      │  │
│   │          │                │                │                              │  │
│   │          │   Bloom says   │                │                              │  │
│   │          │   'not seen'   │                │                              │  │
│   │          ▼                ▼                ▼                              │  │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                                     │                                            │
│                                     │ Bloom says 'possibly seen'                 │
│                                     │ → Query SimHash service                    │
│                                     ▼                                            │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                     SIMHASH DEDUP SERVICE                                 │  │
│   │                                                                           │  │
│   │   ┌─────────────────────────────────────────────────────────────────┐    │  │
│   │   │                    PARTITIONED INDEX                             │    │  │
│   │   │                                                                  │    │  │
│   │   │   Partition 0        Partition 1        Partition N             │    │  │
│   │   │   ────────────       ────────────       ────────────            │    │  │
│   │   │   Hash prefix 0x0    Hash prefix 0x1    Hash prefix 0xF         │    │  │
│   │   │   SimHash tables     SimHash tables     SimHash tables          │    │  │
│   │   │                                                                  │    │  │
│   │   └─────────────────────────────────────────────────────────────────┘    │  │
│   │                                                                           │  │
│   │   Query Flow:                                                             │  │
│   │   1. Receive (docId, content)                                             │  │
│   │   2. Compute SimHash                                                      │  │
│   │   3. Query partition for hash prefix + neighbors                         │  │
│   │   4. Return: duplicates found? (docId list) or unique                    │  │
│   │                                                                           │  │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                                                                                  │
│   Two-tier design:                                                              │
│   - Tier 1 (Local Bloom): Fast, in-memory, catches most 'definitely new' docs │
│   - Tier 2 (SimHash Service): Accurate, catches near-duplicates               │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

Why two tiers?

Local Bloom Filter (per worker): Catches 70-90% of lookups as "definitely not seen." Avoids network round-trip for most new pages.
SimHash Service: Handles 'possibly seen' from Bloom filter and detects near-duplicates that Bloom filter can't catch.

This architecture minimizes network traffic (most checks are local) while providing comprehensive duplicate and near-duplicate detection.

Practical Considerations

Implementing duplicate detection in a production crawler involves several practical decisions:

Implementation Decisions

•When to check duplicates — Before fetching (URL dedup) and after fetching (content dedup). Some crawlers also check before indexing as a final safety net.
•What content to fingerprint — Full HTML includes boilerplate; extracted text may miss important structure. Best practice: fingerprint extracted main content, but store raw HTML for reprocessing.
•How to handle detected duplicates — Don't crawl (URL dedup), don't store (content dedup), or mark as duplicate and keep one canonical copy.
•Threshold tuning — SimHash threshold of 3 bits works well for most text. Images and other media need different approaches (perceptual hashing).
•Freshness vs deduplication — A 'duplicate' page may have updates. Balance aggressive deduplication against missing content updates.

Performance Optimizations

•Batch processing — Check multiple documents in one request to amortize network overhead.
•Caching hot fingerprints — Frequently accessed fingerprints (popular duplicate patterns) can be cached locally.
•Incremental updates — When content changes slightly, compute delta rather than full re-fingerprint where possible.
•Approximate counts — Use HyperLogLog to estimate unique document counts without exact tracking.
•Periodic compaction — Remove fingerprints for deleted/expired documents to prevent index bloat.

Evolving Duplicates

Content changes over time. A page that was a duplicate yesterday may have unique updates today. Production systems must balance deduplication against freshness—periodically rechecking 'known duplicates' to detect divergence, and updating fingerprints when content changes significantly.

Summary and Next Steps

We've explored the complete landscape of duplicate detection—from simple URL matching to sophisticated near-duplicate detection at scale.

Key Takeaways

•Duplication exists on a spectrum — From URL variants to near-duplicates, each type requires appropriate detection methods.
•Content fingerprinting catches exact duplicates — Clean content extraction + cryptographic hashing identifies identical pages efficiently.
•MinHash estimates Jaccard similarity — Compact signatures (~512 bytes) enable similarity estimation for billions of documents.
•LSH enables efficient similarity search — Banding reduces O(n²) pairwise comparisons to near-constant time lookups.
•SimHash provides single-hash near-duplicate detection — 64-bit fingerprints with Hamming distance thresholds; proven at Google scale.
•Two-tier architecture optimizes performance — Local Bloom filters catch most new documents; centralized SimHash catches near-duplicates.

What's Next:

The next page explores Distributed Crawling—how to scale a crawler across hundreds of machines while maintaining coordination, consistency, and efficiency. We'll cover worker architecture, work distribution, failure handling, and the coordination mechanisms that tie distributed crawler components together.

Page Complete

You now understand how to detect duplicates at web scale using a combination of exact matching and similarity-preserving hashing techniques. Effective deduplication can reduce storage costs by 25-30% and improve crawl efficiency proportionally—a significant optimization for any large-scale crawler.

4 / 6

Loading learning content...

System Design (HLD)Web Crawler

Designing a Web Crawler at Scale

LevelAdvanced

Duration90 mins

TopicWeb Crawler

4 / 6

Duplicate Detection

The Duplication Problem

For a web crawler, duplicates are a serious problem:

Wasted bandwidth — Downloading the same content multiple times
Wasted storage — Storing redundant copies of identical pages
Wasted compute — Processing the same content repeatedly
Index bloat — Search indices containing redundant entries
Freshness dilution — Crawl quota spent on duplicates instead of new content

What You Will Learn

Types of Web Content Duplication

Duplication exists on a spectrum from trivial URL variations to near-identical content. Understanding this spectrum is essential for choosing appropriate detection strategies.

The duplication spectrum:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                        DUPLICATION SPECTRUM                                      │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   EXACT                                                            NEAR          │
│   ◄──────────────────────────────────────────────────────────────────────────►  │
│                                                                                  │
│   URL Variants    Same Content     Boilerplate +    Similar       Slight        │
│   (trivial)       Different URLs   Same Core        Content       Variations    │
│                                                                                  │
│   /page?a=1       Mirror sites     Templates with   Syndicated    Minor edits   │
│   /page?a=1&     CDN copies        same article     articles      Translations  │
│                   Canonical issues Different ads    Scraped       Reformatted   │
│                                                                                  │
│   ─────────────────────────────────────────────────────────────────────────────  │
│   Detection       Detection        Detection        Detection     Detection     │
│   Method:         Method:          Method:          Method:       Method:       │
│   URL Normal-     Content Hash     Content Hash     MinHash /     MinHash /     │
│   ization         (MD5/SHA)        (after clean)    Shingling     SimHash       │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

Duplication Types and Detection Approaches
Duplication Type	Examples	Detection Approach	Complexity
URL Variants	Query parameter ordering, trailing slashes, case differences	URL normalization	Low
Exact Content Duplicates	Mirrors, CDN copies, www vs non-www	Cryptographic hash (MD5/SHA-256)	Low
Template Duplicates	Same article with different ads/headers	Content extraction + hashing	Medium
Near-Duplicates	Syndicated articles, minor edits	MinHash, SimHash, Shingling	High
Semantic Duplicates	Same information, different wording	ML embeddings, semantic similarity	Very High

Focus on Practical Detection

Content Fingerprinting

The simplest approach to detecting duplicate content is fingerprinting: computing a hash that uniquely identifies a page's content. If two pages have the same fingerprint, they're duplicates.

Fingerprinting workflow:

Fetch page content
Preprocess (optional): extract main content, remove boilerplate, normalize
Compute hash (MD5, SHA-256, etc.)
Check if hash exists in seen-hashes set
If exists → duplicate; if not → add to set

Hash function requirements:

Deterministic — Same input always produces same hash
Fast — Billions of pages to hash
Collision-resistant — Different content shouldn't produce same hash
Fixed output size — Enables efficient storage and comparison

Content Fingerprinting Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import crypto from 'crypto';
 
class ContentFingerprinter {
  /**
   * Compute fingerprint for raw HTML content
   * This catches exact duplicates but is sensitive to any change
   */
  public hashRaw(content: string): string {
    return crypto.createHash('sha256').update(content).digest('hex');
  }
 
  /**
   * Compute fingerprint after basic normalization
   * More robust to whitespace/encoding differences
   */
  public hashNormalized(content: string): string {
    const normalized = this.normalizeContent(content);
    return crypto.createHash('sha256').update(normalized).digest('hex');
  }
 
  /**
   * Compute fingerprint of extracted main content
   * Ignores headers, footers, ads, navigation - focuses on article body
   */
  public hashMainContent(html: string): string {
    const mainContent = this.extractMainContent(html);
    const normalized = this.normalizeContent(mainContent);
    return crypto.createHash('sha256').update(normalized).digest('hex');
  }
 
  private normalizeContent(content: string): string {
    return content
      // Normalize whitespace
      .replace(/\s+/g, ' ')
      // Remove leading/trailing whitespace
      .trim()
      // Normalize case (optional, may be too aggressive)
      .toLowerCase()
      // Remove common variable elements
      .replace(/\d{4}-\d{2}-\d{2}/g, 'DATE')  // Dates
      .replace(/\d{1,2}:\d{2}/g, 'TIME')       // Times
      .replace(/\b\d+\b/g, 'NUM');            // Numbers
  }
 
  private extractMainContent(html: string): string {
    // In production, use a library like readability-lxml or mercury-parser
    // This is a simplified example using regex
 
    // Remove script and style elements
    let content = html
      .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
      .replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
      .replace(/<!--[\s\S]*?-->/g, '');
 
    // Remove header, footer, nav, aside elements (common boilerplate)
    content = content
      .replace(/<header[^>]*>[\s\S]*?<\/header>/gi, '')
      .replace(/<footer[^>]*>[\s\S]*?<\/footer>/gi, '')
      .replace(/<nav[^>]*>[\s\S]*?<\/nav>/gi, '')
      .replace(/<aside[^>]*>[\s\S]*?<\/aside>/gi, '');
 
    // Extract text content
    content = content.replace(/<[^>]+>/g, ' ');
 
    return content;
  }
}
 
class DuplicateDetector {
  private seenHashes: Set<string> = new Set();
  private fingerprinter: ContentFingerprinter = new ContentFingerprinter();
 
  /**
   * Check if content is a duplicate
   * Returns the URL of the original if duplicate, null if new
   */
  public checkDuplicate(
    url: string, 
    content: string
  ): { isDuplicate: boolean; fingerprint: string } {
    const fingerprint = this.fingerprinter.hashMainContent(content);
    
    if (this.seenHashes.has(fingerprint)) {
      return { isDuplicate: true, fingerprint };
    }
 
    this.seenHashes.add(fingerprint);
    return { isDuplicate: false, fingerprint };
  }
 
  /**
   * Get statistics
   */
  public getStats(): { totalSeen: number; uniqueCount: number } {
    return {
      totalSeen: this.seenHashes.size,
      uniqueCount: this.seenHashes.size
    };
  }
}
 
// Usage example
const detector = new DuplicateDetector();
 
const page1 = '<html><body><article>Breaking news about event X</article></body></html>';
const page2 = '<html><body><article>Breaking news about event X</article></body></html>';  // Same
const page3 = '<html><body><article>Different story about event Y</article></body></html>';
 
console.log(detector.checkDuplicate('url1', page1)); // { isDuplicate: false, ... }
console.log(detector.checkDuplicate('url2', page2)); // { isDuplicate: true, ... }
console.log(detector.checkDuplicate('url3', page3)); // { isDuplicate: false, ... }

Content Extraction Quality Matters

Shingling and MinHash for Near-Duplicates

Shingling (n-grams):

Document: "The quick brown fox jumps"
3-shingles: {"The quick brown", "quick brown fox", "brown fox jumps"}

Jaccard Similarity:

Given two documents A and B represented as shingle sets:

Jaccard(A, B) = |A ∩ B| / |A ∪ B|

Jaccard = 1.0 → identical documents
Jaccard = 0.0 → no common shingles
Jaccard > 0.8 → typically considered near-duplicates

The Problem with Direct Comparison:

Comparing shingle sets directly is expensive:

Each document may have thousands of shingles
Storing all shingles for billions of documents is prohibitive
Pairwise comparison of all documents is O(n²)

MinHash to the rescue:

Pr[min(h(A)) = min(h(B))] = Jaccard(A, B)

By using multiple hash functions and creating a signature of minimum hashes, we can estimate Jaccard with high accuracy using much less space.

MinHash Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
class MinHash {
  private numHashes: number;
  private hashFunctions: Array<(x: number) => number>;
  
  constructor(numHashes: number = 128) {
    this.numHashes = numHashes;
    this.hashFunctions = this.generateHashFunctions(numHashes);
  }
 
  /**
   * Generate shingles from text
   */
  public shingle(text: string, k: number = 5): Set<string> {
    const words = text.toLowerCase().split(/\s+/).filter(w => w.length > 0);
    const shingles = new Set<string>();
    
    for (let i = 0; i <= words.length - k; i++) {
      const shingle = words.slice(i, i + k).join(' ');
      shingles.add(shingle);
    }
    
    return shingles;
  }
 
  /**
   * Compute MinHash signature for a set of shingles
   */
  public computeSignature(shingles: Set<string>): number[] {
    // Initialize signature with infinity
    const signature = new Array(this.numHashes).fill(Infinity);
    
    // Convert shingles to numeric hashes
    const shingleHashes = Array.from(shingles).map(s => this.stringHash(s));
    
    // For each hash function, find minimum across all shingles
    for (let i = 0; i < this.numHashes; i++) {
      for (const shingleHash of shingleHashes) {
        const hashValue = this.hashFunctions[i](shingleHash);
        if (hashValue < signature[i]) {
          signature[i] = hashValue;
        }
      }
    }
    
    return signature;
  }
 
  /**
   * Estimate Jaccard similarity from two MinHash signatures
   */
  public estimateSimilarity(sig1: number[], sig2: number[]): number {
    if (sig1.length !== sig2.length) {
      throw new Error('Signatures must have same length');
    }
    
    let matches = 0;
    for (let i = 0; i < sig1.length; i++) {
      if (sig1[i] === sig2[i]) {
        matches++;
      }
    }
    
    return matches / sig1.length;
  }
 
  /**
   * Generate family of hash functions
   * Using the form: h(x) = (a*x + b) mod p mod m
   */
  private generateHashFunctions(n: number): Array<(x: number) => number> {
    const p = 2147483647; // Large prime (2^31 - 1)
    const m = 2 ** 32;
    const functions: Array<(x: number) => number> = [];
    
    for (let i = 0; i < n; i++) {
      const a = Math.floor(Math.random() * (p - 1)) + 1;
      const b = Math.floor(Math.random() * p);
      
      functions.push((x: number) => ((a * x + b) % p) % m);
    }
    
    return functions;
  }
 
  /**
   * Simple string to number hash
   */
  private stringHash(str: string): number {
    let hash = 0;
    for (let i = 0; i < str.length; i++) {
      const char = str.charCodeAt(i);
      hash = ((hash << 5) - hash) + char;
      hash = hash & hash; // Convert to 32-bit integer
    }
    return Math.abs(hash);
  }
}
 
// Usage example
const minhash = new MinHash(128);
 
const doc1 = "The quick brown fox jumps over the lazy dog";
const doc2 = "The quick brown fox leaps over the lazy dog";  // One word different
const doc3 = "A completely different document about cats";
 
const shingles1 = minhash.shingle(doc1, 3);
const shingles2 = minhash.shingle(doc2, 3);
const shingles3 = minhash.shingle(doc3, 3);
 
const sig1 = minhash.computeSignature(shingles1);
const sig2 = minhash.computeSignature(shingles2);
const sig3 = minhash.computeSignature(shingles3);
 
console.log('doc1 vs doc2:', minhash.estimateSimilarity(sig1, sig2)); // ~0.75-0.85
console.log('doc1 vs doc3:', minhash.estimateSimilarity(sig1, sig3)); // ~0.0-0.1
console.log('doc2 vs doc3:', minhash.estimateSimilarity(sig2, sig3)); // ~0.0-0.1

Signature Size vs. Accuracy

Locality-Sensitive Hashing (LSH)

Locality-Sensitive Hashing (LSH) solves this by hashing documents into buckets such that similar documents are likely to hash to the same bucket.

LSH for MinHash (Banding Technique):

Divide the MinHash signature into b bands of r rows each
For each band, hash the r values to a bucket
Documents are candidates if they match in any band
Only compare candidates (not all pairs)

Why it works:

For two documents with Jaccard similarity s:

Probability they match in one specific row: s
Probability they match in all r rows of a band: s^r
Probability they DON'T match in a band: 1 - s^r
Probability they don't match in ANY of b bands: (1 - s^r)^b
Probability they become candidates: 1 - (1 - s^r)^b

By tuning b and r, we create an S-curve that sharply separates similar from dissimilar pairs.

LSH Parameter Examples (128 hash signature)
Bands (b)	Rows (r)	Threshold (~)	P(Candidate) at 0.8 similarity	P(Candidate) at 0.3 similarity
32	4	~0.5	99.6%	0.03%
16	8	~0.7	99.9%	~0%
64	2	~0.3	~100%	47%
8	16	~0.85	91%	~0%

LSH for Near-Duplicate Detection
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
class LSHIndex {
  private bands: number;
  private rows: number;
  private buckets: Map<string, Set<string>>[];  // One bucket map per band
 
  constructor(signatureLength: number, bands: number) {
    if (signatureLength % bands !== 0) {
      throw new Error('Signature length must be divisible by bands');
    }
    
    this.bands = bands;
    this.rows = signatureLength / bands;
    
    // Initialize bucket maps for each band
    this.buckets = [];
    for (let i = 0; i < bands; i++) {
      this.buckets.push(new Map());
    }
  }
 
  /**
   * Add a document (by its signature) to the index
   */
  public add(docId: string, signature: number[]): void {
    for (let band = 0; band < this.bands; band++) {
      const bandSignature = this.getBandSignature(signature, band);
      const bucketKey = this.hashBand(bandSignature);
      
      if (!this.buckets[band].has(bucketKey)) {
        this.buckets[band].set(bucketKey, new Set());
      }
      this.buckets[band].get(bucketKey)!.add(docId);
    }
  }
 
  /**
   * Find candidate documents that might be similar to query
   */
  public findCandidates(signature: number[]): Set<string> {
    const candidates = new Set<string>();
    
    for (let band = 0; band < this.bands; band++) {
      const bandSignature = this.getBandSignature(signature, band);
      const bucketKey = this.hashBand(bandSignature);
      
      const bucket = this.buckets[band].get(bucketKey);
      if (bucket) {
        for (const docId of bucket) {
          candidates.add(docId);
        }
      }
    }
    
    return candidates;
  }
 
  private getBandSignature(signature: number[], band: number): number[] {
    const start = band * this.rows;
    return signature.slice(start, start + this.rows);
  }
 
  private hashBand(bandSignature: number[]): string {
    // Simple hash: join values
    return bandSignature.join('-');
  }
}
 
class NearDuplicateDetector {
  private minhash: MinHash;
  private lshIndex: LSHIndex;
  private signatures: Map<string, number[]> = new Map();
  private similarityThreshold: number;
 
  constructor(
    numHashes: number = 128,
    bands: number = 16,
    similarityThreshold: number = 0.8
  ) {
    this.minhash = new MinHash(numHashes);
    this.lshIndex = new LSHIndex(numHashes, bands);
    this.similarityThreshold = similarityThreshold;
  }
 
  /**
   * Check if document is near-duplicate of existing documents
   * Returns list of similar document IDs if duplicates found
   */
  public checkNearDuplicate(docId: string, content: string): {
    isDuplicate: boolean;
    similarDocs: Array<{ docId: string; similarity: number }>;
  } {
    // Compute signature
    const shingles = this.minhash.shingle(content, 5);
    const signature = this.minhash.computeSignature(shingles);
    
    // Find candidates via LSH
    const candidates = this.lshIndex.findCandidates(signature);
    
    // Verify candidates by computing actual similarity
    const similarDocs: Array<{ docId: string; similarity: number }> = [];
    
    for (const candidateId of candidates) {
      if (candidateId === docId) continue;
      
      const candidateSig = this.signatures.get(candidateId);
      if (!candidateSig) continue;
      
      const similarity = this.minhash.estimateSimilarity(signature, candidateSig);
      
      if (similarity >= this.similarityThreshold) {
        similarDocs.push({ docId: candidateId, similarity });
      }
    }
    
    // Sort by similarity descending
    similarDocs.sort((a, b) => b.similarity - a.similarity);
    
    // Add new document to index
    this.lshIndex.add(docId, signature);
    this.signatures.set(docId, signature);
    
    return {
      isDuplicate: similarDocs.length > 0,
      similarDocs
    };
  }
 
  /**
   * Get index statistics
   */
  public getStats(): { documentsIndexed: number } {
    return { documentsIndexed: this.signatures.size };
  }
}
 
// Usage example
const detector = new NearDuplicateDetector(128, 16, 0.7);
 
const articles = [
  { id: 'article1', content: 'Breaking news: Major event happens in city. Officials respond to situation.' },
  { id: 'article2', content: 'Breaking news: Major event happens in city. Officials respond to the situation.' },  // Near-dup
  { id: 'article3', content: 'Sports update: Team wins championship after long season of competition.' },
  { id: 'article4', content: 'Breaking news: Major event occurs in city. Officials respond to situation.' },  // Near-dup of 1
];
 
for (const article of articles) {
  const result = detector.checkNearDuplicate(article.id, article.content);
  console.log(`${article.id}: isDuplicate=${result.isDuplicate}`);
  if (result.similarDocs.length > 0) {
    console.log(`  Similar to: ${result.similarDocs.map(d => `${d.docId}(${(d.similarity*100).toFixed(1)}%)`).join(', ')}`);
  }
}

False Positives and Negatives

SimHash: Google's Approach

SimHash Algorithm:

Extract features — Break document into features (words, shingles)
Hash each feature — Produce a b-bit hash for each feature
Weight features — Assign importance weights (e.g., TF-IDF)
Aggregate — For each bit position:
- Sum weights of features with 1 at that position
- Subtract weights of features with 0 at that position
Threshold — If sum > 0, final bit = 1; else 0

The result is a b-bit fingerprint where similar documents differ in few bits.

Why SimHash?

Single hash comparison — Just XOR and count differing bits (Hamming distance)
No pairwise computation — Can use sorted table for efficient lookup
Proven at scale — Used by Google for the entire web

SimHash Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
class SimHash {
  private hashBits: number;
 
  constructor(hashBits: number = 64) {
    this.hashBits = hashBits;
  }
 
  /**
   * Compute SimHash fingerprint for a document
   */
  public compute(text: string): bigint {
    const features = this.extractFeatures(text);
    const weights = this.computeWeights(features);
    
    // Initialize vector with zeros
    const v = new Array(this.hashBits).fill(0);
    
    // For each feature
    for (const [feature, weight] of weights) {
      const featureHash = this.hashFeature(feature);
      
      // For each bit position
      for (let i = 0; i < this.hashBits; i++) {
        const bit = (featureHash >> BigInt(i)) & 1n;
        if (bit === 1n) {
          v[i] += weight;
        } else {
          v[i] -= weight;
        }
      }
    }
    
    // Create fingerprint: 1 if v[i] > 0, else 0
    let fingerprint = 0n;
    for (let i = 0; i < this.hashBits; i++) {
      if (v[i] > 0) {
        fingerprint |= (1n << BigInt(i));
      }
    }
    
    return fingerprint;
  }
 
  /**
   * Compute Hamming distance between two SimHash fingerprints
   */
  public hammingDistance(hash1: bigint, hash2: bigint): number {
    let xor = hash1 ^ hash2;
    let distance = 0;
    
    while (xor > 0n) {
      distance += Number(xor & 1n);
      xor >>= 1n;
    }
    
    return distance;
  }
 
  /**
   * Check if two documents are near-duplicates
   * Threshold of 3 bits is common for 64-bit SimHash
   */
  public areNearDuplicates(
    hash1: bigint, 
    hash2: bigint, 
    threshold: number = 3
  ): boolean {
    return this.hammingDistance(hash1, hash2) <= threshold;
  }
 
  /**
   * Extract features (words or n-grams) from text
   */
  private extractFeatures(text: string): string[] {
    // Use word 3-grams as features
    const words = text.toLowerCase()
      .replace(/[^a-z0-9\s]/g, '')
      .split(/\s+/)
      .filter(w => w.length > 0);
    
    const features: string[] = [];
    for (let i = 0; i <= words.length - 3; i++) {
      features.push(words.slice(i, i + 3).join(' '));
    }
    
    return features;
  }
 
  /**
   * Compute weights for features (simplified: all weight 1)
   * In production, use TF-IDF or similar
   */
  private computeWeights(features: string[]): Map<string, number> {
    const counts = new Map<string, number>();
    
    for (const feature of features) {
      counts.set(feature, (counts.get(feature) || 0) + 1);
    }
    
    return counts;
  }
 
  /**
   * Hash a feature string to a 64-bit value
   * Using FNV-1a hash (in production, use xxHash or MurmurHash)
   */
  private hashFeature(feature: string): bigint {
    let hash = 14695981039346656037n;  // FNV offset basis
    const prime = 1099511628211n;       // FNV prime
    
    for (let i = 0; i < feature.length; i++) {
      hash ^= BigInt(feature.charCodeAt(i));
      hash = (hash * prime) & ((1n << 64n) - 1n);  // Keep 64 bits
    }
    
    return hash;
  }
}
 
// Efficient lookup using sorted table
class SimHashIndex {
  private hashBits: number;
  private threshold: number;
  private tables: Map<bigint, string[]>[];  // Multiple tables for different bit permutations
  private simhash: SimHash;
  private hashes: Map<string, bigint> = new Map();
 
  constructor(hashBits: number = 64, threshold: number = 3) {
    this.hashBits = hashBits;
    this.threshold = threshold;
    this.simhash = new SimHash(hashBits);
    
    // Create multiple tables for multi-probe lookup
    // Split hash into (threshold + 1) blocks
    const numTables = threshold + 1;
    this.tables = [];
    for (let i = 0; i < numTables; i++) {
      this.tables.push(new Map());
    }
  }
 
  /**
   * Add document to index
   */
  public add(docId: string, hash: bigint): void {
    this.hashes.set(docId, hash);
    
    // Add to each table using block as key
    for (let t = 0; t < this.tables.length; t++) {
      const blockKey = this.getBlockKey(hash, t);
      
      if (!this.tables[t].has(blockKey)) {
        this.tables[t].set(blockKey, []);
      }
      this.tables[t].get(blockKey)!.push(docId);
    }
  }
 
  /**
   * Find near-duplicates of a query hash
   */
  public findNearDuplicates(queryHash: bigint): string[] {
    const candidates = new Set<string>();
    
    // Check all tables
    for (let t = 0; t < this.tables.length; t++) {
      const blockKey = this.getBlockKey(queryHash, t);
      const bucket = this.tables[t].get(blockKey);
      
      if (bucket) {
        for (const docId of bucket) {
          candidates.add(docId);
        }
      }
    }
    
    // Verify candidates
    const duplicates: string[] = [];
    for (const docId of candidates) {
      const docHash = this.hashes.get(docId)!;
      if (this.simhash.areNearDuplicates(queryHash, docHash, this.threshold)) {
        duplicates.push(docId);
      }
    }
    
    return duplicates;
  }
 
  /**
   * Get block key for a given table
   * Divides hash into blocks, returns the block for this table
   */
  private getBlockKey(hash: bigint, tableIndex: number): bigint {
    const bitsPerBlock = Math.floor(this.hashBits / this.tables.length);
    const shift = BigInt(tableIndex * bitsPerBlock);
    const mask = (1n << BigInt(bitsPerBlock)) - 1n;
    
    return (hash >> shift) & mask;
  }
}
 
// Usage example
const simhash = new SimHash(64);
 
const doc1 = "The quick brown fox jumps over the lazy dog near the river";
const doc2 = "The quick brown fox leaps over the lazy dog near the river";  // Near-duplicate
const doc3 = "A completely different document about programming in Python";
 
const hash1 = simhash.compute(doc1);
const hash2 = simhash.compute(doc2);
const hash3 = simhash.compute(doc3);
 
console.log('doc1 hash:', hash1.toString(16));
console.log('doc2 hash:', hash2.toString(16));
console.log('doc3 hash:', hash3.toString(16));
 
console.log('doc1 vs doc2 distance:', simhash.hammingDistance(hash1, hash2));  // Small (1-3)
console.log('doc1 vs doc3 distance:', simhash.hammingDistance(hash1, hash3));  // Large (20+)
console.log('Are doc1 and doc2 near-duplicates?', simhash.areNearDuplicates(hash1, hash2));  // true

SimHash vs MinHash Comparison

Distributed Duplicate Detection

Distributed architecture options:

Option 1: Centralized SimHash Service

•Architecture: Central cluster holds all SimHash fingerprints; crawlers query for duplicates
•Storage: 64-bit SimHash × 50B docs = ~400 GB (fits on commodity machines with replication)
•Lookup: Use multiple tables (as shown above) for O(1) average lookup
•Pros: Simple, consistent view across all crawlers
•Cons: Central bottleneck, network latency for every check

Option 2: Partitioned by Hash Prefix

•Architecture: Partition fingerprints by hash prefix; route queries to relevant partition
•Partitioning: First N bits of SimHash → partition ID
•Lookup: Check document against partition matching its hash prefix, plus neighboring partitions within Hamming distance
•Pros: Scales horizontally, parallel queries
•Cons: Near-duplicates may span partitions (must query multiple)

Hybrid architecture for production:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                  DISTRIBUTED DUPLICATE DETECTION ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                         CRAWLER WORKERS                                   │  │
│   │                                                                           │  │
│   │   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                      │  │
│   │   │  Worker 1   │  │  Worker 2   │  │  Worker N   │                      │  │
│   │   │             │  │             │  │             │                      │  │
│   │   │  ┌────────┐ │  │  ┌────────┐ │  │  ┌────────┐ │                      │  │
│   │   │  │Local   │ │  │  │Local   │ │  │  │Local   │ │                      │  │
│   │   │  │Bloom   │ │  │  │Bloom   │ │  │  │Bloom   │ │                      │  │
│   │   │  │Filter  │ │  │  │Filter  │ │  │  │Filter  │ │                      │  │
│   │   │  └───┬────┘ │  │  └───┬────┘ │  │  └───┬────┘ │                      │  │
│   │   │      │      │  │      │      │  │      │      │                      │  │
│   │   └──────┼──────┘  └──────┼──────┘  └──────┼──────┘                      │  │
│   │          │                │                │                              │  │
│   │          │   Bloom says   │                │                              │  │
│   │          │   'not seen'   │                │                              │  │
│   │          ▼                ▼                ▼                              │  │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                                     │                                            │
│                                     │ Bloom says 'possibly seen'                 │
│                                     │ → Query SimHash service                    │
│                                     ▼                                            │
│   ┌──────────────────────────────────────────────────────────────────────────┐  │
│   │                     SIMHASH DEDUP SERVICE                                 │  │
│   │                                                                           │  │
│   │   ┌─────────────────────────────────────────────────────────────────┐    │  │
│   │   │                    PARTITIONED INDEX                             │    │  │
│   │   │                                                                  │    │  │
│   │   │   Partition 0        Partition 1        Partition N             │    │  │
│   │   │   ────────────       ────────────       ────────────            │    │  │
│   │   │   Hash prefix 0x0    Hash prefix 0x1    Hash prefix 0xF         │    │  │
│   │   │   SimHash tables     SimHash tables     SimHash tables          │    │  │
│   │   │                                                                  │    │  │
│   │   └─────────────────────────────────────────────────────────────────┘    │  │
│   │                                                                           │  │
│   │   Query Flow:                                                             │  │
│   │   1. Receive (docId, content)                                             │  │
│   │   2. Compute SimHash                                                      │  │
│   │   3. Query partition for hash prefix + neighbors                         │  │
│   │   4. Return: duplicates found? (docId list) or unique                    │  │
│   │                                                                           │  │
│   └──────────────────────────────────────────────────────────────────────────┘  │
│                                                                                  │
│   Two-tier design:                                                              │
│   - Tier 1 (Local Bloom): Fast, in-memory, catches most 'definitely new' docs │
│   - Tier 2 (SimHash Service): Accurate, catches near-duplicates               │
│                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────┘

Why two tiers?

Local Bloom Filter (per worker): Catches 70-90% of lookups as "definitely not seen." Avoids network round-trip for most new pages.
SimHash Service: Handles 'possibly seen' from Bloom filter and detects near-duplicates that Bloom filter can't catch.

This architecture minimizes network traffic (most checks are local) while providing comprehensive duplicate and near-duplicate detection.

Practical Considerations

Implementing duplicate detection in a production crawler involves several practical decisions:

Implementation Decisions

•When to check duplicates — Before fetching (URL dedup) and after fetching (content dedup). Some crawlers also check before indexing as a final safety net.
•What content to fingerprint — Full HTML includes boilerplate; extracted text may miss important structure. Best practice: fingerprint extracted main content, but store raw HTML for reprocessing.
•How to handle detected duplicates — Don't crawl (URL dedup), don't store (content dedup), or mark as duplicate and keep one canonical copy.
•Threshold tuning — SimHash threshold of 3 bits works well for most text. Images and other media need different approaches (perceptual hashing).
•Freshness vs deduplication — A 'duplicate' page may have updates. Balance aggressive deduplication against missing content updates.

Performance Optimizations

•Batch processing — Check multiple documents in one request to amortize network overhead.
•Caching hot fingerprints — Frequently accessed fingerprints (popular duplicate patterns) can be cached locally.
•Incremental updates — When content changes slightly, compute delta rather than full re-fingerprint where possible.
•Approximate counts — Use HyperLogLog to estimate unique document counts without exact tracking.
•Periodic compaction — Remove fingerprints for deleted/expired documents to prevent index bloat.

Evolving Duplicates

Summary and Next Steps

We've explored the complete landscape of duplicate detection—from simple URL matching to sophisticated near-duplicate detection at scale.

Key Takeaways

•Duplication exists on a spectrum — From URL variants to near-duplicates, each type requires appropriate detection methods.
•Content fingerprinting catches exact duplicates — Clean content extraction + cryptographic hashing identifies identical pages efficiently.
•MinHash estimates Jaccard similarity — Compact signatures (~512 bytes) enable similarity estimation for billions of documents.
•LSH enables efficient similarity search — Banding reduces O(n²) pairwise comparisons to near-constant time lookups.
•SimHash provides single-hash near-duplicate detection — 64-bit fingerprints with Hamming distance thresholds; proven at Google scale.
•Two-tier architecture optimizes performance — Local Bloom filters catch most new documents; centralized SimHash catches near-duplicates.

What's Next:

Page Complete

4 / 6