Loading learning content...
The web is rife with duplicates. The same content appears under multiple URLs due to URL normalization failures, mirrors, syndication, content management systems, session IDs, tracking parameters, and countless other reasons. Studies estimate that 25-30% of web content is duplicated.
For a web crawler, duplicates are a serious problem:
Effective duplicate detection is essential for crawler efficiency. This page explores the techniques—from simple URL matching to sophisticated content fingerprinting—that enable detection at scale.
By the end of this page, you will understand: (1) The spectrum of duplicate types from URL to near-duplicate content, (2) Content fingerprinting using cryptographic and locality-sensitive hashing, (3) Shingling and MinHash for near-duplicate detection, (4) SimHash for similarity search at scale, and (5) Probabilistic data structures that make duplicate detection feasible for billions of pages.
Duplication exists on a spectrum from trivial URL variations to near-identical content. Understanding this spectrum is essential for choosing appropriate detection strategies.
The duplication spectrum:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ DUPLICATION SPECTRUM │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ EXACT NEAR │
│ ◄──────────────────────────────────────────────────────────────────────────► │
│ │
│ URL Variants Same Content Boilerplate + Similar Slight │
│ (trivial) Different URLs Same Core Content Variations │
│ │
│ /page?a=1 Mirror sites Templates with Syndicated Minor edits │
│ /page?a=1& CDN copies same article articles Translations │
│ Canonical issues Different ads Scraped Reformatted │
│ │
│ ───────────────────────────────────────────────────────────────────────────── │
│ Detection Detection Detection Detection Detection │
│ Method: Method: Method: Method: Method: │
│ URL Normal- Content Hash Content Hash MinHash / MinHash / │
│ ization (MD5/SHA) (after clean) Shingling SimHash │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
| Duplication Type | Examples | Detection Approach | Complexity |
|---|---|---|---|
| URL Variants | Query parameter ordering, trailing slashes, case differences | URL normalization | Low |
| Exact Content Duplicates | Mirrors, CDN copies, www vs non-www | Cryptographic hash (MD5/SHA-256) | Low |
| Template Duplicates | Same article with different ads/headers | Content extraction + hashing | Medium |
| Near-Duplicates | Syndicated articles, minor edits | MinHash, SimHash, Shingling | High |
| Semantic Duplicates | Same information, different wording | ML embeddings, semantic similarity | Very High |
Most production crawlers focus on the first four types. Semantic duplicate detection (same meaning, different words) requires deep learning models and is typically handled in the search ranking layer, not the crawler. Our focus here is on practical, scalable detection for the crawl pipeline.
The simplest approach to detecting duplicate content is fingerprinting: computing a hash that uniquely identifies a page's content. If two pages have the same fingerprint, they're duplicates.
Fingerprinting workflow:
Hash function requirements:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
import crypto from 'crypto'; class ContentFingerprinter { /** * Compute fingerprint for raw HTML content * This catches exact duplicates but is sensitive to any change */ public hashRaw(content: string): string { return crypto.createHash('sha256').update(content).digest('hex'); } /** * Compute fingerprint after basic normalization * More robust to whitespace/encoding differences */ public hashNormalized(content: string): string { const normalized = this.normalizeContent(content); return crypto.createHash('sha256').update(normalized).digest('hex'); } /** * Compute fingerprint of extracted main content * Ignores headers, footers, ads, navigation - focuses on article body */ public hashMainContent(html: string): string { const mainContent = this.extractMainContent(html); const normalized = this.normalizeContent(mainContent); return crypto.createHash('sha256').update(normalized).digest('hex'); } private normalizeContent(content: string): string { return content // Normalize whitespace .replace(/\s+/g, ' ') // Remove leading/trailing whitespace .trim() // Normalize case (optional, may be too aggressive) .toLowerCase() // Remove common variable elements .replace(/\d{4}-\d{2}-\d{2}/g, 'DATE') // Dates .replace(/\d{1,2}:\d{2}/g, 'TIME') // Times .replace(/\b\d+\b/g, 'NUM'); // Numbers } private extractMainContent(html: string): string { // In production, use a library like readability-lxml or mercury-parser // This is a simplified example using regex // Remove script and style elements let content = html .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '') .replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '') .replace(/<!--[\s\S]*?-->/g, ''); // Remove header, footer, nav, aside elements (common boilerplate) content = content .replace(/<header[^>]*>[\s\S]*?<\/header>/gi, '') .replace(/<footer[^>]*>[\s\S]*?<\/footer>/gi, '') .replace(/<nav[^>]*>[\s\S]*?<\/nav>/gi, '') .replace(/<aside[^>]*>[\s\S]*?<\/aside>/gi, ''); // Extract text content content = content.replace(/<[^>]+>/g, ' '); return content; }} class DuplicateDetector { private seenHashes: Set<string> = new Set(); private fingerprinter: ContentFingerprinter = new ContentFingerprinter(); /** * Check if content is a duplicate * Returns the URL of the original if duplicate, null if new */ public checkDuplicate( url: string, content: string ): { isDuplicate: boolean; fingerprint: string } { const fingerprint = this.fingerprinter.hashMainContent(content); if (this.seenHashes.has(fingerprint)) { return { isDuplicate: true, fingerprint }; } this.seenHashes.add(fingerprint); return { isDuplicate: false, fingerprint }; } /** * Get statistics */ public getStats(): { totalSeen: number; uniqueCount: number } { return { totalSeen: this.seenHashes.size, uniqueCount: this.seenHashes.size }; }} // Usage exampleconst detector = new DuplicateDetector(); const page1 = '<html><body><article>Breaking news about event X</article></body></html>';const page2 = '<html><body><article>Breaking news about event X</article></body></html>'; // Sameconst page3 = '<html><body><article>Different story about event Y</article></body></html>'; console.log(detector.checkDuplicate('url1', page1)); // { isDuplicate: false, ... }console.log(detector.checkDuplicate('url2', page2)); // { isDuplicate: true, ... }console.log(detector.checkDuplicate('url3', page3)); // { isDuplicate: false, ... }The quality of duplicate detection depends heavily on content extraction. Poor extraction that includes boilerplate (headers, footers, ads) will cause false negatives—pages with the same article but different ads will appear as different. Use robust extraction libraries like Mozilla Readability or diff-match-patch for production systems.
Exact hashing fails for near-duplicates—pages that share most of their content but differ in minor ways (different ads, slight edits, reformatting). For these, we need similarity-preserving techniques.
Shingling (n-grams):
A shingle (or k-shingle) is a contiguous sequence of k tokens. By representing a document as the set of its shingles, we create a signature that captures its content in a way that's robust to minor changes.
Document: "The quick brown fox jumps"
3-shingles: {"The quick brown", "quick brown fox", "brown fox jumps"}
Jaccard Similarity:
Given two documents A and B represented as shingle sets:
Jaccard(A, B) = |A ∩ B| / |A ∪ B|
The Problem with Direct Comparison:
Comparing shingle sets directly is expensive:
MinHash to the rescue:
MinHash is a technique that estimates Jaccard similarity using compact signatures. The key insight: if we apply a random hash function to each shingle and take the minimum hash value, the probability that two documents have the same minimum hash equals their Jaccard similarity.
Pr[min(h(A)) = min(h(B))] = Jaccard(A, B)
By using multiple hash functions and creating a signature of minimum hashes, we can estimate Jaccard with high accuracy using much less space.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
class MinHash { private numHashes: number; private hashFunctions: Array<(x: number) => number>; constructor(numHashes: number = 128) { this.numHashes = numHashes; this.hashFunctions = this.generateHashFunctions(numHashes); } /** * Generate shingles from text */ public shingle(text: string, k: number = 5): Set<string> { const words = text.toLowerCase().split(/\s+/).filter(w => w.length > 0); const shingles = new Set<string>(); for (let i = 0; i <= words.length - k; i++) { const shingle = words.slice(i, i + k).join(' '); shingles.add(shingle); } return shingles; } /** * Compute MinHash signature for a set of shingles */ public computeSignature(shingles: Set<string>): number[] { // Initialize signature with infinity const signature = new Array(this.numHashes).fill(Infinity); // Convert shingles to numeric hashes const shingleHashes = Array.from(shingles).map(s => this.stringHash(s)); // For each hash function, find minimum across all shingles for (let i = 0; i < this.numHashes; i++) { for (const shingleHash of shingleHashes) { const hashValue = this.hashFunctions[i](shingleHash); if (hashValue < signature[i]) { signature[i] = hashValue; } } } return signature; } /** * Estimate Jaccard similarity from two MinHash signatures */ public estimateSimilarity(sig1: number[], sig2: number[]): number { if (sig1.length !== sig2.length) { throw new Error('Signatures must have same length'); } let matches = 0; for (let i = 0; i < sig1.length; i++) { if (sig1[i] === sig2[i]) { matches++; } } return matches / sig1.length; } /** * Generate family of hash functions * Using the form: h(x) = (a*x + b) mod p mod m */ private generateHashFunctions(n: number): Array<(x: number) => number> { const p = 2147483647; // Large prime (2^31 - 1) const m = 2 ** 32; const functions: Array<(x: number) => number> = []; for (let i = 0; i < n; i++) { const a = Math.floor(Math.random() * (p - 1)) + 1; const b = Math.floor(Math.random() * p); functions.push((x: number) => ((a * x + b) % p) % m); } return functions; } /** * Simple string to number hash */ private stringHash(str: string): number { let hash = 0; for (let i = 0; i < str.length; i++) { const char = str.charCodeAt(i); hash = ((hash << 5) - hash) + char; hash = hash & hash; // Convert to 32-bit integer } return Math.abs(hash); }} // Usage exampleconst minhash = new MinHash(128); const doc1 = "The quick brown fox jumps over the lazy dog";const doc2 = "The quick brown fox leaps over the lazy dog"; // One word differentconst doc3 = "A completely different document about cats"; const shingles1 = minhash.shingle(doc1, 3);const shingles2 = minhash.shingle(doc2, 3);const shingles3 = minhash.shingle(doc3, 3); const sig1 = minhash.computeSignature(shingles1);const sig2 = minhash.computeSignature(shingles2);const sig3 = minhash.computeSignature(shingles3); console.log('doc1 vs doc2:', minhash.estimateSimilarity(sig1, sig2)); // ~0.75-0.85console.log('doc1 vs doc3:', minhash.estimateSimilarity(sig1, sig3)); // ~0.0-0.1console.log('doc2 vs doc3:', minhash.estimateSimilarity(sig2, sig3)); // ~0.0-0.1The number of hash functions (signature length) trades off space against accuracy. With 128 hashes, you get ~10% standard error on similarity estimates with just 128 integers (~512 bytes) per document. For billions of documents, this is vastly more efficient than storing all shingles.
MinHash gives us compact signatures that estimate similarity, but we still face the all-pairs comparison problem: with n documents, comparing every pair is O(n²)—infeasible for billions of pages.
Locality-Sensitive Hashing (LSH) solves this by hashing documents into buckets such that similar documents are likely to hash to the same bucket.
LSH for MinHash (Banding Technique):
Why it works:
For two documents with Jaccard similarity s:
By tuning b and r, we create an S-curve that sharply separates similar from dissimilar pairs.
| Bands (b) | Rows (r) | Threshold (~) | P(Candidate) at 0.8 similarity | P(Candidate) at 0.3 similarity |
|---|---|---|---|---|
| 32 | 4 | ~0.5 | 99.6% | 0.03% |
| 16 | 8 | ~0.7 | 99.9% | ~0% |
| 64 | 2 | ~0.3 | ~100% | 47% |
| 8 | 16 | ~0.85 | 91% | ~0% |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152
class LSHIndex { private bands: number; private rows: number; private buckets: Map<string, Set<string>>[]; // One bucket map per band constructor(signatureLength: number, bands: number) { if (signatureLength % bands !== 0) { throw new Error('Signature length must be divisible by bands'); } this.bands = bands; this.rows = signatureLength / bands; // Initialize bucket maps for each band this.buckets = []; for (let i = 0; i < bands; i++) { this.buckets.push(new Map()); } } /** * Add a document (by its signature) to the index */ public add(docId: string, signature: number[]): void { for (let band = 0; band < this.bands; band++) { const bandSignature = this.getBandSignature(signature, band); const bucketKey = this.hashBand(bandSignature); if (!this.buckets[band].has(bucketKey)) { this.buckets[band].set(bucketKey, new Set()); } this.buckets[band].get(bucketKey)!.add(docId); } } /** * Find candidate documents that might be similar to query */ public findCandidates(signature: number[]): Set<string> { const candidates = new Set<string>(); for (let band = 0; band < this.bands; band++) { const bandSignature = this.getBandSignature(signature, band); const bucketKey = this.hashBand(bandSignature); const bucket = this.buckets[band].get(bucketKey); if (bucket) { for (const docId of bucket) { candidates.add(docId); } } } return candidates; } private getBandSignature(signature: number[], band: number): number[] { const start = band * this.rows; return signature.slice(start, start + this.rows); } private hashBand(bandSignature: number[]): string { // Simple hash: join values return bandSignature.join('-'); }} class NearDuplicateDetector { private minhash: MinHash; private lshIndex: LSHIndex; private signatures: Map<string, number[]> = new Map(); private similarityThreshold: number; constructor( numHashes: number = 128, bands: number = 16, similarityThreshold: number = 0.8 ) { this.minhash = new MinHash(numHashes); this.lshIndex = new LSHIndex(numHashes, bands); this.similarityThreshold = similarityThreshold; } /** * Check if document is near-duplicate of existing documents * Returns list of similar document IDs if duplicates found */ public checkNearDuplicate(docId: string, content: string): { isDuplicate: boolean; similarDocs: Array<{ docId: string; similarity: number }>; } { // Compute signature const shingles = this.minhash.shingle(content, 5); const signature = this.minhash.computeSignature(shingles); // Find candidates via LSH const candidates = this.lshIndex.findCandidates(signature); // Verify candidates by computing actual similarity const similarDocs: Array<{ docId: string; similarity: number }> = []; for (const candidateId of candidates) { if (candidateId === docId) continue; const candidateSig = this.signatures.get(candidateId); if (!candidateSig) continue; const similarity = this.minhash.estimateSimilarity(signature, candidateSig); if (similarity >= this.similarityThreshold) { similarDocs.push({ docId: candidateId, similarity }); } } // Sort by similarity descending similarDocs.sort((a, b) => b.similarity - a.similarity); // Add new document to index this.lshIndex.add(docId, signature); this.signatures.set(docId, signature); return { isDuplicate: similarDocs.length > 0, similarDocs }; } /** * Get index statistics */ public getStats(): { documentsIndexed: number } { return { documentsIndexed: this.signatures.size }; }} // Usage exampleconst detector = new NearDuplicateDetector(128, 16, 0.7); const articles = [ { id: 'article1', content: 'Breaking news: Major event happens in city. Officials respond to situation.' }, { id: 'article2', content: 'Breaking news: Major event happens in city. Officials respond to the situation.' }, // Near-dup { id: 'article3', content: 'Sports update: Team wins championship after long season of competition.' }, { id: 'article4', content: 'Breaking news: Major event occurs in city. Officials respond to situation.' }, // Near-dup of 1]; for (const article of articles) { const result = detector.checkNearDuplicate(article.id, article.content); console.log(`${article.id}: isDuplicate=${result.isDuplicate}`); if (result.similarDocs.length > 0) { console.log(` Similar to: ${result.similarDocs.map(d => `${d.docId}(${(d.similarity*100).toFixed(1)}%)`).join(', ')}`); }}LSH trades accuracy for speed. Some similar documents might not become candidates (false negatives), and some dissimilar documents might become candidates (false positives, caught in verification). The banding parameters let you tune this tradeoff. For crawlers, it's usually better to err on the side of more candidates (lower threshold) since verification is cheap compared to re-crawling duplicates.
SimHash is an alternative to MinHash, popularized by Google for web-scale duplicate detection. While MinHash estimates Jaccard similarity, SimHash produces a single hash where similar documents have similar hashes (differing in few bits).
SimHash Algorithm:
The result is a b-bit fingerprint where similar documents differ in few bits.
Why SimHash?
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219
class SimHash { private hashBits: number; constructor(hashBits: number = 64) { this.hashBits = hashBits; } /** * Compute SimHash fingerprint for a document */ public compute(text: string): bigint { const features = this.extractFeatures(text); const weights = this.computeWeights(features); // Initialize vector with zeros const v = new Array(this.hashBits).fill(0); // For each feature for (const [feature, weight] of weights) { const featureHash = this.hashFeature(feature); // For each bit position for (let i = 0; i < this.hashBits; i++) { const bit = (featureHash >> BigInt(i)) & 1n; if (bit === 1n) { v[i] += weight; } else { v[i] -= weight; } } } // Create fingerprint: 1 if v[i] > 0, else 0 let fingerprint = 0n; for (let i = 0; i < this.hashBits; i++) { if (v[i] > 0) { fingerprint |= (1n << BigInt(i)); } } return fingerprint; } /** * Compute Hamming distance between two SimHash fingerprints */ public hammingDistance(hash1: bigint, hash2: bigint): number { let xor = hash1 ^ hash2; let distance = 0; while (xor > 0n) { distance += Number(xor & 1n); xor >>= 1n; } return distance; } /** * Check if two documents are near-duplicates * Threshold of 3 bits is common for 64-bit SimHash */ public areNearDuplicates( hash1: bigint, hash2: bigint, threshold: number = 3 ): boolean { return this.hammingDistance(hash1, hash2) <= threshold; } /** * Extract features (words or n-grams) from text */ private extractFeatures(text: string): string[] { // Use word 3-grams as features const words = text.toLowerCase() .replace(/[^a-z0-9\s]/g, '') .split(/\s+/) .filter(w => w.length > 0); const features: string[] = []; for (let i = 0; i <= words.length - 3; i++) { features.push(words.slice(i, i + 3).join(' ')); } return features; } /** * Compute weights for features (simplified: all weight 1) * In production, use TF-IDF or similar */ private computeWeights(features: string[]): Map<string, number> { const counts = new Map<string, number>(); for (const feature of features) { counts.set(feature, (counts.get(feature) || 0) + 1); } return counts; } /** * Hash a feature string to a 64-bit value * Using FNV-1a hash (in production, use xxHash or MurmurHash) */ private hashFeature(feature: string): bigint { let hash = 14695981039346656037n; // FNV offset basis const prime = 1099511628211n; // FNV prime for (let i = 0; i < feature.length; i++) { hash ^= BigInt(feature.charCodeAt(i)); hash = (hash * prime) & ((1n << 64n) - 1n); // Keep 64 bits } return hash; }} // Efficient lookup using sorted tableclass SimHashIndex { private hashBits: number; private threshold: number; private tables: Map<bigint, string[]>[]; // Multiple tables for different bit permutations private simhash: SimHash; private hashes: Map<string, bigint> = new Map(); constructor(hashBits: number = 64, threshold: number = 3) { this.hashBits = hashBits; this.threshold = threshold; this.simhash = new SimHash(hashBits); // Create multiple tables for multi-probe lookup // Split hash into (threshold + 1) blocks const numTables = threshold + 1; this.tables = []; for (let i = 0; i < numTables; i++) { this.tables.push(new Map()); } } /** * Add document to index */ public add(docId: string, hash: bigint): void { this.hashes.set(docId, hash); // Add to each table using block as key for (let t = 0; t < this.tables.length; t++) { const blockKey = this.getBlockKey(hash, t); if (!this.tables[t].has(blockKey)) { this.tables[t].set(blockKey, []); } this.tables[t].get(blockKey)!.push(docId); } } /** * Find near-duplicates of a query hash */ public findNearDuplicates(queryHash: bigint): string[] { const candidates = new Set<string>(); // Check all tables for (let t = 0; t < this.tables.length; t++) { const blockKey = this.getBlockKey(queryHash, t); const bucket = this.tables[t].get(blockKey); if (bucket) { for (const docId of bucket) { candidates.add(docId); } } } // Verify candidates const duplicates: string[] = []; for (const docId of candidates) { const docHash = this.hashes.get(docId)!; if (this.simhash.areNearDuplicates(queryHash, docHash, this.threshold)) { duplicates.push(docId); } } return duplicates; } /** * Get block key for a given table * Divides hash into blocks, returns the block for this table */ private getBlockKey(hash: bigint, tableIndex: number): bigint { const bitsPerBlock = Math.floor(this.hashBits / this.tables.length); const shift = BigInt(tableIndex * bitsPerBlock); const mask = (1n << BigInt(bitsPerBlock)) - 1n; return (hash >> shift) & mask; }} // Usage exampleconst simhash = new SimHash(64); const doc1 = "The quick brown fox jumps over the lazy dog near the river";const doc2 = "The quick brown fox leaps over the lazy dog near the river"; // Near-duplicateconst doc3 = "A completely different document about programming in Python"; const hash1 = simhash.compute(doc1);const hash2 = simhash.compute(doc2);const hash3 = simhash.compute(doc3); console.log('doc1 hash:', hash1.toString(16));console.log('doc2 hash:', hash2.toString(16));console.log('doc3 hash:', hash3.toString(16)); console.log('doc1 vs doc2 distance:', simhash.hammingDistance(hash1, hash2)); // Small (1-3)console.log('doc1 vs doc3 distance:', simhash.hammingDistance(hash1, hash3)); // Large (20+)console.log('Are doc1 and doc2 near-duplicates?', simhash.areNearDuplicates(hash1, hash2)); // trueSimHash produces a single compact hash (64-128 bits), making storage and comparison very efficient. MinHash requires longer signatures (128+ values) but provides better accuracy for highly similar documents. SimHash is typically preferred when storage is critical and you need exact near-duplicate detection (< ~5% difference). MinHash is better for detecting a range of similarity levels.
At web scale, duplicate detection must be distributed across many machines. This introduces coordination challenges: how do you efficiently check a new document against billions of existing fingerprints spread across a cluster?
Distributed architecture options:
Hybrid architecture for production:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED DUPLICATE DETECTION ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ CRAWLER WORKERS │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Worker 1 │ │ Worker 2 │ │ Worker N │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │ │ │
│ │ │ │Local │ │ │ │Local │ │ │ │Local │ │ │ │
│ │ │ │Bloom │ │ │ │Bloom │ │ │ │Bloom │ │ │ │
│ │ │ │Filter │ │ │ │Filter │ │ │ │Filter │ │ │ │
│ │ │ └───┬────┘ │ │ └───┬────┘ │ │ └───┬────┘ │ │ │
│ │ │ │ │ │ │ │ │ │ │ │ │
│ │ └──────┼──────┘ └──────┼──────┘ └──────┼──────┘ │ │
│ │ │ │ │ │ │
│ │ │ Bloom says │ │ │ │
│ │ │ 'not seen' │ │ │ │
│ │ ▼ ▼ ▼ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ Bloom says 'possibly seen' │
│ │ → Query SimHash service │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ SIMHASH DEDUP SERVICE │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────────┐ │ │
│ │ │ PARTITIONED INDEX │ │ │
│ │ │ │ │ │
│ │ │ Partition 0 Partition 1 Partition N │ │ │
│ │ │ ──────────── ──────────── ──────────── │ │ │
│ │ │ Hash prefix 0x0 Hash prefix 0x1 Hash prefix 0xF │ │ │
│ │ │ SimHash tables SimHash tables SimHash tables │ │ │
│ │ │ │ │ │
│ │ └─────────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Query Flow: │ │
│ │ 1. Receive (docId, content) │ │
│ │ 2. Compute SimHash │ │
│ │ 3. Query partition for hash prefix + neighbors │ │
│ │ 4. Return: duplicates found? (docId list) or unique │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────────────┘ │
│ │
│ Two-tier design: │
│ - Tier 1 (Local Bloom): Fast, in-memory, catches most 'definitely new' docs │
│ - Tier 2 (SimHash Service): Accurate, catches near-duplicates │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
Why two tiers?
Local Bloom Filter (per worker): Catches 70-90% of lookups as "definitely not seen." Avoids network round-trip for most new pages.
SimHash Service: Handles 'possibly seen' from Bloom filter and detects near-duplicates that Bloom filter can't catch.
This architecture minimizes network traffic (most checks are local) while providing comprehensive duplicate and near-duplicate detection.
Implementing duplicate detection in a production crawler involves several practical decisions:
Content changes over time. A page that was a duplicate yesterday may have unique updates today. Production systems must balance deduplication against freshness—periodically rechecking 'known duplicates' to detect divergence, and updating fingerprints when content changes significantly.
We've explored the complete landscape of duplicate detection—from simple URL matching to sophisticated near-duplicate detection at scale.
What's Next:
The next page explores Distributed Crawling—how to scale a crawler across hundreds of machines while maintaining coordination, consistency, and efficiency. We'll cover worker architecture, work distribution, failure handling, and the coordination mechanisms that tie distributed crawler components together.
You now understand how to detect duplicates at web scale using a combination of exact matching and similarity-preserving hashing techniques. Effective deduplication can reduce storage costs by 25-30% and improve crawl efficiency proportionally—a significant optimization for any large-scale crawler.