Loading content...
Hash functions transform strings into integers, enabling O(1) comparison. But this compression comes with a fundamental mathematical consequence: collisions are inevitable.
When two different strings produce the same hash value, we have a collision. In pattern matching, this means a substring might have the same hash as our pattern even though the actual characters differ. If we only compared hashes, we'd report false matches.
This page examines collisions deeply—why they happen, how to detect them, their impact on performance, and strategies to minimize their occurrence. Understanding collisions is essential for building robust, production-ready Rabin-Karp implementations.
By the end of this page, you will: (1) understand why collisions are mathematically unavoidable, (2) see how Rabin-Karp maintains correctness despite collisions, (3) analyze the expected number of collisions, (4) implement verification strategies, and (5) apply techniques to reduce collision probability.
The mathematical foundation of collisions comes from a simple counting argument called the pigeonhole principle.
The pigeonhole principle:
If you have n items and k containers where n > k, at least one container must hold more than one item.
Applied to hashing:
For example, with lowercase English letters (σ = 26) and a 5-character pattern:
But consider:
There are vastly more strings than hash values, so many strings must share hashes.
Birthday paradox intuition:
The famous birthday paradox states that in a group of just 23 people, there's a >50% chance two share a birthday (out of 365). The probability of collision grows faster than intuition suggests.
For hash functions with q possible values, the probability of at least one collision among n hashes is approximately:
P(collision) ≈ 1 - e^(-n²/2q)
With q = 10^9 and n = 50,000 hashes (a modest text):
P(collision) ≈ 1 - e^(-50000²/2×10^9)
≈ 1 - e^(-1.25)
≈ 0.71 (71%!)
Key insight: Even with a large modulus, collisions become probable with enough hash comparisons. Rabin-Karp must handle this gracefully.
A collision doesn't mean your hash function is broken. It's a mathematical certainty when compressing large data into small fingerprints. The goal isn't to eliminate collisions—it's to minimize them and handle them correctly when they occur.
The Rabin-Karp algorithm maintains correctness through a simple but crucial strategy: verification on hash match.
The two-phase comparison:
Hash comparison (O(1)): Compare hash values
Character verification (O(m)): Compare actual characters
Why this works:
The verification step catches all collisions, ensuring we never report false matches.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
def rabin_karp_with_verification(text: str, pattern: str) -> list[int]: """ Rabin-Karp with explicit collision handling via verification. Returns all positions where pattern occurs in text. """ n, m = len(text), len(pattern) if m > n or m == 0: return [] base = 31 mod = 10**9 + 7 # Compute pattern hash pattern_hash = 0 for c in pattern: pattern_hash = (pattern_hash * base + ord(c)) % mod # Compute first window hash window_hash = 0 for i in range(m): window_hash = (window_hash * base + ord(text[i])) % mod # Precompute base^(m-1) for rolling base_power = pow(base, m - 1, mod) matches = [] collision_count = 0 # Track collisions for analysis for i in range(n - m + 1): # Phase 1: Hash comparison (O(1)) if window_hash == pattern_hash: # Phase 2: Character verification (O(m)) if text[i:i+m] == pattern: matches.append(i) # Genuine match else: collision_count += 1 # False positive (collision) # Roll the hash if i < n - m: outgoing = ord(text[i]) incoming = ord(text[i + m]) window_hash = ((window_hash - outgoing * base_power) * base + incoming) % mod if window_hash < 0: window_hash += mod if collision_count > 0: print(f"Note: {collision_count} hash collision(s) detected and handled") return matches # Example usagetext = "AABAACAADAABAABA"pattern = "AABA"result = rabin_karp_with_verification(text, pattern)print(f"Pattern found at positions: {result}")# Output: Pattern found at positions: [0, 9, 12]Think of the hash comparison as a fast filter that eliminates most non-matches in O(1). The verification is the safety net that catches the rare false positives. This two-phase approach gives us correctness guarantees while still being fast in the average case.
Not all collisions are equal. Understanding different collision types helps in analyzing algorithm behavior and choosing mitigation strategies.
Type 1: Spurious hits (false positives)
A text substring has the same hash as the pattern but different content.
Pattern: "ABC" hash = 12345
Substring: "XYZ" hash = 12345
→ Hash match, but "ABC" ≠ "XYZ"
→ Verification catches this
Impact: Extra O(m) verification work per spurious hit.
Type 2: Benign collisions (among text substrings)
Two text substrings have the same hash, but neither matches the pattern hash.
Pattern hash: 99999
Text[0:3] = "AAA" hash = 55555
Text[4:7] = "BBB" hash = 55555
→ Neither matches pattern hash, so no extra work
Impact: None—we never compare these substrings to the pattern.
| Type | Description | Detection | Performance Impact |
|---|---|---|---|
| Spurious Hit | Text substring hash = pattern hash, but strings differ | Verification phase | O(m) per occurrence |
| Benign Collision | Two text substrings share hash, neither = pattern hash | Not detected (irrelevant) | None |
| True Match | Text substring hash = pattern hash AND strings match | Verification confirms | O(m) required work |
Probability analysis:
Assuming uniform hash distribution with modulus q:
With q = 10^9 and n = 10^6:
In practice, we expect about 1 spurious hit per 1000 pattern searches of million-character texts. The verification overhead is negligible.
The 'expected' analysis assumes random or typical input. A malicious adversary who knows your hash parameters can craft input where every position is a spurious hit, forcing O(nm) worst case. This is why security-sensitive applications need additional measures (see later sections).
The verification phase seems simple—just compare strings character by character. But there are implementation nuances that affect both correctness and performance.
Basic verification:
if text[i:i+m] == pattern:
# match found
In Python, this is clean and efficient because string slicing and comparison are highly optimized C operations.
Character-by-character verification:
In lower-level languages or when avoiding string allocation:
bool matches = true;
for (int j = 0; j < m && matches; j++) {
if (text[i + j] != pattern[j]) {
matches = false;
}
}
This can be faster because it short-circuits on first mismatch.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
def verify_match(text: str, pattern: str, start: int) -> bool: """ Verify if text[start:start+m] equals pattern. Short-circuits on first mismatch for efficiency. """ m = len(pattern) for j in range(m): if text[start + j] != pattern[j]: return False return True def verify_match_optimized(text: str, pattern: str, start: int) -> bool: """ Optimized verification with early termination heuristics. Checks first, last, and middle characters before full scan. This catches many mismatches faster for long patterns. """ m = len(pattern) # Quick checks at strategic positions if text[start] != pattern[0]: return False if text[start + m - 1] != pattern[m - 1]: return False if m > 2 and text[start + m // 2] != pattern[m // 2]: return False # Full verification if quick checks pass for j in range(1, m - 1): if text[start + j] != pattern[j]: return False return True # Benchmarkingimport time text = "A" * 1000000 + "B" # Million 'A's followed by 'B'pattern = "A" * 999999 + "B" # Almost matches at position 0, matches at position 1 # Test at position 0 (mismatch at last character)start = time.time()for _ in range(1000): verify_match(text, pattern, 0)print(f"Basic verification: {time.time() - start:.4f}s") start = time.time()for _ in range(1000): verify_match_optimized(text, pattern, 0)print(f"Optimized verification: {time.time() - start:.4f}s")The optimized verification is most helpful for long patterns where spurious hits might match many initial characters before diverging. For short patterns or language built-ins with SIMD optimization, basic comparison is often faster.
While collisions are inevitable, we can dramatically reduce their probability through careful parameter selection and algorithm modifications.
Strategy 1: Use a larger modulus
Collision probability at each position ≈ 1/q. Doubling q halves collision probability.
q = 10^9: P(collision) ≈ 10^-9 per position
q = 10^18: P(collision) ≈ 10^-18 per position
Tradeoff: Larger q requires 64-bit or 128-bit arithmetic.
Strategy 2: Double hashing
Compute two independent hashes with different (base, modulus) pairs. Report a match only if BOTH hashes match.
P(both collide) = P(first collides) × P(second collides)
= (1/q₁) × (1/q₂)
≈ 10^-18 with two 10^9 moduli
This is the most reliable collision reduction technique.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
def rabin_karp_double_hash(text: str, pattern: str) -> list[int]: """ Rabin-Karp with double hashing for near-zero collision probability. Uses two independent hash functions. Both must match for a potential hit. This reduces false positive probability to ~10^-18. """ n, m = len(text), len(pattern) if m > n or m == 0: return [] # Two independent hash parameters base1, mod1 = 31, 10**9 + 7 base2, mod2 = 37, 10**9 + 9 # Different prime base and modulus def compute_hash(s: str, base: int, mod: int) -> int: h = 0 for c in s: h = (h * base + ord(c)) % mod return h # Pattern hashes ph1 = compute_hash(pattern, base1, mod1) ph2 = compute_hash(pattern, base2, mod2) # Initial window hashes wh1 = compute_hash(text[:m], base1, mod1) wh2 = compute_hash(text[:m], base2, mod2) # Base powers for rolling bp1 = pow(base1, m - 1, mod1) bp2 = pow(base2, m - 1, mod2) matches = [] for i in range(n - m + 1): # Both hashes must match for potential hit if wh1 == ph1 and wh2 == ph2: # With double hashing, verification is almost always redundant # but we include it for absolute correctness if text[i:i+m] == pattern: matches.append(i) if i < n - m: # Roll both hashes out_c = ord(text[i]) in_c = ord(text[i + m]) wh1 = ((wh1 - out_c * bp1) * base1 + in_c) % mod1 wh2 = ((wh2 - out_c * bp2) * base2 + in_c) % mod2 if wh1 < 0: wh1 += mod1 if wh2 < 0: wh2 += mod2 return matches # Example usagetext = "ABABCABABCABABC"pattern = "ABABC"print(f"Matches at: {rabin_karp_double_hash(text, pattern)}")| Strategy | Collision Probability | Overhead | Use Case |
|---|---|---|---|
| Single hash (q=10^9) | ~10^-9 per position | None | Most applications |
| Single hash (q=10^18) | ~10^-18 per position | 128-bit math | High reliability |
| Double hash (q₁,q₂=10^9) | ~10^-18 per position | 2× hash work | Security-sensitive |
| Triple hash | ~10^-27 per position | 3× hash work | Cryptographic |
In adversarial settings (e.g., web applications where attackers can craft input), use randomly chosen hash parameters at runtime. This prevents attackers from pre-computing collision-inducing inputs. Generate base and modulus from a cryptographic random source at startup.
Understanding worst-case scenarios helps us recognize when Rabin-Karp might underperform and when to consider alternative algorithms.
Worst case 1: Adversarial input
An attacker who knows your hash parameters can construct strings that all hash to the same value. This forces verification at every position.
Example (simplified):
If hash("AAA") = 1000 and attacker finds hash("XYZ") = 1000,
they can fill the entire text with hash-1000 substrings.
→ Every position requires O(m) verification
→ Total: O(nm)
Worst case 2: Many actual matches
If the pattern appears at many positions, we must verify each one.
Text: "AAAAAAAAAA" (n = 10)
Pattern: "AA" (m = 2)
Matches at: 0, 1, 2, 3, 4, 5, 6, 7, 8 (9 matches!)
→ We verify 9 times, each O(2)
→ But this is correct, necessary work
Worst case 3: Hash function weakness
Poorly chosen hash parameters can create systematic collisions.
Example: base = 256, mod = 256 (terrible choice)
hash(s) = sum of ASCII values mod 256
→ "ABC" and "CBA" both hash to same value
→ Massive collision rate
Worst case 4: Small modulus with many comparisons
Text length: 10^6
Modulus: 10^4 (too small)
Expected collisions: 10^6 / 10^4 = 100 per pattern search
→ 100 extra O(m) verifications
→ Significant slowdown
Mitigation strategies:
If you're in a security-sensitive context where adversarial input is possible, or if you need guaranteed O(n+m) worst case, consider using KMP instead. Rabin-Karp's strength is its simplicity and its ability to search for multiple patterns simultaneously—not its worst-case guarantees.
Let's derive the expected number of collisions and understand how they affect overall complexity.
Expected collisions per search:
Assuming uniform hash distribution:
Example calculations:
| Text Length (n) | Pattern Length (m) | Modulus (q) | Expected Spurious Hits |
|---|---|---|---|
| 1,000 | 10 | 10^9 | ≈ 0.000001 |
| 1,000,000 | 100 | 10^9 | ≈ 0.001 |
| 1,000,000,000 | 100 | 10^9 | ≈ 1 |
| 1,000,000,000 | 100 | 10^18 | ≈ 10^-9 |
Expected running time analysis:
Total time = Preprocessing + Hash comparisons + Verifications
= O(m) + O(n) × O(1) + (matches + spurious) × O(m)
= O(m) + O(n) + O(k × m)
Where k = actual matches + spurious hits.
For typical cases (few matches, rare collisions):
For pathological cases (many matches or collisions):
Key insight: The algorithm is fast when the pattern is somewhat selective (doesn't match everywhere) and hash collisions are rare.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
def analyze_collisions(text: str, pattern: str, runs: int = 100) -> dict: """ Experimentally analyze collision statistics for given text/pattern. Runs Rabin-Karp multiple times with different random seeds and collects collision statistics. """ import random n, m = len(text), len(pattern) collision_counts = [] for _ in range(runs): # Randomize hash parameters base = random.choice([31, 37, 41, 43, 47, 53]) mod = random.choice([10**9 + 7, 10**9 + 9, 10**9 + 21]) collisions = 0 # Compute pattern hash ph = 0 for c in pattern: ph = (ph * base + ord(c)) % mod # Compute and compare window hashes wh = 0 for i in range(m): wh = (wh * base + ord(text[i])) % mod bp = pow(base, m - 1, mod) for i in range(n - m + 1): if wh == ph and text[i:i+m] != pattern: collisions += 1 if i < n - m: wh = ((wh - ord(text[i]) * bp) * base + ord(text[i + m])) % mod if wh < 0: wh += mod collision_counts.append(collisions) return { "mean_collisions": sum(collision_counts) / len(collision_counts), "max_collisions": max(collision_counts), "min_collisions": min(collision_counts), "expected_collisions": (n - m + 1) / (10**9), # Theoretical } # Test on various inputsprint("Random text analysis:")import randomrandom_text = ''.join(random.choices('ABCDEFGHIJ', k=10000))random_pattern = ''.join(random.choices('ABCDEFGHIJ', k=5))print(analyze_collisions(random_text, random_pattern)) print("Repetitive text analysis:")repetitive_text = "AB" * 5000repetitive_pattern = "ABAB"print(analyze_collisions(repetitive_text, repetitive_pattern))Rabin-Karp can be implemented in two fundamentally different modes, named after famous gambling destinations:
Monte Carlo variant (skip verification):
Las Vegas variant (always verify):
The tradeoff:
| Variant | Time Complexity | Correctness | Use Case |
|---|---|---|---|
| Monte Carlo | O(n) guaranteed | Probabilistic (may have false positives) | Probabilistic data structures, approximate matching |
| Las Vegas | O(n+m) expected, O(nm) worst | Always correct | Standard pattern matching |
Monte Carlo applications:
The Monte Carlo variant is surprisingly useful in specific contexts:
Making Monte Carlo reliable:
With double or triple hashing, the Monte Carlo variant becomes practically equivalent to Las Vegas:
P(false positive) = 1/(q₁ × q₂) ≈ 10^-18
In the lifetime of the universe, you won't see a false positive.
If you're using double hashing with large primes (both > 10^9), the probability of a spurious hit is so low that verification becomes practically redundant. You can skip it for a slight speedup—though most production code keeps verification for defense-in-depth.
We've covered hash collisions comprehensively. Let's consolidate the key insights:
What's next:
Now that we understand the complete mechanics of Rabin-Karp—hashing, rolling updates, and collision handling—we're ready to analyze its complexity rigorously. The next page provides detailed time complexity analysis, including expected case, worst case, and the conditions that affect performance.
You now have a complete understanding of hash collisions in Rabin-Karp—why they occur, how to handle them correctly, and strategies to minimize their impact. This knowledge is essential for building robust, production-ready string matching implementations.