Rabin Karp Algorithm - Learning Module

Loading content...

0/279

Handling Hash Collisions

The Unavoidable Reality of Collisions

Hash functions transform strings into integers, enabling O(1) comparison. But this compression comes with a fundamental mathematical consequence: collisions are inevitable.

When two different strings produce the same hash value, we have a collision. In pattern matching, this means a substring might have the same hash as our pattern even though the actual characters differ. If we only compared hashes, we'd report false matches.

This page examines collisions deeply—why they happen, how to detect them, their impact on performance, and strategies to minimize their occurrence. Understanding collisions is essential for building robust, production-ready Rabin-Karp implementations.

What You Will Learn

By the end of this page, you will: (1) understand why collisions are mathematically unavoidable, (2) see how Rabin-Karp maintains correctness despite collisions, (3) analyze the expected number of collisions, (4) implement verification strategies, and (5) apply techniques to reduce collision probability.

Why Collisions Are Inevitable

The mathematical foundation of collisions comes from a simple counting argument called the pigeonhole principle.

The pigeonhole principle:

If you have n items and k containers where n > k, at least one container must hold more than one item.

Applied to hashing:

Number of possible strings of length m over an alphabet σ: |σ|^m
Number of possible hash values (with modulus q): q

For example, with lowercase English letters (σ = 26) and a 5-character pattern:

Possible 5-character strings: 26^5 = 11,881,376
With modulus q = 10^9 + 7: plenty of hash values

But consider:

Possible 20-character strings: 26^20 ≈ 2 × 10^28
Hash values: 10^9

There are vastly more strings than hash values, so many strings must share hashes.

Birthday paradox intuition:

The famous birthday paradox states that in a group of just 23 people, there's a >50% chance two share a birthday (out of 365). The probability of collision grows faster than intuition suggests.

For hash functions with q possible values, the probability of at least one collision among n hashes is approximately:

P(collision) ≈ 1 - e^(-n²/2q)

With q = 10^9 and n = 50,000 hashes (a modest text):

P(collision) ≈ 1 - e^(-50000²/2×10^9)
            ≈ 1 - e^(-1.25)
            ≈ 0.71 (71%!)

Key insight: Even with a large modulus, collisions become probable with enough hash comparisons. Rabin-Karp must handle this gracefully.

Collisions Are Not Bugs

A collision doesn't mean your hash function is broken. It's a mathematical certainty when compressing large data into small fingerprints. The goal isn't to eliminate collisions—it's to minimize them and handle them correctly when they occur.

How Rabin-Karp Maintains Correctness

The Rabin-Karp algorithm maintains correctness through a simple but crucial strategy: verification on hash match.

The two-phase comparison:

Hash comparison (O(1)): Compare hash values
- If hashes differ: definitely not a match, skip
- If hashes match: proceed to verification
Character verification (O(m)): Compare actual characters
- If characters match: report as genuine match
- If characters differ: false positive (collision), skip

Why this works:

Different hashes → different strings (guaranteed)
Same hashes → possibly same strings (needs verification)

The verification step catches all collisions, ensuring we never report false matches.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
def rabin_karp_with_verification(text: str, pattern: str) -> list[int]:
    """
    Rabin-Karp with explicit collision handling via verification.
    
    Returns all positions where pattern occurs in text.
    """
    n, m = len(text), len(pattern)
    if m > n or m == 0:
        return []
    
    base = 31
    mod = 10**9 + 7
    
    # Compute pattern hash
    pattern_hash = 0
    for c in pattern:
        pattern_hash = (pattern_hash * base + ord(c)) % mod
    
    # Compute first window hash
    window_hash = 0
    for i in range(m):
        window_hash = (window_hash * base + ord(text[i])) % mod
    
    # Precompute base^(m-1) for rolling
    base_power = pow(base, m - 1, mod)
    
    matches = []
    collision_count = 0  # Track collisions for analysis
    
    for i in range(n - m + 1):
        # Phase 1: Hash comparison (O(1))
        if window_hash == pattern_hash:
            # Phase 2: Character verification (O(m))
            if text[i:i+m] == pattern:
                matches.append(i)  # Genuine match
            else:
                collision_count += 1  # False positive (collision)
        
        # Roll the hash
        if i < n - m:
            outgoing = ord(text[i])
            incoming = ord(text[i + m])
            window_hash = ((window_hash - outgoing * base_power) * base + incoming) % mod
            if window_hash < 0:
                window_hash += mod
    
    if collision_count > 0:
        print(f"Note: {collision_count} hash collision(s) detected and handled")
    
    return matches
 
# Example usage
text = "AABAACAADAABAABA"
pattern = "AABA"
result = rabin_karp_with_verification(text, pattern)
print(f"Pattern found at positions: {result}")
# Output: Pattern found at positions: [0, 9, 12]

Hash as Filter, Verification as Safety

Think of the hash comparison as a fast filter that eliminates most non-matches in O(1). The verification is the safety net that catches the rare false positives. This two-phase approach gives us correctness guarantees while still being fast in the average case.

Types of Collisions

Not all collisions are equal. Understanding different collision types helps in analyzing algorithm behavior and choosing mitigation strategies.

Type 1: Spurious hits (false positives)

A text substring has the same hash as the pattern but different content.

Pattern: "ABC"     hash = 12345
Substring: "XYZ"   hash = 12345
→ Hash match, but "ABC" ≠ "XYZ"
→ Verification catches this

Impact: Extra O(m) verification work per spurious hit.

Type 2: Benign collisions (among text substrings)

Two text substrings have the same hash, but neither matches the pattern hash.

Pattern hash: 99999
Text[0:3] = "AAA"  hash = 55555
Text[4:7] = "BBB"  hash = 55555
→ Neither matches pattern hash, so no extra work

Impact: None—we never compare these substrings to the pattern.

Collision Types and Their Impact
Type	Description	Detection	Performance Impact
Spurious Hit	Text substring hash = pattern hash, but strings differ	Verification phase	O(m) per occurrence
Benign Collision	Two text substrings share hash, neither = pattern hash	Not detected (irrelevant)	None
True Match	Text substring hash = pattern hash AND strings match	Verification confirms	O(m) required work

Probability analysis:

Assuming uniform hash distribution with modulus q:

Probability of spurious hit at any position: 1/q
Expected spurious hits in text of length n: (n-m+1)/q

With q = 10^9 and n = 10^6:

Expected spurious hits ≈ 10^6 / 10^9 = 0.001

In practice, we expect about 1 spurious hit per 1000 pattern searches of million-character texts. The verification overhead is negligible.

Adversarial Input Warning

The 'expected' analysis assumes random or typical input. A malicious adversary who knows your hash parameters can craft input where every position is a spurious hit, forcing O(nm) worst case. This is why security-sensitive applications need additional measures (see later sections).

Implementing Robust Verification

The verification phase seems simple—just compare strings character by character. But there are implementation nuances that affect both correctness and performance.

Basic verification:

if text[i:i+m] == pattern:
    # match found

In Python, this is clean and efficient because string slicing and comparison are highly optimized C operations.

Character-by-character verification:

In lower-level languages or when avoiding string allocation:

bool matches = true;
for (int j = 0; j < m && matches; j++) {
    if (text[i + j] != pattern[j]) {
        matches = false;
    }
}

This can be faster because it short-circuits on first mismatch.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
def verify_match(text: str, pattern: str, start: int) -> bool:
    """
    Verify if text[start:start+m] equals pattern.
    
    Short-circuits on first mismatch for efficiency.
    """
    m = len(pattern)
    for j in range(m):
        if text[start + j] != pattern[j]:
            return False
    return True
 
 
def verify_match_optimized(text: str, pattern: str, start: int) -> bool:
    """
    Optimized verification with early termination heuristics.
    
    Checks first, last, and middle characters before full scan.
    This catches many mismatches faster for long patterns.
    """
    m = len(pattern)
    
    # Quick checks at strategic positions
    if text[start] != pattern[0]:
        return False
    if text[start + m - 1] != pattern[m - 1]:
        return False
    if m > 2 and text[start + m // 2] != pattern[m // 2]:
        return False
    
    # Full verification if quick checks pass
    for j in range(1, m - 1):
        if text[start + j] != pattern[j]:
            return False
    return True
 
 
# Benchmarking
import time
 
text = "A" * 1000000 + "B"  # Million 'A's followed by 'B'
pattern = "A" * 999999 + "B"  # Almost matches at position 0, matches at position 1
 
# Test at position 0 (mismatch at last character)
start = time.time()
for _ in range(1000):
    verify_match(text, pattern, 0)
print(f"Basic verification: {time.time() - start:.4f}s")
 
start = time.time()
for _ in range(1000):
    verify_match_optimized(text, pattern, 0)
print(f"Optimized verification: {time.time() - start:.4f}s")

When Optimized Verification Helps

The optimized verification is most helpful for long patterns where spurious hits might match many initial characters before diverging. For short patterns or language built-ins with SIMD optimization, basic comparison is often faster.

Reducing Collision Probability

While collisions are inevitable, we can dramatically reduce their probability through careful parameter selection and algorithm modifications.

Strategy 1: Use a larger modulus

Collision probability at each position ≈ 1/q. Doubling q halves collision probability.

q = 10^9:   P(collision) ≈ 10^-9 per position
q = 10^18:  P(collision) ≈ 10^-18 per position

Tradeoff: Larger q requires 64-bit or 128-bit arithmetic.

Strategy 2: Double hashing

Compute two independent hashes with different (base, modulus) pairs. Report a match only if BOTH hashes match.

P(both collide) = P(first collides) × P(second collides)
                = (1/q₁) × (1/q₂)
                ≈ 10^-18 with two 10^9 moduli

This is the most reliable collision reduction technique.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def rabin_karp_double_hash(text: str, pattern: str) -> list[int]:
    """
    Rabin-Karp with double hashing for near-zero collision probability.
    
    Uses two independent hash functions. Both must match for a potential hit.
    This reduces false positive probability to ~10^-18.
    """
    n, m = len(text), len(pattern)
    if m > n or m == 0:
        return []
    
    # Two independent hash parameters
    base1, mod1 = 31, 10**9 + 7
    base2, mod2 = 37, 10**9 + 9  # Different prime base and modulus
    
    def compute_hash(s: str, base: int, mod: int) -> int:
        h = 0
        for c in s:
            h = (h * base + ord(c)) % mod
        return h
    
    # Pattern hashes
    ph1 = compute_hash(pattern, base1, mod1)
    ph2 = compute_hash(pattern, base2, mod2)
    
    # Initial window hashes
    wh1 = compute_hash(text[:m], base1, mod1)
    wh2 = compute_hash(text[:m], base2, mod2)
    
    # Base powers for rolling
    bp1 = pow(base1, m - 1, mod1)
    bp2 = pow(base2, m - 1, mod2)
    
    matches = []
    
    for i in range(n - m + 1):
        # Both hashes must match for potential hit
        if wh1 == ph1 and wh2 == ph2:
            # With double hashing, verification is almost always redundant
            # but we include it for absolute correctness
            if text[i:i+m] == pattern:
                matches.append(i)
        
        if i < n - m:
            # Roll both hashes
            out_c = ord(text[i])
            in_c = ord(text[i + m])
            
            wh1 = ((wh1 - out_c * bp1) * base1 + in_c) % mod1
            wh2 = ((wh2 - out_c * bp2) * base2 + in_c) % mod2
            
            if wh1 < 0: wh1 += mod1
            if wh2 < 0: wh2 += mod2
    
    return matches
 
# Example usage
text = "ABABCABABCABABC"
pattern = "ABABC"
print(f"Matches at: {rabin_karp_double_hash(text, pattern)}")

Collision Probability with Different Strategies
Strategy	Collision Probability	Overhead	Use Case
Single hash (q=10^9)	~10^-9 per position	None	Most applications
Single hash (q=10^18)	~10^-18 per position	128-bit math	High reliability
Double hash (q₁,q₂=10^9)	~10^-18 per position	2× hash work	Security-sensitive
Triple hash	~10^-27 per position	3× hash work	Cryptographic

Randomized Hashing for Security

In adversarial settings (e.g., web applications where attackers can craft input), use randomly chosen hash parameters at runtime. This prevents attackers from pre-computing collision-inducing inputs. Generate base and modulus from a cryptographic random source at startup.

Worst-Case Collision Scenarios

Understanding worst-case scenarios helps us recognize when Rabin-Karp might underperform and when to consider alternative algorithms.

Worst case 1: Adversarial input

An attacker who knows your hash parameters can construct strings that all hash to the same value. This forces verification at every position.

Example (simplified):
If hash("AAA") = 1000 and attacker finds hash("XYZ") = 1000,
they can fill the entire text with hash-1000 substrings.
→ Every position requires O(m) verification
→ Total: O(nm)

Worst case 2: Many actual matches

If the pattern appears at many positions, we must verify each one.

Text:    "AAAAAAAAAA" (n = 10)
Pattern: "AA" (m = 2)
Matches at: 0, 1, 2, 3, 4, 5, 6, 7, 8 (9 matches!)
→ We verify 9 times, each O(2)
→ But this is correct, necessary work

Worst case 3: Hash function weakness

Poorly chosen hash parameters can create systematic collisions.

Example: base = 256, mod = 256 (terrible choice)
hash(s) = sum of ASCII values mod 256
→ "ABC" and "CBA" both hash to same value
→ Massive collision rate

Worst case 4: Small modulus with many comparisons

Text length: 10^6
Modulus: 10^4 (too small)
Expected collisions: 10^6 / 10^4 = 100 per pattern search
→ 100 extra O(m) verifications
→ Significant slowdown

Mitigation strategies:

Use large prime modulus (≥ 10^9)
Double hashing for adversarial resistance
Randomize parameters at runtime
Fall back to KMP for very high collision patterns

Know When to Use Alternatives

If you're in a security-sensitive context where adversarial input is possible, or if you need guaranteed O(n+m) worst case, consider using KMP instead. Rabin-Karp's strength is its simplicity and its ability to search for multiple patterns simultaneously—not its worst-case guarantees.

Collision Statistics and Analysis

Let's derive the expected number of collisions and understand how they affect overall complexity.

Expected collisions per search:

Assuming uniform hash distribution:

Number of hash comparisons: n - m + 1
Probability of collision at each: 1/q
Expected spurious hits: (n - m + 1) / q

Example calculations:

Expected Spurious Hits for Various Parameters
Text Length (n)	Pattern Length (m)	Modulus (q)	Expected Spurious Hits
1,000	10	10^9	≈ 0.000001
1,000,000	100	10^9	≈ 0.001
1,000,000,000	100	10^9	≈ 1
1,000,000,000	100	10^18	≈ 10^-9

Expected running time analysis:

Total time = Preprocessing + Hash comparisons + Verifications
           = O(m) + O(n) × O(1) + (matches + spurious) × O(m)
           = O(m) + O(n) + O(k × m)

Where k = actual matches + spurious hits.

For typical cases (few matches, rare collisions):

k is small (< 10)
Total time ≈ O(n + m)

For pathological cases (many matches or collisions):

k can approach n
Total time ≈ O(nm)

Key insight: The algorithm is fast when the pattern is somewhat selective (doesn't match everywhere) and hash collisions are rare.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def analyze_collisions(text: str, pattern: str, runs: int = 100) -> dict:
    """
    Experimentally analyze collision statistics for given text/pattern.
    
    Runs Rabin-Karp multiple times with different random seeds
    and collects collision statistics.
    """
    import random
    
    n, m = len(text), len(pattern)
    collision_counts = []
    
    for _ in range(runs):
        # Randomize hash parameters
        base = random.choice([31, 37, 41, 43, 47, 53])
        mod = random.choice([10**9 + 7, 10**9 + 9, 10**9 + 21])
        
        collisions = 0
        
        # Compute pattern hash
        ph = 0
        for c in pattern:
            ph = (ph * base + ord(c)) % mod
        
        # Compute and compare window hashes
        wh = 0
        for i in range(m):
            wh = (wh * base + ord(text[i])) % mod
        
        bp = pow(base, m - 1, mod)
        
        for i in range(n - m + 1):
            if wh == ph and text[i:i+m] != pattern:
                collisions += 1
            
            if i < n - m:
                wh = ((wh - ord(text[i]) * bp) * base + ord(text[i + m])) % mod
                if wh < 0:
                    wh += mod
        
        collision_counts.append(collisions)
    
    return {
        "mean_collisions": sum(collision_counts) / len(collision_counts),
        "max_collisions": max(collision_counts),
        "min_collisions": min(collision_counts),
        "expected_collisions": (n - m + 1) / (10**9),  # Theoretical
    }
 
# Test on various inputs
print("Random text analysis:")
import random
random_text = ''.join(random.choices('ABCDEFGHIJ', k=10000))
random_pattern = ''.join(random.choices('ABCDEFGHIJ', k=5))
print(analyze_collisions(random_text, random_pattern))
 
print("
Repetitive text analysis:")
repetitive_text = "AB" * 5000
repetitive_pattern = "ABAB"
print(analyze_collisions(repetitive_text, repetitive_pattern))

Monte Carlo vs Las Vegas Variants

Rabin-Karp can be implemented in two fundamentally different modes, named after famous gambling destinations:

Monte Carlo variant (skip verification):

Report match whenever hashes match
Never verify with character comparison
Result: O(n) guaranteed time, but might report false positives
Use when: Approximate matching is acceptable, speed is critical

Las Vegas variant (always verify):

Verify every hash match with character comparison
Result: Always correct, but O(nm) worst case
Use when: Correctness is essential (standard Rabin-Karp)

The tradeoff:

Rabin-Karp Algorithm Variants
Variant	Time Complexity	Correctness	Use Case
Monte Carlo	O(n) guaranteed	Probabilistic (may have false positives)	Probabilistic data structures, approximate matching
Las Vegas	O(n+m) expected, O(nm) worst	Always correct	Standard pattern matching

Monte Carlo applications:

The Monte Carlo variant is surprisingly useful in specific contexts:

Plagiarism detection: Finding approximate matches where a few false positives are acceptable
Data deduplication: Initial pass to identify candidate duplicates for later exact verification
Streaming data: When you can't afford to store the pattern for verification
Bloom filter-style checking: Quick membership tests with acceptable false positive rates

Making Monte Carlo reliable:

With double or triple hashing, the Monte Carlo variant becomes practically equivalent to Las Vegas:

P(false positive) = 1/(q₁ × q₂) ≈ 10^-18

In the lifetime of the universe, you won't see a false positive.

When to Skip Verification

If you're using double hashing with large primes (both > 10^9), the probability of a spurious hit is so low that verification becomes practically redundant. You can skip it for a slight speedup—though most production code keeps verification for defense-in-depth.

Summary: Mastering Collision Handling

We've covered hash collisions comprehensively. Let's consolidate the key insights:

Key Takeaways

•Collisions are mathematically inevitable — Pigeonhole principle guarantees them.
•Verification ensures correctness — Las Vegas variant never reports false matches.
•Spurious hits are rare with good parameters — Expected ~(n-m+1)/q collisions.
•Double hashing nearly eliminates collisions — Probability drops to ~10^-18.
•Worst case is O(nm) — Adversarial input or many actual matches.
•Monte Carlo trades correctness for speed — Useful in specific applications.
•Randomize parameters for security — Defeats adversarial collision attacks.

What's next:

Now that we understand the complete mechanics of Rabin-Karp—hashing, rolling updates, and collision handling—we're ready to analyze its complexity rigorously. The next page provides detailed time complexity analysis, including expected case, worst case, and the conditions that affect performance.

Page Complete

You now have a complete understanding of hash collisions in Rabin-Karp—why they occur, how to handle them correctly, and strategies to minimize their impact. This knowledge is essential for building robust, production-ready string matching implementations.