String Hashing - Learning Module

Loading content...

0/276

Handling Collisions with Multiple Hashes

The Inevitable Collision

No matter how carefully you choose your hash parameters, collisions are mathematically inevitable. By the pigeonhole principle, when you map an infinite set of possible strings to a finite set of hash values {0, 1, ..., M-1}, multiple strings must share the same hash.

In practice, the question isn't whether collisions occur, but:

How likely are collisions for your specific problem size?
What happens when a collision occurs?
How can you minimize or eliminate the impact of collisions?

This page provides a deep treatment of collision handling—from understanding probability to implementing robust multi-hash solutions that make false positives virtually impossible.

What You Will Learn

By the end of this page, you will understand the birthday paradox and its implications for hash collisions, master double and triple hashing techniques, know when to verify hash matches with actual string comparison, and be able to choose the right collision handling strategy for your problem.

Understanding Collision Probability

Before we can handle collisions, we need to understand how likely they are. This requires careful probabilistic analysis.

Single Pair Collision Probability:

For two randomly chosen distinct strings, the probability they have the same hash is:

$$P(\text{collision}) = \frac{1}{M}$$

For M = 10^9 + 7, this is about 10^-9, or one in a billion. Sounds safe, right?

The Birthday Paradox:

The famous birthday paradox reveals a counterintuitive truth: when comparing many strings, collisions happen much sooner than expected.

If you have n strings, there are C(n, 2) = n(n-1)/2 pairs. Each pair has a 1/M chance of collision. The probability of at least one collision is approximately:

$$P(\text{any collision}) \approx 1 - e^{-\frac{n^2}{2M}}$$

This reaches 50% when n ≈ √(2M ln 2) ≈ 1.18√M.

Collision Probability vs. Number of Strings (M = 10⁹)
Number of Strings (n)	Expected Collisions	P(≥1 collision)	Risk Level
100	~0.000005	~0.0005%	Negligible
1,000	~0.0005	~0.05%	Very Low
10,000	~0.05	~5%	Low
31,623 (≈√M)	~0.5	~39%	Moderate
100,000	~5	~99.3%	Very High
1,000,000	~500	~100%	Certain

The Scale Problem

With M = 10⁹, comparing 100,000 substrings (common in longest repeated substring problems on strings of length 100,000) gives you ~5 expected collisions. This is why single-hash solutions fail in practice for large inputs!

Practical Implications:

Consider finding the longest repeated substring in a 100,000-character string:

We might compare n = 100,000 substrings
With M = 10^9, we expect ~5 collisions
Any of these could cause a false positive match
Our algorithm might return an incorrect answer!

The Solution: Use multiple independent hash functions. If two strings collide under one hash, they're extremely unlikely to collide under a second, independent hash.

Double Hash Collision Probability:

With two independent hashes using moduli M₁ and M₂:

$$P(\text{collision}) = \frac{1}{M_1 \times M_2}$$

For M₁ = M₂ = 10^9:

$$P = \frac{1}{10^{18}} ≈ 10^{-18}$$

Even with n = 100,000 strings (comparing 5 billion pairs), expected collisions ≈ 5 × 10^-9 ≈ 0.

Double Hashing Technique

Double hashing is the standard technique for reducing collision probability to negligible levels. The concept is simple: instead of one hash value, compute two independent hash values for each string. Two strings are considered equal only if both hashes match.

Independence Requirement:

For the probability calculation to work, the two hash functions must be independent. This means:

Different parameters: Use different base and/or modulus for each hash
True randomness: If one hash collides, it shouldn't make the other more likely to collide

Standard Double Hash Setup:

Hash 1: base = 31, mod = 10^9 + 7
Hash 2: base = 37, mod = 10^9 + 9

These use different primes for both base and modulus, ensuring independence.

double_hash.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
class DoubleHash:
    """
    Double hashing implementation for virtually collision-free string comparison.
    
    Uses two independent polynomial hash functions. Two strings match only
    if both hashes match, reducing collision probability from ~10^-9 to ~10^-18.
    """
    
    # First hash function parameters
    BASE1 = 31
    MOD1 = 10**9 + 7
    
    # Second hash function parameters (independent)
    BASE2 = 37
    MOD2 = 10**9 + 9
    
    def __init__(self, s: str):
        """
        Initialize with precomputed prefix hashes for both hash functions.
        
        Time: O(n)
        Space: O(n)
        """
        self.s = s
        self.n = len(s)
        
        # Prefix hashes and powers for first hash
        self.h1 = [0] * (self.n + 1)
        self.p1 = [1] * (self.n + 1)
        
        # Prefix hashes and powers for second hash
        self.h2 = [0] * (self.n + 1)
        self.p2 = [1] * (self.n + 1)
        
        for i in range(self.n):
            c = ord(s[i])
            self.h1[i + 1] = (self.h1[i] * self.BASE1 + c) % self.MOD1
            self.p1[i + 1] = (self.p1[i] * self.BASE1) % self.MOD1
            self.h2[i + 1] = (self.h2[i] * self.BASE2 + c) % self.MOD2
            self.p2[i + 1] = (self.p2[i] * self.BASE2) % self.MOD2
    
    def get_hash(self, left: int, right: int) -> tuple[int, int]:
        """
        Get the double hash of substring s[left:right+1].
        
        Time: O(1)
        
        Returns:
            Tuple (hash1, hash2) - both must match for equality
        """
        length = right - left + 1
        
        # First hash
        hash1 = (self.h1[right + 1] - self.h1[left] * self.p1[length]) % self.MOD1
        if hash1 < 0:
            hash1 += self.MOD1
        
        # Second hash
        hash2 = (self.h2[right + 1] - self.h2[left] * self.p2[length]) % self.MOD2
        if hash2 < 0:
            hash2 += self.MOD2
        
        return (hash1, hash2)
    
    def compare(self, l1: int, r1: int, l2: int, r2: int) -> bool:
        """
        Compare two substrings using double hash.
        
        Time: O(1)
        
        The probability of a false positive (different strings with
        same double hash) is approximately 1/10^18 - essentially zero.
        """
        if r1 - l1 != r2 - l2:  # Different lengths
            return False
        return self.get_hash(l1, r1) == self.get_hash(l2, r2)
 
 
def find_longest_repeated_substring_safe(s: str) -> str:
    """
    Find the longest substring that appears at least twice.
    
    Uses double hashing for virtually guaranteed correctness.
    
    Time: O(n log n) average
    """
    if len(s) <= 1:
        return ""
    
    hasher = DoubleHash(s)
    n = len(s)
    
    def has_repeated_of_length(length: int) -> int:
        """
        Check if there's a repeated substring of given length.
        Returns starting index if found, -1 otherwise.
        """
        seen = {}  # hash tuple -> starting index
        
        for i in range(n - length + 1):
            h = hasher.get_hash(i, i + length - 1)
            
            if h in seen:
                # Double hash match - extremely high confidence
                # Can optionally verify with string comparison
                return i
            seen[h] = i
        
        return -1
    
    # Binary search for longest length
    left, right = 1, n - 1
    result = ""
    
    while left <= right:
        mid = (left + right) // 2
        idx = has_repeated_of_length(mid)
        
        if idx != -1:
            result = s[idx:idx + mid]
            left = mid + 1
        else:
            right = mid - 1
    
    return result
 
 
# Demonstration
print("Testing double hashing:")
s = "banana"
hasher = DoubleHash(s)
 
print(f"String: '{s}'")
print(f"\n'ana' at position 1: {hasher.get_hash(1, 3)}")
print(f"'ana' at position 3: {hasher.get_hash(3, 5)}")
print(f"Match: {hasher.compare(1, 3, 3, 5)}")  # True
 
print(f"\n'ban' at position 0: {hasher.get_hash(0, 2)}")
print(f"'ana' at position 1: {hasher.get_hash(1, 3)}")
print(f"Match: {hasher.compare(0, 2, 1, 3)}")  # False
 
print(f"\nLongest repeated substring: '{find_longest_repeated_substring_safe(s)}'")

Combining Hash Values

Instead of storing a pair (hash1, hash2), you can combine them into a single value: combined = hash1 * MOD2 + hash2. This works if combined fits in your integer type (needs ~60 bits for the standard parameters). It simplifies storage and comparison while maintaining the same collision resistance.

When to Verify with String Comparison

Even with double hashing, false positives are theoretically possible (probability ~10^-18). For some applications, even this infinitesimal risk is unacceptable. The solution is verification: when hashes match, confirm equality with actual string comparison.

The Trade-off:

Approach	Time Complexity	Correctness
Single hash only	O(n) preprocessing, O(1) compare	~10^-9 error rate
Double hash only	O(n) preprocessing, O(1) compare	~10^-18 error rate
Hash + verify	O(n) preprocessing, O(m) compare on match	100% correct

Where m is the length of the substring being compared.

When Verification Is Worth It:

Correctness-critical applications: Financial, medical, legal systems
When false positives are catastrophic: Security systems, deduplication
When matches are rare: Most comparisons reject; verification cost is amortized
During debugging: Catch hash issues early

The Amortization Argument

In pattern matching (Rabin-Karp), we might check thousands of positions but only match a few. If we verify only on hash match, the O(m) verification cost is incurred only for true matches plus rare false positives. Total extra work: O(matches × m) which is usually small.

The Hybrid Strategy:

For optimal balance between speed and correctness:

Use single hash for initial filtering (very fast rejection)
On single hash match, verify with string comparison
This catches false positives while maintaining O(1) rejection for most cases

Alternative: Double Hash without Verification

For many applications (like competitive programming), double hashing provides sufficient confidence without verification:

Probability of double hash collision: ~10^-18
If you do 10^12 comparisons (one trillion), expected false positives: ~10^-6
Practically zero in any contest problem

But in production with billions of daily operations, even 10^-18 can accumulate!

verified_hash.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
class VerifiedHash:
    """
    Hash-based string comparison with optional verification.
    
    Provides guaranteed correctness when verification is enabled,
    while still benefiting from hash-based fast rejection.
    """
    
    BASE = 31
    MOD = 10**9 + 7
    
    def __init__(self, s: str, verify: bool = True):
        """
        Args:
            s: The string to hash
            verify: If True, confirm hash matches with string comparison
        """
        self.s = s
        self.n = len(s)
        self.verify = verify
        
        # Build prefix hashes
        self.h = [0] * (self.n + 1)
        self.p = [1] * (self.n + 1)
        
        for i in range(self.n):
            self.h[i + 1] = (self.h[i] * self.BASE + ord(s[i])) % self.MOD
            self.p[i + 1] = (self.p[i] * self.BASE) % self.MOD
    
    def _get_hash(self, left: int, right: int) -> int:
        """Get hash of s[left:right+1]."""
        length = right - left + 1
        result = (self.h[right + 1] - self.h[left] * self.p[length]) % self.MOD
        return result if result >= 0 else result + self.MOD
    
    def _get_substring(self, left: int, right: int) -> str:
        """Get actual substring for verification."""
        return self.s[left:right + 1]
    
    def compare(self, l1: int, r1: int, l2: int, r2: int) -> bool:
        """
        Compare two substrings.
        
        Uses hash for fast rejection, then verifies if hashes match
        (when verification is enabled).
        
        Time: O(1) for rejection, O(m) for verified match
        """
        # Quick rejection by length
        if r1 - l1 != r2 - l2:
            return False
        
        # Hash comparison (fast path - most rejections happen here)
        h1 = self._get_hash(l1, r1)
        h2 = self._get_hash(l2, r2)
        
        if h1 != h2:
            return False  # Definitely different
        
        # Hashes match - verify if configured
        if self.verify:
            return self._get_substring(l1, r1) == self._get_substring(l2, r2)
        
        return True  # Trust the hash
 
 
def rabin_karp_verified(text: str, pattern: str) -> list[int]:
    """
    Rabin-Karp pattern matching with verified results.
    
    Uses hash for fast candidate filtering, string comparison for confirmation.
    
    Time: O(n + m) average, O(nm) worst case
          Worst case only if many false hash matches (extremely rare)
    """
    n, m = len(text), len(pattern)
    if m > n:
        return []
    
    BASE, MOD = 31, 10**9 + 7
    
    # Compute pattern hash
    pattern_hash = 0
    for c in pattern:
        pattern_hash = (pattern_hash * BASE + ord(c)) % MOD
    
    # Compute first window hash and base^(m-1)
    window_hash = 0
    power = 1
    for i in range(m):
        window_hash = (window_hash * BASE + ord(text[i])) % MOD
        if i < m - 1:
            power = (power * BASE) % MOD
    
    occurrences = []
    
    for i in range(n - m + 1):
        # Update hash for positions after 0
        if i > 0:
            # Remove leftmost character, add new rightmost
            window_hash = (window_hash - ord(text[i - 1]) * power) % MOD
            if window_hash < 0:
                window_hash += MOD
            window_hash = (window_hash * BASE + ord(text[i + m - 1])) % MOD
        
        # Check hash match
        if window_hash == pattern_hash:
            # VERIFY: Actually compare strings
            if text[i:i + m] == pattern:
                occurrences.append(i)
    
    return occurrences
 
 
# Demonstrate the importance of verification
def collision_demo():
    """
    Show a case where single hash matches but strings differ.
    
    This is rare but possible - verification catches it.
    """
    import random
    random.seed(42)
    
    # Generate many random strings to find a hash collision
    # (For demonstration - in practice collisions are rarer)
    
    BASE, MOD = 31, 10**5 + 3  # Small mod for demo
    
    def quick_hash(s: str) -> int:
        h = 0
        for c in s:
            h = (h * BASE + ord(c)) % MOD
        return h
    
    # Generate strings until we find a collision
    seen = {}
    collisions = []
    
    for _ in range(100000):
        s = ''.join(random.choices('abc', k=5))
        h = quick_hash(s)
        
        if h in seen and seen[h] != s:
            collisions.append((seen[h], s))
            if len(collisions) >= 5:
                break
        seen[h] = s
    
    print("Collision examples (with small modulus for demo):")
    print(f"Using base={BASE}, mod={MOD}")
    for s1, s2 in collisions[:5]:
        print(f"  '{s1}' and '{s2}' → both hash to {quick_hash(s1)}")
        print(f"     Are they equal? {s1 == s2}")
 
 
collision_demo()
 
# Pattern matching demo
print("\nRabin-Karp with verification:")
text = "ababababca"
pattern = "abab"
matches = rabin_karp_verified(text, pattern)
print(f"Text: '{text}'")
print(f"Pattern: '{pattern}'")
print(f"Found at positions: {matches}")  # [0, 2, 4]

Advanced Multi-Hash Strategies

Beyond double hashing, several advanced strategies exist for specific use cases.

Triple Hashing:

For applications requiring ultra-low collision rates:

Add a third independent hash function
Collision probability: ~1/(M₁ × M₂ × M₃) ≈ 10^-27
Overkill for most applications, but used in some cryptographic contexts

Rolling Hash Families:

For comparing many different substrings across multiple strings:

Precompute hash arrays for all strings
Use consistent parameters across all
Store hashes in a single data structure for fast lookup

Combining Hash and Position:

When ordering matters (like in sorting by hash):

# Store (hash, original_index) pairs
# Sort by hash, use index for collision handling
hashed_items = [(hash(s), i, s) for i, s in enumerate(strings)]
hashed_items.sort()  # Groups equal hashes together

The Hash-Then-Verify Pattern

A powerful pattern: use hashing to create 'buckets' of potentially equal items, then use O(m) comparison within buckets. This turns O(n² × m) all-pairs comparison into O(n × m) hash computation + O(k × m²) verification where k is the bucket size (usually 1-2 with good hashing).

Large Modulus Alternative:

Instead of multiple small hashes, use one very large hash:

Single hash with M = 2^61 - 1 (Mersenne prime)
Collision probability: ~10^-18 per pair (same as double hash!)
Simpler code, slightly more computation per hash operation

Hash Combination Techniques:

When you need a single integer from double hash:

# Method 1: Tuple-based (conceptually clearest)
hash_pair = (hash1, hash2)

# Method 2: Combined into one large integer
if hash1 < 2**31 and hash2 < 2**31:
    combined = (hash1 << 32) | hash2  # Fits in 64-bit

# Method 3: String representation (for hash tables without pair support)
hash_key = f"{hash1}_{hash2}"

advanced_hashing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
from typing import Set, List, Tuple
from collections import defaultdict
 
class TripleHash:
    """
    Triple hashing for ultra-low collision probability.
    
    Collision probability: ~10^-27 per pair
    """
    
    PARAMS = [
        (31, 10**9 + 7),
        (37, 10**9 + 9),
        (41, 998244353),
    ]
    
    def __init__(self, s: str):
        self.s = s
        self.n = len(s)
        
        # Build prefix arrays for all three hash functions
        self.hashes = []
        self.powers = []
        
        for base, mod in self.PARAMS:
            h = [0] * (self.n + 1)
            p = [1] * (self.n + 1)
            
            for i in range(self.n):
                h[i + 1] = (h[i] * base + ord(s[i])) % mod
                p[i + 1] = (p[i] * base) % mod
            
            self.hashes.append(h)
            self.powers.append(p)
    
    def get_hash(self, left: int, right: int) -> tuple[int, int, int]:
        """Return triple hash of s[left:right+1]."""
        length = right - left + 1
        
        result = []
        for (h, p), (base, mod) in zip(zip(self.hashes, self.powers), self.PARAMS):
            val = (h[right + 1] - h[left] * p[length]) % mod
            if val < 0:
                val += mod
            result.append(val)
        
        return tuple(result)
 
 
class MersenneHash:
    """
    Single large hash using Mersenne prime 2^61 - 1.
    
    Provides ~10^-18 collision probability with single hash function.
    Efficient modulo via bit operations.
    """
    
    MOD = (1 << 61) - 1
    BASE = 31
    
    @staticmethod
    def mersenne_mod(x: int) -> int:
        """Fast modulo for 2^61 - 1."""
        # x mod (2^61 - 1) = (x mod 2^61) + (x >> 61)
        # May need one more reduction
        while x >= MersenneHash.MOD:
            x = (x & MersenneHash.MOD) + (x >> 61)
        return x
    
    def __init__(self, s: str):
        self.s = s
        self.n = len(s)
        
        self.h = [0] * (self.n + 1)
        self.p = [1] * (self.n + 1)
        
        for i in range(self.n):
            self.h[i + 1] = self.mersenne_mod(
                self.h[i] * self.BASE + ord(s[i])
            )
            self.p[i + 1] = self.mersenne_mod(self.p[i] * self.BASE)
    
    def get_hash(self, left: int, right: int) -> int:
        """Get hash of s[left:right+1]."""
        length = right - left + 1
        result = self.mersenne_mod(
            self.h[right + 1] - self.h[left] * self.p[length]
        )
        return result
 
 
def group_by_hash(strings: List[str]) -> dict[int, List[str]]:
    """
    Group strings by hash for efficient duplicate detection.
    
    Pattern: Hash first, then compare within groups.
    """
    groups = defaultdict(list)
    
    for s in strings:
        h = hash_simple(s)
        groups[h].append(s)
    
    return groups
 
 
def hash_simple(s: str, base: int = 31, mod: int = 10**9 + 7) -> int:
    """Simple polynomial hash for grouping."""
    h = 0
    for c in s:
        h = (h * base + ord(c)) % mod
    return h
 
 
def find_duplicates_efficient(strings: List[str]) -> List[Tuple[str, List[int]]]:
    """
    Find all duplicate strings efficiently.
    
    Uses hash-then-verify pattern:
    1. Group by hash (O(n × m))
    2. Compare within groups (O(k × m²) where k is avg group size)
    
    Much faster than naive O(n² × m) all-pairs comparison.
    """
    # Group by hash
    from collections import defaultdict
    hash_to_indices = defaultdict(list)
    
    for i, s in enumerate(strings):
        h = hash_simple(s)
        hash_to_indices[h].append(i)
    
    # Find actual duplicates within each group
    duplicates = []
    
    for indices in hash_to_indices.values():
        if len(indices) > 1:
            # Compare strings in this bucket
            actual_groups = defaultdict(list)
            for idx in indices:
                actual_groups[strings[idx]].append(idx)
            
            for s, group_indices in actual_groups.items():
                if len(group_indices) > 1:
                    duplicates.append((s, group_indices))
    
    return duplicates
 
 
# Demonstration
print("Triple Hash Demo:")
th = TripleHash("banana")
print(f"  'ana' (1-3): {th.get_hash(1, 3)}")
print(f"  'ana' (3-5): {th.get_hash(3, 5)}")
 
print("\nMersenne Hash Demo:")
mh = MersenneHash("banana")
print(f"  'ana' (1-3): {mh.get_hash(1, 3)}")
print(f"  'ana' (3-5): {mh.get_hash(3, 5)}")
 
print("\nDuplicate Detection Demo:")
strings = ["apple", "banana", "apple", "cherry", "banana", "banana"]
dupes = find_duplicates_efficient(strings)
print(f"  Input: {strings}")
print(f"  Duplicates: {dupes}")

Choosing the Right Collision Strategy

With multiple options available, how do you choose the right collision handling strategy? Here's a decision framework:

Collision Strategy Decision Guide
Scenario	Recommended Approach	Why
Competitive programming	Double hash, no verify	Speed matters; 10^-18 error acceptable
Production string matching	Single hash + verify	Guaranteed correct; O(1) average case
Deduplication at scale	Hash groups + verify	Minimize comparisons in large datasets
Security-sensitive	Cryptographic hash (SipHash)	Polynomial hashes vulnerable to attack
Ultra-critical correctness	Mersenne hash + verify	Maximum safety with single hash
Educational/learning	Single hash with analysis	Understand failure modes first

The Cost-Benefit Analysis:

Single Hash:

Space: O(n) for prefix array
Hash compute: O(n)
Substring query: O(1)
Collision rate: ~10^-9
Best for: Simple problems, speed-critical applications with verification

Double Hash:

Space: O(2n) for two prefix arrays
Hash compute: O(2n)
Substring query: O(1)
Collision rate: ~10^-18
Best for: Competitive programming, when verification is expensive

Single Hash + Verification:

Space: O(n)
Hash compute: O(n)
Substring query: O(1) + O(m) on match
Collision rate: 0 (perfect)
Best for: Production systems, correctness-critical applications

The Practical Rule

When in doubt: use double hashing for competitive programming, and single hash + verification for everything else. Double hashing adds little overhead and dramatically reduces collision risk. Verification adds safety at minimal average cost.

Summary

Key Takeaways

•Collisions are inevitable: By pigeonhole principle, infinite strings map to finite hash values. The question is how likely and how damaging.
•Birthday paradox amplifies risk: With n strings, collision probability grows as n²/M. 31,623 strings give 39% collision chance with M = 10⁹.
•Double hashing reduces collision to ~10^-18: Using two independent hash functions makes false positives virtually impossible.
•Verification guarantees correctness: When hashes match, compare actual strings. Cost is O(m) per match, amortized over rare matches.
•Choose strategy by use case: Competitive programming → double hash; Production → hash + verify; Security → cryptographic hash.
•Hash-then-verify pattern: Use hashing to create candidate groups, verification within groups. Combines speed with correctness.

What's Next:

We've now covered the fundamentals of string hashing—polynomial hashing, parameter selection, and collision handling. The final page explores applications beyond exact pattern matching: substring comparison, rolling hash for plagiarism detection, string similarity, and more.

Page Complete

You now understand string hash collision probability, how to implement double hashing, when to verify hash matches, and how to choose the right strategy for your application. These techniques make string hashing reliable enough for production use.