Loading content...
No matter how carefully you choose your hash parameters, collisions are mathematically inevitable. By the pigeonhole principle, when you map an infinite set of possible strings to a finite set of hash values {0, 1, ..., M-1}, multiple strings must share the same hash.
In practice, the question isn't whether collisions occur, but:
This page provides a deep treatment of collision handling—from understanding probability to implementing robust multi-hash solutions that make false positives virtually impossible.
By the end of this page, you will understand the birthday paradox and its implications for hash collisions, master double and triple hashing techniques, know when to verify hash matches with actual string comparison, and be able to choose the right collision handling strategy for your problem.
Before we can handle collisions, we need to understand how likely they are. This requires careful probabilistic analysis.
Single Pair Collision Probability:
For two randomly chosen distinct strings, the probability they have the same hash is:
$$P(\text{collision}) = \frac{1}{M}$$
For M = 10^9 + 7, this is about 10^-9, or one in a billion. Sounds safe, right?
The Birthday Paradox:
The famous birthday paradox reveals a counterintuitive truth: when comparing many strings, collisions happen much sooner than expected.
If you have n strings, there are C(n, 2) = n(n-1)/2 pairs. Each pair has a 1/M chance of collision. The probability of at least one collision is approximately:
$$P(\text{any collision}) \approx 1 - e^{-\frac{n^2}{2M}}$$
This reaches 50% when n ≈ √(2M ln 2) ≈ 1.18√M.
| Number of Strings (n) | Expected Collisions | P(≥1 collision) | Risk Level |
|---|---|---|---|
| 100 | ~0.000005 | ~0.0005% | Negligible |
| 1,000 | ~0.0005 | ~0.05% | Very Low |
| 10,000 | ~0.05 | ~5% | Low |
| 31,623 (≈√M) | ~0.5 | ~39% | Moderate |
| 100,000 | ~5 | ~99.3% | Very High |
| 1,000,000 | ~500 | ~100% | Certain |
With M = 10⁹, comparing 100,000 substrings (common in longest repeated substring problems on strings of length 100,000) gives you ~5 expected collisions. This is why single-hash solutions fail in practice for large inputs!
Practical Implications:
Consider finding the longest repeated substring in a 100,000-character string:
The Solution: Use multiple independent hash functions. If two strings collide under one hash, they're extremely unlikely to collide under a second, independent hash.
Double Hash Collision Probability:
With two independent hashes using moduli M₁ and M₂:
$$P(\text{collision}) = \frac{1}{M_1 \times M_2}$$
For M₁ = M₂ = 10^9:
$$P = \frac{1}{10^{18}} ≈ 10^{-18}$$
Even with n = 100,000 strings (comparing 5 billion pairs), expected collisions ≈ 5 × 10^-9 ≈ 0.
Double hashing is the standard technique for reducing collision probability to negligible levels. The concept is simple: instead of one hash value, compute two independent hash values for each string. Two strings are considered equal only if both hashes match.
Independence Requirement:
For the probability calculation to work, the two hash functions must be independent. This means:
Standard Double Hash Setup:
These use different primes for both base and modulus, ensuring independence.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
class DoubleHash: """ Double hashing implementation for virtually collision-free string comparison. Uses two independent polynomial hash functions. Two strings match only if both hashes match, reducing collision probability from ~10^-9 to ~10^-18. """ # First hash function parameters BASE1 = 31 MOD1 = 10**9 + 7 # Second hash function parameters (independent) BASE2 = 37 MOD2 = 10**9 + 9 def __init__(self, s: str): """ Initialize with precomputed prefix hashes for both hash functions. Time: O(n) Space: O(n) """ self.s = s self.n = len(s) # Prefix hashes and powers for first hash self.h1 = [0] * (self.n + 1) self.p1 = [1] * (self.n + 1) # Prefix hashes and powers for second hash self.h2 = [0] * (self.n + 1) self.p2 = [1] * (self.n + 1) for i in range(self.n): c = ord(s[i]) self.h1[i + 1] = (self.h1[i] * self.BASE1 + c) % self.MOD1 self.p1[i + 1] = (self.p1[i] * self.BASE1) % self.MOD1 self.h2[i + 1] = (self.h2[i] * self.BASE2 + c) % self.MOD2 self.p2[i + 1] = (self.p2[i] * self.BASE2) % self.MOD2 def get_hash(self, left: int, right: int) -> tuple[int, int]: """ Get the double hash of substring s[left:right+1]. Time: O(1) Returns: Tuple (hash1, hash2) - both must match for equality """ length = right - left + 1 # First hash hash1 = (self.h1[right + 1] - self.h1[left] * self.p1[length]) % self.MOD1 if hash1 < 0: hash1 += self.MOD1 # Second hash hash2 = (self.h2[right + 1] - self.h2[left] * self.p2[length]) % self.MOD2 if hash2 < 0: hash2 += self.MOD2 return (hash1, hash2) def compare(self, l1: int, r1: int, l2: int, r2: int) -> bool: """ Compare two substrings using double hash. Time: O(1) The probability of a false positive (different strings with same double hash) is approximately 1/10^18 - essentially zero. """ if r1 - l1 != r2 - l2: # Different lengths return False return self.get_hash(l1, r1) == self.get_hash(l2, r2) def find_longest_repeated_substring_safe(s: str) -> str: """ Find the longest substring that appears at least twice. Uses double hashing for virtually guaranteed correctness. Time: O(n log n) average """ if len(s) <= 1: return "" hasher = DoubleHash(s) n = len(s) def has_repeated_of_length(length: int) -> int: """ Check if there's a repeated substring of given length. Returns starting index if found, -1 otherwise. """ seen = {} # hash tuple -> starting index for i in range(n - length + 1): h = hasher.get_hash(i, i + length - 1) if h in seen: # Double hash match - extremely high confidence # Can optionally verify with string comparison return i seen[h] = i return -1 # Binary search for longest length left, right = 1, n - 1 result = "" while left <= right: mid = (left + right) // 2 idx = has_repeated_of_length(mid) if idx != -1: result = s[idx:idx + mid] left = mid + 1 else: right = mid - 1 return result # Demonstrationprint("Testing double hashing:")s = "banana"hasher = DoubleHash(s) print(f"String: '{s}'")print(f"\n'ana' at position 1: {hasher.get_hash(1, 3)}")print(f"'ana' at position 3: {hasher.get_hash(3, 5)}")print(f"Match: {hasher.compare(1, 3, 3, 5)}") # True print(f"\n'ban' at position 0: {hasher.get_hash(0, 2)}")print(f"'ana' at position 1: {hasher.get_hash(1, 3)}")print(f"Match: {hasher.compare(0, 2, 1, 3)}") # False print(f"\nLongest repeated substring: '{find_longest_repeated_substring_safe(s)}'")Instead of storing a pair (hash1, hash2), you can combine them into a single value: combined = hash1 * MOD2 + hash2. This works if combined fits in your integer type (needs ~60 bits for the standard parameters). It simplifies storage and comparison while maintaining the same collision resistance.
Even with double hashing, false positives are theoretically possible (probability ~10^-18). For some applications, even this infinitesimal risk is unacceptable. The solution is verification: when hashes match, confirm equality with actual string comparison.
The Trade-off:
| Approach | Time Complexity | Correctness |
|---|---|---|
| Single hash only | O(n) preprocessing, O(1) compare | ~10^-9 error rate |
| Double hash only | O(n) preprocessing, O(1) compare | ~10^-18 error rate |
| Hash + verify | O(n) preprocessing, O(m) compare on match | 100% correct |
Where m is the length of the substring being compared.
When Verification Is Worth It:
In pattern matching (Rabin-Karp), we might check thousands of positions but only match a few. If we verify only on hash match, the O(m) verification cost is incurred only for true matches plus rare false positives. Total extra work: O(matches × m) which is usually small.
The Hybrid Strategy:
For optimal balance between speed and correctness:
Alternative: Double Hash without Verification
For many applications (like competitive programming), double hashing provides sufficient confidence without verification:
But in production with billions of daily operations, even 10^-18 can accumulate!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166
class VerifiedHash: """ Hash-based string comparison with optional verification. Provides guaranteed correctness when verification is enabled, while still benefiting from hash-based fast rejection. """ BASE = 31 MOD = 10**9 + 7 def __init__(self, s: str, verify: bool = True): """ Args: s: The string to hash verify: If True, confirm hash matches with string comparison """ self.s = s self.n = len(s) self.verify = verify # Build prefix hashes self.h = [0] * (self.n + 1) self.p = [1] * (self.n + 1) for i in range(self.n): self.h[i + 1] = (self.h[i] * self.BASE + ord(s[i])) % self.MOD self.p[i + 1] = (self.p[i] * self.BASE) % self.MOD def _get_hash(self, left: int, right: int) -> int: """Get hash of s[left:right+1].""" length = right - left + 1 result = (self.h[right + 1] - self.h[left] * self.p[length]) % self.MOD return result if result >= 0 else result + self.MOD def _get_substring(self, left: int, right: int) -> str: """Get actual substring for verification.""" return self.s[left:right + 1] def compare(self, l1: int, r1: int, l2: int, r2: int) -> bool: """ Compare two substrings. Uses hash for fast rejection, then verifies if hashes match (when verification is enabled). Time: O(1) for rejection, O(m) for verified match """ # Quick rejection by length if r1 - l1 != r2 - l2: return False # Hash comparison (fast path - most rejections happen here) h1 = self._get_hash(l1, r1) h2 = self._get_hash(l2, r2) if h1 != h2: return False # Definitely different # Hashes match - verify if configured if self.verify: return self._get_substring(l1, r1) == self._get_substring(l2, r2) return True # Trust the hash def rabin_karp_verified(text: str, pattern: str) -> list[int]: """ Rabin-Karp pattern matching with verified results. Uses hash for fast candidate filtering, string comparison for confirmation. Time: O(n + m) average, O(nm) worst case Worst case only if many false hash matches (extremely rare) """ n, m = len(text), len(pattern) if m > n: return [] BASE, MOD = 31, 10**9 + 7 # Compute pattern hash pattern_hash = 0 for c in pattern: pattern_hash = (pattern_hash * BASE + ord(c)) % MOD # Compute first window hash and base^(m-1) window_hash = 0 power = 1 for i in range(m): window_hash = (window_hash * BASE + ord(text[i])) % MOD if i < m - 1: power = (power * BASE) % MOD occurrences = [] for i in range(n - m + 1): # Update hash for positions after 0 if i > 0: # Remove leftmost character, add new rightmost window_hash = (window_hash - ord(text[i - 1]) * power) % MOD if window_hash < 0: window_hash += MOD window_hash = (window_hash * BASE + ord(text[i + m - 1])) % MOD # Check hash match if window_hash == pattern_hash: # VERIFY: Actually compare strings if text[i:i + m] == pattern: occurrences.append(i) return occurrences # Demonstrate the importance of verificationdef collision_demo(): """ Show a case where single hash matches but strings differ. This is rare but possible - verification catches it. """ import random random.seed(42) # Generate many random strings to find a hash collision # (For demonstration - in practice collisions are rarer) BASE, MOD = 31, 10**5 + 3 # Small mod for demo def quick_hash(s: str) -> int: h = 0 for c in s: h = (h * BASE + ord(c)) % MOD return h # Generate strings until we find a collision seen = {} collisions = [] for _ in range(100000): s = ''.join(random.choices('abc', k=5)) h = quick_hash(s) if h in seen and seen[h] != s: collisions.append((seen[h], s)) if len(collisions) >= 5: break seen[h] = s print("Collision examples (with small modulus for demo):") print(f"Using base={BASE}, mod={MOD}") for s1, s2 in collisions[:5]: print(f" '{s1}' and '{s2}' → both hash to {quick_hash(s1)}") print(f" Are they equal? {s1 == s2}") collision_demo() # Pattern matching demoprint("\nRabin-Karp with verification:")text = "ababababca"pattern = "abab"matches = rabin_karp_verified(text, pattern)print(f"Text: '{text}'")print(f"Pattern: '{pattern}'")print(f"Found at positions: {matches}") # [0, 2, 4]Beyond double hashing, several advanced strategies exist for specific use cases.
Triple Hashing:
For applications requiring ultra-low collision rates:
Rolling Hash Families:
For comparing many different substrings across multiple strings:
Combining Hash and Position:
When ordering matters (like in sorting by hash):
# Store (hash, original_index) pairs
# Sort by hash, use index for collision handling
hashed_items = [(hash(s), i, s) for i, s in enumerate(strings)]
hashed_items.sort() # Groups equal hashes together
A powerful pattern: use hashing to create 'buckets' of potentially equal items, then use O(m) comparison within buckets. This turns O(n² × m) all-pairs comparison into O(n × m) hash computation + O(k × m²) verification where k is the bucket size (usually 1-2 with good hashing).
Large Modulus Alternative:
Instead of multiple small hashes, use one very large hash:
Hash Combination Techniques:
When you need a single integer from double hash:
# Method 1: Tuple-based (conceptually clearest)
hash_pair = (hash1, hash2)
# Method 2: Combined into one large integer
if hash1 < 2**31 and hash2 < 2**31:
combined = (hash1 << 32) | hash2 # Fits in 64-bit
# Method 3: String representation (for hash tables without pair support)
hash_key = f"{hash1}_{hash2}"
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165
from typing import Set, List, Tuplefrom collections import defaultdict class TripleHash: """ Triple hashing for ultra-low collision probability. Collision probability: ~10^-27 per pair """ PARAMS = [ (31, 10**9 + 7), (37, 10**9 + 9), (41, 998244353), ] def __init__(self, s: str): self.s = s self.n = len(s) # Build prefix arrays for all three hash functions self.hashes = [] self.powers = [] for base, mod in self.PARAMS: h = [0] * (self.n + 1) p = [1] * (self.n + 1) for i in range(self.n): h[i + 1] = (h[i] * base + ord(s[i])) % mod p[i + 1] = (p[i] * base) % mod self.hashes.append(h) self.powers.append(p) def get_hash(self, left: int, right: int) -> tuple[int, int, int]: """Return triple hash of s[left:right+1].""" length = right - left + 1 result = [] for (h, p), (base, mod) in zip(zip(self.hashes, self.powers), self.PARAMS): val = (h[right + 1] - h[left] * p[length]) % mod if val < 0: val += mod result.append(val) return tuple(result) class MersenneHash: """ Single large hash using Mersenne prime 2^61 - 1. Provides ~10^-18 collision probability with single hash function. Efficient modulo via bit operations. """ MOD = (1 << 61) - 1 BASE = 31 @staticmethod def mersenne_mod(x: int) -> int: """Fast modulo for 2^61 - 1.""" # x mod (2^61 - 1) = (x mod 2^61) + (x >> 61) # May need one more reduction while x >= MersenneHash.MOD: x = (x & MersenneHash.MOD) + (x >> 61) return x def __init__(self, s: str): self.s = s self.n = len(s) self.h = [0] * (self.n + 1) self.p = [1] * (self.n + 1) for i in range(self.n): self.h[i + 1] = self.mersenne_mod( self.h[i] * self.BASE + ord(s[i]) ) self.p[i + 1] = self.mersenne_mod(self.p[i] * self.BASE) def get_hash(self, left: int, right: int) -> int: """Get hash of s[left:right+1].""" length = right - left + 1 result = self.mersenne_mod( self.h[right + 1] - self.h[left] * self.p[length] ) return result def group_by_hash(strings: List[str]) -> dict[int, List[str]]: """ Group strings by hash for efficient duplicate detection. Pattern: Hash first, then compare within groups. """ groups = defaultdict(list) for s in strings: h = hash_simple(s) groups[h].append(s) return groups def hash_simple(s: str, base: int = 31, mod: int = 10**9 + 7) -> int: """Simple polynomial hash for grouping.""" h = 0 for c in s: h = (h * base + ord(c)) % mod return h def find_duplicates_efficient(strings: List[str]) -> List[Tuple[str, List[int]]]: """ Find all duplicate strings efficiently. Uses hash-then-verify pattern: 1. Group by hash (O(n × m)) 2. Compare within groups (O(k × m²) where k is avg group size) Much faster than naive O(n² × m) all-pairs comparison. """ # Group by hash from collections import defaultdict hash_to_indices = defaultdict(list) for i, s in enumerate(strings): h = hash_simple(s) hash_to_indices[h].append(i) # Find actual duplicates within each group duplicates = [] for indices in hash_to_indices.values(): if len(indices) > 1: # Compare strings in this bucket actual_groups = defaultdict(list) for idx in indices: actual_groups[strings[idx]].append(idx) for s, group_indices in actual_groups.items(): if len(group_indices) > 1: duplicates.append((s, group_indices)) return duplicates # Demonstrationprint("Triple Hash Demo:")th = TripleHash("banana")print(f" 'ana' (1-3): {th.get_hash(1, 3)}")print(f" 'ana' (3-5): {th.get_hash(3, 5)}") print("\nMersenne Hash Demo:")mh = MersenneHash("banana")print(f" 'ana' (1-3): {mh.get_hash(1, 3)}")print(f" 'ana' (3-5): {mh.get_hash(3, 5)}") print("\nDuplicate Detection Demo:")strings = ["apple", "banana", "apple", "cherry", "banana", "banana"]dupes = find_duplicates_efficient(strings)print(f" Input: {strings}")print(f" Duplicates: {dupes}")With multiple options available, how do you choose the right collision handling strategy? Here's a decision framework:
| Scenario | Recommended Approach | Why |
|---|---|---|
| Competitive programming | Double hash, no verify | Speed matters; 10^-18 error acceptable |
| Production string matching | Single hash + verify | Guaranteed correct; O(1) average case |
| Deduplication at scale | Hash groups + verify | Minimize comparisons in large datasets |
| Security-sensitive | Cryptographic hash (SipHash) | Polynomial hashes vulnerable to attack |
| Ultra-critical correctness | Mersenne hash + verify | Maximum safety with single hash |
| Educational/learning | Single hash with analysis | Understand failure modes first |
The Cost-Benefit Analysis:
Single Hash:
Double Hash:
Single Hash + Verification:
When in doubt: use double hashing for competitive programming, and single hash + verification for everything else. Double hashing adds little overhead and dramatically reduces collision risk. Verification adds safety at minimal average cost.
What's Next:
We've now covered the fundamentals of string hashing—polynomial hashing, parameter selection, and collision handling. The final page explores applications beyond exact pattern matching: substring comparison, rolling hash for plagiarism detection, string similarity, and more.
You now understand string hash collision probability, how to implement double hashing, when to verify hash matches, and how to choose the right strategy for your application. These techniques make string hashing reliable enough for production use.