Loading learning content...
We've seen how the naive algorithm compares patterns character by character, and how the KMP algorithm cleverly uses prefix information to avoid redundant comparisons. But what if we could take an entirely different approach—one that doesn't compare characters at all?\n\nThe radical insight: Instead of asking "do these two strings have the same characters?", what if we asked "do these two strings have the same fingerprint?"\n\nThis is the foundational idea behind the Rabin-Karp algorithm, developed by Michael O. Rabin and Richard M. Karp in 1987. Rather than comparing strings directly, we compute a hash value (a numerical fingerprint) for both the pattern and each candidate substring in the text. If the hash values match, we have a potential match; if they differ, we can instantly rule out that position.
By the end of this page, you will understand: (1) why hashing is a powerful paradigm for string comparison, (2) how hash functions transform the pattern matching problem, (3) the key properties that make a hash function suitable for string matching, and (4) the conceptual framework that makes Rabin-Karp possible.
Before introducing the hashing approach, let's crystallize exactly what makes string comparison expensive and why a different paradigm might help.\n\nThe fundamental challenge:\n\nWhen we compare a pattern P of length m against a substring of text T, the naive approach examines characters one by one:\n\n\nPattern: A B C D E\nSubstring: A B C D X\n ✓ ✓ ✓ ✓ ✗\n\n\nEach comparison costs O(1), but we might need up to m comparisons before discovering a mismatch—and that mismatch might occur at the very last character. For n-m+1 possible starting positions, this yields O(nm) total work in the worst case.\n\nWhy this matters:\n\nThe core inefficiency isn't just the character comparisons—it's that we learn nothing global from local comparisons. When we compare ABCDE against ABCDX and find they differ at position 5, we've gained no information about how ABCDE compares to any other substring. We start from scratch at the next position.
| Strategy | What It Learns | Information Reuse |
|---|---|---|
| Naive | Single character equality | None—starts fresh each position |
| KMP | Prefix structure of pattern | Shifts based on failure function |
| Hashing | Numerical fingerprint equality | Reuses computation across positions |
The key observation:\n\nWhat if, instead of character-by-character comparison, we could reduce each m-character string to a single number? Then string comparison becomes integer comparison—a single O(1) operation, regardless of string length.\n\nThis is precisely what hash functions enable.
A hash function is a mathematical function that maps data of arbitrary size to fixed-size values. For string matching, we need a hash function h that takes a string and produces an integer:\n\n\nh("HELLO") → 7429831\nh("WORLD") → 2198457\nh("HELLO") → 7429831 (same input always gives same output)\n\n\nEssential properties for string matching:
By the pigeonhole principle, if we hash more strings than there are possible hash values, some strings must share a hash. For example, if our hash function produces 32-bit integers (about 4 billion values), hashing more than 4 billion distinct strings guarantees collisions. The Rabin-Karp algorithm must handle this gracefully.
The hash-based comparison paradigm:\n\nWith a good hash function, our pattern matching strategy becomes:\n\n1. Compute hash(pattern) — the fingerprint we're looking for\n2. For each position i in text:\n - Compute hash(text[i:i+m]) — fingerprint of current substring\n - If hashes differ → definitely not a match, move on\n - If hashes match → possibly a match, verify with character comparison\n\nStep 3 is crucial: matching hashes don't guarantee matching strings (due to collisions), but differing hashes do guarantee non-matching strings. We use hash comparison as a fast filter.
To understand the power of hashing, we need to analyze what happens in the best case versus the worst case for pattern matching.\n\nNaive approach analysis:\n- At each position, we compare up to m characters\n- n-m+1 positions to check\n- Worst case: O(nm) total comparisons\n- Most comparisons are wasted on non-matching substrings\n\nHash-based approach analysis:\n- Compute pattern hash once: O(m)\n- At each position, compare two integers: O(1)\n- n-m+1 positions to check: O(n) hash comparisons\n- Only verify (with O(m) character comparison) when hashes match\n\nThe key insight: Most positions won't match. If hashes are uniformly distributed and collisions are rare, the hash comparison filters out nearly all non-matching positions in O(1) each.
A concrete example:\n\nSuppose we're searching for a 100-character pattern in a 1,000,000-character text.\n\nNaive approach (worst case):\n- ~999,901 positions × 100 comparisons = ~100 million character comparisons\n\nHash-based approach (expected case):\n- 1 pattern hash computation: 100 operations\n- ~999,901 hash comparisons: ~1 million integer comparisons\n- Assume collision rate of 1/1,000,000: maybe 1-2 false positives to verify\n- Total: ~1 million operations\n\nThe hash-based approach is approximately 100× faster in expected case, scaling with text length rather than the product of text and pattern lengths.
There's a critical problem with the naive hashing approach we've described so far. Let's think carefully about the cost of computing hash values.\n\nThe hidden cost:\n\nIf computing hash(text[i:i+m]) requires reading all m characters of the substring, then computing hashes for all n-m+1 positions costs:\n\n\n(n - m + 1) × m = O(nm)\n\n\nThis is exactly as expensive as the naive algorithm! We've gained nothing.\n\nThe critical question:\n\nCan we compute the hash of text[i+1:i+1+m] more efficiently if we already know the hash of text[i:i+m]?\n\nIn other words: can we update a hash value incrementally as our window slides, rather than recomputing it from scratch?
The entire viability of hash-based pattern matching depends on finding a hash function that can be updated in O(1) time as the window slides. Without this property—called the 'rolling' property—hashing offers no advantage over naive comparison. This is exactly what the polynomial rolling hash provides.
The sliding window insight:\n\nConsider two consecutive substrings of text:\n\n\nPosition i: [a₁ a₂ a₃ ... aₘ]\nPosition i+1: [a₂ a₃ ... aₘ aₘ₊₁]\n\n\nThese substrings share m-1 characters. A clever hash function should be able to:\n1. "Remove" the contribution of a₁ (the character leaving the window)\n2. "Add" the contribution of aₘ₊₁ (the character entering the window)\n\nIf both operations take O(1) time, then updating the hash takes O(1) time, and computing all n-m+1 hashes takes only O(n) time total.\n\nThis is the essence of the rolling hash technique that we'll explore in detail in the next page.
Before introducing the polynomial rolling hash, let's examine a simpler hash function to build intuition about why hash design matters.\n\nSum-based hash:\n\nThe simplest hash function just sums the character values:\n\n\nhash("ABC") = ord('A') + ord('B') + ord('C') = 65 + 66 + 67 = 198\n\n\nAdvantages:\n- Easy to compute: O(m) for a string of length m\n- Easy to update: hash("BCD") = hash("ABC") - ord('A') + ord('D') = O(1)\n\nFatal flaw:\n- Produces the same hash for any permutation of characters\n- hash("ABC") = hash("BAC") = hash("CBA") = 198\n- Collision rate is catastrophically high
123456789101112131415
def sum_hash(s): """Simple sum-based hash - demonstrating high collision rate.""" return sum(ord(c) for c in s) # All these produce the same hash!print(sum_hash("ABC")) # 198print(sum_hash("BAC")) # 198print(sum_hash("CBA")) # 198 # Even completely different strings can collideprint(sum_hash("AZ")) # 155print(sum_hash("BY")) # 155print(sum_hash("CX")) # 155 # For pattern matching, this would cause massive false positivesWhy position matters:\n\nThe sum-based hash fails because it ignores character positions. In pattern matching, "ABC" and "CBA" are completely different strings that should never match. We need a hash function where character positions affect the output.\n\nThe solution:\n\nPolynomial hashing treats the string as a number in some base b, where each character's position determines its weight:\n\n\nhash("ABC") = ord('A')×b² + ord('B')×b¹ + ord('C')×b⁰\n\n\nNow "ABC" and "CBA" produce different hashes because characters at different positions contribute differently. This is the foundation of the rolling hash we'll explore next.
The solution to our collision problem is the polynomial hash function, which treats strings as numbers in a chosen base. This is the hash function used by Rabin-Karp and many other string algorithms.\n\nDefinition:\n\nFor a string s = s₀s₁s₂...sₘ₋₁ of length m, the polynomial hash is:\n\n\nhash(s) = (s₀ × bᵐ⁻¹ + s₁ × bᵐ⁻² + ... + sₘ₋₂ × b¹ + sₘ₋₁ × b⁰) mod q\n\n\nWhere:\n- b is the base (typically a prime slightly larger than the alphabet size)\n- q is a large prime modulus (to keep hash values bounded)\n- sᵢ is the numerical value of character at position i\n\nWhy this works:\n\nConsider hashing "ABC" with base b = 31:\n\n\nhash("ABC") = 65×31² + 66×31¹ + 67×31⁰\n = 65×961 + 66×31 + 67×1\n = 62465 + 2046 + 67\n = 64578\n hash("CBA") = 67×31² + 66×31¹ + 65×31⁰\n = 67×961 + 66×31 + 65×1\n = 64387 + 2046 + 65\n = 66498\n\n\nDifferent hashes for different strings!
1234567891011121314151617181920212223242526272829
def polynomial_hash(s: str, base: int = 31, mod: int = 10**9 + 7) -> int: """ Polynomial hash function for strings. Treats string as a number in the given base. Uses modular arithmetic to prevent integer overflow. Args: s: The string to hash base: The base to use (should be > max character value) mod: Large prime modulus to keep values bounded Returns: Hash value in range [0, mod-1] """ hash_value = 0 for char in s: # hash = (hash * base + char_value) mod q # This is Horner's method for polynomial evaluation hash_value = (hash_value * base + ord(char)) % mod return hash_value # Now different strings produce different hashesprint(f"hash('ABC') = {polynomial_hash('ABC')}") # 66214print(f"hash('CBA') = {polynomial_hash('CBA')}") # 68386print(f"hash('BAC') = {polynomial_hash('BAC')}") # 67331 # Same string always produces same hash (deterministic)print(f"hash('ABC') again = {polynomial_hash('ABC')}") # 66214The code uses Horner's method to compute the polynomial: instead of computing each power of b separately (expensive), we iteratively multiply and add: hash = (...((s₀ × b + s₁) × b + s₂) × b + ...) × b + sₘ₋₁. This gives O(m) time with constant extra memory.
The polynomial hash we've just seen has a crucial property: it can be updated in O(1) time as our window slides. Let's understand why this works mathematically.\n\nThe rolling hash update formula:\n\nSuppose we have computed:\n\nH₁ = hash(s[i:i+m]) = s[i]×bᵐ⁻¹ + s[i+1]×bᵐ⁻² + ... + s[i+m-1]×b⁰\n\n\nWe want to compute:\n\nH₂ = hash(s[i+1:i+1+m]) = s[i+1]×bᵐ⁻¹ + s[i+2]×bᵐ⁻² + ... + s[i+m]×b⁰\n\n\nThe relationship between them:\n\nH₂ = (H₁ - s[i]×bᵐ⁻¹) × b + s[i+m]\n\n\nBreaking this down:\n1. Subtract: Remove the leftmost character's contribution: s[i]×bᵐ⁻¹\n2. Shift: Multiply by b to shift all remaining terms one position left\n3. Add: Add the new rightmost character: s[i+m]×b⁰\n\nAll three operations are O(1) (assuming we precompute bᵐ⁻¹).
Visual representation:\n\n\nWindow at position i: [A][B][C][D][E]\n ↓ ↓ ↓ ↓ ↓\nHash components: A×b⁴ + B×b³ + C×b² + D×b¹ + E×b⁰\n\nSlide window right: [B][C][D][E][F]\n ↓ ↓ ↓ ↓ ↓\nNew hash: B×b⁴ + C×b³ + D×b² + E×b¹ + F×b⁰\n\nUpdate formula:\n1. Remove A's contribution: (old_hash - A×b⁴)\n2. Shift left: × b\n3. Add F at rightmost: + F×b⁰\n\n\nThis transformation takes constant time regardless of window size. A 10-character window updates as quickly as a 10,000-character window.
The polynomial structure isn't arbitrary—it's specifically designed to enable these O(1) rolling updates. The base-weighted positions act like digits in a number system, where shifting left is multiplication and shifting right is division. This mathematical elegance is why polynomial hashing became the standard for rolling hash algorithms.
Now we can assemble the complete framework for hash-based pattern matching:\n\nAlgorithm skeleton:\n\n1. Preprocessing (O(m)):\n - Compute pattern_hash = hash(pattern)\n - Compute bᵐ⁻¹ mod q for use in rolling updates\n - Compute text_hash = hash(text[0:m]) for first window\n\n2. Matching (O(n)):\n - Compare text_hash with pattern_hash\n - If equal: verify with character-by-character comparison\n - Roll the window: update text_hash using O(1) formula\n - Repeat until end of text\n\nThis gives O(n+m) expected time. We'll analyze this precisely in a later page, including understanding when and why the worst case can degrade.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
def rabin_karp_conceptual(text: str, pattern: str) -> list[int]: """ Conceptual implementation of hash-based pattern matching. This shows the structure; full rolling hash comes in next page. """ n, m = len(text), len(pattern) if m > n: return [] base = 31 mod = 10**9 + 7 # Step 1: Compute pattern hash (O(m)) pattern_hash = 0 for char in pattern: pattern_hash = (pattern_hash * base + ord(char)) % mod # Step 2: Compute initial window hash (O(m)) window_hash = 0 for i in range(m): window_hash = (window_hash * base + ord(text[i])) % mod # Step 3: Precompute b^(m-1) for rolling (O(m)) base_power = pow(base, m - 1, mod) matches = [] # Step 4: Slide window and compare (O(n)) for i in range(n - m + 1): # Hash comparison: O(1) if window_hash == pattern_hash: # Potential match - verify with actual comparison: O(m) if text[i:i+m] == pattern: matches.append(i) # Roll the hash to next position: O(1) if i < n - m: # Remove leftmost character, shift, add new rightmost outgoing = ord(text[i]) incoming = ord(text[i + m]) window_hash = ((window_hash - outgoing * base_power) * base + incoming) % mod return matches # Example usagetext = "ABCABCABC"pattern = "ABC"print(f"Pattern '{pattern}' found at positions: {rabin_karp_conceptual(text, pattern)}")# Output: Pattern 'ABC' found at positions: [0, 3, 6]Key observations from this framework:\n\n1. Hash comparisons dominate: We perform n-m+1 hash comparisons, each O(1)\n2. Character comparisons are rare: Only when hashes match do we verify\n3. The rolling update is essential: Without it, computing each window's hash would take O(m)\n4. Modular arithmetic is crucial: Without mod, hash values would grow exponentially large\n\nThe next page explores the rolling hash technique in complete detail, including handling negative modular values and optimizing the implementation.
We've established the foundational concepts that make Rabin-Karp possible. Let's consolidate what we've learned:
What's next:\n\nIn the next page, we'll dive deep into the rolling hash technique—the mathematical machinery that makes O(1) hash updates possible. We'll explore the precise formulas, handle edge cases with modular arithmetic, and build a production-ready implementation.
You now understand why hashing is a powerful paradigm for pattern matching and why the polynomial hash function is the right choice. The conceptual foundation is set—next, we master the rolling hash mechanics that make Rabin-Karp truly efficient.