Loading learning content...
We've seen how the naive algorithm compares patterns character by character, and how the KMP algorithm cleverly uses prefix information to avoid redundant comparisons. But what if we could take an entirely different approach—one that doesn't compare characters at all?
The radical insight: Instead of asking "do these two strings have the same characters?", what if we asked "do these two strings have the same fingerprint?"
This is the foundational idea behind the Rabin-Karp algorithm, developed by Michael O. Rabin and Richard M. Karp in 1987. Rather than comparing strings directly, we compute a hash value (a numerical fingerprint) for both the pattern and each candidate substring in the text. If the hash values match, we have a potential match; if they differ, we can instantly rule out that position.
By the end of this page, you will understand: (1) why hashing is a powerful paradigm for string comparison, (2) how hash functions transform the pattern matching problem, (3) the key properties that make a hash function suitable for string matching, and (4) the conceptual framework that makes Rabin-Karp possible.
Before introducing the hashing approach, let's crystallize exactly what makes string comparison expensive and why a different paradigm might help.
The fundamental challenge:
When we compare a pattern P of length m against a substring of text T, the naive approach examines characters one by one:
Pattern: A B C D E
Substring: A B C D X
✓ ✓ ✓ ✓ ✗
Each comparison costs O(1), but we might need up to m comparisons before discovering a mismatch—and that mismatch might occur at the very last character. For n-m+1 possible starting positions, this yields O(nm) total work in the worst case.
Why this matters:
The core inefficiency isn't just the character comparisons—it's that we learn nothing global from local comparisons. When we compare ABCDE against ABCDX and find they differ at position 5, we've gained no information about how ABCDE compares to any other substring. We start from scratch at the next position.
| Strategy | What It Learns | Information Reuse |
|---|---|---|
| Naive | Single character equality | None—starts fresh each position |
| KMP | Prefix structure of pattern | Shifts based on failure function |
| Hashing | Numerical fingerprint equality | Reuses computation across positions |
The key observation:
What if, instead of character-by-character comparison, we could reduce each m-character string to a single number? Then string comparison becomes integer comparison—a single O(1) operation, regardless of string length.
This is precisely what hash functions enable.
A hash function is a mathematical function that maps data of arbitrary size to fixed-size values. For string matching, we need a hash function h that takes a string and produces an integer:
h("HELLO") → 7429831
h("WORLD") → 2198457
h("HELLO") → 7429831 (same input always gives same output)
Essential properties for string matching:
By the pigeonhole principle, if we hash more strings than there are possible hash values, some strings must share a hash. For example, if our hash function produces 32-bit integers (about 4 billion values), hashing more than 4 billion distinct strings guarantees collisions. The Rabin-Karp algorithm must handle this gracefully.
The hash-based comparison paradigm:
With a good hash function, our pattern matching strategy becomes:
hash(pattern) — the fingerprint we're looking forhash(text[i:i+m]) — fingerprint of current substringStep 3 is crucial: matching hashes don't guarantee matching strings (due to collisions), but differing hashes do guarantee non-matching strings. We use hash comparison as a fast filter.
To understand the power of hashing, we need to analyze what happens in the best case versus the worst case for pattern matching.
Naive approach analysis:
Hash-based approach analysis:
The key insight: Most positions won't match. If hashes are uniformly distributed and collisions are rare, the hash comparison filters out nearly all non-matching positions in O(1) each.
A concrete example:
Suppose we're searching for a 100-character pattern in a 1,000,000-character text.
Naive approach (worst case):
Hash-based approach (expected case):
The hash-based approach is approximately 100× faster in expected case, scaling with text length rather than the product of text and pattern lengths.
There's a critical problem with the naive hashing approach we've described so far. Let's think carefully about the cost of computing hash values.
The hidden cost:
If computing hash(text[i:i+m]) requires reading all m characters of the substring, then computing hashes for all n-m+1 positions costs:
(n - m + 1) × m = O(nm)
This is exactly as expensive as the naive algorithm! We've gained nothing.
The critical question:
Can we compute the hash of text[i+1:i+1+m] more efficiently if we already know the hash of text[i:i+m]?
In other words: can we update a hash value incrementally as our window slides, rather than recomputing it from scratch?
The entire viability of hash-based pattern matching depends on finding a hash function that can be updated in O(1) time as the window slides. Without this property—called the 'rolling' property—hashing offers no advantage over naive comparison. This is exactly what the polynomial rolling hash provides.
The sliding window insight:
Consider two consecutive substrings of text:
Position i: [a₁ a₂ a₃ ... aₘ]
Position i+1: [a₂ a₃ ... aₘ aₘ₊₁]
These substrings share m-1 characters. A clever hash function should be able to:
If both operations take O(1) time, then updating the hash takes O(1) time, and computing all n-m+1 hashes takes only O(n) time total.
This is the essence of the rolling hash technique that we'll explore in detail in the next page.
Before introducing the polynomial rolling hash, let's examine a simpler hash function to build intuition about why hash design matters.
Sum-based hash:
The simplest hash function just sums the character values:
hash("ABC") = ord('A') + ord('B') + ord('C') = 65 + 66 + 67 = 198
Advantages:
Fatal flaw:
123456789101112131415
def sum_hash(s): """Simple sum-based hash - demonstrating high collision rate.""" return sum(ord(c) for c in s) # All these produce the same hash!print(sum_hash("ABC")) # 198print(sum_hash("BAC")) # 198print(sum_hash("CBA")) # 198 # Even completely different strings can collideprint(sum_hash("AZ")) # 155print(sum_hash("BY")) # 155print(sum_hash("CX")) # 155 # For pattern matching, this would cause massive false positivesWhy position matters:
The sum-based hash fails because it ignores character positions. In pattern matching, "ABC" and "CBA" are completely different strings that should never match. We need a hash function where character positions affect the output.
The solution:
Polynomial hashing treats the string as a number in some base b, where each character's position determines its weight:
hash("ABC") = ord('A')×b² + ord('B')×b¹ + ord('C')×b⁰
Now "ABC" and "CBA" produce different hashes because characters at different positions contribute differently. This is the foundation of the rolling hash we'll explore next.
The solution to our collision problem is the polynomial hash function, which treats strings as numbers in a chosen base. This is the hash function used by Rabin-Karp and many other string algorithms.
Definition:
For a string s = s₀s₁s₂...sₘ₋₁ of length m, the polynomial hash is:
hash(s) = (s₀ × bᵐ⁻¹ + s₁ × bᵐ⁻² + ... + sₘ₋₂ × b¹ + sₘ₋₁ × b⁰) mod q
Where:
Why this works:
Consider hashing "ABC" with base b = 31:
hash("ABC") = 65×31² + 66×31¹ + 67×31⁰
= 65×961 + 66×31 + 67×1
= 62465 + 2046 + 67
= 64578
hash("CBA") = 67×31² + 66×31¹ + 65×31⁰
= 67×961 + 66×31 + 65×1
= 64387 + 2046 + 65
= 66498
Different hashes for different strings!
1234567891011121314151617181920212223242526272829
def polynomial_hash(s: str, base: int = 31, mod: int = 10**9 + 7) -> int: """ Polynomial hash function for strings. Treats string as a number in the given base. Uses modular arithmetic to prevent integer overflow. Args: s: The string to hash base: The base to use (should be > max character value) mod: Large prime modulus to keep values bounded Returns: Hash value in range [0, mod-1] """ hash_value = 0 for char in s: # hash = (hash * base + char_value) mod q # This is Horner's method for polynomial evaluation hash_value = (hash_value * base + ord(char)) % mod return hash_value # Now different strings produce different hashesprint(f"hash('ABC') = {polynomial_hash('ABC')}") # 66214print(f"hash('CBA') = {polynomial_hash('CBA')}") # 68386print(f"hash('BAC') = {polynomial_hash('BAC')}") # 67331 # Same string always produces same hash (deterministic)print(f"hash('ABC') again = {polynomial_hash('ABC')}") # 66214The code uses Horner's method to compute the polynomial: instead of computing each power of b separately (expensive), we iteratively multiply and add: hash = (...((s₀ × b + s₁) × b + s₂) × b + ...) × b + sₘ₋₁. This gives O(m) time with constant extra memory.
The polynomial hash we've just seen has a crucial property: it can be updated in O(1) time as our window slides. Let's understand why this works mathematically.
The rolling hash update formula:
Suppose we have computed:
H₁ = hash(s[i:i+m]) = s[i]×bᵐ⁻¹ + s[i+1]×bᵐ⁻² + ... + s[i+m-1]×b⁰
We want to compute:
H₂ = hash(s[i+1:i+1+m]) = s[i+1]×bᵐ⁻¹ + s[i+2]×bᵐ⁻² + ... + s[i+m]×b⁰
The relationship between them:
H₂ = (H₁ - s[i]×bᵐ⁻¹) × b + s[i+m]
Breaking this down:
s[i]×bᵐ⁻¹s[i+m]×b⁰All three operations are O(1) (assuming we precompute bᵐ⁻¹).
Visual representation:
Window at position i: [A][B][C][D][E]
↓ ↓ ↓ ↓ ↓
Hash components: A×b⁴ + B×b³ + C×b² + D×b¹ + E×b⁰
Slide window right: [B][C][D][E][F]
↓ ↓ ↓ ↓ ↓
New hash: B×b⁴ + C×b³ + D×b² + E×b¹ + F×b⁰
Update formula:
1. Remove A's contribution: (old_hash - A×b⁴)
2. Shift left: × b
3. Add F at rightmost: + F×b⁰
This transformation takes constant time regardless of window size. A 10-character window updates as quickly as a 10,000-character window.
The polynomial structure isn't arbitrary—it's specifically designed to enable these O(1) rolling updates. The base-weighted positions act like digits in a number system, where shifting left is multiplication and shifting right is division. This mathematical elegance is why polynomial hashing became the standard for rolling hash algorithms.
Now we can assemble the complete framework for hash-based pattern matching:
Algorithm skeleton:
Preprocessing (O(m)):
pattern_hash = hash(pattern)bᵐ⁻¹ mod q for use in rolling updatestext_hash = hash(text[0:m]) for first windowMatching (O(n)):
text_hash with pattern_hashtext_hash using O(1) formulaThis gives O(n+m) expected time. We'll analyze this precisely in a later page, including understanding when and why the worst case can degrade.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
def rabin_karp_conceptual(text: str, pattern: str) -> list[int]: """ Conceptual implementation of hash-based pattern matching. This shows the structure; full rolling hash comes in next page. """ n, m = len(text), len(pattern) if m > n: return [] base = 31 mod = 10**9 + 7 # Step 1: Compute pattern hash (O(m)) pattern_hash = 0 for char in pattern: pattern_hash = (pattern_hash * base + ord(char)) % mod # Step 2: Compute initial window hash (O(m)) window_hash = 0 for i in range(m): window_hash = (window_hash * base + ord(text[i])) % mod # Step 3: Precompute b^(m-1) for rolling (O(m)) base_power = pow(base, m - 1, mod) matches = [] # Step 4: Slide window and compare (O(n)) for i in range(n - m + 1): # Hash comparison: O(1) if window_hash == pattern_hash: # Potential match - verify with actual comparison: O(m) if text[i:i+m] == pattern: matches.append(i) # Roll the hash to next position: O(1) if i < n - m: # Remove leftmost character, shift, add new rightmost outgoing = ord(text[i]) incoming = ord(text[i + m]) window_hash = ((window_hash - outgoing * base_power) * base + incoming) % mod return matches # Example usagetext = "ABCABCABC"pattern = "ABC"print(f"Pattern '{pattern}' found at positions: {rabin_karp_conceptual(text, pattern)}")# Output: Pattern 'ABC' found at positions: [0, 3, 6]Key observations from this framework:
The next page explores the rolling hash technique in complete detail, including handling negative modular values and optimizing the implementation.
We've established the foundational concepts that make Rabin-Karp possible. Let's consolidate what we've learned:
What's next:
In the next page, we'll dive deep into the rolling hash technique—the mathematical machinery that makes O(1) hash updates possible. We'll explore the precise formulas, handle edge cases with modular arithmetic, and build a production-ready implementation.
You now understand why hashing is a powerful paradigm for pattern matching and why the polynomial hash function is the right choice. The conceptual foundation is set—next, we master the rolling hash mechanics that make Rabin-Karp truly efficient.