Data Structures & AlgorithmsRabin-Karp Algorithm

Rabin-Karp Algorithm — Rolling Hash

LevelAdvanced

Duration75 mins

TopicRabin-Karp Algorithm

1 / 5

Using Hashing for Pattern Matching

The Hash-Based Revolution in String Matching

We've seen how the naive algorithm compares patterns character by character, and how the KMP algorithm cleverly uses prefix information to avoid redundant comparisons. But what if we could take an entirely different approach—one that doesn't compare characters at all?\n\nThe radical insight: Instead of asking "do these two strings have the same characters?", what if we asked "do these two strings have the same fingerprint?"\n\nThis is the foundational idea behind the Rabin-Karp algorithm, developed by Michael O. Rabin and Richard M. Karp in 1987. Rather than comparing strings directly, we compute a hash value (a numerical fingerprint) for both the pattern and each candidate substring in the text. If the hash values match, we have a potential match; if they differ, we can instantly rule out that position.

What You Will Learn

By the end of this page, you will understand: (1) why hashing is a powerful paradigm for string comparison, (2) how hash functions transform the pattern matching problem, (3) the key properties that make a hash function suitable for string matching, and (4) the conceptual framework that makes Rabin-Karp possible.

The Comparison Problem Revisited

Before introducing the hashing approach, let's crystallize exactly what makes string comparison expensive and why a different paradigm might help.\n\nThe fundamental challenge:\n\nWhen we compare a pattern P of length m against a substring of text T, the naive approach examines characters one by one:\n\n\nPattern: A B C D E\nSubstring: A B C D X\n ✓ ✓ ✓ ✓ ✗\n\n\nEach comparison costs O(1), but we might need up to m comparisons before discovering a mismatch—and that mismatch might occur at the very last character. For n-m+1 possible starting positions, this yields O(nm) total work in the worst case.\n\nWhy this matters:\n\nThe core inefficiency isn't just the character comparisons—it's that we learn nothing global from local comparisons. When we compare ABCDE against ABCDX and find they differ at position 5, we've gained no information about how ABCDE compares to any other substring. We start from scratch at the next position.

What Different Matching Strategies Learn From Each Comparison
Strategy	What It Learns	Information Reuse
Naive	Single character equality	None—starts fresh each position
KMP	Prefix structure of pattern	Shifts based on failure function
Hashing	Numerical fingerprint equality	Reuses computation across positions

The key observation:\n\nWhat if, instead of character-by-character comparison, we could reduce each m-character string to a single number? Then string comparison becomes integer comparison—a single O(1) operation, regardless of string length.\n\nThis is precisely what hash functions enable.

What Is a Hash Function?

A hash function is a mathematical function that maps data of arbitrary size to fixed-size values. For string matching, we need a hash function h that takes a string and produces an integer:\n\n\nh("HELLO") → 7429831\nh("WORLD") → 2198457\nh("HELLO") → 7429831 (same input always gives same output)\n\n\nEssential properties for string matching:

Required Hash Function Properties

•Deterministic — The same string must always produce the same hash value. Without this, comparison would be meaningless.
•Efficient to compute — Computing h(s) for a string of length m should take O(m) time, not more. Otherwise, hashing would be slower than direct comparison.
•Low collision rate — Different strings can produce the same hash (this is unavoidable when mapping infinite possibilities to finite integers), but this should be rare. The fewer collisions, the faster our algorithm.
•Uniform distribution — Hash values should spread evenly across the output range. Clustering of hash values increases collision probability.

Hash Collisions Are Inevitable

By the pigeonhole principle, if we hash more strings than there are possible hash values, some strings must share a hash. For example, if our hash function produces 32-bit integers (about 4 billion values), hashing more than 4 billion distinct strings guarantees collisions. The Rabin-Karp algorithm must handle this gracefully.

The hash-based comparison paradigm:\n\nWith a good hash function, our pattern matching strategy becomes:\n\n1. Compute hash(pattern) — the fingerprint we're looking for\n2. For each position i in text:\n - Compute hash(text[i:i+m]) — fingerprint of current substring\n - If hashes differ → definitely not a match, move on\n - If hashes match → possibly a match, verify with character comparison\n\nStep 3 is crucial: matching hashes don't guarantee matching strings (due to collisions), but differing hashes do guarantee non-matching strings. We use hash comparison as a fast filter.

Why Hashing Transforms Pattern Matching

To understand the power of hashing, we need to analyze what happens in the best case versus the worst case for pattern matching.\n\nNaive approach analysis:\n- At each position, we compare up to m characters\n- n-m+1 positions to check\n- Worst case: O(nm) total comparisons\n- Most comparisons are wasted on non-matching substrings\n\nHash-based approach analysis:\n- Compute pattern hash once: O(m)\n- At each position, compare two integers: O(1)\n- n-m+1 positions to check: O(n) hash comparisons\n- Only verify (with O(m) character comparison) when hashes match\n\nThe key insight: Most positions won't match. If hashes are uniformly distributed and collisions are rare, the hash comparison filters out nearly all non-matching positions in O(1) each.

Without Hashing

•Compare pattern against every substring
•Up to m character comparisons per position
•Learn almost nothing from mismatches
•Worst case: O(nm) always achievable
•Performance degrades with longer patterns

With Hashing

•Compare fingerprints first
•O(1) to rule out most positions
•Only examine promising positions
•Expected case: O(n+m) with good hash
•Performance largely independent of pattern length

A concrete example:\n\nSuppose we're searching for a 100-character pattern in a 1,000,000-character text.\n\nNaive approach (worst case):\n- ~999,901 positions × 100 comparisons = ~100 million character comparisons\n\nHash-based approach (expected case):\n- 1 pattern hash computation: 100 operations\n- ~999,901 hash comparisons: ~1 million integer comparisons\n- Assume collision rate of 1/1,000,000: maybe 1-2 false positives to verify\n- Total: ~1 million operations\n\nThe hash-based approach is approximately 100× faster in expected case, scaling with text length rather than the product of text and pattern lengths.

The Computational Cost of Hashing

There's a critical problem with the naive hashing approach we've described so far. Let's think carefully about the cost of computing hash values.\n\nThe hidden cost:\n\nIf computing hash(text[i:i+m]) requires reading all m characters of the substring, then computing hashes for all n-m+1 positions costs:\n\n\n(n - m + 1) × m = O(nm)\n\n\nThis is exactly as expensive as the naive algorithm! We've gained nothing.\n\nThe critical question:\n\nCan we compute the hash of text[i+1:i+1+m] more efficiently if we already know the hash of text[i:i+m]?\n\nIn other words: can we update a hash value incrementally as our window slides, rather than recomputing it from scratch?

The Make-or-Break Requirement

The entire viability of hash-based pattern matching depends on finding a hash function that can be updated in O(1) time as the window slides. Without this property—called the 'rolling' property—hashing offers no advantage over naive comparison. This is exactly what the polynomial rolling hash provides.

The sliding window insight:\n\nConsider two consecutive substrings of text:\n\n\nPosition i: [a₁ a₂ a₃ ... aₘ]\nPosition i+1: [a₂ a₃ ... aₘ aₘ₊₁]\n\n\nThese substrings share m-1 characters. A clever hash function should be able to:\n1. "Remove" the contribution of a₁ (the character leaving the window)\n2. "Add" the contribution of aₘ₊₁ (the character entering the window)\n\nIf both operations take O(1) time, then updating the hash takes O(1) time, and computing all n-m+1 hashes takes only O(n) time total.\n\nThis is the essence of the rolling hash technique that we'll explore in detail in the next page.

A Simple (Non-Rolling) Hash Example

Before introducing the polynomial rolling hash, let's examine a simpler hash function to build intuition about why hash design matters.\n\nSum-based hash:\n\nThe simplest hash function just sums the character values:\n\n\nhash("ABC") = ord('A') + ord('B') + ord('C') = 65 + 66 + 67 = 198\n\n\nAdvantages:\n- Easy to compute: O(m) for a string of length m\n- Easy to update: hash("BCD") = hash("ABC") - ord('A') + ord('D') = O(1)\n\nFatal flaw:\n- Produces the same hash for any permutation of characters\n- hash("ABC") = hash("BAC") = hash("CBA") = 198\n- Collision rate is catastrophically high

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def sum_hash(s):
    """Simple sum-based hash - demonstrating high collision rate."""
    return sum(ord(c) for c in s)
 
# All these produce the same hash!
print(sum_hash("ABC"))  # 198
print(sum_hash("BAC"))  # 198
print(sum_hash("CBA"))  # 198
 
# Even completely different strings can collide
print(sum_hash("AZ"))   # 155
print(sum_hash("BY"))   # 155
print(sum_hash("CX"))   # 155
 
# For pattern matching, this would cause massive false positives

Why position matters:\n\nThe sum-based hash fails because it ignores character positions. In pattern matching, "ABC" and "CBA" are completely different strings that should never match. We need a hash function where character positions affect the output.\n\nThe solution:\n\nPolynomial hashing treats the string as a number in some base b, where each character's position determines its weight:\n\n\nhash("ABC") = ord('A')×b² + ord('B')×b¹ + ord('C')×b⁰\n\n\nNow "ABC" and "CBA" produce different hashes because characters at different positions contribute differently. This is the foundation of the rolling hash we'll explore next.

The Polynomial Hash Function

The solution to our collision problem is the polynomial hash function, which treats strings as numbers in a chosen base. This is the hash function used by Rabin-Karp and many other string algorithms.\n\nDefinition:\n\nFor a string s = s₀s₁s₂...sₘ₋₁ of length m, the polynomial hash is:\n\n\nhash(s) = (s₀ × bᵐ⁻¹ + s₁ × bᵐ⁻² + ... + sₘ₋₂ × b¹ + sₘ₋₁ × b⁰) mod q\n\n\nWhere:\n- b is the base (typically a prime slightly larger than the alphabet size)\n- q is a large prime modulus (to keep hash values bounded)\n- sᵢ is the numerical value of character at position i\n\nWhy this works:\n\nConsider hashing "ABC" with base b = 31:\n\n\nhash("ABC") = 65×31² + 66×31¹ + 67×31⁰\n = 65×961 + 66×31 + 67×1\n = 62465 + 2046 + 67\n = 64578\n hash("CBA") = 67×31² + 66×31¹ + 65×31⁰\n = 67×961 + 66×31 + 65×1\n = 64387 + 2046 + 65\n = 66498\n\n\nDifferent hashes for different strings!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def polynomial_hash(s: str, base: int = 31, mod: int = 10**9 + 7) -> int:
    """
    Polynomial hash function for strings.
    
    Treats string as a number in the given base.
    Uses modular arithmetic to prevent integer overflow.
    
    Args:
        s: The string to hash
        base: The base to use (should be > max character value)
        mod: Large prime modulus to keep values bounded
    
    Returns:
        Hash value in range [0, mod-1]
    """
    hash_value = 0
    for char in s:
        # hash = (hash * base + char_value) mod q
        # This is Horner's method for polynomial evaluation
        hash_value = (hash_value * base + ord(char)) % mod
    return hash_value
 
# Now different strings produce different hashes
print(f"hash('ABC') = {polynomial_hash('ABC')}")  # 66214
print(f"hash('CBA') = {polynomial_hash('CBA')}")  # 68386
print(f"hash('BAC') = {polynomial_hash('BAC')}")  # 67331
 
# Same string always produces same hash (deterministic)
print(f"hash('ABC') again = {polynomial_hash('ABC')}")  # 66214

Horner's Method

The code uses Horner's method to compute the polynomial: instead of computing each power of b separately (expensive), we iteratively multiply and add: hash = (...((s₀ × b + s₁) × b + s₂) × b + ...) × b + sₘ₋₁. This gives O(m) time with constant extra memory.

Designing for Rolling Updates

The polynomial hash we've just seen has a crucial property: it can be updated in O(1) time as our window slides. Let's understand why this works mathematically.\n\nThe rolling hash update formula:\n\nSuppose we have computed:\n\nH₁ = hash(s[i:i+m]) = s[i]×bᵐ⁻¹ + s[i+1]×bᵐ⁻² + ... + s[i+m-1]×b⁰\n\n\nWe want to compute:\n\nH₂ = hash(s[i+1:i+1+m]) = s[i+1]×bᵐ⁻¹ + s[i+2]×bᵐ⁻² + ... + s[i+m]×b⁰\n\n\nThe relationship between them:\n\nH₂ = (H₁ - s[i]×bᵐ⁻¹) × b + s[i+m]\n\n\nBreaking this down:\n1. Subtract: Remove the leftmost character's contribution: s[i]×bᵐ⁻¹\n2. Shift: Multiply by b to shift all remaining terms one position left\n3. Add: Add the new rightmost character: s[i+m]×b⁰\n\nAll three operations are O(1) (assuming we precompute bᵐ⁻¹).

Visual representation:\n\n\nWindow at position i: [A][B][C][D][E]\n ↓ ↓ ↓ ↓ ↓\nHash components: A×b⁴ + B×b³ + C×b² + D×b¹ + E×b⁰\n\nSlide window right: [B][C][D][E][F]\n ↓ ↓ ↓ ↓ ↓\nNew hash: B×b⁴ + C×b³ + D×b² + E×b¹ + F×b⁰\n\nUpdate formula:\n1. Remove A's contribution: (old_hash - A×b⁴)\n2. Shift left: × b\n3. Add F at rightmost: + F×b⁰\n\n\nThis transformation takes constant time regardless of window size. A 10-character window updates as quickly as a 10,000-character window.

The Power of Polynomial Hash Structure

The polynomial structure isn't arbitrary—it's specifically designed to enable these O(1) rolling updates. The base-weighted positions act like digits in a number system, where shifting left is multiplication and shifting right is division. This mathematical elegance is why polynomial hashing became the standard for rolling hash algorithms.

The Complete Hash-Matching Framework

Now we can assemble the complete framework for hash-based pattern matching:\n\nAlgorithm skeleton:\n\n1. Preprocessing (O(m)):\n - Compute pattern_hash = hash(pattern)\n - Compute bᵐ⁻¹ mod q for use in rolling updates\n - Compute text_hash = hash(text[0:m]) for first window\n\n2. Matching (O(n)):\n - Compare text_hash with pattern_hash\n - If equal: verify with character-by-character comparison\n - Roll the window: update text_hash using O(1) formula\n - Repeat until end of text\n\nThis gives O(n+m) expected time. We'll analyze this precisely in a later page, including understanding when and why the worst case can degrade.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def rabin_karp_conceptual(text: str, pattern: str) -> list[int]:
    """
    Conceptual implementation of hash-based pattern matching.
    This shows the structure; full rolling hash comes in next page.
    """
    n, m = len(text), len(pattern)
    if m > n:
        return []
    
    base = 31
    mod = 10**9 + 7
    
    # Step 1: Compute pattern hash (O(m))
    pattern_hash = 0
    for char in pattern:
        pattern_hash = (pattern_hash * base + ord(char)) % mod
    
    # Step 2: Compute initial window hash (O(m))
    window_hash = 0
    for i in range(m):
        window_hash = (window_hash * base + ord(text[i])) % mod
    
    # Step 3: Precompute b^(m-1) for rolling (O(m))
    base_power = pow(base, m - 1, mod)
    
    matches = []
    
    # Step 4: Slide window and compare (O(n))
    for i in range(n - m + 1):
        # Hash comparison: O(1)
        if window_hash == pattern_hash:
            # Potential match - verify with actual comparison: O(m)
            if text[i:i+m] == pattern:
                matches.append(i)
        
        # Roll the hash to next position: O(1)
        if i < n - m:
            # Remove leftmost character, shift, add new rightmost
            outgoing = ord(text[i])
            incoming = ord(text[i + m])
            window_hash = ((window_hash - outgoing * base_power) * base + incoming) % mod
    
    return matches
 
# Example usage
text = "ABCABCABC"
pattern = "ABC"
print(f"Pattern '{pattern}' found at positions: {rabin_karp_conceptual(text, pattern)}")
# Output: Pattern 'ABC' found at positions: [0, 3, 6]

Key observations from this framework:\n\n1. Hash comparisons dominate: We perform n-m+1 hash comparisons, each O(1)\n2. Character comparisons are rare: Only when hashes match do we verify\n3. The rolling update is essential: Without it, computing each window's hash would take O(m)\n4. Modular arithmetic is crucial: Without mod, hash values would grow exponentially large\n\nThe next page explores the rolling hash technique in complete detail, including handling negative modular values and optimizing the implementation.

Summary: The Hash-Based Revolution

We've established the foundational concepts that make Rabin-Karp possible. Let's consolidate what we've learned:

Key Takeaways

•Hashing transforms string comparison — Instead of comparing m characters, we compare single integers.
•Hash functions must be deterministic and efficient — Same input → same output, computed in O(m) time.
•Collisions are inevitable but manageable — When hashes match, we verify with actual comparison.
•Simple hashes (like sum) have too many collisions — Position must influence the hash value.
•Polynomial hashing solves the position problem — Treats strings as base-b numbers.
•Rolling updates are essential for efficiency — O(1) update enables O(n) total hashing cost.
•The framework yields O(n+m) expected time — A dramatic improvement over O(nm) worst case of naive.

What's next:\n\nIn the next page, we'll dive deep into the rolling hash technique—the mathematical machinery that makes O(1) hash updates possible. We'll explore the precise formulas, handle edge cases with modular arithmetic, and build a production-ready implementation.

Page Complete

You now understand why hashing is a powerful paradigm for pattern matching and why the polynomial hash function is the right choice. The conceptual foundation is set—next, we master the rolling hash mechanics that make Rabin-Karp truly efficient.

1 / 5

Loading learning content...

Data Structures & AlgorithmsRabin-Karp Algorithm

Rabin-Karp Algorithm — Rolling Hash

LevelAdvanced

Duration75 mins

TopicRabin-Karp Algorithm

1 / 5

Using Hashing for Pattern Matching

The Hash-Based Revolution in String Matching

What You Will Learn

The Comparison Problem Revisited

What Different Matching Strategies Learn From Each Comparison
Strategy	What It Learns	Information Reuse
Naive	Single character equality	None—starts fresh each position
KMP	Prefix structure of pattern	Shifts based on failure function
Hashing	Numerical fingerprint equality	Reuses computation across positions

What Is a Hash Function?

Required Hash Function Properties

•Deterministic — The same string must always produce the same hash value. Without this, comparison would be meaningless.
•Efficient to compute — Computing h(s) for a string of length m should take O(m) time, not more. Otherwise, hashing would be slower than direct comparison.
•Low collision rate — Different strings can produce the same hash (this is unavoidable when mapping infinite possibilities to finite integers), but this should be rare. The fewer collisions, the faster our algorithm.
•Uniform distribution — Hash values should spread evenly across the output range. Clustering of hash values increases collision probability.

Hash Collisions Are Inevitable

Why Hashing Transforms Pattern Matching

Without Hashing

•Compare pattern against every substring
•Up to m character comparisons per position
•Learn almost nothing from mismatches
•Worst case: O(nm) always achievable
•Performance degrades with longer patterns

With Hashing

•Compare fingerprints first
•O(1) to rule out most positions
•Only examine promising positions
•Expected case: O(n+m) with good hash
•Performance largely independent of pattern length

The Computational Cost of Hashing

The Make-or-Break Requirement

A Simple (Non-Rolling) Hash Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def sum_hash(s):
    """Simple sum-based hash - demonstrating high collision rate."""
    return sum(ord(c) for c in s)
 
# All these produce the same hash!
print(sum_hash("ABC"))  # 198
print(sum_hash("BAC"))  # 198
print(sum_hash("CBA"))  # 198
 
# Even completely different strings can collide
print(sum_hash("AZ"))   # 155
print(sum_hash("BY"))   # 155
print(sum_hash("CX"))   # 155
 
# For pattern matching, this would cause massive false positives

The Polynomial Hash Function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def polynomial_hash(s: str, base: int = 31, mod: int = 10**9 + 7) -> int:
    """
    Polynomial hash function for strings.
    
    Treats string as a number in the given base.
    Uses modular arithmetic to prevent integer overflow.
    
    Args:
        s: The string to hash
        base: The base to use (should be > max character value)
        mod: Large prime modulus to keep values bounded
    
    Returns:
        Hash value in range [0, mod-1]
    """
    hash_value = 0
    for char in s:
        # hash = (hash * base + char_value) mod q
        # This is Horner's method for polynomial evaluation
        hash_value = (hash_value * base + ord(char)) % mod
    return hash_value
 
# Now different strings produce different hashes
print(f"hash('ABC') = {polynomial_hash('ABC')}")  # 66214
print(f"hash('CBA') = {polynomial_hash('CBA')}")  # 68386
print(f"hash('BAC') = {polynomial_hash('BAC')}")  # 67331
 
# Same string always produces same hash (deterministic)
print(f"hash('ABC') again = {polynomial_hash('ABC')}")  # 66214

Horner's Method

Designing for Rolling Updates

The Power of Polynomial Hash Structure

The Complete Hash-Matching Framework

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def rabin_karp_conceptual(text: str, pattern: str) -> list[int]:
    """
    Conceptual implementation of hash-based pattern matching.
    This shows the structure; full rolling hash comes in next page.
    """
    n, m = len(text), len(pattern)
    if m > n:
        return []
    
    base = 31
    mod = 10**9 + 7
    
    # Step 1: Compute pattern hash (O(m))
    pattern_hash = 0
    for char in pattern:
        pattern_hash = (pattern_hash * base + ord(char)) % mod
    
    # Step 2: Compute initial window hash (O(m))
    window_hash = 0
    for i in range(m):
        window_hash = (window_hash * base + ord(text[i])) % mod
    
    # Step 3: Precompute b^(m-1) for rolling (O(m))
    base_power = pow(base, m - 1, mod)
    
    matches = []
    
    # Step 4: Slide window and compare (O(n))
    for i in range(n - m + 1):
        # Hash comparison: O(1)
        if window_hash == pattern_hash:
            # Potential match - verify with actual comparison: O(m)
            if text[i:i+m] == pattern:
                matches.append(i)
        
        # Roll the hash to next position: O(1)
        if i < n - m:
            # Remove leftmost character, shift, add new rightmost
            outgoing = ord(text[i])
            incoming = ord(text[i + m])
            window_hash = ((window_hash - outgoing * base_power) * base + incoming) % mod
    
    return matches
 
# Example usage
text = "ABCABCABC"
pattern = "ABC"
print(f"Pattern '{pattern}' found at positions: {rabin_karp_conceptual(text, pattern)}")
# Output: Pattern 'ABC' found at positions: [0, 3, 6]

Summary: The Hash-Based Revolution

We've established the foundational concepts that make Rabin-Karp possible. Let's consolidate what we've learned:

Key Takeaways

•Hashing transforms string comparison — Instead of comparing m characters, we compare single integers.
•Hash functions must be deterministic and efficient — Same input → same output, computed in O(m) time.
•Collisions are inevitable but manageable — When hashes match, we verify with actual comparison.
•Simple hashes (like sum) have too many collisions — Position must influence the hash value.
•Polynomial hashing solves the position problem — Treats strings as base-b numbers.
•Rolling updates are essential for efficiency — O(1) update enables O(n) total hashing cost.
•The framework yields O(n+m) expected time — A dramatic improvement over O(nm) worst case of naive.

Page Complete

1 / 5