Rabin-Karp Algorithm - Learning Module

Loading content...

0/276

Rolling Hash Technique

The Art of Incremental Computation

In the previous page, we established that polynomial hashing can transform string comparison from O(m) character-by-character to O(1) integer comparison. But we also identified a critical challenge: if computing each window's hash costs O(m), we lose all the efficiency gains.\n\nThe rolling hash technique solves this beautifully. It exploits the mathematical structure of polynomial hashing to update hash values incrementally as the window slides—computing each new hash in O(1) time based on the previous one, rather than recomputing from scratch.\n\nThis page provides a complete, rigorous treatment of rolling hash mechanics, including the mathematical derivation, modular arithmetic subtleties that trip up many implementations, and production-ready code patterns.

What You Will Learn

By the end of this page, you will: (1) understand the precise mathematical formula for rolling hash updates, (2) master handling negative values in modular arithmetic, (3) see the complete derivation of the O(1) update formula, (4) implement a bulletproof rolling hash, and (5) understand parameter selection for base and modulus.

The Mathematical Foundation

Let's establish the precise mathematical framework for rolling hash. We'll use notation that makes the derivation crystal clear.\n\nDefinitions:\n\n- Text: T = t₀t₁t₂...t_{n-1} (length n)\n- Pattern: P = p₀p₁p₂...p_{m-1} (length m)\n- Base: b (a prime number)\n- Modulus: q (a large prime number)\n- Character value: v(c) = ord(c) for character c\n\nPolynomial hash of a string S = s₀s₁...s_{m-1}:\n\n\nH(S) = Σᵢ₌₀^{m-1} [v(sᵢ) × b^{m-1-i}] mod q\n = v(s₀)×b^{m-1} + v(s₁)×b^{m-2} + ... + v(s_{m-1})×b⁰ mod q\n\n\nThis treats the string as a number in base b. The leftmost character has the highest weight.

The window hash notation:\n\nLet H_i denote the hash of the substring T[i:i+m]:\n\n\nH_i = H(T[i:i+m]) = Σⱼ₌₀^{m-1} [v(t_{i+j}) × b^{m-1-j}] mod q\n\n\nExpanding for a concrete example with m = 5:\n\n\nH_i = v(t_i)×b⁴ + v(t_{i+1})×b³ + v(t_{i+2})×b² + v(t_{i+3})×b¹ + v(t_{i+4})×b⁰\n\n\nSimilarly for the next window:\n\n\nH_{i+1} = v(t_{i+1})×b⁴ + v(t_{i+2})×b³ + v(t_{i+3})×b² + v(t_{i+4})×b¹ + v(t_{i+5})×b⁰\n\n\nCrucial observation: Both expressions share four terms. The structure of polynomial hashing allows us to relate H_{i+1} to H_i algebraically.

Why Polynomial Structure Enables Rolling

The key insight is that multiplying by b is equivalent to 'shifting left' in the polynomial representation. This is exactly like how multiplying a decimal number by 10 shifts all digits left. This structural property enables the O(1) update formula.

Deriving the Rolling Update Formula

Now let's derive the rolling update formula step by step. We want to express H_{i+1} in terms of H_i.\n\nStarting point:\n\n\nH_i = v(t_i)×b^{m-1} + v(t_{i+1})×b^{m-2} + v(t_{i+2})×b^{m-3} + ... + v(t_{i+m-1})×b⁰\n\n\nStep 1: Remove the outgoing character\n\nThe character t_i is leaving the window. Its contribution is v(t_i)×b^{m-1}. We subtract it:\n\n\nH_i - v(t_i)×b^{m-1} = v(t_{i+1})×b^{m-2} + v(t_{i+2})×b^{m-3} + ... + v(t_{i+m-1})×b⁰\n\n\nStep 2: Shift all terms left (multiply by b)\n\nEach remaining character needs its exponent increased by 1 (effectively shifting left in the window):\n\n\n(H_i - v(t_i)×b^{m-1}) × b = v(t_{i+1})×b^{m-1} + v(t_{i+2})×b^{m-2} + ... + v(t_{i+m-1})×b¹\n\n\nStep 3: Add the incoming character\n\nThe character t_{i+m} enters the window at the rightmost position (power b⁰):\n\n\n(H_i - v(t_i)×b^{m-1}) × b + v(t_{i+m}) = H_{i+1}\n

The rolling hash update formula:\n\n\nH_{i+1} = (H_i - v(outgoing) × b^{m-1}) × b + v(incoming) (mod q)\n\n\nWhere:\n- outgoing = t_i (character leaving the left side of window)\n- incoming = t_{i+m} (character entering the right side of window)\n- b^{m-1} is precomputed once at the start\n\nOperation count:\n- 1 multiplication: v(outgoing) × b^{m-1}\n- 1 subtraction: H_i - ...\n- 1 multiplication: ... × b\n- 1 addition: ... + v(incoming)\n- 1 modulo: ... mod q\n\nAll O(1) operations, regardless of window size m!

Compare to Direct Computation

Computing H_{i+1} from scratch would require m multiplications and m-1 additions for a total of O(m) operations. The rolling formula reduces this to exactly 5 operations, making the update truly O(1). For a pattern of length 1000, this is a 200× improvement per window position.

Modular Arithmetic Deep Dive

The rolling hash formula involves modular arithmetic, which introduces subtleties that cause many implementations to fail. Let's address them thoroughly.\n\nWhy we need modular arithmetic:\n\nWithout the mod operation, hash values would grow exponentially:\n\n\nFor a 100-character string with base 31:\nMaximum hash ≈ 127 × 31^99 ≈ 10^149\n\n\nThis vastly exceeds what any integer type can hold (64-bit integers max out around 10^19). The modulus keeps values bounded.\n\nThe negativity problem:\n\nConsider the subtraction in our formula: H_i - v(outgoing) × b^{m-1}\n\nEven though H_i was computed mod q, the term v(outgoing) × b^{m-1} might be larger than the result, producing a negative value. Different programming languages handle negative modulo differently:\n\n\nPython: (-5) % 3 = 1 (always positive)\nC/Java: (-5) % 3 = -2 (sign of dividend)\nJS: (-5) % 3 = -2 (sign of dividend)\n

1
2
3
4
5
6
7
8
9
10
11
# Python handles negative modulo correctly
print((-5) % 3)  # Output: 1 (always in range [0, q-1])
 
# But explicit handling is still good practice for clarity
def safe_mod(x: int, q: int) -> int:
    """Ensure modulo result is always non-negative."""
    return ((x % q) + q) % q
 
# Example showing the difference
print(safe_mod(-5, 3))   # Output: 1
print(safe_mod(-100, 7)) # Output: 5

The corrected rolling update formula:\n\n\nH_{i+1} = ((H_i - v(outgoing) × b^{m-1} % q) × b + v(incoming)) % q\n ↑ ↑ ↑\n Make non-negative Main computation Final reduction\n\n\nIn practice, a robust implementation ensures non-negativity at each step:\n\npython\ntemp = (H_i - v(outgoing) * power_of_base) % q\ntemp = (temp + q) % q # Handle negative result\nH_next = (temp * base + v(incoming)) % q\n

Common Bug Alert

Failing to handle negative intermediate values is the #1 bug in rolling hash implementations. Always test your implementation with cases where the outgoing character has a higher contribution than the current hash value. For example, test with text 'ZA...' where Z's contribution (high value) is subtracted early.

Choosing Base and Modulus

The choice of base b and modulus q significantly affects collision probability and numerical stability. Let's analyze the considerations.\n\nChoosing the base b:\n\nThe base should be:\n1. Larger than the alphabet size — Ensures distinct characters produce distinct contributions\n2. Prime — Reduces collision patterns from multiplicative relationships\n3. Not too large — Prevents overflow during intermediate computations\n\nCommon choices:\n- For ASCII (128 characters): b = 131 or 257\n- For lowercase English (26 characters): b = 31 or 37\n- For Unicode (65536+ characters): b = 65537\n\nThe value 31 is popular because it's prime, works well for lowercase English, and x * 31 = (x << 5) - x can be computed efficiently.

Common Base and Modulus Choices
Use Case	Base (b)	Modulus (q)	Collision Probability
General ASCII	131	10^9 + 7	~1 / 10^9
Lowercase English	31	10^9 + 7	~1 / 10^9
High precision	10007	10^18 + 9	~1 / 10^18
Competition (safe)	31 + 37 (double hash)	(10^9+7, 10^9+9)	~1 / 10^18

Choosing the modulus q:\n\nThe modulus should be:\n1. Prime — Ensures uniform distribution of hash values\n2. Large — Collision probability ≈ 1/q, larger is better\n3. Safe for multiplication — q² should not overflow your integer type\n\nCommon choices:\n- 10^9 + 7 (fits in 32 bits, q² fits in 64 bits)\n- 10^9 + 9 (another prime, often used with 10^9 + 7 for double hashing)\n- 2^61 - 1 (Mersenne prime, very efficient mod operation)\n- 10^18 + 9 (for 128-bit integers or big integer libraries)\n\nThe 64-bit consideration:\n\nWith base b ≈ 100 and modulus q ≈ 10^9:\n- Maximum intermediate value before mod: (q-1) × b ≈ 10^11\n- This fits comfortably in 64 bits (max ≈ 10^19)\n\nIf you need larger moduli, you'll need 128-bit integers or careful handling.

Double Hashing for Safety

In competitive programming or security-sensitive applications, use two independent hash functions with different (b₁,q₁) and (b₂,q₂). Two strings match only if BOTH hashes match. Collision probability becomes ~1/(q₁×q₂) ≈ 10^-18 — essentially zero for practical purposes.

Complete Rolling Hash Implementation

Now let's build a complete, production-ready rolling hash implementation that handles all edge cases correctly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
class RollingHash:
    """
    Production-ready rolling hash implementation.
    
    Supports O(1) window sliding for efficient string matching.
    Uses polynomial hashing with configurable base and modulus.
    """
    
    def __init__(self, s: str, window_size: int, base: int = 31, mod: int = 10**9 + 7):
        """
        Initialize rolling hash for string s with given window size.
        
        Args:
            s: The string to hash
            window_size: Size of the sliding window (m)
            base: Hash base (should be prime > max character value)
            mod: Hash modulus (should be large prime)
        """
        self.s = s
        self.n = len(s)
        self.m = window_size
        self.base = base
        self.mod = mod
        
        # Precompute base^(m-1) mod q for rolling updates
        self.base_power = pow(base, window_size - 1, mod)
        
        # Compute initial hash for first window
        self.current_hash = 0
        for i in range(min(window_size, self.n)):
            self.current_hash = (self.current_hash * base + ord(s[i])) % mod
        
        # Track current window position
        self.window_start = 0
    
    def get_hash(self) -> int:
        """Return hash of current window."""
        return self.current_hash
    
    def slide(self) -> bool:
        """
        Slide window one position to the right.
        
        Returns:
            True if slide was successful, False if at end of string
        """
        if self.window_start + self.m >= self.n:
            return False  # Cannot slide further
        
        # Character leaving the window (leftmost)
        outgoing = ord(self.s[self.window_start])
        
        # Character entering the window (rightmost)
        incoming = ord(self.s[self.window_start + self.m])
        
        # Rolling update formula with safe modular arithmetic
        # Step 1: Remove outgoing character's contribution
        self.current_hash = (self.current_hash - outgoing * self.base_power) % self.mod
        
        # Step 2: Python handles negative mod correctly, but be explicit
        if self.current_hash < 0:
            self.current_hash += self.mod
        
        # Step 3: Shift left (multiply by base) and add incoming
        self.current_hash = (self.current_hash * self.base + incoming) % self.mod
        
        self.window_start += 1
        return True
    
    def get_window(self) -> str:
        """Return current window substring (for debugging)."""
        return self.s[self.window_start:self.window_start + self.m]
 
 
# Example usage
text = "ABCABCABC"
window_size = 3
 
rh = RollingHash(text, window_size)
 
print(f"Window: {rh.get_window()}, Hash: {rh.get_hash()}")
while rh.slide():
    print(f"Window: {rh.get_window()}, Hash: {rh.get_hash()}")
 
# Output:
# Window: ABC, Hash: 66214
# Window: BCA, Hash: 67331
# Window: CAB, Hash: 68417
# Window: ABC, Hash: 66214  <- Same hash for same substring!
# Window: BCA, Hash: 67331
# Window: CAB, Hash: 68417
# Window: ABC, Hash: 66214

Visualizing Rolling Updates

Let's trace through a concrete example to see exactly how rolling updates work step by step.\n\nSetup:\n- Text: "ABCDE"\n- Window size: 3\n- Base: 10 (simplified for clarity)\n- Modulus: 1000 (simplified)\n- Character values: A=1, B=2, C=3, D=4, E=5\n\nInitial hash computation (window "ABC"):\n\n\nH₀ = 1×10² + 2×10¹ + 3×10⁰\n = 100 + 20 + 3\n = 123\n\n\nFirst roll: "ABC" → "BCD"\n\n\n Step 1: Remove 'A' contribution\n H - A×10² = 123 - 1×100 = 23\n \n Step 2: Shift left (multiply by base)\n 23 × 10 = 230\n \n Step 3: Add 'D' contribution\n 230 + 4 = 234\n \nH₁ = 234\n\n\nVerification: Direct computation of "BCD":\n\nH("BCD") = 2×10² + 3×10¹ + 4×10⁰ = 200 + 30 + 4 = 234 ✓\n

Second roll: "BCD" → "CDE"\n\n\n Step 1: Remove 'B' contribution\n H - B×10² = 234 - 2×100 = 34\n \n Step 2: Shift left\n 34 × 10 = 340\n \n Step 3: Add 'E' contribution\n 340 + 5 = 345\n \nH₂ = 345\n\n\nVerification: Direct computation of "CDE":\n\nH("CDE") = 3×10² + 4×10¹ + 5×10⁰ = 300 + 40 + 5 = 345 ✓\n\n\nKey observation: Each roll required exactly 3 arithmetic operations (subtract, multiply, add) regardless of window size. If the window were 1000 characters, the update would still take exactly 3 operations.

Rolling Hash Progression Through Text "ABCDE" (window size 3)
Position	Window	Outgoing	Incoming	Hash Computation	Result
0	ABC	—	—	1×100 + 2×10 + 3×1	123
1	BCD	A	D	(123 - 1×100)×10 + 4	234
2	CDE	B	E	(234 - 2×100)×10 + 5	345

The Decimal Number Analogy

With base 10, you can literally read the hash as a number with digits being character values. Hash 234 for 'BCD' shows B=2 in hundreds place, C=3 in tens place, D=4 in ones place. With larger bases like 31, the interpretation is similar but the digits don't align with decimal reading.

Edge Cases and Robustness

A production-ready rolling hash must handle various edge cases correctly. Let's examine them:\n\nEdge Case 1: Window equals string length\n\nWhen m = n, there's only one window position. The rolling hash should:\n- Compute the initial hash\n- Return false on first slide attempt\n\nEdge Case 2: Empty string or zero-size window\n\nThese should be handled gracefully (return empty results or raise sensible errors).\n\nEdge Case 3: Single-character window\n\nWhen m = 1:\n- base_power = base^0 = 1\n- Hash of single character = ord(char) mod q\n- Rolling update: new_hash = incoming (outgoing was the entire previous hash)

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def test_edge_cases():
    """Test rolling hash edge cases."""
    
    # Edge case 1: Window equals string
    text = "ABC"
    rh = RollingHash(text, 3)
    assert rh.get_hash() == polynomial_hash("ABC")
    assert rh.slide() == False  # Can't slide, at end
    print("✓ Edge case 1: Window equals string")
    
    # Edge case 2: Single character window
    text = "ABCDE"
    rh = RollingHash(text, 1)
    hashes = [rh.get_hash()]
    while rh.slide():
        hashes.append(rh.get_hash())
    # Each hash should be the individual character's hash
    expected = [polynomial_hash(c) for c in "ABCDE"]
    assert hashes == expected
    print("✓ Edge case 2: Single character window")
    
    # Edge case 3: Repeated patterns
    text = "ABCABCABC"
    rh = RollingHash(text, 3)
    pattern_hash = rh.get_hash()  # Hash of "ABC"
    matches = [0]
    pos = 1
    while rh.slide():
        if rh.get_hash() == pattern_hash:
            matches.append(pos)
        pos += 1
    assert matches == [0, 3, 6]  # "ABC" appears at these positions
    print("✓ Edge case 3: Repeated patterns detected")
    
    # Edge case 4: Characters with high values (testing negative mod)
    text = "ZZZ" + "AAA"  # Z has high ord value
    rh = RollingHash(text, 3)
    initial_hash = rh.get_hash()
    rh.slide()
    # Should not raise errors or produce negative hashes
    assert rh.get_hash() >= 0
    print("✓ Edge case 4: High-value characters handled")
 
test_edge_cases()

Integer Overflow Considerations

In languages with fixed-size integers (C++, Java), check for overflow. The intermediate value (hash - outgoing × base_power) × base can approach q × base ≈ 10^10 with typical parameters. Use 64-bit integers (long long in C++, long in Java) for safety. Python's arbitrary-precision integers handle this automatically.

Complexity Analysis

Let's rigorously analyze the time and space complexity of the rolling hash technique.\n\nTime Complexity:\n\n1. Initialization:\n - Computing base^(m-1): O(log m) using fast exponentiation, or O(m) iteratively\n - Computing initial hash: O(m) for processing m characters\n - Total initialization: O(m)\n\n2. Per-slide operation:\n - Subtraction: O(1)\n - Multiplication: O(1)\n - Addition: O(1)\n - Modulo: O(1)\n - Total per slide: O(1)\n\n3. Processing all positions:\n - n - m + 1 windows total\n - Each window (after first) updated in O(1)\n - Total: O(n - m + 1) = O(n)

Rolling Hash Complexity Summary
Operation	Time Complexity	Notes
Initialize	O(m)	One-time cost to set up first window
Single slide	O(1)	Constant time regardless of window size
All n-m+1 positions	O(n)	After initialization
Complete algorithm	O(n + m)	Which is O(n) when m ≤ n

Space Complexity:\n\n- String storage: O(n) (assumed already provided)\n- Hash value: O(1)\n- Base power: O(1)\n- Position tracking: O(1)\n- Total auxiliary space: O(1)\n\nComparison with other approaches:\n\n| Approach | Time to hash all windows | Space |\n|----------|-------------------------|-------|\n| Naive (recompute each) | O(nm) | O(1) |\n| Rolling hash | O(n) | O(1) |\n| Precompute all hashes | O(n) | O(n) |\n\nRolling hash achieves optimal time with minimal space.

The Power of Amortization

Even though each individual hash involves O(m) 'work' conceptually, the rolling technique amortizes this across all positions. Each character in the text contributes to exactly O(1) hash updates (entering once, leaving once), giving total O(n) work.

Summary: Mastering Rolling Hash

We've covered the rolling hash technique in complete detail. Let's consolidate the key points:

Key Takeaways

•The rolling update formula: H_{i+1} = (H_i - out × b^{m-1}) × b + in (mod q)
•O(1) updates enable O(n) total hashing — This is what makes Rabin-Karp efficient.
•Modular arithmetic requires care — Handle negative intermediate values correctly.
•Base should exceed alphabet size — Typically 31-37 for English, 131-257 for ASCII.
•Modulus should be large prime — 10^9 + 7 is standard, double hashing for extra safety.
•Precompute b^{m-1} — Avoids repeated exponentiation during rolling.
•Test edge cases thoroughly — Single-character windows, high-value characters, repeated patterns.

What's next:\n\nThe rolling hash is the engine of Rabin-Karp, but hashing brings an unavoidable issue: collisions. Two different strings can produce the same hash, leading to false positives. In the next page, we'll explore how Rabin-Karp handles collisions, the impact on complexity, and strategies for minimizing their occurrence.

Page Complete

You now have a complete understanding of the rolling hash technique—the mathematical foundation, implementation details, edge cases, and complexity analysis. This is the core machinery that powers efficient hash-based string matching.