Data Structures & AlgorithmsRabin-Karp Algorithm

Rabin-Karp Algorithm — Rolling Hash

LevelAdvanced

Duration75 mins

TopicRabin-Karp Algorithm

2 / 5

Rolling Hash Technique

The Art of Incremental Computation

In the previous page, we established that polynomial hashing can transform string comparison from O(m) character-by-character to O(1) integer comparison. But we also identified a critical challenge: if computing each window's hash costs O(m), we lose all the efficiency gains.

The rolling hash technique solves this beautifully. It exploits the mathematical structure of polynomial hashing to update hash values incrementally as the window slides—computing each new hash in O(1) time based on the previous one, rather than recomputing from scratch.

This page provides a complete, rigorous treatment of rolling hash mechanics, including the mathematical derivation, modular arithmetic subtleties that trip up many implementations, and production-ready code patterns.

What You Will Learn

By the end of this page, you will: (1) understand the precise mathematical formula for rolling hash updates, (2) master handling negative values in modular arithmetic, (3) see the complete derivation of the O(1) update formula, (4) implement a bulletproof rolling hash, and (5) understand parameter selection for base and modulus.

The Mathematical Foundation

Let's establish the precise mathematical framework for rolling hash. We'll use notation that makes the derivation crystal clear.

Definitions:

Text: T = t₀t₁t₂...t_{n-1} (length n)
Pattern: P = p₀p₁p₂...p_{m-1} (length m)
Base: b (a prime number)
Modulus: q (a large prime number)
Character value: v(c) = ord(c) for character c

Polynomial hash of a string S = s₀s₁...s_{m-1}:

H(S) = Σᵢ₌₀^{m-1} [v(sᵢ) × b^{m-1-i}] mod q
     = v(s₀)×b^{m-1} + v(s₁)×b^{m-2} + ... + v(s_{m-1})×b⁰ mod q

This treats the string as a number in base b. The leftmost character has the highest weight.

The window hash notation:

Let H_i denote the hash of the substring T[i:i+m]:

H_i = H(T[i:i+m]) = Σⱼ₌₀^{m-1} [v(t_{i+j}) × b^{m-1-j}] mod q

Expanding for a concrete example with m = 5:

H_i = v(t_i)×b⁴ + v(t_{i+1})×b³ + v(t_{i+2})×b² + v(t_{i+3})×b¹ + v(t_{i+4})×b⁰

Similarly for the next window:

H_{i+1} = v(t_{i+1})×b⁴ + v(t_{i+2})×b³ + v(t_{i+3})×b² + v(t_{i+4})×b¹ + v(t_{i+5})×b⁰

Crucial observation: Both expressions share four terms. The structure of polynomial hashing allows us to relate H_{i+1} to H_i algebraically.

Why Polynomial Structure Enables Rolling

The key insight is that multiplying by b is equivalent to 'shifting left' in the polynomial representation. This is exactly like how multiplying a decimal number by 10 shifts all digits left. This structural property enables the O(1) update formula.

Deriving the Rolling Update Formula

Now let's derive the rolling update formula step by step. We want to express H_{i+1} in terms of H_i.

Starting point:

H_i = v(t_i)×b^{m-1} + v(t_{i+1})×b^{m-2} + v(t_{i+2})×b^{m-3} + ... + v(t_{i+m-1})×b⁰

Step 1: Remove the outgoing character

The character t_i is leaving the window. Its contribution is v(t_i)×b^{m-1}. We subtract it:

H_i - v(t_i)×b^{m-1} = v(t_{i+1})×b^{m-2} + v(t_{i+2})×b^{m-3} + ... + v(t_{i+m-1})×b⁰

Step 2: Shift all terms left (multiply by b)

Each remaining character needs its exponent increased by 1 (effectively shifting left in the window):

(H_i - v(t_i)×b^{m-1}) × b = v(t_{i+1})×b^{m-1} + v(t_{i+2})×b^{m-2} + ... + v(t_{i+m-1})×b¹

Step 3: Add the incoming character

The character t_{i+m} enters the window at the rightmost position (power b⁰):

(H_i - v(t_i)×b^{m-1}) × b + v(t_{i+m}) = H_{i+1}

The rolling hash update formula:

H_{i+1} = (H_i - v(outgoing) × b^{m-1}) × b + v(incoming)   (mod q)

Where:

outgoing = t_i (character leaving the left side of window)
incoming = t_{i+m} (character entering the right side of window)
b^{m-1} is precomputed once at the start

Operation count:

1 multiplication: v(outgoing) × b^{m-1}
1 subtraction: H_i - ...
1 multiplication: ... × b
1 addition: ... + v(incoming)
1 modulo: ... mod q

All O(1) operations, regardless of window size m!

Compare to Direct Computation

Computing H_{i+1} from scratch would require m multiplications and m-1 additions for a total of O(m) operations. The rolling formula reduces this to exactly 5 operations, making the update truly O(1). For a pattern of length 1000, this is a 200× improvement per window position.

Modular Arithmetic Deep Dive

The rolling hash formula involves modular arithmetic, which introduces subtleties that cause many implementations to fail. Let's address them thoroughly.

Why we need modular arithmetic:

Without the mod operation, hash values would grow exponentially:

For a 100-character string with base 31:
Maximum hash ≈ 127 × 31^99 ≈ 10^149

This vastly exceeds what any integer type can hold (64-bit integers max out around 10^19). The modulus keeps values bounded.

The negativity problem:

Consider the subtraction in our formula: H_i - v(outgoing) × b^{m-1}

Even though H_i was computed mod q, the term v(outgoing) × b^{m-1} might be larger than the result, producing a negative value. Different programming languages handle negative modulo differently:

Python:  (-5) % 3 = 1   (always positive)
C/Java:  (-5) % 3 = -2  (sign of dividend)
JS:      (-5) % 3 = -2  (sign of dividend)

1
2
3
4
5
6
7
8
9
10
11
# Python handles negative modulo correctly
print((-5) % 3)  # Output: 1 (always in range [0, q-1])
 
# But explicit handling is still good practice for clarity
def safe_mod(x: int, q: int) -> int:
    """Ensure modulo result is always non-negative."""
    return ((x % q) + q) % q
 
# Example showing the difference
print(safe_mod(-5, 3))   # Output: 1
print(safe_mod(-100, 7)) # Output: 5

The corrected rolling update formula:

H_{i+1} = ((H_i - v(outgoing) × b^{m-1} % q) × b + v(incoming)) % q
         ↑                        ↑                           ↑
      Make non-negative     Main computation          Final reduction

In practice, a robust implementation ensures non-negativity at each step:

temp = (H_i - v(outgoing) * power_of_base) % q
temp = (temp + q) % q  # Handle negative result
H_next = (temp * base + v(incoming)) % q

Common Bug Alert

Failing to handle negative intermediate values is the #1 bug in rolling hash implementations. Always test your implementation with cases where the outgoing character has a higher contribution than the current hash value. For example, test with text 'ZA...' where Z's contribution (high value) is subtracted early.

Choosing Base and Modulus

The choice of base b and modulus q significantly affects collision probability and numerical stability. Let's analyze the considerations.

Choosing the base b:

The base should be:

Larger than the alphabet size — Ensures distinct characters produce distinct contributions
Prime — Reduces collision patterns from multiplicative relationships
Not too large — Prevents overflow during intermediate computations

Common choices:

For ASCII (128 characters): b = 131 or 257
For lowercase English (26 characters): b = 31 or 37
For Unicode (65536+ characters): b = 65537

The value 31 is popular because it's prime, works well for lowercase English, and x * 31 = (x << 5) - x can be computed efficiently.

Common Base and Modulus Choices
Use Case	Base (b)	Modulus (q)	Collision Probability
General ASCII	131	10^9 + 7	~1 / 10^9
Lowercase English	31	10^9 + 7	~1 / 10^9
High precision	10007	10^18 + 9	~1 / 10^18
Competition (safe)	31 + 37 (double hash)	(10^9+7, 10^9+9)	~1 / 10^18

Choosing the modulus q:

The modulus should be:

Prime — Ensures uniform distribution of hash values
Large — Collision probability ≈ 1/q, larger is better
Safe for multiplication — q² should not overflow your integer type

Common choices:

10^9 + 7 (fits in 32 bits, q² fits in 64 bits)
10^9 + 9 (another prime, often used with 10^9 + 7 for double hashing)
2^61 - 1 (Mersenne prime, very efficient mod operation)
10^18 + 9 (for 128-bit integers or big integer libraries)

The 64-bit consideration:

With base b ≈ 100 and modulus q ≈ 10^9:

Maximum intermediate value before mod: (q-1) × b ≈ 10^11
This fits comfortably in 64 bits (max ≈ 10^19)

If you need larger moduli, you'll need 128-bit integers or careful handling.

Double Hashing for Safety

In competitive programming or security-sensitive applications, use two independent hash functions with different (b₁,q₁) and (b₂,q₂). Two strings match only if BOTH hashes match. Collision probability becomes ~1/(q₁×q₂) ≈ 10^-18 — essentially zero for practical purposes.

Complete Rolling Hash Implementation

Now let's build a complete, production-ready rolling hash implementation that handles all edge cases correctly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
class RollingHash:
    """
    Production-ready rolling hash implementation.
    
    Supports O(1) window sliding for efficient string matching.
    Uses polynomial hashing with configurable base and modulus.
    """
    
    def __init__(self, s: str, window_size: int, base: int = 31, mod: int = 10**9 + 7):
        """
        Initialize rolling hash for string s with given window size.
        
        Args:
            s: The string to hash
            window_size: Size of the sliding window (m)
            base: Hash base (should be prime > max character value)
            mod: Hash modulus (should be large prime)
        """
        self.s = s
        self.n = len(s)
        self.m = window_size
        self.base = base
        self.mod = mod
        
        # Precompute base^(m-1) mod q for rolling updates
        self.base_power = pow(base, window_size - 1, mod)
        
        # Compute initial hash for first window
        self.current_hash = 0
        for i in range(min(window_size, self.n)):
            self.current_hash = (self.current_hash * base + ord(s[i])) % mod
        
        # Track current window position
        self.window_start = 0
    
    def get_hash(self) -> int:
        """Return hash of current window."""
        return self.current_hash
    
    def slide(self) -> bool:
        """
        Slide window one position to the right.
        
        Returns:
            True if slide was successful, False if at end of string
        """
        if self.window_start + self.m >= self.n:
            return False  # Cannot slide further
        
        # Character leaving the window (leftmost)
        outgoing = ord(self.s[self.window_start])
        
        # Character entering the window (rightmost)
        incoming = ord(self.s[self.window_start + self.m])
        
        # Rolling update formula with safe modular arithmetic
        # Step 1: Remove outgoing character's contribution
        self.current_hash = (self.current_hash - outgoing * self.base_power) % self.mod
        
        # Step 2: Python handles negative mod correctly, but be explicit
        if self.current_hash < 0:
            self.current_hash += self.mod
        
        # Step 3: Shift left (multiply by base) and add incoming
        self.current_hash = (self.current_hash * self.base + incoming) % self.mod
        
        self.window_start += 1
        return True
    
    def get_window(self) -> str:
        """Return current window substring (for debugging)."""
        return self.s[self.window_start:self.window_start + self.m]
 
 
# Example usage
text = "ABCABCABC"
window_size = 3
 
rh = RollingHash(text, window_size)
 
print(f"Window: {rh.get_window()}, Hash: {rh.get_hash()}")
while rh.slide():
    print(f"Window: {rh.get_window()}, Hash: {rh.get_hash()}")
 
# Output:
# Window: ABC, Hash: 66214
# Window: BCA, Hash: 67331
# Window: CAB, Hash: 68417
# Window: ABC, Hash: 66214  <- Same hash for same substring!
# Window: BCA, Hash: 67331
# Window: CAB, Hash: 68417
# Window: ABC, Hash: 66214

Visualizing Rolling Updates

Let's trace through a concrete example to see exactly how rolling updates work step by step.

Setup:

Text: "ABCDE"
Window size: 3
Base: 10 (simplified for clarity)
Modulus: 1000 (simplified)
Character values: A=1, B=2, C=3, D=4, E=5

Initial hash computation (window "ABC"):

H₀ = 1×10² + 2×10¹ + 3×10⁰
   = 100 + 20 + 3
   = 123

First roll: "ABC" → "BCD"

      Step 1: Remove 'A' contribution
      H - A×10² = 123 - 1×100 = 23
      
      Step 2: Shift left (multiply by base)
      23 × 10 = 230
      
      Step 3: Add 'D' contribution
      230 + 4 = 234
      
H₁ = 234

Verification: Direct computation of "BCD":

H("BCD") = 2×10² + 3×10¹ + 4×10⁰ = 200 + 30 + 4 = 234 ✓

Second roll: "BCD" → "CDE"

      Step 1: Remove 'B' contribution
      H - B×10² = 234 - 2×100 = 34
      
      Step 2: Shift left
      34 × 10 = 340
      
      Step 3: Add 'E' contribution
      340 + 5 = 345
      
H₂ = 345

Verification: Direct computation of "CDE":

H("CDE") = 3×10² + 4×10¹ + 5×10⁰ = 300 + 40 + 5 = 345 ✓

Key observation: Each roll required exactly 3 arithmetic operations (subtract, multiply, add) regardless of window size. If the window were 1000 characters, the update would still take exactly 3 operations.

Rolling Hash Progression Through Text "ABCDE" (window size 3)
Position	Window	Outgoing	Incoming	Hash Computation	Result
0	ABC	—	—	1×100 + 2×10 + 3×1	123
1	BCD	A	D	(123 - 1×100)×10 + 4	234
2	CDE	B	E	(234 - 2×100)×10 + 5	345

The Decimal Number Analogy

With base 10, you can literally read the hash as a number with digits being character values. Hash 234 for 'BCD' shows B=2 in hundreds place, C=3 in tens place, D=4 in ones place. With larger bases like 31, the interpretation is similar but the digits don't align with decimal reading.

Edge Cases and Robustness

A production-ready rolling hash must handle various edge cases correctly. Let's examine them:

Edge Case 1: Window equals string length

When m = n, there's only one window position. The rolling hash should:

Compute the initial hash
Return false on first slide attempt

Edge Case 2: Empty string or zero-size window

These should be handled gracefully (return empty results or raise sensible errors).

Edge Case 3: Single-character window

When m = 1:

base_power = base^0 = 1
Hash of single character = ord(char) mod q
Rolling update: new_hash = incoming (outgoing was the entire previous hash)

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def test_edge_cases():
    """Test rolling hash edge cases."""
    
    # Edge case 1: Window equals string
    text = "ABC"
    rh = RollingHash(text, 3)
    assert rh.get_hash() == polynomial_hash("ABC")
    assert rh.slide() == False  # Can't slide, at end
    print("✓ Edge case 1: Window equals string")
    
    # Edge case 2: Single character window
    text = "ABCDE"
    rh = RollingHash(text, 1)
    hashes = [rh.get_hash()]
    while rh.slide():
        hashes.append(rh.get_hash())
    # Each hash should be the individual character's hash
    expected = [polynomial_hash(c) for c in "ABCDE"]
    assert hashes == expected
    print("✓ Edge case 2: Single character window")
    
    # Edge case 3: Repeated patterns
    text = "ABCABCABC"
    rh = RollingHash(text, 3)
    pattern_hash = rh.get_hash()  # Hash of "ABC"
    matches = [0]
    pos = 1
    while rh.slide():
        if rh.get_hash() == pattern_hash:
            matches.append(pos)
        pos += 1
    assert matches == [0, 3, 6]  # "ABC" appears at these positions
    print("✓ Edge case 3: Repeated patterns detected")
    
    # Edge case 4: Characters with high values (testing negative mod)
    text = "ZZZ" + "AAA"  # Z has high ord value
    rh = RollingHash(text, 3)
    initial_hash = rh.get_hash()
    rh.slide()
    # Should not raise errors or produce negative hashes
    assert rh.get_hash() >= 0
    print("✓ Edge case 4: High-value characters handled")
 
test_edge_cases()

Integer Overflow Considerations

In languages with fixed-size integers (C++, Java), check for overflow. The intermediate value (hash - outgoing × base_power) × base can approach q × base ≈ 10^10 with typical parameters. Use 64-bit integers (long long in C++, long in Java) for safety. Python's arbitrary-precision integers handle this automatically.

Complexity Analysis

Let's rigorously analyze the time and space complexity of the rolling hash technique.

Time Complexity:

Initialization:
- Computing base^(m-1): O(log m) using fast exponentiation, or O(m) iteratively
- Computing initial hash: O(m) for processing m characters
- Total initialization: O(m)
Per-slide operation:
- Subtraction: O(1)
- Multiplication: O(1)
- Addition: O(1)
- Modulo: O(1)
- Total per slide: O(1)
Processing all positions:
- n - m + 1 windows total
- Each window (after first) updated in O(1)
- Total: O(n - m + 1) = O(n)

Rolling Hash Complexity Summary
Operation	Time Complexity	Notes
Initialize	O(m)	One-time cost to set up first window
Single slide	O(1)	Constant time regardless of window size
All n-m+1 positions	O(n)	After initialization
Complete algorithm	O(n + m)	Which is O(n) when m ≤ n

Space Complexity:

String storage: O(n) (assumed already provided)
Hash value: O(1)
Base power: O(1)
Position tracking: O(1)
Total auxiliary space: O(1)

Comparison with other approaches:

Approach	Time to hash all windows	Space
Naive (recompute each)	O(nm)	O(1)
Rolling hash	O(n)	O(1)
Precompute all hashes	O(n)	O(n)

Rolling hash achieves optimal time with minimal space.

The Power of Amortization

Even though each individual hash involves O(m) 'work' conceptually, the rolling technique amortizes this across all positions. Each character in the text contributes to exactly O(1) hash updates (entering once, leaving once), giving total O(n) work.

Summary: Mastering Rolling Hash

We've covered the rolling hash technique in complete detail. Let's consolidate the key points:

Key Takeaways

•The rolling update formula: H_{i+1} = (H_i - out × b^{m-1}) × b + in (mod q)
•O(1) updates enable O(n) total hashing — This is what makes Rabin-Karp efficient.
•Modular arithmetic requires care — Handle negative intermediate values correctly.
•Base should exceed alphabet size — Typically 31-37 for English, 131-257 for ASCII.
•Modulus should be large prime — 10^9 + 7 is standard, double hashing for extra safety.
•Precompute b^{m-1} — Avoids repeated exponentiation during rolling.
•Test edge cases thoroughly — Single-character windows, high-value characters, repeated patterns.

What's next:

The rolling hash is the engine of Rabin-Karp, but hashing brings an unavoidable issue: collisions. Two different strings can produce the same hash, leading to false positives. In the next page, we'll explore how Rabin-Karp handles collisions, the impact on complexity, and strategies for minimizing their occurrence.

Page Complete

You now have a complete understanding of the rolling hash technique—the mathematical foundation, implementation details, edge cases, and complexity analysis. This is the core machinery that powers efficient hash-based string matching.

2 / 5

Loading learning content...

Data Structures & AlgorithmsRabin-Karp Algorithm

Rabin-Karp Algorithm — Rolling Hash

LevelAdvanced

Duration75 mins

TopicRabin-Karp Algorithm

2 / 5

Rolling Hash Technique

The Art of Incremental Computation

What You Will Learn

The Mathematical Foundation

Let's establish the precise mathematical framework for rolling hash. We'll use notation that makes the derivation crystal clear.

Definitions:

Text: T = t₀t₁t₂...t_{n-1} (length n)
Pattern: P = p₀p₁p₂...p_{m-1} (length m)
Base: b (a prime number)
Modulus: q (a large prime number)
Character value: v(c) = ord(c) for character c

Polynomial hash of a string S = s₀s₁...s_{m-1}:

H(S) = Σᵢ₌₀^{m-1} [v(sᵢ) × b^{m-1-i}] mod q
     = v(s₀)×b^{m-1} + v(s₁)×b^{m-2} + ... + v(s_{m-1})×b⁰ mod q

This treats the string as a number in base b. The leftmost character has the highest weight.

The window hash notation:

Let H_i denote the hash of the substring T[i:i+m]:

H_i = H(T[i:i+m]) = Σⱼ₌₀^{m-1} [v(t_{i+j}) × b^{m-1-j}] mod q

Expanding for a concrete example with m = 5:

H_i = v(t_i)×b⁴ + v(t_{i+1})×b³ + v(t_{i+2})×b² + v(t_{i+3})×b¹ + v(t_{i+4})×b⁰

Similarly for the next window:

H_{i+1} = v(t_{i+1})×b⁴ + v(t_{i+2})×b³ + v(t_{i+3})×b² + v(t_{i+4})×b¹ + v(t_{i+5})×b⁰

Crucial observation: Both expressions share four terms. The structure of polynomial hashing allows us to relate H_{i+1} to H_i algebraically.

Why Polynomial Structure Enables Rolling

Deriving the Rolling Update Formula

Now let's derive the rolling update formula step by step. We want to express H_{i+1} in terms of H_i.

Starting point:

H_i = v(t_i)×b^{m-1} + v(t_{i+1})×b^{m-2} + v(t_{i+2})×b^{m-3} + ... + v(t_{i+m-1})×b⁰

Step 1: Remove the outgoing character

The character t_i is leaving the window. Its contribution is v(t_i)×b^{m-1}. We subtract it:

H_i - v(t_i)×b^{m-1} = v(t_{i+1})×b^{m-2} + v(t_{i+2})×b^{m-3} + ... + v(t_{i+m-1})×b⁰

Step 2: Shift all terms left (multiply by b)

Each remaining character needs its exponent increased by 1 (effectively shifting left in the window):

(H_i - v(t_i)×b^{m-1}) × b = v(t_{i+1})×b^{m-1} + v(t_{i+2})×b^{m-2} + ... + v(t_{i+m-1})×b¹

Step 3: Add the incoming character

The character t_{i+m} enters the window at the rightmost position (power b⁰):

(H_i - v(t_i)×b^{m-1}) × b + v(t_{i+m}) = H_{i+1}

The rolling hash update formula:

H_{i+1} = (H_i - v(outgoing) × b^{m-1}) × b + v(incoming)   (mod q)

Where:

outgoing = t_i (character leaving the left side of window)
incoming = t_{i+m} (character entering the right side of window)
b^{m-1} is precomputed once at the start

Operation count:

1 multiplication: v(outgoing) × b^{m-1}
1 subtraction: H_i - ...
1 multiplication: ... × b
1 addition: ... + v(incoming)
1 modulo: ... mod q

All O(1) operations, regardless of window size m!

Compare to Direct Computation

Modular Arithmetic Deep Dive

The rolling hash formula involves modular arithmetic, which introduces subtleties that cause many implementations to fail. Let's address them thoroughly.

Why we need modular arithmetic:

Without the mod operation, hash values would grow exponentially:

For a 100-character string with base 31:
Maximum hash ≈ 127 × 31^99 ≈ 10^149

This vastly exceeds what any integer type can hold (64-bit integers max out around 10^19). The modulus keeps values bounded.

The negativity problem:

Consider the subtraction in our formula: H_i - v(outgoing) × b^{m-1}

Even though H_i was computed mod q, the term v(outgoing) × b^{m-1} might be larger than the result, producing a negative value. Different programming languages handle negative modulo differently:

Python:  (-5) % 3 = 1   (always positive)
C/Java:  (-5) % 3 = -2  (sign of dividend)
JS:      (-5) % 3 = -2  (sign of dividend)

1
2
3
4
5
6
7
8
9
10
11
# Python handles negative modulo correctly
print((-5) % 3)  # Output: 1 (always in range [0, q-1])
 
# But explicit handling is still good practice for clarity
def safe_mod(x: int, q: int) -> int:
    """Ensure modulo result is always non-negative."""
    return ((x % q) + q) % q
 
# Example showing the difference
print(safe_mod(-5, 3))   # Output: 1
print(safe_mod(-100, 7)) # Output: 5

The corrected rolling update formula:

H_{i+1} = ((H_i - v(outgoing) × b^{m-1} % q) × b + v(incoming)) % q
         ↑                        ↑                           ↑
      Make non-negative     Main computation          Final reduction

In practice, a robust implementation ensures non-negativity at each step:

temp = (H_i - v(outgoing) * power_of_base) % q
temp = (temp + q) % q  # Handle negative result
H_next = (temp * base + v(incoming)) % q

Common Bug Alert

Choosing Base and Modulus

The choice of base b and modulus q significantly affects collision probability and numerical stability. Let's analyze the considerations.

Choosing the base b:

The base should be:

Larger than the alphabet size — Ensures distinct characters produce distinct contributions
Prime — Reduces collision patterns from multiplicative relationships
Not too large — Prevents overflow during intermediate computations

Common choices:

For ASCII (128 characters): b = 131 or 257
For lowercase English (26 characters): b = 31 or 37
For Unicode (65536+ characters): b = 65537

The value 31 is popular because it's prime, works well for lowercase English, and x * 31 = (x << 5) - x can be computed efficiently.

Common Base and Modulus Choices
Use Case	Base (b)	Modulus (q)	Collision Probability
General ASCII	131	10^9 + 7	~1 / 10^9
Lowercase English	31	10^9 + 7	~1 / 10^9
High precision	10007	10^18 + 9	~1 / 10^18
Competition (safe)	31 + 37 (double hash)	(10^9+7, 10^9+9)	~1 / 10^18

Choosing the modulus q:

The modulus should be:

Prime — Ensures uniform distribution of hash values
Large — Collision probability ≈ 1/q, larger is better
Safe for multiplication — q² should not overflow your integer type

Common choices:

10^9 + 7 (fits in 32 bits, q² fits in 64 bits)
10^9 + 9 (another prime, often used with 10^9 + 7 for double hashing)
2^61 - 1 (Mersenne prime, very efficient mod operation)
10^18 + 9 (for 128-bit integers or big integer libraries)

The 64-bit consideration:

With base b ≈ 100 and modulus q ≈ 10^9:

Maximum intermediate value before mod: (q-1) × b ≈ 10^11
This fits comfortably in 64 bits (max ≈ 10^19)

If you need larger moduli, you'll need 128-bit integers or careful handling.

Double Hashing for Safety

Complete Rolling Hash Implementation

Now let's build a complete, production-ready rolling hash implementation that handles all edge cases correctly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
class RollingHash:
    """
    Production-ready rolling hash implementation.
    
    Supports O(1) window sliding for efficient string matching.
    Uses polynomial hashing with configurable base and modulus.
    """
    
    def __init__(self, s: str, window_size: int, base: int = 31, mod: int = 10**9 + 7):
        """
        Initialize rolling hash for string s with given window size.
        
        Args:
            s: The string to hash
            window_size: Size of the sliding window (m)
            base: Hash base (should be prime > max character value)
            mod: Hash modulus (should be large prime)
        """
        self.s = s
        self.n = len(s)
        self.m = window_size
        self.base = base
        self.mod = mod
        
        # Precompute base^(m-1) mod q for rolling updates
        self.base_power = pow(base, window_size - 1, mod)
        
        # Compute initial hash for first window
        self.current_hash = 0
        for i in range(min(window_size, self.n)):
            self.current_hash = (self.current_hash * base + ord(s[i])) % mod
        
        # Track current window position
        self.window_start = 0
    
    def get_hash(self) -> int:
        """Return hash of current window."""
        return self.current_hash
    
    def slide(self) -> bool:
        """
        Slide window one position to the right.
        
        Returns:
            True if slide was successful, False if at end of string
        """
        if self.window_start + self.m >= self.n:
            return False  # Cannot slide further
        
        # Character leaving the window (leftmost)
        outgoing = ord(self.s[self.window_start])
        
        # Character entering the window (rightmost)
        incoming = ord(self.s[self.window_start + self.m])
        
        # Rolling update formula with safe modular arithmetic
        # Step 1: Remove outgoing character's contribution
        self.current_hash = (self.current_hash - outgoing * self.base_power) % self.mod
        
        # Step 2: Python handles negative mod correctly, but be explicit
        if self.current_hash < 0:
            self.current_hash += self.mod
        
        # Step 3: Shift left (multiply by base) and add incoming
        self.current_hash = (self.current_hash * self.base + incoming) % self.mod
        
        self.window_start += 1
        return True
    
    def get_window(self) -> str:
        """Return current window substring (for debugging)."""
        return self.s[self.window_start:self.window_start + self.m]
 
 
# Example usage
text = "ABCABCABC"
window_size = 3
 
rh = RollingHash(text, window_size)
 
print(f"Window: {rh.get_window()}, Hash: {rh.get_hash()}")
while rh.slide():
    print(f"Window: {rh.get_window()}, Hash: {rh.get_hash()}")
 
# Output:
# Window: ABC, Hash: 66214
# Window: BCA, Hash: 67331
# Window: CAB, Hash: 68417
# Window: ABC, Hash: 66214  <- Same hash for same substring!
# Window: BCA, Hash: 67331
# Window: CAB, Hash: 68417
# Window: ABC, Hash: 66214

Visualizing Rolling Updates

Let's trace through a concrete example to see exactly how rolling updates work step by step.

Setup:

Text: "ABCDE"
Window size: 3
Base: 10 (simplified for clarity)
Modulus: 1000 (simplified)
Character values: A=1, B=2, C=3, D=4, E=5

Initial hash computation (window "ABC"):

H₀ = 1×10² + 2×10¹ + 3×10⁰
   = 100 + 20 + 3
   = 123

First roll: "ABC" → "BCD"

      Step 1: Remove 'A' contribution
      H - A×10² = 123 - 1×100 = 23
      
      Step 2: Shift left (multiply by base)
      23 × 10 = 230
      
      Step 3: Add 'D' contribution
      230 + 4 = 234
      
H₁ = 234

Verification: Direct computation of "BCD":

H("BCD") = 2×10² + 3×10¹ + 4×10⁰ = 200 + 30 + 4 = 234 ✓

Second roll: "BCD" → "CDE"

      Step 1: Remove 'B' contribution
      H - B×10² = 234 - 2×100 = 34
      
      Step 2: Shift left
      34 × 10 = 340
      
      Step 3: Add 'E' contribution
      340 + 5 = 345
      
H₂ = 345

Verification: Direct computation of "CDE":

H("CDE") = 3×10² + 4×10¹ + 5×10⁰ = 300 + 40 + 5 = 345 ✓

Rolling Hash Progression Through Text "ABCDE" (window size 3)
Position	Window	Outgoing	Incoming	Hash Computation	Result
0	ABC	—	—	1×100 + 2×10 + 3×1	123
1	BCD	A	D	(123 - 1×100)×10 + 4	234
2	CDE	B	E	(234 - 2×100)×10 + 5	345

The Decimal Number Analogy

Edge Cases and Robustness

A production-ready rolling hash must handle various edge cases correctly. Let's examine them:

Edge Case 1: Window equals string length

When m = n, there's only one window position. The rolling hash should:

Compute the initial hash
Return false on first slide attempt

Edge Case 2: Empty string or zero-size window

These should be handled gracefully (return empty results or raise sensible errors).

Edge Case 3: Single-character window

When m = 1:

base_power = base^0 = 1
Hash of single character = ord(char) mod q
Rolling update: new_hash = incoming (outgoing was the entire previous hash)

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def test_edge_cases():
    """Test rolling hash edge cases."""
    
    # Edge case 1: Window equals string
    text = "ABC"
    rh = RollingHash(text, 3)
    assert rh.get_hash() == polynomial_hash("ABC")
    assert rh.slide() == False  # Can't slide, at end
    print("✓ Edge case 1: Window equals string")
    
    # Edge case 2: Single character window
    text = "ABCDE"
    rh = RollingHash(text, 1)
    hashes = [rh.get_hash()]
    while rh.slide():
        hashes.append(rh.get_hash())
    # Each hash should be the individual character's hash
    expected = [polynomial_hash(c) for c in "ABCDE"]
    assert hashes == expected
    print("✓ Edge case 2: Single character window")
    
    # Edge case 3: Repeated patterns
    text = "ABCABCABC"
    rh = RollingHash(text, 3)
    pattern_hash = rh.get_hash()  # Hash of "ABC"
    matches = [0]
    pos = 1
    while rh.slide():
        if rh.get_hash() == pattern_hash:
            matches.append(pos)
        pos += 1
    assert matches == [0, 3, 6]  # "ABC" appears at these positions
    print("✓ Edge case 3: Repeated patterns detected")
    
    # Edge case 4: Characters with high values (testing negative mod)
    text = "ZZZ" + "AAA"  # Z has high ord value
    rh = RollingHash(text, 3)
    initial_hash = rh.get_hash()
    rh.slide()
    # Should not raise errors or produce negative hashes
    assert rh.get_hash() >= 0
    print("✓ Edge case 4: High-value characters handled")
 
test_edge_cases()

Integer Overflow Considerations

Complexity Analysis

Let's rigorously analyze the time and space complexity of the rolling hash technique.

Time Complexity:

Initialization:
- Computing base^(m-1): O(log m) using fast exponentiation, or O(m) iteratively
- Computing initial hash: O(m) for processing m characters
- Total initialization: O(m)
Per-slide operation:
- Subtraction: O(1)
- Multiplication: O(1)
- Addition: O(1)
- Modulo: O(1)
- Total per slide: O(1)
Processing all positions:
- n - m + 1 windows total
- Each window (after first) updated in O(1)
- Total: O(n - m + 1) = O(n)

Rolling Hash Complexity Summary
Operation	Time Complexity	Notes
Initialize	O(m)	One-time cost to set up first window
Single slide	O(1)	Constant time regardless of window size
All n-m+1 positions	O(n)	After initialization
Complete algorithm	O(n + m)	Which is O(n) when m ≤ n

Space Complexity:

String storage: O(n) (assumed already provided)
Hash value: O(1)
Base power: O(1)
Position tracking: O(1)
Total auxiliary space: O(1)

Comparison with other approaches:

Approach	Time to hash all windows	Space
Naive (recompute each)	O(nm)	O(1)
Rolling hash	O(n)	O(1)
Precompute all hashes	O(n)	O(n)

Rolling hash achieves optimal time with minimal space.

The Power of Amortization

Summary: Mastering Rolling Hash

We've covered the rolling hash technique in complete detail. Let's consolidate the key points:

Key Takeaways

•The rolling update formula: H_{i+1} = (H_i - out × b^{m-1}) × b + in (mod q)
•O(1) updates enable O(n) total hashing — This is what makes Rabin-Karp efficient.
•Modular arithmetic requires care — Handle negative intermediate values correctly.
•Base should exceed alphabet size — Typically 31-37 for English, 131-257 for ASCII.
•Modulus should be large prime — 10^9 + 7 is standard, double hashing for extra safety.
•Precompute b^{m-1} — Avoids repeated exponentiation during rolling.
•Test edge cases thoroughly — Single-character windows, high-value characters, repeated patterns.

What's next:

Page Complete

2 / 5