Data Structures & AlgorithmsString Hashing Techniques

String Hashing Techniques

LevelIntermediate

Duration90 mins

TopicString Hashing Techniques

1 / 4

Polynomial Rolling Hash

The Power of Fingerprinting Strings

Imagine you need to find all occurrences of a pattern within a massive text file—say, searching for a specific DNA sequence in the human genome (approximately 3 billion characters). The naive approach of comparing character-by-character at every position would require trillions of operations. But what if you could represent any string as a single number and compare numbers instead of strings?

This is the fundamental insight behind string hashing: transform strings into numerical fingerprints (hash values) that can be compared in constant time. The magic lies in polynomial rolling hashes, a technique that not only computes these fingerprints efficiently but also allows us to update them incrementally as we slide through a text—turning what could be an O(n × m) operation into an O(n + m) average-case operation.

What You Will Learn

By the end of this page, you will deeply understand how polynomial hashing works mathematically, why polynomials provide excellent distribution properties, how to compute hash values incrementally in O(1) time per position, and the theoretical foundation that makes rolling hashes so powerful for string algorithms.

The Hash Function Concept

Before diving into polynomial hashes specifically, let's establish what we need from a string hash function and why most simple approaches fail.

The Goal:

A hash function H maps strings to integers: H: Σ* → {0, 1, ..., M-1}

where Σ is our alphabet and M is the size of our hash space. We want this function to have several properties:

Deterministic: The same string always produces the same hash value
Efficient: Computing H(s) should be fast—ideally O(|s|) time
Uniform distribution: Different strings should map to different hash values as evenly as possible
Incremental computation: We should be able to update the hash efficiently when the string changes slightly

Why Simple Approaches Fail

Consider summing ASCII values: H('abc') = 97 + 98 + 99 = 294. This fails because 'abc' and 'bca' have the same hash (294), and 'cba' also has the same hash. Position information is lost, creating massive collisions. Any hash that ignores character positions will suffer from this fundamental flaw.

The Position Problem:

Two strings are different if they have different characters OR if the same characters appear in different positions. Therefore, our hash function must encode positional information. This is where polynomials enter the picture.

The Polynomial Insight:

Consider representing a string s = s₀s₁s₂...sₙ₋₁ as a polynomial:

H(s) = s₀ · bⁿ⁻¹ + s₁ · bⁿ⁻² + ... + sₙ₋₂ · b¹ + sₙ₋₁ · b⁰

where b is called the base and each sᵢ is the numerical value of the character at position i.

This representation has a beautiful property: two different strings can only have the same hash value if they are roots of the same polynomial equation. For a random choice of b, this is extremely unlikely.

Comparison of Hash Function Approaches
Approach	Position-Aware	Collision Rate	Rolling Update	Example Issue
Sum of ASCII values	No	Very High	Yes	'abc' = 'bca' = 'cba'
XOR of ASCII values	No	Very High	Yes	'ab' XOR = 'ba' XOR
Product of ASCII values	No	High	No	'ab' × 'ba' same
Polynomial hash	Yes	Low (tunable)	Yes	None (probabilistic)

Mathematical Foundation of Polynomial Hashing

Let's develop the polynomial hash formula rigorously. We'll see exactly why it works and understand the mathematical machinery that makes it so effective.

The Polynomial Representation:

Given a string s of length m, we interpret it as a polynomial evaluated at base b:

$$H(s) = \sum_{i=0}^{m-1} s[i] \cdot b^{m-1-i} \pmod{M}$$

Or equivalently, using position from the right:

$$H(s) = s[0] \cdot b^{m-1} + s[1] \cdot b^{m-2} + \cdots + s[m-2] \cdot b^1 + s[m-1] \cdot b^0 \pmod{M}$$

Why Polynomials?

Polynomials have a remarkable property from algebraic number theory: over a field, a polynomial of degree n-1 has at most n-1 roots. This means that for two different strings of length m, their hash values (as polynomials evaluated at base b) can only collide if b happens to be a root of their difference polynomial—which has at most m-1 roots out of M possible values of b.

The Collision Probability Theorem

For two distinct strings of length m, if we choose the base b uniformly at random from {0, 1, ..., M-1}, the probability that their polynomial hashes collide is at most (m-1)/M. For m = 1000 and M = 10⁹, this is less than 0.0001%!

Concrete Example:

Let's compute the polynomial hash of "cat" with base b = 31 and modulus M = 10⁹ + 7:

'c' = 99, 'a' = 97, 't' = 116 (ASCII values)
H("cat") = 99 × 31² + 97 × 31¹ + 116 × 31⁰
H("cat") = 99 × 961 + 97 × 31 + 116 × 1
H("cat") = 95,139 + 3,007 + 116
H("cat") = 98,262

Now compare with "act":

H("act") = 97 × 31² + 99 × 31¹ + 116 × 31⁰
H("act") = 97 × 961 + 99 × 31 + 116 × 1
H("act") = 93,217 + 3,069 + 116
H("act") = 96,402

Different strings → different hashes. The position-weighting through powers of b ensures anagrams produce different hash values.

polynomial_hash.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def polynomial_hash(s: str, base: int = 31, mod: int = 10**9 + 7) -> int:
    """
    Compute the polynomial hash of a string.
    
    The hash is computed as:
    H(s) = s[0] * base^(n-1) + s[1] * base^(n-2) + ... + s[n-1] * base^0
    
    We use Horner's method for efficient computation:
    H(s) = ((s[0] * base + s[1]) * base + s[2]) * base + ...
    
    Time Complexity: O(n) where n = len(s)
    Space Complexity: O(1)
    """
    h = 0
    for char in s:
        # Horner's method: multiply previous by base, add new character
        h = (h * base + ord(char)) % mod
    return h
 
# Example usage
print(f"Hash of 'cat': {polynomial_hash('cat')}")  # 98262
print(f"Hash of 'act': {polynomial_hash('act')}")  # 96402
print(f"Hash of 'cat' again: {polynomial_hash('cat')}")  # 98262 (deterministic)
 
# Demonstrating collision resistance
words = ['hello', 'world', 'olleh', 'dlrow', 'cat', 'tac']
hashes = {word: polynomial_hash(word) for word in words}
for word, h in hashes.items():
    print(f"  '{word}' -> {h}")

Horner's Method — The Efficiency Secret

Instead of computing each power of base separately (which would require O(n²) operations or O(n) with precomputation), we use Horner's method to evaluate the polynomial iteratively. This transforms H(s) = s₀b³ + s₁b² + s₂b + s₃ into H(s) = ((s₀ × b + s₁) × b + s₂) × b + s₃, computing the entire hash in a single O(n) pass with O(1) space.

The Rolling Hash Technique

The true power of polynomial hashing emerges when we need to compute hashes for many overlapping substrings. In pattern matching, we slide a window of length m across a text of length n, computing the hash of each window position. A naive approach would compute n-m+1 hashes, each in O(m) time, for O((n-m) × m) total time.

The Rolling Hash Insight:

Consider two adjacent windows in the text:

Window at position i: text[i..i+m-1]
Window at position i+1: text[i+1..i+m]

These windows share m-1 characters! The rolling hash exploits this overlap to compute the new hash from the old hash in O(1) time.

The Rolling Update Formula:

Let H(i) = hash of text[i..i+m-1].

Then:

$$H(i+1) = (H(i) - text[i] \cdot b^{m-1}) \cdot b + text[i+m] \pmod{M}$$

Intuition:

Subtract the contribution of the leftmost character (scaled by b^(m-1))
Multiply by b (shift all remaining characters one position left in the polynomial)
Add the new rightmost character (at position b⁰)

Without Rolling Hash

•Compute each window hash from scratch
•O(m) time per window position
•O((n-m) × m) total time
•For n=10⁶, m=1000: ~10⁹ operations
•Dominant factor in naive pattern matching

With Rolling Hash

•Compute first window hash in O(m)
•Update each subsequent window in O(1)
•O(n) total time for all windows
•For n=10⁶, m=1000: ~10⁶ operations
•1000× speedup in this example!

Step-by-Step Example:

Let's trace through rolling the hash for text "abcde" with window size 3, using base=31:

Initial Hash (window "abc"):

H₀ = 97×31² + 98×31¹ + 99×31⁰ = 93,217 + 3,038 + 99 = 96,354

Roll to "bcd" (remove 'a'=97, add 'd'=100):

H₁ = (H₀ - 97×31²) × 31 + 100
H₁ = (96,354 - 93,217) × 31 + 100
H₁ = 3,137 × 31 + 100 = 97,247 + 100 = 97,347

Verification (compute directly):

H("bcd") = 98×31² + 99×31¹ + 100×31⁰ = 94,178 + 3,069 + 100 = 97,347 ✓

Roll to "cde" (remove 'b'=98, add 'e'=101):

H₂ = (H₁ - 98×31²) × 31 + 101
H₂ = (97,347 - 94,178) × 31 + 101
H₂ = 3,169 × 31 + 101 = 98,239 + 101 = 98,340

Verification:

H("cde") = 99×31² + 100×31¹ + 101×31⁰ = 95,139 + 3,100 + 101 = 98,340 ✓

rolling_hash.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
class RollingHash:
    """
    A rolling hash implementation for efficient substring hashing.
    
    This class allows computing hash values for all substrings of a fixed
    length in O(n) total time, by updating the hash incrementally.
    """
    
    def __init__(self, base: int = 31, mod: int = 10**9 + 7):
        self.base = base
        self.mod = mod
        self.power = 1  # Will hold base^(window_size - 1)
    
    def compute_initial_hash(self, s: str, window_size: int) -> int:
        """
        Compute the hash of the first window and prepare the power multiplier.
        
        Time Complexity: O(window_size)
        """
        self.window_size = window_size
        h = 0
        
        # Compute hash using Horner's method
        for i in range(window_size):
            h = (h * self.base + ord(s[i])) % self.mod
        
        # Precompute base^(window_size - 1) for rolling
        self.power = 1
        for _ in range(window_size - 1):
            self.power = (self.power * self.base) % self.mod
        
        return h
    
    def roll(self, current_hash: int, old_char: str, new_char: str) -> int:
        """
        Roll the hash window: remove old_char from the left, add new_char to the right.
        
        Time Complexity: O(1)
        
        Formula: H(i+1) = (H(i) - old * base^(m-1)) * base + new
        """
        # Remove contribution of old (leftmost) character
        h = (current_hash - ord(old_char) * self.power) % self.mod
        
        # Shift left (multiply by base) and add new (rightmost) character
        h = (h * self.base + ord(new_char)) % self.mod
        
        # Handle negative modulo (Python handles this, but explicit for clarity)
        return h if h >= 0 else h + self.mod
 
 
def find_all_substring_hashes(text: str, window_size: int) -> list[int]:
    """
    Compute the hash of every substring of given length.
    
    Time Complexity: O(n) where n = len(text)
    Space Complexity: O(n - window_size + 1) for the result
    """
    if len(text) < window_size:
        return []
    
    rh = RollingHash()
    hashes = []
    
    # Compute initial hash
    current_hash = rh.compute_initial_hash(text, window_size)
    hashes.append(current_hash)
    
    # Roll through the rest of the text
    for i in range(len(text) - window_size):
        current_hash = rh.roll(
            current_hash,
            text[i],           # Character leaving the window
            text[i + window_size]  # Character entering the window
        )
        hashes.append(current_hash)
    
    return hashes
 
 
# Example: Find all occurrences of pattern in text using rolling hash
def rabin_karp_search(text: str, pattern: str) -> list[int]:
    """
    Find all occurrences of pattern in text using Rabin-Karp algorithm.
    
    Time Complexity: O(n + m) average, O(nm) worst case
    """
    n, m = len(text), len(pattern)
    if m > n:
        return []
    
    rh = RollingHash()
    
    # Compute pattern hash
    pattern_hash = rh.compute_initial_hash(pattern, m)
    rh_text = RollingHash()
    text_hash = rh_text.compute_initial_hash(text, m)
    
    occurrences = []
    
    for i in range(n - m + 1):
        if i > 0:
            text_hash = rh_text.roll(text_hash, text[i - 1], text[i + m - 1])
        
        # If hashes match, verify with actual string comparison
        if text_hash == pattern_hash:
            if text[i:i + m] == pattern:  # Guard against collisions
                occurrences.append(i)
    
    return occurrences
 
 
# Demonstration
text = "abracadabra"
pattern = "abra"
print(f"Text: '{text}'")
print(f"Pattern: '{pattern}'")
print(f"Found at positions: {rabin_karp_search(text, pattern)}")  # [0, 7]

Precomputing Hash Prefix Arrays

For many applications, we need to query the hash of arbitrary substrings, not just consecutive windows. The prefix hash array technique enables O(1) substring hash queries after O(n) preprocessing.

The Prefix Hash Concept:

Define prefix_hash[i] = H(s[0..i-1]) = hash of the first i characters.

Then for any substring s[l..r] (0-indexed, inclusive):

$$H(s[l..r]) = \frac{prefix_hash[r+1] - prefix_hash[l] \cdot b^{r-l+1}}{1} \pmod{M}$$

Or more precisely:

$$H(s[l..r]) = (prefix_hash[r+1] - prefix_hash[l] \cdot power[r-l+1]) \mod M$$

where power[k] = b^k mod M.

Intuition: Place Value Analogy

Think of it like extracting digits from a decimal number. If you have 12345 and want digits 2-4 ("234"), you subtract 12000 and divide by 10. Similarly, prefix_hash[5] contains s[0..4], and to get s[2..4], we subtract the contribution of s[0..1] (properly scaled) from prefix_hash[5] and adjust the powers.

Why This Works:

Let's trace through mathematically:

prefix_hash[5] = s[0]·b⁴ + s[1]·b³ + s[2]·b² + s[3]·b¹ + s[4]·b⁰
prefix_hash[2] = s[0]·b¹ + s[1]·b⁰

prefix_hash[2] × b³ = s[0]·b⁴ + s[1]·b³  (scaled to match)

prefix_hash[5] - prefix_hash[2] × b³ = s[2]·b² + s[3]·b¹ + s[4]·b⁰ = H(s[2..4])

The key insight is that we multiply the prefix hash by an appropriate power of b to align the polynomial terms before subtraction.

Implementation Considerations:

Precompute powers: Store power[i] = b^i mod M for i = 0 to n
Handle negative modulo: (a - b) mod M can be negative in some languages
1-indexed convenience: Use prefix_hash[0] = 0 for easier formula

prefix_hash_array.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
class StringHasher:
    """
    A string hasher that supports O(1) substring hash queries
    after O(n) preprocessing.
    
    This is the preferred approach when you need to query many
    arbitrary substrings of the same string.
    """
    
    def __init__(self, s: str, base: int = 31, mod: int = 10**9 + 7):
        """
        Precompute prefix hashes and powers.
        
        Time Complexity: O(n)
        Space Complexity: O(n)
        """
        self.s = s
        self.base = base
        self.mod = mod
        self.n = len(s)
        
        # prefix_hash[i] = hash of s[0..i-1] (first i characters)
        # prefix_hash[0] = 0 (empty string)
        self.prefix_hash = [0] * (self.n + 1)
        
        # power[i] = base^i mod M
        self.power = [1] * (self.n + 1)
        
        # Build prefix hashes using Horner's method
        for i in range(self.n):
            self.prefix_hash[i + 1] = (
                self.prefix_hash[i] * self.base + ord(s[i])
            ) % self.mod
        
        # Precompute powers of base
        for i in range(1, self.n + 1):
            self.power[i] = (self.power[i - 1] * self.base) % self.mod
    
    def get_hash(self, left: int, right: int) -> int:
        """
        Get the hash of substring s[left..right] (0-indexed, inclusive).
        
        Time Complexity: O(1)
        
        This uses the formula:
        H(s[l..r]) = prefix_hash[r+1] - prefix_hash[l] * base^(r-l+1)
        """
        length = right - left + 1
        
        # Subtract the prefix contribution (scaled appropriately)
        result = (
            self.prefix_hash[right + 1] 
            - self.prefix_hash[left] * self.power[length]
        ) % self.mod
        
        # Handle negative modulo
        return result if result >= 0 else result + self.mod
    
    def compare_substrings(
        self, l1: int, r1: int, l2: int, r2: int
    ) -> bool:
        """
        Compare two substrings s[l1..r1] and s[l2..r2] using hashes.
        
        Returns True if they are equal (with high probability).
        Note: False positives are possible due to hash collisions.
        
        Time Complexity: O(1)
        """
        if r1 - l1 != r2 - l2:  # Different lengths must be different
            return False
        return self.get_hash(l1, r1) == self.get_hash(l2, r2)
 
 
# Example: Find the longest repeated substring using binary search + hashing
def longest_repeated_substring(s: str) -> str:
    """
    Find the longest substring that appears at least twice.
    
    Uses binary search on length + rolling hash for O(n log n) average time.
    """
    if len(s) <= 1:
        return ""
    
    hasher = StringHasher(s)
    n = len(s)
    
    def has_repeated_of_length(length: int) -> int:
        """Check if there's a repeated substring of given length.
        Returns starting index if found, -1 otherwise."""
        seen = {}
        for i in range(n - length + 1):
            h = hasher.get_hash(i, i + length - 1)
            if h in seen:
                # Verify to guard against collision
                if s[i:i+length] == s[seen[h]:seen[h]+length]:
                    return i
            seen[h] = i
        return -1
    
    # Binary search for the longest length
    left, right = 1, n - 1
    result = ""
    
    while left <= right:
        mid = (left + right) // 2
        idx = has_repeated_of_length(mid)
        if idx != -1:
            result = s[idx:idx + mid]
            left = mid + 1  # Try longer
        else:
            right = mid - 1  # Try shorter
    
    return result
 
 
# Demonstration
s = "banana"
print(f"String: '{s}'")
print(f"Longest repeated substring: '{longest_repeated_substring(s)}'")  # 'ana'
 
hasher = StringHasher(s)
print(f"\nHash queries:")
print(f"  Hash of 'ban' (0..2): {hasher.get_hash(0, 2)}")
print(f"  Hash of 'ana' (1..3): {hasher.get_hash(1, 3)}")
print(f"  Hash of 'ana' (3..5): {hasher.get_hash(3, 5)}")
print(f"  Substrings equal? {hasher.compare_substrings(1, 3, 3, 5)}")  # True

Why Polynomials Give Good Distribution

It's worth understanding why polynomial hashing works so well from a theoretical perspective. The collision resistance of polynomial hashing comes from deep results in algebra and number theory.

The Schwartz-Zippel Lemma:

This fundamental result states: For a non-zero polynomial P(x) of degree d over a field F, if we pick x uniformly at random from a finite subset S ⊆ F, then:

Pr[P(x) = 0] ≤ d / |S|

Application to String Hashing:

Given two distinct strings s₁ and s₂ of length m, define:

P(x) = H(s₁, x) - H(s₂, x)

where H(s, x) is the polynomial hash of s with base x.

Since s₁ ≠ s₂, at least one coefficient differs, so P(x) is a non-zero polynomial of degree at most m-1.

By Schwartz-Zippel: Pr[P(b) = 0] ≤ (m-1)/M

For m = 10³ and M = 10⁹: collision probability < 10⁻⁶ per pair!

The Birthday Paradox Warning

The above analysis is for a single pair of strings. When comparing many strings (like all substrings of length m), the birthday paradox kicks in. With √M ≈ 31,623 substrings, you have a ~50% chance of at least one collision. This is why using a large modulus (10⁹+7) or multiple hash functions is crucial for large-scale applications.

Uniformity of Distribution:

Polynomial hashes distribute uniformly because:

Independence of coefficients: Each character contributes independently to the final hash value
Power weighting: Different positions have different weights (b^i), preventing collision patterns
Field properties: When M is prime, arithmetic mod M behaves like a mathematical field, maximizing randomness

Why Prime Moduli Matter:

Consider what happens with a composite modulus like M = 256 (2⁸):

If b is even, b² is divisible by 4, b³ by 8, b⁴ by 16...
After just 8 terms, b⁸ ≡ 0 (mod 256)
Early characters stop contributing to the hash!

With a prime M:

No non-trivial factors exist
Powers of b cycle through many distinct values
Every character contributes meaningfully

Collision Probability Analysis
Scenario	M = 10⁶	M = 10⁹	M = 10¹⁸ (double hash)
Single pair (m=1000)	~0.1%	~0.0001%	~10⁻¹³%
100 substrings	~10%	~0.01%	~10⁻⁹%
10,000 substrings	~99.995%	~0.5%	~10⁻⁵%
1,000,000 substrings	Certain	~39%	~0.005%

Practical Recommendation

For competitive programming and most applications: Use M = 10⁹ + 7 (this is prime). For production systems handling large data: Use double hashing (two different base/mod pairs) or a 64-bit hash. For cryptographic applications: Polynomial hashes are NOT secure—use SHA-256 or similar.

Summary and Connections

We've now established the mathematical foundations of polynomial rolling hashes—the workhorse technique for efficient string matching. Let's consolidate the key insights:

Key Takeaways

•Polynomial representation encodes position: Each character's contribution is weighted by a power of the base, ensuring anagrams produce different hashes.
•Rolling updates in O(1): The key formula H(i+1) = (H(i) - old × b^(m-1)) × b + new allows sliding window hash computation in constant time per step.
•Prefix arrays enable O(1) substring queries: After O(n) preprocessing, any substring hash can be computed in constant time.
•Collision probability is mathematically bounded: By Schwartz-Zippel, random collisions are extremely rare with proper parameters.
•Horner's method provides efficiency: We can evaluate the polynomial in O(n) time and O(1) space using iterative accumulation.
•Prime moduli maximize distribution quality: Using a prime modulus ensures all powers of the base remain distinct and contribute meaningfully.

Connections to Other Topics:

Rabin-Karp algorithm: Uses rolling hash as its core—now you understand exactly how it achieves O(n+m) average time
Longest common substring: Prefix hash arrays + binary search gives O(n log n) solution
Suffix arrays: Can be constructed using hashing + sorting
Palindrome detection: Compare H(s[i..j]) with H(reverse(s[i..j]))

What's Next:

Now that we understand polynomial hashing, the next critical question is: How do we choose the base and modulus? Poor choices can lead to more collisions, vulnerability to adversarial inputs, or arithmetic overflow. The next page covers parameter selection in depth.

Page Complete

You now understand polynomial rolling hashes from first principles—the mathematical foundation, the rolling update technique, prefix array preprocessing, and the theoretical guarantees. These concepts form the basis for all string hashing applications we'll explore in subsequent pages.

1 / 4

Loading learning content...

Data Structures & AlgorithmsString Hashing Techniques

String Hashing Techniques

LevelIntermediate

Duration90 mins

TopicString Hashing Techniques

1 / 4

Polynomial Rolling Hash

The Power of Fingerprinting Strings

What You Will Learn

The Hash Function Concept

Before diving into polynomial hashes specifically, let's establish what we need from a string hash function and why most simple approaches fail.

The Goal:

A hash function H maps strings to integers: H: Σ* → {0, 1, ..., M-1}

where Σ is our alphabet and M is the size of our hash space. We want this function to have several properties:

Deterministic: The same string always produces the same hash value
Efficient: Computing H(s) should be fast—ideally O(|s|) time
Uniform distribution: Different strings should map to different hash values as evenly as possible
Incremental computation: We should be able to update the hash efficiently when the string changes slightly

Why Simple Approaches Fail

The Position Problem:

The Polynomial Insight:

Consider representing a string s = s₀s₁s₂...sₙ₋₁ as a polynomial:

H(s) = s₀ · bⁿ⁻¹ + s₁ · bⁿ⁻² + ... + sₙ₋₂ · b¹ + sₙ₋₁ · b⁰

where b is called the base and each sᵢ is the numerical value of the character at position i.

Comparison of Hash Function Approaches
Approach	Position-Aware	Collision Rate	Rolling Update	Example Issue
Sum of ASCII values	No	Very High	Yes	'abc' = 'bca' = 'cba'
XOR of ASCII values	No	Very High	Yes	'ab' XOR = 'ba' XOR
Product of ASCII values	No	High	No	'ab' × 'ba' same
Polynomial hash	Yes	Low (tunable)	Yes	None (probabilistic)

Mathematical Foundation of Polynomial Hashing

Let's develop the polynomial hash formula rigorously. We'll see exactly why it works and understand the mathematical machinery that makes it so effective.

The Polynomial Representation:

Given a string s of length m, we interpret it as a polynomial evaluated at base b:

$$H(s) = \sum_{i=0}^{m-1} s[i] \cdot b^{m-1-i} \pmod{M}$$

Or equivalently, using position from the right:

$$H(s) = s[0] \cdot b^{m-1} + s[1] \cdot b^{m-2} + \cdots + s[m-2] \cdot b^1 + s[m-1] \cdot b^0 \pmod{M}$$

Why Polynomials?

The Collision Probability Theorem

Concrete Example:

Let's compute the polynomial hash of "cat" with base b = 31 and modulus M = 10⁹ + 7:

'c' = 99, 'a' = 97, 't' = 116 (ASCII values)
H("cat") = 99 × 31² + 97 × 31¹ + 116 × 31⁰
H("cat") = 99 × 961 + 97 × 31 + 116 × 1
H("cat") = 95,139 + 3,007 + 116
H("cat") = 98,262

Now compare with "act":

H("act") = 97 × 31² + 99 × 31¹ + 116 × 31⁰
H("act") = 97 × 961 + 99 × 31 + 116 × 1
H("act") = 93,217 + 3,069 + 116
H("act") = 96,402

Different strings → different hashes. The position-weighting through powers of b ensures anagrams produce different hash values.

polynomial_hash.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def polynomial_hash(s: str, base: int = 31, mod: int = 10**9 + 7) -> int:
    """
    Compute the polynomial hash of a string.
    
    The hash is computed as:
    H(s) = s[0] * base^(n-1) + s[1] * base^(n-2) + ... + s[n-1] * base^0
    
    We use Horner's method for efficient computation:
    H(s) = ((s[0] * base + s[1]) * base + s[2]) * base + ...
    
    Time Complexity: O(n) where n = len(s)
    Space Complexity: O(1)
    """
    h = 0
    for char in s:
        # Horner's method: multiply previous by base, add new character
        h = (h * base + ord(char)) % mod
    return h
 
# Example usage
print(f"Hash of 'cat': {polynomial_hash('cat')}")  # 98262
print(f"Hash of 'act': {polynomial_hash('act')}")  # 96402
print(f"Hash of 'cat' again: {polynomial_hash('cat')}")  # 98262 (deterministic)
 
# Demonstrating collision resistance
words = ['hello', 'world', 'olleh', 'dlrow', 'cat', 'tac']
hashes = {word: polynomial_hash(word) for word in words}
for word, h in hashes.items():
    print(f"  '{word}' -> {h}")

Horner's Method — The Efficiency Secret

The Rolling Hash Technique

The Rolling Hash Insight:

Consider two adjacent windows in the text:

Window at position i: text[i..i+m-1]
Window at position i+1: text[i+1..i+m]

These windows share m-1 characters! The rolling hash exploits this overlap to compute the new hash from the old hash in O(1) time.

The Rolling Update Formula:

Let H(i) = hash of text[i..i+m-1].

Then:

$$H(i+1) = (H(i) - text[i] \cdot b^{m-1}) \cdot b + text[i+m] \pmod{M}$$

Intuition:

Subtract the contribution of the leftmost character (scaled by b^(m-1))
Multiply by b (shift all remaining characters one position left in the polynomial)
Add the new rightmost character (at position b⁰)

Without Rolling Hash

•Compute each window hash from scratch
•O(m) time per window position
•O((n-m) × m) total time
•For n=10⁶, m=1000: ~10⁹ operations
•Dominant factor in naive pattern matching

With Rolling Hash

•Compute first window hash in O(m)
•Update each subsequent window in O(1)
•O(n) total time for all windows
•For n=10⁶, m=1000: ~10⁶ operations
•1000× speedup in this example!

Step-by-Step Example:

Let's trace through rolling the hash for text "abcde" with window size 3, using base=31:

Initial Hash (window "abc"):

H₀ = 97×31² + 98×31¹ + 99×31⁰ = 93,217 + 3,038 + 99 = 96,354

Roll to "bcd" (remove 'a'=97, add 'd'=100):

H₁ = (H₀ - 97×31²) × 31 + 100
H₁ = (96,354 - 93,217) × 31 + 100
H₁ = 3,137 × 31 + 100 = 97,247 + 100 = 97,347

Verification (compute directly):

H("bcd") = 98×31² + 99×31¹ + 100×31⁰ = 94,178 + 3,069 + 100 = 97,347 ✓

Roll to "cde" (remove 'b'=98, add 'e'=101):

H₂ = (H₁ - 98×31²) × 31 + 101
H₂ = (97,347 - 94,178) × 31 + 101
H₂ = 3,169 × 31 + 101 = 98,239 + 101 = 98,340

Verification:

H("cde") = 99×31² + 100×31¹ + 101×31⁰ = 95,139 + 3,100 + 101 = 98,340 ✓

rolling_hash.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
class RollingHash:
    """
    A rolling hash implementation for efficient substring hashing.
    
    This class allows computing hash values for all substrings of a fixed
    length in O(n) total time, by updating the hash incrementally.
    """
    
    def __init__(self, base: int = 31, mod: int = 10**9 + 7):
        self.base = base
        self.mod = mod
        self.power = 1  # Will hold base^(window_size - 1)
    
    def compute_initial_hash(self, s: str, window_size: int) -> int:
        """
        Compute the hash of the first window and prepare the power multiplier.
        
        Time Complexity: O(window_size)
        """
        self.window_size = window_size
        h = 0
        
        # Compute hash using Horner's method
        for i in range(window_size):
            h = (h * self.base + ord(s[i])) % self.mod
        
        # Precompute base^(window_size - 1) for rolling
        self.power = 1
        for _ in range(window_size - 1):
            self.power = (self.power * self.base) % self.mod
        
        return h
    
    def roll(self, current_hash: int, old_char: str, new_char: str) -> int:
        """
        Roll the hash window: remove old_char from the left, add new_char to the right.
        
        Time Complexity: O(1)
        
        Formula: H(i+1) = (H(i) - old * base^(m-1)) * base + new
        """
        # Remove contribution of old (leftmost) character
        h = (current_hash - ord(old_char) * self.power) % self.mod
        
        # Shift left (multiply by base) and add new (rightmost) character
        h = (h * self.base + ord(new_char)) % self.mod
        
        # Handle negative modulo (Python handles this, but explicit for clarity)
        return h if h >= 0 else h + self.mod
 
 
def find_all_substring_hashes(text: str, window_size: int) -> list[int]:
    """
    Compute the hash of every substring of given length.
    
    Time Complexity: O(n) where n = len(text)
    Space Complexity: O(n - window_size + 1) for the result
    """
    if len(text) < window_size:
        return []
    
    rh = RollingHash()
    hashes = []
    
    # Compute initial hash
    current_hash = rh.compute_initial_hash(text, window_size)
    hashes.append(current_hash)
    
    # Roll through the rest of the text
    for i in range(len(text) - window_size):
        current_hash = rh.roll(
            current_hash,
            text[i],           # Character leaving the window
            text[i + window_size]  # Character entering the window
        )
        hashes.append(current_hash)
    
    return hashes
 
 
# Example: Find all occurrences of pattern in text using rolling hash
def rabin_karp_search(text: str, pattern: str) -> list[int]:
    """
    Find all occurrences of pattern in text using Rabin-Karp algorithm.
    
    Time Complexity: O(n + m) average, O(nm) worst case
    """
    n, m = len(text), len(pattern)
    if m > n:
        return []
    
    rh = RollingHash()
    
    # Compute pattern hash
    pattern_hash = rh.compute_initial_hash(pattern, m)
    rh_text = RollingHash()
    text_hash = rh_text.compute_initial_hash(text, m)
    
    occurrences = []
    
    for i in range(n - m + 1):
        if i > 0:
            text_hash = rh_text.roll(text_hash, text[i - 1], text[i + m - 1])
        
        # If hashes match, verify with actual string comparison
        if text_hash == pattern_hash:
            if text[i:i + m] == pattern:  # Guard against collisions
                occurrences.append(i)
    
    return occurrences
 
 
# Demonstration
text = "abracadabra"
pattern = "abra"
print(f"Text: '{text}'")
print(f"Pattern: '{pattern}'")
print(f"Found at positions: {rabin_karp_search(text, pattern)}")  # [0, 7]

Precomputing Hash Prefix Arrays

The Prefix Hash Concept:

Define prefix_hash[i] = H(s[0..i-1]) = hash of the first i characters.

Then for any substring s[l..r] (0-indexed, inclusive):

$$H(s[l..r]) = \frac{prefix_hash[r+1] - prefix_hash[l] \cdot b^{r-l+1}}{1} \pmod{M}$$

Or more precisely:

$$H(s[l..r]) = (prefix_hash[r+1] - prefix_hash[l] \cdot power[r-l+1]) \mod M$$

where power[k] = b^k mod M.

Intuition: Place Value Analogy

Why This Works:

Let's trace through mathematically:

prefix_hash[5] = s[0]·b⁴ + s[1]·b³ + s[2]·b² + s[3]·b¹ + s[4]·b⁰
prefix_hash[2] = s[0]·b¹ + s[1]·b⁰

prefix_hash[2] × b³ = s[0]·b⁴ + s[1]·b³  (scaled to match)

prefix_hash[5] - prefix_hash[2] × b³ = s[2]·b² + s[3]·b¹ + s[4]·b⁰ = H(s[2..4])

The key insight is that we multiply the prefix hash by an appropriate power of b to align the polynomial terms before subtraction.

Implementation Considerations:

Precompute powers: Store power[i] = b^i mod M for i = 0 to n
Handle negative modulo: (a - b) mod M can be negative in some languages
1-indexed convenience: Use prefix_hash[0] = 0 for easier formula

prefix_hash_array.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
class StringHasher:
    """
    A string hasher that supports O(1) substring hash queries
    after O(n) preprocessing.
    
    This is the preferred approach when you need to query many
    arbitrary substrings of the same string.
    """
    
    def __init__(self, s: str, base: int = 31, mod: int = 10**9 + 7):
        """
        Precompute prefix hashes and powers.
        
        Time Complexity: O(n)
        Space Complexity: O(n)
        """
        self.s = s
        self.base = base
        self.mod = mod
        self.n = len(s)
        
        # prefix_hash[i] = hash of s[0..i-1] (first i characters)
        # prefix_hash[0] = 0 (empty string)
        self.prefix_hash = [0] * (self.n + 1)
        
        # power[i] = base^i mod M
        self.power = [1] * (self.n + 1)
        
        # Build prefix hashes using Horner's method
        for i in range(self.n):
            self.prefix_hash[i + 1] = (
                self.prefix_hash[i] * self.base + ord(s[i])
            ) % self.mod
        
        # Precompute powers of base
        for i in range(1, self.n + 1):
            self.power[i] = (self.power[i - 1] * self.base) % self.mod
    
    def get_hash(self, left: int, right: int) -> int:
        """
        Get the hash of substring s[left..right] (0-indexed, inclusive).
        
        Time Complexity: O(1)
        
        This uses the formula:
        H(s[l..r]) = prefix_hash[r+1] - prefix_hash[l] * base^(r-l+1)
        """
        length = right - left + 1
        
        # Subtract the prefix contribution (scaled appropriately)
        result = (
            self.prefix_hash[right + 1] 
            - self.prefix_hash[left] * self.power[length]
        ) % self.mod
        
        # Handle negative modulo
        return result if result >= 0 else result + self.mod
    
    def compare_substrings(
        self, l1: int, r1: int, l2: int, r2: int
    ) -> bool:
        """
        Compare two substrings s[l1..r1] and s[l2..r2] using hashes.
        
        Returns True if they are equal (with high probability).
        Note: False positives are possible due to hash collisions.
        
        Time Complexity: O(1)
        """
        if r1 - l1 != r2 - l2:  # Different lengths must be different
            return False
        return self.get_hash(l1, r1) == self.get_hash(l2, r2)
 
 
# Example: Find the longest repeated substring using binary search + hashing
def longest_repeated_substring(s: str) -> str:
    """
    Find the longest substring that appears at least twice.
    
    Uses binary search on length + rolling hash for O(n log n) average time.
    """
    if len(s) <= 1:
        return ""
    
    hasher = StringHasher(s)
    n = len(s)
    
    def has_repeated_of_length(length: int) -> int:
        """Check if there's a repeated substring of given length.
        Returns starting index if found, -1 otherwise."""
        seen = {}
        for i in range(n - length + 1):
            h = hasher.get_hash(i, i + length - 1)
            if h in seen:
                # Verify to guard against collision
                if s[i:i+length] == s[seen[h]:seen[h]+length]:
                    return i
            seen[h] = i
        return -1
    
    # Binary search for the longest length
    left, right = 1, n - 1
    result = ""
    
    while left <= right:
        mid = (left + right) // 2
        idx = has_repeated_of_length(mid)
        if idx != -1:
            result = s[idx:idx + mid]
            left = mid + 1  # Try longer
        else:
            right = mid - 1  # Try shorter
    
    return result
 
 
# Demonstration
s = "banana"
print(f"String: '{s}'")
print(f"Longest repeated substring: '{longest_repeated_substring(s)}'")  # 'ana'
 
hasher = StringHasher(s)
print(f"\nHash queries:")
print(f"  Hash of 'ban' (0..2): {hasher.get_hash(0, 2)}")
print(f"  Hash of 'ana' (1..3): {hasher.get_hash(1, 3)}")
print(f"  Hash of 'ana' (3..5): {hasher.get_hash(3, 5)}")
print(f"  Substrings equal? {hasher.compare_substrings(1, 3, 3, 5)}")  # True

Why Polynomials Give Good Distribution

It's worth understanding why polynomial hashing works so well from a theoretical perspective. The collision resistance of polynomial hashing comes from deep results in algebra and number theory.

The Schwartz-Zippel Lemma:

This fundamental result states: For a non-zero polynomial P(x) of degree d over a field F, if we pick x uniformly at random from a finite subset S ⊆ F, then:

Pr[P(x) = 0] ≤ d / |S|

Application to String Hashing:

Given two distinct strings s₁ and s₂ of length m, define:

P(x) = H(s₁, x) - H(s₂, x)

where H(s, x) is the polynomial hash of s with base x.

Since s₁ ≠ s₂, at least one coefficient differs, so P(x) is a non-zero polynomial of degree at most m-1.

By Schwartz-Zippel: Pr[P(b) = 0] ≤ (m-1)/M

For m = 10³ and M = 10⁹: collision probability < 10⁻⁶ per pair!

The Birthday Paradox Warning

Uniformity of Distribution:

Polynomial hashes distribute uniformly because:

Independence of coefficients: Each character contributes independently to the final hash value
Power weighting: Different positions have different weights (b^i), preventing collision patterns
Field properties: When M is prime, arithmetic mod M behaves like a mathematical field, maximizing randomness

Why Prime Moduli Matter:

Consider what happens with a composite modulus like M = 256 (2⁸):

If b is even, b² is divisible by 4, b³ by 8, b⁴ by 16...
After just 8 terms, b⁸ ≡ 0 (mod 256)
Early characters stop contributing to the hash!

With a prime M:

No non-trivial factors exist
Powers of b cycle through many distinct values
Every character contributes meaningfully

Collision Probability Analysis
Scenario	M = 10⁶	M = 10⁹	M = 10¹⁸ (double hash)
Single pair (m=1000)	~0.1%	~0.0001%	~10⁻¹³%
100 substrings	~10%	~0.01%	~10⁻⁹%
10,000 substrings	~99.995%	~0.5%	~10⁻⁵%
1,000,000 substrings	Certain	~39%	~0.005%

Practical Recommendation

Summary and Connections

We've now established the mathematical foundations of polynomial rolling hashes—the workhorse technique for efficient string matching. Let's consolidate the key insights:

Key Takeaways

•Polynomial representation encodes position: Each character's contribution is weighted by a power of the base, ensuring anagrams produce different hashes.
•Rolling updates in O(1): The key formula H(i+1) = (H(i) - old × b^(m-1)) × b + new allows sliding window hash computation in constant time per step.
•Prefix arrays enable O(1) substring queries: After O(n) preprocessing, any substring hash can be computed in constant time.
•Collision probability is mathematically bounded: By Schwartz-Zippel, random collisions are extremely rare with proper parameters.
•Horner's method provides efficiency: We can evaluate the polynomial in O(n) time and O(1) space using iterative accumulation.
•Prime moduli maximize distribution quality: Using a prime modulus ensures all powers of the base remain distinct and contribute meaningfully.

Connections to Other Topics:

Rabin-Karp algorithm: Uses rolling hash as its core—now you understand exactly how it achieves O(n+m) average time
Longest common substring: Prefix hash arrays + binary search gives O(n log n) solution
Suffix arrays: Can be constructed using hashing + sorting
Palindrome detection: Compare H(s[i..j]) with H(reverse(s[i..j]))

What's Next:

Page Complete

1 / 4