Loading learning content...
Imagine you need to find all occurrences of a pattern within a massive text file—say, searching for a specific DNA sequence in the human genome (approximately 3 billion characters). The naive approach of comparing character-by-character at every position would require trillions of operations. But what if you could represent any string as a single number and compare numbers instead of strings?
This is the fundamental insight behind string hashing: transform strings into numerical fingerprints (hash values) that can be compared in constant time. The magic lies in polynomial rolling hashes, a technique that not only computes these fingerprints efficiently but also allows us to update them incrementally as we slide through a text—turning what could be an O(n × m) operation into an O(n + m) average-case operation.
By the end of this page, you will deeply understand how polynomial hashing works mathematically, why polynomials provide excellent distribution properties, how to compute hash values incrementally in O(1) time per position, and the theoretical foundation that makes rolling hashes so powerful for string algorithms.
Before diving into polynomial hashes specifically, let's establish what we need from a string hash function and why most simple approaches fail.
The Goal:
A hash function H maps strings to integers: H: Σ* → {0, 1, ..., M-1}
where Σ is our alphabet and M is the size of our hash space. We want this function to have several properties:
Consider summing ASCII values: H('abc') = 97 + 98 + 99 = 294. This fails because 'abc' and 'bca' have the same hash (294), and 'cba' also has the same hash. Position information is lost, creating massive collisions. Any hash that ignores character positions will suffer from this fundamental flaw.
The Position Problem:
Two strings are different if they have different characters OR if the same characters appear in different positions. Therefore, our hash function must encode positional information. This is where polynomials enter the picture.
The Polynomial Insight:
Consider representing a string s = s₀s₁s₂...sₙ₋₁ as a polynomial:
H(s) = s₀ · bⁿ⁻¹ + s₁ · bⁿ⁻² + ... + sₙ₋₂ · b¹ + sₙ₋₁ · b⁰
where b is called the base and each sᵢ is the numerical value of the character at position i.
This representation has a beautiful property: two different strings can only have the same hash value if they are roots of the same polynomial equation. For a random choice of b, this is extremely unlikely.
| Approach | Position-Aware | Collision Rate | Rolling Update | Example Issue |
|---|---|---|---|---|
| Sum of ASCII values | No | Very High | Yes | 'abc' = 'bca' = 'cba' |
| XOR of ASCII values | No | Very High | Yes | 'ab' XOR = 'ba' XOR |
| Product of ASCII values | No | High | No | 'ab' × 'ba' same |
| Polynomial hash | Yes | Low (tunable) | Yes | None (probabilistic) |
Let's develop the polynomial hash formula rigorously. We'll see exactly why it works and understand the mathematical machinery that makes it so effective.
The Polynomial Representation:
Given a string s of length m, we interpret it as a polynomial evaluated at base b:
$$H(s) = \sum_{i=0}^{m-1} s[i] \cdot b^{m-1-i} \pmod{M}$$
Or equivalently, using position from the right:
$$H(s) = s[0] \cdot b^{m-1} + s[1] \cdot b^{m-2} + \cdots + s[m-2] \cdot b^1 + s[m-1] \cdot b^0 \pmod{M}$$
Why Polynomials?
Polynomials have a remarkable property from algebraic number theory: over a field, a polynomial of degree n-1 has at most n-1 roots. This means that for two different strings of length m, their hash values (as polynomials evaluated at base b) can only collide if b happens to be a root of their difference polynomial—which has at most m-1 roots out of M possible values of b.
For two distinct strings of length m, if we choose the base b uniformly at random from {0, 1, ..., M-1}, the probability that their polynomial hashes collide is at most (m-1)/M. For m = 1000 and M = 10⁹, this is less than 0.0001%!
Concrete Example:
Let's compute the polynomial hash of "cat" with base b = 31 and modulus M = 10⁹ + 7:
Now compare with "act":
Different strings → different hashes. The position-weighting through powers of b ensures anagrams produce different hash values.
1234567891011121314151617181920212223242526272829
def polynomial_hash(s: str, base: int = 31, mod: int = 10**9 + 7) -> int: """ Compute the polynomial hash of a string. The hash is computed as: H(s) = s[0] * base^(n-1) + s[1] * base^(n-2) + ... + s[n-1] * base^0 We use Horner's method for efficient computation: H(s) = ((s[0] * base + s[1]) * base + s[2]) * base + ... Time Complexity: O(n) where n = len(s) Space Complexity: O(1) """ h = 0 for char in s: # Horner's method: multiply previous by base, add new character h = (h * base + ord(char)) % mod return h # Example usageprint(f"Hash of 'cat': {polynomial_hash('cat')}") # 98262print(f"Hash of 'act': {polynomial_hash('act')}") # 96402print(f"Hash of 'cat' again: {polynomial_hash('cat')}") # 98262 (deterministic) # Demonstrating collision resistancewords = ['hello', 'world', 'olleh', 'dlrow', 'cat', 'tac']hashes = {word: polynomial_hash(word) for word in words}for word, h in hashes.items(): print(f" '{word}' -> {h}")Instead of computing each power of base separately (which would require O(n²) operations or O(n) with precomputation), we use Horner's method to evaluate the polynomial iteratively. This transforms H(s) = s₀b³ + s₁b² + s₂b + s₃ into H(s) = ((s₀ × b + s₁) × b + s₂) × b + s₃, computing the entire hash in a single O(n) pass with O(1) space.
The true power of polynomial hashing emerges when we need to compute hashes for many overlapping substrings. In pattern matching, we slide a window of length m across a text of length n, computing the hash of each window position. A naive approach would compute n-m+1 hashes, each in O(m) time, for O((n-m) × m) total time.
The Rolling Hash Insight:
Consider two adjacent windows in the text:
These windows share m-1 characters! The rolling hash exploits this overlap to compute the new hash from the old hash in O(1) time.
The Rolling Update Formula:
Let H(i) = hash of text[i..i+m-1].
Then:
$$H(i+1) = (H(i) - text[i] \cdot b^{m-1}) \cdot b + text[i+m] \pmod{M}$$
Intuition:
Step-by-Step Example:
Let's trace through rolling the hash for text "abcde" with window size 3, using base=31:
Initial Hash (window "abc"):
Roll to "bcd" (remove 'a'=97, add 'd'=100):
Verification (compute directly):
Roll to "cde" (remove 'b'=98, add 'e'=101):
Verification:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
class RollingHash: """ A rolling hash implementation for efficient substring hashing. This class allows computing hash values for all substrings of a fixed length in O(n) total time, by updating the hash incrementally. """ def __init__(self, base: int = 31, mod: int = 10**9 + 7): self.base = base self.mod = mod self.power = 1 # Will hold base^(window_size - 1) def compute_initial_hash(self, s: str, window_size: int) -> int: """ Compute the hash of the first window and prepare the power multiplier. Time Complexity: O(window_size) """ self.window_size = window_size h = 0 # Compute hash using Horner's method for i in range(window_size): h = (h * self.base + ord(s[i])) % self.mod # Precompute base^(window_size - 1) for rolling self.power = 1 for _ in range(window_size - 1): self.power = (self.power * self.base) % self.mod return h def roll(self, current_hash: int, old_char: str, new_char: str) -> int: """ Roll the hash window: remove old_char from the left, add new_char to the right. Time Complexity: O(1) Formula: H(i+1) = (H(i) - old * base^(m-1)) * base + new """ # Remove contribution of old (leftmost) character h = (current_hash - ord(old_char) * self.power) % self.mod # Shift left (multiply by base) and add new (rightmost) character h = (h * self.base + ord(new_char)) % self.mod # Handle negative modulo (Python handles this, but explicit for clarity) return h if h >= 0 else h + self.mod def find_all_substring_hashes(text: str, window_size: int) -> list[int]: """ Compute the hash of every substring of given length. Time Complexity: O(n) where n = len(text) Space Complexity: O(n - window_size + 1) for the result """ if len(text) < window_size: return [] rh = RollingHash() hashes = [] # Compute initial hash current_hash = rh.compute_initial_hash(text, window_size) hashes.append(current_hash) # Roll through the rest of the text for i in range(len(text) - window_size): current_hash = rh.roll( current_hash, text[i], # Character leaving the window text[i + window_size] # Character entering the window ) hashes.append(current_hash) return hashes # Example: Find all occurrences of pattern in text using rolling hashdef rabin_karp_search(text: str, pattern: str) -> list[int]: """ Find all occurrences of pattern in text using Rabin-Karp algorithm. Time Complexity: O(n + m) average, O(nm) worst case """ n, m = len(text), len(pattern) if m > n: return [] rh = RollingHash() # Compute pattern hash pattern_hash = rh.compute_initial_hash(pattern, m) rh_text = RollingHash() text_hash = rh_text.compute_initial_hash(text, m) occurrences = [] for i in range(n - m + 1): if i > 0: text_hash = rh_text.roll(text_hash, text[i - 1], text[i + m - 1]) # If hashes match, verify with actual string comparison if text_hash == pattern_hash: if text[i:i + m] == pattern: # Guard against collisions occurrences.append(i) return occurrences # Demonstrationtext = "abracadabra"pattern = "abra"print(f"Text: '{text}'")print(f"Pattern: '{pattern}'")print(f"Found at positions: {rabin_karp_search(text, pattern)}") # [0, 7]For many applications, we need to query the hash of arbitrary substrings, not just consecutive windows. The prefix hash array technique enables O(1) substring hash queries after O(n) preprocessing.
The Prefix Hash Concept:
Define prefix_hash[i] = H(s[0..i-1]) = hash of the first i characters.
Then for any substring s[l..r] (0-indexed, inclusive):
$$H(s[l..r]) = \frac{prefix_hash[r+1] - prefix_hash[l] \cdot b^{r-l+1}}{1} \pmod{M}$$
Or more precisely:
$$H(s[l..r]) = (prefix_hash[r+1] - prefix_hash[l] \cdot power[r-l+1]) \mod M$$
where power[k] = b^k mod M.
Think of it like extracting digits from a decimal number. If you have 12345 and want digits 2-4 ("234"), you subtract 12000 and divide by 10. Similarly, prefix_hash[5] contains s[0..4], and to get s[2..4], we subtract the contribution of s[0..1] (properly scaled) from prefix_hash[5] and adjust the powers.
Why This Works:
Let's trace through mathematically:
prefix_hash[5] = s[0]·b⁴ + s[1]·b³ + s[2]·b² + s[3]·b¹ + s[4]·b⁰
prefix_hash[2] = s[0]·b¹ + s[1]·b⁰
prefix_hash[2] × b³ = s[0]·b⁴ + s[1]·b³ (scaled to match)
prefix_hash[5] - prefix_hash[2] × b³ = s[2]·b² + s[3]·b¹ + s[4]·b⁰ = H(s[2..4])
The key insight is that we multiply the prefix hash by an appropriate power of b to align the polynomial terms before subtraction.
Implementation Considerations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
class StringHasher: """ A string hasher that supports O(1) substring hash queries after O(n) preprocessing. This is the preferred approach when you need to query many arbitrary substrings of the same string. """ def __init__(self, s: str, base: int = 31, mod: int = 10**9 + 7): """ Precompute prefix hashes and powers. Time Complexity: O(n) Space Complexity: O(n) """ self.s = s self.base = base self.mod = mod self.n = len(s) # prefix_hash[i] = hash of s[0..i-1] (first i characters) # prefix_hash[0] = 0 (empty string) self.prefix_hash = [0] * (self.n + 1) # power[i] = base^i mod M self.power = [1] * (self.n + 1) # Build prefix hashes using Horner's method for i in range(self.n): self.prefix_hash[i + 1] = ( self.prefix_hash[i] * self.base + ord(s[i]) ) % self.mod # Precompute powers of base for i in range(1, self.n + 1): self.power[i] = (self.power[i - 1] * self.base) % self.mod def get_hash(self, left: int, right: int) -> int: """ Get the hash of substring s[left..right] (0-indexed, inclusive). Time Complexity: O(1) This uses the formula: H(s[l..r]) = prefix_hash[r+1] - prefix_hash[l] * base^(r-l+1) """ length = right - left + 1 # Subtract the prefix contribution (scaled appropriately) result = ( self.prefix_hash[right + 1] - self.prefix_hash[left] * self.power[length] ) % self.mod # Handle negative modulo return result if result >= 0 else result + self.mod def compare_substrings( self, l1: int, r1: int, l2: int, r2: int ) -> bool: """ Compare two substrings s[l1..r1] and s[l2..r2] using hashes. Returns True if they are equal (with high probability). Note: False positives are possible due to hash collisions. Time Complexity: O(1) """ if r1 - l1 != r2 - l2: # Different lengths must be different return False return self.get_hash(l1, r1) == self.get_hash(l2, r2) # Example: Find the longest repeated substring using binary search + hashingdef longest_repeated_substring(s: str) -> str: """ Find the longest substring that appears at least twice. Uses binary search on length + rolling hash for O(n log n) average time. """ if len(s) <= 1: return "" hasher = StringHasher(s) n = len(s) def has_repeated_of_length(length: int) -> int: """Check if there's a repeated substring of given length. Returns starting index if found, -1 otherwise.""" seen = {} for i in range(n - length + 1): h = hasher.get_hash(i, i + length - 1) if h in seen: # Verify to guard against collision if s[i:i+length] == s[seen[h]:seen[h]+length]: return i seen[h] = i return -1 # Binary search for the longest length left, right = 1, n - 1 result = "" while left <= right: mid = (left + right) // 2 idx = has_repeated_of_length(mid) if idx != -1: result = s[idx:idx + mid] left = mid + 1 # Try longer else: right = mid - 1 # Try shorter return result # Demonstrations = "banana"print(f"String: '{s}'")print(f"Longest repeated substring: '{longest_repeated_substring(s)}'") # 'ana' hasher = StringHasher(s)print(f"\nHash queries:")print(f" Hash of 'ban' (0..2): {hasher.get_hash(0, 2)}")print(f" Hash of 'ana' (1..3): {hasher.get_hash(1, 3)}")print(f" Hash of 'ana' (3..5): {hasher.get_hash(3, 5)}")print(f" Substrings equal? {hasher.compare_substrings(1, 3, 3, 5)}") # TrueIt's worth understanding why polynomial hashing works so well from a theoretical perspective. The collision resistance of polynomial hashing comes from deep results in algebra and number theory.
The Schwartz-Zippel Lemma:
This fundamental result states: For a non-zero polynomial P(x) of degree d over a field F, if we pick x uniformly at random from a finite subset S ⊆ F, then:
Pr[P(x) = 0] ≤ d / |S|
Application to String Hashing:
Given two distinct strings s₁ and s₂ of length m, define:
P(x) = H(s₁, x) - H(s₂, x)
where H(s, x) is the polynomial hash of s with base x.
Since s₁ ≠ s₂, at least one coefficient differs, so P(x) is a non-zero polynomial of degree at most m-1.
By Schwartz-Zippel: Pr[P(b) = 0] ≤ (m-1)/M
For m = 10³ and M = 10⁹: collision probability < 10⁻⁶ per pair!
The above analysis is for a single pair of strings. When comparing many strings (like all substrings of length m), the birthday paradox kicks in. With √M ≈ 31,623 substrings, you have a ~50% chance of at least one collision. This is why using a large modulus (10⁹+7) or multiple hash functions is crucial for large-scale applications.
Uniformity of Distribution:
Polynomial hashes distribute uniformly because:
Why Prime Moduli Matter:
Consider what happens with a composite modulus like M = 256 (2⁸):
With a prime M:
| Scenario | M = 10⁶ | M = 10⁹ | M = 10¹⁸ (double hash) |
|---|---|---|---|
| Single pair (m=1000) | ~0.1% | ~0.0001% | ~10⁻¹³% |
| 100 substrings | ~10% | ~0.01% | ~10⁻⁹% |
| 10,000 substrings | ~99.995% | ~0.5% | ~10⁻⁵% |
| 1,000,000 substrings | Certain | ~39% | ~0.005% |
For competitive programming and most applications: Use M = 10⁹ + 7 (this is prime). For production systems handling large data: Use double hashing (two different base/mod pairs) or a 64-bit hash. For cryptographic applications: Polynomial hashes are NOT secure—use SHA-256 or similar.
We've now established the mathematical foundations of polynomial rolling hashes—the workhorse technique for efficient string matching. Let's consolidate the key insights:
Connections to Other Topics:
What's Next:
Now that we understand polynomial hashing, the next critical question is: How do we choose the base and modulus? Poor choices can lead to more collisions, vulnerability to adversarial inputs, or arithmetic overflow. The next page covers parameter selection in depth.
You now understand polynomial rolling hashes from first principles—the mathematical foundation, the rolling update technique, prefix array preprocessing, and the theoretical guarantees. These concepts form the basis for all string hashing applications we'll explore in subsequent pages.