Loading content...
You've learned that polynomial hashing transforms strings into numbers using the formula H(s) = Σ s[i] × b^(n-1-i) mod M. But here's a critical question that separates robust implementations from fragile ones: How do you choose the base b and the modulus M?
Poor parameter choices can lead to:
This page provides a rigorous, comprehensive guide to parameter selection—covering the mathematical constraints, practical recommendations, and defensive techniques that production systems demand.
By the end of this page, you will understand why the base must exceed alphabet size, why the modulus should be prime and large, how to prevent integer overflow, how to select parameters that resist adversarial attacks, and the standard parameter choices used in competitive programming and industry.
The base b determines how character values are weighted based on their positions. Choosing it correctly is essential for avoiding trivial collisions.
The Fundamental Constraint: b > alphabet_size
If your strings use an alphabet of size σ (e.g., σ=26 for lowercase letters, σ=128 for ASCII, σ=256 for extended ASCII), the base must satisfy:
b > σ
Why?
Consider what happens if b = σ or b < σ. With b = 26 for lowercase letters ('a'=0, 'b'=1, ...):
H("ba") = 1 × 26 + 0 = 26 H("az") = 0 × 26 + 25 = 25 (close!)
But more critically, with b = 10 and digits: H("19") = 1 × 10 + 9 = 19
This is exactly the decimal number 19! When the base equals the alphabet size, the hash becomes a simple base conversion, which is fine but limits flexibility.
If b ≤ max_char_value, certain character combinations can 'carry' into adjacent positions, creating collisions. For example, with b=10 and ASCII: H([character 10]) = 10 = 1×10 + 0 = H([character 1][character 0]). Always ensure b > max possible character value.
Recommended Base Values:
Why Odd Primes Are Preferred:
Common Standard Choices:
| Context | Base | Reasoning |
|---|---|---|
| Competitive programming | 31 | Prime, > 26, fits well in calculations |
| General ASCII strings | 31 or 37 | Small primes, efficient multiplication |
| Case-sensitive strings | 53 or 131 | > 52 (letters) or > 126 (printable ASCII) |
| Unicode strings | 257 | Prime just above 256 |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
def recommended_base(alphabet_description: str) -> int: """ Return a recommended base for the given alphabet type. The base must be: 1. Greater than the maximum character value in the alphabet 2. Preferably prime (for better distribution) 3. Not too large (to avoid overflow issues) """ recommendations = { 'lowercase': 31, # 'a'-'z' (26 chars, values 0-25 or 97-122) 'alphanumeric': 37, # a-z, A-Z, 0-9 (62 chars) 'ascii_printable': 131, # Characters 32-126 (95 chars) 'ascii_full': 257, # Characters 0-255 (256 chars) 'unicode': 1000003, # Large prime for Unicode } return recommendations.get(alphabet_description, 31) def validate_base(base: int, characters: str) -> tuple[bool, str]: """ Validate that a base is appropriate for the given character set. """ max_char = max(ord(c) for c in characters) if base <= max_char: return (False, f"Base {base} must be > max character value {max_char}") # Check if base is prime (simple check) def is_prime(n): if n < 2: return False for i in range(2, int(n**0.5) + 1): if n % i == 0: return False return True if not is_prime(base): return (True, f"Warning: Base {base} is not prime, consider using a prime") return (True, f"Base {base} is valid for this character set") # Example: Demonstrate the problem with too-small basesdef show_collision_risk(): """Demonstrate why base must exceed alphabet size.""" # With base = 26 for lowercase letters (using 'a'=0 mapping) base = 26 def hash_with_base(s: str, b: int) -> int: h = 0 for c in s: h = h * b + (ord(c) - ord('a')) return h # These should have different hashes but... test_pairs = [ ("ba", "aa"), # Different strings ] print("Demonstrating collision risk with small base:") print(f"Using base = {base}") # 'ba' = 1*26 + 0 = 26 # 'aa' = 0*26 + 0 = 0 (no collision here) # But consider longer strings and patterns... for s in ["aa", "ba", "ab", "bb"]: h = hash_with_base(s, base) print(f" H('{s}') = {h}") print("\nWith proper base = 31:") base = 31 for s in ["aa", "ba", "ab", "bb"]: h = hash_with_base(s, base) print(f" H('{s}') = {h}") show_collision_risk()The modulus M determines the size of our hash space and has profound implications for collision probability, arithmetic safety, and computational efficiency.
Why Use a Modulus at All?
Without a modulus, polynomial hashes grow exponentially with string length:
H(s) = s[0] × b^(n-1) + ... + s[n-1] × b^0
For a 1000-character string with b=31:
The modulus constrains hash values to a fixed range [0, M-1], enabling:
We want M as LARGE as possible (fewer collisions) but small enough to avoid INTEGER OVERFLOW during computation. The sweet spot depends on your programming language's integer type and whether you need efficient arithmetic.
Why the Modulus Should Be Prime:
A prime modulus provides several mathematical advantages:
Maximum distribution uniformity: For prime M, the multiplicative group (ℤ/Mℤ)* has order M-1, the maximum possible. This means powers of b cycle through the maximum number of distinct values.
No zero divisors: In ℤ/Mℤ with prime M, if ab ≡ 0 (mod M), then a ≡ 0 or b ≡ 0. This prevents 'hidden zeros' from polluting hash computations.
Modular inverses exist: Every non-zero element has a multiplicative inverse, enabling division operations when needed.
Schwartz-Zippel applies directly: The collision probability bounds assume field arithmetic, which requires prime M.
Why Composite Moduli Fail:
Consider M = 256 = 2^8:
Even with odd base, M = 256 has only 128 'units' (elements with inverses), reducing effective hash space by half.
| Modulus | Binary Size | Primality | Use Case | Notes |
|---|---|---|---|---|
| 10^9 + 7 | ~30 bits | Prime ✓ | Competitive programming | Standard choice, fits 32-bit with room |
| 10^9 + 9 | ~30 bits | Prime ✓ | Alternative to 10^9+7 | Useful for double hashing |
| 998244353 | ~30 bits | Prime ✓ | NTT-friendly | = 119 × 2^23 + 1 |
| 2^61 - 1 | 61 bits | Mersenne Prime ✓ | Low-collision applications | Efficient mod via bit ops |
| 2^64 (overflow) | 64 bits | Not prime ✗ | Speed optimization | Relies on natural overflow |
| 10^18 + 9 | ~60 bits | Prime ✓ | Ultra-low collision | Requires 128-bit intermediate |
The 10^9 + 7 Standard:
This modulus has become the de facto standard in competitive programming for good reasons:
When 10^9 + 7 Isn't Enough:
For large-scale applications (millions of substrings), the birthday paradox means collisions become likely. Solutions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
# Common modulus choices with their propertiesMODULI = { 'CP_STANDARD': 10**9 + 7, # Standard competitive programming 'CP_ALT': 10**9 + 9, # Alternative for double hashing 'NTT_FRIENDLY': 998244353, # For Number Theoretic Transform 'MERSENNE_61': (1 << 61) - 1, # Mersenne prime, very low collision 'MERSENNE_31': (1 << 31) - 1, # Smaller Mersenne prime} def is_prime(n: int) -> bool: """Miller-Rabin primality test for large numbers.""" if n < 2: return False if n == 2 or n == 3: return True if n % 2 == 0: return False # Write n-1 as 2^r * d r, d = 0, n - 1 while d % 2 == 0: r += 1 d //= 2 # Witnesses to test witnesses = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37] for a in witnesses: if a >= n: continue x = pow(a, d, n) if x == 1 or x == n - 1: continue for _ in range(r - 1): x = pow(x, 2, n) if x == n - 1: break else: return False return True def analyze_modulus(m: int, name: str): """Analyze properties of a modulus choice.""" print(f"\n{name}: {m}") print(f" Prime: {is_prime(m)}") print(f" Bits needed: {m.bit_length()}") print(f" Max product before overflow (64-bit): {(1 << 63) // m}") # Collision probability per pair collision_prob = 1 / m print(f" Collision probability (per pair): {collision_prob:.2e}") # Number of strings before 50% collision (birthday paradox) import math birthday_50 = int(math.sqrt(2 * m * math.log(2))) print(f" 50% collision threshold (birthday): ~{birthday_50:,} strings") # Analyze common modulifor name, mod in MODULI.items(): analyze_modulus(mod, name) def safe_product(a: int, b: int, mod: int) -> int: """ Compute (a * b) % mod safely without overflow. In Python, this is automatic (arbitrary precision). In C++/Java, you'd need __int128 or modular multiplication. """ return (a * b) % mod def demonstrate_overflow_risk(): """ Show why intermediate overflow is a concern. When computing (hash * base + char) % mod: - hash can be up to mod-1 - base is typically 31-1000 - So hash * base can be (mod-1) * base For mod = 10^9+7, base = 31: - Max intermediate: ~3.1 × 10^10 - Fits in 64-bit (max ~9.2 × 10^18) ✓ For mod = 10^18, base = 31: - Max intermediate: ~3.1 × 10^19 - EXCEEDS 64-bit! Need 128-bit arithmetic """ mod_small = 10**9 + 7 mod_large = 10**18 + 9 base = 31 max_64bit = (1 << 63) - 1 intermediate_small = (mod_small - 1) * base intermediate_large = (mod_large - 1) * base print("\nOverflow Risk Analysis:") print(f" 64-bit signed max: {max_64bit:,}") print(f" Mod=10^9+7, base=31: intermediate max = {intermediate_small:,}") print(f" Safe for 64-bit: {intermediate_small < max_64bit}") print(f" Mod=10^18+9, base=31: intermediate max = {intermediate_large:,}") print(f" Safe for 64-bit: {intermediate_large < max_64bit}") demonstrate_overflow_risk()In languages with fixed-width integers (C++, Java, Rust), overflow is a silent killer. Let's understand exactly when overflow occurs and how to prevent it.
The Critical Computation:
When computing rolling hashes, the most dangerous operation is:
hash = (hash * base + character_value) % mod
Before the modulo operation, hash * base + character_value can overflow.
Overflow Analysis for 64-bit Signed Integers:
Max safe value: 2^63 - 1 ≈ 9.2 × 10^18
For this computation:
Max intermediate = (mod - 1) × base + char_max
| Modulus | Base | Max Intermediate | 64-bit Safe? |
|---|---|---|---|
| 10^9 + 7 | 31 | ~3.1 × 10^10 | ✓ Yes |
| 10^9 + 7 | 1000 | ~10^12 | ✓ Yes |
| 10^12 | 31 | ~3.1 × 10^13 | ✓ Yes |
| 10^18 | 31 | ~3.1 × 10^19 | ✗ NO! |
In C++ and Java, signed integer overflow is undefined behavior or wraps silently. Your tests might pass (small inputs don't overflow) while production crashes on large strings. ALWAYS analyze the math before choosing parameters.
Strategies for Large Moduli:
1. Use Unsigned 64-bit with 2^64 Overflow
Instead of explicit modulo, let the arithmetic naturally overflow at 2^64:
uint64_t hash = 0;
for (char c : s) {
hash = hash * base + c; // Natural overflow at 2^64
}
Pros: Very fast (no modulo operation), simple Cons: 2^64 is not prime, slightly worse distribution
2. Use 128-bit Intermediate (C++)
using u64 = uint64_t;
using u128 = __uint128_t;
constexpr u64 MOD = 1e18 + 9;
u64 safe_mul_mod(u64 a, u64 b) {
return (u128)a * b % MOD;
}
3. Modular Exponentiation via Splitting
For environments without 128-bit types, split the multiplication:
a × b = (a_hi × 2^32 + a_lo) × (b_hi × 2^32 + b_lo)
Then reduce each component modulo M carefully.
4. Python/Java BigInteger
These languages handle arbitrary precision automatically:
hash = (hash * base + ord(c)) % mod # Always safe in Python
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
#include <iostream>#include <string>#include <vector>using namespace std; // Strategy 1: Standard 64-bit with safe modulusnamespace StandardHash { const long long MOD = 1e9 + 7; const long long BASE = 31; long long hash(const string& s) { long long h = 0; for (char c : s) { // Safe: (10^9) * 31 + 127 < 2^63 h = (h * BASE + c) % MOD; } return h; }} // Strategy 2: 128-bit intermediate for large modulusnamespace LargeModHash { using u64 = uint64_t; using u128 = __uint128_t; // GCC/Clang extension const u64 MOD = 1000000000000000009ULL; // 10^18 + 9 const u64 BASE = 31; u64 safe_mod(u128 x) { return x % MOD; } u64 hash(const string& s) { u64 h = 0; for (char c : s) { // Use 128-bit for intermediate h = safe_mod((u128)h * BASE + c); } return h; }} // Strategy 3: Natural 64-bit overflow (non-prime modulus)namespace OverflowHash { using u64 = uint64_t; const u64 BASE = 31; u64 hash(const string& s) { u64 h = 0; for (char c : s) { // Let it overflow naturally at 2^64 h = h * BASE + c; } return h; }} // Strategy 4: Mersenne prime with fast modulonamespace MersenneHash { using u64 = uint64_t; using u128 = __uint128_t; // 2^61 - 1 is a Mersenne prime const u64 MOD = (1ULL << 61) - 1; const u64 BASE = 31; // Fast modulo for Mersenne prime: x % (2^61 - 1) u64 mersenne_mod(u128 x) { // x = a * 2^61 + b // x mod (2^61-1) = a + b (mod 2^61-1) u64 low = x & MOD; // b = x mod 2^61 u64 high = x >> 61; // a = x / 2^61 u64 result = low + high; if (result >= MOD) result -= MOD; return result; } u64 hash(const string& s) { u64 h = 0; for (char c : s) { h = mersenne_mod((u128)h * BASE + c); } return h; }} int main() { string test = "Hello, World! This is a test string for hashing."; cout << "Standard (10^9+7): " << StandardHash::hash(test) << endl; cout << "Large mod (10^18+9): " << LargeModHash::hash(test) << endl; cout << "Overflow (2^64): " << OverflowHash::hash(test) << endl; cout << "Mersenne (2^61-1): " << MersenneHash::hash(test) << endl; return 0;}In competitive programming, you control the test cases. In production systems, attackers control the input. This distinction is crucial.
The Attack Scenario:
If an attacker knows your hash parameters (base and modulus), they can craft inputs that deliberately collide:
Why This Is Possible:
Given base b and modulus M, finding strings with identical hashes is straightforward:
In 2011, researchers demonstrated 'hash flooding' attacks against web frameworks (PHP, ASP.NET, Java, Python) by submitting POST data with colliding keys. A single HTTP request could consume 100% CPU for hours. This led to widespread adoption of randomized hashing.
Defense Strategy 1: Randomize the Base
Instead of using a fixed base, generate a random base at program startup:
#include <random>
const long long BASE = uniform_int_distribution<long long>(31, 1000000)(random_device{});
The attacker cannot predict the base, so cannot construct colliding inputs.
Defense Strategy 2: Randomize the Modulus
Use a random large prime for each run:
import random
def random_large_prime():
while True:
p = random.randint(10**9, 10**10)
if is_prime(p):
return p
Defense Strategy 3: Use Cryptographic Hashing
For security-critical applications, polynomial hashes are insufficient. Use:
Defense Strategy 4: Double Hashing
Use two independent hash functions; require both to match:
hash1 = poly_hash(s, base=31, mod=10**9+7)
hash2 = poly_hash(s, base=37, mod=10**9+9)
combined = (hash1, hash2) # Both must match
Collision probability becomes (1/M₁) × (1/M₂) ≈ 10^-18
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
import randomimport time class SecureStringHasher: """ String hasher with randomized parameters for adversarial resistance. The base is randomly chosen at instantiation, making it impossible for an attacker to pre-compute colliding inputs. """ def __init__(self, seed: int = None): """ Initialize with random or seeded parameters. Args: seed: For reproducibility in testing; None for random """ if seed is not None: random.seed(seed) # Randomize base from a range of primes prime_bases = [31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151] # Choose random base self.base = random.choice(prime_bases) # Use two different moduli for double hashing self.mod1 = 10**9 + 7 self.mod2 = 10**9 + 9 # Also randomize which of several mod pairs we use mod_choices = [ (10**9 + 7, 10**9 + 9), (998244353, 10**9 + 7), (10**9 + 21, 10**9 + 33), ] self.mod1, self.mod2 = random.choice(mod_choices) def hash(self, s: str) -> tuple[int, int]: """ Compute double hash for collision resistance. Returns a tuple (hash1, hash2) that both must match. """ h1, h2 = 0, 0 for c in s: code = ord(c) h1 = (h1 * self.base + code) % self.mod1 h2 = (h2 * self.base + code) % self.mod2 return (h1, h2) def are_likely_equal(self, s1: str, s2: str) -> bool: """ Check if two strings are likely equal based on hash. Returns True if both hashes match (collisions rare but possible). For guaranteed correctness, verify with string comparison. """ return self.hash(s1) == self.hash(s2) def demonstrate_randomization(): """Show that each hasher instance uses different parameters.""" print("Demonstrating parameter randomization:") for i in range(5): hasher = SecureStringHasher() h = hasher.hash("test") print(f" Instance {i}: base={hasher.base}, mods=({hasher.mod1}, {hasher.mod2})") print(f" hash('test') = {h}") def collision_attack_simulation(): """ Simulate attempting a collision attack. With fixed parameters, an attacker can find collisions. With randomized parameters, they cannot. """ print("\nCollision attack simulation:") # Fixed hasher (vulnerable) class FixedHasher: def hash(self, s: str) -> int: h = 0 for c in s: h = (h * 31 + ord(c)) % (10**9 + 7) return h # Attacker knows parameters and crafts collision fixed = FixedHasher() # Find two different strings with same hash (simplified demo) # In practice, attackers use mathematical techniques target = fixed.hash("hello") print(f" Fixed hasher: hash('hello') = {target}") # With randomized hasher, attacker doesn't know parameters secure = SecureStringHasher() h1 = secure.hash("hello") h2 = secure.hash("world") print(f" Secure hasher: hash('hello') = {h1}") print(f" Secure hasher: hash('world') = {h2}") print(f" Attacker cannot predict these values without knowing parameters!") demonstrate_randomization()collision_attack_simulation()After all this analysis, what should you actually use? Here are battle-tested parameter combinations for different scenarios:
| Use Case | Base | Modulus | Notes |
|---|---|---|---|
| Competitive Programming | 31 | 10^9 + 7 | Standard, works for most problems |
| CP with Double Hash | 31, 37 | 10^9+7, 10^9+9 | For problems that require ultra-low collision |
| Production (low security) | Randomized 31-127 | 10^9 + 7 | Random base per instance |
| Production (high security) | Use SipHash/BLAKE2 | — | Cryptographic hash required |
| Very long strings | 31 | 2^61 - 1 | Large hash space, fast Mersenne mod |
| Rabin-Karp implementation | 256 or 263 | Large prime | Base ≥ 256 for byte strings |
The Most Common Choices Explained:
Base = 31:
Modulus = 10^9 + 7 (1000000007):
Double Hashing with (31, 10^9+7) and (37, 10^9+9):
If in doubt: use base=31, mod=10^9+7. For important applications: add a second hash. For security-sensitive code: don't use polynomial hashing; use a proper cryptographic hash. For performance-critical code with known-safe inputs: consider unsigned overflow hashing.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
"""Production-ready string hasher with recommended parameters.""" class PolynomialHash: """ Standard polynomial hash implementation with recommended parameters. Uses the competitive programming standard: base=31, mod=10^9+7 with optional double hashing for increased collision resistance. """ # Standard competitive programming parameters BASE1 = 31 MOD1 = 10**9 + 7 # Second set for double hashing BASE2 = 37 MOD2 = 10**9 + 9 def __init__(self, s: str, double_hash: bool = False): """ Initialize hasher with precomputed prefix hashes. Args: s: The string to hash double_hash: If True, compute two independent hashes """ self.s = s self.n = len(s) self.double_hash = double_hash # Precompute prefix hashes and powers (first hash) self.prefix1 = [0] * (self.n + 1) self.power1 = [1] * (self.n + 1) for i in range(self.n): self.prefix1[i + 1] = ( self.prefix1[i] * self.BASE1 + ord(s[i]) ) % self.MOD1 self.power1[i + 1] = ( self.power1[i] * self.BASE1 ) % self.MOD1 # Second hash if requested if double_hash: self.prefix2 = [0] * (self.n + 1) self.power2 = [1] * (self.n + 1) for i in range(self.n): self.prefix2[i + 1] = ( self.prefix2[i] * self.BASE2 + ord(s[i]) ) % self.MOD2 self.power2[i + 1] = ( self.power2[i] * self.BASE2 ) % self.MOD2 def get_hash(self, left: int, right: int) -> int | tuple[int, int]: """ Get hash of substring s[left:right+1] (inclusive). Returns int if single hash, tuple if double hash. Time: O(1) """ length = right - left + 1 h1 = ( self.prefix1[right + 1] - self.prefix1[left] * self.power1[length] ) % self.MOD1 if h1 < 0: h1 += self.MOD1 if not self.double_hash: return h1 h2 = ( self.prefix2[right + 1] - self.prefix2[left] * self.power2[length] ) % self.MOD2 if h2 < 0: h2 += self.MOD2 return (h1, h2) def compare(self, l1: int, r1: int, l2: int, r2: int) -> bool: """Compare two substrings by hash. O(1) time.""" if r1 - l1 != r2 - l2: return False return self.get_hash(l1, r1) == self.get_hash(l2, r2) # Demonstrationdef demo(): # Single hash (standard) text = "abracadabra" hasher = PolynomialHash(text) print(f"String: '{text}'") print(f"\nSingle hash mode:") print(f" hash(0,3) 'abra': {hasher.get_hash(0, 3)}") print(f" hash(7,10) 'abra': {hasher.get_hash(7, 10)}") print(f" Equal? {hasher.compare(0, 3, 7, 10)}") # True # Double hash (when correctness matters) hasher2 = PolynomialHash(text, double_hash=True) print(f"\nDouble hash mode:") print(f" hash(0,3): {hasher2.get_hash(0, 3)}") print(f" hash(7,10): {hasher2.get_hash(7, 10)}") print(f" Equal? {hasher2.compare(0, 3, 7, 10)}") # True demo()Parameter selection for polynomial hashes might seem like a minor implementation detail, but it's the difference between a robust algorithm and one that fails at scale or succumbs to attack.
What's Next:
Even with optimal parameters, collisions will eventually occur—especially when comparing many strings. The next page covers strategies for handling collisions: detection, double hashing as a systematic approach, and verification techniques.
You now understand how to choose hash parameters that are mathematically sound, computationally safe, and resistant to adversarial manipulation. This knowledge is essential for implementing reliable string hashing in any environment.