String Hashing - Learning Module

Loading content...

0/276

Choosing Hash Base and Modulus

The Art and Science of Parameter Selection

You've learned that polynomial hashing transforms strings into numbers using the formula H(s) = Σ s[i] × b^(n-1-i) mod M. But here's a critical question that separates robust implementations from fragile ones: How do you choose the base b and the modulus M?

Poor parameter choices can lead to:

Excessive hash collisions, destroying performance guarantees
Arithmetic overflow, causing silent bugs and incorrect results
Vulnerability to adversarial inputs that deliberately trigger collisions
Hash value clustering that defeats the purpose of hashing entirely

This page provides a rigorous, comprehensive guide to parameter selection—covering the mathematical constraints, practical recommendations, and defensive techniques that production systems demand.

What You Will Learn

By the end of this page, you will understand why the base must exceed alphabet size, why the modulus should be prime and large, how to prevent integer overflow, how to select parameters that resist adversarial attacks, and the standard parameter choices used in competitive programming and industry.

Choosing the Base (b)

The base b determines how character values are weighted based on their positions. Choosing it correctly is essential for avoiding trivial collisions.

The Fundamental Constraint: b > alphabet_size

If your strings use an alphabet of size σ (e.g., σ=26 for lowercase letters, σ=128 for ASCII, σ=256 for extended ASCII), the base must satisfy:

b > σ

Why?

Consider what happens if b = σ or b < σ. With b = 26 for lowercase letters ('a'=0, 'b'=1, ...):

H("ba") = 1 × 26 + 0 = 26 H("az") = 0 × 26 + 25 = 25 (close!)

But more critically, with b = 10 and digits: H("19") = 1 × 10 + 9 = 19

This is exactly the decimal number 19! When the base equals the alphabet size, the hash becomes a simple base conversion, which is fine but limits flexibility.

The Carry Problem

If b ≤ max_char_value, certain character combinations can 'carry' into adjacent positions, creating collisions. For example, with b=10 and ASCII: H([character 10]) = 10 = 1×10 + 0 = H([character 1][character 0]). Always ensure b > max possible character value.

Recommended Base Values:

For lowercase letters only (a-z): b ≥ 27, commonly b = 31
For alphanumeric (a-z, A-Z, 0-9): b ≥ 63, commonly b = 37
For printable ASCII (32-126): b ≥ 127, commonly b = 131
For full ASCII (0-255): b ≥ 256, commonly b = 257 or 263
For Unicode: b ≈ 10^6 or use character code directly with large base

Why Odd Primes Are Preferred:

Coprimality with modulus: If b and M share factors, certain hash values become unreachable
Avoids power patterns: Even bases can create patterns (all hashes become even, etc.)
Maximum period: Prime bases have maximal cycles modulo prime M

Common Standard Choices:

Context	Base	Reasoning
Competitive programming	31	Prime, > 26, fits well in calculations
General ASCII strings	31 or 37	Small primes, efficient multiplication
Case-sensitive strings	53 or 131	> 52 (letters) or > 126 (printable ASCII)
Unicode strings	257	Prime just above 256

base_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
def recommended_base(alphabet_description: str) -> int:
    """
    Return a recommended base for the given alphabet type.
    
    The base must be:
    1. Greater than the maximum character value in the alphabet
    2. Preferably prime (for better distribution)
    3. Not too large (to avoid overflow issues)
    """
    recommendations = {
        'lowercase': 31,      # 'a'-'z' (26 chars, values 0-25 or 97-122)
        'alphanumeric': 37,   # a-z, A-Z, 0-9 (62 chars)
        'ascii_printable': 131,  # Characters 32-126 (95 chars)
        'ascii_full': 257,    # Characters 0-255 (256 chars)
        'unicode': 1000003,   # Large prime for Unicode
    }
    
    return recommendations.get(alphabet_description, 31)
 
 
def validate_base(base: int, characters: str) -> tuple[bool, str]:
    """
    Validate that a base is appropriate for the given character set.
    """
    max_char = max(ord(c) for c in characters)
    
    if base <= max_char:
        return (False, f"Base {base} must be > max character value {max_char}")
    
    # Check if base is prime (simple check)
    def is_prime(n):
        if n < 2: return False
        for i in range(2, int(n**0.5) + 1):
            if n % i == 0: return False
        return True
    
    if not is_prime(base):
        return (True, f"Warning: Base {base} is not prime, consider using a prime")
    
    return (True, f"Base {base} is valid for this character set")
 
 
# Example: Demonstrate the problem with too-small bases
def show_collision_risk():
    """Demonstrate why base must exceed alphabet size."""
    
    # With base = 26 for lowercase letters (using 'a'=0 mapping)
    base = 26
    
    def hash_with_base(s: str, b: int) -> int:
        h = 0
        for c in s:
            h = h * b + (ord(c) - ord('a'))
        return h
    
    # These should have different hashes but...
    test_pairs = [
        ("ba", "aa"),  # Different strings
    ]
    
    print("Demonstrating collision risk with small base:")
    print(f"Using base = {base}")
    
    # 'ba' = 1*26 + 0 = 26
    # 'aa' = 0*26 + 0 = 0 (no collision here)
    # But consider longer strings and patterns...
    
    for s in ["aa", "ba", "ab", "bb"]:
        h = hash_with_base(s, base)
        print(f"  H('{s}') = {h}")
    
    print("\nWith proper base = 31:")
    base = 31
    for s in ["aa", "ba", "ab", "bb"]:
        h = hash_with_base(s, base)
        print(f"  H('{s}') = {h}")
 
 
show_collision_risk()

Choosing the Modulus (M)

The modulus M determines the size of our hash space and has profound implications for collision probability, arithmetic safety, and computational efficiency.

Why Use a Modulus at All?

Without a modulus, polynomial hashes grow exponentially with string length:

H(s) = s[0] × b^(n-1) + ... + s[n-1] × b^0

For a 1000-character string with b=31:

b^999 has approximately 1487 decimal digits
No fixed-precision integer type can hold such values
Comparisons become expensive (arbitrary precision arithmetic)

The modulus constrains hash values to a fixed range [0, M-1], enabling:

Fixed-size integer storage
O(1) comparison and arithmetic
Bounded memory in hash tables

The Two Competing Goals

We want M as LARGE as possible (fewer collisions) but small enough to avoid INTEGER OVERFLOW during computation. The sweet spot depends on your programming language's integer type and whether you need efficient arithmetic.

Why the Modulus Should Be Prime:

A prime modulus provides several mathematical advantages:

Maximum distribution uniformity: For prime M, the multiplicative group (ℤ/Mℤ)* has order M-1, the maximum possible. This means powers of b cycle through the maximum number of distinct values.
No zero divisors: In ℤ/Mℤ with prime M, if ab ≡ 0 (mod M), then a ≡ 0 or b ≡ 0. This prevents 'hidden zeros' from polluting hash computations.
Modular inverses exist: Every non-zero element has a multiplicative inverse, enabling division operations when needed.
Schwartz-Zippel applies directly: The collision probability bounds assume field arithmetic, which requires prime M.

Why Composite Moduli Fail:

Consider M = 256 = 2^8:

If the base b is even (say 32), then b^8 = 0 mod 256
After position 8, early characters stop contributing!
This creates massive collision clusters

Even with odd base, M = 256 has only 128 'units' (elements with inverses), reducing effective hash space by half.

Common Modulus Choices
Modulus	Binary Size	Primality	Use Case	Notes
10^9 + 7	~30 bits	Prime ✓	Competitive programming	Standard choice, fits 32-bit with room
10^9 + 9	~30 bits	Prime ✓	Alternative to 10^9+7	Useful for double hashing
998244353	~30 bits	Prime ✓	NTT-friendly	= 119 × 2^23 + 1
2^61 - 1	61 bits	Mersenne Prime ✓	Low-collision applications	Efficient mod via bit ops
2^64 (overflow)	64 bits	Not prime ✗	Speed optimization	Relies on natural overflow
10^18 + 9	~60 bits	Prime ✓	Ultra-low collision	Requires 128-bit intermediate

The 10^9 + 7 Standard:

This modulus has become the de facto standard in competitive programming for good reasons:

Prime: 10^9 + 7 = 1,000,000,007 is indeed prime
Fits in 32 bits: 2^30 ≈ 10^9, so it fits with room to spare
Safe multiplication: (M-1) × (M-1) ≈ 10^18 fits in 64-bit integers
Memorable: Easy to type without looking up
Collision rate: With ~10^9 possible values, collision probability is ~10^-9 per pair

When 10^9 + 7 Isn't Enough:

For large-scale applications (millions of substrings), the birthday paradox means collisions become likely. Solutions:

Use 2^61 - 1 (Mersenne prime, ~10^18 hash space)
Use double hashing with two different moduli
Verify hash matches with string comparison

modulus_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# Common modulus choices with their properties
MODULI = {
    'CP_STANDARD': 10**9 + 7,      # Standard competitive programming
    'CP_ALT': 10**9 + 9,          # Alternative for double hashing
    'NTT_FRIENDLY': 998244353,     # For Number Theoretic Transform
    'MERSENNE_61': (1 << 61) - 1,  # Mersenne prime, very low collision
    'MERSENNE_31': (1 << 31) - 1,  # Smaller Mersenne prime
}
 
 
def is_prime(n: int) -> bool:
    """Miller-Rabin primality test for large numbers."""
    if n < 2:
        return False
    if n == 2 or n == 3:
        return True
    if n % 2 == 0:
        return False
    
    # Write n-1 as 2^r * d
    r, d = 0, n - 1
    while d % 2 == 0:
        r += 1
        d //= 2
    
    # Witnesses to test
    witnesses = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]
    
    for a in witnesses:
        if a >= n:
            continue
        x = pow(a, d, n)
        if x == 1 or x == n - 1:
            continue
        for _ in range(r - 1):
            x = pow(x, 2, n)
            if x == n - 1:
                break
        else:
            return False
    return True
 
 
def analyze_modulus(m: int, name: str):
    """Analyze properties of a modulus choice."""
    print(f"\n{name}: {m}")
    print(f"  Prime: {is_prime(m)}")
    print(f"  Bits needed: {m.bit_length()}")
    print(f"  Max product before overflow (64-bit): {(1 << 63) // m}")
    
    # Collision probability per pair
    collision_prob = 1 / m
    print(f"  Collision probability (per pair): {collision_prob:.2e}")
    
    # Number of strings before 50% collision (birthday paradox)
    import math
    birthday_50 = int(math.sqrt(2 * m * math.log(2)))
    print(f"  50% collision threshold (birthday): ~{birthday_50:,} strings")
 
 
# Analyze common moduli
for name, mod in MODULI.items():
    analyze_modulus(mod, name)
 
 
def safe_product(a: int, b: int, mod: int) -> int:
    """
    Compute (a * b) % mod safely without overflow.
    
    In Python, this is automatic (arbitrary precision).
    In C++/Java, you'd need __int128 or modular multiplication.
    """
    return (a * b) % mod
 
 
def demonstrate_overflow_risk():
    """
    Show why intermediate overflow is a concern.
    
    When computing (hash * base + char) % mod:
    - hash can be up to mod-1
    - base is typically 31-1000
    - So hash * base can be (mod-1) * base
    
    For mod = 10^9+7, base = 31:
    - Max intermediate: ~3.1 × 10^10
    - Fits in 64-bit (max ~9.2 × 10^18) ✓
    
    For mod = 10^18, base = 31:
    - Max intermediate: ~3.1 × 10^19
    - EXCEEDS 64-bit! Need 128-bit arithmetic
    """
    mod_small = 10**9 + 7
    mod_large = 10**18 + 9
    base = 31
    
    max_64bit = (1 << 63) - 1
    
    intermediate_small = (mod_small - 1) * base
    intermediate_large = (mod_large - 1) * base
    
    print("\nOverflow Risk Analysis:")
    print(f"  64-bit signed max: {max_64bit:,}")
    print(f"  Mod=10^9+7, base=31: intermediate max = {intermediate_small:,}")
    print(f"    Safe for 64-bit: {intermediate_small < max_64bit}")
    print(f"  Mod=10^18+9, base=31: intermediate max = {intermediate_large:,}")
    print(f"    Safe for 64-bit: {intermediate_large < max_64bit}")
 
 
demonstrate_overflow_risk()

Preventing Integer Overflow

In languages with fixed-width integers (C++, Java, Rust), overflow is a silent killer. Let's understand exactly when overflow occurs and how to prevent it.

The Critical Computation:

When computing rolling hashes, the most dangerous operation is:

hash = (hash * base + character_value) % mod

Before the modulo operation, hash * base + character_value can overflow.

Overflow Analysis for 64-bit Signed Integers:

Max safe value: 2^63 - 1 ≈ 9.2 × 10^18

For this computation:

hash can be up to mod - 1
base is typically 31-1000
character_value is typically 0-127 (ASCII) or 0-255

Max intermediate = (mod - 1) × base + char_max

Modulus	Base	Max Intermediate	64-bit Safe?
10^9 + 7	31	~3.1 × 10^10	✓ Yes
10^9 + 7	1000	~10^12	✓ Yes
10^12	31	~3.1 × 10^13	✓ Yes
10^18	31	~3.1 × 10^19	✗ NO!

The Subtle Danger

In C++ and Java, signed integer overflow is undefined behavior or wraps silently. Your tests might pass (small inputs don't overflow) while production crashes on large strings. ALWAYS analyze the math before choosing parameters.

Strategies for Large Moduli:

1. Use Unsigned 64-bit with 2^64 Overflow

Instead of explicit modulo, let the arithmetic naturally overflow at 2^64:

uint64_t hash = 0;
for (char c : s) {
    hash = hash * base + c;  // Natural overflow at 2^64
}

Pros: Very fast (no modulo operation), simple Cons: 2^64 is not prime, slightly worse distribution

2. Use 128-bit Intermediate (C++)

using u64 = uint64_t;
using u128 = __uint128_t;

constexpr u64 MOD = 1e18 + 9;

u64 safe_mul_mod(u64 a, u64 b) {
    return (u128)a * b % MOD;
}

3. Modular Exponentiation via Splitting

For environments without 128-bit types, split the multiplication:

a × b = (a_hi × 2^32 + a_lo) × (b_hi × 2^32 + b_lo)

Then reduce each component modulo M carefully.

4. Python/Java BigInteger

These languages handle arbitrary precision automatically:

hash = (hash * base + ord(c)) % mod  # Always safe in Python

overflow_safe_hash.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
#include <iostream>
#include <string>
#include <vector>
using namespace std;
 
// Strategy 1: Standard 64-bit with safe modulus
namespace StandardHash {
    const long long MOD = 1e9 + 7;
    const long long BASE = 31;
    
    long long hash(const string& s) {
        long long h = 0;
        for (char c : s) {
            // Safe: (10^9) * 31 + 127 < 2^63
            h = (h * BASE + c) % MOD;
        }
        return h;
    }
}
 
// Strategy 2: 128-bit intermediate for large modulus
namespace LargeModHash {
    using u64 = uint64_t;
    using u128 = __uint128_t;  // GCC/Clang extension
    
    const u64 MOD = 1000000000000000009ULL;  // 10^18 + 9
    const u64 BASE = 31;
    
    u64 safe_mod(u128 x) {
        return x % MOD;
    }
    
    u64 hash(const string& s) {
        u64 h = 0;
        for (char c : s) {
            // Use 128-bit for intermediate
            h = safe_mod((u128)h * BASE + c);
        }
        return h;
    }
}
 
// Strategy 3: Natural 64-bit overflow (non-prime modulus)
namespace OverflowHash {
    using u64 = uint64_t;
    const u64 BASE = 31;
    
    u64 hash(const string& s) {
        u64 h = 0;
        for (char c : s) {
            // Let it overflow naturally at 2^64
            h = h * BASE + c;
        }
        return h;
    }
}
 
// Strategy 4: Mersenne prime with fast modulo
namespace MersenneHash {
    using u64 = uint64_t;
    using u128 = __uint128_t;
    
    // 2^61 - 1 is a Mersenne prime
    const u64 MOD = (1ULL << 61) - 1;
    const u64 BASE = 31;
    
    // Fast modulo for Mersenne prime: x % (2^61 - 1)
    u64 mersenne_mod(u128 x) {
        // x = a * 2^61 + b
        // x mod (2^61-1) = a + b (mod 2^61-1)
        u64 low = x & MOD;       // b = x mod 2^61
        u64 high = x >> 61;       // a = x / 2^61
        u64 result = low + high;
        if (result >= MOD) result -= MOD;
        return result;
    }
    
    u64 hash(const string& s) {
        u64 h = 0;
        for (char c : s) {
            h = mersenne_mod((u128)h * BASE + c);
        }
        return h;
    }
}
 
int main() {
    string test = "Hello, World! This is a test string for hashing.";
    
    cout << "Standard (10^9+7): " << StandardHash::hash(test) << endl;
    cout << "Large mod (10^18+9): " << LargeModHash::hash(test) << endl;
    cout << "Overflow (2^64): " << OverflowHash::hash(test) << endl;
    cout << "Mersenne (2^61-1): " << MersenneHash::hash(test) << endl;
    
    return 0;
}

Defending Against Adversarial Inputs

In competitive programming, you control the test cases. In production systems, attackers control the input. This distinction is crucial.

The Attack Scenario:

If an attacker knows your hash parameters (base and modulus), they can craft inputs that deliberately collide:

Hash table DoS: Send many keys with the same hash → O(n²) lookup time
Algorithm disruption: Force Rabin-Karp into worst-case O(nm) performance
Cache poisoning: Make different data appear identical via hash collisions

Why This Is Possible:

Given base b and modulus M, finding strings with identical hashes is straightforward:

For length-2 strings s = [a, b] and t = [c, d]:
H(s) = a × b + d, H(t) = c × b + d
Collision when a × b + d ≡ c × b + e (mod M)
Many solutions exist; finding them is polynomial time

Real-World Attacks

In 2011, researchers demonstrated 'hash flooding' attacks against web frameworks (PHP, ASP.NET, Java, Python) by submitting POST data with colliding keys. A single HTTP request could consume 100% CPU for hours. This led to widespread adoption of randomized hashing.

Defense Strategy 1: Randomize the Base

Instead of using a fixed base, generate a random base at program startup:

#include <random>
const long long BASE = uniform_int_distribution<long long>(31, 1000000)(random_device{});

The attacker cannot predict the base, so cannot construct colliding inputs.

Defense Strategy 2: Randomize the Modulus

Use a random large prime for each run:

import random
def random_large_prime():
    while True:
        p = random.randint(10**9, 10**10)
        if is_prime(p):
            return p

Defense Strategy 3: Use Cryptographic Hashing

For security-critical applications, polynomial hashes are insufficient. Use:

SipHash (fast, used in Python dict, Rust HashMap)
SHA-256/SHA-3 (slower but cryptographically secure)
BLAKE2/BLAKE3 (modern, fast, secure)

Defense Strategy 4: Double Hashing

Use two independent hash functions; require both to match:

hash1 = poly_hash(s, base=31, mod=10**9+7)
hash2 = poly_hash(s, base=37, mod=10**9+9)
combined = (hash1, hash2)  # Both must match

Collision probability becomes (1/M₁) × (1/M₂) ≈ 10^-18

secure_hashing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import random
import time
 
class SecureStringHasher:
    """
    String hasher with randomized parameters for adversarial resistance.
    
    The base is randomly chosen at instantiation, making it impossible
    for an attacker to pre-compute colliding inputs.
    """
    
    def __init__(self, seed: int = None):
        """
        Initialize with random or seeded parameters.
        
        Args:
            seed: For reproducibility in testing; None for random
        """
        if seed is not None:
            random.seed(seed)
        
        # Randomize base from a range of primes
        prime_bases = [31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97,
                       101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151]
        
        # Choose random base
        self.base = random.choice(prime_bases)
        
        # Use two different moduli for double hashing
        self.mod1 = 10**9 + 7
        self.mod2 = 10**9 + 9
        
        # Also randomize which of several mod pairs we use
        mod_choices = [
            (10**9 + 7, 10**9 + 9),
            (998244353, 10**9 + 7),
            (10**9 + 21, 10**9 + 33),
        ]
        self.mod1, self.mod2 = random.choice(mod_choices)
    
    def hash(self, s: str) -> tuple[int, int]:
        """
        Compute double hash for collision resistance.
        
        Returns a tuple (hash1, hash2) that both must match.
        """
        h1, h2 = 0, 0
        for c in s:
            code = ord(c)
            h1 = (h1 * self.base + code) % self.mod1
            h2 = (h2 * self.base + code) % self.mod2
        return (h1, h2)
    
    def are_likely_equal(self, s1: str, s2: str) -> bool:
        """
        Check if two strings are likely equal based on hash.
        
        Returns True if both hashes match (collisions rare but possible).
        For guaranteed correctness, verify with string comparison.
        """
        return self.hash(s1) == self.hash(s2)
 
 
def demonstrate_randomization():
    """Show that each hasher instance uses different parameters."""
    print("Demonstrating parameter randomization:")
    
    for i in range(5):
        hasher = SecureStringHasher()
        h = hasher.hash("test")
        print(f"  Instance {i}: base={hasher.base}, mods=({hasher.mod1}, {hasher.mod2})")
        print(f"    hash('test') = {h}")
 
 
def collision_attack_simulation():
    """
    Simulate attempting a collision attack.
    
    With fixed parameters, an attacker can find collisions.
    With randomized parameters, they cannot.
    """
    print("\nCollision attack simulation:")
    
    # Fixed hasher (vulnerable)
    class FixedHasher:
        def hash(self, s: str) -> int:
            h = 0
            for c in s:
                h = (h * 31 + ord(c)) % (10**9 + 7)
            return h
    
    # Attacker knows parameters and crafts collision
    fixed = FixedHasher()
    
    # Find two different strings with same hash (simplified demo)
    # In practice, attackers use mathematical techniques
    target = fixed.hash("hello")
    print(f"  Fixed hasher: hash('hello') = {target}")
    
    # With randomized hasher, attacker doesn't know parameters
    secure = SecureStringHasher()
    h1 = secure.hash("hello")
    h2 = secure.hash("world")
    print(f"  Secure hasher: hash('hello') = {h1}")
    print(f"  Secure hasher: hash('world') = {h2}")
    print(f"  Attacker cannot predict these values without knowing parameters!")
 
 
demonstrate_randomization()
collision_attack_simulation()

Standard Parameter Combinations

After all this analysis, what should you actually use? Here are battle-tested parameter combinations for different scenarios:

Recommended Parameter Combinations by Use Case
Use Case	Base	Modulus	Notes
Competitive Programming	31	10^9 + 7	Standard, works for most problems
CP with Double Hash	31, 37	10^9+7, 10^9+9	For problems that require ultra-low collision
Production (low security)	Randomized 31-127	10^9 + 7	Random base per instance
Production (high security)	Use SipHash/BLAKE2	—	Cryptographic hash required
Very long strings	31	2^61 - 1	Large hash space, fast Mersenne mod
Rabin-Karp implementation	256 or 263	Large prime	Base ≥ 256 for byte strings

The Most Common Choices Explained:

Base = 31:

Small prime greater than 26 (lowercase letters)
Efficient to compute (some compilers optimize × 31 to shifts)
Standard across languages and codebases

Modulus = 10^9 + 7 (1000000007):

Large enough for low collisions (~10^9 possible values)
Prime (optimal distribution)
Fits in 32-bit with room for addition
(M-1)^2 fits in 64-bit for safe multiplication

Double Hashing with (31, 10^9+7) and (37, 10^9+9):

Two independent hash functions
Collision requires matching BOTH hashes
Probability: ~10^-18 (practically impossible by chance)
Use when correctness is critical and you can't verify with string comparison

Quick Decision Guide

If in doubt: use base=31, mod=10^9+7. For important applications: add a second hash. For security-sensitive code: don't use polynomial hashing; use a proper cryptographic hash. For performance-critical code with known-safe inputs: consider unsigned overflow hashing.

standard_hasher.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
"""
Production-ready string hasher with recommended parameters.
"""
 
class PolynomialHash:
    """
    Standard polynomial hash implementation with recommended parameters.
    
    Uses the competitive programming standard: base=31, mod=10^9+7
    with optional double hashing for increased collision resistance.
    """
    
    # Standard competitive programming parameters
    BASE1 = 31
    MOD1 = 10**9 + 7
    
    # Second set for double hashing
    BASE2 = 37
    MOD2 = 10**9 + 9
    
    def __init__(self, s: str, double_hash: bool = False):
        """
        Initialize hasher with precomputed prefix hashes.
        
        Args:
            s: The string to hash
            double_hash: If True, compute two independent hashes
        """
        self.s = s
        self.n = len(s)
        self.double_hash = double_hash
        
        # Precompute prefix hashes and powers (first hash)
        self.prefix1 = [0] * (self.n + 1)
        self.power1 = [1] * (self.n + 1)
        
        for i in range(self.n):
            self.prefix1[i + 1] = (
                self.prefix1[i] * self.BASE1 + ord(s[i])
            ) % self.MOD1
            self.power1[i + 1] = (
                self.power1[i] * self.BASE1
            ) % self.MOD1
        
        # Second hash if requested
        if double_hash:
            self.prefix2 = [0] * (self.n + 1)
            self.power2 = [1] * (self.n + 1)
            
            for i in range(self.n):
                self.prefix2[i + 1] = (
                    self.prefix2[i] * self.BASE2 + ord(s[i])
                ) % self.MOD2
                self.power2[i + 1] = (
                    self.power2[i] * self.BASE2
                ) % self.MOD2
    
    def get_hash(self, left: int, right: int) -> int | tuple[int, int]:
        """
        Get hash of substring s[left:right+1] (inclusive).
        
        Returns int if single hash, tuple if double hash.
        Time: O(1)
        """
        length = right - left + 1
        
        h1 = (
            self.prefix1[right + 1] 
            - self.prefix1[left] * self.power1[length]
        ) % self.MOD1
        if h1 < 0:
            h1 += self.MOD1
        
        if not self.double_hash:
            return h1
        
        h2 = (
            self.prefix2[right + 1] 
            - self.prefix2[left] * self.power2[length]
        ) % self.MOD2
        if h2 < 0:
            h2 += self.MOD2
        
        return (h1, h2)
    
    def compare(self, l1: int, r1: int, l2: int, r2: int) -> bool:
        """Compare two substrings by hash. O(1) time."""
        if r1 - l1 != r2 - l2:
            return False
        return self.get_hash(l1, r1) == self.get_hash(l2, r2)
 
 
# Demonstration
def demo():
    # Single hash (standard)
    text = "abracadabra"
    hasher = PolynomialHash(text)
    
    print(f"String: '{text}'")
    print(f"\nSingle hash mode:")
    print(f"  hash(0,3) 'abra': {hasher.get_hash(0, 3)}")
    print(f"  hash(7,10) 'abra': {hasher.get_hash(7, 10)}")
    print(f"  Equal? {hasher.compare(0, 3, 7, 10)}")  # True
    
    # Double hash (when correctness matters)
    hasher2 = PolynomialHash(text, double_hash=True)
    print(f"\nDouble hash mode:")
    print(f"  hash(0,3): {hasher2.get_hash(0, 3)}")
    print(f"  hash(7,10): {hasher2.get_hash(7, 10)}")
    print(f"  Equal? {hasher2.compare(0, 3, 7, 10)}")  # True
 
 
demo()

Summary

Parameter selection for polynomial hashes might seem like a minor implementation detail, but it's the difference between a robust algorithm and one that fails at scale or succumbs to attack.

Key Takeaways

•Base must exceed alphabet size: Prevents trivial collisions where different strings accidentally create the same polynomial.
•Modulus should be large and prime: Maximizes hash space, ensures uniform distribution, and enables proper mathematical properties.
•Prevent integer overflow: Analyze your max intermediate value; use 128-bit arithmetic or Mersenne primes for large moduli.
•Consider adversarial resistance: Randomize parameters in production; use cryptographic hashes for security-critical applications.
•Standard choices exist for good reason: base=31, mod=10^9+7 is battle-tested; double hashing provides extra safety.
•Verify when possible: Hash matches can be false positives; confirm with string comparison for critical correctness.

What's Next:

Even with optimal parameters, collisions will eventually occur—especially when comparing many strings. The next page covers strategies for handling collisions: detection, double hashing as a systematic approach, and verification techniques.

Page Complete

You now understand how to choose hash parameters that are mathematically sound, computationally safe, and resistant to adversarial manipulation. This knowledge is essential for implementing reliable string hashing in any environment.