Loading content...
If hash indexing is the vehicle that delivers O(1) lookups, then the hash function is its engine. Every characteristic of a hash index—its speed, its space efficiency, its resistance to performance degradation—ultimately traces back to the quality of its hash function.
A hash function that produces poor distribution transforms a theoretically elegant O(1) system into a practical O(n) disaster. Conversely, a well-designed hash function can make the difference between a database that struggles under load and one that handles millions of queries per second with ease.
This page examines hash functions with the depth and rigor they deserve: mathematical foundations, quality metrics, specific algorithms, and the practical wisdom that separates production-ready implementations from textbook exercises.
By the end of this page, you will understand the mathematical properties that define a quality hash function, know the specific algorithms used in major database systems, and be able to evaluate and select hash functions for different data distributions and workloads.
Formally, a hash function for database indexing is a mapping:
h: K → {0, 1, 2, ..., M-1}
Where:
The function maps an arbitrarily large key space to a fixed, manageable bucket space. This mapping is inherently many-to-one: multiple keys must map to the same bucket (unless |K| ≤ M, which is rarely true in practice).
The ideal hash function (theoretical):
If we could achieve a perfect hash function—one that maps each key to a unique bucket with no collisions—hash indexing would be optimal. Perfect hashing is possible when:
In practice, database indexes rarely meet these conditions. Keys are dynamic, storage is limited, and the hash function must be computed quickly. We therefore aim for the next best thing: a uniform hash function.
An ideal hash function for database indexing distributes keys uniformly across buckets. If we insert n keys into M buckets, each bucket should contain approximately n/M keys. Deviations from this ideal—some buckets with many keys, others with few—directly degrade performance.
The birthday paradox and collision probability:
How many keys can we insert before collisions become likely? This connects to the famous birthday problem. With M buckets:
The implication: collisions are not edge cases—they're the normal operating condition. Hash index design must embrace collisions rather than try to eliminate them.
Expected chain length:
With n keys hashed into M buckets using a uniform hash function:
Expected keys per bucket (load factor) = λ = n/M
The bucket sizes follow approximately a Poisson distribution with mean λ:
P(bucket has k keys) ≈ (e^(-λ) × λ^k) / k!
For λ = 1 (one key per bucket on average):
This variance is acceptable—even with uneven distribution, average lookup cost remains O(1) as long as no bucket grows exceptionally large.
Not all hash functions are created equal. We evaluate hash functions along several dimensions, each critical for database performance:
Measuring uniformity: Chi-squared test
The chi-squared statistic measures how much observed bucket sizes deviate from expected sizes:
χ² = Σ (observed_i - expected)² / expected
Where expected = n/M for each bucket.
For a good hash function with n keys and M buckets:
Avalanche property:
A strong hash function should satisfy:
Example of poor avalanche:
h(1000) = 47
h(1001) = 48 // Only 1 bit difference → similar hash
h(1002) = 49 // Sequential pattern → terrible for distribution
Example of good avalanche:
h(1000) = 47328
h(1001) = 91847 // Completely different
h(1002) = 62109 // No pattern
| Property | Poor Hash Function | Good Hash Function |
|---|---|---|
| bucket_size_variance | High (some buckets 100x larger than others) | Low (all buckets within 2-3x of mean) |
| chi_squared / M ratio | 2.0 (significant deviation) | ≈ 1.0 (close to expected) |
| avalanche (% bits changed) | < 30% for single bit flip | 45% for single bit flip |
| speed (ns per hash) | Highly variable | Consistent, predictable |
| pattern sensitivity | Sequential keys cluster | Sequential keys scatter |
Theoretical analysis assumes random key distributions, but real data has patterns: sequential IDs, timestamps, names starting with common letters, zip codes clustered by geography. A hash function must break these patterns to achieve uniform distribution in practice, not just in theory.
The simplest and most widely used hash function is the division method:
h(key) = key mod M
Where M is the number of buckets. Its simplicity is both its strength and its weakness.
Advantages:
Critical issue: Choice of M
The choice of M dramatically affects distribution quality:
Bad choice: M = power of 2 (e.g., 1024)
h(key) = key mod 1024
This uses only the low 10 bits of the key. If keys share common low bits (e.g., aligned memory addresses, IDs divisible by 4), distribution suffers badly.
Bad choice: M = 10^k (e.g., 10000) Similar keys (1234, 2234, 3234) all hash to the same value when only last 4 digits vary in predictable ways.
Good choice: M = prime number not close to power of 2
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import mathfrom collections import Counter def analyze_hash_distribution(keys: list, num_buckets: int) -> dict: """ Analyze the distribution quality of modular hashing. Returns statistics about bucket distribution. """ # Hash all keys buckets = [key % num_buckets for key in keys] bucket_counts = Counter(buckets) # Fill in empty buckets with 0 for i in range(num_buckets): if i not in bucket_counts: bucket_counts[i] = 0 counts = list(bucket_counts.values()) n = len(keys) expected = n / num_buckets # Calculate statistics chi_squared = sum((c - expected) ** 2 / expected for c in counts) max_bucket = max(counts) min_bucket = min(counts) variance = sum((c - expected) ** 2 for c in counts) / num_buckets return { "total_keys": n, "num_buckets": num_buckets, "expected_per_bucket": expected, "chi_squared": chi_squared, "chi_squared_ratio": chi_squared / num_buckets, # Should be ~1.0 "max_bucket_size": max_bucket, "min_bucket_size": min_bucket, "variance": variance, "std_dev": math.sqrt(variance), } # Test with different M valueskeys = list(range(0, 100000, 4)) # IDs divisible by 4 (common pattern)n_keys = len(keys) print(f"Testing with {n_keys} keys (all divisible by 4)")print("=" * 60) # Bad: Power of 2result = analyze_hash_distribution(keys, 1024)print(f"\nM = 1024 (power of 2):")print(f" Chi-squared ratio: {result['chi_squared_ratio']:.2f} (should be ~1.0)")print(f" Max bucket: {result['max_bucket_size']} (expected: {result['expected_per_bucket']:.1f})")print(f" Most buckets have: 0 keys") # Only 256 buckets used! # Bad: Power of 10result = analyze_hash_distribution(keys, 10000)print(f"\nM = 10000 (power of 10):")print(f" Chi-squared ratio: {result['chi_squared_ratio']:.2f}")print(f" Max bucket: {result['max_bucket_size']}") # Good: Prime numberresult = analyze_hash_distribution(keys, 10007)print(f"\nM = 10007 (prime):")print(f" Chi-squared ratio: {result['chi_squared_ratio']:.2f}")print(f" Max bucket: {result['max_bucket_size']}")print(f" Min bucket: {result['min_bucket_size']}")print(f" Std dev: {result['std_dev']:.2f}") # Output demonstrates how prime M handles patterned keys betterWhen choosing a prime M, pick one not close to powers of 2. For example, 1021 is prime but close to 1024 (2^10). Better choices: 997, 1009, 1013. For larger tables: 99991, 100003, 999983. Many databases pre-compute tables of useful primes.
The multiplication method addresses the division method's sensitivity to bucket count:
h(key) = floor(M × (key × A mod 1))
Where:
key × A mod 1 extracts the fractional part of key × AThe beauty of this method: the choice of M doesn't significantly affect distribution quality. Any M works reasonably well, including powers of 2 (which allow efficient implementation).
The golden ratio connection:
Donald Knuth's analysis shows the optimal A is:
A = (√5 - 1) / 2 ≈ 0.6180339887...
This is the golden ratio (φ⁻¹). Using this value:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import math # The golden ratio constantGOLDEN_RATIO_INVERSE = (math.sqrt(5) - 1) / 2 # ≈ 0.6180339887 def multiplicative_hash(key: int, num_buckets: int) -> int: """ Multiplicative hash using the golden ratio. Works well with any num_buckets, including powers of 2. """ # Multiply key by golden ratio, take fractional part product = key * GOLDEN_RATIO_INVERSE fractional_part = product - math.floor(product) # Equivalent to: product % 1 # Scale to bucket range and floor return int(num_buckets * fractional_part) def multiplicative_hash_fast(key: int, num_buckets: int, num_bits: int = 32) -> int: """ Fast integer-only implementation for power-of-2 num_buckets. Uses bit manipulation instead of floating point. For 32-bit words, A ≈ 2654435769 / 2^32 ≈ 0.618... For 64-bit words, A ≈ 11400714819323198485 / 2^64 """ if num_bits == 32: A_SCALED = 2654435769 # floor(2^32 * (sqrt(5)-1)/2) word_size = 32 else: A_SCALED = 11400714819323198485 word_size = 64 # Multiply (overflow is intentional and handled by unsigned arithmetic) product = (key * A_SCALED) & ((1 << word_size) - 1) # Extract top log2(num_buckets) bits p = int(math.log2(num_buckets)) return product >> (word_size - p) # Demonstrate distribution qualityprint("Multiplicative Hash Distribution Test")print("=" * 50) keys = list(range(0, 1000)) # Sequential keysnum_buckets = 16 # Power of 2 (would be bad for division method) # Count bucket usagefrom collections import Counterbuckets_division = Counter(k % num_buckets for k in keys)buckets_multiplicative = Counter(multiplicative_hash(k, num_buckets) for k in keys) print(f"\nDivision method (mod {num_buckets}):")print(f" Bucket distribution: {dict(sorted(buckets_division.items()))}")print(f" Max bucket: {max(buckets_division.values())}")print(f" Min bucket: {min(buckets_division.values())}") print(f"\nMultiplicative method (M={num_buckets}):")print(f" Bucket distribution: {dict(sorted(buckets_multiplicative.items()))}")print(f" Max bucket: {max(buckets_multiplicative.values())}")print(f" Min bucket: {min(buckets_multiplicative.values())}") # The multiplicative method handles sequential keys and power-of-2 M much betterPractical implementation notes:
Floating-point vs. integer implementation:
Scaling A for integer arithmetic:
Power-of-2 bucket counts:
Despite the multiplication method's theoretical advantages, many databases use division with carefully chosen primes. Reasons: simplicity, historical precedent, debuggability (bucket = key mod M is intuitive), and the fact that database keys often aren't as patterned as worst-case analysis suggests.
String keys (names, emails, product codes, UUIDs) require special handling. The challenge: convert a variable-length sequence of characters into a fixed-size integer suitable for bucket addressing.
Naive approaches and their failures:
| Approach | Formula | Problem |
|---|---|---|
| Sum of ASCII values | h = Σ ord(c) | Anagrams collide: 'abc' = 'bca' = 'cab' |
| First N characters | h = hash(s[:N]) | Ignores suffix: 'Smith1' = 'Smith2' |
| Last N characters | h = hash(s[-N:]) | Common suffixes cluster: '.com', '.edu' |
| XOR of all bytes | h = c₁ XOR c₂ XOR ... | Order-insensitive, many collisions |
The polynomial (rolling) hash:
The standard approach treats the string as a polynomial in some base:
h(s) = (s[0]×aⁿ⁻¹ + s[1]×aⁿ⁻² + ... + s[n-1]×a⁰) mod M
This is equivalent to:
h(s) = Σᵢ s[i] × aⁿ⁻¹⁻ⁱ mod M
Where:
Efficient computation using Horner's method:
h = 0
for each character c in string:
h = (h × a + ord(c)) mod M
This computes the polynomial in O(n) time with constant space.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
def polynomial_hash(s: str, num_buckets: int, base: int = 31) -> int: """ Standard polynomial hash using Horner's method. Base 31 is common (prime, works well for ASCII). """ h = 0 for char in s: h = (h * base + ord(char)) % num_buckets return h def djb2_hash(s: str, num_buckets: int) -> int: """ Dan Bernstein's djb2 hash function. Widely used, good distribution, simple. The magic number 5381 has good properties. """ h = 5381 for char in s: # h * 33 + c, using bit shift for speed: h*33 = h*32 + h = (h<<5) + h h = ((h << 5) + h) + ord(char) h = h & 0xFFFFFFFF # Keep as 32-bit return h % num_buckets def fnv1a_hash(s: str, num_buckets: int) -> int: """ FNV-1a hash function. Used in many production systems. XOR-then-multiply order provides better distribution than FNV-1. """ FNV_OFFSET_BASIS = 2166136261 FNV_PRIME = 16777619 h = FNV_OFFSET_BASIS for char in s: h ^= ord(char) h = (h * FNV_PRIME) & 0xFFFFFFFF return h % num_buckets def sdbm_hash(s: str, num_buckets: int) -> int: """ SDBM database hash function. Originally used in SDBM database library. h = h * 65599 + c, implemented as: h * 65536 + h * 64 - h + c """ h = 0 for char in s: h = ord(char) + (h << 6) + (h << 16) - h h = h & 0xFFFFFFFF return h % num_buckets # Compare hash functions on similar stringstest_strings = [ "smith", "Smith", "SMITH", # Case variations "smith1", "smith2", "smith3", # Numeric suffixes "asmith", "bsmith", "csmith", # Alphabetic prefixes "john_smith", "john.smith", "john-smith", # Separators] print("Hash Value Comparison for Similar Strings")print("=" * 70)print(f"{'String':<15} {'poly':<10} {'djb2':<12} {'fnv1a':<12} {'sdbm':<12}")print("-" * 70) NUM_BUCKETS = 1000 for s in test_strings: p = polynomial_hash(s, NUM_BUCKETS) d = djb2_hash(s, NUM_BUCKETS) f = fnv1a_hash(s, NUM_BUCKETS) sb = sdbm_hash(s, NUM_BUCKETS) print(f"{s:<15} {p:<10} {d:<12} {f:<12} {sb:<12}") # All good hash functions produce well-scattered values for similar stringsThe base 'a' should be a prime larger than the character set size. For ASCII (128 values), bases like 31, 37, or 131 work well. For Unicode, larger primes (1000003) are appropriate. The choice affects collision probability but any reasonable prime provides good results.
Modern database systems use sophisticated hash functions that balance speed, distribution quality, and security considerations. Let's examine what major systems actually use.
MurmurHash: The modern workhorse
MurmurHash (created by Austin Appleby) is widely used in modern databases and distributed systems. It offers:
MurmurHash3 (32-bit version) outline:
1. Process input in 4-byte chunks
2. Multiply each chunk by large primes
3. Rotate bits and XOR to mix
4. Final avalanche mixing to spread entropy
xxHash: Even faster
xxHash by Yann Collet prioritizes raw speed:
CityHash / FarmHash (Google)
Optimized for modern CPUs:
| Hash Function | Speed (GB/s) | Output Size | Quality | Use Case |
|---|---|---|---|---|
| Simple modulo | ~10 GB/s | Variable | Poor on patterns | Educational only |
| djb2 | ~3 GB/s | 32-bit | Good | Legacy systems |
| FNV-1a | ~2 GB/s | 32/64-bit | Good | General purpose |
| MurmurHash3 | ~4 GB/s | 32/128-bit | Excellent | Hash tables, bloom filters |
| xxHash | ~15 GB/s | 32/64/128-bit | Excellent | High-throughput systems |
| SipHash | ~2 GB/s | 64-bit | Excellent + secure | Hash DoS prevention |
| CityHash64 | ~10 GB/s | 64-bit | Excellent | Google systems |
Predictable hash functions can be exploited. An attacker who knows your hash function can craft inputs that all collide, turning O(1) lookups into O(n). This is called a 'hash flooding' or 'hash DoS' attack. Systems exposed to untrusted input (web servers, APIs) should use keyed hash functions like SipHash with random seeds.
Many database lookups involve composite keys—multiple columns that together form the search key. Hashing composite keys requires combining individual component hashes into a single hash value.
The challenge:
Given a key (department_id, employee_id), we need:
h(10, 100) ≠ h(100, 10) — Order mattersh(10, 100) ≠ h(10, 101) — Small changes should produce different hashesNaive (bad) approaches:
| Method | Formula | Issue |
|---|---|---|
| Simple XOR | h₁ XOR h₂ | (10,10) and (20,20) both produce 0 |
| Simple addition | h₁ + h₂ | (10,20) = (15,15) = (5,25) = ... |
| Concatenation | h(str(k₁) + str(k₂)) | Performance overhead of string conversion |
Recommended approaches:
1. Hash-combine pattern (used in Boost, many databases):
combined = h1
combined = combined XOR (h2 + 0x9e3779b9 + (combined << 6) + (combined >> 2))
The magic constant 0x9e3779b9 is derived from the golden ratio (2³² / φ).
2. Polynomial combination:
combined = a * h1 + h2
Where a is a large prime.
3. Sequential hashing (feeding one into the next):
combined = hash(hash(k1) + k2)
4. Tuple hashing (treat as byte sequence): Serialize all components into a byte buffer, hash the entire buffer.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
def hash_combine(seed: int, value: int) -> int: """ Boost-style hash combine. Mixes 'value' into 'seed' in a way that respects ordering and produces good distribution. The magic number 0x9e3779b9 is 2^32 / golden_ratio. """ MAGIC = 0x9e3779b9 seed ^= value + MAGIC + (seed << 6) + (seed >> 2) return seed & 0xFFFFFFFF # Keep as 32-bit def hash_composite_key(*components, num_buckets: int) -> int: """ Hash a composite key (multiple columns) into a single bucket. Each component can be int or string. """ result = 0 for component in components: if isinstance(component, str): # Hash string first component_hash = hash_string(component) else: component_hash = component & 0xFFFFFFFF result = hash_combine(result, component_hash) return result % num_buckets def hash_string(s: str) -> int: """Simple polynomial hash for strings.""" h = 0 for c in s: h = (h * 31 + ord(c)) & 0xFFFFFFFF return h # Test composite key hashingprint("Composite Key Hash Distribution")print("=" * 50) NUM_BUCKETS = 100 # Test that order matterskey1 = (10, 100)key2 = (100, 10)h1 = hash_composite_key(*key1, num_buckets=NUM_BUCKETS)h2 = hash_composite_key(*key2, num_buckets=NUM_BUCKETS)print(f"h(10, 100) = {h1}")print(f"h(100, 10) = {h2}")print(f"Order matters: {h1 != h2}") # Test with stringskey3 = ("sales", 42, "2024-01-15")key4 = ("sales", 43, "2024-01-15")h3 = hash_composite_key(*key3, num_buckets=NUM_BUCKETS)h4 = hash_composite_key(*key4, num_buckets=NUM_BUCKETS)print(f"\nh('sales', 42, '2024-01-15') = {h3}")print(f"h('sales', 43, '2024-01-15') = {h4}")print(f"Different keys: {h3 != h4}") # Verify distribution across many keysfrom collections import Counterbuckets = Counter()for dept in range(10): for emp in range(100): bucket = hash_composite_key(dept, emp, num_buckets=NUM_BUCKETS) buckets[bucket] += 1 print(f"\nDistribution of 1000 composite keys into 100 buckets:")print(f"Min bucket size: {min(buckets.values())}")print(f"Max bucket size: {max(buckets.values())}")print(f"Expected (uniform): 10")Unlike B+-tree indexes, hash indexes on composite keys cannot support partial key lookups. If you hash (dept_id, emp_id) together, you cannot efficiently query 'all employees in department 10'. The key components lose their individual identity once combined. This is another reason B+-trees often win for composite keys.
The hash function is the heart of hash-based indexing. Choosing the right function—or evaluating the one your database uses—requires understanding both theory and practical trade-offs.
| Scenario | Recommended Approach |
|---|---|
| Integer primary keys, uniform distribution | Division with prime M |
| Integer keys, patterned or sequential | Multiplication method |
| String keys, trusted input | djb2, FNV-1a, or polynomial hash |
| String keys, untrusted input | SipHash with random seed |
| High-throughput systems | xxHash or MurmurHash3 |
| Composite keys | Hash-combine pattern with quality base hash |
| Database partitioning | Use database's built-in hash functions |
What's next:
With hash functions understood, the next page explores buckets: the storage containers where hashed entries reside. We'll examine bucket structure, sizing decisions, overflow handling, and how bucket design affects overall hash index performance.
You now understand hash functions at a deep level: mathematical foundations, quality metrics, specific algorithms for integers and strings, and practical selection guidelines. This knowledge enables you to evaluate hash index implementations and make informed decisions about hash function choice.