Hash Index Concept - Learning Module

Loading content...

0/241

Hash Function

The Engine of Hash Indexing

If hash indexing is the vehicle that delivers O(1) lookups, then the hash function is its engine. Every characteristic of a hash index—its speed, its space efficiency, its resistance to performance degradation—ultimately traces back to the quality of its hash function.

A hash function that produces poor distribution transforms a theoretically elegant O(1) system into a practical O(n) disaster. Conversely, a well-designed hash function can make the difference between a database that struggles under load and one that handles millions of queries per second with ease.

This page examines hash functions with the depth and rigor they deserve: mathematical foundations, quality metrics, specific algorithms, and the practical wisdom that separates production-ready implementations from textbook exercises.

What You Will Learn

By the end of this page, you will understand the mathematical properties that define a quality hash function, know the specific algorithms used in major database systems, and be able to evaluate and select hash functions for different data distributions and workloads.

Mathematical Foundations

Formally, a hash function for database indexing is a mapping:

h: K → {0, 1, 2, ..., M-1}

Where:

K is the domain of possible key values (potentially infinite)
M is the number of buckets (finite)
h is the hash function

The function maps an arbitrarily large key space to a fixed, manageable bucket space. This mapping is inherently many-to-one: multiple keys must map to the same bucket (unless |K| ≤ M, which is rarely true in practice).

The ideal hash function (theoretical):

If we could achieve a perfect hash function—one that maps each key to a unique bucket with no collisions—hash indexing would be optimal. Perfect hashing is possible when:

The key set is static and known in advance
We can afford M ≥ |K| (one bucket per key)
We invest computational effort to find the perfect mapping

In practice, database indexes rarely meet these conditions. Keys are dynamic, storage is limited, and the hash function must be computed quickly. We therefore aim for the next best thing: a uniform hash function.

The Uniform Distribution Ideal

An ideal hash function for database indexing distributes keys uniformly across buckets. If we insert n keys into M buckets, each bucket should contain approximately n/M keys. Deviations from this ideal—some buckets with many keys, others with few—directly degrade performance.

The birthday paradox and collision probability:

How many keys can we insert before collisions become likely? This connects to the famous birthday problem. With M buckets:

After inserting just √(2M) keys, there's a >50% chance of at least one collision
For M = 1,000,000 buckets: collisions become likely after ~1,177 insertions
This is far fewer than the million buckets might suggest

The implication: collisions are not edge cases—they're the normal operating condition. Hash index design must embrace collisions rather than try to eliminate them.

Expected chain length:

With n keys hashed into M buckets using a uniform hash function:

Expected keys per bucket (load factor) = λ = n/M

The bucket sizes follow approximately a Poisson distribution with mean λ:

P(bucket has k keys) ≈ (e^(-λ) × λ^k) / k!

For λ = 1 (one key per bucket on average):

~37% of buckets are empty
~37% have exactly 1 key
~18% have exactly 2 keys
~6% have 3 keys
~2% have 4+ keys

This variance is acceptable—even with uneven distribution, average lookup cost remains O(1) as long as no bucket grows exceptionally large.

Quality Metrics for Hash Functions

Not all hash functions are created equal. We evaluate hash functions along several dimensions, each critical for database performance:

Hash Function Quality Dimensions

•Uniformity — How evenly does the function distribute keys across buckets? Measure using chi-squared tests or bucket size variance.
•Avalanche Effect — Does a small change in input (flipping one bit) produce a dramatically different output? Prevents clustering of similar keys.
•Speed — How quickly can the function be computed? Nanoseconds matter when processing millions of lookups per second.
•Determinism — Given the same input, does it always produce the same output? Essential for correctness (all hash functions must satisfy this).
•Key Independence — Does the function treat all parts of the key equally? Functions that only use first few characters create artificial collisions.

Measuring uniformity: Chi-squared test

The chi-squared statistic measures how much observed bucket sizes deviate from expected sizes:

χ² = Σ (observed_i - expected)² / expected

Where expected = n/M for each bucket.

For a good hash function with n keys and M buckets:

χ² should be close to M (the degrees of freedom)
Values significantly above M indicate non-uniform distribution
Statistical tables provide confidence thresholds

Avalanche property:

A strong hash function should satisfy:

Flipping any single bit of input changes ~50% of output bits
This is called the Strict Avalanche Criterion (SAC)
Prevents similar keys from landing in nearby buckets

Example of poor avalanche:

h(1000) = 47
h(1001) = 48  // Only 1 bit difference → similar hash
h(1002) = 49  // Sequential pattern → terrible for distribution

Example of good avalanche:

h(1000) = 47328
h(1001) = 91847  // Completely different
h(1002) = 62109  // No pattern

Hash Function Quality Comparison
Property	Poor Hash Function	Good Hash Function
bucket_size_variance	High (some buckets 100x larger than others)	Low (all buckets within 2-3x of mean)
chi_squared / M ratio	2.0 (significant deviation)	≈ 1.0 (close to expected)
avalanche (% bits changed)	< 30% for single bit flip	45% for single bit flip
speed (ns per hash)	Highly variable	Consistent, predictable
pattern sensitivity	Sequential keys cluster	Sequential keys scatter

Real-World Key Distributions Are Not Random

Theoretical analysis assumes random key distributions, but real data has patterns: sequential IDs, timestamps, names starting with common letters, zip codes clustered by geography. A hash function must break these patterns to achieve uniform distribution in practice, not just in theory.

The Division Method (Modular Hashing)

The simplest and most widely used hash function is the division method:

h(key) = key mod M

Where M is the number of buckets. Its simplicity is both its strength and its weakness.

Advantages:

Extremely fast: single modulo operation
Easy to understand and implement
Works well when key distribution is random

Critical issue: Choice of M

The choice of M dramatically affects distribution quality:

Bad choice: M = power of 2 (e.g., 1024)

h(key) = key mod 1024

This uses only the low 10 bits of the key. If keys share common low bits (e.g., aligned memory addresses, IDs divisible by 4), distribution suffers badly.

Bad choice: M = 10^k (e.g., 10000) Similar keys (1234, 2234, 3234) all hash to the same value when only last 4 digits vary in predictable ways.

Good choice: M = prime number not close to power of 2

997, 10007, 100003, 999983 are good choices
Prime M ensures key bits interact before producing result
Avoids resonance with common key patterns

Division Method Analysis
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import math
from collections import Counter
 
def analyze_hash_distribution(keys: list, num_buckets: int) -> dict:
    """
    Analyze the distribution quality of modular hashing.
    Returns statistics about bucket distribution.
    """
    # Hash all keys
    buckets = [key % num_buckets for key in keys]
    bucket_counts = Counter(buckets)
    
    # Fill in empty buckets with 0
    for i in range(num_buckets):
        if i not in bucket_counts:
            bucket_counts[i] = 0
    
    counts = list(bucket_counts.values())
    n = len(keys)
    expected = n / num_buckets
    
    # Calculate statistics
    chi_squared = sum((c - expected) ** 2 / expected for c in counts)
    max_bucket = max(counts)
    min_bucket = min(counts)
    variance = sum((c - expected) ** 2 for c in counts) / num_buckets
    
    return {
        "total_keys": n,
        "num_buckets": num_buckets,
        "expected_per_bucket": expected,
        "chi_squared": chi_squared,
        "chi_squared_ratio": chi_squared / num_buckets,  # Should be ~1.0
        "max_bucket_size": max_bucket,
        "min_bucket_size": min_bucket,
        "variance": variance,
        "std_dev": math.sqrt(variance),
    }
 
 
# Test with different M values
keys = list(range(0, 100000, 4))  # IDs divisible by 4 (common pattern)
n_keys = len(keys)
 
print(f"Testing with {n_keys} keys (all divisible by 4)")
print("=" * 60)
 
# Bad: Power of 2
result = analyze_hash_distribution(keys, 1024)
print(f"\nM = 1024 (power of 2):")
print(f"  Chi-squared ratio: {result['chi_squared_ratio']:.2f} (should be ~1.0)")
print(f"  Max bucket: {result['max_bucket_size']} (expected: {result['expected_per_bucket']:.1f})")
print(f"  Most buckets have: 0 keys")  # Only 256 buckets used!
 
# Bad: Power of 10
result = analyze_hash_distribution(keys, 10000)
print(f"\nM = 10000 (power of 10):")
print(f"  Chi-squared ratio: {result['chi_squared_ratio']:.2f}")
print(f"  Max bucket: {result['max_bucket_size']}")
 
# Good: Prime number
result = analyze_hash_distribution(keys, 10007)
print(f"\nM = 10007 (prime):")
print(f"  Chi-squared ratio: {result['chi_squared_ratio']:.2f}")
print(f"  Max bucket: {result['max_bucket_size']}")
print(f"  Min bucket: {result['min_bucket_size']}")
print(f"  Std dev: {result['std_dev']:.2f}")
 
# Output demonstrates how prime M handles patterned keys better

Prime Number Selection

When choosing a prime M, pick one not close to powers of 2. For example, 1021 is prime but close to 1024 (2^10). Better choices: 997, 1009, 1013. For larger tables: 99991, 100003, 999983. Many databases pre-compute tables of useful primes.

The Multiplication Method

The multiplication method addresses the division method's sensitivity to bucket count:

h(key) = floor(M × (key × A mod 1))

Where:

A is a constant, 0 < A < 1
key × A mod 1 extracts the fractional part of key × A
The result is scaled to [0, M) and floored

The beauty of this method: the choice of M doesn't significantly affect distribution quality. Any M works reasonably well, including powers of 2 (which allow efficient implementation).

The golden ratio connection:

Donald Knuth's analysis shows the optimal A is:

A = (√5 - 1) / 2 ≈ 0.6180339887...

This is the golden ratio (φ⁻¹). Using this value:

Consecutive keys spread maximally across buckets
The distribution is provably good for any key sequence
Fibonacci-like sequences don't cluster

Multiplication Method Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import math
 
# The golden ratio constant
GOLDEN_RATIO_INVERSE = (math.sqrt(5) - 1) / 2  # ≈ 0.6180339887
 
def multiplicative_hash(key: int, num_buckets: int) -> int:
    """
    Multiplicative hash using the golden ratio.
    Works well with any num_buckets, including powers of 2.
    """
    # Multiply key by golden ratio, take fractional part
    product = key * GOLDEN_RATIO_INVERSE
    fractional_part = product - math.floor(product)  # Equivalent to: product % 1
    
    # Scale to bucket range and floor
    return int(num_buckets * fractional_part)
 
 
def multiplicative_hash_fast(key: int, num_buckets: int, 
                             num_bits: int = 32) -> int:
    """
    Fast integer-only implementation for power-of-2 num_buckets.
    Uses bit manipulation instead of floating point.
    
    For 32-bit words, A ≈ 2654435769 / 2^32 ≈ 0.618...
    For 64-bit words, A ≈ 11400714819323198485 / 2^64
    """
    if num_bits == 32:
        A_SCALED = 2654435769  # floor(2^32 * (sqrt(5)-1)/2)
        word_size = 32
    else:
        A_SCALED = 11400714819323198485
        word_size = 64
    
    # Multiply (overflow is intentional and handled by unsigned arithmetic)
    product = (key * A_SCALED) & ((1 << word_size) - 1)
    
    # Extract top log2(num_buckets) bits 
    p = int(math.log2(num_buckets))
    return product >> (word_size - p)
 
 
# Demonstrate distribution quality
print("Multiplicative Hash Distribution Test")
print("=" * 50)
 
keys = list(range(0, 1000))  # Sequential keys
num_buckets = 16  # Power of 2 (would be bad for division method)
 
# Count bucket usage
from collections import Counter
buckets_division = Counter(k % num_buckets for k in keys)
buckets_multiplicative = Counter(multiplicative_hash(k, num_buckets) for k in keys)
 
print(f"\nDivision method (mod {num_buckets}):")
print(f"  Bucket distribution: {dict(sorted(buckets_division.items()))}")
print(f"  Max bucket: {max(buckets_division.values())}")
print(f"  Min bucket: {min(buckets_division.values())}")
 
print(f"\nMultiplicative method (M={num_buckets}):")
print(f"  Bucket distribution: {dict(sorted(buckets_multiplicative.items()))}")
print(f"  Max bucket: {max(buckets_multiplicative.values())}")
print(f"  Min bucket: {min(buckets_multiplicative.values())}")
 
# The multiplicative method handles sequential keys and power-of-2 M much better

Practical implementation notes:

Floating-point vs. integer implementation:
- Floating-point: simpler code, exact golden ratio
- Integer: faster, avoids FP precision issues, preferred in databases
Scaling A for integer arithmetic:
- For 32-bit: A_int = floor(2³² × 0.618...) = 2654435769
- For 64-bit: A_int = floor(2⁶⁴ × 0.618...) = 11400714819323198485
- Multiplication naturally overflows, but overflow bits are exactly the fractional part
Power-of-2 bucket counts:
- With multiplicative hashing, M = 2^p is preferred
- After multiplication, extract top p bits with right shift
- No division or modulo needed—bit operations only

Why Databases Often Prefer Division Method Anyway

Despite the multiplication method's theoretical advantages, many databases use division with carefully chosen primes. Reasons: simplicity, historical precedent, debuggability (bucket = key mod M is intuitive), and the fact that database keys often aren't as patterned as worst-case analysis suggests.

String Hashing

String keys (names, emails, product codes, UUIDs) require special handling. The challenge: convert a variable-length sequence of characters into a fixed-size integer suitable for bucket addressing.

Naive approaches and their failures:

String Hash Function Approaches
Approach	Formula	Problem
Sum of ASCII values	h = Σ ord(c)	Anagrams collide: 'abc' = 'bca' = 'cab'
First N characters	h = hash(s[:N])	Ignores suffix: 'Smith1' = 'Smith2'
Last N characters	h = hash(s[-N:])	Common suffixes cluster: '.com', '.edu'
XOR of all bytes	h = c₁ XOR c₂ XOR ...	Order-insensitive, many collisions

The polynomial (rolling) hash:

The standard approach treats the string as a polynomial in some base:

h(s) = (s[0]×aⁿ⁻¹ + s[1]×aⁿ⁻² + ... + s[n-1]×a⁰) mod M

This is equivalent to:

h(s) = Σᵢ s[i] × aⁿ⁻¹⁻ⁱ mod M

Where:

s[i] is the character value at position i
a is a prime base (commonly 31, 33, 37, or 131)
n is the string length

Efficient computation using Horner's method:

h = 0
for each character c in string:
    h = (h × a + ord(c)) mod M

This computes the polynomial in O(n) time with constant space.

String Hash Implementations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def polynomial_hash(s: str, num_buckets: int, base: int = 31) -> int:
    """
    Standard polynomial hash using Horner's method.
    Base 31 is common (prime, works well for ASCII).
    """
    h = 0
    for char in s:
        h = (h * base + ord(char)) % num_buckets
    return h
 
 
def djb2_hash(s: str, num_buckets: int) -> int:
    """
    Dan Bernstein's djb2 hash function.
    Widely used, good distribution, simple.
    The magic number 5381 has good properties.
    """
    h = 5381
    for char in s:
        # h * 33 + c, using bit shift for speed: h*33 = h*32 + h = (h<<5) + h
        h = ((h << 5) + h) + ord(char)
        h = h & 0xFFFFFFFF  # Keep as 32-bit
    return h % num_buckets
 
 
def fnv1a_hash(s: str, num_buckets: int) -> int:
    """
    FNV-1a hash function.
    Used in many production systems.
    XOR-then-multiply order provides better distribution than FNV-1.
    """
    FNV_OFFSET_BASIS = 2166136261
    FNV_PRIME = 16777619
    
    h = FNV_OFFSET_BASIS
    for char in s:
        h ^= ord(char)
        h = (h * FNV_PRIME) & 0xFFFFFFFF
    return h % num_buckets
 
 
def sdbm_hash(s: str, num_buckets: int) -> int:
    """
    SDBM database hash function.
    Originally used in SDBM database library.
    h = h * 65599 + c, implemented as: h * 65536 + h * 64 - h + c
    """
    h = 0
    for char in s:
        h = ord(char) + (h << 6) + (h << 16) - h
        h = h & 0xFFFFFFFF
    return h % num_buckets
 
 
# Compare hash functions on similar strings
test_strings = [
    "smith", "Smith", "SMITH",  # Case variations
    "smith1", "smith2", "smith3",  # Numeric suffixes
    "asmith", "bsmith", "csmith",  # Alphabetic prefixes
    "john_smith", "john.smith", "john-smith",  # Separators
]
 
print("Hash Value Comparison for Similar Strings")
print("=" * 70)
print(f"{'String':<15} {'poly':<10} {'djb2':<12} {'fnv1a':<12} {'sdbm':<12}")
print("-" * 70)
 
NUM_BUCKETS = 1000
 
for s in test_strings:
    p = polynomial_hash(s, NUM_BUCKETS)
    d = djb2_hash(s, NUM_BUCKETS)
    f = fnv1a_hash(s, NUM_BUCKETS)
    sb = sdbm_hash(s, NUM_BUCKETS)
    print(f"{s:<15} {p:<10} {d:<12} {f:<12} {sb:<12}")
 
# All good hash functions produce well-scattered values for similar strings

Choosing a Base for Polynomial Hash

The base 'a' should be a prime larger than the character set size. For ASCII (128 values), bases like 31, 37, or 131 work well. For Unicode, larger primes (1000003) are appropriate. The choice affects collision probability but any reasonable prime provides good results.

Production Hash Functions in Databases

Modern database systems use sophisticated hash functions that balance speed, distribution quality, and security considerations. Let's examine what major systems actually use.

Database Hash Function Implementations

•PostgreSQL — Uses custom hash functions optimized for different data types. For integers: multiplication then XOR mixing. For strings: accumulator with carefully chosen constants. All functions return 32-bit values.
•MySQL/InnoDB — Internal hash uses modified FNV hash for the adaptive hash index. Partition BY HASH uses simple modulo. User-defined functions can use MD5 or SHA for deterministic hashing.
•Oracle — STANDARD_HASH function supports SHA-1, SHA-256, SHA-384, SHA-512, and MD5. Hash clusters use internal hash optimized for equality partitioning. ORA_HASH provides configurable bucketing.
•SQL Server — HASHBYTES for cryptographic hashes. CHECKSUM for fast non-cryptographic hash. Binary hash algorithms for hash joins and hash aggregates.
•Redis — Uses SipHash (cryptographic) for string keys to prevent hash flooding attacks. MurmurHash2 for internal structures like the dictionaries.

MurmurHash: The modern workhorse

MurmurHash (created by Austin Appleby) is widely used in modern databases and distributed systems. It offers:

Excellent distribution quality
High speed (approaching memory bandwidth limits)
Good avalanche effect
Simple implementation

MurmurHash3 (32-bit version) outline:
1. Process input in 4-byte chunks
2. Multiply each chunk by large primes
3. Rotate bits and XOR to mix
4. Final avalanche mixing to spread entropy

xxHash: Even faster

xxHash by Yann Collet prioritizes raw speed:

~5x faster than MurmurHash
Used in ClickHouse, RocksDB, many others
Available in 32-bit, 64-bit, and 128-bit variants

CityHash / FarmHash (Google)

Optimized for modern CPUs:

Exploits instruction-level parallelism
Uses SIMD when available
CityHash64 processes 64 bytes at a time

Hash Function Speed Comparison (approx.)
Hash Function	Speed (GB/s)	Output Size	Quality	Use Case
Simple modulo	~10 GB/s	Variable	Poor on patterns	Educational only
djb2	~3 GB/s	32-bit	Good	Legacy systems
FNV-1a	~2 GB/s	32/64-bit	Good	General purpose
MurmurHash3	~4 GB/s	32/128-bit	Excellent	Hash tables, bloom filters
xxHash	~15 GB/s	32/64/128-bit	Excellent	High-throughput systems
SipHash	~2 GB/s	64-bit	Excellent + secure	Hash DoS prevention
CityHash64	~10 GB/s	64-bit	Excellent	Google systems

Hash Flooding Attacks

Predictable hash functions can be exploited. An attacker who knows your hash function can craft inputs that all collide, turning O(1) lookups into O(n). This is called a 'hash flooding' or 'hash DoS' attack. Systems exposed to untrusted input (web servers, APIs) should use keyed hash functions like SipHash with random seeds.

Composite Key Hashing

Many database lookups involve composite keys—multiple columns that together form the search key. Hashing composite keys requires combining individual component hashes into a single hash value.

The challenge:

Given a key (department_id, employee_id), we need:

h(10, 100) ≠ h(100, 10) — Order matters
h(10, 100) ≠ h(10, 101) — Small changes should produce different hashes
Uniform distribution over the combined space

Naive (bad) approaches:

Composite Key Hash Approaches
Method	Formula	Issue
Simple XOR	h₁ XOR h₂	(10,10) and (20,20) both produce 0
Simple addition	h₁ + h₂	(10,20) = (15,15) = (5,25) = ...
Concatenation	h(str(k₁) + str(k₂))	Performance overhead of string conversion

Recommended approaches:

1. Hash-combine pattern (used in Boost, many databases):

combined = h1
combined = combined XOR (h2 + 0x9e3779b9 + (combined << 6) + (combined >> 2))

The magic constant 0x9e3779b9 is derived from the golden ratio (2³² / φ).

2. Polynomial combination:

combined = a * h1 + h2

Where a is a large prime.

3. Sequential hashing (feeding one into the next):

combined = hash(hash(k1) + k2)

4. Tuple hashing (treat as byte sequence): Serialize all components into a byte buffer, hash the entire buffer.

Composite Key Hashing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def hash_combine(seed: int, value: int) -> int:
    """
    Boost-style hash combine.
    Mixes 'value' into 'seed' in a way that respects ordering
    and produces good distribution.
    
    The magic number 0x9e3779b9 is 2^32 / golden_ratio.
    """
    MAGIC = 0x9e3779b9
    seed ^= value + MAGIC + (seed << 6) + (seed >> 2)
    return seed & 0xFFFFFFFF  # Keep as 32-bit
 
 
def hash_composite_key(*components, num_buckets: int) -> int:
    """
    Hash a composite key (multiple columns) into a single bucket.
    Each component can be int or string.
    """
    result = 0
    for component in components:
        if isinstance(component, str):
            # Hash string first
            component_hash = hash_string(component)
        else:
            component_hash = component & 0xFFFFFFFF
        result = hash_combine(result, component_hash)
    
    return result % num_buckets
 
 
def hash_string(s: str) -> int:
    """Simple polynomial hash for strings."""
    h = 0
    for c in s:
        h = (h * 31 + ord(c)) & 0xFFFFFFFF
    return h
 
 
# Test composite key hashing
print("Composite Key Hash Distribution")
print("=" * 50)
 
NUM_BUCKETS = 100
 
# Test that order matters
key1 = (10, 100)
key2 = (100, 10)
h1 = hash_composite_key(*key1, num_buckets=NUM_BUCKETS)
h2 = hash_composite_key(*key2, num_buckets=NUM_BUCKETS)
print(f"h(10, 100) = {h1}")
print(f"h(100, 10) = {h2}")
print(f"Order matters: {h1 != h2}")
 
# Test with strings
key3 = ("sales", 42, "2024-01-15")
key4 = ("sales", 43, "2024-01-15")
h3 = hash_composite_key(*key3, num_buckets=NUM_BUCKETS)
h4 = hash_composite_key(*key4, num_buckets=NUM_BUCKETS)
print(f"\nh('sales', 42, '2024-01-15') = {h3}")
print(f"h('sales', 43, '2024-01-15') = {h4}")
print(f"Different keys: {h3 != h4}")
 
# Verify distribution across many keys
from collections import Counter
buckets = Counter()
for dept in range(10):
    for emp in range(100):
        bucket = hash_composite_key(dept, emp, num_buckets=NUM_BUCKETS)
        buckets[bucket] += 1
 
print(f"\nDistribution of 1000 composite keys into 100 buckets:")
print(f"Min bucket size: {min(buckets.values())}")
print(f"Max bucket size: {max(buckets.values())}")
print(f"Expected (uniform): 10")

Partial Key Lookups

Unlike B+-tree indexes, hash indexes on composite keys cannot support partial key lookups. If you hash (dept_id, emp_id) together, you cannot efficiently query 'all employees in department 10'. The key components lose their individual identity once combined. This is another reason B+-trees often win for composite keys.

Summary: Hash Function Selection Guidelines

The hash function is the heart of hash-based indexing. Choosing the right function—or evaluating the one your database uses—requires understanding both theory and practical trade-offs.

Key Takeaways

•Uniformity is paramount — A hash function that clusters keys destroys the O(1) guarantee. Always verify distribution quality on representative data.
•Division method is simple but M-sensitive — Use prime bucket counts to avoid resonance with patterned keys.
•Multiplication method is M-agnostic — The golden ratio constant provides robust distribution regardless of bucket count.
•String hashing requires polynomial accumulation — Simple approaches (sum, XOR) produce unacceptable collision rates.
•Modern functions (MurmurHash, xxHash) excel — They combine speed, distribution quality, and avalanche properties.
•Composite keys need careful combination — Hash-combine patterns preserve component independence and ordering.
•Security matters for external input — Use keyed hash functions (SipHash) to prevent hash flooding attacks.

Quick Hash Function Selection Guide
Scenario	Recommended Approach
Integer primary keys, uniform distribution	Division with prime M
Integer keys, patterned or sequential	Multiplication method
String keys, trusted input	djb2, FNV-1a, or polynomial hash
String keys, untrusted input	SipHash with random seed
High-throughput systems	xxHash or MurmurHash3
Composite keys	Hash-combine pattern with quality base hash
Database partitioning	Use database's built-in hash functions

What's next:

With hash functions understood, the next page explores buckets: the storage containers where hashed entries reside. We'll examine bucket structure, sizing decisions, overflow handling, and how bucket design affects overall hash index performance.

Page Complete

You now understand hash functions at a deep level: mathematical foundations, quality metrics, specific algorithms for integers and strings, and practical selection guidelines. This knowledge enables you to evaluate hash index implementations and make informed decisions about hash function choice.

Hash Function

The Engine of Hash Indexing

What You Will Learn

Mathematical Foundations

Formally, a hash function for database indexing is a mapping:

h: K → {0, 1, 2, ..., M-1}

Where:

K is the domain of possible key values (potentially infinite)
M is the number of buckets (finite)
h is the hash function

The ideal hash function (theoretical):

If we could achieve a perfect hash function—one that maps each key to a unique bucket with no collisions—hash indexing would be optimal. Perfect hashing is possible when:

The key set is static and known in advance
We can afford M ≥ |K| (one bucket per key)
We invest computational effort to find the perfect mapping

The Uniform Distribution Ideal

The birthday paradox and collision probability:

How many keys can we insert before collisions become likely? This connects to the famous birthday problem. With M buckets:

After inserting just √(2M) keys, there's a >50% chance of at least one collision
For M = 1,000,000 buckets: collisions become likely after ~1,177 insertions
This is far fewer than the million buckets might suggest

The implication: collisions are not edge cases—they're the normal operating condition. Hash index design must embrace collisions rather than try to eliminate them.

Expected chain length:

With n keys hashed into M buckets using a uniform hash function:

Expected keys per bucket (load factor) = λ = n/M

The bucket sizes follow approximately a Poisson distribution with mean λ:

P(bucket has k keys) ≈ (e^(-λ) × λ^k) / k!

For λ = 1 (one key per bucket on average):

~37% of buckets are empty
~37% have exactly 1 key
~18% have exactly 2 keys
~6% have 3 keys
~2% have 4+ keys

This variance is acceptable—even with uneven distribution, average lookup cost remains O(1) as long as no bucket grows exceptionally large.

Quality Metrics for Hash Functions

Not all hash functions are created equal. We evaluate hash functions along several dimensions, each critical for database performance:

Hash Function Quality Dimensions

•Uniformity — How evenly does the function distribute keys across buckets? Measure using chi-squared tests or bucket size variance.
•Avalanche Effect — Does a small change in input (flipping one bit) produce a dramatically different output? Prevents clustering of similar keys.
•Speed — How quickly can the function be computed? Nanoseconds matter when processing millions of lookups per second.
•Determinism — Given the same input, does it always produce the same output? Essential for correctness (all hash functions must satisfy this).
•Key Independence — Does the function treat all parts of the key equally? Functions that only use first few characters create artificial collisions.

Measuring uniformity: Chi-squared test

The chi-squared statistic measures how much observed bucket sizes deviate from expected sizes:

χ² = Σ (observed_i - expected)² / expected

Where expected = n/M for each bucket.

For a good hash function with n keys and M buckets:

χ² should be close to M (the degrees of freedom)
Values significantly above M indicate non-uniform distribution
Statistical tables provide confidence thresholds

Avalanche property:

A strong hash function should satisfy:

Flipping any single bit of input changes ~50% of output bits
This is called the Strict Avalanche Criterion (SAC)
Prevents similar keys from landing in nearby buckets

Example of poor avalanche:

h(1000) = 47
h(1001) = 48  // Only 1 bit difference → similar hash
h(1002) = 49  // Sequential pattern → terrible for distribution

Example of good avalanche:

h(1000) = 47328
h(1001) = 91847  // Completely different
h(1002) = 62109  // No pattern

Hash Function Quality Comparison
Property	Poor Hash Function	Good Hash Function
bucket_size_variance	High (some buckets 100x larger than others)	Low (all buckets within 2-3x of mean)
chi_squared / M ratio	2.0 (significant deviation)	≈ 1.0 (close to expected)
avalanche (% bits changed)	< 30% for single bit flip	45% for single bit flip
speed (ns per hash)	Highly variable	Consistent, predictable
pattern sensitivity	Sequential keys cluster	Sequential keys scatter

Real-World Key Distributions Are Not Random

The Division Method (Modular Hashing)

The simplest and most widely used hash function is the division method:

h(key) = key mod M

Where M is the number of buckets. Its simplicity is both its strength and its weakness.

Advantages:

Extremely fast: single modulo operation
Easy to understand and implement
Works well when key distribution is random

Critical issue: Choice of M

The choice of M dramatically affects distribution quality:

Bad choice: M = power of 2 (e.g., 1024)

h(key) = key mod 1024

This uses only the low 10 bits of the key. If keys share common low bits (e.g., aligned memory addresses, IDs divisible by 4), distribution suffers badly.

Bad choice: M = 10^k (e.g., 10000) Similar keys (1234, 2234, 3234) all hash to the same value when only last 4 digits vary in predictable ways.

Good choice: M = prime number not close to power of 2

997, 10007, 100003, 999983 are good choices
Prime M ensures key bits interact before producing result
Avoids resonance with common key patterns

Division Method Analysis
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import math
from collections import Counter
 
def analyze_hash_distribution(keys: list, num_buckets: int) -> dict:
    """
    Analyze the distribution quality of modular hashing.
    Returns statistics about bucket distribution.
    """
    # Hash all keys
    buckets = [key % num_buckets for key in keys]
    bucket_counts = Counter(buckets)
    
    # Fill in empty buckets with 0
    for i in range(num_buckets):
        if i not in bucket_counts:
            bucket_counts[i] = 0
    
    counts = list(bucket_counts.values())
    n = len(keys)
    expected = n / num_buckets
    
    # Calculate statistics
    chi_squared = sum((c - expected) ** 2 / expected for c in counts)
    max_bucket = max(counts)
    min_bucket = min(counts)
    variance = sum((c - expected) ** 2 for c in counts) / num_buckets
    
    return {
        "total_keys": n,
        "num_buckets": num_buckets,
        "expected_per_bucket": expected,
        "chi_squared": chi_squared,
        "chi_squared_ratio": chi_squared / num_buckets,  # Should be ~1.0
        "max_bucket_size": max_bucket,
        "min_bucket_size": min_bucket,
        "variance": variance,
        "std_dev": math.sqrt(variance),
    }
 
 
# Test with different M values
keys = list(range(0, 100000, 4))  # IDs divisible by 4 (common pattern)
n_keys = len(keys)
 
print(f"Testing with {n_keys} keys (all divisible by 4)")
print("=" * 60)
 
# Bad: Power of 2
result = analyze_hash_distribution(keys, 1024)
print(f"\nM = 1024 (power of 2):")
print(f"  Chi-squared ratio: {result['chi_squared_ratio']:.2f} (should be ~1.0)")
print(f"  Max bucket: {result['max_bucket_size']} (expected: {result['expected_per_bucket']:.1f})")
print(f"  Most buckets have: 0 keys")  # Only 256 buckets used!
 
# Bad: Power of 10
result = analyze_hash_distribution(keys, 10000)
print(f"\nM = 10000 (power of 10):")
print(f"  Chi-squared ratio: {result['chi_squared_ratio']:.2f}")
print(f"  Max bucket: {result['max_bucket_size']}")
 
# Good: Prime number
result = analyze_hash_distribution(keys, 10007)
print(f"\nM = 10007 (prime):")
print(f"  Chi-squared ratio: {result['chi_squared_ratio']:.2f}")
print(f"  Max bucket: {result['max_bucket_size']}")
print(f"  Min bucket: {result['min_bucket_size']}")
print(f"  Std dev: {result['std_dev']:.2f}")
 
# Output demonstrates how prime M handles patterned keys better

Prime Number Selection

The Multiplication Method

The multiplication method addresses the division method's sensitivity to bucket count:

h(key) = floor(M × (key × A mod 1))

Where:

A is a constant, 0 < A < 1
key × A mod 1 extracts the fractional part of key × A
The result is scaled to [0, M) and floored

The beauty of this method: the choice of M doesn't significantly affect distribution quality. Any M works reasonably well, including powers of 2 (which allow efficient implementation).

The golden ratio connection:

Donald Knuth's analysis shows the optimal A is:

A = (√5 - 1) / 2 ≈ 0.6180339887...

This is the golden ratio (φ⁻¹). Using this value:

Consecutive keys spread maximally across buckets
The distribution is provably good for any key sequence
Fibonacci-like sequences don't cluster

Multiplication Method Implementation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import math
 
# The golden ratio constant
GOLDEN_RATIO_INVERSE = (math.sqrt(5) - 1) / 2  # ≈ 0.6180339887
 
def multiplicative_hash(key: int, num_buckets: int) -> int:
    """
    Multiplicative hash using the golden ratio.
    Works well with any num_buckets, including powers of 2.
    """
    # Multiply key by golden ratio, take fractional part
    product = key * GOLDEN_RATIO_INVERSE
    fractional_part = product - math.floor(product)  # Equivalent to: product % 1
    
    # Scale to bucket range and floor
    return int(num_buckets * fractional_part)
 
 
def multiplicative_hash_fast(key: int, num_buckets: int, 
                             num_bits: int = 32) -> int:
    """
    Fast integer-only implementation for power-of-2 num_buckets.
    Uses bit manipulation instead of floating point.
    
    For 32-bit words, A ≈ 2654435769 / 2^32 ≈ 0.618...
    For 64-bit words, A ≈ 11400714819323198485 / 2^64
    """
    if num_bits == 32:
        A_SCALED = 2654435769  # floor(2^32 * (sqrt(5)-1)/2)
        word_size = 32
    else:
        A_SCALED = 11400714819323198485
        word_size = 64
    
    # Multiply (overflow is intentional and handled by unsigned arithmetic)
    product = (key * A_SCALED) & ((1 << word_size) - 1)
    
    # Extract top log2(num_buckets) bits 
    p = int(math.log2(num_buckets))
    return product >> (word_size - p)
 
 
# Demonstrate distribution quality
print("Multiplicative Hash Distribution Test")
print("=" * 50)
 
keys = list(range(0, 1000))  # Sequential keys
num_buckets = 16  # Power of 2 (would be bad for division method)
 
# Count bucket usage
from collections import Counter
buckets_division = Counter(k % num_buckets for k in keys)
buckets_multiplicative = Counter(multiplicative_hash(k, num_buckets) for k in keys)
 
print(f"\nDivision method (mod {num_buckets}):")
print(f"  Bucket distribution: {dict(sorted(buckets_division.items()))}")
print(f"  Max bucket: {max(buckets_division.values())}")
print(f"  Min bucket: {min(buckets_division.values())}")
 
print(f"\nMultiplicative method (M={num_buckets}):")
print(f"  Bucket distribution: {dict(sorted(buckets_multiplicative.items()))}")
print(f"  Max bucket: {max(buckets_multiplicative.values())}")
print(f"  Min bucket: {min(buckets_multiplicative.values())}")
 
# The multiplicative method handles sequential keys and power-of-2 M much better

Practical implementation notes:

Floating-point vs. integer implementation:
- Floating-point: simpler code, exact golden ratio
- Integer: faster, avoids FP precision issues, preferred in databases
Scaling A for integer arithmetic:
- For 32-bit: A_int = floor(2³² × 0.618...) = 2654435769
- For 64-bit: A_int = floor(2⁶⁴ × 0.618...) = 11400714819323198485
- Multiplication naturally overflows, but overflow bits are exactly the fractional part
Power-of-2 bucket counts:
- With multiplicative hashing, M = 2^p is preferred
- After multiplication, extract top p bits with right shift
- No division or modulo needed—bit operations only

Why Databases Often Prefer Division Method Anyway

String Hashing

String keys (names, emails, product codes, UUIDs) require special handling. The challenge: convert a variable-length sequence of characters into a fixed-size integer suitable for bucket addressing.

Naive approaches and their failures:

String Hash Function Approaches
Approach	Formula	Problem
Sum of ASCII values	h = Σ ord(c)	Anagrams collide: 'abc' = 'bca' = 'cab'
First N characters	h = hash(s[:N])	Ignores suffix: 'Smith1' = 'Smith2'
Last N characters	h = hash(s[-N:])	Common suffixes cluster: '.com', '.edu'
XOR of all bytes	h = c₁ XOR c₂ XOR ...	Order-insensitive, many collisions

The polynomial (rolling) hash:

The standard approach treats the string as a polynomial in some base:

h(s) = (s[0]×aⁿ⁻¹ + s[1]×aⁿ⁻² + ... + s[n-1]×a⁰) mod M

This is equivalent to:

h(s) = Σᵢ s[i] × aⁿ⁻¹⁻ⁱ mod M

Where:

s[i] is the character value at position i
a is a prime base (commonly 31, 33, 37, or 131)
n is the string length

Efficient computation using Horner's method:

h = 0
for each character c in string:
    h = (h × a + ord(c)) mod M

This computes the polynomial in O(n) time with constant space.

String Hash Implementations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def polynomial_hash(s: str, num_buckets: int, base: int = 31) -> int:
    """
    Standard polynomial hash using Horner's method.
    Base 31 is common (prime, works well for ASCII).
    """
    h = 0
    for char in s:
        h = (h * base + ord(char)) % num_buckets
    return h
 
 
def djb2_hash(s: str, num_buckets: int) -> int:
    """
    Dan Bernstein's djb2 hash function.
    Widely used, good distribution, simple.
    The magic number 5381 has good properties.
    """
    h = 5381
    for char in s:
        # h * 33 + c, using bit shift for speed: h*33 = h*32 + h = (h<<5) + h
        h = ((h << 5) + h) + ord(char)
        h = h & 0xFFFFFFFF  # Keep as 32-bit
    return h % num_buckets
 
 
def fnv1a_hash(s: str, num_buckets: int) -> int:
    """
    FNV-1a hash function.
    Used in many production systems.
    XOR-then-multiply order provides better distribution than FNV-1.
    """
    FNV_OFFSET_BASIS = 2166136261
    FNV_PRIME = 16777619
    
    h = FNV_OFFSET_BASIS
    for char in s:
        h ^= ord(char)
        h = (h * FNV_PRIME) & 0xFFFFFFFF
    return h % num_buckets
 
 
def sdbm_hash(s: str, num_buckets: int) -> int:
    """
    SDBM database hash function.
    Originally used in SDBM database library.
    h = h * 65599 + c, implemented as: h * 65536 + h * 64 - h + c
    """
    h = 0
    for char in s:
        h = ord(char) + (h << 6) + (h << 16) - h
        h = h & 0xFFFFFFFF
    return h % num_buckets
 
 
# Compare hash functions on similar strings
test_strings = [
    "smith", "Smith", "SMITH",  # Case variations
    "smith1", "smith2", "smith3",  # Numeric suffixes
    "asmith", "bsmith", "csmith",  # Alphabetic prefixes
    "john_smith", "john.smith", "john-smith",  # Separators
]
 
print("Hash Value Comparison for Similar Strings")
print("=" * 70)
print(f"{'String':<15} {'poly':<10} {'djb2':<12} {'fnv1a':<12} {'sdbm':<12}")
print("-" * 70)
 
NUM_BUCKETS = 1000
 
for s in test_strings:
    p = polynomial_hash(s, NUM_BUCKETS)
    d = djb2_hash(s, NUM_BUCKETS)
    f = fnv1a_hash(s, NUM_BUCKETS)
    sb = sdbm_hash(s, NUM_BUCKETS)
    print(f"{s:<15} {p:<10} {d:<12} {f:<12} {sb:<12}")
 
# All good hash functions produce well-scattered values for similar strings

Choosing a Base for Polynomial Hash

Production Hash Functions in Databases

Modern database systems use sophisticated hash functions that balance speed, distribution quality, and security considerations. Let's examine what major systems actually use.

Database Hash Function Implementations

•PostgreSQL — Uses custom hash functions optimized for different data types. For integers: multiplication then XOR mixing. For strings: accumulator with carefully chosen constants. All functions return 32-bit values.
•MySQL/InnoDB — Internal hash uses modified FNV hash for the adaptive hash index. Partition BY HASH uses simple modulo. User-defined functions can use MD5 or SHA for deterministic hashing.
•Oracle — STANDARD_HASH function supports SHA-1, SHA-256, SHA-384, SHA-512, and MD5. Hash clusters use internal hash optimized for equality partitioning. ORA_HASH provides configurable bucketing.
•SQL Server — HASHBYTES for cryptographic hashes. CHECKSUM for fast non-cryptographic hash. Binary hash algorithms for hash joins and hash aggregates.
•Redis — Uses SipHash (cryptographic) for string keys to prevent hash flooding attacks. MurmurHash2 for internal structures like the dictionaries.

MurmurHash: The modern workhorse

MurmurHash (created by Austin Appleby) is widely used in modern databases and distributed systems. It offers:

Excellent distribution quality
High speed (approaching memory bandwidth limits)
Good avalanche effect
Simple implementation

MurmurHash3 (32-bit version) outline:
1. Process input in 4-byte chunks
2. Multiply each chunk by large primes
3. Rotate bits and XOR to mix
4. Final avalanche mixing to spread entropy

xxHash: Even faster

xxHash by Yann Collet prioritizes raw speed:

~5x faster than MurmurHash
Used in ClickHouse, RocksDB, many others
Available in 32-bit, 64-bit, and 128-bit variants

CityHash / FarmHash (Google)

Optimized for modern CPUs:

Exploits instruction-level parallelism
Uses SIMD when available
CityHash64 processes 64 bytes at a time

Hash Function Speed Comparison (approx.)
Hash Function	Speed (GB/s)	Output Size	Quality	Use Case
Simple modulo	~10 GB/s	Variable	Poor on patterns	Educational only
djb2	~3 GB/s	32-bit	Good	Legacy systems
FNV-1a	~2 GB/s	32/64-bit	Good	General purpose
MurmurHash3	~4 GB/s	32/128-bit	Excellent	Hash tables, bloom filters
xxHash	~15 GB/s	32/64/128-bit	Excellent	High-throughput systems
SipHash	~2 GB/s	64-bit	Excellent + secure	Hash DoS prevention
CityHash64	~10 GB/s	64-bit	Excellent	Google systems

Hash Flooding Attacks

Composite Key Hashing

Many database lookups involve composite keys—multiple columns that together form the search key. Hashing composite keys requires combining individual component hashes into a single hash value.

The challenge:

Given a key (department_id, employee_id), we need:

h(10, 100) ≠ h(100, 10) — Order matters
h(10, 100) ≠ h(10, 101) — Small changes should produce different hashes
Uniform distribution over the combined space

Naive (bad) approaches:

Composite Key Hash Approaches
Method	Formula	Issue
Simple XOR	h₁ XOR h₂	(10,10) and (20,20) both produce 0
Simple addition	h₁ + h₂	(10,20) = (15,15) = (5,25) = ...
Concatenation	h(str(k₁) + str(k₂))	Performance overhead of string conversion

Recommended approaches:

1. Hash-combine pattern (used in Boost, many databases):

combined = h1
combined = combined XOR (h2 + 0x9e3779b9 + (combined << 6) + (combined >> 2))

The magic constant 0x9e3779b9 is derived from the golden ratio (2³² / φ).

2. Polynomial combination:

combined = a * h1 + h2

Where a is a large prime.

3. Sequential hashing (feeding one into the next):

combined = hash(hash(k1) + k2)

4. Tuple hashing (treat as byte sequence): Serialize all components into a byte buffer, hash the entire buffer.

Composite Key Hashing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def hash_combine(seed: int, value: int) -> int:
    """
    Boost-style hash combine.
    Mixes 'value' into 'seed' in a way that respects ordering
    and produces good distribution.
    
    The magic number 0x9e3779b9 is 2^32 / golden_ratio.
    """
    MAGIC = 0x9e3779b9
    seed ^= value + MAGIC + (seed << 6) + (seed >> 2)
    return seed & 0xFFFFFFFF  # Keep as 32-bit
 
 
def hash_composite_key(*components, num_buckets: int) -> int:
    """
    Hash a composite key (multiple columns) into a single bucket.
    Each component can be int or string.
    """
    result = 0
    for component in components:
        if isinstance(component, str):
            # Hash string first
            component_hash = hash_string(component)
        else:
            component_hash = component & 0xFFFFFFFF
        result = hash_combine(result, component_hash)
    
    return result % num_buckets
 
 
def hash_string(s: str) -> int:
    """Simple polynomial hash for strings."""
    h = 0
    for c in s:
        h = (h * 31 + ord(c)) & 0xFFFFFFFF
    return h
 
 
# Test composite key hashing
print("Composite Key Hash Distribution")
print("=" * 50)
 
NUM_BUCKETS = 100
 
# Test that order matters
key1 = (10, 100)
key2 = (100, 10)
h1 = hash_composite_key(*key1, num_buckets=NUM_BUCKETS)
h2 = hash_composite_key(*key2, num_buckets=NUM_BUCKETS)
print(f"h(10, 100) = {h1}")
print(f"h(100, 10) = {h2}")
print(f"Order matters: {h1 != h2}")
 
# Test with strings
key3 = ("sales", 42, "2024-01-15")
key4 = ("sales", 43, "2024-01-15")
h3 = hash_composite_key(*key3, num_buckets=NUM_BUCKETS)
h4 = hash_composite_key(*key4, num_buckets=NUM_BUCKETS)
print(f"\nh('sales', 42, '2024-01-15') = {h3}")
print(f"h('sales', 43, '2024-01-15') = {h4}")
print(f"Different keys: {h3 != h4}")
 
# Verify distribution across many keys
from collections import Counter
buckets = Counter()
for dept in range(10):
    for emp in range(100):
        bucket = hash_composite_key(dept, emp, num_buckets=NUM_BUCKETS)
        buckets[bucket] += 1
 
print(f"\nDistribution of 1000 composite keys into 100 buckets:")
print(f"Min bucket size: {min(buckets.values())}")
print(f"Max bucket size: {max(buckets.values())}")
print(f"Expected (uniform): 10")

Partial Key Lookups

Summary: Hash Function Selection Guidelines

The hash function is the heart of hash-based indexing. Choosing the right function—or evaluating the one your database uses—requires understanding both theory and practical trade-offs.

Key Takeaways

•Uniformity is paramount — A hash function that clusters keys destroys the O(1) guarantee. Always verify distribution quality on representative data.
•Division method is simple but M-sensitive — Use prime bucket counts to avoid resonance with patterned keys.
•Multiplication method is M-agnostic — The golden ratio constant provides robust distribution regardless of bucket count.
•String hashing requires polynomial accumulation — Simple approaches (sum, XOR) produce unacceptable collision rates.
•Modern functions (MurmurHash, xxHash) excel — They combine speed, distribution quality, and avalanche properties.
•Composite keys need careful combination — Hash-combine patterns preserve component independence and ordering.
•Security matters for external input — Use keyed hash functions (SipHash) to prevent hash flooding attacks.

Quick Hash Function Selection Guide
Scenario	Recommended Approach
Integer primary keys, uniform distribution	Division with prime M
Integer keys, patterned or sequential	Multiplication method
String keys, trusted input	djb2, FNV-1a, or polynomial hash
String keys, untrusted input	SipHash with random seed
High-throughput systems	xxHash or MurmurHash3
Composite keys	Hash-combine pattern with quality base hash
Database partitioning	Use database's built-in hash functions

What's next:

Page Complete