Data Structures & AlgorithmsHash Tables

When Hashing Fails — Pathological Cases & Alternatives

LevelAdvanced

Duration75 mins

TopicHash Tables

2 / 4

Poor Hash Function Choices

The Silent Performance Killer

In the previous page, we examined deliberate attacks that exploit hash table vulnerabilities. But there's an equally important source of hash table performance problems that requires no attacker at all: poorly designed hash functions.

A bad hash function doesn't announce itself with errors or crashes. Instead, it silently degrades performance—what should be O(1) operations gradually become O(n) as data accumulates in a few overloaded buckets. The system works, but slowly. Engineers blame the database, the network, or insufficient hardware, never suspecting that a simple mathematical function is the root cause.

This insidious failure mode has caused countless production incidents. Understanding what makes a hash function good (or bad) is fundamental to hash table mastery.

What You Will Learn

By the end of this page, you will be able to identify poor hash function choices, understand why they fail, recognize common anti-patterns, and evaluate hash functions for your specific use cases. You'll learn to see hash function quality as a critical design decision, not an afterthought.

What Makes a Hash Function "Good"

Before we can identify poor hash functions, we need a clear understanding of what properties a good hash function should have. Recall the three fundamental properties from earlier in this chapter:

The Three Pillars of Hash Function Quality:

Essential Hash Function Properties

•Determinism — The same input must always produce the same output. hash("hello") must return the same value every time (within the same execution context, considering randomization).
•Uniformity — Outputs should be distributed as evenly as possible across the output range. If you have 1000 buckets, each bucket should receive approximately 1/1000 of the keys, regardless of input patterns.
•Efficiency — The hash function should be fast to compute. A hash function that takes longer than the operations it enables defeats its purpose.

The Critical Property: Uniformity

Of these three, uniformity is where most hash functions fail. Determinism is easy—just don't use random numbers. Efficiency is usually acceptable—even slow hash functions are faster than searching. But achieving good uniformity requires understanding your data and choosing (or designing) a hash function that spreads that specific type of data evenly.

The Uniformity Challenge:

Real-world data is rarely uniform. Consider these common patterns:

Sequential IDs: 1, 2, 3, 4, 5, ...
Timestamps: 1704067200, 1704067201, 1704067202, ...
Prefixed strings: "user_001", "user_002", "user_003", ...
IP addresses: "192.168.1.1", "192.168.1.2", "192.168.1.3", ...

A hash function must transform these clustered, patterned inputs into uniformly distributed outputs. If it fails to break up input patterns, collisions cluster and performance degrades.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
INPUT DATA: Sequential user IDs: 1000, 1001, 1002, 1003, ..., 1999
HASH TABLE: 10 buckets (indices 0-9)
 
IDEAL DISTRIBUTION (Good Hash Function):
═══════════════════════════════════════════════════════════════════
 
Bucket 0: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 1: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 2: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 3: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 4: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 5: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 6: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 7: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 8: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 9: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
 
Average chain length: 100
Lookup time: O(100) = manageable
 
 
TERRIBLE DISTRIBUTION (hash(x) = x % 10):
═══════════════════════════════════════════════════════════════════
 
Bucket 0: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●...  (100 elements)
Bucket 1: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●...  (100 elements)
Bucket 2: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●...  (100 elements)
... (still even for this example, but see next example)
 
THE PROBLEM EMERGES with multiples of 10:
Input: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, ...
 
Bucket 0: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●...  (ALL elements!)
Bucket 1: (empty)
Bucket 2: (empty)
...
Bucket 9: (empty)
 
ALL keys go to bucket 0!
Lookup time: O(n) — hash table becomes a linked list

The Pattern Problem

Poor hash functions fail when input data has patterns that align with the hash function's behavior. The more structured your data, the more important it is to choose a hash function that thoroughly scrambles those patterns. Real-world data is almost always patterned.

Common Hash Function Anti-Patterns

Let's examine specific hash function designs that frequently cause problems in practice. These anti-patterns appear surprisingly often in production code.

Anti-Pattern 1: Using Modulo with Non-Prime Table Sizes

When hash table size shares common factors with key values, collisions cluster predictably:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# BAD: Table size of 100 (highly composite: 100 = 2² × 5²)
def bad_hash(key):
    return key % 100
 
# Keys that are multiples of 10:
# 10 % 100 = 10
# 20 % 100 = 20
# ...
# 100 % 100 = 0
# 110 % 100 = 10  ← Collision with 10!
# 210 % 100 = 10  ← Another collision!
 
# Keys that are multiples of 50:
# 50 % 100 = 50
# 100 % 100 = 0
# 150 % 100 = 50  ← Collision!
# 200 % 100 = 0   ← Collision!
 
# Pattern: Any factor of 100 creates collision clusters
 
# BETTER: Use a prime table size
def better_hash(key):
    return key % 97  # 97 is prime
 
# With prime modulus, patterns in input data
# are less likely to align with table structure
 
# Keys that are multiples of 10:
# 10 % 97 = 10
# 20 % 97 = 20
# ...
# 100 % 97 = 3    ← Different bucket!
# 110 % 97 = 13   ← Different bucket!
# 210 % 97 = 16   ← Different bucket!
 
# Prime modulus "breaks up" regular patterns

Anti-Pattern 2: Ignoring Parts of the Key

Hash functions that only consider part of the key waste information and cause unnecessary collisions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# BAD: Only using first character
def terrible_hash(string_key):
    return ord(string_key[0]) if string_key else 0
 
# Results for common names:
# "Alice" → 65 (A)
# "Aaron" → 65 (A)  ← Collision!
# "Amy"   → 65 (A)  ← Collision!
# "Anna"  → 65 (A)  ← Collision!
# "Bob"   → 66 (B)
# "Bill"  → 66 (B)  ← Collision!
 
# With only 26 possible values, maximum 26 buckets are useful!
 
# BAD: Only using last N characters
def bad_suffix_hash(key):
    return hash(key[-3:])  # Only last 3 chars
 
# For file paths:
# "/var/log/app.log" → hash("log")
# "/home/user/debug.log" → hash("log")  ← Collision!
# "/tmp/error.log" → hash("log")        ← Collision!
 
# BAD: Truncating long keys
def truncated_hash(key):
    return hash(key[:10])  # Only first 10 chars
 
# For URLs:
# "https://api.example.com/users/123" → hash("https://ap")
# "https://api.example.com/users/456" → hash("https://ap")  ← Collision!
# "https://api.example.com/orders/789" → hash("https://ap") ← Collision!

Anti-Pattern 3: Simple Additive Hashes

Just summing character values (or similar simple operations) creates many collision equivalence classes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# BAD: Summing character values (or byte values)
def additive_hash(s):
    return sum(ord(c) for c in s)
 
# Collisions are trivial to find:
# "ab" = 97 + 98 = 195
# "ba" = 98 + 97 = 195  ← Same hash!
 
# More examples:
# "abc" = 97 + 98 + 99 = 294
# "cab" = 99 + 97 + 98 = 294  ← Collision!
# "bac" = 98 + 97 + 99 = 294  ← Collision!
# Any permutation collides!
 
# Even worse - different strings:
# "ab" = 195
# "Z" + chr(195 - 90) = 195  # Can construct collisions easily
 
# This hash is ORDER-INSENSITIVE
# This is catastrophic for strings, where order matters
 
# The hash function treats "stop" and "pots" as identical!
# "stop" = 115 + 116 + 111 + 112 = 454
# "pots" = 112 + 111 + 116 + 115 = 454
# "spot" = 115 + 112 + 111 + 116 = 454
# "tops" = 116 + 111 + 112 + 115 = 454

Anti-Pattern 4: XOR Without Position Encoding

Using XOR alone has similar problems to addition—it's associative and commutative:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# BAD: Simple XOR of all bytes
def xor_hash(s):
    result = 0
    for c in s:
        result ^= ord(c)
    return result
 
# XOR is commutative and associative:
# "ab" = ord('a') ^ ord('b') = 97 ^ 98 = 3
# "ba" = ord('b') ^ ord('a') = 98 ^ 97 = 3  ← Collision!
 
# Worse: XOR has self-cancellation
# x ^ x = 0
 
# "abba" = a ^ b ^ b ^ a = (a ^ a) ^ (b ^ b) = 0 ^ 0 = 0
# "cddc" = c ^ d ^ d ^ c = (c ^ c) ^ (d ^ d) = 0 ^ 0 = 0
# Any string with pairs of repeated chars hashes to 0!
 
# Even length matters:
# "ab" = 3
# "abab" = a ^ b ^ a ^ b = 0  ← Different string, very different hash!
 
# XOR is useful for COMBINING multiple independent hash values
# But not for hashing sequential data directly

The Commutativity Problem

Hash functions using only addition, multiplication, or XOR without position-dependent weighting treat "abc" the same as "cba". For data where order matters (strings, sequences, coordinates), this is catastrophic. Good hash functions incorporate position information to distinguish otherwise-similar inputs.

Real-World Hash Function Failures

These anti-patterns aren't just theoretical—they've caused real problems in production systems. Let's examine documented cases where poor hash function choices led to serious issues.

Case Study 1: Database Key Clustering

A company stored customer records in a hash-partitioned distributed database. The partition key was the customer ID, an auto-incrementing integer. The partition function was simple: partition = customer_id % num_partitions.

The problem? They had 8 partitions (a power of 2). Customer IDs from batch imports came in multiples of 1000. Since 1000 = 8 × 125, all batch-imported customers landed on the same partition:

Customer 1000 → partition 0
Customer 2000 → partition 0
Customer 3000 → partition 0

That single partition became overloaded while others sat nearly empty.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
EXPECTED: Uniform distribution across 8 partitions
═══════════════════════════════════════════════════════════════════
 
Partition 0: ████████████░░░  12.5% of data
Partition 1: ████████████░░░  12.5% of data
Partition 2: ████████████░░░  12.5% of data
Partition 3: ████████████░░░  12.5% of data
Partition 4: ████████████░░░  12.5% of data
Partition 5: ████████████░░░  12.5% of data
Partition 6: ████████████░░░  12.5% of data
Partition 7: ████████████░░░  12.5% of data
 
 
ACTUAL: Skewed distribution due to ID patterns
═══════════════════════════════════════════════════════════════════
 
Partition 0: ████████████████████████████████████████  85% of data!
Partition 1: ██░░░░░░░░░░░░░  2% of data
Partition 2: ██░░░░░░░░░░░░░  2% of data
Partition 3: ██░░░░░░░░░░░░░  2% of data
Partition 4: ██░░░░░░░░░░░░░  2% of data
Partition 5: ██░░░░░░░░░░░░░  2% of data
Partition 6: ██░░░░░░░░░░░░░  2% of data
Partition 7: ██░░░░░░░░░░░░░  3% of data
 
Result: 
- Partition 0 server overloaded
- Query latencies 10x higher than expected
- Horizontal scaling ineffective (other partitions unused)

Case Study 2: Cache Key Collisions

A web application cached rendered pages using URL paths as keys. The cache used a custom hash function that only considered the first 8 characters of the URL path.

For a site structure like:

/products/electronics/phones/iphone-15
/products/electronics/phones/samsung-s24
/products/electronics/phones/pixel-8

All URLs hashed to the same value because they shared the prefix /product. Cache hit rates plummeted as entries constantly evicted each other, and the "cache" provided no benefit.

Case Study 3: The Zip Code Catastrophe

A shipping logistics system stored package data in a hash table keyed by destination zip code. The hash function was zipcode % table_size.

Zip codes in the United States follow geographic patterns. Most zip codes in a region share prefixes. A shipping hub serving primarily the 9XXXX region (California) found that most packages clustered in a small number of buckets:

90001, 90002, ..., 90099 → similar hash values
91000, 91001, ..., 91099 → similar hash values

The distribution was nowhere near uniform.

Common Hash Function Failure Patterns
Scenario	Poor Hash Choice	Result	Better Approach
Sequential IDs	ID % table_size	Predictable bucket assignment	Multiply-shift hash or bit mixing
Timestamps	timestamp % table_size	Time-correlated clustering	Mix all bits with Murmur/xxHash
IP addresses	Last octet only	256 possible values	Hash full IP as 32-bit integer
URLs	First N chars	Prefix overlap clustering	Full string hash (FNV, xxHash)
Geographic data	Coordinate truncation	Spatial clustering preserved	Space-filling curve or full-precision hash

Pattern Recognition is Key

Notice the common thread: real-world data has structure and patterns. Hash functions fail when they preserve these patterns instead of destroying them. The solution is always to use a hash function that thoroughly mixes all bits of the input, regardless of input patterns.

Measuring Hash Function Quality

How do you know if your hash function is good for your data? Beyond code review and intuition, we can empirically measure distribution quality.

Metric 1: Chi-Squared Test for Uniformity

The chi-squared (χ²) test measures whether observed frequencies deviate significantly from expected uniform distribution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import scipy.stats as stats
 
def test_hash_uniformity(hash_func, test_keys, num_buckets):
    """
    Test whether hash_func distributes test_keys uniformly.
    
    Returns: chi-squared statistic and p-value
    Lower chi-squared and higher p-value = better uniformity
    """
    # Count elements in each bucket
    bucket_counts = [0] * num_buckets
    for key in test_keys:
        bucket = hash_func(key) % num_buckets
        bucket_counts[bucket] += 1
    
    # Expected count per bucket (uniform distribution)
    expected = len(test_keys) / num_buckets
    
    # Chi-squared statistic
    chi_sq = sum((obs - expected)**2 / expected for obs in bucket_counts)
    
    # Degrees of freedom = num_buckets - 1
    df = num_buckets - 1
    
    # P-value: probability of seeing this or more extreme result if truly uniform
    p_value = 1 - stats.chi2.cdf(chi_sq, df)
    
    return chi_sq, p_value, bucket_counts
 
# Example: Test bad vs good hash function
def bad_hash(x): return x  # Identity - preserves patterns
def good_hash(x): return hash(x)  # Python's built-in (randomized)
 
# Test with sequential IDs (common real-world pattern)
test_keys = list(range(1000, 10000))
num_buckets = 100
 
bad_chi, bad_p, bad_counts = test_hash_uniformity(bad_hash, test_keys, num_buckets)
good_chi, good_p, good_counts = test_hash_uniformity(good_hash, test_keys, num_buckets)
 
print(f"Bad hash:  χ² = {bad_chi:.1f}, p-value = {bad_p:.6f}")
print(f"Good hash: χ² = {good_chi:.1f}, p-value = {good_p:.6f}")
 
# Interpretation:
# - χ² close to df (99) = good uniformity
# - χ² >> df = non-uniform (patterns detected)
# - p-value > 0.05 = consistent with uniform distribution
# - p-value < 0.05 = likely non-uniform

Metric 2: Coefficient of Variation

A simpler metric is the coefficient of variation (CV) of bucket sizes—the standard deviation divided by the mean:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import statistics
 
def hash_distribution_cv(hash_func, test_keys, num_buckets):
    """
    Calculate coefficient of variation of bucket counts.
    
    CV = 0: Perfect uniformity (all buckets have same count)
    CV ≈ 1/√n: Expected for random distribution
    CV >> 1/√n: Poor distribution
    """
    bucket_counts = [0] * num_buckets
    for key in test_keys:
        bucket = hash_func(key) % num_buckets
        bucket_counts[bucket] += 1
    
    mean = statistics.mean(bucket_counts)
    stdev = statistics.stdev(bucket_counts)
    cv = stdev / mean if mean > 0 else float('inf')
    
    # Expected CV for truly random distribution
    expected_cv = 1 / (len(test_keys) / num_buckets) ** 0.5
    
    return cv, expected_cv, max(bucket_counts), min(bucket_counts)
 
# Example output:
# Good hash: CV=0.12, Expected=0.10, Max=105, Min=85
# Bad hash:  CV=5.24, Expected=0.10, Max=900, Min=0

Metric 3: Avalanche Effect

A good hash function exhibits the "avalanche effect"—a single bit change in input causes approximately 50% of output bits to flip. This ensures that similar inputs produce dissimilar outputs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def count_bit_differences(a, b, bit_width=32):
    """Count the number of differing bits between a and b."""
    xor = a ^ b
    count = 0
    for i in range(bit_width):
        if xor & (1 << i):
            count += 1
    return count
 
def test_avalanche(hash_func, num_tests=1000):
    """
    Test avalanche effect: changing 1 input bit should flip ~50% of output bits.
    
    Returns: average percentage of flipped bits for single-bit input changes.
    Ideal: 50%
    Poor: <30% or >70%
    """
    import random
    total_flips = 0
    bit_width = 32
    
    for _ in range(num_tests):
        # Random input
        x = random.randint(0, 2**32 - 1)
        h1 = hash_func(x) & 0xFFFFFFFF  # Ensure 32-bit
        
        # Flip one random bit
        bit_position = random.randint(0, 31)
        x_flipped = x ^ (1 << bit_position)
        h2 = hash_func(x_flipped) & 0xFFFFFFFF
        
        # Count bit differences
        flips = count_bit_differences(h1, h2, bit_width)
        total_flips += flips
    
    avg_flips = total_flips / num_tests
    avg_percent = (avg_flips / bit_width) * 100
    return avg_percent
 
# Results:
# Identity hash (h(x)=x):  ~3.1%  (only the changed bit differs)
# Add constant (h(x)=x+C): ~3.1%  (addition doesn't spread)
# Multiply hash:           ~25%   (better, but not great)
# xxHash/Murmur:           ~50%   (excellent avalanche)

Test With Your Actual Data

The best test for a hash function is against your actual production data patterns. Generate representative test data (sequential IDs, timestamps, user-submitted strings, etc.) and measure distribution quality. A hash function that works well for random data may fail catastrophically for your specific input patterns.

Choosing the Right Hash Function

With an understanding of what can go wrong, let's examine proven hash function families and when to use each.

For General-Purpose Hashing:

Recommended Hash Functions by Use Case
Hash Function	Speed	Quality	Best For
xxHash (xxh32/xxh64)	Extremely fast	Excellent	General purpose, large data
MurmurHash3	Very fast	Excellent	Hash tables, checksums
FNV-1a	Fast	Good	Short strings, simple implementation
CityHash	Very fast	Excellent	Strings, Google's choice
SipHash	Moderate	Excellent + secure	Hash tables with untrusted input
djb2	Fast	Acceptable	Simple string hashing

Implementation Examples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# djb2 - Simple but effective string hash
def djb2_hash(s):
    """
    Dan Bernstein's hash function.
    - Uses magic constant 5381 (empirically chosen)
    - Multiply by 33 (= 32 + 1 = (hash << 5) + hash)
    - Addition incorporates each character
    """
    hash_value = 5381
    for char in s:
        hash_value = ((hash_value << 5) + hash_value) + ord(char)
    return hash_value & 0xFFFFFFFF  # Keep 32 bits
 
 
# FNV-1a - Fowler-Noll-Vo hash (variant a)
def fnv1a_hash(data):
    """
    FNV-1a hash - XOR then multiply.
    - Excellent distribution
    - Simple implementation
    - Good for strings and binary data
    """
    FNV_PRIME = 0x01000193  # 16777619
    FNV_OFFSET = 0x811c9dc5  # 2166136261
    
    hash_value = FNV_OFFSET
    for byte in data.encode() if isinstance(data, str) else data:
        hash_value ^= byte
        hash_value = (hash_value * FNV_PRIME) & 0xFFFFFFFF
    return hash_value
 
 
# Knuth's multiplicative hash (for integers)
def mult_hash(key, table_size):
    """
    Knuth's multiplicative method for integers.
    A = 2654435769 ≈ 2^32 / φ (golden ratio)
    """
    A = 2654435769
    # Multiply, take high bits (they're better mixed)
    return ((key * A) >> 16) % table_size
 
 
# MurmurHash3 (simplified finalization mixer)
def murmur3_fmix32(h):
    """
    MurmurHash3 finalizer - excellent bit mixing.
    Use this to improve poor hash values.
    """
    h ^= h >> 16
    h = (h * 0x85ebca6b) & 0xFFFFFFFF
    h ^= h >> 13
    h = (h * 0xc2b2ae35) & 0xFFFFFFFF
    h ^= h >> 16
    return h
 
# Usage: If you have a weak hash, strengthen it:
# good_hash = murmur3_fmix32(weak_hash(key))

Key Principles for Hash Function Selection:

Hash Function Selection Guidelines

•Use language built-ins for most cases — Python's hash(), Java's hashCode(), etc. are well-tested and typically sufficient.
•Use SipHash for untrusted input — If users control hash table keys, use a cryptographically-keyed hash like SipHash.
•Use xxHash or MurmurHash for performance-critical paths — When you need the absolute fastest hashing with excellent distribution.
•Use prime table sizes with simple hash functions — If you must use a simple hash, prime sizes help break up patterns.
•Test with your actual data — Empirically verify distribution before production deployment.
•Consider Zobrist hashing for composite keys — XOR of pre-computed random values for each component.

When in Doubt, Use the Default

Modern programming languages have well-designed built-in hash functions. Unless you have specific requirements (cryptographic security, extreme speed, embedded systems), the default hash function is usually the right choice. Custom hash functions are a common source of subtle bugs.

Fixing Poor Hash Distribution

What if you've inherited code with a poor hash function, or you're stuck with a hash that can't be changed (e.g., database partition keys)? There are techniques to improve distribution without replacing the hash function entirely.

Technique 1: Hash Function Chaining (Bit Mixing)

Apply a high-quality mixing function to the output of a poor hash:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def improve_hash(poor_hash_value):
    """
    Apply bit mixing to improve distribution of a poor hash value.
    Uses MurmurHash3 finalization.
    """
    h = poor_hash_value & 0xFFFFFFFF
    h ^= h >> 16
    h = (h * 0x85ebca6b) & 0xFFFFFFFF
    h ^= h >> 13
    h = (h * 0xc2b2ae35) & 0xFFFFFFFF
    h ^= h >> 16
    return h
 
# Before: poor_hash(sequential_id) clusters badly
# After: improve_hash(poor_hash(sequential_id)) is uniform
 
# Example with sequential IDs:
print("Poor hash distribution (ID % 16):")
for i in range(1000, 1016):
    print(f"  {i} → bucket {i % 16}")
 
print("
Improved hash distribution:")
for i in range(1000, 1016):
    print(f"  {i} → bucket {improve_hash(i) % 16}")
 
# Output shows improved scattering across buckets

Technique 2: Salting

Combine a constant salt value with the key before hashing to change the collision pattern:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Without salt: clustering happens
def unsalted_hash(key, table_size):
    return hash(key) % table_size
 
# With salt: patterns are disrupted
SALT = 0x9E3779B9  # Golden ratio constant
 
def salted_hash(key, table_size):
    if isinstance(key, int):
        mixed = key ^ SALT
        mixed = (mixed * 0x85ebca6b) & 0xFFFFFFFF
    else:
        # For strings, prepend salt or combine differently
        mixed = hash(f"{SALT}{key}")
    return mixed % table_size
 
# For database sharding with problematic keys:
def shard_key(user_id, num_shards):
    """
    Improved sharding that avoids clustering from sequential IDs.
    """
    # Mix the user_id with a salt
    mixed = user_id ^ 0x9E3779B9
    mixed = (mixed * 2654435761) & 0xFFFFFFFF
    mixed ^= mixed >> 16
    return mixed % num_shards

Technique 3: Secondary Hash (Double Hashing)

Forstubborn clustering, use a completely different hash function as a secondary check:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class DoubleHashTable:
    """
    Use two independent hash functions to combat poor distribution.
    Primary hash determines bucket; secondary hash resolves collisions.
    """
    def __init__(self, size):
        self.size = size
        self.table = [None] * size
    
    def _hash1(self, key):
        """Primary hash - even if poor, secondary will help."""
        return hash(key) % self.size
    
    def _hash2(self, key):
        """Secondary hash - must be coprime with size."""
        # Use different algorithm for independence
        h = 5381
        for c in str(key):
            h = ((h << 5) + h) ^ ord(c)
        # Ensure non-zero and coprime with size
        return 1 + (h % (self.size - 1))
    
    def insert(self, key, value):
        """Insert with double hashing probe sequence."""
        h1 = self._hash1(key)
        h2 = self._hash2(key)
        
        for i in range(self.size):
            idx = (h1 + i * h2) % self.size
            if self.table[idx] is None or self.table[idx][0] == key:
                self.table[idx] = (key, value)
                return
        raise Exception("Hash table full")

Prevention is Better Than Cure

While these techniques can improve poor hash functions, they add complexity and computational overhead. It's always better to choose a good hash function from the start. Reserve these techniques for legacy systems where changing the hash function would require data migration or break compatibility.

Hash Function Design for Custom Types

When working with custom objects or composite keys, you'll need to design hash functions that combine multiple fields. This is where many mistakes happen.

The Combination Problem:

Given an object with fields (a, b, c), how do we combine their individual hash values into one hash for the object?

Bad Approach: Simple XOR

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# BAD: XOR combination
class BadPoint:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def __hash__(self):
        return hash(self.x) ^ hash(self.y)  # Don't do this!
    
    def __eq__(self, other):
        return self.x == other.x and self.y == other.y
 
# Problems:
p1 = BadPoint(3, 5)
p2 = BadPoint(5, 3)
print(hash(p1) == hash(p2))  # True! Order doesn't matter with XOR
 
p3 = BadPoint(7, 7)
print(hash(p3))  # 0! x^x = 0 for any x
 
# Many distinct points will collide:
# (1,2), (2,1), (3,0), (0,3) all have related hashes

Good Approach: Position-Weighted Combination

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# GOOD: Python's recommended approach
class GoodPoint:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def __hash__(self):
        # Use tuple hashing - properly handles ordering and combination
        return hash((self.x, self.y))
    
    def __eq__(self, other):
        return self.x == other.x and self.y == other.y
 
# Alternative: Explicit prime-based combination
class BetterPoint:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def __hash__(self):
        # Combine with primes to preserve order
        result = 17  # Non-zero start
        result = 31 * result + hash(self.x)
        result = 31 * result + hash(self.y)
        return result
    
    def __eq__(self, other):
        return self.x == other.x and self.y == other.y
 
# Now (3,5) and (5,3) have different hashes
p1 = BetterPoint(3, 5)
p2 = BetterPoint(5, 3)
print(hash(p1) == hash(p2))  # False - correctly distinguished
 
 
# For Java, similar pattern:
'''
@Override
public int hashCode() {
    int result = 17;
    result = 31 * result + x;
    result = 31 * result + y;
    return result;
}
'''

Best Practices for Custom Hash Functions:

Custom HashCode Checklist

•Include all fields used in equals() — If two objects are equal, they must have the same hash code.
•Use tuple hashing when available — hash((a, b, c)) in Python is optimized and correct.
•Apply prime multipliers for ordering — Pattern: 31 * running_hash + field_hash
•Handle null/None fields — Use 0 or a constant for null field values.
•Consider field types — Arrays need special handling (hash contents, not reference).
•Cache immutable object hashes — If object is immutable, compute hash once and store it.
•Never use mutable fields — If a field can change, don't include it in the hash.

The Equals-HashCode Contract

If you override equals(), you MUST override hashCode(). Objects that are equal must have identical hash codes. Violating this contract causes hash table lookups to fail—you can insert an object but never find it again!

Summary: Hash Function Quality Matters

Poor hash function choices silently undermine hash table performance. Understanding what makes a hash function good—and recognizing common failures—is essential for building efficient systems.

Key Takeaways

•Uniformity is the critical property — Hash outputs must be evenly distributed regardless of input patterns.
•Common anti-patterns cause clustering — Simple modulo with non-prime sizes, ignoring key portions, additive hashes, and order-insensitive combinations all cause problems.
•Real-world data has patterns — Sequential IDs, timestamps, geographic data, and structured strings all require careful hash function selection.
•Measure distribution empirically — Chi-squared tests, coefficient of variation, and avalanche testing reveal hash function quality.
•Use proven hash families — xxHash, MurmurHash, FNV-1a, and language built-ins are well-tested choices.
•Bit mixing can improve poor hashes — MurmurHash finalizers can salvage distribution from weak hash functions.
•Custom types need careful handling — Combine field hashes with position weighting; honor the equals-hashCode contract.

What's Next:

Hash functions and collision handling determine when hash tables work well. But there are problems where hash tables, even with perfect hash functions, are simply the wrong choice. The next page explores scenarios where ordered data structures like balanced binary search trees are fundamentally better suited to the problem requirements.

Page Complete

You now understand what makes hash functions fail and how to choose, evaluate, and fix them. This knowledge helps you avoid subtle performance bugs that can silently cripple system performance. Next, we'll examine when hash tables themselves are the wrong tool for the job.

2 / 4

Loading learning content...

Data Structures & AlgorithmsHash Tables

When Hashing Fails — Pathological Cases & Alternatives

LevelAdvanced

Duration75 mins

TopicHash Tables

2 / 4

Poor Hash Function Choices

The Silent Performance Killer

This insidious failure mode has caused countless production incidents. Understanding what makes a hash function good (or bad) is fundamental to hash table mastery.

What You Will Learn

What Makes a Hash Function "Good"

Before we can identify poor hash functions, we need a clear understanding of what properties a good hash function should have. Recall the three fundamental properties from earlier in this chapter:

The Three Pillars of Hash Function Quality:

Essential Hash Function Properties

•Determinism — The same input must always produce the same output. hash("hello") must return the same value every time (within the same execution context, considering randomization).
•Uniformity — Outputs should be distributed as evenly as possible across the output range. If you have 1000 buckets, each bucket should receive approximately 1/1000 of the keys, regardless of input patterns.
•Efficiency — The hash function should be fast to compute. A hash function that takes longer than the operations it enables defeats its purpose.

The Critical Property: Uniformity

The Uniformity Challenge:

Real-world data is rarely uniform. Consider these common patterns:

Sequential IDs: 1, 2, 3, 4, 5, ...
Timestamps: 1704067200, 1704067201, 1704067202, ...
Prefixed strings: "user_001", "user_002", "user_003", ...
IP addresses: "192.168.1.1", "192.168.1.2", "192.168.1.3", ...

A hash function must transform these clustered, patterned inputs into uniformly distributed outputs. If it fails to break up input patterns, collisions cluster and performance degrades.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
INPUT DATA: Sequential user IDs: 1000, 1001, 1002, 1003, ..., 1999
HASH TABLE: 10 buckets (indices 0-9)
 
IDEAL DISTRIBUTION (Good Hash Function):
═══════════════════════════════════════════════════════════════════
 
Bucket 0: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 1: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 2: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 3: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 4: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 5: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 6: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 7: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 8: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
Bucket 9: ●●●●●●●●●●●●●●●●●●●●  (100 elements)
 
Average chain length: 100
Lookup time: O(100) = manageable
 
 
TERRIBLE DISTRIBUTION (hash(x) = x % 10):
═══════════════════════════════════════════════════════════════════
 
Bucket 0: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●...  (100 elements)
Bucket 1: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●...  (100 elements)
Bucket 2: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●...  (100 elements)
... (still even for this example, but see next example)
 
THE PROBLEM EMERGES with multiples of 10:
Input: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, ...
 
Bucket 0: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●...  (ALL elements!)
Bucket 1: (empty)
Bucket 2: (empty)
...
Bucket 9: (empty)
 
ALL keys go to bucket 0!
Lookup time: O(n) — hash table becomes a linked list

The Pattern Problem

Common Hash Function Anti-Patterns

Let's examine specific hash function designs that frequently cause problems in practice. These anti-patterns appear surprisingly often in production code.

Anti-Pattern 1: Using Modulo with Non-Prime Table Sizes

When hash table size shares common factors with key values, collisions cluster predictably:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# BAD: Table size of 100 (highly composite: 100 = 2² × 5²)
def bad_hash(key):
    return key % 100
 
# Keys that are multiples of 10:
# 10 % 100 = 10
# 20 % 100 = 20
# ...
# 100 % 100 = 0
# 110 % 100 = 10  ← Collision with 10!
# 210 % 100 = 10  ← Another collision!
 
# Keys that are multiples of 50:
# 50 % 100 = 50
# 100 % 100 = 0
# 150 % 100 = 50  ← Collision!
# 200 % 100 = 0   ← Collision!
 
# Pattern: Any factor of 100 creates collision clusters
 
# BETTER: Use a prime table size
def better_hash(key):
    return key % 97  # 97 is prime
 
# With prime modulus, patterns in input data
# are less likely to align with table structure
 
# Keys that are multiples of 10:
# 10 % 97 = 10
# 20 % 97 = 20
# ...
# 100 % 97 = 3    ← Different bucket!
# 110 % 97 = 13   ← Different bucket!
# 210 % 97 = 16   ← Different bucket!
 
# Prime modulus "breaks up" regular patterns

Anti-Pattern 2: Ignoring Parts of the Key

Hash functions that only consider part of the key waste information and cause unnecessary collisions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# BAD: Only using first character
def terrible_hash(string_key):
    return ord(string_key[0]) if string_key else 0
 
# Results for common names:
# "Alice" → 65 (A)
# "Aaron" → 65 (A)  ← Collision!
# "Amy"   → 65 (A)  ← Collision!
# "Anna"  → 65 (A)  ← Collision!
# "Bob"   → 66 (B)
# "Bill"  → 66 (B)  ← Collision!
 
# With only 26 possible values, maximum 26 buckets are useful!
 
# BAD: Only using last N characters
def bad_suffix_hash(key):
    return hash(key[-3:])  # Only last 3 chars
 
# For file paths:
# "/var/log/app.log" → hash("log")
# "/home/user/debug.log" → hash("log")  ← Collision!
# "/tmp/error.log" → hash("log")        ← Collision!
 
# BAD: Truncating long keys
def truncated_hash(key):
    return hash(key[:10])  # Only first 10 chars
 
# For URLs:
# "https://api.example.com/users/123" → hash("https://ap")
# "https://api.example.com/users/456" → hash("https://ap")  ← Collision!
# "https://api.example.com/orders/789" → hash("https://ap") ← Collision!

Anti-Pattern 3: Simple Additive Hashes

Just summing character values (or similar simple operations) creates many collision equivalence classes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# BAD: Summing character values (or byte values)
def additive_hash(s):
    return sum(ord(c) for c in s)
 
# Collisions are trivial to find:
# "ab" = 97 + 98 = 195
# "ba" = 98 + 97 = 195  ← Same hash!
 
# More examples:
# "abc" = 97 + 98 + 99 = 294
# "cab" = 99 + 97 + 98 = 294  ← Collision!
# "bac" = 98 + 97 + 99 = 294  ← Collision!
# Any permutation collides!
 
# Even worse - different strings:
# "ab" = 195
# "Z" + chr(195 - 90) = 195  # Can construct collisions easily
 
# This hash is ORDER-INSENSITIVE
# This is catastrophic for strings, where order matters
 
# The hash function treats "stop" and "pots" as identical!
# "stop" = 115 + 116 + 111 + 112 = 454
# "pots" = 112 + 111 + 116 + 115 = 454
# "spot" = 115 + 112 + 111 + 116 = 454
# "tops" = 116 + 111 + 112 + 115 = 454

Anti-Pattern 4: XOR Without Position Encoding

Using XOR alone has similar problems to addition—it's associative and commutative:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# BAD: Simple XOR of all bytes
def xor_hash(s):
    result = 0
    for c in s:
        result ^= ord(c)
    return result
 
# XOR is commutative and associative:
# "ab" = ord('a') ^ ord('b') = 97 ^ 98 = 3
# "ba" = ord('b') ^ ord('a') = 98 ^ 97 = 3  ← Collision!
 
# Worse: XOR has self-cancellation
# x ^ x = 0
 
# "abba" = a ^ b ^ b ^ a = (a ^ a) ^ (b ^ b) = 0 ^ 0 = 0
# "cddc" = c ^ d ^ d ^ c = (c ^ c) ^ (d ^ d) = 0 ^ 0 = 0
# Any string with pairs of repeated chars hashes to 0!
 
# Even length matters:
# "ab" = 3
# "abab" = a ^ b ^ a ^ b = 0  ← Different string, very different hash!
 
# XOR is useful for COMBINING multiple independent hash values
# But not for hashing sequential data directly

The Commutativity Problem

Real-World Hash Function Failures

These anti-patterns aren't just theoretical—they've caused real problems in production systems. Let's examine documented cases where poor hash function choices led to serious issues.

Case Study 1: Database Key Clustering

The problem? They had 8 partitions (a power of 2). Customer IDs from batch imports came in multiples of 1000. Since 1000 = 8 × 125, all batch-imported customers landed on the same partition:

Customer 1000 → partition 0
Customer 2000 → partition 0
Customer 3000 → partition 0

That single partition became overloaded while others sat nearly empty.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
EXPECTED: Uniform distribution across 8 partitions
═══════════════════════════════════════════════════════════════════
 
Partition 0: ████████████░░░  12.5% of data
Partition 1: ████████████░░░  12.5% of data
Partition 2: ████████████░░░  12.5% of data
Partition 3: ████████████░░░  12.5% of data
Partition 4: ████████████░░░  12.5% of data
Partition 5: ████████████░░░  12.5% of data
Partition 6: ████████████░░░  12.5% of data
Partition 7: ████████████░░░  12.5% of data
 
 
ACTUAL: Skewed distribution due to ID patterns
═══════════════════════════════════════════════════════════════════
 
Partition 0: ████████████████████████████████████████  85% of data!
Partition 1: ██░░░░░░░░░░░░░  2% of data
Partition 2: ██░░░░░░░░░░░░░  2% of data
Partition 3: ██░░░░░░░░░░░░░  2% of data
Partition 4: ██░░░░░░░░░░░░░  2% of data
Partition 5: ██░░░░░░░░░░░░░  2% of data
Partition 6: ██░░░░░░░░░░░░░  2% of data
Partition 7: ██░░░░░░░░░░░░░  3% of data
 
Result: 
- Partition 0 server overloaded
- Query latencies 10x higher than expected
- Horizontal scaling ineffective (other partitions unused)

Case Study 2: Cache Key Collisions

A web application cached rendered pages using URL paths as keys. The cache used a custom hash function that only considered the first 8 characters of the URL path.

For a site structure like:

/products/electronics/phones/iphone-15
/products/electronics/phones/samsung-s24
/products/electronics/phones/pixel-8

All URLs hashed to the same value because they shared the prefix /product. Cache hit rates plummeted as entries constantly evicted each other, and the "cache" provided no benefit.

Case Study 3: The Zip Code Catastrophe

A shipping logistics system stored package data in a hash table keyed by destination zip code. The hash function was zipcode % table_size.

90001, 90002, ..., 90099 → similar hash values
91000, 91001, ..., 91099 → similar hash values

The distribution was nowhere near uniform.

Common Hash Function Failure Patterns
Scenario	Poor Hash Choice	Result	Better Approach
Sequential IDs	ID % table_size	Predictable bucket assignment	Multiply-shift hash or bit mixing
Timestamps	timestamp % table_size	Time-correlated clustering	Mix all bits with Murmur/xxHash
IP addresses	Last octet only	256 possible values	Hash full IP as 32-bit integer
URLs	First N chars	Prefix overlap clustering	Full string hash (FNV, xxHash)
Geographic data	Coordinate truncation	Spatial clustering preserved	Space-filling curve or full-precision hash

Pattern Recognition is Key

Measuring Hash Function Quality

How do you know if your hash function is good for your data? Beyond code review and intuition, we can empirically measure distribution quality.

Metric 1: Chi-Squared Test for Uniformity

The chi-squared (χ²) test measures whether observed frequencies deviate significantly from expected uniform distribution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import scipy.stats as stats
 
def test_hash_uniformity(hash_func, test_keys, num_buckets):
    """
    Test whether hash_func distributes test_keys uniformly.
    
    Returns: chi-squared statistic and p-value
    Lower chi-squared and higher p-value = better uniformity
    """
    # Count elements in each bucket
    bucket_counts = [0] * num_buckets
    for key in test_keys:
        bucket = hash_func(key) % num_buckets
        bucket_counts[bucket] += 1
    
    # Expected count per bucket (uniform distribution)
    expected = len(test_keys) / num_buckets
    
    # Chi-squared statistic
    chi_sq = sum((obs - expected)**2 / expected for obs in bucket_counts)
    
    # Degrees of freedom = num_buckets - 1
    df = num_buckets - 1
    
    # P-value: probability of seeing this or more extreme result if truly uniform
    p_value = 1 - stats.chi2.cdf(chi_sq, df)
    
    return chi_sq, p_value, bucket_counts
 
# Example: Test bad vs good hash function
def bad_hash(x): return x  # Identity - preserves patterns
def good_hash(x): return hash(x)  # Python's built-in (randomized)
 
# Test with sequential IDs (common real-world pattern)
test_keys = list(range(1000, 10000))
num_buckets = 100
 
bad_chi, bad_p, bad_counts = test_hash_uniformity(bad_hash, test_keys, num_buckets)
good_chi, good_p, good_counts = test_hash_uniformity(good_hash, test_keys, num_buckets)
 
print(f"Bad hash:  χ² = {bad_chi:.1f}, p-value = {bad_p:.6f}")
print(f"Good hash: χ² = {good_chi:.1f}, p-value = {good_p:.6f}")
 
# Interpretation:
# - χ² close to df (99) = good uniformity
# - χ² >> df = non-uniform (patterns detected)
# - p-value > 0.05 = consistent with uniform distribution
# - p-value < 0.05 = likely non-uniform

Metric 2: Coefficient of Variation

A simpler metric is the coefficient of variation (CV) of bucket sizes—the standard deviation divided by the mean:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import statistics
 
def hash_distribution_cv(hash_func, test_keys, num_buckets):
    """
    Calculate coefficient of variation of bucket counts.
    
    CV = 0: Perfect uniformity (all buckets have same count)
    CV ≈ 1/√n: Expected for random distribution
    CV >> 1/√n: Poor distribution
    """
    bucket_counts = [0] * num_buckets
    for key in test_keys:
        bucket = hash_func(key) % num_buckets
        bucket_counts[bucket] += 1
    
    mean = statistics.mean(bucket_counts)
    stdev = statistics.stdev(bucket_counts)
    cv = stdev / mean if mean > 0 else float('inf')
    
    # Expected CV for truly random distribution
    expected_cv = 1 / (len(test_keys) / num_buckets) ** 0.5
    
    return cv, expected_cv, max(bucket_counts), min(bucket_counts)
 
# Example output:
# Good hash: CV=0.12, Expected=0.10, Max=105, Min=85
# Bad hash:  CV=5.24, Expected=0.10, Max=900, Min=0

Metric 3: Avalanche Effect

A good hash function exhibits the "avalanche effect"—a single bit change in input causes approximately 50% of output bits to flip. This ensures that similar inputs produce dissimilar outputs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def count_bit_differences(a, b, bit_width=32):
    """Count the number of differing bits between a and b."""
    xor = a ^ b
    count = 0
    for i in range(bit_width):
        if xor & (1 << i):
            count += 1
    return count
 
def test_avalanche(hash_func, num_tests=1000):
    """
    Test avalanche effect: changing 1 input bit should flip ~50% of output bits.
    
    Returns: average percentage of flipped bits for single-bit input changes.
    Ideal: 50%
    Poor: <30% or >70%
    """
    import random
    total_flips = 0
    bit_width = 32
    
    for _ in range(num_tests):
        # Random input
        x = random.randint(0, 2**32 - 1)
        h1 = hash_func(x) & 0xFFFFFFFF  # Ensure 32-bit
        
        # Flip one random bit
        bit_position = random.randint(0, 31)
        x_flipped = x ^ (1 << bit_position)
        h2 = hash_func(x_flipped) & 0xFFFFFFFF
        
        # Count bit differences
        flips = count_bit_differences(h1, h2, bit_width)
        total_flips += flips
    
    avg_flips = total_flips / num_tests
    avg_percent = (avg_flips / bit_width) * 100
    return avg_percent
 
# Results:
# Identity hash (h(x)=x):  ~3.1%  (only the changed bit differs)
# Add constant (h(x)=x+C): ~3.1%  (addition doesn't spread)
# Multiply hash:           ~25%   (better, but not great)
# xxHash/Murmur:           ~50%   (excellent avalanche)

Test With Your Actual Data

Choosing the Right Hash Function

With an understanding of what can go wrong, let's examine proven hash function families and when to use each.

For General-Purpose Hashing:

Recommended Hash Functions by Use Case
Hash Function	Speed	Quality	Best For
xxHash (xxh32/xxh64)	Extremely fast	Excellent	General purpose, large data
MurmurHash3	Very fast	Excellent	Hash tables, checksums
FNV-1a	Fast	Good	Short strings, simple implementation
CityHash	Very fast	Excellent	Strings, Google's choice
SipHash	Moderate	Excellent + secure	Hash tables with untrusted input
djb2	Fast	Acceptable	Simple string hashing

Implementation Examples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# djb2 - Simple but effective string hash
def djb2_hash(s):
    """
    Dan Bernstein's hash function.
    - Uses magic constant 5381 (empirically chosen)
    - Multiply by 33 (= 32 + 1 = (hash << 5) + hash)
    - Addition incorporates each character
    """
    hash_value = 5381
    for char in s:
        hash_value = ((hash_value << 5) + hash_value) + ord(char)
    return hash_value & 0xFFFFFFFF  # Keep 32 bits
 
 
# FNV-1a - Fowler-Noll-Vo hash (variant a)
def fnv1a_hash(data):
    """
    FNV-1a hash - XOR then multiply.
    - Excellent distribution
    - Simple implementation
    - Good for strings and binary data
    """
    FNV_PRIME = 0x01000193  # 16777619
    FNV_OFFSET = 0x811c9dc5  # 2166136261
    
    hash_value = FNV_OFFSET
    for byte in data.encode() if isinstance(data, str) else data:
        hash_value ^= byte
        hash_value = (hash_value * FNV_PRIME) & 0xFFFFFFFF
    return hash_value
 
 
# Knuth's multiplicative hash (for integers)
def mult_hash(key, table_size):
    """
    Knuth's multiplicative method for integers.
    A = 2654435769 ≈ 2^32 / φ (golden ratio)
    """
    A = 2654435769
    # Multiply, take high bits (they're better mixed)
    return ((key * A) >> 16) % table_size
 
 
# MurmurHash3 (simplified finalization mixer)
def murmur3_fmix32(h):
    """
    MurmurHash3 finalizer - excellent bit mixing.
    Use this to improve poor hash values.
    """
    h ^= h >> 16
    h = (h * 0x85ebca6b) & 0xFFFFFFFF
    h ^= h >> 13
    h = (h * 0xc2b2ae35) & 0xFFFFFFFF
    h ^= h >> 16
    return h
 
# Usage: If you have a weak hash, strengthen it:
# good_hash = murmur3_fmix32(weak_hash(key))

Key Principles for Hash Function Selection:

Hash Function Selection Guidelines

•Use language built-ins for most cases — Python's hash(), Java's hashCode(), etc. are well-tested and typically sufficient.
•Use SipHash for untrusted input — If users control hash table keys, use a cryptographically-keyed hash like SipHash.
•Use xxHash or MurmurHash for performance-critical paths — When you need the absolute fastest hashing with excellent distribution.
•Use prime table sizes with simple hash functions — If you must use a simple hash, prime sizes help break up patterns.
•Test with your actual data — Empirically verify distribution before production deployment.
•Consider Zobrist hashing for composite keys — XOR of pre-computed random values for each component.

When in Doubt, Use the Default

Fixing Poor Hash Distribution

Technique 1: Hash Function Chaining (Bit Mixing)

Apply a high-quality mixing function to the output of a poor hash:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def improve_hash(poor_hash_value):
    """
    Apply bit mixing to improve distribution of a poor hash value.
    Uses MurmurHash3 finalization.
    """
    h = poor_hash_value & 0xFFFFFFFF
    h ^= h >> 16
    h = (h * 0x85ebca6b) & 0xFFFFFFFF
    h ^= h >> 13
    h = (h * 0xc2b2ae35) & 0xFFFFFFFF
    h ^= h >> 16
    return h
 
# Before: poor_hash(sequential_id) clusters badly
# After: improve_hash(poor_hash(sequential_id)) is uniform
 
# Example with sequential IDs:
print("Poor hash distribution (ID % 16):")
for i in range(1000, 1016):
    print(f"  {i} → bucket {i % 16}")
 
print("
Improved hash distribution:")
for i in range(1000, 1016):
    print(f"  {i} → bucket {improve_hash(i) % 16}")
 
# Output shows improved scattering across buckets

Technique 2: Salting

Combine a constant salt value with the key before hashing to change the collision pattern:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Without salt: clustering happens
def unsalted_hash(key, table_size):
    return hash(key) % table_size
 
# With salt: patterns are disrupted
SALT = 0x9E3779B9  # Golden ratio constant
 
def salted_hash(key, table_size):
    if isinstance(key, int):
        mixed = key ^ SALT
        mixed = (mixed * 0x85ebca6b) & 0xFFFFFFFF
    else:
        # For strings, prepend salt or combine differently
        mixed = hash(f"{SALT}{key}")
    return mixed % table_size
 
# For database sharding with problematic keys:
def shard_key(user_id, num_shards):
    """
    Improved sharding that avoids clustering from sequential IDs.
    """
    # Mix the user_id with a salt
    mixed = user_id ^ 0x9E3779B9
    mixed = (mixed * 2654435761) & 0xFFFFFFFF
    mixed ^= mixed >> 16
    return mixed % num_shards

Technique 3: Secondary Hash (Double Hashing)

Forstubborn clustering, use a completely different hash function as a secondary check:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class DoubleHashTable:
    """
    Use two independent hash functions to combat poor distribution.
    Primary hash determines bucket; secondary hash resolves collisions.
    """
    def __init__(self, size):
        self.size = size
        self.table = [None] * size
    
    def _hash1(self, key):
        """Primary hash - even if poor, secondary will help."""
        return hash(key) % self.size
    
    def _hash2(self, key):
        """Secondary hash - must be coprime with size."""
        # Use different algorithm for independence
        h = 5381
        for c in str(key):
            h = ((h << 5) + h) ^ ord(c)
        # Ensure non-zero and coprime with size
        return 1 + (h % (self.size - 1))
    
    def insert(self, key, value):
        """Insert with double hashing probe sequence."""
        h1 = self._hash1(key)
        h2 = self._hash2(key)
        
        for i in range(self.size):
            idx = (h1 + i * h2) % self.size
            if self.table[idx] is None or self.table[idx][0] == key:
                self.table[idx] = (key, value)
                return
        raise Exception("Hash table full")

Prevention is Better Than Cure

Hash Function Design for Custom Types

When working with custom objects or composite keys, you'll need to design hash functions that combine multiple fields. This is where many mistakes happen.

The Combination Problem:

Given an object with fields (a, b, c), how do we combine their individual hash values into one hash for the object?

Bad Approach: Simple XOR

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# BAD: XOR combination
class BadPoint:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def __hash__(self):
        return hash(self.x) ^ hash(self.y)  # Don't do this!
    
    def __eq__(self, other):
        return self.x == other.x and self.y == other.y
 
# Problems:
p1 = BadPoint(3, 5)
p2 = BadPoint(5, 3)
print(hash(p1) == hash(p2))  # True! Order doesn't matter with XOR
 
p3 = BadPoint(7, 7)
print(hash(p3))  # 0! x^x = 0 for any x
 
# Many distinct points will collide:
# (1,2), (2,1), (3,0), (0,3) all have related hashes

Good Approach: Position-Weighted Combination

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# GOOD: Python's recommended approach
class GoodPoint:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def __hash__(self):
        # Use tuple hashing - properly handles ordering and combination
        return hash((self.x, self.y))
    
    def __eq__(self, other):
        return self.x == other.x and self.y == other.y
 
# Alternative: Explicit prime-based combination
class BetterPoint:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def __hash__(self):
        # Combine with primes to preserve order
        result = 17  # Non-zero start
        result = 31 * result + hash(self.x)
        result = 31 * result + hash(self.y)
        return result
    
    def __eq__(self, other):
        return self.x == other.x and self.y == other.y
 
# Now (3,5) and (5,3) have different hashes
p1 = BetterPoint(3, 5)
p2 = BetterPoint(5, 3)
print(hash(p1) == hash(p2))  # False - correctly distinguished
 
 
# For Java, similar pattern:
'''
@Override
public int hashCode() {
    int result = 17;
    result = 31 * result + x;
    result = 31 * result + y;
    return result;
}
'''

Best Practices for Custom Hash Functions:

Custom HashCode Checklist

•Include all fields used in equals() — If two objects are equal, they must have the same hash code.
•Use tuple hashing when available — hash((a, b, c)) in Python is optimized and correct.
•Apply prime multipliers for ordering — Pattern: 31 * running_hash + field_hash
•Handle null/None fields — Use 0 or a constant for null field values.
•Consider field types — Arrays need special handling (hash contents, not reference).
•Cache immutable object hashes — If object is immutable, compute hash once and store it.
•Never use mutable fields — If a field can change, don't include it in the hash.

The Equals-HashCode Contract

Summary: Hash Function Quality Matters

Poor hash function choices silently undermine hash table performance. Understanding what makes a hash function good—and recognizing common failures—is essential for building efficient systems.

Key Takeaways

•Uniformity is the critical property — Hash outputs must be evenly distributed regardless of input patterns.
•Common anti-patterns cause clustering — Simple modulo with non-prime sizes, ignoring key portions, additive hashes, and order-insensitive combinations all cause problems.
•Real-world data has patterns — Sequential IDs, timestamps, geographic data, and structured strings all require careful hash function selection.
•Measure distribution empirically — Chi-squared tests, coefficient of variation, and avalanche testing reveal hash function quality.
•Use proven hash families — xxHash, MurmurHash, FNV-1a, and language built-ins are well-tested choices.
•Bit mixing can improve poor hashes — MurmurHash finalizers can salvage distribution from weak hash functions.
•Custom types need careful handling — Combine field hashes with position weighting; honor the equals-hashCode contract.

What's Next:

Page Complete

2 / 4