Hash Functions - Learning Module

Loading content...

0/276

Uniformity — Spread Outputs Evenly

The Crowded Restaurant Problem

Imagine you're managing a restaurant with 100 tables. Your host's job is to assign arriving parties to tables. Now consider two hosts:

Host A has a strange policy: every party whose last name starts with 'S' goes to Table 1. Every 'J' goes to Table 2. Every 'M' goes to Table 3. The remaining 97 tables? They're for the other 23 letters, most of which are rare.

Host B distributes parties evenly across all 100 tables based on their complete name, ensuring roughly equal distribution.

Which restaurant runs better? Host A's approach creates massive queues at a few tables while others sit empty. Host B's approach balances the load, giving every party quick service.

Hash function uniformity is exactly this problem. A hash function that clusters outputs creates expensive 'crowded tables' (long bucket chains), while a uniform hash function distributes the load evenly (short chains, O(1) access).

What You Will Learn

By the end of this page, you will understand: (1) What uniformity means mathematically, (2) Why uniformity is essential for O(1) hash table performance, (3) How to measure and evaluate uniformity, (4) Common patterns that destroy uniformity, and (5) How to design and select uniform hash functions.

Defining Uniformity Precisely

Uniformity, also called the uniform distribution property, describes how evenly a hash function spreads its outputs across the available output space. A perfectly uniform hash function would give each possible output value an equal probability of being selected.

Formally, for a hash function h that maps keys to integers in the range [0, m-1]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
DEFINITION (Simple Uniform Hashing Assumption - SUHA)
=====================================================
A hash function h satisfies the Simple Uniform Hashing Assumption if:
 
  For a randomly selected key k from the key space K,
  P(h(k) = i) = 1/m  for all i ∈ {0, 1, 2, ..., m-1}
 
In words:
"Each hash bucket is equally likely to receive any given key."
 
IDEAL DISTRIBUTION
==================
If we insert n keys into a hash table with m buckets using a 
perfectly uniform hash function:
 
  Expected keys per bucket = n/m (the load factor α)
  
  Variance in bucket sizes ≈ 0 (all buckets equally full)
 
NON-UNIFORM DISTRIBUTION (PROBLEM)
==================================
If the hash function is non-uniform:
 
  Some buckets receive far more than n/m keys
  Some buckets receive far fewer than n/m keys
  
  Result: O(1) degrades toward O(n) in crowded buckets

Important clarification: The Simple Uniform Hashing Assumption (SUHA) is an idealization—no real hash function achieves perfect uniformity for all possible inputs. However, well-designed hash functions come close enough for practical purposes. The closer we get to uniform distribution, the better our hash table performs.

Uniformity vs Randomness

Uniformity does NOT mean randomness. A hash function must be deterministic (same input → same output). Uniformity means that across all possible inputs, outputs are spread evenly. It's the distribution of outputs that's uniform, not the process of generating them.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import random
from collections import Counter
 
def uniform_hash(key: str, table_size: int = 10) -> int:
    """A reasonably uniform hash function."""
    h = 0
    for i, char in enumerate(key):
        # Mix position and character value thoroughly
        h = (h * 31 + ord(char) * (i + 1)) & 0xFFFFFFFF
    return h % table_size
 
def terrible_hash(key: str, table_size: int = 10) -> int:
    """A deliberately terrible hash function."""
    # Only uses first character - extremely non-uniform!
    if not key:
        return 0
    return ord(key[0]) % table_size
 
def analyze_distribution(hash_fn, keys: list, table_size: int = 10):
    """Analyze how evenly a hash function distributes keys."""
    distribution = Counter(hash_fn(k, table_size) for k in keys)
    
    total_keys = len(keys)
    expected_per_bucket = total_keys / table_size
    
    print(f"\nDistribution Analysis (n={total_keys}, m={table_size}):")
    print(f"Expected keys per bucket: {expected_per_bucket:.1f}")
    print(f"Actual distribution:")
    
    max_bucket = max(distribution.values())
    for bucket in range(table_size):
        count = distribution.get(bucket, 0)
        bar = '█' * count
        ratio = count / expected_per_bucket if expected_per_bucket > 0 else 0
        status = '(optimal)' if 0.5 <= ratio <= 1.5 else '(skewed!)' if ratio > 2 else ''
        print(f"  Bucket {bucket}: {bar:50} {count:4} {status}")
    
    # Calculate standard deviation
    counts = [distribution.get(i, 0) for i in range(table_size)]
    mean = sum(counts) / len(counts)
    variance = sum((c - mean) ** 2 for c in counts) / len(counts)
    std_dev = variance ** 0.5
    
    print(f"\nStatistics:")
    print(f"  Min bucket size: {min(counts)}")
    print(f"  Max bucket size: {max(counts)}")
    print(f"  Standard deviation: {std_dev:.2f}")
    print(f"  Max/Expected ratio: {max_bucket / expected_per_bucket:.2f}x")
    
    return std_dev
 
# Generate test keys - common first names
test_keys = [
    "Alice", "Bob", "Charlie", "David", "Emma", "Frank", "Grace", 
    "Henry", "Ivy", "Jack", "Kate", "Leo", "Mia", "Noah", "Olivia",
    "Peter", "Quinn", "Rose", "Sam", "Tina", "Uma", "Victor", "Wendy",
    "Xavier", "Yuki", "Zoe", "Adam", "Beth", "Carl", "Diana", "Eric",
    "Fiona", "George", "Helen", "Ian", "Julia", "Kevin", "Laura", "Mike"
]
 
print("=== UNIFORM HASH FUNCTION ===")
uniform_std = analyze_distribution(uniform_hash, test_keys)
 
print("\n" + "="*60)
print("=== TERRIBLE HASH FUNCTION ===")
terrible_std = analyze_distribution(terrible_hash, test_keys)
 
print(f"\n=== COMPARISON ===")
print(f"Uniform hash std deviation: {uniform_std:.2f}")
print(f"Terrible hash std deviation: {terrible_std:.2f}")
print(f"The terrible hash is {terrible_std/uniform_std:.1f}x more skewed!")

Why Uniformity Is Essential for O(1) Performance

Hash tables promise O(1) average-case operations. But this promise depends entirely on uniformity. Without uniform distribution, hash tables silently degrade to O(n), and you may never know why your system is slow.

Impact of Uniformity on Hash Table Performance
Metric	Uniform Hash	Non-Uniform Hash
Average bucket size	n/m (load factor)	Varies wildly: 0 to n
Lookup time (average)	O(1 + α)	O(n) in worst buckets
Insert time (average)	O(1)	O(n) in crowded buckets
Delete time (average)	O(1 + α)	O(n) in worst buckets
Memory efficiency	All buckets used equally	Some buckets overflow, others empty
Predictability	Consistent performance	Unpredictable, input-dependent

The Mathematics of Degradation:

Let's see exactly how non-uniformity destroys performance. Consider a hash table with m = 1000 buckets and n = 10,000 keys.

With perfect uniformity:

Each bucket holds approximately 10,000 / 1000 = 10 keys
Lookup requires checking ~10 keys on average
Time complexity: O(10) = O(1) (constant, independent of n)

With extreme non-uniformity (everything in one bucket):

One bucket holds all 10,000 keys
Other 999 buckets are empty
Lookup requires scanning 10,000 keys
Time complexity: O(10,000) = O(n)

The difference: 1,000x slower!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import time
from typing import Callable, List, Tuple
from collections import defaultdict
 
class InstrumentedHashTable:
    """Hash table that counts comparisons to measure performance."""
    
    def __init__(self, hash_fn: Callable[[str], int], size: int = 1000):
        self.hash_fn = hash_fn
        self.size = size
        self.buckets = [[] for _ in range(size)]
        self.total_comparisons = 0
    
    def _hash(self, key: str) -> int:
        return self.hash_fn(key) % self.size
    
    def insert(self, key: str, value):
        index = self._hash(key)
        for i, (k, v) in enumerate(self.buckets[index]):
            self.total_comparisons += 1
            if k == key:
                self.buckets[index][i] = (key, value)
                return
        self.buckets[index].append((key, value))
    
    def lookup(self, key: str):
        index = self._hash(key)
        for k, v in self.buckets[index]:
            self.total_comparisons += 1
            if k == key:
                return v
        return None
    
    def get_bucket_stats(self) -> Tuple[float, int, int]:
        sizes = [len(b) for b in self.buckets]
        return sum(sizes) / len(sizes), max(sizes), min(sizes)
 
# Hash functions with different uniformity
def good_hash(key: str) -> int:
    h = 0
    for i, c in enumerate(key):
        h = (h * 31 + ord(c) * (i + 1)) & 0xFFFFFFFF
    return h
 
def bad_hash(key: str) -> int:
    # Only uses length - many strings have same length!
    return len(key)
 
def terrible_hash(key: str) -> int:
    # Constant - all keys go to same bucket!
    return 42
 
# Generate diverse test data
test_keys = [f"user_{i:05d}" for i in range(10000)]
 
def benchmark(name: str, hash_fn: Callable):
    print(f"\n{'='*60}")
    print(f"Testing: {name}")
    print('='*60)
    
    table = InstrumentedHashTable(hash_fn, size=1000)
    
    # Insert all keys
    start = time.perf_counter()
    for key in test_keys:
        table.insert(key, {"data": key})
    insert_time = time.perf_counter() - start
    
    # Reset comparison counter for lookups
    table.total_comparisons = 0
    
    # Lookup all keys
    start = time.perf_counter()
    for key in test_keys:
        table.lookup(key)
    lookup_time = time.perf_counter() - start
    
    avg_size, max_size, min_size = table.get_bucket_stats()
    
    print(f"Bucket Statistics:")
    print(f"  Average bucket size: {avg_size:.2f}")
    print(f"  Maximum bucket size: {max_size}")
    print(f"  Minimum bucket size: {min_size}")
    print(f"  Max/Avg ratio: {max_size/avg_size:.1f}x")
    print(f"\nPerformance:")
    print(f"  Total comparisons for {len(test_keys)} lookups: {table.total_comparisons:,}")
    print(f"  Average comparisons per lookup: {table.total_comparisons/len(test_keys):.2f}")
    print(f"  Lookup time: {lookup_time*1000:.2f}ms")
    
    return table.total_comparisons
 
# Run benchmarks
good_comparisons = benchmark("Good Hash (Uniform)", good_hash)
bad_comparisons = benchmark("Bad Hash (Length-based)", bad_hash)
terrible_comparisons = benchmark("Terrible Hash (Constant)", terrible_hash)
 
print(f"\n{'='*60}")
print("SUMMARY: Impact of Uniformity on Performance")
print('='*60)
print(f"Good hash comparisons:     {good_comparisons:>12,} (baseline)")
print(f"Bad hash comparisons:      {bad_comparisons:>12,} ({bad_comparisons/good_comparisons:.1f}x worse)")
print(f"Terrible hash comparisons: {terrible_comparisons:>12,} ({terrible_comparisons/good_comparisons:.1f}x worse)")

The Hidden Performance Killer

Non-uniform hashing is insidious because the code looks correct. The hash table API works. Data is stored and retrieved. But under the surface, O(1) has silently become O(n). You only discover the problem when your system mysteriously slows down under load—and by then, finding the root cause is extremely difficult.

Common Patterns That Destroy Uniformity

Certain hash function designs are notorious for producing non-uniform distributions. Understanding these anti-patterns helps you avoid them and recognize them in existing code.

Hash Function Anti-Patterns and Their Effects
Anti-Pattern	Example	Why It Fails	Real Impact
Using only first character	h(s) = ord(s[0]) % m	Names cluster by first letter; 'S' and 'J' dominate	Buckets 10, 18, 19 get 40% of keys
Using string length	h(s) = len(s) % m	Most strings have similar lengths (5-15 chars)	Buckets 5-15 get 95% of keys
Simple sum of characters	h(s) = sum(ord(c)) % m	Anagrams hash to same value	'abc', 'bca', 'cab' all collide
Ignoring structure	h(ip) = sum(octets) % m	192.168.1.x all hash similarly	Local network IPs cluster
Mod by small number	h(n) = n % 10	Only 10 possible outputs	Sequential IDs fill buckets in order
Using only high bits	h(n) = n >> 24	Similar numbers hash identically	User IDs 1-16M all hash to 0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
from collections import Counter
 
def show_distribution(name: str, hash_fn, keys: list, table_size: int = 20):
    """Show how badly a hash function distributes keys."""
    distribution = Counter(hash_fn(k) % table_size for k in keys)
    
    total = len(keys)
    expected = total / table_size
    
    print(f"\n{name}")
    print(f"Keys: {total}, Buckets: {table_size}, Expected per bucket: {expected:.1f}")
    
    # Find most crowded buckets
    crowded = distribution.most_common(5)
    empty = table_size - len(distribution)
    
    print(f"Most crowded buckets: {crowded}")
    print(f"Empty buckets: {empty}/{table_size}")
    print(f"Worst bucket has {crowded[0][1]}x the keys it should")
 
# Test data - realistic user names
names = [
    "Smith", "Johnson", "Williams", "Brown", "Jones", "Garcia", "Miller",
    "Davis", "Rodriguez", "Martinez", "Hernandez", "Lopez", "Gonzalez",
    "Wilson", "Anderson", "Thomas", "Taylor", "Moore", "Jackson", "Martin",
    "Lee", "Perez", "Thompson", "White", "Harris", "Sanchez", "Clark",
    "Ramirez", "Lewis", "Robinson", "Walker", "Young", "Allen", "King",
    "Wright", "Scott", "Torres", "Nguyen", "Hill", "Flores", "Green",
    "Adams", "Nelson", "Baker", "Hall", "Rivera", "Campbell", "Mitchell",
    "Carter", "Roberts", "Turner", "Phillips", "Evans", "Parker", "Edwards"
]
 
# Anti-pattern 1: First character only
def first_char_hash(s: str) -> int:
    return ord(s[0]) if s else 0
 
# Anti-pattern 2: String length
def length_hash(s: str) -> int:
    return len(s)
 
# Anti-pattern 3: Simple sum
def sum_hash(s: str) -> int:
    return sum(ord(c) for c in s)
 
# Good hash for comparison
def good_hash(s: str) -> int:
    h = 0
    for i, c in enumerate(s):
        h = (h * 31 + ord(c)) & 0xFFFFFFFF
    return h
 
show_distribution("ANTI-PATTERN: First Character Hash", first_char_hash, names)
show_distribution("ANTI-PATTERN: Length Hash", length_hash, names)  
show_distribution("ANTI-PATTERN: Sum Hash", sum_hash, names)
show_distribution("GOOD: Polynomial Hash", good_hash, names)
 
# Special case: Anagram collision in sum hash
print("\n=== ANAGRAM COLLISION DEMONSTRATION ===")
anagrams = ["listen", "silent", "enlist", "tinsel", "inlets"]
for word in anagrams:
    print(f"  sum_hash('{word}') = {sum_hash(word)}")
    
print(f"\nAll anagrams hash to the same value!")
print(f"Expected: 5 different hashes")
print(f"Actual: {len(set(sum_hash(w) for w in anagrams))} unique hash(es)")

The Lesson

A good hash function must use ALL the input data, mix it thoroughly, and ensure that small changes in input produce large, unpredictable changes in output. Simple operations like sum, length, or first-character extraction leave too much information on the table.

The Avalanche Effect: Key to Uniformity

The avalanche effect is a property where a small change in the input produces a drastic, unpredictable change in the output. Specifically, changing a single bit of input should change approximately 50% of the output bits. This property is crucial for uniformity because it ensures that similar inputs don't cluster together.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
STRICT AVALANCHE CRITERION (SAC)
=================================
A hash function h satisfies the Strict Avalanche Criterion if:
 
  For any input x and any bit position i:
  Flipping bit i of x changes each bit of h(x) with probability 0.5
 
In other words:
  - Change 1 bit of input
  - Expect ~50% of output bits to change
  - Which specific bits change should be unpredictable
 
PRACTICAL AVALANCHE
===================
For practical hash functions, we want:
 
  hamming_distance(h(x), h(x')) ≈ output_bits / 2
 
Where x' differs from x by a single bit or character.
 
WHY AVALANCHE ENABLES UNIFORMITY
================================
Without avalanche:
  - Similar inputs → Similar hashes
  - "user_001" and "user_002" might differ by 1 bit
  - They land in same or adjacent buckets
  - Real data has patterns → Clustering occurs
 
With avalanche:
  - Similar inputs → Completely different hashes
  - "user_001" → 0x7A3F2C1D
  - "user_002" → 0xE891B5A0  (completely different!)
  - Patterns in input don't create patterns in output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
def count_differing_bits(a: int, b: int) -> int:
    """Count how many bits differ between two integers."""
    return bin(a ^ b).count('1')
 
def measure_avalanche(hash_fn, base_string: str, bits: int = 32):
    """Measure avalanche effect by changing one character."""
    base_hash = hash_fn(base_string)
    
    print(f"\nAvalanche analysis for: '{base_string}'")
    print(f"Base hash: {base_hash:032b} ({base_hash})")
    print(f"\nChanging each character position:")
    
    total_bits_changed = 0
    tests = 0
    
    for i in range(len(base_string)):
        # Create variant with one character changed
        chars = list(base_string)
        original_char = chars[i]
        chars[i] = chr(ord(chars[i]) + 1)  # Increment character
        variant = ''.join(chars)
        
        variant_hash = hash_fn(variant)
        bits_changed = count_differing_bits(base_hash, variant_hash)
        percentage = (bits_changed / bits) * 100
        
        print(f"  Position {i}: '{original_char}'→'{chars[i]}' "
              f"hash={variant_hash:08x}, bits changed: {bits_changed}/{bits} ({percentage:.0f}%)")
        
        total_bits_changed += bits_changed
        tests += 1
    
    avg_bits = total_bits_changed / tests
    ideal = bits / 2
    print(f"\nSummary:")
    print(f"  Average bits changed: {avg_bits:.1f}/{bits}")
    print(f"  Ideal (50%): {ideal}")
    print(f"  Avalanche quality: {(avg_bits/ideal)*100:.0f}%")
    
    return avg_bits / bits
 
# Poor hash: No avalanche
def poor_hash(s: str) -> int:
    return sum(ord(c) for c in s) & 0xFFFFFFFF
 
# Better hash: Some mixing but weak avalanche  
def medium_hash(s: str) -> int:
    h = 0
    for c in s:
        h = (h * 31 + ord(c)) & 0xFFFFFFFF
    return h
 
# Good hash: Strong avalanche through thorough mixing
def strong_hash(s: str) -> int:
    h = 0
    for c in s:
        h = (h * 31 + ord(c)) & 0xFFFFFFFF
        # Additional mixing
        h ^= (h >> 16)
        h = (h * 0x85ebca6b) & 0xFFFFFFFF
        h ^= (h >> 13)
    return h
 
# Test avalanche
test_string = "hello_world"
 
print("=" * 60)
print("POOR HASH (Simple Sum)")
print("=" * 60)
poor_score = measure_avalanche(poor_hash, test_string)
 
print("\n" + "=" * 60)
print("MEDIUM HASH (Polynomial)")
print("=" * 60)
medium_score = measure_avalanche(medium_hash, test_string)
 
print("\n" + "=" * 60)
print("STRONG HASH (With Mixing)")
print("=" * 60)
strong_score = measure_avalanche(strong_hash, test_string)
 
print("\n" + "=" * 60)
print("COMPARISON")
print("=" * 60)
print(f"Poor hash avalanche:   {poor_score*100:.0f}% (should be ~50%)")
print(f"Medium hash avalanche: {medium_score*100:.0f}%")
print(f"Strong hash avalanche: {strong_score*100:.0f}%")

Cryptographic vs Non-Cryptographic Hash Functions

Cryptographic hash functions (SHA-256, Blake2) have excellent avalanche properties because security requires it. Non-cryptographic hash functions (MurmurHash, xxHash) balance avalanche quality with speed. For hash tables, you don't need crypto-grade avalanche—just enough to prevent clustering from realistic input patterns.

Designing for Uniformity

Creating a uniform hash function requires deliberate design. Here are the key principles and techniques used by well-designed hash functions:

Principles of Uniform Hash Design

•Use all input bits: Every bit of input should influence the output. Ignoring parts of the input (like only using the first character) throws away entropy.
•Position matters: The same character in different positions should produce different contributions. 'abc' and 'cba' must hash differently.
•Non-linear mixing: Linear operations (addition, XOR alone) don't spread bits well. Use multiplication, rotation, and combined operations.
•Finalization step: After processing all input, apply a final mixing step to ensure all bits of the accumulator are well-mixed.
•Prime multipliers: Primes reduce patterns in mod operations. 31, 37, and larger primes are common choices.
•Bit width awareness: Operations should use the full bit width of your accumulator to avoid losing entropy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
"""
Building a uniform hash function from first principles.
We'll improve incrementally, measuring uniformity at each step.
"""
 
from collections import Counter
import random
 
def test_uniformity(hash_fn, num_keys: int = 10000, table_size: int = 100):
    """Calculate chi-square statistic to measure uniformity."""
    keys = [f"key_{i:06d}" for i in range(num_keys)]
    distribution = Counter(hash_fn(k) % table_size for k in keys)
    
    expected = num_keys / table_size
    chi_square = sum(
        (distribution.get(i, 0) - expected) ** 2 / expected
        for i in range(table_size)
    )
    
    # Lower is better; perfect uniformity would give ~table_size-1
    return chi_square
 
# Version 1: Terrible - only uses first character
def hash_v1(s: str) -> int:
    return ord(s[0]) if s else 0
 
# Version 2: Bad - simple sum (anagram problem)
def hash_v2(s: str) -> int:
    return sum(ord(c) for c in s)
 
# Version 3: Better - position-weighted sum
def hash_v3(s: str) -> int:
    h = 0
    for i, c in enumerate(s):
        h += ord(c) * (i + 1)  # Position matters now
    return h
 
# Version 4: Good - polynomial rolling hash
def hash_v4(s: str) -> int:
    h = 0
    for c in s:
        h = (h * 31 + ord(c)) & 0xFFFFFFFF
    return h
 
# Version 5: Excellent - polynomial with finalization
def hash_v5(s: str) -> int:
    h = 0
    # Character processing with prime multiplier
    for c in s:
        h = (h * 31 + ord(c)) & 0xFFFFFFFF
    
    # Finalization: mix all bits thoroughly
    h ^= h >> 16
    h = (h * 0x85ebca6b) & 0xFFFFFFFF
    h ^= h >> 13
    h = (h * 0xc2b2ae35) & 0xFFFFFFFF
    h ^= h >> 16
    
    return h
 
# Version 6: Production-quality (djb2 inspired with improvements)
def hash_v6(s: str) -> int:
    h = 5381  # Magic starting value (prime)
    for c in s:
        # h * 33 + c, implemented as shift-add for efficiency
        h = ((h << 5) + h + ord(c)) & 0xFFFFFFFF
    
    # Finalization
    h ^= h >> 16
    h = (h * 0x7feb352d) & 0xFFFFFFFF
    h ^= h >> 15
    
    return h
 
# Test all versions
print("Uniformity Comparison (Chi-Square Test)")
print("Lower is better. Perfect uniformity ≈ 99 for table_size=100")
print("="*60)
 
versions = [
    ("V1: First char only", hash_v1),
    ("V2: Simple sum", hash_v2),
    ("V3: Position-weighted", hash_v3),
    ("V4: Polynomial", hash_v4),
    ("V5: Poly + finalization", hash_v5),
    ("V6: Production quality", hash_v6),
]
 
for name, fn in versions:
    chi_sq = test_uniformity(fn)
    quality = "❌ Poor" if chi_sq > 200 else "⚠️ Fair" if chi_sq > 120 else "✓ Good"
    print(f"{name:30} χ² = {chi_sq:8.1f}  {quality}")

The Chi-Square Test for Uniformity

The chi-square (χ²) statistic measures how much an observed distribution deviates from expected. For a hash function mapping n keys to m buckets: χ² = Σ(observed - expected)² / expected. A perfectly uniform distribution gives χ² ≈ m-1. Values much higher indicate non-uniformity.

Table Size and Uniformity Interaction

An often-overlooked aspect of uniformity is how the hash function interacts with the table size. Even a uniform hash function can produce non-uniform bucket distributions if the table size is poorly chosen.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
THE PROBLEM WITH POWER-OF-2 SIZES
==================================
If table_size = 2^k, then:
  index = hash(key) % table_size
        = hash(key) & (table_size - 1)  // Only uses lowest k bits!
 
If your hash function has weak lower bits (common!), 
power-of-2 sizes expose this weakness.
 
Example:
  hash(key) = key * 1024  // Lower 10 bits are always 0!
  table_size = 1024       // Power of 2
  index = (key * 1024) % 1024 = 0  // Everything goes to bucket 0!
 
THE PRIME NUMBER SOLUTION
=========================
Using a prime table size helps because:
  - Primes have no factors to align with hash patterns
  - Modulo by prime spreads values more evenly
  - Common patterns in data (multiples, sequences) don't cluster
 
Example:
  table_size = 1009  // Prime
  Sequential keys 1,2,3,... spread across all buckets
  Even if hash is weak, prime breaks up patterns
 
MODERN APPROACH: BETTER HASH, ANY SIZE
======================================
Modern hash functions (MurmurHash, xxHash, etc.) have such good
bit distribution that power-of-2 sizes work fine.
 
But for simple hash functions (polynomial, djb2), prefer primes
or use a final mixing step to improve lower-bit quality.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
from collections import Counter
 
def weak_hash(key: int) -> int:
    """A hash function with weak lower bits."""
    # Multiplying by a multiple of small power of 2
    # makes lower bits less random
    return key * 128
 
def strong_hash(key: int) -> int:
    """A hash function with good bit distribution."""
    h = key
    h ^= h >> 16
    h = (h * 0x85ebca6b) & 0xFFFFFFFF
    h ^= h >> 13
    h = (h * 0xc2b2ae35) & 0xFFFFFFFF
    h ^= h >> 16
    return h
 
def analyze_size_effect(hash_fn, table_size: int, num_keys: int = 1000):
    """Show distribution for a given table size."""
    keys = list(range(num_keys))
    distribution = Counter(hash_fn(k) % table_size for k in keys)
    
    used_buckets = len(distribution)
    max_bucket = max(distribution.values())
    expected = num_keys / table_size
    
    return used_buckets, max_bucket, expected
 
print("Effect of Table Size on Distribution")
print("="*70)
 
# Test different table sizes
sizes = [
    (100, "100 (not prime, has factors 2,5)"),
    (101, "101 (prime)"),
    (128, "128 (power of 2)"),
    (127, "127 (prime)"),
    (256, "256 (power of 2)"),
    (257, "257 (prime)"),
    (1000, "1000 (highly composite)"),
    (1009, "1009 (prime)"),
]
 
print("\nWEAK HASH (poor lower bits):")
print("-"*70)
for size, description in sizes:
    used, max_b, expected = analyze_size_effect(weak_hash, size)
    utilization = (used / size) * 100
    crowding = max_b / expected
    quality = "✓" if crowding < 2 else "⚠️" if crowding < 5 else "❌"
    print(f"  Size {description:35} Used: {used:4}/{size:<4} "
          f"Max: {max_b:4} (×{crowding:.1f} expected) {quality}")
 
print("\nSTRONG HASH (good bit mixing):")
print("-"*70)
for size, description in sizes:
    used, max_b, expected = analyze_size_effect(strong_hash, size)
    utilization = (used / size) * 100
    crowding = max_b / expected
    quality = "✓" if crowding < 2 else "⚠️" if crowding < 5 else "❌"
    print(f"  Size {description:35} Used: {used:4}/{size:<4} "
          f"Max: {max_b:4} (×{crowding:.1f} expected) {quality}")
 
print("\n💡 Observation: Strong hash works well with any size.")
print("   Weak hash needs prime sizes to compensate.")

Practical Recommendation

If you control the hash function, use a modern well-designed one (MurmurHash3, xxHash, FNV-1a) and power-of-2 sizes are fine. If you're using a potentially weak hash function or can't control it, prefer prime table sizes. Many production hash tables (Java's HashMap, Python's dict) use sophisticated combinations of both approaches.

Summary: Uniformity for Performance

We've explored uniformity—the property that makes hash tables actually deliver O(1) performance. Let's consolidate the key insights:

Key Takeaways

•Uniformity defined: Hash values should be spread evenly across all possible outputs. Each bucket has equal probability of receiving any given key.
•Uniformity enables O(1): Without uniform distribution, some buckets overflow while others stay empty. Performance degrades from O(1) toward O(n).
•Anti-patterns to avoid: Using only first character, string length, simple sums, or ignoring input structure all destroy uniformity.
•Avalanche effect is key: Small input changes should cause large, unpredictable output changes. This prevents similar inputs from clustering.
•Good design principles: Use all input, make position matter, use non-linear mixing, include finalization, prefer prime multipliers.
•Table size interacts with hash quality: Weak hashes benefit from prime table sizes; strong hashes work with any size.

What's next:

We've established that hash functions must be deterministic (correctness) and uniform (performance). But there's a third critical property: efficiency. A hash function that's deterministic and uniform but takes 10 seconds to compute defeats the purpose. The next page explores how to design hash functions that are fast to compute while maintaining the other essential properties.

Page Complete

You now understand uniformity—why it matters, how to measure it, what destroys it, and how to design for it. You can evaluate hash functions for uniformity and recognize the anti-patterns that cause silent performance degradation. Next, we'll explore the third essential property: computational efficiency.