Loading content...
Imagine you're managing a restaurant with 100 tables. Your host's job is to assign arriving parties to tables. Now consider two hosts:
Host A has a strange policy: every party whose last name starts with 'S' goes to Table 1. Every 'J' goes to Table 2. Every 'M' goes to Table 3. The remaining 97 tables? They're for the other 23 letters, most of which are rare.
Host B distributes parties evenly across all 100 tables based on their complete name, ensuring roughly equal distribution.
Which restaurant runs better? Host A's approach creates massive queues at a few tables while others sit empty. Host B's approach balances the load, giving every party quick service.
Hash function uniformity is exactly this problem. A hash function that clusters outputs creates expensive 'crowded tables' (long bucket chains), while a uniform hash function distributes the load evenly (short chains, O(1) access).
By the end of this page, you will understand: (1) What uniformity means mathematically, (2) Why uniformity is essential for O(1) hash table performance, (3) How to measure and evaluate uniformity, (4) Common patterns that destroy uniformity, and (5) How to design and select uniform hash functions.
Uniformity, also called the uniform distribution property, describes how evenly a hash function spreads its outputs across the available output space. A perfectly uniform hash function would give each possible output value an equal probability of being selected.
Formally, for a hash function h that maps keys to integers in the range [0, m-1]:
123456789101112131415161718192021222324252627
DEFINITION (Simple Uniform Hashing Assumption - SUHA)=====================================================A hash function h satisfies the Simple Uniform Hashing Assumption if: For a randomly selected key k from the key space K, P(h(k) = i) = 1/m for all i ∈ {0, 1, 2, ..., m-1} In words:"Each hash bucket is equally likely to receive any given key." IDEAL DISTRIBUTION==================If we insert n keys into a hash table with m buckets using a perfectly uniform hash function: Expected keys per bucket = n/m (the load factor α) Variance in bucket sizes ≈ 0 (all buckets equally full) NON-UNIFORM DISTRIBUTION (PROBLEM)==================================If the hash function is non-uniform: Some buckets receive far more than n/m keys Some buckets receive far fewer than n/m keys Result: O(1) degrades toward O(n) in crowded bucketsImportant clarification: The Simple Uniform Hashing Assumption (SUHA) is an idealization—no real hash function achieves perfect uniformity for all possible inputs. However, well-designed hash functions come close enough for practical purposes. The closer we get to uniform distribution, the better our hash table performs.
Uniformity does NOT mean randomness. A hash function must be deterministic (same input → same output). Uniformity means that across all possible inputs, outputs are spread evenly. It's the distribution of outputs that's uniform, not the process of generating them.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import randomfrom collections import Counter def uniform_hash(key: str, table_size: int = 10) -> int: """A reasonably uniform hash function.""" h = 0 for i, char in enumerate(key): # Mix position and character value thoroughly h = (h * 31 + ord(char) * (i + 1)) & 0xFFFFFFFF return h % table_size def terrible_hash(key: str, table_size: int = 10) -> int: """A deliberately terrible hash function.""" # Only uses first character - extremely non-uniform! if not key: return 0 return ord(key[0]) % table_size def analyze_distribution(hash_fn, keys: list, table_size: int = 10): """Analyze how evenly a hash function distributes keys.""" distribution = Counter(hash_fn(k, table_size) for k in keys) total_keys = len(keys) expected_per_bucket = total_keys / table_size print(f"\nDistribution Analysis (n={total_keys}, m={table_size}):") print(f"Expected keys per bucket: {expected_per_bucket:.1f}") print(f"Actual distribution:") max_bucket = max(distribution.values()) for bucket in range(table_size): count = distribution.get(bucket, 0) bar = '█' * count ratio = count / expected_per_bucket if expected_per_bucket > 0 else 0 status = '(optimal)' if 0.5 <= ratio <= 1.5 else '(skewed!)' if ratio > 2 else '' print(f" Bucket {bucket}: {bar:50} {count:4} {status}") # Calculate standard deviation counts = [distribution.get(i, 0) for i in range(table_size)] mean = sum(counts) / len(counts) variance = sum((c - mean) ** 2 for c in counts) / len(counts) std_dev = variance ** 0.5 print(f"\nStatistics:") print(f" Min bucket size: {min(counts)}") print(f" Max bucket size: {max(counts)}") print(f" Standard deviation: {std_dev:.2f}") print(f" Max/Expected ratio: {max_bucket / expected_per_bucket:.2f}x") return std_dev # Generate test keys - common first namestest_keys = [ "Alice", "Bob", "Charlie", "David", "Emma", "Frank", "Grace", "Henry", "Ivy", "Jack", "Kate", "Leo", "Mia", "Noah", "Olivia", "Peter", "Quinn", "Rose", "Sam", "Tina", "Uma", "Victor", "Wendy", "Xavier", "Yuki", "Zoe", "Adam", "Beth", "Carl", "Diana", "Eric", "Fiona", "George", "Helen", "Ian", "Julia", "Kevin", "Laura", "Mike"] print("=== UNIFORM HASH FUNCTION ===")uniform_std = analyze_distribution(uniform_hash, test_keys) print("\n" + "="*60)print("=== TERRIBLE HASH FUNCTION ===")terrible_std = analyze_distribution(terrible_hash, test_keys) print(f"\n=== COMPARISON ===")print(f"Uniform hash std deviation: {uniform_std:.2f}")print(f"Terrible hash std deviation: {terrible_std:.2f}")print(f"The terrible hash is {terrible_std/uniform_std:.1f}x more skewed!")Hash tables promise O(1) average-case operations. But this promise depends entirely on uniformity. Without uniform distribution, hash tables silently degrade to O(n), and you may never know why your system is slow.
| Metric | Uniform Hash | Non-Uniform Hash |
|---|---|---|
| Average bucket size | n/m (load factor) | Varies wildly: 0 to n |
| Lookup time (average) | O(1 + α) | O(n) in worst buckets |
| Insert time (average) | O(1) | O(n) in crowded buckets |
| Delete time (average) | O(1 + α) | O(n) in worst buckets |
| Memory efficiency | All buckets used equally | Some buckets overflow, others empty |
| Predictability | Consistent performance | Unpredictable, input-dependent |
The Mathematics of Degradation:
Let's see exactly how non-uniformity destroys performance. Consider a hash table with m = 1000 buckets and n = 10,000 keys.
With perfect uniformity:
With extreme non-uniformity (everything in one bucket):
The difference: 1,000x slower!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import timefrom typing import Callable, List, Tuplefrom collections import defaultdict class InstrumentedHashTable: """Hash table that counts comparisons to measure performance.""" def __init__(self, hash_fn: Callable[[str], int], size: int = 1000): self.hash_fn = hash_fn self.size = size self.buckets = [[] for _ in range(size)] self.total_comparisons = 0 def _hash(self, key: str) -> int: return self.hash_fn(key) % self.size def insert(self, key: str, value): index = self._hash(key) for i, (k, v) in enumerate(self.buckets[index]): self.total_comparisons += 1 if k == key: self.buckets[index][i] = (key, value) return self.buckets[index].append((key, value)) def lookup(self, key: str): index = self._hash(key) for k, v in self.buckets[index]: self.total_comparisons += 1 if k == key: return v return None def get_bucket_stats(self) -> Tuple[float, int, int]: sizes = [len(b) for b in self.buckets] return sum(sizes) / len(sizes), max(sizes), min(sizes) # Hash functions with different uniformitydef good_hash(key: str) -> int: h = 0 for i, c in enumerate(key): h = (h * 31 + ord(c) * (i + 1)) & 0xFFFFFFFF return h def bad_hash(key: str) -> int: # Only uses length - many strings have same length! return len(key) def terrible_hash(key: str) -> int: # Constant - all keys go to same bucket! return 42 # Generate diverse test datatest_keys = [f"user_{i:05d}" for i in range(10000)] def benchmark(name: str, hash_fn: Callable): print(f"\n{'='*60}") print(f"Testing: {name}") print('='*60) table = InstrumentedHashTable(hash_fn, size=1000) # Insert all keys start = time.perf_counter() for key in test_keys: table.insert(key, {"data": key}) insert_time = time.perf_counter() - start # Reset comparison counter for lookups table.total_comparisons = 0 # Lookup all keys start = time.perf_counter() for key in test_keys: table.lookup(key) lookup_time = time.perf_counter() - start avg_size, max_size, min_size = table.get_bucket_stats() print(f"Bucket Statistics:") print(f" Average bucket size: {avg_size:.2f}") print(f" Maximum bucket size: {max_size}") print(f" Minimum bucket size: {min_size}") print(f" Max/Avg ratio: {max_size/avg_size:.1f}x") print(f"\nPerformance:") print(f" Total comparisons for {len(test_keys)} lookups: {table.total_comparisons:,}") print(f" Average comparisons per lookup: {table.total_comparisons/len(test_keys):.2f}") print(f" Lookup time: {lookup_time*1000:.2f}ms") return table.total_comparisons # Run benchmarksgood_comparisons = benchmark("Good Hash (Uniform)", good_hash)bad_comparisons = benchmark("Bad Hash (Length-based)", bad_hash)terrible_comparisons = benchmark("Terrible Hash (Constant)", terrible_hash) print(f"\n{'='*60}")print("SUMMARY: Impact of Uniformity on Performance")print('='*60)print(f"Good hash comparisons: {good_comparisons:>12,} (baseline)")print(f"Bad hash comparisons: {bad_comparisons:>12,} ({bad_comparisons/good_comparisons:.1f}x worse)")print(f"Terrible hash comparisons: {terrible_comparisons:>12,} ({terrible_comparisons/good_comparisons:.1f}x worse)")Non-uniform hashing is insidious because the code looks correct. The hash table API works. Data is stored and retrieved. But under the surface, O(1) has silently become O(n). You only discover the problem when your system mysteriously slows down under load—and by then, finding the root cause is extremely difficult.
Certain hash function designs are notorious for producing non-uniform distributions. Understanding these anti-patterns helps you avoid them and recognize them in existing code.
| Anti-Pattern | Example | Why It Fails | Real Impact |
|---|---|---|---|
| Using only first character | h(s) = ord(s[0]) % m | Names cluster by first letter; 'S' and 'J' dominate | Buckets 10, 18, 19 get 40% of keys |
| Using string length | h(s) = len(s) % m | Most strings have similar lengths (5-15 chars) | Buckets 5-15 get 95% of keys |
| Simple sum of characters | h(s) = sum(ord(c)) % m | Anagrams hash to same value | 'abc', 'bca', 'cab' all collide |
| Ignoring structure | h(ip) = sum(octets) % m | 192.168.1.x all hash similarly | Local network IPs cluster |
| Mod by small number | h(n) = n % 10 | Only 10 possible outputs | Sequential IDs fill buckets in order |
| Using only high bits | h(n) = n >> 24 | Similar numbers hash identically | User IDs 1-16M all hash to 0 |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
from collections import Counter def show_distribution(name: str, hash_fn, keys: list, table_size: int = 20): """Show how badly a hash function distributes keys.""" distribution = Counter(hash_fn(k) % table_size for k in keys) total = len(keys) expected = total / table_size print(f"\n{name}") print(f"Keys: {total}, Buckets: {table_size}, Expected per bucket: {expected:.1f}") # Find most crowded buckets crowded = distribution.most_common(5) empty = table_size - len(distribution) print(f"Most crowded buckets: {crowded}") print(f"Empty buckets: {empty}/{table_size}") print(f"Worst bucket has {crowded[0][1]}x the keys it should") # Test data - realistic user namesnames = [ "Smith", "Johnson", "Williams", "Brown", "Jones", "Garcia", "Miller", "Davis", "Rodriguez", "Martinez", "Hernandez", "Lopez", "Gonzalez", "Wilson", "Anderson", "Thomas", "Taylor", "Moore", "Jackson", "Martin", "Lee", "Perez", "Thompson", "White", "Harris", "Sanchez", "Clark", "Ramirez", "Lewis", "Robinson", "Walker", "Young", "Allen", "King", "Wright", "Scott", "Torres", "Nguyen", "Hill", "Flores", "Green", "Adams", "Nelson", "Baker", "Hall", "Rivera", "Campbell", "Mitchell", "Carter", "Roberts", "Turner", "Phillips", "Evans", "Parker", "Edwards"] # Anti-pattern 1: First character onlydef first_char_hash(s: str) -> int: return ord(s[0]) if s else 0 # Anti-pattern 2: String lengthdef length_hash(s: str) -> int: return len(s) # Anti-pattern 3: Simple sumdef sum_hash(s: str) -> int: return sum(ord(c) for c in s) # Good hash for comparisondef good_hash(s: str) -> int: h = 0 for i, c in enumerate(s): h = (h * 31 + ord(c)) & 0xFFFFFFFF return h show_distribution("ANTI-PATTERN: First Character Hash", first_char_hash, names)show_distribution("ANTI-PATTERN: Length Hash", length_hash, names) show_distribution("ANTI-PATTERN: Sum Hash", sum_hash, names)show_distribution("GOOD: Polynomial Hash", good_hash, names) # Special case: Anagram collision in sum hashprint("\n=== ANAGRAM COLLISION DEMONSTRATION ===")anagrams = ["listen", "silent", "enlist", "tinsel", "inlets"]for word in anagrams: print(f" sum_hash('{word}') = {sum_hash(word)}") print(f"\nAll anagrams hash to the same value!")print(f"Expected: 5 different hashes")print(f"Actual: {len(set(sum_hash(w) for w in anagrams))} unique hash(es)")A good hash function must use ALL the input data, mix it thoroughly, and ensure that small changes in input produce large, unpredictable changes in output. Simple operations like sum, length, or first-character extraction leave too much information on the table.
The avalanche effect is a property where a small change in the input produces a drastic, unpredictable change in the output. Specifically, changing a single bit of input should change approximately 50% of the output bits. This property is crucial for uniformity because it ensures that similar inputs don't cluster together.
123456789101112131415161718192021222324252627282930313233
STRICT AVALANCHE CRITERION (SAC)=================================A hash function h satisfies the Strict Avalanche Criterion if: For any input x and any bit position i: Flipping bit i of x changes each bit of h(x) with probability 0.5 In other words: - Change 1 bit of input - Expect ~50% of output bits to change - Which specific bits change should be unpredictable PRACTICAL AVALANCHE===================For practical hash functions, we want: hamming_distance(h(x), h(x')) ≈ output_bits / 2 Where x' differs from x by a single bit or character. WHY AVALANCHE ENABLES UNIFORMITY================================Without avalanche: - Similar inputs → Similar hashes - "user_001" and "user_002" might differ by 1 bit - They land in same or adjacent buckets - Real data has patterns → Clustering occurs With avalanche: - Similar inputs → Completely different hashes - "user_001" → 0x7A3F2C1D - "user_002" → 0xE891B5A0 (completely different!) - Patterns in input don't create patterns in output123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
def count_differing_bits(a: int, b: int) -> int: """Count how many bits differ between two integers.""" return bin(a ^ b).count('1') def measure_avalanche(hash_fn, base_string: str, bits: int = 32): """Measure avalanche effect by changing one character.""" base_hash = hash_fn(base_string) print(f"\nAvalanche analysis for: '{base_string}'") print(f"Base hash: {base_hash:032b} ({base_hash})") print(f"\nChanging each character position:") total_bits_changed = 0 tests = 0 for i in range(len(base_string)): # Create variant with one character changed chars = list(base_string) original_char = chars[i] chars[i] = chr(ord(chars[i]) + 1) # Increment character variant = ''.join(chars) variant_hash = hash_fn(variant) bits_changed = count_differing_bits(base_hash, variant_hash) percentage = (bits_changed / bits) * 100 print(f" Position {i}: '{original_char}'→'{chars[i]}' " f"hash={variant_hash:08x}, bits changed: {bits_changed}/{bits} ({percentage:.0f}%)") total_bits_changed += bits_changed tests += 1 avg_bits = total_bits_changed / tests ideal = bits / 2 print(f"\nSummary:") print(f" Average bits changed: {avg_bits:.1f}/{bits}") print(f" Ideal (50%): {ideal}") print(f" Avalanche quality: {(avg_bits/ideal)*100:.0f}%") return avg_bits / bits # Poor hash: No avalanchedef poor_hash(s: str) -> int: return sum(ord(c) for c in s) & 0xFFFFFFFF # Better hash: Some mixing but weak avalanche def medium_hash(s: str) -> int: h = 0 for c in s: h = (h * 31 + ord(c)) & 0xFFFFFFFF return h # Good hash: Strong avalanche through thorough mixingdef strong_hash(s: str) -> int: h = 0 for c in s: h = (h * 31 + ord(c)) & 0xFFFFFFFF # Additional mixing h ^= (h >> 16) h = (h * 0x85ebca6b) & 0xFFFFFFFF h ^= (h >> 13) return h # Test avalanchetest_string = "hello_world" print("=" * 60)print("POOR HASH (Simple Sum)")print("=" * 60)poor_score = measure_avalanche(poor_hash, test_string) print("\n" + "=" * 60)print("MEDIUM HASH (Polynomial)")print("=" * 60)medium_score = measure_avalanche(medium_hash, test_string) print("\n" + "=" * 60)print("STRONG HASH (With Mixing)")print("=" * 60)strong_score = measure_avalanche(strong_hash, test_string) print("\n" + "=" * 60)print("COMPARISON")print("=" * 60)print(f"Poor hash avalanche: {poor_score*100:.0f}% (should be ~50%)")print(f"Medium hash avalanche: {medium_score*100:.0f}%")print(f"Strong hash avalanche: {strong_score*100:.0f}%")Cryptographic hash functions (SHA-256, Blake2) have excellent avalanche properties because security requires it. Non-cryptographic hash functions (MurmurHash, xxHash) balance avalanche quality with speed. For hash tables, you don't need crypto-grade avalanche—just enough to prevent clustering from realistic input patterns.
Creating a uniform hash function requires deliberate design. Here are the key principles and techniques used by well-designed hash functions:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
"""Building a uniform hash function from first principles.We'll improve incrementally, measuring uniformity at each step.""" from collections import Counterimport random def test_uniformity(hash_fn, num_keys: int = 10000, table_size: int = 100): """Calculate chi-square statistic to measure uniformity.""" keys = [f"key_{i:06d}" for i in range(num_keys)] distribution = Counter(hash_fn(k) % table_size for k in keys) expected = num_keys / table_size chi_square = sum( (distribution.get(i, 0) - expected) ** 2 / expected for i in range(table_size) ) # Lower is better; perfect uniformity would give ~table_size-1 return chi_square # Version 1: Terrible - only uses first characterdef hash_v1(s: str) -> int: return ord(s[0]) if s else 0 # Version 2: Bad - simple sum (anagram problem)def hash_v2(s: str) -> int: return sum(ord(c) for c in s) # Version 3: Better - position-weighted sumdef hash_v3(s: str) -> int: h = 0 for i, c in enumerate(s): h += ord(c) * (i + 1) # Position matters now return h # Version 4: Good - polynomial rolling hashdef hash_v4(s: str) -> int: h = 0 for c in s: h = (h * 31 + ord(c)) & 0xFFFFFFFF return h # Version 5: Excellent - polynomial with finalizationdef hash_v5(s: str) -> int: h = 0 # Character processing with prime multiplier for c in s: h = (h * 31 + ord(c)) & 0xFFFFFFFF # Finalization: mix all bits thoroughly h ^= h >> 16 h = (h * 0x85ebca6b) & 0xFFFFFFFF h ^= h >> 13 h = (h * 0xc2b2ae35) & 0xFFFFFFFF h ^= h >> 16 return h # Version 6: Production-quality (djb2 inspired with improvements)def hash_v6(s: str) -> int: h = 5381 # Magic starting value (prime) for c in s: # h * 33 + c, implemented as shift-add for efficiency h = ((h << 5) + h + ord(c)) & 0xFFFFFFFF # Finalization h ^= h >> 16 h = (h * 0x7feb352d) & 0xFFFFFFFF h ^= h >> 15 return h # Test all versionsprint("Uniformity Comparison (Chi-Square Test)")print("Lower is better. Perfect uniformity ≈ 99 for table_size=100")print("="*60) versions = [ ("V1: First char only", hash_v1), ("V2: Simple sum", hash_v2), ("V3: Position-weighted", hash_v3), ("V4: Polynomial", hash_v4), ("V5: Poly + finalization", hash_v5), ("V6: Production quality", hash_v6),] for name, fn in versions: chi_sq = test_uniformity(fn) quality = "❌ Poor" if chi_sq > 200 else "⚠️ Fair" if chi_sq > 120 else "✓ Good" print(f"{name:30} χ² = {chi_sq:8.1f} {quality}")The chi-square (χ²) statistic measures how much an observed distribution deviates from expected. For a hash function mapping n keys to m buckets: χ² = Σ(observed - expected)² / expected. A perfectly uniform distribution gives χ² ≈ m-1. Values much higher indicate non-uniformity.
An often-overlooked aspect of uniformity is how the hash function interacts with the table size. Even a uniform hash function can produce non-uniform bucket distributions if the table size is poorly chosen.
123456789101112131415161718192021222324252627282930313233
THE PROBLEM WITH POWER-OF-2 SIZES==================================If table_size = 2^k, then: index = hash(key) % table_size = hash(key) & (table_size - 1) // Only uses lowest k bits! If your hash function has weak lower bits (common!), power-of-2 sizes expose this weakness. Example: hash(key) = key * 1024 // Lower 10 bits are always 0! table_size = 1024 // Power of 2 index = (key * 1024) % 1024 = 0 // Everything goes to bucket 0! THE PRIME NUMBER SOLUTION=========================Using a prime table size helps because: - Primes have no factors to align with hash patterns - Modulo by prime spreads values more evenly - Common patterns in data (multiples, sequences) don't cluster Example: table_size = 1009 // Prime Sequential keys 1,2,3,... spread across all buckets Even if hash is weak, prime breaks up patterns MODERN APPROACH: BETTER HASH, ANY SIZE======================================Modern hash functions (MurmurHash, xxHash, etc.) have such goodbit distribution that power-of-2 sizes work fine. But for simple hash functions (polynomial, djb2), prefer primesor use a final mixing step to improve lower-bit quality.123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
from collections import Counter def weak_hash(key: int) -> int: """A hash function with weak lower bits.""" # Multiplying by a multiple of small power of 2 # makes lower bits less random return key * 128 def strong_hash(key: int) -> int: """A hash function with good bit distribution.""" h = key h ^= h >> 16 h = (h * 0x85ebca6b) & 0xFFFFFFFF h ^= h >> 13 h = (h * 0xc2b2ae35) & 0xFFFFFFFF h ^= h >> 16 return h def analyze_size_effect(hash_fn, table_size: int, num_keys: int = 1000): """Show distribution for a given table size.""" keys = list(range(num_keys)) distribution = Counter(hash_fn(k) % table_size for k in keys) used_buckets = len(distribution) max_bucket = max(distribution.values()) expected = num_keys / table_size return used_buckets, max_bucket, expected print("Effect of Table Size on Distribution")print("="*70) # Test different table sizessizes = [ (100, "100 (not prime, has factors 2,5)"), (101, "101 (prime)"), (128, "128 (power of 2)"), (127, "127 (prime)"), (256, "256 (power of 2)"), (257, "257 (prime)"), (1000, "1000 (highly composite)"), (1009, "1009 (prime)"),] print("\nWEAK HASH (poor lower bits):")print("-"*70)for size, description in sizes: used, max_b, expected = analyze_size_effect(weak_hash, size) utilization = (used / size) * 100 crowding = max_b / expected quality = "✓" if crowding < 2 else "⚠️" if crowding < 5 else "❌" print(f" Size {description:35} Used: {used:4}/{size:<4} " f"Max: {max_b:4} (×{crowding:.1f} expected) {quality}") print("\nSTRONG HASH (good bit mixing):")print("-"*70)for size, description in sizes: used, max_b, expected = analyze_size_effect(strong_hash, size) utilization = (used / size) * 100 crowding = max_b / expected quality = "✓" if crowding < 2 else "⚠️" if crowding < 5 else "❌" print(f" Size {description:35} Used: {used:4}/{size:<4} " f"Max: {max_b:4} (×{crowding:.1f} expected) {quality}") print("\n💡 Observation: Strong hash works well with any size.")print(" Weak hash needs prime sizes to compensate.")If you control the hash function, use a modern well-designed one (MurmurHash3, xxHash, FNV-1a) and power-of-2 sizes are fine. If you're using a potentially weak hash function or can't control it, prefer prime table sizes. Many production hash tables (Java's HashMap, Python's dict) use sophisticated combinations of both approaches.
We've explored uniformity—the property that makes hash tables actually deliver O(1) performance. Let's consolidate the key insights:
What's next:
We've established that hash functions must be deterministic (correctness) and uniform (performance). But there's a third critical property: efficiency. A hash function that's deterministic and uniform but takes 10 seconds to compute defeats the purpose. The next page explores how to design hash functions that are fast to compute while maintaining the other essential properties.
You now understand uniformity—why it matters, how to measure it, what destroys it, and how to design for it. You can evaluate hash functions for uniformity and recognize the anti-patterns that cause silent performance degradation. Next, we'll explore the third essential property: computational efficiency.