Loading learning content...
In the previous page, we examined deliberate attacks that exploit hash table vulnerabilities. But there's an equally important source of hash table performance problems that requires no attacker at all: poorly designed hash functions.
A bad hash function doesn't announce itself with errors or crashes. Instead, it silently degrades performance—what should be O(1) operations gradually become O(n) as data accumulates in a few overloaded buckets. The system works, but slowly. Engineers blame the database, the network, or insufficient hardware, never suspecting that a simple mathematical function is the root cause.
This insidious failure mode has caused countless production incidents. Understanding what makes a hash function good (or bad) is fundamental to hash table mastery.
By the end of this page, you will be able to identify poor hash function choices, understand why they fail, recognize common anti-patterns, and evaluate hash functions for your specific use cases. You'll learn to see hash function quality as a critical design decision, not an afterthought.
Before we can identify poor hash functions, we need a clear understanding of what properties a good hash function should have. Recall the three fundamental properties from earlier in this chapter:
The Three Pillars of Hash Function Quality:
The Critical Property: Uniformity
Of these three, uniformity is where most hash functions fail. Determinism is easy—just don't use random numbers. Efficiency is usually acceptable—even slow hash functions are faster than searching. But achieving good uniformity requires understanding your data and choosing (or designing) a hash function that spreads that specific type of data evenly.
The Uniformity Challenge:
Real-world data is rarely uniform. Consider these common patterns:
A hash function must transform these clustered, patterned inputs into uniformly distributed outputs. If it fails to break up input patterns, collisions cluster and performance degrades.
12345678910111213141516171819202122232425262728293031323334353637383940
INPUT DATA: Sequential user IDs: 1000, 1001, 1002, 1003, ..., 1999HASH TABLE: 10 buckets (indices 0-9) IDEAL DISTRIBUTION (Good Hash Function):═══════════════════════════════════════════════════════════════════ Bucket 0: ●●●●●●●●●●●●●●●●●●●● (100 elements)Bucket 1: ●●●●●●●●●●●●●●●●●●●● (100 elements)Bucket 2: ●●●●●●●●●●●●●●●●●●●● (100 elements)Bucket 3: ●●●●●●●●●●●●●●●●●●●● (100 elements)Bucket 4: ●●●●●●●●●●●●●●●●●●●● (100 elements)Bucket 5: ●●●●●●●●●●●●●●●●●●●● (100 elements)Bucket 6: ●●●●●●●●●●●●●●●●●●●● (100 elements)Bucket 7: ●●●●●●●●●●●●●●●●●●●● (100 elements)Bucket 8: ●●●●●●●●●●●●●●●●●●●● (100 elements)Bucket 9: ●●●●●●●●●●●●●●●●●●●● (100 elements) Average chain length: 100Lookup time: O(100) = manageable TERRIBLE DISTRIBUTION (hash(x) = x % 10):═══════════════════════════════════════════════════════════════════ Bucket 0: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●... (100 elements)Bucket 1: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●... (100 elements)Bucket 2: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●... (100 elements)... (still even for this example, but see next example) THE PROBLEM EMERGES with multiples of 10:Input: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, ... Bucket 0: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●... (ALL elements!)Bucket 1: (empty)Bucket 2: (empty)...Bucket 9: (empty) ALL keys go to bucket 0!Lookup time: O(n) — hash table becomes a linked listPoor hash functions fail when input data has patterns that align with the hash function's behavior. The more structured your data, the more important it is to choose a hash function that thoroughly scrambles those patterns. Real-world data is almost always patterned.
Let's examine specific hash function designs that frequently cause problems in practice. These anti-patterns appear surprisingly often in production code.
Anti-Pattern 1: Using Modulo with Non-Prime Table Sizes
When hash table size shares common factors with key values, collisions cluster predictably:
123456789101112131415161718192021222324252627282930313233343536
# BAD: Table size of 100 (highly composite: 100 = 2² × 5²)def bad_hash(key): return key % 100 # Keys that are multiples of 10:# 10 % 100 = 10# 20 % 100 = 20# ...# 100 % 100 = 0# 110 % 100 = 10 ← Collision with 10!# 210 % 100 = 10 ← Another collision! # Keys that are multiples of 50:# 50 % 100 = 50# 100 % 100 = 0# 150 % 100 = 50 ← Collision!# 200 % 100 = 0 ← Collision! # Pattern: Any factor of 100 creates collision clusters # BETTER: Use a prime table sizedef better_hash(key): return key % 97 # 97 is prime # With prime modulus, patterns in input data# are less likely to align with table structure # Keys that are multiples of 10:# 10 % 97 = 10# 20 % 97 = 20# ...# 100 % 97 = 3 ← Different bucket!# 110 % 97 = 13 ← Different bucket!# 210 % 97 = 16 ← Different bucket! # Prime modulus "breaks up" regular patternsAnti-Pattern 2: Ignoring Parts of the Key
Hash functions that only consider part of the key waste information and cause unnecessary collisions:
12345678910111213141516171819202122232425262728293031
# BAD: Only using first characterdef terrible_hash(string_key): return ord(string_key[0]) if string_key else 0 # Results for common names:# "Alice" → 65 (A)# "Aaron" → 65 (A) ← Collision!# "Amy" → 65 (A) ← Collision!# "Anna" → 65 (A) ← Collision!# "Bob" → 66 (B)# "Bill" → 66 (B) ← Collision! # With only 26 possible values, maximum 26 buckets are useful! # BAD: Only using last N charactersdef bad_suffix_hash(key): return hash(key[-3:]) # Only last 3 chars # For file paths:# "/var/log/app.log" → hash("log")# "/home/user/debug.log" → hash("log") ← Collision!# "/tmp/error.log" → hash("log") ← Collision! # BAD: Truncating long keysdef truncated_hash(key): return hash(key[:10]) # Only first 10 chars # For URLs:# "https://api.example.com/users/123" → hash("https://ap")# "https://api.example.com/users/456" → hash("https://ap") ← Collision!# "https://api.example.com/orders/789" → hash("https://ap") ← Collision!Anti-Pattern 3: Simple Additive Hashes
Just summing character values (or similar simple operations) creates many collision equivalence classes:
1234567891011121314151617181920212223242526
# BAD: Summing character values (or byte values)def additive_hash(s): return sum(ord(c) for c in s) # Collisions are trivial to find:# "ab" = 97 + 98 = 195# "ba" = 98 + 97 = 195 ← Same hash! # More examples:# "abc" = 97 + 98 + 99 = 294# "cab" = 99 + 97 + 98 = 294 ← Collision!# "bac" = 98 + 97 + 99 = 294 ← Collision!# Any permutation collides! # Even worse - different strings:# "ab" = 195# "Z" + chr(195 - 90) = 195 # Can construct collisions easily # This hash is ORDER-INSENSITIVE# This is catastrophic for strings, where order matters # The hash function treats "stop" and "pots" as identical!# "stop" = 115 + 116 + 111 + 112 = 454# "pots" = 112 + 111 + 116 + 115 = 454# "spot" = 115 + 112 + 111 + 116 = 454# "tops" = 116 + 111 + 112 + 115 = 454Anti-Pattern 4: XOR Without Position Encoding
Using XOR alone has similar problems to addition—it's associative and commutative:
123456789101112131415161718192021222324
# BAD: Simple XOR of all bytesdef xor_hash(s): result = 0 for c in s: result ^= ord(c) return result # XOR is commutative and associative:# "ab" = ord('a') ^ ord('b') = 97 ^ 98 = 3# "ba" = ord('b') ^ ord('a') = 98 ^ 97 = 3 ← Collision! # Worse: XOR has self-cancellation# x ^ x = 0 # "abba" = a ^ b ^ b ^ a = (a ^ a) ^ (b ^ b) = 0 ^ 0 = 0# "cddc" = c ^ d ^ d ^ c = (c ^ c) ^ (d ^ d) = 0 ^ 0 = 0# Any string with pairs of repeated chars hashes to 0! # Even length matters:# "ab" = 3# "abab" = a ^ b ^ a ^ b = 0 ← Different string, very different hash! # XOR is useful for COMBINING multiple independent hash values# But not for hashing sequential data directlyHash functions using only addition, multiplication, or XOR without position-dependent weighting treat "abc" the same as "cba". For data where order matters (strings, sequences, coordinates), this is catastrophic. Good hash functions incorporate position information to distinguish otherwise-similar inputs.
These anti-patterns aren't just theoretical—they've caused real problems in production systems. Let's examine documented cases where poor hash function choices led to serious issues.
Case Study 1: Database Key Clustering
A company stored customer records in a hash-partitioned distributed database. The partition key was the customer ID, an auto-incrementing integer. The partition function was simple: partition = customer_id % num_partitions.
The problem? They had 8 partitions (a power of 2). Customer IDs from batch imports came in multiples of 1000. Since 1000 = 8 × 125, all batch-imported customers landed on the same partition:
That single partition became overloaded while others sat nearly empty.
1234567891011121314151617181920212223242526272829
EXPECTED: Uniform distribution across 8 partitions═══════════════════════════════════════════════════════════════════ Partition 0: ████████████░░░ 12.5% of dataPartition 1: ████████████░░░ 12.5% of dataPartition 2: ████████████░░░ 12.5% of dataPartition 3: ████████████░░░ 12.5% of dataPartition 4: ████████████░░░ 12.5% of dataPartition 5: ████████████░░░ 12.5% of dataPartition 6: ████████████░░░ 12.5% of dataPartition 7: ████████████░░░ 12.5% of data ACTUAL: Skewed distribution due to ID patterns═══════════════════════════════════════════════════════════════════ Partition 0: ████████████████████████████████████████ 85% of data!Partition 1: ██░░░░░░░░░░░░░ 2% of dataPartition 2: ██░░░░░░░░░░░░░ 2% of dataPartition 3: ██░░░░░░░░░░░░░ 2% of dataPartition 4: ██░░░░░░░░░░░░░ 2% of dataPartition 5: ██░░░░░░░░░░░░░ 2% of dataPartition 6: ██░░░░░░░░░░░░░ 2% of dataPartition 7: ██░░░░░░░░░░░░░ 3% of data Result: - Partition 0 server overloaded- Query latencies 10x higher than expected- Horizontal scaling ineffective (other partitions unused)Case Study 2: Cache Key Collisions
A web application cached rendered pages using URL paths as keys. The cache used a custom hash function that only considered the first 8 characters of the URL path.
For a site structure like:
/products/electronics/phones/iphone-15/products/electronics/phones/samsung-s24/products/electronics/phones/pixel-8All URLs hashed to the same value because they shared the prefix /product. Cache hit rates plummeted as entries constantly evicted each other, and the "cache" provided no benefit.
Case Study 3: The Zip Code Catastrophe
A shipping logistics system stored package data in a hash table keyed by destination zip code. The hash function was zipcode % table_size.
Zip codes in the United States follow geographic patterns. Most zip codes in a region share prefixes. A shipping hub serving primarily the 9XXXX region (California) found that most packages clustered in a small number of buckets:
The distribution was nowhere near uniform.
| Scenario | Poor Hash Choice | Result | Better Approach |
|---|---|---|---|
| Sequential IDs | ID % table_size | Predictable bucket assignment | Multiply-shift hash or bit mixing |
| Timestamps | timestamp % table_size | Time-correlated clustering | Mix all bits with Murmur/xxHash |
| IP addresses | Last octet only | 256 possible values | Hash full IP as 32-bit integer |
| URLs | First N chars | Prefix overlap clustering | Full string hash (FNV, xxHash) |
| Geographic data | Coordinate truncation | Spatial clustering preserved | Space-filling curve or full-precision hash |
Notice the common thread: real-world data has structure and patterns. Hash functions fail when they preserve these patterns instead of destroying them. The solution is always to use a hash function that thoroughly mixes all bits of the input, regardless of input patterns.
How do you know if your hash function is good for your data? Beyond code review and intuition, we can empirically measure distribution quality.
Metric 1: Chi-Squared Test for Uniformity
The chi-squared (χ²) test measures whether observed frequencies deviate significantly from expected uniform distribution:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
import scipy.stats as stats def test_hash_uniformity(hash_func, test_keys, num_buckets): """ Test whether hash_func distributes test_keys uniformly. Returns: chi-squared statistic and p-value Lower chi-squared and higher p-value = better uniformity """ # Count elements in each bucket bucket_counts = [0] * num_buckets for key in test_keys: bucket = hash_func(key) % num_buckets bucket_counts[bucket] += 1 # Expected count per bucket (uniform distribution) expected = len(test_keys) / num_buckets # Chi-squared statistic chi_sq = sum((obs - expected)**2 / expected for obs in bucket_counts) # Degrees of freedom = num_buckets - 1 df = num_buckets - 1 # P-value: probability of seeing this or more extreme result if truly uniform p_value = 1 - stats.chi2.cdf(chi_sq, df) return chi_sq, p_value, bucket_counts # Example: Test bad vs good hash functiondef bad_hash(x): return x # Identity - preserves patternsdef good_hash(x): return hash(x) # Python's built-in (randomized) # Test with sequential IDs (common real-world pattern)test_keys = list(range(1000, 10000))num_buckets = 100 bad_chi, bad_p, bad_counts = test_hash_uniformity(bad_hash, test_keys, num_buckets)good_chi, good_p, good_counts = test_hash_uniformity(good_hash, test_keys, num_buckets) print(f"Bad hash: χ² = {bad_chi:.1f}, p-value = {bad_p:.6f}")print(f"Good hash: χ² = {good_chi:.1f}, p-value = {good_p:.6f}") # Interpretation:# - χ² close to df (99) = good uniformity# - χ² >> df = non-uniform (patterns detected)# - p-value > 0.05 = consistent with uniform distribution# - p-value < 0.05 = likely non-uniformMetric 2: Coefficient of Variation
A simpler metric is the coefficient of variation (CV) of bucket sizes—the standard deviation divided by the mean:
123456789101112131415161718192021222324252627
import statistics def hash_distribution_cv(hash_func, test_keys, num_buckets): """ Calculate coefficient of variation of bucket counts. CV = 0: Perfect uniformity (all buckets have same count) CV ≈ 1/√n: Expected for random distribution CV >> 1/√n: Poor distribution """ bucket_counts = [0] * num_buckets for key in test_keys: bucket = hash_func(key) % num_buckets bucket_counts[bucket] += 1 mean = statistics.mean(bucket_counts) stdev = statistics.stdev(bucket_counts) cv = stdev / mean if mean > 0 else float('inf') # Expected CV for truly random distribution expected_cv = 1 / (len(test_keys) / num_buckets) ** 0.5 return cv, expected_cv, max(bucket_counts), min(bucket_counts) # Example output:# Good hash: CV=0.12, Expected=0.10, Max=105, Min=85# Bad hash: CV=5.24, Expected=0.10, Max=900, Min=0Metric 3: Avalanche Effect
A good hash function exhibits the "avalanche effect"—a single bit change in input causes approximately 50% of output bits to flip. This ensures that similar inputs produce dissimilar outputs:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
def count_bit_differences(a, b, bit_width=32): """Count the number of differing bits between a and b.""" xor = a ^ b count = 0 for i in range(bit_width): if xor & (1 << i): count += 1 return count def test_avalanche(hash_func, num_tests=1000): """ Test avalanche effect: changing 1 input bit should flip ~50% of output bits. Returns: average percentage of flipped bits for single-bit input changes. Ideal: 50% Poor: <30% or >70% """ import random total_flips = 0 bit_width = 32 for _ in range(num_tests): # Random input x = random.randint(0, 2**32 - 1) h1 = hash_func(x) & 0xFFFFFFFF # Ensure 32-bit # Flip one random bit bit_position = random.randint(0, 31) x_flipped = x ^ (1 << bit_position) h2 = hash_func(x_flipped) & 0xFFFFFFFF # Count bit differences flips = count_bit_differences(h1, h2, bit_width) total_flips += flips avg_flips = total_flips / num_tests avg_percent = (avg_flips / bit_width) * 100 return avg_percent # Results:# Identity hash (h(x)=x): ~3.1% (only the changed bit differs)# Add constant (h(x)=x+C): ~3.1% (addition doesn't spread)# Multiply hash: ~25% (better, but not great)# xxHash/Murmur: ~50% (excellent avalanche)The best test for a hash function is against your actual production data patterns. Generate representative test data (sequential IDs, timestamps, user-submitted strings, etc.) and measure distribution quality. A hash function that works well for random data may fail catastrophically for your specific input patterns.
With an understanding of what can go wrong, let's examine proven hash function families and when to use each.
For General-Purpose Hashing:
| Hash Function | Speed | Quality | Best For |
|---|---|---|---|
| xxHash (xxh32/xxh64) | Extremely fast | Excellent | General purpose, large data |
| MurmurHash3 | Very fast | Excellent | Hash tables, checksums |
| FNV-1a | Fast | Good | Short strings, simple implementation |
| CityHash | Very fast | Excellent | Strings, Google's choice |
| SipHash | Moderate | Excellent + secure | Hash tables with untrusted input |
| djb2 | Fast | Acceptable | Simple string hashing |
Implementation Examples:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
# djb2 - Simple but effective string hashdef djb2_hash(s): """ Dan Bernstein's hash function. - Uses magic constant 5381 (empirically chosen) - Multiply by 33 (= 32 + 1 = (hash << 5) + hash) - Addition incorporates each character """ hash_value = 5381 for char in s: hash_value = ((hash_value << 5) + hash_value) + ord(char) return hash_value & 0xFFFFFFFF # Keep 32 bits # FNV-1a - Fowler-Noll-Vo hash (variant a)def fnv1a_hash(data): """ FNV-1a hash - XOR then multiply. - Excellent distribution - Simple implementation - Good for strings and binary data """ FNV_PRIME = 0x01000193 # 16777619 FNV_OFFSET = 0x811c9dc5 # 2166136261 hash_value = FNV_OFFSET for byte in data.encode() if isinstance(data, str) else data: hash_value ^= byte hash_value = (hash_value * FNV_PRIME) & 0xFFFFFFFF return hash_value # Knuth's multiplicative hash (for integers)def mult_hash(key, table_size): """ Knuth's multiplicative method for integers. A = 2654435769 ≈ 2^32 / φ (golden ratio) """ A = 2654435769 # Multiply, take high bits (they're better mixed) return ((key * A) >> 16) % table_size # MurmurHash3 (simplified finalization mixer)def murmur3_fmix32(h): """ MurmurHash3 finalizer - excellent bit mixing. Use this to improve poor hash values. """ h ^= h >> 16 h = (h * 0x85ebca6b) & 0xFFFFFFFF h ^= h >> 13 h = (h * 0xc2b2ae35) & 0xFFFFFFFF h ^= h >> 16 return h # Usage: If you have a weak hash, strengthen it:# good_hash = murmur3_fmix32(weak_hash(key))Key Principles for Hash Function Selection:
hash(), Java's hashCode(), etc. are well-tested and typically sufficient.Modern programming languages have well-designed built-in hash functions. Unless you have specific requirements (cryptographic security, extreme speed, embedded systems), the default hash function is usually the right choice. Custom hash functions are a common source of subtle bugs.
What if you've inherited code with a poor hash function, or you're stuck with a hash that can't be changed (e.g., database partition keys)? There are techniques to improve distribution without replacing the hash function entirely.
Technique 1: Hash Function Chaining (Bit Mixing)
Apply a high-quality mixing function to the output of a poor hash:
123456789101112131415161718192021222324252627
def improve_hash(poor_hash_value): """ Apply bit mixing to improve distribution of a poor hash value. Uses MurmurHash3 finalization. """ h = poor_hash_value & 0xFFFFFFFF h ^= h >> 16 h = (h * 0x85ebca6b) & 0xFFFFFFFF h ^= h >> 13 h = (h * 0xc2b2ae35) & 0xFFFFFFFF h ^= h >> 16 return h # Before: poor_hash(sequential_id) clusters badly# After: improve_hash(poor_hash(sequential_id)) is uniform # Example with sequential IDs:print("Poor hash distribution (ID % 16):")for i in range(1000, 1016): print(f" {i} → bucket {i % 16}") print("Improved hash distribution:")for i in range(1000, 1016): print(f" {i} → bucket {improve_hash(i) % 16}") # Output shows improved scattering across bucketsTechnique 2: Salting
Combine a constant salt value with the key before hashing to change the collision pattern:
1234567891011121314151617181920212223242526
# Without salt: clustering happensdef unsalted_hash(key, table_size): return hash(key) % table_size # With salt: patterns are disruptedSALT = 0x9E3779B9 # Golden ratio constant def salted_hash(key, table_size): if isinstance(key, int): mixed = key ^ SALT mixed = (mixed * 0x85ebca6b) & 0xFFFFFFFF else: # For strings, prepend salt or combine differently mixed = hash(f"{SALT}{key}") return mixed % table_size # For database sharding with problematic keys:def shard_key(user_id, num_shards): """ Improved sharding that avoids clustering from sequential IDs. """ # Mix the user_id with a salt mixed = user_id ^ 0x9E3779B9 mixed = (mixed * 2654435761) & 0xFFFFFFFF mixed ^= mixed >> 16 return mixed % num_shardsTechnique 3: Secondary Hash (Double Hashing)
Forstubborn clustering, use a completely different hash function as a secondary check:
123456789101112131415161718192021222324252627282930313233
class DoubleHashTable: """ Use two independent hash functions to combat poor distribution. Primary hash determines bucket; secondary hash resolves collisions. """ def __init__(self, size): self.size = size self.table = [None] * size def _hash1(self, key): """Primary hash - even if poor, secondary will help.""" return hash(key) % self.size def _hash2(self, key): """Secondary hash - must be coprime with size.""" # Use different algorithm for independence h = 5381 for c in str(key): h = ((h << 5) + h) ^ ord(c) # Ensure non-zero and coprime with size return 1 + (h % (self.size - 1)) def insert(self, key, value): """Insert with double hashing probe sequence.""" h1 = self._hash1(key) h2 = self._hash2(key) for i in range(self.size): idx = (h1 + i * h2) % self.size if self.table[idx] is None or self.table[idx][0] == key: self.table[idx] = (key, value) return raise Exception("Hash table full")While these techniques can improve poor hash functions, they add complexity and computational overhead. It's always better to choose a good hash function from the start. Reserve these techniques for legacy systems where changing the hash function would require data migration or break compatibility.
When working with custom objects or composite keys, you'll need to design hash functions that combine multiple fields. This is where many mistakes happen.
The Combination Problem:
Given an object with fields (a, b, c), how do we combine their individual hash values into one hash for the object?
Bad Approach: Simple XOR
12345678910111213141516171819202122
# BAD: XOR combinationclass BadPoint: def __init__(self, x, y): self.x = x self.y = y def __hash__(self): return hash(self.x) ^ hash(self.y) # Don't do this! def __eq__(self, other): return self.x == other.x and self.y == other.y # Problems:p1 = BadPoint(3, 5)p2 = BadPoint(5, 3)print(hash(p1) == hash(p2)) # True! Order doesn't matter with XOR p3 = BadPoint(7, 7)print(hash(p3)) # 0! x^x = 0 for any x # Many distinct points will collide:# (1,2), (2,1), (3,0), (0,3) all have related hashesGood Approach: Position-Weighted Combination
123456789101112131415161718192021222324252627282930313233343536373839404142434445
# GOOD: Python's recommended approachclass GoodPoint: def __init__(self, x, y): self.x = x self.y = y def __hash__(self): # Use tuple hashing - properly handles ordering and combination return hash((self.x, self.y)) def __eq__(self, other): return self.x == other.x and self.y == other.y # Alternative: Explicit prime-based combinationclass BetterPoint: def __init__(self, x, y): self.x = x self.y = y def __hash__(self): # Combine with primes to preserve order result = 17 # Non-zero start result = 31 * result + hash(self.x) result = 31 * result + hash(self.y) return result def __eq__(self, other): return self.x == other.x and self.y == other.y # Now (3,5) and (5,3) have different hashesp1 = BetterPoint(3, 5)p2 = BetterPoint(5, 3)print(hash(p1) == hash(p2)) # False - correctly distinguished # For Java, similar pattern:'''@Overridepublic int hashCode() { int result = 17; result = 31 * result + x; result = 31 * result + y; return result;}'''Best Practices for Custom Hash Functions:
hash((a, b, c)) in Python is optimized and correct.31 * running_hash + field_hashIf you override equals(), you MUST override hashCode(). Objects that are equal must have identical hash codes. Violating this contract causes hash table lookups to fail—you can insert an object but never find it again!
Poor hash function choices silently undermine hash table performance. Understanding what makes a hash function good—and recognizing common failures—is essential for building efficient systems.
What's Next:
Hash functions and collision handling determine when hash tables work well. But there are problems where hash tables, even with perfect hash functions, are simply the wrong choice. The next page explores scenarios where ordered data structures like balanced binary search trees are fundamentally better suited to the problem requirements.
You now understand what makes hash functions fail and how to choose, evaluate, and fix them. This knowledge helps you avoid subtle performance bugs that can silently cripple system performance. Next, we'll examine when hash tables themselves are the wrong tool for the job.