Loading learning content...
If buckets are the containers of a hash index, then the hash function is the routing mechanism that directs each key to its designated bucket. The quality of this routing determines everything: lookup performance, space utilization, and the frequency of collisions that degrade O(1) operations to linear searches.
A hash function for database indexing must satisfy demanding, often conflicting requirements. It must be fast enough that it doesn't become the bottleneck in high-throughput systems. It must distribute keys uniformly to prevent bucket hotspots. It must be deterministic so the same key always reaches the same bucket. And it must handle arbitrary key types—integers, strings, composite keys, binary blobs—with equal effectiveness.
Designing or selecting a hash function is not a trivial decision. The wrong choice can turn an O(1) data structure into an O(n) liability, silently degrading system performance until it becomes a crisis.
By the end of this page, you will understand: (1) The mathematical properties that define an effective hash function, (2) Common hash algorithms used in database systems, (3) How to evaluate hash function quality, (4) Techniques for hashing different key types, and (5) Production-grade implementation considerations.
A function h: K → {0, 1, ..., N-1} that maps keys from domain K to bucket indices must exhibit several properties to be useful for database indexing. Understanding these properties helps in both selecting existing hash functions and recognizing when a chosen function is underperforming.
The most fundamental requirement: given the same key and the same hash function configuration, the output must always be identical.
h(k) = h(k), always
This seems obvious, but violations occur more often than expected. Using uninitialized memory, floating-point keys with NaN handling, or functions that sample random values internally will break determinism. Without it, a stored record becomes unfindable.
The hash function should distribute keys as evenly as possible across all bucket indices. Formally, for a uniform random key k drawn from K:
P(h(k) = i) = 1/N for all buckets i ∈ {0, 1, ..., N-1}
In practice, real keys are not uniformly random. They cluster (sequential IDs), have patterns (timestamps), or share prefixes (URLs). A good hash function must break these patterns and produce outputs that appear random even when inputs are structured.
| Property | Definition | Consequence of Violation |
|---|---|---|
| Determinism | Same input always yields same output | Records become unfindable after storage |
| Uniformity | All outputs equally probable | Bucket hotspots cause overflow chains |
| Efficiency | Computation completes in O(key_length) | Hash computation becomes bottleneck |
| Sensitivity | Small input changes → large output changes | Similar keys cluster in same buckets |
| Completeness | Uses all input bits in computation | Unused bits create unnecessary collisions |
A desirable property for high-quality hash functions: changing a single bit in the input should change approximately half of the output bits. This ensures that:
The avalanche effect quantifies how well the hash function "mixes" input bits. Poor mixing leads to clustering; good mixing leads to pseudo-random distribution.
Unlike cryptographic hashing where security justifies computational expense, database hash functions must be fast. Hash computation occurs on every:
For a table processing 100,000 queries per second, even a 1μs hash computation adds 100ms of aggregate CPU time per second—a significant overhead.
Do NOT use cryptographic hash functions (SHA-256, MD5) for database indexing. They're designed for security, not speed. SHA-256 takes ~30-100 CPU cycles per byte; database hash functions aim for <1 cycle per byte. The collision resistance of crypto hashes is irrelevant for indexing—we handle collisions explicitly.
Several hash function families have proven effective for database indexing. Understanding their characteristics helps in selecting the right function for specific workloads.
The simplest approach: divide the key by the number of buckets and use the remainder.
h(k) = k mod N
For integer keys, this is extremely fast (single CPU instruction) but has limitations:
Mitigation: Choose N as a prime number not close to a power of 2. For example, if you need ~1000 buckets, use 1009 (prime) rather than 1024 (power of 2).
123456789101112131415
/* Division method hash - simplest form */uint32_t hash_division(uint64_t key, uint32_t bucket_count) { return (uint32_t)(key % bucket_count);} /* Example usage with prime bucket count */#define BUCKET_COUNT 1009 /* Prime near 1000 */ uint32_t bucket = hash_division(user_id, BUCKET_COUNT); /* Problem demonstration: sequential keys */for (int i = 0; i < 10; i++) { printf("key=%d -> bucket=%d\n", i, hash_division(i, BUCKET_COUNT));}/* Output: 0,1,2,3,4,5,6,7,8,9 - no mixing! */Address the weaknesses of division by introducing multiplication:
h(k) = ⌊N × (k × A mod 1)⌋
Where A is a carefully chosen constant. Knuth recommends A = (√5 - 1) / 2 ≈ 0.618033988... (the reciprocal of the golden ratio).
The multiplication method:
123456789101112131415161718192021222324252627
/* Knuth's multiplication method */#include <stdint.h> /* Golden ratio constant in fixed-point representation *//* A = (sqrt(5) - 1) / 2 ≈ 0.618033988... *//* As 32-bit fixed point: A * 2^32 = 2654435769 */#define KNUTH_MULTIPLIER 2654435769UL uint32_t hash_multiplication(uint64_t key, uint32_t log2_buckets) { /* Multiply key by golden ratio constant */ uint32_t mixed = (uint32_t)(key * KNUTH_MULTIPLIER); /* Extract high-order bits for bucket index */ /* log2_buckets = log2(N), assumes N is power of 2 */ return mixed >> (32 - log2_buckets);} /* Example: 1024 buckets (2^10) */#define LOG2_BUCKETS 10uint32_t bucket = hash_multiplication(user_id, LOG2_BUCKETS); /* Sequential keys now distribute better */for (int i = 0; i < 10; i++) { printf("key=%d -> bucket=%d\n", i, hash_multiplication(i, LOG2_BUCKETS));}/* Output: varied distribution, not sequential */For variable-length keys like strings, a common approach treats the string as a polynomial:
h(s) = (s[0] × p^(n-1) + s[1] × p^(n-2) + ... + s[n-1]) mod N
Where p is a prime base (commonly 31, 37, or 53 for alphanumeric strings) and s[i] are character codes.
Horner's method evaluates this efficiently in O(n) time:
h = 0
for each character c in string:
h = (h × p + c) mod N
1234567891011121314151617181920212223242526272829303132
/* Polynomial rolling hash for strings */uint32_t hash_polynomial(const char *str, size_t len, uint32_t bucket_count) { const uint32_t BASE = 31; /* Prime base for mixing */ uint64_t hash = 0; for (size_t i = 0; i < len; i++) { hash = hash * BASE + (unsigned char)str[i]; /* Keep intermediate results manageable */ hash &= 0xFFFFFFFFFFFFFFFF; /* Implicit for uint64_t */ } return (uint32_t)(hash % bucket_count);} /* Improved version: use larger prime, mix at end */uint32_t hash_polynomial_improved(const char *str, size_t len, uint32_t bucket_count) { const uint64_t BASE = 31; const uint64_t MOD = 0x7FFFFFFFFFFFFFFF; /* Large prime */ uint64_t hash = 0; for (size_t i = 0; i < len; i++) { hash = ((hash * BASE) + str[i]) % MOD; } /* Apply secondary mixing before modulo by bucket count */ hash ^= (hash >> 33); hash *= 0xff51afd7ed558ccdULL; hash ^= (hash >> 33); return (uint32_t)(hash % bucket_count);}String hash functions must handle: (1) Case sensitivity (should 'ABC' = 'abc'?), (2) Character encoding (UTF-8 vs. UTF-16), (3) Locale-specific comparisons, (4) Trailing spaces. These must match the equality semantics of the index. A hash that ignores case must correspond to case-insensitive lookups.
Modern database systems employ sophisticated hash functions designed for both distribution quality and computational efficiency. These functions leverage CPU features like pipelining, SIMD instructions, and hardware multipliers.
Developed by Austin Appleby, MurmurHash3 is widely used in databases (including Apache Cassandra, Redis, and many others). Its key characteristics:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
/* MurmurHash3 finalization mix (32-bit) */static inline uint32_t fmix32(uint32_t h) { h ^= h >> 16; h *= 0x85ebca6b; h ^= h >> 13; h *= 0xc2b2ae35; h ^= h >> 16; return h;} /* MurmurHash3 for 32-bit output */uint32_t murmurhash3_32(const void *key, size_t len, uint32_t seed) { const uint8_t *data = (const uint8_t*)key; const int nblocks = len / 4; uint32_t h1 = seed; const uint32_t c1 = 0xcc9e2d51; const uint32_t c2 = 0x1b873593; /* Body: process 4-byte blocks */ const uint32_t *blocks = (const uint32_t*)(data); for (int i = 0; i < nblocks; i++) { uint32_t k1 = blocks[i]; k1 *= c1; k1 = (k1 << 15) | (k1 >> 17); /* ROTL32 */ k1 *= c2; h1 ^= k1; h1 = (h1 << 13) | (h1 >> 19); h1 = h1 * 5 + 0xe6546b64; } /* Tail: process remaining bytes */ const uint8_t *tail = data + nblocks * 4; uint32_t k1 = 0; switch (len & 3) { case 3: k1 ^= tail[2] << 16; /* fall through */ case 2: k1 ^= tail[1] << 8; /* fall through */ case 1: k1 ^= tail[0]; k1 *= c1; k1 = (k1 << 15) | (k1 >> 17); k1 *= c2; h1 ^= k1; } /* Finalization */ h1 ^= len; h1 = fmix32(h1); return h1;}Developed by Yann Collet (creator of LZ4 compression), xxHash is one of the fastest non-cryptographic hash functions available:
xxHash achieves its speed through:
| Hash Function | Throughput (GB/s) | Quality Score | Use Case |
|---|---|---|---|
| Division (mod) | 50+ | Poor | Integer keys only, prime N required |
| Polynomial | 0.5-2 | Medium | Strings, compatible with range queries |
| FNV-1a | 2-3 | Good | Simple, good for short keys |
| MurmurHash3 | 3-4 | Excellent | General purpose, widely adopted |
| xxHash64 | 10-15 | Excellent | High-throughput systems |
| CityHash | 5-10 | Excellent | Google's hash, good for strings |
| SHA-256 | 0.3-0.5 | Cryptographic | NEVER use for indexing |
For most database indexing scenarios, MurmurHash3 or xxHash64 are excellent choices. They offer excellent distribution, high speed, and are well-tested in production systems. Use xxHash for maximum throughput; use MurmurHash3 for broader ecosystem compatibility.
Real-world database indexes often use composite keys—keys composed of multiple columns (e.g., (user_id, timestamp) or (category, product_id)). Hashing composite keys requires combining individual column hashes into a single bucket identifier.
The simplest method concatenates column values and hashes the result:
composite_key = column1 || column2 || ... || columnN
bucket = hash(composite_key) mod N
This works but has drawbacks:
Production systems typically hash each column independently and combine the results:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
/* Method 1: XOR combination (simple but weak) */uint64_t hash_composite_xor(uint64_t h1, uint64_t h2) { return h1 ^ h2;}/* Problem: XOR is commutative, so (a,b) = (b,a) *//* Problem: XOR of same values is 0 */ /* Method 2: Polynomial combination (better) */uint64_t hash_composite_poly(uint64_t h1, uint64_t h2) { const uint64_t PRIME = 31; return h1 * PRIME + h2;}/* Preserves order: (a,b) ≠ (b,a) *//* Still has collision patterns */ /* Method 3: Boost-style hash combine (widely used) */uint64_t hash_combine(uint64_t seed, uint64_t value) { /* Magic constant derived from golden ratio */ return seed ^ (value + 0x9e3779b97f4a7c15ULL + (seed << 6) + (seed >> 2));} /* Usage for composite key (user_id, timestamp) */uint64_t hash_user_timestamp(uint64_t user_id, uint64_t timestamp, uint32_t bucket_count) { uint64_t h = 0; h = hash_combine(h, murmurhash3_64(&user_id, sizeof(user_id), 0)); h = hash_combine(h, murmurhash3_64(×tamp, sizeof(timestamp), 0)); return h % bucket_count;} /* Method 4: Murmur-style mixing (excellent distribution) */uint64_t hash_composite_murmur(const void **columns, const size_t *lengths, int num_columns, uint32_t seed) { uint64_t h = seed; for (int i = 0; i < num_columns; i++) { uint64_t col_hash = murmurhash3_64(columns[i], lengths[i], h); /* Mix with rotation and multiplication */ h = ((h << 5) | (h >> 59)) ^ col_hash; h *= 0xc4ceb9fe1a85ec53ULL; } /* Final mixing */ h ^= h >> 33; h *= 0xff51afd7ed558ccdULL; h ^= h >> 33; return h;}When columns can be NULL, hash functions must handle this consistently:
Most SQL databases use option 1 or 3. The choice affects query planning—NULL = NULL evaluates to UNKNOWN in SQL, which has implications for hash join probes.
When combining column hashes, the combination function should NOT be commutative. The keys (user_id=1, product_id=2) and (user_id=2, product_id=1) must hash to different buckets. XOR-only combination violates this. Always use order-preserving combinations like polynomial or Boost-style hash_combine.
How do you know if your hash function is performing well? Production database systems should monitor hash index quality and alert when distribution becomes pathological.
The chi-square statistic measures deviation from expected uniform distribution:
χ² = Σ (observed_i - expected)² / expected
For a hash index with N buckets and n records, expected records per bucket = n/N.
Interpretation:
12345678910111213141516171819202122232425262728293031323334353637
/* Chi-square test for hash distribution quality */double compute_chi_square(const uint32_t *bucket_counts, uint32_t num_buckets, uint64_t total_records) { double expected = (double)total_records / num_buckets; double chi_square = 0.0; for (uint32_t i = 0; i < num_buckets; i++) { double diff = bucket_counts[i] - expected; chi_square += (diff * diff) / expected; } return chi_square;} /* Evaluate hash function quality */void evaluate_hash_quality(const uint32_t *bucket_counts, uint32_t num_buckets, uint64_t total_records) { double chi_sq = compute_chi_square(bucket_counts, num_buckets, total_records); double expected_chi_sq = num_buckets - 1; /* degrees of freedom */ double ratio = chi_sq / expected_chi_sq; printf("Chi-square statistic: %.2f\n", chi_sq); printf("Expected (uniform): %.2f\n", expected_chi_sq); printf("Ratio: %.2f\n", ratio); if (ratio < 0.9) { printf("WARNING: Suspiciously uniform (check for bugs)\n"); } else if (ratio > 1.5) { printf("WARNING: Significant non-uniformity detected\n"); } else if (ratio > 2.0) { printf("CRITICAL: Severe clustering, consider different hash function\n"); } else { printf("OK: Distribution within acceptable range\n"); }}Beyond chi-square, database administrators should monitor:
| Metric | Healthy Value | Warning Threshold | Action if Exceeded |
|---|---|---|---|
| Max bucket load / Average load | < 3 | 5 | Check for key patterns; consider rehash |
| Empty bucket % | 20-40% (at 70% fill) | 60% | Index oversized; consider rebuild |
| Overflow chain % | < 5% | 15% | Index undersized; consider rebuild |
| Avg overflow chain length | < 2 | 4 | Severe clustering; change hash function |
| Longest overflow chain | < 10 | 20 | Investigate specific key patterns |
Always test hash functions with REALISTIC data, not random inputs. Hash functions that perform well on random data may fail spectacularly on production patterns like sequential IDs, timestamps with second granularity, or URLs with common prefixes. Extract a sample of production keys and measure actual distribution.
Many hash functions accept a seed parameter that modifies their output. Understanding how to use seeds effectively can improve distribution and provide defense against certain attack vectors.
Breaking systematic patterns: If a hash function clusters on certain key patterns, a different seed may break that clustering.
Universal hashing: Randomly selecting a hash function from a family provides probabilistic guarantees about worst-case performance.
Hash DoS mitigation: Adversaries who know your hash function can craft keys that all collide. Random seeds prevent this (more relevant for hash tables than database indexes, but still applicable).
A hash function family H is universal if for any two distinct keys k₁ and k₂:
P[h(k₁) = h(k₂)] ≤ 1/N when h is chosen uniformly at random from H.
With universal hashing, the expected number of collisions for any key is bounded, regardless of the key distribution.
12345678910111213141516171819202122232425262728293031323334353637
/* Simple universal hashing family for integers *//* h(k) = ((a * k + b) mod p) mod N *//* where p is prime > key range, a ∈ [1, p-1], b ∈ [0, p-1] */ typedef struct { uint64_t a; /* Multiplier, must be non-zero */ uint64_t b; /* Offset */ uint64_t prime; /* Large prime, > max key value */ uint32_t num_buckets;} UniversalHashParams; /* Initialize with random parameters */UniversalHashParams universal_hash_init(uint32_t num_buckets, uint64_t seed) { /* Use seed to generate random a and b */ srand48(seed); UniversalHashParams params; params.prime = 2305843009213693951ULL; /* Mersenne prime 2^61 - 1 */ params.num_buckets = num_buckets; params.a = (lrand48() % (params.prime - 1)) + 1; /* Non-zero */ params.b = lrand48() % params.prime; return params;} /* Universal hash function */uint32_t universal_hash(const UniversalHashParams *p, uint64_t key) { /* Note: For very large keys, use modular arithmetic carefully */ __uint128_t temp = (__uint128_t)p->a * key + p->b; uint64_t h = temp % p->prime; return (uint32_t)(h % p->num_buckets);} /* Usage: different seeds create different hash functions */UniversalHashParams h1 = universal_hash_init(1009, 12345);UniversalHashParams h2 = universal_hash_init(1009, 67890);/* h1 and h2 will distribute the same keys differently */Once a seed is chosen for a hash index, it MUST be stored persistently in the index metadata. If the seed changes (even due to a software upgrade), all previously stored records become unfindable since their bucket locations would change.
For database hash indexes:
Generate seed at index creation: Use a cryptographically secure random number generator (e.g., /dev/urandom, CryptGenRandom).
Store seed in index metadata: Include it in the index header page, replicated in the system catalog.
Use seed consistently: All hash computations for that index must use the same seed.
Consider reproducibility: For testing and debugging, allow deterministic seed specification.
-- Hypothetical syntax for seed specification
CREATE HASH INDEX idx_users_email ON users(email)
WITH (hash_seed = 0xDEADBEEF, buckets = 10007);
Implementing hash functions for production database systems requires attention to correctness, performance, and platform compatibility.
12345678910111213141516171819202122232425262728293031
/* Safe 32-bit integer read, handles unaligned access and endianness */static inline uint32_t read_u32(const uint8_t *p) {#if defined(__GNUC__) && defined(__x86_64__) /* x86 allows unaligned access, use memcpy for strict aliasing */ uint32_t v; memcpy(&v, p, sizeof(v)); return v; /* x86 is little-endian */#else /* Portable: byte-at-a-time, little-endian */ return (uint32_t)p[0] | ((uint32_t)p[1] << 8) | ((uint32_t)p[2] << 16) | ((uint32_t)p[3] << 24);#endif} /* Safe double hashing - normalize representation */uint64_t hash_double(double value) { /* Handle special cases */ if (isnan(value)) { return 0x7FF8000000000000ULL; /* Canonical NaN */ } if (value == 0.0) { return 0; /* Both +0.0 and -0.0 hash the same */ } /* Use bit representation for normal values */ uint64_t bits; memcpy(&bits, &value, sizeof(bits)); return murmurhash3_64(&bits, sizeof(bits), 0);}Hash function performance varies significantly across CPU architectures. A function that's fast on Intel may be slower on ARM. Always benchmark on your production hardware. Cloud deployments should test on the specific instance types used.
We have explored the critical role of hash functions in database indexing—the algorithms that transform keys into bucket locations and determine the effectiveness of the entire hash index structure.
What's Next:
With bucket architecture and hash functions established, we now face an inevitable reality: collisions. The next page examines overflow handling—what happens when multiple keys hash to the same bucket and the bucket fills beyond capacity.
You now understand the principles and practice of hash function design for database indexing. This knowledge enables you to select appropriate hash functions, evaluate their quality in production, and troubleshoot distribution problems. Next, we'll examine how to handle the collisions that even the best hash functions cannot prevent.