Loading learning content...
At the core of every Bloom filter lie its hash functions. These functions transform arbitrary elements into bit array positions, and their quality directly determines the filter's effectiveness. A Bloom filter with k hash functions requires computing k different positions for every insert and query operation.
But here's a challenge: where do k independent hash functions come from? Computing truly independent hash functions is expensive, and maintaining k separate hash implementations is impractical. This page explores the elegant solutions that make Bloom filters efficient in practice.
By the end of this page, you will understand the hash function requirements for Bloom filters, master the double-hashing technique that generates k hashes from just two base hashes, learn about hash function quality and uniformity, and be equipped to implement production-grade Bloom filter hashing.
Before diving into implementation techniques, let's precisely define what we need from Bloom filter hash functions.
Formal Requirements:
Determinism — The same input must always produce the same output. This ensures inserted elements can be queried reliably.
Uniformity — Outputs should be uniformly distributed across the bit array. Non-uniform distribution creates 'hot spots' that fill up faster, increasing false positive rates.
Independence (or near-independence) — The k hash functions should behave as if they're statistically independent. Correlation between hashes reduces effective entropy.
Speed — Hash computation occurs on every insert and query. Slow hashes become a performance bottleneck, especially for high-throughput applications.
Avalanche property — Small input changes should produce dramatically different outputs. This prevents similar inputs from clustering in the bit array.
| Property | Importance for Bloom Filter | Consequence if Violated |
|---|---|---|
| Determinism | Critical | False negatives possible (queries fail for inserted elements) |
| Uniformity | High | Uneven bit distribution → higher FP rate than calculated |
| Independence | High | Correlated hashes → reduced entropy → higher FP rate |
| Speed | Moderate-High | Slow operations → limited throughput |
| Avalanche | Moderate | Similar inputs cluster → uneven distribution |
Bloom filters typically do NOT require cryptographic hash functions (like SHA-256). Cryptographic hashes provide collision resistance and pre-image resistance—properties important for security but not needed here. Non-cryptographic hashes like MurmurHash, xxHash, or FNV are faster and provide sufficient uniformity for Bloom filter use.
The most straightforward approach to generating k hash functions is to simply use k different hash algorithms. Let's examine why this approach, while correct, is impractical.
Naive Implementation:
123456789101112131415161718192021222324
// NAIVE APPROACH: k separate hash function implementations// This works but is impractical function naiveBloomHash(element: string, k: number, m: number): number[] { const positions: number[] = []; // Using k different hash algorithms positions.push(md5Hash(element) % m); // Hash 1 positions.push(sha1Hash(element) % m); // Hash 2 positions.push(sha256Hash(element) % m); // Hash 3 positions.push(fnv1aHash(element) % m); // Hash 4 positions.push(murmurHash(element) % m); // Hash 5 positions.push(crc32Hash(element) % m); // Hash 6 positions.push(jenkinsHash(element) % m); // Hash 7 // ... need more hash functions if k > 7! return positions.slice(0, k);} // Problems with this approach:// 1. Limited scalability - where do we get k=13 different hash functions?// 2. Code complexity - each hash is a separate implementation to maintain// 3. Speed - computing many different hashes is slow// 4. Independence not guaranteed - some hashes might correlateWe need a better approach—one that generates arbitrarily many hash-like values from a simpler foundation. This is where the elegance of enhanced double hashing comes in.
The breakthrough insight, formalized by Kirsch and Mitzenmacher in 2006, is that we can generate k hash values using only two base hash functions. This dramatic simplification forms the basis of virtually all practical Bloom filter implementations.
The Formula:
Given two independent hash functions h₁(x) and h₂(x), we generate the i-th hash as:
gᵢ(x) = h₁(x) + i × h₂(x) mod m
For i = 0, 1, 2, ..., k-1
This produces k different hash values while only computing two underlying hashes.
12345678910111213141516171819202122232425262728293031323334
/** * Generate k hash positions using enhanced double hashing. * * Based on "Less Hashing, Same Performance: Building a Better Bloom Filter" * by Kirsch and Mitzenmacher (2006) */function doubleHash(element: string, k: number, m: number): number[] { // Compute two base hash values // Using a good hash function that produces 64-bit output const hashValue = murmurHash3_128(element); // Split the 128-bit hash into two 64-bit values const h1 = hashValue.high; // Upper 64 bits const h2 = hashValue.low; // Lower 64 bits const positions: number[] = []; for (let i = 0; i < k; i++) { // g_i(x) = (h1(x) + i * h2(x)) mod m const position = Math.abs((h1 + i * h2) % m); positions.push(position); } return positions;} // Example usageconst element = "hello@example.com";const k = 7; // 7 hash functionsconst m = 1000000; // 1 million bits const positions = doubleHash(element, k, m);// Returns: [342891, 517234, 691577, 865920, 40263, 214606, 388949]// 7 different positions from just ONE underlying hash computation!Why This Works:
The mathematical proof (beyond our scope here) shows that for Bloom filter purposes, double hashing produces hash values that are 'sufficiently independent.' The key insight is:
This has been validated both theoretically and empirically across many implementations.
With double hashing, the cost of k hash functions is almost identical to the cost of ONE hash function. Computing k positions becomes O(k) simple arithmetic operations after the initial O(1) hash. For k=10, this is roughly 10x faster than computing 10 separate hashes.
The original double hashing formula can be further refined to improve hash distribution, especially for larger values of k. The enhanced double hashing technique adds a quadratic term:
gᵢ(x) = h₁(x) + i × h₂(x) + i² mod m
The i² term provides additional mixing that helps when k is large relative to h₂(x).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
/** * Enhanced double hashing with quadratic probing. * Provides better distribution for large k values. */function enhancedDoubleHash(element: string, k: number, m: number): number[] { const hashBytes = murmurHash3_128(element); // Extract two 64-bit hash values const h1 = BigInt(hashBytes.high); const h2 = BigInt(hashBytes.low); const mBig = BigInt(m); const positions: number[] = []; for (let i = 0; i < k; i++) { const iBig = BigInt(i); // g_i(x) = (h1(x) + i * h2(x) + i^2) mod m // The i^2 term adds quadratic probing for better distribution const combinedHash = (h1 + iBig * h2 + iBig * iBig) % mBig; // Ensure positive result (JavaScript BigInt mod can be negative) const position = Number(combinedHash < 0n ? combinedHash + mBig : combinedHash); positions.push(position); } return positions;} /** * Production-ready implementation with performance optimizations */class BloomFilterHasher { private readonly m: number; private readonly k: number; constructor(bitArraySize: number, numHashFunctions: number) { this.m = bitArraySize; this.k = numHashFunctions; } getPositions(element: string | Buffer): number[] { // Get 128-bit hash from MurmurHash3 const hash = this.computeHash(element); // Split into two 64-bit values using DataView for efficiency const h1 = hash.readBigUInt64LE(0); const h2 = hash.readBigUInt64LE(8); const positions: number[] = new Array(this.k); const mBig = BigInt(this.m); // Pre-calculate for loop efficiency let combinedHash = h1 % mBig; for (let i = 0; i < this.k; i++) { positions[i] = Number(combinedHash); // Increment: next = current + h2 + 2*i + 1 (derived from quadratic formula) combinedHash = (combinedHash + h2 + BigInt(2 * i + 1)) % mBig; } return positions; } private computeHash(element: string | Buffer): Buffer { // Use a fast, high-quality hash like MurmurHash3 or xxHash return murmurHash3_x64_128(element); }}The Quadratic Increment Optimization:
Notice in the optimized implementation that we don't compute i² directly in the loop. Instead, we use the mathematical identity:
(i+1)² = i² + 2i + 1
This means we can update the hash incrementally by adding (h₂ + 2i + 1) at each step, avoiding the multiplication entirely. This small optimization matters for high-throughput applications.
Standard double hashing (without i²) is sufficient for most applications where k ≤ 10. Enhanced double hashing provides noticeable improvement when k > 15 or when the bit array size m is not coprime with typical h₂ values. When in doubt, use enhanced double hashing—the overhead is negligible.
While double hashing reduces our need to just two base hashes, choosing the right base hash function is still crucial. The quality of the base hash directly affects Bloom filter performance.
Popular Choices for Bloom Filters:
| Hash Function | Output Size | Speed | Quality | Best For |
|---|---|---|---|---|
| MurmurHash3 (128-bit) | 128 bits | Very Fast | Excellent | General purpose, most Bloom filters |
| xxHash | 64/128 bits | Extremely Fast | Excellent | High-throughput applications |
| SipHash | 64/128 bits | Fast | Excellent | Security-sensitive applications |
| FNV-1a | 32/64 bits | Fast | Good | Simple implementations, short keys |
| CityHash | 64/128 bits | Very Fast | Excellent | String-heavy workloads |
| SHA-256 (truncated) | 256 bits | Slow | Excellent | When cryptographic properties needed |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import { createHash } from 'crypto';import murmurhash3 from 'murmurhash3js-revisited'; /** * Hash function strategies for different use cases */ // Strategy 1: MurmurHash3 128-bit (Recommended for most cases)// Fast, excellent distribution, widely availablefunction murmurhash128(data: string, seed: number = 0): [bigint, bigint] { const hash = murmurhash3.x64.hash128(data, seed); // Returns hex string; convert to two 64-bit values const h1 = BigInt('0x' + hash.substring(0, 16)); const h2 = BigInt('0x' + hash.substring(16, 32)); return [h1, h2];} // Strategy 2: Two independent hashes with different seeds// Slightly more computation but provably independentfunction twoSeedHash(data: string): [bigint, bigint] { const h1 = BigInt(murmurhash3.x86.hash32(data, 0)); const h2 = BigInt(murmurhash3.x86.hash32(data, 0x9747b28c)); // Different seed // Combine into 64-bit values for more bits const h1_2 = BigInt(murmurhash3.x86.hash32(data, 0x12345678)); const h2_2 = BigInt(murmurhash3.x86.hash32(data, 0x87654321)); return [(h1 << 32n) | h1_2, (h2 << 32n) | h2_2];} // Strategy 3: Cryptographic hash for security-sensitive applications// Slower but provides collision resistancefunction cryptoHash(data: string): [bigint, bigint] { const hash = createHash('sha256').update(data).digest(); const h1 = hash.readBigUInt64BE(0); const h2 = hash.readBigUInt64BE(8); return [h1, h2];} // Strategy 4: Simple FNV-1a for constrained environments// Minimal code, decent quality for small filtersfunction fnv1a64(data: string): bigint { const FNV_OFFSET = 0xcbf29ce484222325n; const FNV_PRIME = 0x100000001b3n; let hash = FNV_OFFSET; for (let i = 0; i < data.length; i++) { hash ^= BigInt(data.charCodeAt(i)); hash = (hash * FNV_PRIME) & 0xFFFFFFFFFFFFFFFFn; // 64-bit mask } return hash;}How do we verify that our hash functions are actually producing uniform distributions? Poor distribution leads to higher-than-expected false positive rates. Here's how to analyze hash quality empirically.
The Chi-Square Test for Uniformity:
We can use statistical tests to verify that hash outputs are uniformly distributed across the bit array.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
/** * Analyze hash function distribution quality */function analyzeHashDistribution( hashFn: (input: string) => number[], numSamples: number, m: number, k: number): { chiSquare: number; expectedPerBucket: number; variance: number; isUniform: boolean;} { // Count hits per bit position const bucketCounts = new Array(m).fill(0); // Generate random inputs and hash them for (let i = 0; i < numSamples; i++) { const input = `test_element_${i}_${Math.random()}`; const positions = hashFn(input); for (const pos of positions) { bucketCounts[pos]++; } } // Expected count per bucket under uniform distribution const totalHashes = numSamples * k; const expected = totalHashes / m; // Calculate chi-square statistic let chiSquare = 0; for (let i = 0; i < m; i++) { const diff = bucketCounts[i] - expected; chiSquare += (diff * diff) / expected; } // Calculate variance const mean = totalHashes / m; let sumSquaredDiff = 0; for (let i = 0; i < m; i++) { sumSquaredDiff += Math.pow(bucketCounts[i] - mean, 2); } const variance = sumSquaredDiff / m; // For uniform distribution, chi-square should be approximately m // (degrees of freedom = m - 1) // If chi-square >> m, distribution is non-uniform const isUniform = chiSquare < m * 1.5; // Rough threshold return { chiSquare, expectedPerBucket: expected, variance, isUniform };} // Visualization of hash distributionfunction visualizeDistribution(bucketCounts: number[], numBuckets: number): void { const max = Math.max(...bucketCounts); const bucketWidth = Math.ceil(bucketCounts.length / numBuckets); console.log("Hash Distribution Histogram:"); console.log("============================"); for (let b = 0; b < numBuckets; b++) { const start = b * bucketWidth; const end = Math.min(start + bucketWidth, bucketCounts.length); let sum = 0; for (let i = start; i < end; i++) { sum += bucketCounts[i]; } const avg = sum / (end - start); const barLength = Math.round((avg / max) * 50); const bar = '█'.repeat(barLength); console.log(`[${start.toString().padStart(6)}-${end.toString().padStart(6)}] ${bar}`); }} /*Example output for a GOOD hash function:Hash Distribution Histogram:============================[ 0- 10000] ██████████████████████████████████████████████████[ 10000- 20000] █████████████████████████████████████████████████[ 20000- 30000] ██████████████████████████████████████████████████[ 30000- 40000] █████████████████████████████████████████████████[ 40000- 50000] ██████████████████████████████████████████████████ Example output for a BAD hash function:Hash Distribution Histogram:============================[ 0- 10000] ████████████████████████████████████████████████████████[ 10000- 20000] ███████████[ 20000- 30000] ██████████████████████████████████████████████████[ 30000- 40000] ████████████████████████████████████████████████████████████[ 40000- 50000] ██████████████████*/If your Bloom filter's observed false positive rate significantly exceeds the theoretical rate, poor hash distribution is a likely culprit. Run distribution analysis on a sample of your actual data—some hash functions perform poorly on specific input patterns (e.g., sequential integers, similar strings).
Bloom filters can store any data type, but the hash function must accept all types consistently. How we serialize data for hashing affects both correctness and performance.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
/** * Type-safe Bloom filter hashing for multiple data types */class UniversalHasher { private readonly encoder = new TextEncoder(); /** * Convert any value to a hashable byte array. * Critical: same value must ALWAYS produce same bytes. */ toBytes(value: unknown): Uint8Array { if (value === null) { return new Uint8Array([0]); // Null sentinel } if (value === undefined) { return new Uint8Array([1]); // Undefined sentinel } if (typeof value === 'string') { // Prefix with type marker to distinguish from number "123" return this.withTypePrefix(2, this.encoder.encode(value)); } if (typeof value === 'number') { // Use Float64 representation for all numbers const buffer = new ArrayBuffer(9); const view = new DataView(buffer); view.setUint8(0, 3); // Type marker view.setFloat64(1, value, true); // Little-endian return new Uint8Array(buffer); } if (typeof value === 'bigint') { // Convert BigInt to variable-length byte array const hex = value.toString(16); const bytes = this.encoder.encode(hex); return this.withTypePrefix(4, bytes); } if (typeof value === 'boolean') { return new Uint8Array([5, value ? 1 : 0]); } if (value instanceof Date) { const buffer = new ArrayBuffer(9); const view = new DataView(buffer); view.setUint8(0, 6); view.setFloat64(1, value.getTime(), true); return new Uint8Array(buffer); } if (value instanceof Uint8Array) { return this.withTypePrefix(7, value); } if (Array.isArray(value)) { // Hash array as concatenated element hashes const elementBytes = value.map(v => this.toBytes(v)); return this.withTypePrefix(8, this.concatenate(elementBytes)); } if (typeof value === 'object') { // Sort keys for consistent ordering const sortedEntries = Object.entries(value) .sort(([a], [b]) => a.localeCompare(b)); const parts: Uint8Array[] = []; for (const [key, val] of sortedEntries) { parts.push(this.encoder.encode(key)); parts.push(this.toBytes(val)); } return this.withTypePrefix(9, this.concatenate(parts)); } // Fallback: stringify return this.withTypePrefix(10, this.encoder.encode(String(value))); } private withTypePrefix(type: number, data: Uint8Array): Uint8Array { const result = new Uint8Array(data.length + 1); result[0] = type; result.set(data, 1); return result; } private concatenate(arrays: Uint8Array[]): Uint8Array { const totalLength = arrays.reduce((sum, arr) => sum + arr.length + 4, 0); const result = new Uint8Array(totalLength); let offset = 0; for (const arr of arrays) { // Include length prefix to prevent ambiguity const view = new DataView(result.buffer, offset); view.setUint32(0, arr.length, true); result.set(arr, offset + 4); offset += arr.length + 4; } return result; }} // Usageconst hasher = new UniversalHasher(); // All these produce distinct, consistent byte arrayshasher.toBytes("123"); // String "123"hasher.toBytes(123); // Number 123hasher.toBytes(123n); // BigInt 123hasher.toBytes([1, 2, 3]); // Arrayhasher.toBytes({a: 1, b: 2}); // Object - same bytes regardless of property orderHash functions are the engine that powers Bloom filters. Understanding how to generate and validate them is essential for building effective implementations. Let's consolidate the key insights:
What's next:
With the theoretical foundations complete, the next page explores real-world applications of Bloom filters. We'll see how they power web caching, spell checkers, database systems, and distributed networks—with concrete examples of how the concepts we've learned translate into production systems.
You now understand the hash function design that enables practical Bloom filter implementations. The double hashing technique—generating k hashes from just two base hash computations—is the key insight that makes Bloom filters efficient. Combined with careful data serialization, these techniques produce robust, performant implementations.