Data Structures & AlgorithmsBloom Filters (Conceptual)

Bloom Filters — Probabilistic Set Membership

LevelAdvanced

Duration55 mins

TopicBloom Filters (Conceptual)

3 / 4

Multiple Hash Functions

The Heart of the Bloom Filter

At the core of every Bloom filter lie its hash functions. These functions transform arbitrary elements into bit array positions, and their quality directly determines the filter's effectiveness. A Bloom filter with k hash functions requires computing k different positions for every insert and query operation.

But here's a challenge: where do k independent hash functions come from? Computing truly independent hash functions is expensive, and maintaining k separate hash implementations is impractical. This page explores the elegant solutions that make Bloom filters efficient in practice.

What You Will Learn

By the end of this page, you will understand the hash function requirements for Bloom filters, master the double-hashing technique that generates k hashes from just two base hashes, learn about hash function quality and uniformity, and be equipped to implement production-grade Bloom filter hashing.

Hash Function Requirements

Before diving into implementation techniques, let's precisely define what we need from Bloom filter hash functions.

Formal Requirements:

Determinism — The same input must always produce the same output. This ensures inserted elements can be queried reliably.
Uniformity — Outputs should be uniformly distributed across the bit array. Non-uniform distribution creates 'hot spots' that fill up faster, increasing false positive rates.
Independence (or near-independence) — The k hash functions should behave as if they're statistically independent. Correlation between hashes reduces effective entropy.
Speed — Hash computation occurs on every insert and query. Slow hashes become a performance bottleneck, especially for high-throughput applications.
Avalanche property — Small input changes should produce dramatically different outputs. This prevents similar inputs from clustering in the bit array.

Hash Function Properties Comparison
Property	Importance for Bloom Filter	Consequence if Violated
Determinism	Critical	False negatives possible (queries fail for inserted elements)
Uniformity	High	Uneven bit distribution → higher FP rate than calculated
Independence	High	Correlated hashes → reduced entropy → higher FP rate
Speed	Moderate-High	Slow operations → limited throughput
Avalanche	Moderate	Similar inputs cluster → uneven distribution

Cryptographic vs Non-Cryptographic Hashes

Bloom filters typically do NOT require cryptographic hash functions (like SHA-256). Cryptographic hashes provide collision resistance and pre-image resistance—properties important for security but not needed here. Non-cryptographic hashes like MurmurHash, xxHash, or FNV are faster and provide sufficient uniformity for Bloom filter use.

The Naive Approach and Its Problems

The most straightforward approach to generating k hash functions is to simply use k different hash algorithms. Let's examine why this approach, while correct, is impractical.

Naive Implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// NAIVE APPROACH: k separate hash function implementations
// This works but is impractical
 
function naiveBloomHash(element: string, k: number, m: number): number[] {
    const positions: number[] = [];
    
    // Using k different hash algorithms
    positions.push(md5Hash(element) % m);        // Hash 1
    positions.push(sha1Hash(element) % m);       // Hash 2  
    positions.push(sha256Hash(element) % m);     // Hash 3
    positions.push(fnv1aHash(element) % m);      // Hash 4
    positions.push(murmurHash(element) % m);     // Hash 5
    positions.push(crc32Hash(element) % m);      // Hash 6
    positions.push(jenkinsHash(element) % m);    // Hash 7
    // ... need more hash functions if k > 7!
    
    return positions.slice(0, k);
}
 
// Problems with this approach:
// 1. Limited scalability - where do we get k=13 different hash functions?
// 2. Code complexity - each hash is a separate implementation to maintain
// 3. Speed - computing many different hashes is slow
// 4. Independence not guaranteed - some hashes might correlate

Problems with the Naive Approach

•Scalability — Where do k=17 truly independent hash functions come from? We quickly run out of distinct algorithms.
•Maintenance burden — Each hash function is a separate codebase to implement, test, and maintain.
•Computational cost — Computing multiple different hash algorithms is slow; each has its own overhead.
•Independence uncertainty — How do we verify that MD5 and SHA-1 produce statistically independent outputs?
•Portability — Different languages/platforms may have different hash implementations, causing compatibility issues.

We need a better approach—one that generates arbitrarily many hash-like values from a simpler foundation. This is where the elegance of enhanced double hashing comes in.

The Double Hashing Technique

The breakthrough insight, formalized by Kirsch and Mitzenmacher in 2006, is that we can generate k hash values using only two base hash functions. This dramatic simplification forms the basis of virtually all practical Bloom filter implementations.

The Formula:

Given two independent hash functions h₁(x) and h₂(x), we generate the i-th hash as:

gᵢ(x) = h₁(x) + i × h₂(x) mod m

For i = 0, 1, 2, ..., k-1

This produces k different hash values while only computing two underlying hashes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
/**
 * Generate k hash positions using enhanced double hashing.
 * 
 * Based on "Less Hashing, Same Performance: Building a Better Bloom Filter"
 * by Kirsch and Mitzenmacher (2006)
 */
function doubleHash(element: string, k: number, m: number): number[] {
    // Compute two base hash values
    // Using a good hash function that produces 64-bit output
    const hashValue = murmurHash3_128(element);
    
    // Split the 128-bit hash into two 64-bit values
    const h1 = hashValue.high;  // Upper 64 bits
    const h2 = hashValue.low;   // Lower 64 bits
    
    const positions: number[] = [];
    
    for (let i = 0; i < k; i++) {
        // g_i(x) = (h1(x) + i * h2(x)) mod m
        const position = Math.abs((h1 + i * h2) % m);
        positions.push(position);
    }
    
    return positions;
}
 
// Example usage
const element = "hello@example.com";
const k = 7;   // 7 hash functions
const m = 1000000;  // 1 million bits
 
const positions = doubleHash(element, k, m);
// Returns: [342891, 517234, 691577, 865920, 40263, 214606, 388949]
// 7 different positions from just ONE underlying hash computation!

Why This Works:

The mathematical proof (beyond our scope here) shows that for Bloom filter purposes, double hashing produces hash values that are 'sufficiently independent.' The key insight is:

The false positive probability bound is still (1 - e^(-kn/m))^k
The slight correlation between generated hashes has negligible impact on practical FP rates
The asymptotic behavior matches truly independent hash functions

This has been validated both theoretically and empirically across many implementations.

The Efficiency Gain

With double hashing, the cost of k hash functions is almost identical to the cost of ONE hash function. Computing k positions becomes O(k) simple arithmetic operations after the initial O(1) hash. For k=10, this is roughly 10x faster than computing 10 separate hashes.

Enhanced Double Hashing

The original double hashing formula can be further refined to improve hash distribution, especially for larger values of k. The enhanced double hashing technique adds a quadratic term:

gᵢ(x) = h₁(x) + i × h₂(x) + i² mod m

The i² term provides additional mixing that helps when k is large relative to h₂(x).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
/**
 * Enhanced double hashing with quadratic probing.
 * Provides better distribution for large k values.
 */
function enhancedDoubleHash(element: string, k: number, m: number): number[] {
    const hashBytes = murmurHash3_128(element);
    
    // Extract two 64-bit hash values
    const h1 = BigInt(hashBytes.high);
    const h2 = BigInt(hashBytes.low);
    const mBig = BigInt(m);
    
    const positions: number[] = [];
    
    for (let i = 0; i < k; i++) {
        const iBig = BigInt(i);
        
        // g_i(x) = (h1(x) + i * h2(x) + i^2) mod m
        // The i^2 term adds quadratic probing for better distribution
        const combinedHash = (h1 + iBig * h2 + iBig * iBig) % mBig;
        
        // Ensure positive result (JavaScript BigInt mod can be negative)
        const position = Number(combinedHash < 0n ? combinedHash + mBig : combinedHash);
        positions.push(position);
    }
    
    return positions;
}
 
/**
 * Production-ready implementation with performance optimizations
 */
class BloomFilterHasher {
    private readonly m: number;
    private readonly k: number;
    
    constructor(bitArraySize: number, numHashFunctions: number) {
        this.m = bitArraySize;
        this.k = numHashFunctions;
    }
    
    getPositions(element: string | Buffer): number[] {
        // Get 128-bit hash from MurmurHash3
        const hash = this.computeHash(element);
        
        // Split into two 64-bit values using DataView for efficiency
        const h1 = hash.readBigUInt64LE(0);
        const h2 = hash.readBigUInt64LE(8);
        
        const positions: number[] = new Array(this.k);
        const mBig = BigInt(this.m);
        
        // Pre-calculate for loop efficiency
        let combinedHash = h1 % mBig;
        
        for (let i = 0; i < this.k; i++) {
            positions[i] = Number(combinedHash);
            // Increment: next = current + h2 + 2*i + 1 (derived from quadratic formula)
            combinedHash = (combinedHash + h2 + BigInt(2 * i + 1)) % mBig;
        }
        
        return positions;
    }
    
    private computeHash(element: string | Buffer): Buffer {
        // Use a fast, high-quality hash like MurmurHash3 or xxHash
        return murmurHash3_x64_128(element);
    }
}

The Quadratic Increment Optimization:

Notice in the optimized implementation that we don't compute i² directly in the loop. Instead, we use the mathematical identity:

(i+1)² = i² + 2i + 1

This means we can update the hash incrementally by adding (h₂ + 2i + 1) at each step, avoiding the multiplication entirely. This small optimization matters for high-throughput applications.

When Enhanced vs Standard Double Hashing

Standard double hashing (without i²) is sufficient for most applications where k ≤ 10. Enhanced double hashing provides noticeable improvement when k > 15 or when the bit array size m is not coprime with typical h₂ values. When in doubt, use enhanced double hashing—the overhead is negligible.

Choosing Base Hash Functions

While double hashing reduces our need to just two base hashes, choosing the right base hash function is still crucial. The quality of the base hash directly affects Bloom filter performance.

Popular Choices for Bloom Filters:

Hash Function Comparison for Bloom Filters
Hash Function	Output Size	Speed	Quality	Best For
MurmurHash3 (128-bit)	128 bits	Very Fast	Excellent	General purpose, most Bloom filters
xxHash	64/128 bits	Extremely Fast	Excellent	High-throughput applications
SipHash	64/128 bits	Fast	Excellent	Security-sensitive applications
FNV-1a	32/64 bits	Fast	Good	Simple implementations, short keys
CityHash	64/128 bits	Very Fast	Excellent	String-heavy workloads
SHA-256 (truncated)	256 bits	Slow	Excellent	When cryptographic properties needed

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import { createHash } from 'crypto';
import murmurhash3 from 'murmurhash3js-revisited';
 
/**
 * Hash function strategies for different use cases
 */
 
// Strategy 1: MurmurHash3 128-bit (Recommended for most cases)
// Fast, excellent distribution, widely available
function murmurhash128(data: string, seed: number = 0): [bigint, bigint] {
    const hash = murmurhash3.x64.hash128(data, seed);
    // Returns hex string; convert to two 64-bit values
    const h1 = BigInt('0x' + hash.substring(0, 16));
    const h2 = BigInt('0x' + hash.substring(16, 32));
    return [h1, h2];
}
 
// Strategy 2: Two independent hashes with different seeds
// Slightly more computation but provably independent
function twoSeedHash(data: string): [bigint, bigint] {
    const h1 = BigInt(murmurhash3.x86.hash32(data, 0));
    const h2 = BigInt(murmurhash3.x86.hash32(data, 0x9747b28c));  // Different seed
    // Combine into 64-bit values for more bits
    const h1_2 = BigInt(murmurhash3.x86.hash32(data, 0x12345678));
    const h2_2 = BigInt(murmurhash3.x86.hash32(data, 0x87654321));
    return [(h1 << 32n) | h1_2, (h2 << 32n) | h2_2];
}
 
// Strategy 3: Cryptographic hash for security-sensitive applications
// Slower but provides collision resistance
function cryptoHash(data: string): [bigint, bigint] {
    const hash = createHash('sha256').update(data).digest();
    const h1 = hash.readBigUInt64BE(0);
    const h2 = hash.readBigUInt64BE(8);
    return [h1, h2];
}
 
// Strategy 4: Simple FNV-1a for constrained environments
// Minimal code, decent quality for small filters
function fnv1a64(data: string): bigint {
    const FNV_OFFSET = 0xcbf29ce484222325n;
    const FNV_PRIME = 0x100000001b3n;
    
    let hash = FNV_OFFSET;
    for (let i = 0; i < data.length; i++) {
        hash ^= BigInt(data.charCodeAt(i));
        hash = (hash * FNV_PRIME) & 0xFFFFFFFFFFFFFFFFn;  // 64-bit mask
    }
    return hash;
}

Hash Function Selection Guidelines

•General use — MurmurHash3 128-bit is the gold standard: fast, excellent distribution, battle-tested
•High throughput — xxHash offers the best speed; consider when hashing millions of items per second
•Security sensitive — Use SipHash with a secret key to prevent adversarial attacks
•Embedded/constrained — FNV-1a is simple to implement with no dependencies
•Cross-platform — MurmurHash3 has implementations in every language; ensures consistent hashes

Hash Distribution Analysis

How do we verify that our hash functions are actually producing uniform distributions? Poor distribution leads to higher-than-expected false positive rates. Here's how to analyze hash quality empirically.

The Chi-Square Test for Uniformity:

We can use statistical tests to verify that hash outputs are uniformly distributed across the bit array.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
/**
 * Analyze hash function distribution quality
 */
function analyzeHashDistribution(
    hashFn: (input: string) => number[],
    numSamples: number,
    m: number,
    k: number
): {
    chiSquare: number;
    expectedPerBucket: number;
    variance: number;
    isUniform: boolean;
} {
    // Count hits per bit position
    const bucketCounts = new Array(m).fill(0);
    
    // Generate random inputs and hash them
    for (let i = 0; i < numSamples; i++) {
        const input = `test_element_${i}_${Math.random()}`;
        const positions = hashFn(input);
        
        for (const pos of positions) {
            bucketCounts[pos]++;
        }
    }
    
    // Expected count per bucket under uniform distribution
    const totalHashes = numSamples * k;
    const expected = totalHashes / m;
    
    // Calculate chi-square statistic
    let chiSquare = 0;
    for (let i = 0; i < m; i++) {
        const diff = bucketCounts[i] - expected;
        chiSquare += (diff * diff) / expected;
    }
    
    // Calculate variance
    const mean = totalHashes / m;
    let sumSquaredDiff = 0;
    for (let i = 0; i < m; i++) {
        sumSquaredDiff += Math.pow(bucketCounts[i] - mean, 2);
    }
    const variance = sumSquaredDiff / m;
    
    // For uniform distribution, chi-square should be approximately m
    // (degrees of freedom = m - 1)
    // If chi-square >> m, distribution is non-uniform
    const isUniform = chiSquare < m * 1.5;  // Rough threshold
    
    return { chiSquare, expectedPerBucket: expected, variance, isUniform };
}
 
// Visualization of hash distribution
function visualizeDistribution(bucketCounts: number[], numBuckets: number): void {
    const max = Math.max(...bucketCounts);
    const bucketWidth = Math.ceil(bucketCounts.length / numBuckets);
    
    console.log("Hash Distribution Histogram:");
    console.log("============================");
    
    for (let b = 0; b < numBuckets; b++) {
        const start = b * bucketWidth;
        const end = Math.min(start + bucketWidth, bucketCounts.length);
        
        let sum = 0;
        for (let i = start; i < end; i++) {
            sum += bucketCounts[i];
        }
        const avg = sum / (end - start);
        
        const barLength = Math.round((avg / max) * 50);
        const bar = '█'.repeat(barLength);
        console.log(`[${start.toString().padStart(6)}-${end.toString().padStart(6)}] ${bar}`);
    }
}
 
/*
Example output for a GOOD hash function:
Hash Distribution Histogram:
============================
[     0- 10000] ██████████████████████████████████████████████████
[ 10000- 20000] █████████████████████████████████████████████████
[ 20000- 30000] ██████████████████████████████████████████████████
[ 30000- 40000] █████████████████████████████████████████████████
[ 40000- 50000] ██████████████████████████████████████████████████
 
Example output for a BAD hash function:
Hash Distribution Histogram:
============================
[     0- 10000] ████████████████████████████████████████████████████████
[ 10000- 20000] ███████████
[ 20000- 30000] ██████████████████████████████████████████████████
[ 30000- 40000] ████████████████████████████████████████████████████████████
[ 40000- 50000] ██████████████████
*/

Detecting Hash Problems in Production

If your Bloom filter's observed false positive rate significantly exceeds the theoretical rate, poor hash distribution is a likely culprit. Run distribution analysis on a sample of your actual data—some hash functions perform poorly on specific input patterns (e.g., sequential integers, similar strings).

Handling Different Data Types

Bloom filters can store any data type, but the hash function must accept all types consistently. How we serialize data for hashing affects both correctness and performance.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
/**
 * Type-safe Bloom filter hashing for multiple data types
 */
class UniversalHasher {
    private readonly encoder = new TextEncoder();
    
    /**
     * Convert any value to a hashable byte array.
     * Critical: same value must ALWAYS produce same bytes.
     */
    toBytes(value: unknown): Uint8Array {
        if (value === null) {
            return new Uint8Array([0]);  // Null sentinel
        }
        
        if (value === undefined) {
            return new Uint8Array([1]);  // Undefined sentinel
        }
        
        if (typeof value === 'string') {
            // Prefix with type marker to distinguish from number "123"
            return this.withTypePrefix(2, this.encoder.encode(value));
        }
        
        if (typeof value === 'number') {
            // Use Float64 representation for all numbers
            const buffer = new ArrayBuffer(9);
            const view = new DataView(buffer);
            view.setUint8(0, 3);  // Type marker
            view.setFloat64(1, value, true);  // Little-endian
            return new Uint8Array(buffer);
        }
        
        if (typeof value === 'bigint') {
            // Convert BigInt to variable-length byte array
            const hex = value.toString(16);
            const bytes = this.encoder.encode(hex);
            return this.withTypePrefix(4, bytes);
        }
        
        if (typeof value === 'boolean') {
            return new Uint8Array([5, value ? 1 : 0]);
        }
        
        if (value instanceof Date) {
            const buffer = new ArrayBuffer(9);
            const view = new DataView(buffer);
            view.setUint8(0, 6);
            view.setFloat64(1, value.getTime(), true);
            return new Uint8Array(buffer);
        }
        
        if (value instanceof Uint8Array) {
            return this.withTypePrefix(7, value);
        }
        
        if (Array.isArray(value)) {
            // Hash array as concatenated element hashes
            const elementBytes = value.map(v => this.toBytes(v));
            return this.withTypePrefix(8, this.concatenate(elementBytes));
        }
        
        if (typeof value === 'object') {
            // Sort keys for consistent ordering
            const sortedEntries = Object.entries(value)
                .sort(([a], [b]) => a.localeCompare(b));
            
            const parts: Uint8Array[] = [];
            for (const [key, val] of sortedEntries) {
                parts.push(this.encoder.encode(key));
                parts.push(this.toBytes(val));
            }
            return this.withTypePrefix(9, this.concatenate(parts));
        }
        
        // Fallback: stringify
        return this.withTypePrefix(10, this.encoder.encode(String(value)));
    }
    
    private withTypePrefix(type: number, data: Uint8Array): Uint8Array {
        const result = new Uint8Array(data.length + 1);
        result[0] = type;
        result.set(data, 1);
        return result;
    }
    
    private concatenate(arrays: Uint8Array[]): Uint8Array {
        const totalLength = arrays.reduce((sum, arr) => sum + arr.length + 4, 0);
        const result = new Uint8Array(totalLength);
        let offset = 0;
        
        for (const arr of arrays) {
            // Include length prefix to prevent ambiguity
            const view = new DataView(result.buffer, offset);
            view.setUint32(0, arr.length, true);
            result.set(arr, offset + 4);
            offset += arr.length + 4;
        }
        
        return result;
    }
}
 
// Usage
const hasher = new UniversalHasher();
 
// All these produce distinct, consistent byte arrays
hasher.toBytes("123");        // String "123"
hasher.toBytes(123);          // Number 123
hasher.toBytes(123n);         // BigInt 123
hasher.toBytes([1, 2, 3]);    // Array
hasher.toBytes({a: 1, b: 2}); // Object - same bytes regardless of property order

Common Serialization Pitfalls

•String '123' vs Number 123 — Without type prefixes, these might hash identically
•Object property order — {a:1, b:2} and {b:2, a:1} must hash the same; sort keys
•Floating point precision — 0.1 + 0.2 ≠ 0.3 in floating point; be careful with computed values
•Unicode normalization — 'café' can be represented multiple ways; normalize before hashing
•Null vs undefined vs empty — Each should have distinct representations

Summary: Mastering Bloom Filter Hashing

Hash functions are the engine that powers Bloom filters. Understanding how to generate and validate them is essential for building effective implementations. Let's consolidate the key insights:

Key Takeaways

•Hash functions must be deterministic, uniform, and fast — These properties directly affect Bloom filter correctness and performance
•Double hashing eliminates the need for k separate hash functions — g_i(x) = h1(x) + i×h2(x) generates k values from two base hashes
•Enhanced double hashing adds a quadratic term — g_i(x) = h1(x) + i×h2(x) + i² improves distribution for large k
•MurmurHash3 128-bit is the recommended base hash — Fast, high-quality, and widely available across platforms
•Hash distribution should be empirically verified — Chi-square tests can detect non-uniformity
•Data type serialization requires care — Type prefixes and consistent ordering prevent collisions between different types

What's next:

With the theoretical foundations complete, the next page explores real-world applications of Bloom filters. We'll see how they power web caching, spell checkers, database systems, and distributed networks—with concrete examples of how the concepts we've learned translate into production systems.

Page Complete

You now understand the hash function design that enables practical Bloom filter implementations. The double hashing technique—generating k hashes from just two base hash computations—is the key insight that makes Bloom filters efficient. Combined with careful data serialization, these techniques produce robust, performant implementations.

3 / 4

Loading learning content...

Data Structures & AlgorithmsBloom Filters (Conceptual)

Bloom Filters — Probabilistic Set Membership

LevelAdvanced

Duration55 mins

TopicBloom Filters (Conceptual)

3 / 4

Multiple Hash Functions

The Heart of the Bloom Filter

What You Will Learn

Hash Function Requirements

Before diving into implementation techniques, let's precisely define what we need from Bloom filter hash functions.

Formal Requirements:

Determinism — The same input must always produce the same output. This ensures inserted elements can be queried reliably.
Uniformity — Outputs should be uniformly distributed across the bit array. Non-uniform distribution creates 'hot spots' that fill up faster, increasing false positive rates.
Independence (or near-independence) — The k hash functions should behave as if they're statistically independent. Correlation between hashes reduces effective entropy.
Speed — Hash computation occurs on every insert and query. Slow hashes become a performance bottleneck, especially for high-throughput applications.
Avalanche property — Small input changes should produce dramatically different outputs. This prevents similar inputs from clustering in the bit array.

Hash Function Properties Comparison
Property	Importance for Bloom Filter	Consequence if Violated
Determinism	Critical	False negatives possible (queries fail for inserted elements)
Uniformity	High	Uneven bit distribution → higher FP rate than calculated
Independence	High	Correlated hashes → reduced entropy → higher FP rate
Speed	Moderate-High	Slow operations → limited throughput
Avalanche	Moderate	Similar inputs cluster → uneven distribution

Cryptographic vs Non-Cryptographic Hashes

The Naive Approach and Its Problems

The most straightforward approach to generating k hash functions is to simply use k different hash algorithms. Let's examine why this approach, while correct, is impractical.

Naive Implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// NAIVE APPROACH: k separate hash function implementations
// This works but is impractical
 
function naiveBloomHash(element: string, k: number, m: number): number[] {
    const positions: number[] = [];
    
    // Using k different hash algorithms
    positions.push(md5Hash(element) % m);        // Hash 1
    positions.push(sha1Hash(element) % m);       // Hash 2  
    positions.push(sha256Hash(element) % m);     // Hash 3
    positions.push(fnv1aHash(element) % m);      // Hash 4
    positions.push(murmurHash(element) % m);     // Hash 5
    positions.push(crc32Hash(element) % m);      // Hash 6
    positions.push(jenkinsHash(element) % m);    // Hash 7
    // ... need more hash functions if k > 7!
    
    return positions.slice(0, k);
}
 
// Problems with this approach:
// 1. Limited scalability - where do we get k=13 different hash functions?
// 2. Code complexity - each hash is a separate implementation to maintain
// 3. Speed - computing many different hashes is slow
// 4. Independence not guaranteed - some hashes might correlate

Problems with the Naive Approach

•Scalability — Where do k=17 truly independent hash functions come from? We quickly run out of distinct algorithms.
•Maintenance burden — Each hash function is a separate codebase to implement, test, and maintain.
•Computational cost — Computing multiple different hash algorithms is slow; each has its own overhead.
•Independence uncertainty — How do we verify that MD5 and SHA-1 produce statistically independent outputs?
•Portability — Different languages/platforms may have different hash implementations, causing compatibility issues.

We need a better approach—one that generates arbitrarily many hash-like values from a simpler foundation. This is where the elegance of enhanced double hashing comes in.

The Double Hashing Technique

The Formula:

Given two independent hash functions h₁(x) and h₂(x), we generate the i-th hash as:

gᵢ(x) = h₁(x) + i × h₂(x) mod m

For i = 0, 1, 2, ..., k-1

This produces k different hash values while only computing two underlying hashes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
/**
 * Generate k hash positions using enhanced double hashing.
 * 
 * Based on "Less Hashing, Same Performance: Building a Better Bloom Filter"
 * by Kirsch and Mitzenmacher (2006)
 */
function doubleHash(element: string, k: number, m: number): number[] {
    // Compute two base hash values
    // Using a good hash function that produces 64-bit output
    const hashValue = murmurHash3_128(element);
    
    // Split the 128-bit hash into two 64-bit values
    const h1 = hashValue.high;  // Upper 64 bits
    const h2 = hashValue.low;   // Lower 64 bits
    
    const positions: number[] = [];
    
    for (let i = 0; i < k; i++) {
        // g_i(x) = (h1(x) + i * h2(x)) mod m
        const position = Math.abs((h1 + i * h2) % m);
        positions.push(position);
    }
    
    return positions;
}
 
// Example usage
const element = "hello@example.com";
const k = 7;   // 7 hash functions
const m = 1000000;  // 1 million bits
 
const positions = doubleHash(element, k, m);
// Returns: [342891, 517234, 691577, 865920, 40263, 214606, 388949]
// 7 different positions from just ONE underlying hash computation!

Why This Works:

The mathematical proof (beyond our scope here) shows that for Bloom filter purposes, double hashing produces hash values that are 'sufficiently independent.' The key insight is:

The false positive probability bound is still (1 - e^(-kn/m))^k
The slight correlation between generated hashes has negligible impact on practical FP rates
The asymptotic behavior matches truly independent hash functions

This has been validated both theoretically and empirically across many implementations.

The Efficiency Gain

Enhanced Double Hashing

The original double hashing formula can be further refined to improve hash distribution, especially for larger values of k. The enhanced double hashing technique adds a quadratic term:

gᵢ(x) = h₁(x) + i × h₂(x) + i² mod m

The i² term provides additional mixing that helps when k is large relative to h₂(x).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
/**
 * Enhanced double hashing with quadratic probing.
 * Provides better distribution for large k values.
 */
function enhancedDoubleHash(element: string, k: number, m: number): number[] {
    const hashBytes = murmurHash3_128(element);
    
    // Extract two 64-bit hash values
    const h1 = BigInt(hashBytes.high);
    const h2 = BigInt(hashBytes.low);
    const mBig = BigInt(m);
    
    const positions: number[] = [];
    
    for (let i = 0; i < k; i++) {
        const iBig = BigInt(i);
        
        // g_i(x) = (h1(x) + i * h2(x) + i^2) mod m
        // The i^2 term adds quadratic probing for better distribution
        const combinedHash = (h1 + iBig * h2 + iBig * iBig) % mBig;
        
        // Ensure positive result (JavaScript BigInt mod can be negative)
        const position = Number(combinedHash < 0n ? combinedHash + mBig : combinedHash);
        positions.push(position);
    }
    
    return positions;
}
 
/**
 * Production-ready implementation with performance optimizations
 */
class BloomFilterHasher {
    private readonly m: number;
    private readonly k: number;
    
    constructor(bitArraySize: number, numHashFunctions: number) {
        this.m = bitArraySize;
        this.k = numHashFunctions;
    }
    
    getPositions(element: string | Buffer): number[] {
        // Get 128-bit hash from MurmurHash3
        const hash = this.computeHash(element);
        
        // Split into two 64-bit values using DataView for efficiency
        const h1 = hash.readBigUInt64LE(0);
        const h2 = hash.readBigUInt64LE(8);
        
        const positions: number[] = new Array(this.k);
        const mBig = BigInt(this.m);
        
        // Pre-calculate for loop efficiency
        let combinedHash = h1 % mBig;
        
        for (let i = 0; i < this.k; i++) {
            positions[i] = Number(combinedHash);
            // Increment: next = current + h2 + 2*i + 1 (derived from quadratic formula)
            combinedHash = (combinedHash + h2 + BigInt(2 * i + 1)) % mBig;
        }
        
        return positions;
    }
    
    private computeHash(element: string | Buffer): Buffer {
        // Use a fast, high-quality hash like MurmurHash3 or xxHash
        return murmurHash3_x64_128(element);
    }
}

The Quadratic Increment Optimization:

Notice in the optimized implementation that we don't compute i² directly in the loop. Instead, we use the mathematical identity:

(i+1)² = i² + 2i + 1

This means we can update the hash incrementally by adding (h₂ + 2i + 1) at each step, avoiding the multiplication entirely. This small optimization matters for high-throughput applications.

When Enhanced vs Standard Double Hashing

Choosing Base Hash Functions

While double hashing reduces our need to just two base hashes, choosing the right base hash function is still crucial. The quality of the base hash directly affects Bloom filter performance.

Popular Choices for Bloom Filters:

Hash Function Comparison for Bloom Filters
Hash Function	Output Size	Speed	Quality	Best For
MurmurHash3 (128-bit)	128 bits	Very Fast	Excellent	General purpose, most Bloom filters
xxHash	64/128 bits	Extremely Fast	Excellent	High-throughput applications
SipHash	64/128 bits	Fast	Excellent	Security-sensitive applications
FNV-1a	32/64 bits	Fast	Good	Simple implementations, short keys
CityHash	64/128 bits	Very Fast	Excellent	String-heavy workloads
SHA-256 (truncated)	256 bits	Slow	Excellent	When cryptographic properties needed

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import { createHash } from 'crypto';
import murmurhash3 from 'murmurhash3js-revisited';
 
/**
 * Hash function strategies for different use cases
 */
 
// Strategy 1: MurmurHash3 128-bit (Recommended for most cases)
// Fast, excellent distribution, widely available
function murmurhash128(data: string, seed: number = 0): [bigint, bigint] {
    const hash = murmurhash3.x64.hash128(data, seed);
    // Returns hex string; convert to two 64-bit values
    const h1 = BigInt('0x' + hash.substring(0, 16));
    const h2 = BigInt('0x' + hash.substring(16, 32));
    return [h1, h2];
}
 
// Strategy 2: Two independent hashes with different seeds
// Slightly more computation but provably independent
function twoSeedHash(data: string): [bigint, bigint] {
    const h1 = BigInt(murmurhash3.x86.hash32(data, 0));
    const h2 = BigInt(murmurhash3.x86.hash32(data, 0x9747b28c));  // Different seed
    // Combine into 64-bit values for more bits
    const h1_2 = BigInt(murmurhash3.x86.hash32(data, 0x12345678));
    const h2_2 = BigInt(murmurhash3.x86.hash32(data, 0x87654321));
    return [(h1 << 32n) | h1_2, (h2 << 32n) | h2_2];
}
 
// Strategy 3: Cryptographic hash for security-sensitive applications
// Slower but provides collision resistance
function cryptoHash(data: string): [bigint, bigint] {
    const hash = createHash('sha256').update(data).digest();
    const h1 = hash.readBigUInt64BE(0);
    const h2 = hash.readBigUInt64BE(8);
    return [h1, h2];
}
 
// Strategy 4: Simple FNV-1a for constrained environments
// Minimal code, decent quality for small filters
function fnv1a64(data: string): bigint {
    const FNV_OFFSET = 0xcbf29ce484222325n;
    const FNV_PRIME = 0x100000001b3n;
    
    let hash = FNV_OFFSET;
    for (let i = 0; i < data.length; i++) {
        hash ^= BigInt(data.charCodeAt(i));
        hash = (hash * FNV_PRIME) & 0xFFFFFFFFFFFFFFFFn;  // 64-bit mask
    }
    return hash;
}

Hash Function Selection Guidelines

•General use — MurmurHash3 128-bit is the gold standard: fast, excellent distribution, battle-tested
•High throughput — xxHash offers the best speed; consider when hashing millions of items per second
•Security sensitive — Use SipHash with a secret key to prevent adversarial attacks
•Embedded/constrained — FNV-1a is simple to implement with no dependencies
•Cross-platform — MurmurHash3 has implementations in every language; ensures consistent hashes

Hash Distribution Analysis

The Chi-Square Test for Uniformity:

We can use statistical tests to verify that hash outputs are uniformly distributed across the bit array.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
/**
 * Analyze hash function distribution quality
 */
function analyzeHashDistribution(
    hashFn: (input: string) => number[],
    numSamples: number,
    m: number,
    k: number
): {
    chiSquare: number;
    expectedPerBucket: number;
    variance: number;
    isUniform: boolean;
} {
    // Count hits per bit position
    const bucketCounts = new Array(m).fill(0);
    
    // Generate random inputs and hash them
    for (let i = 0; i < numSamples; i++) {
        const input = `test_element_${i}_${Math.random()}`;
        const positions = hashFn(input);
        
        for (const pos of positions) {
            bucketCounts[pos]++;
        }
    }
    
    // Expected count per bucket under uniform distribution
    const totalHashes = numSamples * k;
    const expected = totalHashes / m;
    
    // Calculate chi-square statistic
    let chiSquare = 0;
    for (let i = 0; i < m; i++) {
        const diff = bucketCounts[i] - expected;
        chiSquare += (diff * diff) / expected;
    }
    
    // Calculate variance
    const mean = totalHashes / m;
    let sumSquaredDiff = 0;
    for (let i = 0; i < m; i++) {
        sumSquaredDiff += Math.pow(bucketCounts[i] - mean, 2);
    }
    const variance = sumSquaredDiff / m;
    
    // For uniform distribution, chi-square should be approximately m
    // (degrees of freedom = m - 1)
    // If chi-square >> m, distribution is non-uniform
    const isUniform = chiSquare < m * 1.5;  // Rough threshold
    
    return { chiSquare, expectedPerBucket: expected, variance, isUniform };
}
 
// Visualization of hash distribution
function visualizeDistribution(bucketCounts: number[], numBuckets: number): void {
    const max = Math.max(...bucketCounts);
    const bucketWidth = Math.ceil(bucketCounts.length / numBuckets);
    
    console.log("Hash Distribution Histogram:");
    console.log("============================");
    
    for (let b = 0; b < numBuckets; b++) {
        const start = b * bucketWidth;
        const end = Math.min(start + bucketWidth, bucketCounts.length);
        
        let sum = 0;
        for (let i = start; i < end; i++) {
            sum += bucketCounts[i];
        }
        const avg = sum / (end - start);
        
        const barLength = Math.round((avg / max) * 50);
        const bar = '█'.repeat(barLength);
        console.log(`[${start.toString().padStart(6)}-${end.toString().padStart(6)}] ${bar}`);
    }
}
 
/*
Example output for a GOOD hash function:
Hash Distribution Histogram:
============================
[     0- 10000] ██████████████████████████████████████████████████
[ 10000- 20000] █████████████████████████████████████████████████
[ 20000- 30000] ██████████████████████████████████████████████████
[ 30000- 40000] █████████████████████████████████████████████████
[ 40000- 50000] ██████████████████████████████████████████████████
 
Example output for a BAD hash function:
Hash Distribution Histogram:
============================
[     0- 10000] ████████████████████████████████████████████████████████
[ 10000- 20000] ███████████
[ 20000- 30000] ██████████████████████████████████████████████████
[ 30000- 40000] ████████████████████████████████████████████████████████████
[ 40000- 50000] ██████████████████
*/

Detecting Hash Problems in Production

Handling Different Data Types

Bloom filters can store any data type, but the hash function must accept all types consistently. How we serialize data for hashing affects both correctness and performance.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
/**
 * Type-safe Bloom filter hashing for multiple data types
 */
class UniversalHasher {
    private readonly encoder = new TextEncoder();
    
    /**
     * Convert any value to a hashable byte array.
     * Critical: same value must ALWAYS produce same bytes.
     */
    toBytes(value: unknown): Uint8Array {
        if (value === null) {
            return new Uint8Array([0]);  // Null sentinel
        }
        
        if (value === undefined) {
            return new Uint8Array([1]);  // Undefined sentinel
        }
        
        if (typeof value === 'string') {
            // Prefix with type marker to distinguish from number "123"
            return this.withTypePrefix(2, this.encoder.encode(value));
        }
        
        if (typeof value === 'number') {
            // Use Float64 representation for all numbers
            const buffer = new ArrayBuffer(9);
            const view = new DataView(buffer);
            view.setUint8(0, 3);  // Type marker
            view.setFloat64(1, value, true);  // Little-endian
            return new Uint8Array(buffer);
        }
        
        if (typeof value === 'bigint') {
            // Convert BigInt to variable-length byte array
            const hex = value.toString(16);
            const bytes = this.encoder.encode(hex);
            return this.withTypePrefix(4, bytes);
        }
        
        if (typeof value === 'boolean') {
            return new Uint8Array([5, value ? 1 : 0]);
        }
        
        if (value instanceof Date) {
            const buffer = new ArrayBuffer(9);
            const view = new DataView(buffer);
            view.setUint8(0, 6);
            view.setFloat64(1, value.getTime(), true);
            return new Uint8Array(buffer);
        }
        
        if (value instanceof Uint8Array) {
            return this.withTypePrefix(7, value);
        }
        
        if (Array.isArray(value)) {
            // Hash array as concatenated element hashes
            const elementBytes = value.map(v => this.toBytes(v));
            return this.withTypePrefix(8, this.concatenate(elementBytes));
        }
        
        if (typeof value === 'object') {
            // Sort keys for consistent ordering
            const sortedEntries = Object.entries(value)
                .sort(([a], [b]) => a.localeCompare(b));
            
            const parts: Uint8Array[] = [];
            for (const [key, val] of sortedEntries) {
                parts.push(this.encoder.encode(key));
                parts.push(this.toBytes(val));
            }
            return this.withTypePrefix(9, this.concatenate(parts));
        }
        
        // Fallback: stringify
        return this.withTypePrefix(10, this.encoder.encode(String(value)));
    }
    
    private withTypePrefix(type: number, data: Uint8Array): Uint8Array {
        const result = new Uint8Array(data.length + 1);
        result[0] = type;
        result.set(data, 1);
        return result;
    }
    
    private concatenate(arrays: Uint8Array[]): Uint8Array {
        const totalLength = arrays.reduce((sum, arr) => sum + arr.length + 4, 0);
        const result = new Uint8Array(totalLength);
        let offset = 0;
        
        for (const arr of arrays) {
            // Include length prefix to prevent ambiguity
            const view = new DataView(result.buffer, offset);
            view.setUint32(0, arr.length, true);
            result.set(arr, offset + 4);
            offset += arr.length + 4;
        }
        
        return result;
    }
}
 
// Usage
const hasher = new UniversalHasher();
 
// All these produce distinct, consistent byte arrays
hasher.toBytes("123");        // String "123"
hasher.toBytes(123);          // Number 123
hasher.toBytes(123n);         // BigInt 123
hasher.toBytes([1, 2, 3]);    // Array
hasher.toBytes({a: 1, b: 2}); // Object - same bytes regardless of property order

Common Serialization Pitfalls

•String '123' vs Number 123 — Without type prefixes, these might hash identically
•Object property order — {a:1, b:2} and {b:2, a:1} must hash the same; sort keys
•Floating point precision — 0.1 + 0.2 ≠ 0.3 in floating point; be careful with computed values
•Unicode normalization — 'café' can be represented multiple ways; normalize before hashing
•Null vs undefined vs empty — Each should have distinct representations

Summary: Mastering Bloom Filter Hashing

Hash functions are the engine that powers Bloom filters. Understanding how to generate and validate them is essential for building effective implementations. Let's consolidate the key insights:

Key Takeaways

•Hash functions must be deterministic, uniform, and fast — These properties directly affect Bloom filter correctness and performance
•Double hashing eliminates the need for k separate hash functions — g_i(x) = h1(x) + i×h2(x) generates k values from two base hashes
•Enhanced double hashing adds a quadratic term — g_i(x) = h1(x) + i×h2(x) + i² improves distribution for large k
•MurmurHash3 128-bit is the recommended base hash — Fast, high-quality, and widely available across platforms
•Hash distribution should be empirically verified — Chi-square tests can detect non-uniformity
•Data type serialization requires care — Type prefixes and consistent ordering prevent collisions between different types

What's next:

Page Complete

3 / 4