Data Structures & AlgorithmsCollision Handling — Chaining

Collision Handling: Chaining

LevelIntermediate

Duration60 mins

TopicCollision Handling — Chaining

4 / 4

Performance Implications

From Theory to Numbers

We've learned the mechanics of chaining and discussed its tradeoffs qualitatively. Now it's time for the mathematics—the rigorous analysis that allows you to predict and optimize hash table performance.

Understanding performance implications isn't just academic. When you're designing a system, you need to answer questions like:

How large should my hash table be for 1 million entries?
What's the expected lookup time as the table fills?
When should I trigger a resize?
What's the probability of a chain exceeding 10 elements?

This page gives you the mathematical tools to answer these questions with confidence.

What You Will Learn

By the end of this page, you will master time complexity analysis for all chaining operations, understand the load factor's mathematical impact on performance, learn expected chain length distributions, and know optimal sizing and resizing strategies.

Time Complexity: The Foundation

Let's establish precise time complexity for each operation. We'll use these variables:

n = number of entries stored in the table
m = number of buckets (array size)
α = n/m = load factor (average entries per bucket)

The key insight is that α (load factor) determines average chain length, which in turn determines average operation time.

Time Complexity of Chaining Operations
Operation	Best Case	Average Case	Worst Case	Space
Insert	O(1)	O(1) *	O(n)	O(1)
Lookup (hit)	O(1)	O(1 + α)	O(n)	O(1)
Lookup (miss)	O(1)	O(1 + α)	O(n)	O(1)
Delete	O(1)	O(1 + α)	O(n)	O(1)
Resize	—	O(n)	O(n)	O(n)

The Asterisk on Insert

Insert is O(1) if we insert at the head of the chain without checking for duplicates. If we must check for existing keys first (standard behavior for maps), insert becomes O(1 + α) like lookup, since we traverse the chain to check.

Breaking Down the Complexity:

O(1): Computing the hash and accessing the bucket array
O(α): Traversing the chain at that bucket (average length α)
O(n): Worst case when all n keys hash to one bucket

In practice, with a good hash function and reasonable load factor, α is small (< 1), making average-case operations effectively O(1).

Critical Assumption: These bounds assume a uniform hash function that distributes keys evenly across buckets. Poor hash functions can cause clustering that defeats this analysis.

The Load Factor: Master Variable of Performance

The load factor (α = n/m) is the single most important variable for hash table performance. It represents the average number of entries per bucket—equivalently, the average chain length.

Interpreting Load Factor:

α = 0.5: Half the buckets have entries (on average)
α = 1.0: Exactly as many entries as buckets
α = 2.0: Twice as many entries as buckets (average chain length 2)

Unlike open addressing (which cannot exceed α = 1.0), chaining can operate at any load factor. But performance degrades as α increases.

Performance at Different Load Factors
Load Factor (α)	Avg Chain Length	Expected Probes (Successful)	Expected Probes (Unsuccessful)	Relative Performance
0.25	0.25	1.125	1.25	Excellent
0.50	0.50	1.25	1.50	Very Good
0.75	0.75	1.375	1.75	Good
1.00	1.00	1.50	2.00	Acceptable
1.50	1.50	1.75	2.50	Degraded
2.00	2.00	2.00	3.00	Poor
5.00	5.00	3.50	6.00	Severely degraded

The Formulas:

For a hash table with chaining and load factor α:

Unsuccessful Search (key not found):

Expected probes = 1 + α

We traverse the entire chain, on average α elements.

Successful Search (key found):

Expected probes = 1 + α/2

On average, we find the key halfway through the chain.

The 0.75 Sweet Spot

Most hash table implementations use 0.75 as the default load factor threshold for resizing. This balances memory efficiency (not too many empty buckets) with performance (chains stay short). Java's HashMap, Python's dict, and many others use this threshold.

Why 0.75?

At α = 0.75:

Expected successful lookup: 1.375 probes (very close to 1)
Expected unsuccessful lookup: 1.75 probes (still fast)
Memory utilization: 75% of buckets have entries (efficient)
Room for growth before resize

Increasing to 0.9 saves ~17% memory but increases probes by ~20%. Decreasing to 0.5 adds ~50% memory overhead for only ~15% speed improvement. The returns diminish rapidly in both directions.

Chain Length Distribution: The Poisson Model

With a uniform hash function, chain lengths follow a Poisson distribution with parameter λ = α (the load factor). This mathematical model lets us predict the probability of any chain length.

The Poisson Probability Formula:

P(chain length = k) = (α^k × e^(-α)) / k!

Where:

α = load factor (expected chain length)
k = specific chain length we're asking about
e ≈ 2.718... (Euler's number)

Chain Length Probabilities (α = 1.0)
Chain Length	Probability	Interpretation
0 (empty)	36.79%	About 1/3 of buckets are empty
1	36.79%	About 1/3 have exactly one entry
2	18.39%	About 1/5 have two entries
3	6.13%	Small fraction have three
4	1.53%	Rare
5	0.31%	Very rare
≥6	0.06%	Extremely rare

The Reassuring Reality

Even with α = 1.0 (as many entries as buckets), over 90% of buckets have 0, 1, or 2 entries. Long chains are genuinely rare with a uniform hash function. This is why chaining performs well in practice.

Probability of Long Chains:

The probability that any specific bucket has a chain length ≥ k is approximately:

P(length ≥ k) ≈ (eα/k)^k for k > α

For a table with m buckets, the expected number of chains of length ≥ k is:

E[chains ≥ k] = m × P(length ≥ k)

Expected Chains ≥ k in a 1,000,000 Bucket Table (α = 0.75)
Chain Length ≥	Probability	Expected Count in 1M Buckets
5	~0.095%	~950 buckets
8	~0.0012%	~12 buckets
10	~0.000019%	~0.19 buckets (unlikely any)
15	~10^-10	~0 (effectively impossible)

Practical Implication:

With a good hash function and reasonable load factor, chains exceeding 10 elements are essentially impossible in tables of any practical size. This bounds the worst-case behavior in practice (though theory must still consider adversarial inputs that could defeat the hash function).

Amortized Analysis of Dynamic Resizing

Most hash tables grow dynamically—when the load factor exceeds a threshold, they allocate a larger array and rehash all entries. This is expensive (O(n)), but it happens rarely enough that the cost amortizes to O(1) per operation.

The Amortization Argument:

Start with capacity m
Insert n = αm entries before resize triggers
Resize costs O(n) = O(αm)
Cost per insertion = O(αm) / αm = O(1)

Each entry 'pays' a constant amount toward the eventual resize.

The Doubling Strategy

When resizing, doubling the capacity (m → 2m) is optimal. It ensures the next resize is at least n more insertions away, maintaining O(1) amortized insert. Growing by smaller factors (e.g., 1.5x) uses less memory but resizes more often.

Cost Analysis of Doubling:

Consider inserting 2^k elements into an initially empty table:

Insertions	Table Size	Resize Cost	Cumulative Resize Cost
1	1	0	0
2	2	1	1
4	4	2	3
8	8	4	7
16	16	8	15
2^k	2^k	2^(k-1)	2^k - 1

After n insertions, total resize cost ≈ n. Average cost per insertion = O(1).

resize-implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
class ChainingHashTable<K, V> {
    // ... existing code ...
    
    /**
     * Resizes the hash table by creating a new larger array
     * and rehashing all existing entries.
     */
    private resize(newCapacity: number): void {
        console.log(`Resizing: ${this.buckets.length} → ${newCapacity}`);
        
        // Save reference to old buckets
        const oldBuckets = this.buckets;
        
        // Create new, larger bucket array
        this.buckets = new Array(newCapacity).fill(null);
        this.size = 0; // Will be repopulated
        
        // Rehash all existing entries
        for (const head of oldBuckets) {
            let current = head;
            while (current !== null) {
                // Note: this.put() uses the new bucket array
                this.put(current.key, current.value);
                current = current.next;
            }
        }
    }
    
    /**
     * Insertion with automatic resizing when load factor exceeded.
     */
    put(key: K, value: V): V | undefined {
        // ... insertion logic ...
        
        // Check if resize needed AFTER insertion
        if (this.loadFactor > this.loadFactorThreshold) {
            this.resize(this.buckets.length * 2);
        }
        
        return oldValue;
    }
}

Resize Latency Spike

While amortized cost is O(1), the individual resize operation is O(n). For latency-sensitive applications (real-time systems, trading), this spike may be unacceptable. Solutions include incremental resizing (rehash a few entries per operation) or pre-allocating based on expected size.

Space Complexity Analysis

Understanding space complexity helps you predict memory usage and make sizing decisions.

Total Space:

Space = Bucket Array + All Chain Nodes
      = m × pointer_size + n × node_size
      = O(m + n)

Since m = n/α (buckets = entries / load_factor), and typical α ≈ 0.75:

Space = O(n/α + n) = O(n × (1/α + 1)) = O(n)

Space grows linearly with entries, multiplied by constants depending on load factor and node overhead.

Memory Usage for 1 Million Key-Value Pairs (64-bit)
Component	Per Entry Bytes	Total (1M entries)	Notes
Bucket array (α=0.75)	8 bytes (pointer)	~10.7 MB	1.33M buckets × 8 bytes
Node: key (int64)	8 bytes	8 MB	The actual key
Node: value (int64)	8 bytes	8 MB	The actual value
Node: next pointer	8 bytes	8 MB	Chain link
Node overhead (allocator)	8-16 bytes	8-16 MB	Memory allocator bookkeeping
TOTAL	40-48 bytes/entry	40-48 MB	For int64 → int64 map

Comparing with Open Addressing:

Open addressing stores entries directly in the bucket array—no separate nodes:

Open Addressing Space = m × entry_size
                      = (n/α) × (key_size + value_size)

For 1M entries at α = 0.5 (open addressing needs lower load factor):

Bucket array: 2M slots × 16 bytes = 32 MB
No pointer overhead

Chaining uses more memory for small entries due to pointer overhead, but the gap narrows for larger entries.

Memory vs Speed Tradeoff

Lower load factors use more memory but improve speed. A table at α=0.5 uses twice the bucket memory as α=1.0, but has half the average chain length. Profile your specific workload to find the right balance.

Optimal Sizing Strategies

Knowing how to size your hash table optimally can prevent performance issues and memory waste:

Given Expected Entries (n), Find Optimal Capacity (m):

m = ceiling(n / target_load_factor)

Example:
- Expected 10,000 entries
- Target α = 0.75
- m = ceiling(10,000 / 0.75) = ceiling(13,333) = 13,334 buckets

Practically, round up to a power of 2 for efficient modulo (bit masking):

m = next_power_of_2(n / α) = 16,384 buckets

Recommended Initial Capacity
Expected Entries	Load Factor	Minimum Capacity	Power-of-2 Capacity
100	0.75	134	256
1,000	0.75	1,334	2,048
10,000	0.75	13,334	16,384
100,000	0.75	133,334	262,144 (256K)
1,000,000	0.75	1,333,334	2,097,152 (2M)

Pre-sizing Prevents Resizing

If you know the approximate size in advance, pre-allocate! Starting with capacity 16 and growing to 1M entries requires ~17 resizes. Pre-allocating with 2M capacity requires zero resizes, saving significant CPU time during population.

pre-sizing-example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// BAD: Default capacity, many resizes
const badMap = new ChainingHashTable<string, number>();
for (let i = 0; i < 1000000; i++) {
    badMap.put(`key_${i}`, i);
}
// Triggers ~17 resize operations!
 
// GOOD: Pre-sized capacity, zero resizes
const expectedSize = 1000000;
const loadFactor = 0.75;
const optimalCapacity = Math.pow(2, Math.ceil(Math.log2(expectedSize / loadFactor)));
// optimalCapacity = 2097152
 
const goodMap = new ChainingHashTable<string, number>(optimalCapacity);
for (let i = 0; i < 1000000; i++) {
    goodMap.put(`key_${i}`, i);
}
// No resize operations needed!

Worst Case Scenarios and Protection

The O(n) worst case occurs when all keys hash to the same bucket, creating a single chain of length n. This can happen due to:

Poor hash function: Doesn't distribute uniformly
Adversarial input: Attacker crafts keys that collide
Pathological data patterns: Real data that happens to collide

In the worst case, the hash table degenerates to a linked list, and all operations become O(n).

Hash Flooding Attacks

In 2011, researchers demonstrated 'hash collision DoS attacks' against web frameworks (PHP, Python, Java, Node.js). Attackers sent carefully crafted POST data where all keys hashed to the same bucket, turning O(1) parsing into O(n²) and crashing servers. This led to randomized hash functions and per-process hash seeds becoming standard security measures.

Protection Strategies:

Worst-Case Mitigations

•Randomized hash seeds: Each process uses a random seed, making collision prediction infeasible for attackers.
•Cryptographic hash functions: For untrusted input, use SipHash or similar keyed hash that's resistant to collision attacks.
•Tree conversion: Convert long chains to balanced trees (Java 8+ HashMap does this at length 8).
•Input validation: Limit the number of items from untrusted sources.
•Monitoring: Alert on unusually long chains that might indicate attack or hash function failure.

Java's Hybrid Defense:

Java's HashMap provides a reference implementation of tree conversion:

Chain length 0-8:   Linked list (simple, fast for short chains)
Chain length 8+:    Convert to red-black tree (O(log n) guarantee)
Chain length <6:    Convert back to linked list (hysteresis)

This bounds worst-case lookup to O(log n) while maintaining O(1) average case. The tree threshold of 8 is chosen because:

Probability of natural chain ≥8 is extremely low (~0.0000006 at α=0.75)
Long chains almost certainly indicate attack or hash function failure
Tree overhead is acceptable only for long chains

Real-World Performance Numbers

Theory is essential, but let's ground our understanding in real-world performance numbers. These benchmarks give you practical expectations:

Typical Operation Times (Modern Hardware, 2020s):

Real-World Hash Table Performance
Operation	Time (nanoseconds)	Context
Hash computation (string, 20 chars)	15-30 ns	Depends on hash algorithm
Bucket access (cache hit)	1-4 ns	L1/L2 cache
Bucket access (cache miss)	50-100 ns	Main memory
Chain traversal (per node)	3-15 ns	Pointer chasing
Key comparison (integer)	1-2 ns	Simple equality
Key comparison (string, 20 chars)	5-20 ns	Depends on match point
Full lookup (successful, α=0.75)	30-100 ns	Typical case
Full lookup (cache-cold)	200-400 ns	First access to table

Throughput Expectations:

Well-optimized hash table: 10-50 million lookups/second
With string keys: 5-20 million lookups/second (hashing cost)
Memory-bound workload: Throughput limited by memory bandwidth

Scaling Behavior:

Hash tables maintain near-constant operation time regardless of size—that's their key advantage. A table with 1 billion entries performs comparably to one with 1,000 entries (assuming both fit in memory).

Benchmark Your Specific Use Case

These numbers are guidelines, not guarantees. Your performance depends on key types, value sizes, access patterns, cache behavior, and hash function choice. Always benchmark with representative data before making architectural decisions.

Summary: Performance Mastery

We've thoroughly analyzed the performance characteristics of hash tables with chaining. Let's consolidate the key insights:

Key Performance Takeaways

•Load factor is the master variable — α determines average chain length and thus average operation time.
•Expected probes scale linearly with α — Successful search ≈ 1 + α/2, unsuccessful ≈ 1 + α.
•Chain lengths follow Poisson distribution — Long chains are exponentially rare with good hashing.
•Resizing is O(n) but amortizes to O(1) — Doubling strategy ensures constant amortized insert.
•Space is O(n) with constant overhead — Pointer overhead matters for small entries.
•Pre-sizing prevents resize storms — Know your data size in advance when possible.
•Worst case O(n) requires protection — Randomized seeds and tree conversion defend against attacks.
•Real-world lookups take ~30-100 ns — Hash tables are genuinely fast on modern hardware.

Module Complete:

You've now completed a comprehensive study of collision handling with chaining. You understand:

Why collisions are inevitable (Page 1)
How chaining resolves collisions mechanically (Page 2)
The engineering tradeoffs of chaining (Page 3)
The mathematical performance characteristics (Page 4)

Next, you'll explore open addressing—the alternative collision resolution strategy that stores all entries directly in the bucket array without separate chains.

Module Complete

Congratulations! You've mastered collision handling with chaining—from concept through implementation to rigorous performance analysis. You can now make informed decisions about hash table design and predict performance characteristics with confidence.

4 / 4

Loading learning content...

Data Structures & AlgorithmsCollision Handling — Chaining

Collision Handling: Chaining

LevelIntermediate

Duration60 mins

TopicCollision Handling — Chaining

4 / 4

Performance Implications

From Theory to Numbers

Understanding performance implications isn't just academic. When you're designing a system, you need to answer questions like:

How large should my hash table be for 1 million entries?
What's the expected lookup time as the table fills?
When should I trigger a resize?
What's the probability of a chain exceeding 10 elements?

This page gives you the mathematical tools to answer these questions with confidence.

What You Will Learn

Time Complexity: The Foundation

Let's establish precise time complexity for each operation. We'll use these variables:

n = number of entries stored in the table
m = number of buckets (array size)
α = n/m = load factor (average entries per bucket)

The key insight is that α (load factor) determines average chain length, which in turn determines average operation time.

Time Complexity of Chaining Operations
Operation	Best Case	Average Case	Worst Case	Space
Insert	O(1)	O(1) *	O(n)	O(1)
Lookup (hit)	O(1)	O(1 + α)	O(n)	O(1)
Lookup (miss)	O(1)	O(1 + α)	O(n)	O(1)
Delete	O(1)	O(1 + α)	O(n)	O(1)
Resize	—	O(n)	O(n)	O(n)

The Asterisk on Insert

Breaking Down the Complexity:

O(1): Computing the hash and accessing the bucket array
O(α): Traversing the chain at that bucket (average length α)
O(n): Worst case when all n keys hash to one bucket

In practice, with a good hash function and reasonable load factor, α is small (< 1), making average-case operations effectively O(1).

Critical Assumption: These bounds assume a uniform hash function that distributes keys evenly across buckets. Poor hash functions can cause clustering that defeats this analysis.

The Load Factor: Master Variable of Performance

The load factor (α = n/m) is the single most important variable for hash table performance. It represents the average number of entries per bucket—equivalently, the average chain length.

Interpreting Load Factor:

α = 0.5: Half the buckets have entries (on average)
α = 1.0: Exactly as many entries as buckets
α = 2.0: Twice as many entries as buckets (average chain length 2)

Unlike open addressing (which cannot exceed α = 1.0), chaining can operate at any load factor. But performance degrades as α increases.

Performance at Different Load Factors
Load Factor (α)	Avg Chain Length	Expected Probes (Successful)	Expected Probes (Unsuccessful)	Relative Performance
0.25	0.25	1.125	1.25	Excellent
0.50	0.50	1.25	1.50	Very Good
0.75	0.75	1.375	1.75	Good
1.00	1.00	1.50	2.00	Acceptable
1.50	1.50	1.75	2.50	Degraded
2.00	2.00	2.00	3.00	Poor
5.00	5.00	3.50	6.00	Severely degraded

The Formulas:

For a hash table with chaining and load factor α:

Unsuccessful Search (key not found):

Expected probes = 1 + α

We traverse the entire chain, on average α elements.

Successful Search (key found):

Expected probes = 1 + α/2

On average, we find the key halfway through the chain.

The 0.75 Sweet Spot

Why 0.75?

At α = 0.75:

Expected successful lookup: 1.375 probes (very close to 1)
Expected unsuccessful lookup: 1.75 probes (still fast)
Memory utilization: 75% of buckets have entries (efficient)
Room for growth before resize

Increasing to 0.9 saves ~17% memory but increases probes by ~20%. Decreasing to 0.5 adds ~50% memory overhead for only ~15% speed improvement. The returns diminish rapidly in both directions.

Chain Length Distribution: The Poisson Model

With a uniform hash function, chain lengths follow a Poisson distribution with parameter λ = α (the load factor). This mathematical model lets us predict the probability of any chain length.

The Poisson Probability Formula:

P(chain length = k) = (α^k × e^(-α)) / k!

Where:

α = load factor (expected chain length)
k = specific chain length we're asking about
e ≈ 2.718... (Euler's number)

Chain Length Probabilities (α = 1.0)
Chain Length	Probability	Interpretation
0 (empty)	36.79%	About 1/3 of buckets are empty
1	36.79%	About 1/3 have exactly one entry
2	18.39%	About 1/5 have two entries
3	6.13%	Small fraction have three
4	1.53%	Rare
5	0.31%	Very rare
≥6	0.06%	Extremely rare

The Reassuring Reality

Probability of Long Chains:

The probability that any specific bucket has a chain length ≥ k is approximately:

P(length ≥ k) ≈ (eα/k)^k for k > α

For a table with m buckets, the expected number of chains of length ≥ k is:

E[chains ≥ k] = m × P(length ≥ k)

Expected Chains ≥ k in a 1,000,000 Bucket Table (α = 0.75)
Chain Length ≥	Probability	Expected Count in 1M Buckets
5	~0.095%	~950 buckets
8	~0.0012%	~12 buckets
10	~0.000019%	~0.19 buckets (unlikely any)
15	~10^-10	~0 (effectively impossible)

Practical Implication:

Amortized Analysis of Dynamic Resizing

The Amortization Argument:

Start with capacity m
Insert n = αm entries before resize triggers
Resize costs O(n) = O(αm)
Cost per insertion = O(αm) / αm = O(1)

Each entry 'pays' a constant amount toward the eventual resize.

The Doubling Strategy

Cost Analysis of Doubling:

Consider inserting 2^k elements into an initially empty table:

Insertions	Table Size	Resize Cost	Cumulative Resize Cost
1	1	0	0
2	2	1	1
4	4	2	3
8	8	4	7
16	16	8	15
2^k	2^k	2^(k-1)	2^k - 1

After n insertions, total resize cost ≈ n. Average cost per insertion = O(1).

resize-implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
class ChainingHashTable<K, V> {
    // ... existing code ...
    
    /**
     * Resizes the hash table by creating a new larger array
     * and rehashing all existing entries.
     */
    private resize(newCapacity: number): void {
        console.log(`Resizing: ${this.buckets.length} → ${newCapacity}`);
        
        // Save reference to old buckets
        const oldBuckets = this.buckets;
        
        // Create new, larger bucket array
        this.buckets = new Array(newCapacity).fill(null);
        this.size = 0; // Will be repopulated
        
        // Rehash all existing entries
        for (const head of oldBuckets) {
            let current = head;
            while (current !== null) {
                // Note: this.put() uses the new bucket array
                this.put(current.key, current.value);
                current = current.next;
            }
        }
    }
    
    /**
     * Insertion with automatic resizing when load factor exceeded.
     */
    put(key: K, value: V): V | undefined {
        // ... insertion logic ...
        
        // Check if resize needed AFTER insertion
        if (this.loadFactor > this.loadFactorThreshold) {
            this.resize(this.buckets.length * 2);
        }
        
        return oldValue;
    }
}

Resize Latency Spike

Space Complexity Analysis

Understanding space complexity helps you predict memory usage and make sizing decisions.

Total Space:

Space = Bucket Array + All Chain Nodes
      = m × pointer_size + n × node_size
      = O(m + n)

Since m = n/α (buckets = entries / load_factor), and typical α ≈ 0.75:

Space = O(n/α + n) = O(n × (1/α + 1)) = O(n)

Space grows linearly with entries, multiplied by constants depending on load factor and node overhead.

Memory Usage for 1 Million Key-Value Pairs (64-bit)
Component	Per Entry Bytes	Total (1M entries)	Notes
Bucket array (α=0.75)	8 bytes (pointer)	~10.7 MB	1.33M buckets × 8 bytes
Node: key (int64)	8 bytes	8 MB	The actual key
Node: value (int64)	8 bytes	8 MB	The actual value
Node: next pointer	8 bytes	8 MB	Chain link
Node overhead (allocator)	8-16 bytes	8-16 MB	Memory allocator bookkeeping
TOTAL	40-48 bytes/entry	40-48 MB	For int64 → int64 map

Comparing with Open Addressing:

Open addressing stores entries directly in the bucket array—no separate nodes:

Open Addressing Space = m × entry_size
                      = (n/α) × (key_size + value_size)

For 1M entries at α = 0.5 (open addressing needs lower load factor):

Bucket array: 2M slots × 16 bytes = 32 MB
No pointer overhead

Chaining uses more memory for small entries due to pointer overhead, but the gap narrows for larger entries.

Memory vs Speed Tradeoff

Optimal Sizing Strategies

Knowing how to size your hash table optimally can prevent performance issues and memory waste:

Given Expected Entries (n), Find Optimal Capacity (m):

m = ceiling(n / target_load_factor)

Example:
- Expected 10,000 entries
- Target α = 0.75
- m = ceiling(10,000 / 0.75) = ceiling(13,333) = 13,334 buckets

Practically, round up to a power of 2 for efficient modulo (bit masking):

m = next_power_of_2(n / α) = 16,384 buckets

Recommended Initial Capacity
Expected Entries	Load Factor	Minimum Capacity	Power-of-2 Capacity
100	0.75	134	256
1,000	0.75	1,334	2,048
10,000	0.75	13,334	16,384
100,000	0.75	133,334	262,144 (256K)
1,000,000	0.75	1,333,334	2,097,152 (2M)

Pre-sizing Prevents Resizing

pre-sizing-example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// BAD: Default capacity, many resizes
const badMap = new ChainingHashTable<string, number>();
for (let i = 0; i < 1000000; i++) {
    badMap.put(`key_${i}`, i);
}
// Triggers ~17 resize operations!
 
// GOOD: Pre-sized capacity, zero resizes
const expectedSize = 1000000;
const loadFactor = 0.75;
const optimalCapacity = Math.pow(2, Math.ceil(Math.log2(expectedSize / loadFactor)));
// optimalCapacity = 2097152
 
const goodMap = new ChainingHashTable<string, number>(optimalCapacity);
for (let i = 0; i < 1000000; i++) {
    goodMap.put(`key_${i}`, i);
}
// No resize operations needed!

Worst Case Scenarios and Protection

The O(n) worst case occurs when all keys hash to the same bucket, creating a single chain of length n. This can happen due to:

Poor hash function: Doesn't distribute uniformly
Adversarial input: Attacker crafts keys that collide
Pathological data patterns: Real data that happens to collide

In the worst case, the hash table degenerates to a linked list, and all operations become O(n).

Hash Flooding Attacks

Protection Strategies:

Worst-Case Mitigations

•Randomized hash seeds: Each process uses a random seed, making collision prediction infeasible for attackers.
•Cryptographic hash functions: For untrusted input, use SipHash or similar keyed hash that's resistant to collision attacks.
•Tree conversion: Convert long chains to balanced trees (Java 8+ HashMap does this at length 8).
•Input validation: Limit the number of items from untrusted sources.
•Monitoring: Alert on unusually long chains that might indicate attack or hash function failure.

Java's Hybrid Defense:

Java's HashMap provides a reference implementation of tree conversion:

Chain length 0-8:   Linked list (simple, fast for short chains)
Chain length 8+:    Convert to red-black tree (O(log n) guarantee)
Chain length <6:    Convert back to linked list (hysteresis)

This bounds worst-case lookup to O(log n) while maintaining O(1) average case. The tree threshold of 8 is chosen because:

Probability of natural chain ≥8 is extremely low (~0.0000006 at α=0.75)
Long chains almost certainly indicate attack or hash function failure
Tree overhead is acceptable only for long chains

Real-World Performance Numbers

Theory is essential, but let's ground our understanding in real-world performance numbers. These benchmarks give you practical expectations:

Typical Operation Times (Modern Hardware, 2020s):

Real-World Hash Table Performance
Operation	Time (nanoseconds)	Context
Hash computation (string, 20 chars)	15-30 ns	Depends on hash algorithm
Bucket access (cache hit)	1-4 ns	L1/L2 cache
Bucket access (cache miss)	50-100 ns	Main memory
Chain traversal (per node)	3-15 ns	Pointer chasing
Key comparison (integer)	1-2 ns	Simple equality
Key comparison (string, 20 chars)	5-20 ns	Depends on match point
Full lookup (successful, α=0.75)	30-100 ns	Typical case
Full lookup (cache-cold)	200-400 ns	First access to table

Throughput Expectations:

Well-optimized hash table: 10-50 million lookups/second
With string keys: 5-20 million lookups/second (hashing cost)
Memory-bound workload: Throughput limited by memory bandwidth

Scaling Behavior:

Benchmark Your Specific Use Case

Summary: Performance Mastery

We've thoroughly analyzed the performance characteristics of hash tables with chaining. Let's consolidate the key insights:

Key Performance Takeaways

•Load factor is the master variable — α determines average chain length and thus average operation time.
•Expected probes scale linearly with α — Successful search ≈ 1 + α/2, unsuccessful ≈ 1 + α.
•Chain lengths follow Poisson distribution — Long chains are exponentially rare with good hashing.
•Resizing is O(n) but amortizes to O(1) — Doubling strategy ensures constant amortized insert.
•Space is O(n) with constant overhead — Pointer overhead matters for small entries.
•Pre-sizing prevents resize storms — Know your data size in advance when possible.
•Worst case O(n) requires protection — Randomized seeds and tree conversion defend against attacks.
•Real-world lookups take ~30-100 ns — Hash tables are genuinely fast on modern hardware.

Module Complete:

You've now completed a comprehensive study of collision handling with chaining. You understand:

Why collisions are inevitable (Page 1)
How chaining resolves collisions mechanically (Page 2)
The engineering tradeoffs of chaining (Page 3)
The mathematical performance characteristics (Page 4)

Next, you'll explore open addressing—the alternative collision resolution strategy that stores all entries directly in the bucket array without separate chains.

Module Complete

4 / 4