Loading learning content...
In our exploration of hash tables, we've encountered a remarkable promise: O(1) average-time access to any element. We've learned how hash functions transform arbitrary keys into array indices, seemingly giving us instant lookup capabilities. But there's a fundamental problem lurking beneath this elegant abstraction—a problem that every hash table implementation must confront.
What happens when two different keys produce the same hash index?
This isn't a rare edge case or a sign of a poorly designed hash function. It's a mathematical certainty that will occur in any real-world hash table. Understanding why collisions happen, how to detect them, and how to handle them gracefully is essential knowledge for any engineer working with hash-based data structures.
By the end of this page, you will understand why hash collisions are mathematically inevitable, how the pigeonhole principle guarantees their occurrence, and why collision resolution is not optional but essential. You'll develop intuition for collision probability and see exactly what happens when a collision occurs.
Let's set the stage with a concrete example. Imagine we have a hash table with 10 slots (indices 0-9) and a simple hash function that uses the modulo operation:
hash(key) = key % 10
Now, let's insert some integer keys:
| Key | Hash Value (key % 10) | Stored At Index |
|---|---|---|
| 15 | 15 % 10 = 5 | 5 |
| 27 | 27 % 10 = 7 | 7 |
| 42 | 42 % 10 = 2 | 2 |
| 35 | 35 % 10 = 5 | 5 ← COLLISION! |
The problem is now clear. When we attempt to insert key 35, our hash function computes the index as 5—the same index already occupied by key 15. Both keys are distinct and valid, yet they're competing for the same slot.
This is a hash collision: two or more distinct keys mapping to the same index in the hash table's underlying array.
A collision doesn't mean your hash function is broken. Even a theoretically perfect hash function will produce collisions when the number of possible keys exceeds the number of available slots. Collisions are an inherent property of mapping a larger space to a smaller one.
Why can't we just make the hash table bigger?
This is a natural first instinct—if collisions happen because we're cramming keys into limited slots, why not allocate more slots? While this helps reduce collision frequency, it can never eliminate collisions entirely for several reasons:
The inevitability of hash collisions is rooted in a simple but profound mathematical principle: the Pigeonhole Principle.
If you have more pigeons than pigeonholes, and every pigeon must be placed in a hole, then at least one hole must contain more than one pigeon.
In the context of hash tables:
If n items are placed into m containers, where n > m, then at least one container must hold more than one item. More generally, at least one container must hold at least ⌈n/m⌉ items.
Applying the Pigeonhole Principle to Hash Tables:
Consider these realistic scenarios:
| Scenario | Keys (Pigeons) | Slots (Holes) | Collision Guaranteed? |
|---|---|---|---|
| Store 11 items in size-10 table | 11 | 10 | Yes, by pigeonhole |
| Store 1001 items in size-1000 table | 1,001 | 1,000 | Yes, by pigeonhole |
| Hash all possible 8-char strings | ~200 trillion | 4 billion (32-bit) | Overwhelmingly yes |
| Hash all possible URLs | Effectively infinite | Any finite size | Absolutely guaranteed |
The Key Insight:
The key space (all possible keys you might want to store) is almost always vastly larger than any practical hash table size. When you hash a string, you're reducing potentially gigabytes of possible inputs down to a few billion output values (for a 32-bit hash) or a few quintillion (for a 64-bit hash). Either way, you're mapping a larger space to a smaller space.
This compression is what makes hash tables useful (constant-time access regardless of input size), but it's also what makes collisions unavoidable.
Hash tables trade perfect uniqueness for speed. We accept that multiple keys might hash to the same index, but we gain O(1) average access time. The art of hash table design is managing this trade-off through good hash functions and smart collision handling.
The pigeonhole principle tells us collisions are inevitable when we have more keys than slots. But here's what surprises many engineers: collisions occur with high probability long before you've filled half the table.
This phenomenon is illustrated by the famous Birthday Paradox:
In a room of just 23 people, there's a greater than 50% chance that two people share a birthday.
With 365 possible birthdays and only 23 people (far less than half of 365), why is a collision so likely? The answer reveals something crucial about how collisions accumulate in hash tables.
The Mathematics of Birthday Collisions:
The probability that 23 people all have different birthdays is:
P(no collision) = 365/365 × 364/365 × 363/365 × ... × 343/365
≈ 0.493
Therefore:
P(at least one collision) = 1 - 0.493 ≈ 0.507 (50.7%)
Each new person reduces the pool of "safe" birthdays, and these reductions compound multiplicatively.
In a hash table with m slots, the expected number of insertions before a collision is approximately √(π × m / 2). For a table with 1 million slots, you can expect a collision after only about 1,250 insertions—that's just 0.125% full!
| Table Size (m) | Expected Insertions Until Collision | Percentage Full |
|---|---|---|
| 100 | ≈ 12 | 12% |
| 1,000 | ≈ 40 | 4% |
| 10,000 | ≈ 125 | 1.25% |
| 1,000,000 | ≈ 1,253 | 0.125% |
| 1,000,000,000 | ≈ 39,623 | 0.004% |
The Implication for Hash Table Design:
This mathematical reality has profound implications:
Let's trace through exactly what happens when a collision occurs, step by step. We'll use a hash table with 8 slots, storing key-value pairs where keys are strings and values are integers.
Step 1: Insert ("apple", 100)
Assume our hash function computes hash("apple") % 8 = 3.
The table now looks like:
Index 0: EMPTYIndex 1: EMPTYIndex 2: EMPTYIndex 3: ("apple", 100) ← Just insertedIndex 4: EMPTYIndex 5: EMPTYIndex 6: EMPTYIndex 7: EMPTYStep 2: Insert ("banana", 200)
hash("banana") % 8 = 6. No collision, insertion succeeds:
Step 3: Insert ("cherry", 300)
hash("cherry") % 8 = 3. COLLISION! Index 3 is already occupied by "apple".
At this exact moment, the hash table faces a critical decision: Where should ("cherry", 300) go? We cannot simply overwrite ("apple", 100)—these are different keys with different values. We cannot leave "cherry" uninserted—the user asked us to store it. We need a collision resolution strategy.
The Collision Creates a Fundamental Problem:
Without collision handling:
With proper collision handling (which we'll explore in detail with chaining):
Before we can handle collisions, we must first detect them. Collision detection is straightforward but essential—we check whether the target slot is already occupied before inserting.
Here's how collision detection works in a basic hash table implementation:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
interface Entry<K, V> { key: K; value: V;} class BasicHashTable<K, V> { private buckets: (Entry<K, V> | null)[]; private size: number; constructor(capacity: number = 16) { this.buckets = new Array(capacity).fill(null); this.size = 0; } private hash(key: K): number { // Simplified hash function for demonstration const keyStr = String(key); let hash = 0; for (let i = 0; i < keyStr.length; i++) { hash = (hash * 31 + keyStr.charCodeAt(i)) >>> 0; } return hash % this.buckets.length; } insert(key: K, value: V): boolean { const index = this.hash(key); // COLLISION DETECTION if (this.buckets[index] !== null) { const existingEntry = this.buckets[index]!; // Check if it's the same key (update) or different key (collision) if (existingEntry.key === key) { // Same key - update value (not a collision) existingEntry.value = value; return true; } else { // COLLISION DETECTED! // Different key wants the same slot console.log(`Collision: "${key}" collides with "${existingEntry.key}" at index ${index}`); // Without collision handling, we cannot insert // This is where chaining or open addressing comes in return false; } } // No collision - slot is empty this.buckets[index] = { key, value }; this.size++; return true; }} // Demonstrationconst table = new BasicHashTable<string, number>(8);console.log(table.insert("apple", 100)); // trueconsole.log(table.insert("banana", 200)); // trueconsole.log(table.insert("cherry", 300)); // Depends on hash - may collideNotice the important distinction: when the same key is inserted twice, we update the value (standard hash map behavior). Only when different keys hash to the same index do we have a true collision that requires resolution.
The Collision Detection Algorithm:
This simple check is the foundation upon which all collision resolution strategies build.
Not all collisions are created equal. As we'll see when exploring collision resolution strategies, it's useful to distinguish between two types of collisions:
Why This Distinction Matters:
Primary collisions are unavoidable—they're inherent to hashing. Secondary collisions are a consequence of how we handle collisions and can be influenced by our choice of resolution strategy.
The chaining approach (which we'll explore in depth) eliminates secondary collisions entirely because each slot contains a collection of entries. Open addressing, by contrast, must deal with both types.
Understanding this distinction helps explain why different collision resolution strategies have different performance characteristics, especially as the table fills up.
One of chaining's greatest strengths is that it only deals with primary collisions. There's no concept of 'trying another slot'—every key goes exactly where its hash says, and multiple keys simply share that location in a linked structure.
Collisions don't just complicate insertion—they affect every hash table operation. Let's examine the impact on each core operation:
| Operation | Without Collisions | With Collisions (Naive) | With Collision Handling |
|---|---|---|---|
| Insert | O(1) - Direct placement | Fails or overwrites | O(1) avg - Place in chain/probe |
| Lookup | O(1) - Direct access | May return wrong value | O(k) where k = chain length |
| Delete | O(1) - Clear slot | May delete wrong entry | O(k) - Find then remove |
| Update | O(1) - Direct modification | May update wrong entry | O(k) - Find then modify |
The Lookup Challenge:
Lookup becomes particularly interesting with collisions. Consider looking up "cherry" when it shares index 3 with "apple":
1234567891011121314
function lookup(key): index = hash(key) // Without collision handling: return table[index].value // WRONG! Might return "apple" when looking for "cherry" // With collision handling (chaining): chain = table[index] for each entry in chain: if entry.key == key: return entry.value return NOT_FOUND // CRITICAL: We must compare the actual key, not just the hash!A hash gets you to the right neighborhood, but you must always verify you're at the right house. The hash index tells you WHERE to look, but you must COMPARE the actual keys to confirm identity. This is why hash functions must be paired with equality comparisons.
The Delete Challenge:
Deletion with collisions is even trickier, especially in open addressing schemes (which we'll cover in a later module). With chaining, deletion is relatively straightforward—find the entry in the chain and remove it. With open addressing, deletion requires special "tombstone" markers to avoid breaking probe sequences.
This complexity is one reason many engineers prefer chaining: it keeps the mental model simpler.
We've now thoroughly explored the collision problem—the fundamental challenge that every hash table must address. Let's consolidate the key insights:
What's Next:
Now that we understand why collisions happen and what problems they create, we're ready to explore the most common solution: chaining with linked lists. In the next page, we'll see how chaining elegantly resolves collisions by storing multiple entries at each slot, maintaining O(1) average-case performance while handling any number of collisions gracefully.
You now understand the collision problem in depth—its mathematical inevitability, its probabilistic behavior, and its impact on hash table operations. This foundation prepares you to appreciate why chaining is such an elegant and widely-used solution.