Collision Handling Open Addressing - Learning Module

Loading content...

0/279

How Searching Works with Open Addressing

Finding Elements in a Sea of Probes

In an open-addressed hash table, searching for an element is fundamentally different from chaining. With chaining, we navigate to a bucket and traverse a linked list — the search is localized to one chain. With open addressing, the element we seek could be anywhere along its probe sequence, scattered across the entire table.

This scattering creates subtle challenges: How do we know when to stop probing? What happens when we encounter a gap in the probe sequence? How do deletions affect our ability to find elements that were inserted later?

This page provides a rigorous treatment of search operations in open-addressed hash tables. We'll develop the algorithm, prove its correctness, analyze its complexity, and understand the subtle invariants that make it work.

What You Will Learn

By the end of this page, you will understand:

• The fundamental search algorithm for open addressing • Why empty slots serve as definitive 'not found' signals • How the probe sequence invariant guarantees correctness • The distinction between unsuccessful and successful searches • Performance analysis for different probing strategies • Edge cases and their proper handling • The relationship between insertion order and search path

The Search Algorithm — Core Mechanics

The search algorithm for open addressing follows the same probe sequence used during insertion. For a key k, we generate the probe sequence h(k, 0), h(k, 1), h(k, 2), ... and examine each slot in order.

The algorithm terminates in three cases:

Found: We encounter the key k — return success and the associated value
Not Found (empty slot): We encounter an empty slot — the key cannot exist past this point
Not Found (full cycle): We've probed every slot without finding k or an empty slot — the key doesn't exist

The critical insight is case 2: an empty slot is a definitive negative. If k were in the table, it would have been placed at or before this empty slot during insertion.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def search(self, key):
    """
    Search for key in an open-addressed hash table.
    
    Returns:
        The associated value if found, None otherwise.
    
    Complexity:
        - Average case: O(1/(1-α)) for unsuccessful search
        - Best case: O(1) when key is at initial hash position
        - Worst case: O(n) when probing entire table
    """
    probes = 0  # For analysis
    
    for index in self._probe_sequence(key):
        probes += 1
        slot = self.table[index]
        
        # Case 1: Empty slot — key definitely not in table
        # An empty slot means no key was ever placed here, and
        # therefore no key with this probe sequence could have
        # been placed beyond this point.
        if slot.is_empty():
            return None
        
        # Case 2: Found the key — return its value
        if slot.key == key:
            return slot.value
        
        # Case 3: Slot is occupied by different key, or is a tombstone
        # Continue probing along the sequence
        
        # Safety check: prevent infinite loop if table is full
        if probes >= self.capacity:
            return None
    
    return None  # Exhausted probe sequence

The Probe Sequence Invariant

The correctness of search relies on a fundamental invariant maintained by insertion:

Invariant: For every key k in the table located at position p, if we follow the probe sequence for k, position p is the first empty slot we encounter.

This invariant ensures that searching for k will find it before encountering any empty slot. The empty slot marks the boundary — no element with probe sequence h(k, i) can exist beyond where we insert a new element.

Why Empty Slots Terminate the Search:

Consider how insertion works: when inserting key k, we follow the probe sequence until we find an empty slot, then place k there. This means:

All slots before the insertion point in the probe sequence were occupied when k was inserted
The insertion point was the first empty slot encountered

Now, during search:

If we find an empty slot at position i in the probe sequence
And k is in the table at some position j where j > i in the sequence
Then when k was inserted, position i was NOT empty (otherwise k would have been placed there)
This contradicts the current state where position i IS empty

Conclusion: If we find an empty slot, the key cannot exist later in the sequence. The search can safely terminate.

Tracing a Search Example:

Let's trace searching for key 'E' in a table with linear probing (m = 11):

Table State:
Index:    0   1   2   3   4   5   6   7   8   9  10
Contents: _   A   B   C   _   D   E   _   F   _   _

Search for 'E' where h₁('E') = 4:
Probe sequence: 4, 5, 6, 7, ...

Probe 0: index = 4, slot is EMPTY
→ Search terminates, 'E' not found!

Wait, that's wrong! We can see 'E' is at index 6, but our search stopped at the empty slot at index 4. What happened?

This scenario is impossible if the table was built correctly. If 'E' is at index 6, then when 'E' was inserted, index 4 must have been occupied. The empty slot at index 4 means either:

An element was deleted (creating this gap) — we'll address this with tombstones
The example is inconsistent (you can't have 'E' at index 6 with empty slot at 4 for this hash)

This illustrates why deletions are dangerous in open-addressed hash tables!

Successful vs Unsuccessful Search

The performance characteristics of searching differ significantly depending on whether the key exists in the table. Understanding this distinction is crucial for accurate performance analysis and system design.

Successful Search

•Definition: Key exists in the table
•Terminates when: Key is found
•Average probes depend on: Insertion order
•Path retraces: Original insertion path
•Performance: Generally better than unsuccessful
•Linear probing: E[probes] ≈ ½(1 + 1/(1-α))

Unsuccessful Search

•Definition: Key does not exist in the table
•Terminates when: Empty slot found
•Average probes depend on: Current table state
•Path probes: Until empty slot encountered
•Performance: Generally worse than successful
•Linear probing: E[probes] ≈ ½(1 + 1/(1-α)²)

Why Successful Search is Faster:

The key insight is that a successful search retraces the insertion path — it follows the same probe sequence that was followed when the key was inserted. On average, the table was less full when the key was inserted than it is now, so the original insertion required fewer probes.

Unsuccessful search, however, must probe until it finds an empty slot in the current state of the table, which is maximally full. The probe sequence must traverse all clusters that have formed since the last resize.

The Ordering Insight

Successful search performance is tied to when each key was inserted, not just how full the table currently is. Keys inserted early (when the table was sparse) will be found quickly. Keys inserted recently (when the table was dense) may require many probes.

This is why the formula for successful search involves an average over all insertion points, not just the current load factor.

Mathematical Analysis:

For uniform hashing (each probe sequence is equally likely to be any permutation of table indices), the expected number of probes is:

Unsuccessful search: The expected number of probes is the expected number of occupied slots encountered before finding an empty one. With load factor α:

E[probes | unsuccessful] = 1/(1 - α)

Successful search: We average over all keys currently in the table. If key i was inserted when the load factor was αᵢ = (i-1)/m:

E[probes | successful] = (1/n) × Σᵢ 1/(1 - αᵢ)
                        = (1/n) × Σᵢ m/(m - i + 1)
                        ≈ (1/α) × ln(1/(1-α))

This integral approximation gives us the formula for successful search.

Expected Probes: Successful vs Unsuccessful Search (Uniform Hashing)
Load Factor (α)	Successful Search	Unsuccessful Search	Ratio
0.25	1.15 probes	1.33 probes	1.16×
0.50	1.39 probes	2.00 probes	1.44×
0.75	1.85 probes	4.00 probes	2.16×
0.90	2.56 probes	10.00 probes	3.91×
0.95	3.15 probes	20.00 probes	6.35×

Notice how the ratio widens dramatically at high load factors. At α = 0.95, unsuccessful searches require over 6× more probes than successful ones. This asymmetry has important implications:

Negative lookups are expensive: Applications that frequently search for non-existent keys suffer disproportionately
Bloom filters help: A Bloom filter in front of the hash table can eliminate most unsuccessful searches before they probe the table
Sizing matters: Keeping α low benefits unsuccessful searches much more than successful ones

Search Paths and Probe Chains

Understanding the geometry of search in open addressing helps build intuition for performance characteristics. Let's visualize what happens during search operations.

Linear Probing Search Visualization:

Table: [A] [C] [D] [_] [B] [E] [_] [F] [G] [H] [_]
Index:  0   1   2   3   4   5   6   7   8   9  10

Search for 'B' where h('B') = 1:
  Probe 0: index 1 → 'C' (not B, continue)
  Probe 1: index 2 → 'D' (not B, continue)  
  Probe 2: index 3 → empty (STOP! B not found)

Wait, but B is at index 4!

This illustrates a critical point: if B is at index 4, then when B was inserted, indices 1, 2, and 3 must have been occupied. The current empty slot at index 3 means something was deleted.

Without special handling, deletions break the search guarantee. We'll address this with tombstones in the next page.

Probe Chains and Their Structure:

A probe chain is the sequence of occupied slots that a search must traverse before either finding the key or hitting an empty slot.

For linear probing:

Probe chains are contiguous runs of occupied slots
All keys in a cluster share overlapping probe chains
Chain length = cluster length (for keys at cluster start)

For quadratic probing:

Probe chains jump across the table
Keys with same h₁(k) share identical chains (secondary clustering)
Chain length determined by collisions at each probe position

For double hashing:

Each key has a unique probe chain (different step sizes)
Chains intersect minimally
Chain length ≈ 1/(1-α) for unsuccessful search

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def analyze_probe_chains(table):
    """
    Analyze probe chain statistics for an open-addressed table.
    Returns metrics useful for understanding table performance.
    """
    total_probes_successful = 0
    total_probes_unsuccessful = 0
    keys_analyzed = 0
    max_chain = 0
    
    # For each key in the table, count probes to find it
    for key in table.all_keys():
        probes = 0
        for index in table._probe_sequence(key):
            probes += 1
            if table.keys[index] == key:
                break
        total_probes_successful += probes
        max_chain = max(max_chain, probes)
        keys_analyzed += 1
    
    # For random keys NOT in table, measure unsuccessful search
    import random
    test_keys = [random.randint(0, 10**9) for _ in range(1000)]
    test_keys = [k for k in test_keys if k not in table]
    
    for key in test_keys[:100]:
        probes = 0
        for index in table._probe_sequence(key):
            probes += 1
            if table.keys[index] is None:
                break
        total_probes_unsuccessful += probes
    
    return {
        'avg_successful': total_probes_successful / keys_analyzed,
        'avg_unsuccessful': total_probes_unsuccessful / 100,
        'max_chain': max_chain,
        'load_factor': keys_analyzed / table.capacity,
    }

Long Probe Chains Are Red Flags

If you observe probe chains significantly longer than expected, it indicates one of:

Poor hash function: Keys are clustering due to non-uniform distribution
Load factor too high: Table needs resizing
Adversarial input: Someone might be attacking your hash function
Deletions corrupting chains: Tombstones not being handled correctly

Monitoring max chain length is a valuable health check for hash table implementations.

The Relationship Between Insertion and Search

Insertion and search are intimately connected in open addressing — they must follow the exact same probe sequence to maintain correctness. This coupling is more rigid than in chaining, where each bucket is independent.

The Fundamental Contract:

For any key k, if insertion places k at index p, then searching for k must probe index p before encountering an empty slot.

This contract is maintained by using identical probe sequence generation for both operations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class OpenAddressedTable:
    def _probe_sequence(self, key):
        """
        Generate probe sequence for key.
        CRITICAL: This exact sequence must be used by
        BOTH insert() AND search() for correctness.
        """
        h1 = self._hash1(key)
        h2 = self._hash2(key)  # For double hashing
        
        for i in range(self.capacity):
            yield (h1 + i * h2) % self.capacity
    
    def insert(self, key, value):
        """Insert key-value using probe sequence."""
        for index in self._probe_sequence(key):
            if self.keys[index] is None:
                self.keys[index] = key
                self.values[index] = value
                return True
            if self.keys[index] == key:
                self.values[index] = value  # Update
                return True
        return False
    
    def search(self, key):
        """Search for key using SAME probe sequence."""
        for index in self._probe_sequence(key):
            if self.keys[index] is None:
                return None  # Not found
            if self.keys[index] == key:
                return self.values[index]
        return None

Why the Probe Sequence Must Be Deterministic:

If the probe sequence varied between insertion and search — for example, if it depended on a random seed or current time — we could insert a key at one position but search for it along a different path, never finding it.

The probe sequence must be:

Deterministic: Same key always produces same sequence
Complete: Eventually visits every table slot (or enough to guarantee finding a spot)
Efficient: Computable in O(1) per probe

These properties ensure that insertion creates a path that search can reliably follow.

Hash Function Stability

The hash function used must also be stable — it must return the same value for the same key across the lifetime of the table. In some languages (like Python), strings have randomized hashes by default (for security reasons), and the hash of an object can only be meaningful within one process execution.

If you serialize and deserialize a hash table, or transfer it between processes, you must either:

Use a stable hash function (like MurmurHash with fixed seed)
Rehash all elements after transfer

Insertion Order and Search Cost:

An interesting consequence of the insertion-search relationship is that insertion order affects search performance.

Consider three keys A, B, C all hashing to index 5 in a table of size 11:

Order 1: Insert A, then B, then C

A inserted at index 5 (1 probe)
B inserted at index 6 (2 probes)
C inserted at index 7 (3 probes)

Total insertion: 6 probes
Search for A: 1 probe
Search for B: 2 probes
Search for C: 3 probes
Total search: 6 probes

Order 2: Insert C, then B, then A

C inserted at index 5 (1 probe)
B inserted at index 6 (2 probes)
A inserted at index 7 (3 probes)

Total insertion: 6 probes
Search for C: 1 probe
Search for B: 2 probes
Search for A: 3 probes
Total search: 6 probes

The total search cost is the same, but the per-key costs differ! If C is accessed more frequently than A, Order 2 is more efficient. This observation leads to optimization techniques like Move-to-Front heuristics.

Edge Cases and Special Scenarios

Robust implementations must handle several edge cases correctly. Let's examine each scenario and its proper handling:

1. Searching in an Empty Table

def search(self, key):
    for index in self._probe_sequence(key):
        if self.keys[index] is None:
            return None  # First slot is empty → not found
        ...

The first probe immediately finds an empty slot. The search terminates after exactly 1 probe, returning None. This is O(1) and optimal — we can't do better.

2. Searching in a Full Table (Load Factor = 1)

When the table is completely full:

Every slot is occupied
No empty slots to terminate unsuccessful search
Must rely on probe count to terminate

def search(self, key):
    probes = 0
    for index in self._probe_sequence(key):
        probes += 1
        if self.keys[index] == key:
            return self.values[index]
        if probes >= self.capacity:
            return None  # Full cycle without finding key
    return None

In a full table, unsuccessful search requires probing ALL n slots — O(n) worst case. This is why load factor should never approach 1.

3. Searching for Key That Was Just Inserted

This should always succeed immediately (or within the same number of probes as insertion). If search fails to find a just-inserted key, there's a bug:

# This should always work
table.insert('foo', 42)
assert table.search('foo') == 42  # Must succeed

# Common bugs that break this:
# 1. Search uses different hash function than insert
# 2. Probe sequence is non-deterministic
# 3. Key comparison is broken (e.g., floating point keys)

4. Searching with Keys That Have Same Hash

When multiple keys hash to the same value, they form a chain. The search must traverse the chain:

# Keys 'ABC' and 'BCA' both hash to index 7
table.insert('ABC', 1)
table.insert('BCA', 2)

# Search for 'BCA' (inserted second)
probe 0: index 7 → 'ABC' (not BCA, continue)
probe 1: index 8 → 'BCA' (found!)

# Search for 'XYZ' that also hashes to 7 but isn't in table
probe 0: index 7 → 'ABC' (not XYZ)
probe 1: index 8 → 'BCA' (not XYZ)
probe 2: index 9 → empty (XYZ not found)

Key Comparison Must Be Correct

Search correctness depends entirely on proper key comparison. Common pitfalls:

• Object identity vs equality: Using is instead of == in Python • Floating point comparison: Float keys should use tolerance-based comparison • Mutable keys: If a key is modified after insertion, search will fail • Encoding issues: 'café' (composed) vs 'café' (decomposed) are different strings

The safest approach is to use immutable, hashable types as keys (strings, numbers, tuples).

5. Wrap-Around During Search

Probe sequences must wrap around the table:

Table size: 11
Key hashes to index 9
Probe sequence: 9, 10, 0, 1, 2, ... (wraps from 10 to 0)

Probe 0: index 9
Probe 1: index 10
Probe 2: index 0 ← Wrapped around
Probe 3: index 1
...

The modulo operation handles this: (hash + offset) % capacity naturally wraps. Ensure your implementation uses modulo, not bounds checking.

Performance Optimization Techniques

Several techniques can optimize search performance beyond the basic algorithm:

1. Early Termination with Metadata

Store metadata that allows quick rejection of unsuccessful searches:

class OptimizedHashTable:
    def __init__(self, capacity):
        self.keys = [None] * capacity
        self.values = [None] * capacity
        # Metadata: store hash values to avoid key comparison
        self.hashes = [None] * capacity
    
    def search(self, key):
        key_hash = hash(key)
        
        for index in self._probe_sequence(key):
            if self.keys[index] is None:
                return None
            
            # Quick hash check before expensive key comparison
            if self.hashes[index] == key_hash:
                if self.keys[index] == key:
                    return self.values[index]

Hash comparison is much cheaper than full key comparison for strings. If hashes differ, keys definitely differ.

2. SIMD-Accelerated Probing (Swiss Tables)

Modern implementations like Google's SwissTable use SIMD instructions to probe multiple slots simultaneously:

// Conceptual: Check 16 slots at once using SSE/AVX
__m128i group = _mm_load_si128(&metadata[group_index]);
__m128i match = _mm_cmpeq_epi8(group, hash_byte);
uint32_t mask = _mm_movemask_epi8(match);

// 'mask' has a 1 bit for each slot that might contain our key
while (mask) {
    int slot = __builtin_ctz(mask);  // Count trailing zeros
    if (keys[slot] == search_key) {
        return values[slot];
    }
    mask &= (mask - 1);  // Clear lowest set bit
}

This turns 16 sequential comparisons into parallel operations, dramatically speeding up both successful and unsuccessful searches.

3. Robin Hood Hashing Optimization

Robin Hood hashing is a variant where elements are reordered during insertion to minimize maximum probe distance:

During insertion, if the new element has probed farther than the existing element, swap them
This 'steals from the rich' (elements that are close to their ideal position)
Search can terminate early if current slot's probe distance exceeds our probe count

def search_robin_hood(self, key):
    probe_distance = 0
    
    for index in self._probe_sequence(key):
        if self.keys[index] is None:
            return None
        
        # Early termination: if element here has shorter probe
        # distance than our current probes, key can't be further
        slot_probe_distance = self._get_probe_distance(index)
        if slot_probe_distance < probe_distance:
            return None  # Key would have been placed here
        
        if self.keys[index] == key:
            return self.values[index]
        
        probe_distance += 1

Robin Hood hashing bounds the maximum probe distance, providing more predictable worst-case performance.

Cache-Conscious Design

The biggest practical optimization is often cache-conscious design:

• Linear probing naturally achieves good cache behavior because consecutive probes hit consecutive memory locations • Store metadata separately from full key-value pairs, so cache lines contain more metadata • Align slot size to cache line boundaries (64 bytes on modern CPUs) • Prefetch next probe while processing current one

These optimizations can provide 2-5× speedup, often exceeding the gains from better probing strategies.

Summary: Search Mastery in Open Addressing

We've developed a comprehensive understanding of how search operates in open-addressed hash tables. Let's consolidate the key insights:

Key Takeaways

•Search follows the insertion probe sequence — The same deterministic sequence must be used by both operations for correctness.
•Empty slots terminate unsuccessful search — An empty slot proves the key cannot exist further in the sequence.
•Successful search is faster than unsuccessful — Finding an existing key retraces its (historically shorter) insertion path.
•The probe sequence invariant is sacred — Anything that breaks it (like naive deletion) corrupts the data structure.
•Performance degrades rapidly with load factor — Especially for unsuccessful searches at α > 0.75.
•Optimization focuses on cache behavior — SIMD, metadata caching, and memory layout often matter more than probing strategy.

What's next:

We've seen that empty slots are crucial for search termination. But what happens when we need to delete an element? Simply emptying its slot would break search for any keys that were inserted later and probed past it. The next page addresses the challenging problem of deletion in open-addressed tables and introduces the tombstone mechanism.

Page Complete

You now understand how searching works in open-addressed hash tables, including the algorithm, correctness invariants, performance characteristics, and edge cases. Next, we'll tackle the thorny problem of deletion — why it's difficult, how tombstones solve it, and what challenges they introduce.