Loading content...
In an open-addressed hash table, searching for an element is fundamentally different from chaining. With chaining, we navigate to a bucket and traverse a linked list — the search is localized to one chain. With open addressing, the element we seek could be anywhere along its probe sequence, scattered across the entire table.
This scattering creates subtle challenges: How do we know when to stop probing? What happens when we encounter a gap in the probe sequence? How do deletions affect our ability to find elements that were inserted later?
This page provides a rigorous treatment of search operations in open-addressed hash tables. We'll develop the algorithm, prove its correctness, analyze its complexity, and understand the subtle invariants that make it work.
By the end of this page, you will understand:
• The fundamental search algorithm for open addressing • Why empty slots serve as definitive 'not found' signals • How the probe sequence invariant guarantees correctness • The distinction between unsuccessful and successful searches • Performance analysis for different probing strategies • Edge cases and their proper handling • The relationship between insertion order and search path
The search algorithm for open addressing follows the same probe sequence used during insertion. For a key k, we generate the probe sequence h(k, 0), h(k, 1), h(k, 2), ... and examine each slot in order.
The algorithm terminates in three cases:
The critical insight is case 2: an empty slot is a definitive negative. If k were in the table, it would have been placed at or before this empty slot during insertion.
12345678910111213141516171819202122232425262728293031323334353637
def search(self, key): """ Search for key in an open-addressed hash table. Returns: The associated value if found, None otherwise. Complexity: - Average case: O(1/(1-α)) for unsuccessful search - Best case: O(1) when key is at initial hash position - Worst case: O(n) when probing entire table """ probes = 0 # For analysis for index in self._probe_sequence(key): probes += 1 slot = self.table[index] # Case 1: Empty slot — key definitely not in table # An empty slot means no key was ever placed here, and # therefore no key with this probe sequence could have # been placed beyond this point. if slot.is_empty(): return None # Case 2: Found the key — return its value if slot.key == key: return slot.value # Case 3: Slot is occupied by different key, or is a tombstone # Continue probing along the sequence # Safety check: prevent infinite loop if table is full if probes >= self.capacity: return None return None # Exhausted probe sequenceThe correctness of search relies on a fundamental invariant maintained by insertion:
Invariant: For every key k in the table located at position p, if we follow the probe sequence for k, position p is the first empty slot we encounter.
This invariant ensures that searching for k will find it before encountering any empty slot. The empty slot marks the boundary — no element with probe sequence h(k, i) can exist beyond where we insert a new element.
Why Empty Slots Terminate the Search:
Consider how insertion works: when inserting key k, we follow the probe sequence until we find an empty slot, then place k there. This means:
Now, during search:
Conclusion: If we find an empty slot, the key cannot exist later in the sequence. The search can safely terminate.
Tracing a Search Example:
Let's trace searching for key 'E' in a table with linear probing (m = 11):
Table State:
Index: 0 1 2 3 4 5 6 7 8 9 10
Contents: _ A B C _ D E _ F _ _
Search for 'E' where h₁('E') = 4:
Probe sequence: 4, 5, 6, 7, ...
Probe 0: index = 4, slot is EMPTY
→ Search terminates, 'E' not found!
Wait, that's wrong! We can see 'E' is at index 6, but our search stopped at the empty slot at index 4. What happened?
This scenario is impossible if the table was built correctly. If 'E' is at index 6, then when 'E' was inserted, index 4 must have been occupied. The empty slot at index 4 means either:
This illustrates why deletions are dangerous in open-addressed hash tables!
The performance characteristics of searching differ significantly depending on whether the key exists in the table. Understanding this distinction is crucial for accurate performance analysis and system design.
Why Successful Search is Faster:
The key insight is that a successful search retraces the insertion path — it follows the same probe sequence that was followed when the key was inserted. On average, the table was less full when the key was inserted than it is now, so the original insertion required fewer probes.
Unsuccessful search, however, must probe until it finds an empty slot in the current state of the table, which is maximally full. The probe sequence must traverse all clusters that have formed since the last resize.
Successful search performance is tied to when each key was inserted, not just how full the table currently is. Keys inserted early (when the table was sparse) will be found quickly. Keys inserted recently (when the table was dense) may require many probes.
This is why the formula for successful search involves an average over all insertion points, not just the current load factor.
Mathematical Analysis:
For uniform hashing (each probe sequence is equally likely to be any permutation of table indices), the expected number of probes is:
Unsuccessful search: The expected number of probes is the expected number of occupied slots encountered before finding an empty one. With load factor α:
E[probes | unsuccessful] = 1/(1 - α)
Successful search: We average over all keys currently in the table. If key i was inserted when the load factor was αᵢ = (i-1)/m:
E[probes | successful] = (1/n) × Σᵢ 1/(1 - αᵢ)
= (1/n) × Σᵢ m/(m - i + 1)
≈ (1/α) × ln(1/(1-α))
This integral approximation gives us the formula for successful search.
| Load Factor (α) | Successful Search | Unsuccessful Search | Ratio |
|---|---|---|---|
| 0.25 | 1.15 probes | 1.33 probes | 1.16× |
| 0.50 | 1.39 probes | 2.00 probes | 1.44× |
| 0.75 | 1.85 probes | 4.00 probes | 2.16× |
| 0.90 | 2.56 probes | 10.00 probes | 3.91× |
| 0.95 | 3.15 probes | 20.00 probes | 6.35× |
Notice how the ratio widens dramatically at high load factors. At α = 0.95, unsuccessful searches require over 6× more probes than successful ones. This asymmetry has important implications:
Understanding the geometry of search in open addressing helps build intuition for performance characteristics. Let's visualize what happens during search operations.
Linear Probing Search Visualization:
Table: [A] [C] [D] [_] [B] [E] [_] [F] [G] [H] [_]
Index: 0 1 2 3 4 5 6 7 8 9 10
Search for 'B' where h('B') = 1:
Probe 0: index 1 → 'C' (not B, continue)
Probe 1: index 2 → 'D' (not B, continue)
Probe 2: index 3 → empty (STOP! B not found)
Wait, but B is at index 4!
This illustrates a critical point: if B is at index 4, then when B was inserted, indices 1, 2, and 3 must have been occupied. The current empty slot at index 3 means something was deleted.
Without special handling, deletions break the search guarantee. We'll address this with tombstones in the next page.
Probe Chains and Their Structure:
A probe chain is the sequence of occupied slots that a search must traverse before either finding the key or hitting an empty slot.
For linear probing:
For quadratic probing:
For double hashing:
12345678910111213141516171819202122232425262728293031323334353637383940
def analyze_probe_chains(table): """ Analyze probe chain statistics for an open-addressed table. Returns metrics useful for understanding table performance. """ total_probes_successful = 0 total_probes_unsuccessful = 0 keys_analyzed = 0 max_chain = 0 # For each key in the table, count probes to find it for key in table.all_keys(): probes = 0 for index in table._probe_sequence(key): probes += 1 if table.keys[index] == key: break total_probes_successful += probes max_chain = max(max_chain, probes) keys_analyzed += 1 # For random keys NOT in table, measure unsuccessful search import random test_keys = [random.randint(0, 10**9) for _ in range(1000)] test_keys = [k for k in test_keys if k not in table] for key in test_keys[:100]: probes = 0 for index in table._probe_sequence(key): probes += 1 if table.keys[index] is None: break total_probes_unsuccessful += probes return { 'avg_successful': total_probes_successful / keys_analyzed, 'avg_unsuccessful': total_probes_unsuccessful / 100, 'max_chain': max_chain, 'load_factor': keys_analyzed / table.capacity, }If you observe probe chains significantly longer than expected, it indicates one of:
Monitoring max chain length is a valuable health check for hash table implementations.
Insertion and search are intimately connected in open addressing — they must follow the exact same probe sequence to maintain correctness. This coupling is more rigid than in chaining, where each bucket is independent.
The Fundamental Contract:
For any key k, if insertion places k at index p, then searching for k must probe index p before encountering an empty slot.
This contract is maintained by using identical probe sequence generation for both operations:
123456789101112131415161718192021222324252627282930313233
class OpenAddressedTable: def _probe_sequence(self, key): """ Generate probe sequence for key. CRITICAL: This exact sequence must be used by BOTH insert() AND search() for correctness. """ h1 = self._hash1(key) h2 = self._hash2(key) # For double hashing for i in range(self.capacity): yield (h1 + i * h2) % self.capacity def insert(self, key, value): """Insert key-value using probe sequence.""" for index in self._probe_sequence(key): if self.keys[index] is None: self.keys[index] = key self.values[index] = value return True if self.keys[index] == key: self.values[index] = value # Update return True return False def search(self, key): """Search for key using SAME probe sequence.""" for index in self._probe_sequence(key): if self.keys[index] is None: return None # Not found if self.keys[index] == key: return self.values[index] return NoneWhy the Probe Sequence Must Be Deterministic:
If the probe sequence varied between insertion and search — for example, if it depended on a random seed or current time — we could insert a key at one position but search for it along a different path, never finding it.
The probe sequence must be:
These properties ensure that insertion creates a path that search can reliably follow.
The hash function used must also be stable — it must return the same value for the same key across the lifetime of the table. In some languages (like Python), strings have randomized hashes by default (for security reasons), and the hash of an object can only be meaningful within one process execution.
If you serialize and deserialize a hash table, or transfer it between processes, you must either:
Insertion Order and Search Cost:
An interesting consequence of the insertion-search relationship is that insertion order affects search performance.
Consider three keys A, B, C all hashing to index 5 in a table of size 11:
Order 1: Insert A, then B, then C
A inserted at index 5 (1 probe)
B inserted at index 6 (2 probes)
C inserted at index 7 (3 probes)
Total insertion: 6 probes
Search for A: 1 probe
Search for B: 2 probes
Search for C: 3 probes
Total search: 6 probes
Order 2: Insert C, then B, then A
C inserted at index 5 (1 probe)
B inserted at index 6 (2 probes)
A inserted at index 7 (3 probes)
Total insertion: 6 probes
Search for C: 1 probe
Search for B: 2 probes
Search for A: 3 probes
Total search: 6 probes
The total search cost is the same, but the per-key costs differ! If C is accessed more frequently than A, Order 2 is more efficient. This observation leads to optimization techniques like Move-to-Front heuristics.
Robust implementations must handle several edge cases correctly. Let's examine each scenario and its proper handling:
1. Searching in an Empty Table
def search(self, key):
for index in self._probe_sequence(key):
if self.keys[index] is None:
return None # First slot is empty → not found
...
The first probe immediately finds an empty slot. The search terminates after exactly 1 probe, returning None. This is O(1) and optimal — we can't do better.
2. Searching in a Full Table (Load Factor = 1)
When the table is completely full:
def search(self, key):
probes = 0
for index in self._probe_sequence(key):
probes += 1
if self.keys[index] == key:
return self.values[index]
if probes >= self.capacity:
return None # Full cycle without finding key
return None
In a full table, unsuccessful search requires probing ALL n slots — O(n) worst case. This is why load factor should never approach 1.
3. Searching for Key That Was Just Inserted
This should always succeed immediately (or within the same number of probes as insertion). If search fails to find a just-inserted key, there's a bug:
# This should always work
table.insert('foo', 42)
assert table.search('foo') == 42 # Must succeed
# Common bugs that break this:
# 1. Search uses different hash function than insert
# 2. Probe sequence is non-deterministic
# 3. Key comparison is broken (e.g., floating point keys)
4. Searching with Keys That Have Same Hash
When multiple keys hash to the same value, they form a chain. The search must traverse the chain:
# Keys 'ABC' and 'BCA' both hash to index 7
table.insert('ABC', 1)
table.insert('BCA', 2)
# Search for 'BCA' (inserted second)
probe 0: index 7 → 'ABC' (not BCA, continue)
probe 1: index 8 → 'BCA' (found!)
# Search for 'XYZ' that also hashes to 7 but isn't in table
probe 0: index 7 → 'ABC' (not XYZ)
probe 1: index 8 → 'BCA' (not XYZ)
probe 2: index 9 → empty (XYZ not found)
Search correctness depends entirely on proper key comparison. Common pitfalls:
• Object identity vs equality: Using is instead of == in Python
• Floating point comparison: Float keys should use tolerance-based comparison
• Mutable keys: If a key is modified after insertion, search will fail
• Encoding issues: 'café' (composed) vs 'café' (decomposed) are different strings
The safest approach is to use immutable, hashable types as keys (strings, numbers, tuples).
5. Wrap-Around During Search
Probe sequences must wrap around the table:
Table size: 11
Key hashes to index 9
Probe sequence: 9, 10, 0, 1, 2, ... (wraps from 10 to 0)
Probe 0: index 9
Probe 1: index 10
Probe 2: index 0 ← Wrapped around
Probe 3: index 1
...
The modulo operation handles this: (hash + offset) % capacity naturally wraps. Ensure your implementation uses modulo, not bounds checking.
Several techniques can optimize search performance beyond the basic algorithm:
1. Early Termination with Metadata
Store metadata that allows quick rejection of unsuccessful searches:
class OptimizedHashTable:
def __init__(self, capacity):
self.keys = [None] * capacity
self.values = [None] * capacity
# Metadata: store hash values to avoid key comparison
self.hashes = [None] * capacity
def search(self, key):
key_hash = hash(key)
for index in self._probe_sequence(key):
if self.keys[index] is None:
return None
# Quick hash check before expensive key comparison
if self.hashes[index] == key_hash:
if self.keys[index] == key:
return self.values[index]
Hash comparison is much cheaper than full key comparison for strings. If hashes differ, keys definitely differ.
2. SIMD-Accelerated Probing (Swiss Tables)
Modern implementations like Google's SwissTable use SIMD instructions to probe multiple slots simultaneously:
// Conceptual: Check 16 slots at once using SSE/AVX
__m128i group = _mm_load_si128(&metadata[group_index]);
__m128i match = _mm_cmpeq_epi8(group, hash_byte);
uint32_t mask = _mm_movemask_epi8(match);
// 'mask' has a 1 bit for each slot that might contain our key
while (mask) {
int slot = __builtin_ctz(mask); // Count trailing zeros
if (keys[slot] == search_key) {
return values[slot];
}
mask &= (mask - 1); // Clear lowest set bit
}
This turns 16 sequential comparisons into parallel operations, dramatically speeding up both successful and unsuccessful searches.
3. Robin Hood Hashing Optimization
Robin Hood hashing is a variant where elements are reordered during insertion to minimize maximum probe distance:
def search_robin_hood(self, key):
probe_distance = 0
for index in self._probe_sequence(key):
if self.keys[index] is None:
return None
# Early termination: if element here has shorter probe
# distance than our current probes, key can't be further
slot_probe_distance = self._get_probe_distance(index)
if slot_probe_distance < probe_distance:
return None # Key would have been placed here
if self.keys[index] == key:
return self.values[index]
probe_distance += 1
Robin Hood hashing bounds the maximum probe distance, providing more predictable worst-case performance.
The biggest practical optimization is often cache-conscious design:
• Linear probing naturally achieves good cache behavior because consecutive probes hit consecutive memory locations • Store metadata separately from full key-value pairs, so cache lines contain more metadata • Align slot size to cache line boundaries (64 bytes on modern CPUs) • Prefetch next probe while processing current one
These optimizations can provide 2-5× speedup, often exceeding the gains from better probing strategies.
We've developed a comprehensive understanding of how search operates in open-addressed hash tables. Let's consolidate the key insights:
What's next:
We've seen that empty slots are crucial for search termination. But what happens when we need to delete an element? Simply emptying its slot would break search for any keys that were inserted later and probed past it. The next page addresses the challenging problem of deletion in open-addressed tables and introduces the tombstone mechanism.
You now understand how searching works in open-addressed hash tables, including the algorithm, correctness invariants, performance characteristics, and edge cases. Next, we'll tackle the thorny problem of deletion — why it's difficult, how tombstones solve it, and what challenges they introduce.