Loading learning content...
Deletion in open-addressed hash tables presents a fundamental problem that doesn't exist with chaining. In chaining, deleting an element simply removes a node from a linked list — the bucket structure remains intact. In open addressing, deleting an element creates a hole in the probe sequence, potentially severing the connection to elements inserted afterward.
This page confronts the deletion problem head-on. We'll understand why naive deletion breaks the data structure, develop the tombstone (or lazy deletion) solution, analyze its tradeoffs, and explore advanced techniques for managing the complexity tombstones introduce.
By the end of this page, you will understand:
• Why naive deletion breaks search in open addressing • The tombstone mechanism and how it preserves probe sequence integrity • How tombstones affect insertion, search, and load factor calculations • The accumulation problem and why tombstones degrade performance over time • Strategies for tombstone cleanup: rehashing and periodic purging • Alternative deletion approaches: backward-shift and move-to-fill • When to choose each deletion strategy
Let's start with the intuitive approach: to delete a key, simply mark its slot as empty. This seems reasonable — after all, that's how we'd remove an element from an array. But in open addressing, this intuition leads to catastrophic failure.
The Problem: Severed Probe Chains
Recall the search invariant: when we encounter an empty slot during search, we conclude the key isn't in the table. But if we've deleted an element that was between the initial hash position and the target key, we create a false empty slot that prematurely terminates search.
Step-by-Step Demonstration:
Consider a hash table with linear probing (m = 11). We insert three keys that all hash to index 3:
Step 1: Insert A (h(A) = 3)
[ _ | _ | _ | A | _ | _ | _ | _ | _ | _ | _ ]
0 1 2 3 4 5 6 7 8 9 10
Step 2: Insert B (h(B) = 3) — collides, probes to 4
[ _ | _ | _ | A | B | _ | _ | _ | _ | _ | _ ]
Step 3: Insert C (h(C) = 3) — collides at 3 and 4, probes to 5
[ _ | _ | _ | A | B | C | _ | _ | _ | _ | _ ]
Now let's naively delete B:
Step 4: Delete B — mark slot 4 as empty
[ _ | _ | _ | A | _ | C | _ | _ | _ | _ | _ ]
^
Empty slot created
Now search for C:
Search C (h(C) = 3):
Probe 0: index 3 → A (not C, continue)
Probe 1: index 4 → EMPTY
→ Search terminates. C not found!
But C is still in the table at index 5!
Naive deletion doesn't just cause search to fail — it corrupts the data structure. Elements remain in the table but become unreachable. This is a severe bug:
• Searching for C returns 'not found' even though C exists • Inserting C again would create a duplicate at a different slot • The table now violates its fundamental invariants
This corruption is insidious because it doesn't cause an immediate crash — the table appears to function, but returns incorrect results.
The Root Cause:
The problem stems from the dual meaning of 'empty' in naive implementations:
For search, these two meanings require different actions:
Naive deletion conflates these meanings, causing the search algorithm to stop early when it should continue.
The solution is to introduce a third state for slots: instead of just 'empty' and 'occupied', we add 'deleted' — also known as a tombstone.
The Three Slot States:
| State | Meaning | Search Behavior | Insert Behavior |
|---|---|---|---|
| Empty | Never used | Stop, key not found | Use this slot |
| Occupied | Contains a key | Compare key, continue if mismatch | Continue probing |
| Tombstone | Previously occupied, now deleted | Continue probing | Can use this slot |
How Tombstones Preserve Correctness:
Revisiting our example with tombstones:
After inserting A, B, C:
[ _ | _ | _ | A | B | C | _ | _ | _ | _ | _ ]
Delete B with tombstone (⊗):
[ _ | _ | _ | A | ⊗ | C | _ | _ | _ | _ | _ ]
Search for C:
Probe 0: index 3 → A (not C, continue)
Probe 1: index 4 → TOMBSTONE (continue probing!)
Probe 2: index 5 → C (found!)
The tombstone preserves the probe chain. Search knows to continue past deleted slots, eventually finding C.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
class TombstoneHashTable: # Sentinel object for tombstones - distinct from None (empty) TOMBSTONE = object() def __init__(self, capacity: int): self.capacity = capacity self.keys = [None] * capacity # None = empty, TOMBSTONE = deleted self.values = [None] * capacity self.size = 0 # Active elements (excludes tombstones) self.tombstone_count = 0 # Track tombstones separately def _is_empty(self, index: int) -> bool: return self.keys[index] is None def _is_tombstone(self, index: int) -> bool: return self.keys[index] is self.TOMBSTONE def _is_occupied(self, index: int) -> bool: return self.keys[index] is not None and not self._is_tombstone(index) def search(self, key): """ Search for key, treating tombstones as 'continue probing'. """ for index in self._probe_sequence(key): if self._is_empty(index): return None # True empty → key not in table if self._is_tombstone(index): continue # Deleted slot → keep probing if self.keys[index] == key: return self.values[index] return None def insert(self, key, value) -> bool: """ Insert key-value, reusing tombstone slots when possible. """ first_tombstone = None # Track first tombstone encountered for index in self._probe_sequence(key): if self._is_empty(index): # Use first tombstone if we found one, else this empty slot target = first_tombstone if first_tombstone is not None else index self.keys[target] = key self.values[target] = value if target == first_tombstone: self.tombstone_count -= 1 # Reused a tombstone self.size += 1 return True if self._is_tombstone(index): if first_tombstone is None: first_tombstone = index # Remember for potential reuse continue if self.keys[index] == key: self.values[index] = value # Update existing return True return False def delete(self, key) -> bool: """ Delete key by replacing it with a tombstone. """ for index in self._probe_sequence(key): if self._is_empty(index): return False # Key not found if self._is_tombstone(index): continue if self.keys[index] == key: self.keys[index] = self.TOMBSTONE self.values[index] = None # Free the value self.size -= 1 self.tombstone_count += 1 return True return FalseThe insert algorithm above includes an important optimization: when inserting a new key, we can reuse a tombstone slot.
However, we must still probe to the first empty slot to verify the key doesn't already exist — we can't just stop at the tombstone. This ensures we don't create duplicates.
After confirming the key is new, we insert at the earliest tombstone we encountered, which may shorten future probe sequences for this key.
Tombstones solve the correctness problem but introduce a new issue: tombstones accumulate over time, degrading performance even when the number of active elements is small.
The Problem:
Consider a hash table that undergoes many insert-delete cycles:
Initial: 100 elements, 0 tombstones
After deleting 50: 50 elements, 50 tombstones
After inserting 50 new: 100 elements, ~50 tombstones (some reused)
After deleting 50 different: 50 elements, ~100 tombstones
... repeat ...
Over time, tombstones accumulate. Even if the table contains only a few active elements, it may be littered with tombstones that slow down every operation.
Impact on Performance:
Tombstones affect performance in multiple ways:
1. Search traversal: Search must probe through tombstones, treating them as occupied slots. A table with 50 elements and 200 tombstones performs as if it had 250 elements for unsuccessful search.
2. Load factor calculation: The 'effective load factor' is (elements + tombstones) / capacity, not just elements / capacity. High tombstone counts effectively reduce capacity.
3. Insertion overhead: Insertion must scan past tombstones to verify key uniqueness, then backtrack to use the first tombstone.
4. Memory waste: Tombstones occupy slots but store no useful data.
| Scenario | Active Elements | Tombstones | Effective Load Factor | Avg Probes (Unsuccessful) |
|---|---|---|---|---|
| Optimal | 100 | 0 | 0.50 | 2.00 |
| Light tombstones | 100 | 50 | 0.75 | 4.00 |
| Heavy tombstones | 100 | 100 | 1.00 | ∞ (full) |
| Mostly tombstones | 20 | 180 | 1.00 | ∞ (full) |
A critical implementation detail: when checking if the table needs resizing, you must consider both active elements AND tombstones:
effective_load = (self.size + self.tombstone_count) / self.capacity
if effective_load > self.max_load_factor:
self._resize() # Resize also purges tombstones
If you only check self.size, a table with 10 elements and 190 tombstones won't resize, despite being effectively full.
Worst-Case Scenario:
Imagine an adversarial usage pattern:
Without cleanup, tombstones accumulate toward 100%, eventually making every operation O(n). The table is effectively full despite being logically empty.
This workload pattern is common in caches with short-lived entries, session stores, and any system with high churn.
To prevent tombstone accumulation from degrading performance, we need cleanup strategies. The primary approaches are rehashing and lazy cleanup.
Strategy 1: Full Rehash
The most thorough cleanup is to create a new table and reinsert all active elements, discarding tombstones:
def _cleanup_rehash(self):
"""
Create new table with only active elements.
Tombstones are discarded.
"""
old_keys = self.keys
old_values = self.values
# May resize or keep same capacity
new_capacity = self._calculate_new_capacity()
self.keys = [None] * new_capacity
self.values = [None] * new_capacity
self.capacity = new_capacity
self.size = 0
self.tombstone_count = 0 # Reset!
# Reinsert only active elements
for i in range(len(old_keys)):
if old_keys[i] is not None and old_keys[i] is not self.TOMBSTONE:
self.insert(old_keys[i], old_values[i])
Pros: Completely eliminates tombstones, restores optimal performance Cons: O(n) operation, can cause latency spikes
Strategy 2: Resize Triggers Cleanup
Combine resize with cleanup — whenever we resize (up or down), we get a free tombstone purge:
def _check_load_factor(self):
"""
Check if resize is needed. Consider both elements and tombstones.
"""
effective_load = (self.size + self.tombstone_count) / self.capacity
actual_load = self.size / self.capacity
if effective_load > 0.7:
if actual_load > 0.5:
# Table genuinely full, grow it
self._resize(self.capacity * 2)
else:
# Many tombstones, just cleanup (rehash at same size)
self._resize(self.capacity)
This piggybacks cleanup on resize, avoiding additional operations.
Strategy 3: Periodic Cleanup Based on Tombstone Ratio
Trigger cleanup when tombstones exceed a threshold relative to active elements:
def _maybe_cleanup(self):
"""
Trigger cleanup if tombstones are excessive.
"""
if self.size == 0:
if self.tombstone_count > self.capacity // 4:
self._cleanup_rehash() # Table is mostly tombstones
else:
tombstone_ratio = self.tombstone_count / self.size
if tombstone_ratio > 2.0: # More than 2 tombstones per element
self._cleanup_rehash()
# Call after each delete
def delete(self, key) -> bool:
result = self._delete_internal(key)
if result:
self._maybe_cleanup()
return result
This bounds the tombstone-to-element ratio, preventing unbounded accumulation.
With proper cleanup triggering, the amortized cost per operation remains O(1):
• Each deletion creates one tombstone • Cleanup is triggered when tombstones reach O(n) • Cleanup takes O(n) time but 'pays for' all accumulated tombstones • Average cost per delete: O(1)
This is analogous to dynamic array resizing — occasional expensive operations are amortized across many cheap operations.
Strategy 4: Incremental Cleanup
Spread cleanup work across multiple operations to avoid latency spikes:
def _incremental_cleanup(self, probes_per_op: int = 5):
"""
Clean up a few tombstones each operation.
Spread O(n) work over many operations.
"""
if self.tombstone_count == 0:
return
checked = 0
index = self._cleanup_cursor
while checked < probes_per_op:
if self._is_tombstone(index):
# Try to fill this tombstone from later in its chain
self._try_fill_tombstone(index)
index = (index + 1) % self.capacity
checked += 1
self._cleanup_cursor = index
This 'housekeeping' approach maintains performance without any single expensive operation.
Tombstones aren't the only way to handle deletion. Several alternative approaches avoid tombstones entirely at the cost of more complex delete operations.
Alternative 1: Backward Shift Deletion
After removing an element, shift subsequent elements backward to fill the gap — but only elements that belong in this probe chain:
def delete_backward_shift(self, key) -> bool:
"""
Delete key by shifting elements backward.
No tombstones created.
"""
# Find the key
delete_index = None
for index in self._probe_sequence(key):
if self._is_empty(index):
return False # Not found
if self.keys[index] == key:
delete_index = index
break
if delete_index is None:
return False
# Now shift elements backward
current = delete_index
while True:
next_idx = (current + 1) % self.capacity
if self._is_empty(next_idx):
break # End of chain
# Check if element at next_idx can move backward
element_ideal = self._hash(self.keys[next_idx])
# Can move if current slot is on element's probe path
if self._is_between(element_ideal, current, next_idx):
self.keys[current] = self.keys[next_idx]
self.values[current] = self.values[next_idx]
current = next_idx
else:
break
# Clear the final slot
self.keys[current] = None
self.values[current] = None
self.size -= 1
return True
Pros: No tombstones, no accumulation problem Cons: Delete is O(cluster size), complex implementation, cache-unfriendly
Alternative 2: Robin Hood Deletion
Robin Hood hashing tracks the 'probe distance' for each element. During deletion, we can shift elements more intelligently:
def delete_robin_hood(self, key) -> bool:
"""
Robin Hood deletion: shift elements with positive probe distance.
"""
delete_index = self._find(key)
if delete_index is None:
return False
current = delete_index
while True:
next_idx = (current + 1) % self.capacity
if self._is_empty(next_idx):
break
# Elements in Robin Hood have stored probe distances
if self.probe_distances[next_idx] == 0:
break # Element is at ideal position, can't shift
# Move element backward and decrease its probe distance
self.keys[current] = self.keys[next_idx]
self.values[current] = self.values[next_idx]
self.probe_distances[current] = self.probe_distances[next_idx] - 1
current = next_idx
# Clear final slot
self.keys[current] = None
self.probe_distances[current] = 0
self.size -= 1
return True
Robin Hood's probe distance tracking makes the shift logic simpler and more efficient.
Tombstone-free deletion is preferred when:
• Deletions are rare: Backward shift's O(cluster) cost is acceptable • Read-heavy workloads: Tombstones slow down every search • Long-running processes: Tombstone accumulation over days/weeks becomes severe • Memory-constrained systems: Can't afford tombstone cleanup's temporary memory spike
Conversely, tombstones are better when:
• Deletions are frequent: O(1) delete matters • Latency-sensitive: Can't afford worst-case O(n) backward shift • Already using Robin Hood: Probe distance makes both approaches viable
Alternative 3: No Deletion Allowed
The simplest approach: don't support deletion at all. This sounds extreme but is valid for:
Many high-performance systems use insert-only hash tables when the workload permits.
Let's compare the deletion strategies across key dimensions:
| Characteristic | Tombstones | Backward Shift | Robin Hood Shift |
|---|---|---|---|
| Delete time (average) | O(1) | O(cluster size) | O(cluster size) |
| Delete time (worst) | O(n) search | O(n) | O(n) |
| Search overhead | Must skip tombstones | None added | None added |
| Accumulation problem | Yes | No | No |
| Implementation complexity | Simple | Moderate | Complex |
| Extra memory | None | None | Probe distance array |
| Cache behavior during delete | Excellent | Poor (many writes) | Moderate |
| Cleanup required | Yes (rehash) | No | No |
Decision Framework:
Most production hash tables use tombstones because:
Google's Swiss Tables and Facebook's F14 both use tombstones with sophisticated cleanup strategies.
We've thoroughly explored the deletion challenge unique to open addressing. Let's consolidate the essential insights:
What's next:
We've now covered the three core operations for open addressing: insertion, search, and deletion. In the final page of this module, we'll compare open addressing with chaining — examining when each collision resolution strategy excels, their performance tradeoffs, and guidance for choosing between them in practice.
You now understand why deletion is problematic in open-addressed hash tables, how tombstones solve the correctness issue, the accumulation problem they create, cleanup strategies to manage them, and alternative deletion approaches that avoid tombstones entirely. Next, we'll synthesize everything with a comprehensive comparison of open addressing versus chaining.