Loading content...
In the previous module, we explored chaining — the strategy of handling hash collisions by maintaining linked lists at each bucket. Chaining is elegant and widely used, but it carries an inherent cost: every collision requires additional memory allocation for linked list nodes, and pointer traversal introduces cache-unfriendly access patterns.
Open addressing takes a fundamentally different approach: instead of storing multiple elements in the same bucket, we find an alternative slot within the table itself. When a collision occurs, we probe — systematically searching for the next available slot. The entire hash table becomes a single contiguous block of memory, eliminating pointer overhead and maximizing cache efficiency.
This page explores the three canonical probing strategies: linear probing, quadratic probing, and double hashing. We'll dissect each strategy's mechanics, analyze their performance characteristics, and understand when each shines or falters.
By the end of this page, you will understand:
• The fundamental principle of open addressing and probing sequences • How linear probing works and why it causes primary clustering • How quadratic probing mitigates clustering while introducing new challenges • How double hashing achieves near-ideal probe sequences • The mathematical foundations underlying each probing strategy • Performance characteristics and practical implementation considerations
Open addressing is a collision resolution strategy where all elements are stored directly within the hash table array. When a collision occurs at index h(k), we generate a sequence of alternative indices to probe until we find an empty slot (for insertion) or the target key (for lookup).
The defining characteristic of open addressing is that the load factor α can never exceed 1 — each slot holds at most one element, so a table of size m can contain at most m elements. This constraint, combined with the absence of auxiliary data structures, makes open addressing memory-efficient but sensitive to high load factors.
For a key k, an open addressing scheme generates a probe sequence: a permutation of indices {0, 1, 2, ..., m-1} that determines the order in which we examine slots. The ideal probe sequence touches every slot exactly once — a complete permutation ensures we can always find an empty slot if one exists.
We denote the probe sequence as h(k, 0), h(k, 1), h(k, 2), ..., h(k, m-1), where h(k, i) gives the i-th index to probe for key k.
The probe sequence function:
Every open addressing scheme can be expressed using a general formula:
h(k, i) = (h₁(k) + f(i)) mod m
Where:
The three probing strategies differ precisely in their choice of f(i):
| Strategy | Probe Offset f(i) | Resulting Probe Sequence |
|---|---|---|
| Linear Probing | f(i) = i | h₁(k), h₁(k)+1, h₁(k)+2, ... |
| Quadratic Probing | f(i) = c₁i + c₂i² | h₁(k), h₁(k)+c₁+c₂, h₁(k)+2c₁+4c₂, ... |
| Double Hashing | f(i) = i × h₂(k) | h₁(k), h₁(k)+h₂(k), h₁(k)+2h₂(k), ... |
Why the choice of f(i) matters profoundly:
The probe sequence directly determines:
Let's examine each strategy in depth, understanding not just how it works but why its behavior emerges from its mathematical structure.
Linear probing is the most straightforward open addressing strategy. When a collision occurs at index h₁(k), we simply check the next slot, then the next, wrapping around to the beginning if necessary. The probe sequence is:
h(k, i) = (h₁(k) + i) mod m
For example, if h₁(k) = 5 and the table size m = 11, the probe sequence is: 5 → 6 → 7 → 8 → 9 → 10 → 0 → 1 → 2 → 3 → 4
This sequence visits every slot exactly once — a complete permutation — guaranteeing we find an empty slot if one exists.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
class LinearProbingHashTable: def __init__(self, capacity: int): self.capacity = capacity self.size = 0 self.keys = [None] * capacity self.values = [None] * capacity self.DELETED = object() # Tombstone marker def _hash(self, key) -> int: """Primary hash function.""" return hash(key) % self.capacity def _probe_sequence(self, key): """ Generator for linear probe sequence. Yields indices: h(k), h(k)+1, h(k)+2, ... (mod m) """ start = self._hash(key) for i in range(self.capacity): yield (start + i) % self.capacity def insert(self, key, value) -> bool: """ Insert key-value pair using linear probing. Returns True if successful, False if table is full. """ if self.size >= self.capacity: return False # Table is full for index in self._probe_sequence(key): # Empty slot or tombstone — insert here if self.keys[index] is None or self.keys[index] is self.DELETED: self.keys[index] = key self.values[index] = value self.size += 1 return True # Key already exists — update value if self.keys[index] == key: self.values[index] = value return True return False # Should never reach here if size < capacity def search(self, key): """ Search for key using linear probing. Returns value if found, None otherwise. """ for index in self._probe_sequence(key): if self.keys[index] is None: return None # Empty slot — key not present if self.keys[index] == key: return self.values[index] # If DELETED, continue probing — key might be further return None # Probed entire table without finding keyAdvantages of Linear Probing:
Simplicity — The implementation is trivial; no secondary hash functions or complex offset calculations.
Excellent cache performance — Consecutive slots are accessed sequentially, maximizing CPU cache utilization. Modern processors prefetch contiguous memory, making linear probing surprisingly fast in practice despite its theoretical clustering issues.
Guaranteed full table coverage — The sequence visits every slot, ensuring correctness.
Memory efficiency — No pointers, no auxiliary structures; just an array.
Linear probing suffers from primary clustering: occupied slots tend to form contiguous blocks (clusters), and these clusters grow larger over time. When a new key hashes to any slot within a cluster, it joins the end of that cluster, making it even larger.
The probability of hitting a cluster of length L is proportional to L+1 (including the slot just before the cluster). Larger clusters attract more keys, creating a self-reinforcing growth pattern. This clustering effect dramatically increases the expected number of probes at high load factors.
Visualizing Primary Clustering:
Consider inserting keys that hash to positions 3, 4, 3, 5, 4 into a table of size 11:
Initial: [ _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ ]
Indices: 0 1 2 3 4 5 6 7 8 9 10
Insert A (h=3): [ _ | _ | _ | A | _ | _ | _ | _ | _ | _ | _ ]
Insert B (h=4): [ _ | _ | _ | A | B | _ | _ | _ | _ | _ | _ ]
Insert C (h=3): [ _ | _ | _ | A | B | C | _ | _ | _ | _ | _ ] ← C probes 3→4→5
Insert D (h=5): [ _ | _ | _ | A | B | C | D | _ | _ | _ | _ ] ← D probes 5→6
Insert E (h=4): [ _ | _ | _ | A | B | C | D | E | _ | _ | _ ] ← E probes 4→5→6→7
A cluster has formed at indices 3-7. Key E, hashing to position 4, required 4 probes to find an empty slot. As the table fills, these clusters grow and merge, severely degrading performance.
Theoretical Analysis of Linear Probing:
Donald Knuth's seminal analysis (1963) provides precise formulas for the expected number of probes. Assuming uniform hashing and a load factor α = n/m:
For successful search (finding an existing key):
E[probes] ≈ ½(1 + 1/(1-α))
For unsuccessful search (verifying key absence) and insertion:
E[probes] ≈ ½(1 + 1/(1-α)²)
These formulas reveal the critical sensitivity to load factor:
| Load Factor (α) | Successful Search | Unsuccessful Search/Insert |
|---|---|---|
| 0.50 | 1.50 probes | 2.50 probes |
| 0.75 | 2.50 probes | 8.50 probes |
| 0.90 | 5.50 probes | 50.50 probes |
| 0.95 | 10.50 probes | 200.50 probes |
| 0.99 | 50.50 probes | 5,000.50 probes |
At α = 0.90, insertion already requires an average of 50+ probes. At α = 0.95, it jumps to 200+ probes. This exponential degradation means linear probing is impractical beyond α ≈ 0.70 without resizing. In practice, implementations resize when the load factor exceeds 0.5 to 0.7 to maintain acceptable performance.
Quadratic probing addresses linear probing's primary clustering by using a quadratic function for probe offsets. Instead of examining consecutive slots, we jump by increasing quadratic increments:
h(k, i) = (h₁(k) + c₁i + c₂i²) mod m
Common choices include:
The quadratic jumps spread probes across the table, reducing the tendency for clusters to form.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
class QuadraticProbingHashTable: def __init__(self, capacity: int): # Capacity should be prime or power of 2 for full coverage self.capacity = capacity self.size = 0 self.keys = [None] * capacity self.values = [None] * capacity self.DELETED = object() def _hash(self, key) -> int: return hash(key) % self.capacity def _probe_sequence(self, key, c1: int = 0, c2: int = 1): """ Generator for quadratic probe sequence. h(k, i) = (h(k) + c1*i + c2*i²) mod m Note: May not cover all slots unless capacity and coefficients satisfy specific conditions. """ start = self._hash(key) for i in range(self.capacity): yield (start + c1 * i + c2 * i * i) % self.capacity def _probe_sequence_alternating(self, key): """ Alternative: Alternating sign probe sequence. Probes at +1², -1², +2², -2², +3², -3², ... This ensures better coverage for any table size. """ start = self._hash(key) yield start for i in range(1, (self.capacity + 1) // 2 + 1): yield (start + i * i) % self.capacity yield (start - i * i) % self.capacity def insert(self, key, value) -> bool: if self.size >= self.capacity // 2: # More conservative threshold return False # Trigger resize for index in self._probe_sequence(key): if self.keys[index] is None or self.keys[index] is self.DELETED: self.keys[index] = key self.values[index] = value self.size += 1 return True if self.keys[index] == key: self.values[index] = value return True return False # Table full or sequence exhaustedUnlike linear probing, quadratic probing does not guarantee visiting all slots. For many (m, c₁, c₂) combinations, the probe sequence may cycle through only a subset of indices, potentially failing to find an empty slot even when one exists.
Solution 1: Choose m to be a prime number. When m is prime and c₂ = 1, the sequence h₁(k), h₁(k)+1, h₁(k)+4, h₁(k)+9, ... covers at least m/2 distinct positions.
Solution 2: Use m = 2^k (power of 2) with c₁ = c₂ = ½, giving h(k, i) = (h₁(k) + i(i+1)/2) mod m. This triangular number sequence covers all 2^k positions.
Why Quadratic Probing Reduces Clustering:
The key insight is that quadratic probing generates different probe sequences for keys with different initial hash values, even when those initial collisions would create the same cluster under linear probing.
Consider two keys A and B where h(A) = 3 and h(B) = 4:
Linear Probing:
Quadratic Probing (c₁=0, c₂=1, m=11):
Even if two keys collide at the same initial position, their subsequent probe positions still match (both start probing from the same position), but keys with different initial positions probe entirely different sequences. This prevents the pile-up effect where keys with adjacent initial hashes all compete for the same slots.
Secondary Clustering:
Quadratic probing eliminates primary clustering but introduces secondary clustering: all keys with the same h₁(k) follow the identical probe sequence. If keys A and B both hash to index 5, they compete for the same slots in the same order:
This is less severe than primary clustering because it only affects keys with identical hash values, not keys with merely adjacent hash values. The expected number of colliding keys is smaller, so secondary clustering has a milder performance impact.
| Load Factor (α) | Linear: Unsuccessful Search | Quadratic: Unsuccessful Search |
|---|---|---|
| 0.50 | 2.50 probes | ~2.19 probes |
| 0.75 | 8.50 probes | ~4.64 probes |
| 0.90 | 50.50 probes | ~11.40 probes |
When implementing quadratic probing:
Choose table size carefully — Use a prime number or specific power-of-2 constructions that guarantee full coverage.
Keep load factor below 0.5 — The coverage guarantee for prime sizes only ensures m/2 positions are visited. Keeping α < 0.5 ensures successful insertion.
Consider the alternating-signs variant — Probing +1, -1, +4, -4, +9, -9, ... often provides better coverage across table sizes.
Double hashing represents the most sophisticated open addressing strategy. It uses two independent hash functions: one to determine the initial position and another to determine the step size:
h(k, i) = (h₁(k) + i × h₂(k)) mod m
Where:
The magic of double hashing lies in this key-dependent step size. Unlike linear and quadratic probing where the probe pattern is determined solely by the initial position, double hashing generates a unique probe sequence for each key.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
class DoubleHashingTable: def __init__(self, capacity: int): # Capacity should be prime for best results self.capacity = capacity self.size = 0 self.keys = [None] * capacity self.values = [None] * capacity self.DELETED = object() def _hash1(self, key) -> int: """Primary hash function: determines starting position.""" return hash(key) % self.capacity def _hash2(self, key) -> int: """ Secondary hash function: determines step size. CRITICAL: h₂(k) must never return 0, otherwise we'd probe the same position forever! Common approach: h₂(k) = prime - (hash(key) % prime) where prime < capacity. This ensures h₂ ∈ [1, prime]. """ # Use a different prime smaller than capacity prime = self._largest_prime_below(self.capacity) return prime - (hash(key) % prime) def _largest_prime_below(self, n: int) -> int: """Find the largest prime strictly less than n.""" def is_prime(x): if x < 2: return False if x == 2: return True if x % 2 == 0: return False for i in range(3, int(x**0.5) + 1, 2): if x % i == 0: return False return True for candidate in range(n - 1, 1, -1): if is_prime(candidate): return candidate return 2 def _probe_sequence(self, key): """ Generator for double hashing probe sequence. h(k, i) = (h₁(k) + i × h₂(k)) mod m """ start = self._hash1(key) step = self._hash2(key) for i in range(self.capacity): yield (start + i * step) % self.capacity def insert(self, key, value) -> bool: if self.size >= self.capacity: return False for index in self._probe_sequence(key): if self.keys[index] is None or self.keys[index] is self.DELETED: self.keys[index] = key self.values[index] = value self.size += 1 return True if self.keys[index] == key: self.values[index] = value return True return False def search(self, key): for index in self._probe_sequence(key): if self.keys[index] is None: return None if self.keys[index] == key: return self.values[index] return NoneThe secondary hash function must never return zero. If h₂(k) = 0, the probe sequence becomes:
h(k, 0) = h₁(k) h(k, 1) = h₁(k) + 0 = h₁(k) h(k, 2) = h₁(k) + 0 = h₁(k) ...
We'd infinitely probe the same position! The standard solution is to design h₂ to return values in the range [1, m-1].
Ensuring Full Table Coverage:
For double hashing to visit all table slots, gcd(h₂(k), m) must equal 1 for all keys. If h₂(k) and m share a common factor, the probe sequence will only visit a subset of slots.
Example: If m = 10 and h₂(k) = 4, the sequence visits: h₁(k), h₁(k)+4, h₁(k)+8, h₁(k)+12=h₁(k)+2, h₁(k)+6, h₁(k)+10=h₁(k), ...
This only covers {h₁(k), h₁(k)+2, h₁(k)+4, h₁(k)+6, h₁(k)+8} — only 5 positions, not all 10!
Solution: Make m prime. When m is prime, any step size from 1 to m-1 is coprime with m, guaranteeing full coverage.
Common h₂ Constructions:
Several well-tested secondary hash function designs exist:
1. Division-based (most common):
h₂(k) = q - (k mod q)
where q is a prime smaller than m. This produces values in [1, q].
2. Multiplication-based:
h₂(k) = 1 + (hash(k) · A mod m-1)
where A is a constant. This produces values in [1, m-1].
3. Bit manipulation (for integer keys):
h₂(k) = 1 + (k ^ (k >> 5)) % (m - 1)
XOR and shift operations create independence from h₁.
The key property is that h₁ and h₂ should be independent — knowing h₁(k) should give no information about h₂(k).
If h₁ and h₂ are correlated, keys with similar h₁ values will have similar step sizes, recreating clustering effects. True independence ensures that even when two keys collide on h₁, they probe in completely different directions.
The theoretical analysis assumes uniform hashing: the probe sequence is a random permutation of {0, 1, ..., m-1}. Double hashing with good independent hash functions closely approximates this ideal.
Performance Analysis:
Double hashing virtually eliminates both primary and secondary clustering. Each key generates a unique probe sequence (in expectation), providing performance close to the theoretical optimum of uniform hashing.
Under uniform hashing assumptions:
Unsuccessful search/insertion: E[probes] = 1/(1-α)
Successful search: E[probes] = (1/α) × ln(1/(1-α))
| Load Factor (α) | Double: Unsuccessful | Linear: Unsuccessful | Improvement |
|---|---|---|---|
| 0.50 | 2.00 probes | 2.50 probes | 20% fewer probes |
| 0.75 | 4.00 probes | 8.50 probes | 53% fewer probes |
| 0.90 | 10.00 probes | 50.50 probes | 80% fewer probes |
| 0.95 | 20.00 probes | 200.50 probes | 90% fewer probes |
The Tradeoff:
Double hashing's superior performance comes at a cost:
In practice, the cache advantage of linear probing often outweighs double hashing's theoretical superiority for low load factors (α < 0.5). Double hashing shines when higher load factors are necessary.
Let's synthesize our understanding with a comprehensive comparison across all dimensions that matter in practice:
| Characteristic | Linear Probing | Quadratic Probing | Double Hashing |
|---|---|---|---|
| Probe formula | h₁(k) + i | h₁(k) + c₁i + c₂i² | h₁(k) + i·h₂(k) |
| Clustering type | Primary (severe) | Secondary (moderate) | None (ideal) |
| Cache locality | Excellent | Poor | Poor |
| Table coverage | Always 100% | Partial (needs care) | 100% if gcd=1 |
| Hash functions needed | 1 | 1 | 2 |
| Computation per probe | Addition | Multiplication, addition | Multiplication, addition |
| Best for load factor | α < 0.5 | α < 0.5 | α < 0.75 |
| Implementation complexity | Trivial | Moderate | Complex |
Python's dict (prior to 3.6) used open addressing with a custom probing strategy that combines aspects of quadratic probing and pseudorandom probing. Modern implementations like Swiss Tables (Google's Abseil) use SIMD-accelerated linear probing with smart metadata to achieve both cache efficiency and collision resistance.
The theoretical superiority of double hashing rarely translates to practical superiority because cache effects dominate at typical load factors.
We've completed a deep exploration of the three canonical open addressing probing strategies. Let's consolidate the essential insights:
What's next:
With probing strategies understood, we'll examine how these strategies affect search operations in the next page. Searching in an open-addressed hash table is subtler than insertion — we must correctly handle tombstones, know when to stop probing, and maintain the invariants that make lookup correct.
You now understand the three fundamental probing strategies for open addressing: linear, quadratic, and double hashing. You can analyze their clustering behavior, performance characteristics, and implementation requirements. Next, we'll explore how searching operates within open-addressed tables and the algorithms that maintain correctness across the probe sequence.