Collision — Open Addressing - Learning Module

Loading content...

0/276

Probing Strategies — Linear, Quadratic, Double Hashing

Resolving Collisions Without Extra Memory

In the previous module, we explored chaining — the strategy of handling hash collisions by maintaining linked lists at each bucket. Chaining is elegant and widely used, but it carries an inherent cost: every collision requires additional memory allocation for linked list nodes, and pointer traversal introduces cache-unfriendly access patterns.

Open addressing takes a fundamentally different approach: instead of storing multiple elements in the same bucket, we find an alternative slot within the table itself. When a collision occurs, we probe — systematically searching for the next available slot. The entire hash table becomes a single contiguous block of memory, eliminating pointer overhead and maximizing cache efficiency.

This page explores the three canonical probing strategies: linear probing, quadratic probing, and double hashing. We'll dissect each strategy's mechanics, analyze their performance characteristics, and understand when each shines or falters.

What You Will Learn

By the end of this page, you will understand:

• The fundamental principle of open addressing and probing sequences • How linear probing works and why it causes primary clustering • How quadratic probing mitigates clustering while introducing new challenges • How double hashing achieves near-ideal probe sequences • The mathematical foundations underlying each probing strategy • Performance characteristics and practical implementation considerations

The Open Addressing Paradigm

Open addressing is a collision resolution strategy where all elements are stored directly within the hash table array. When a collision occurs at index h(k), we generate a sequence of alternative indices to probe until we find an empty slot (for insertion) or the target key (for lookup).

The defining characteristic of open addressing is that the load factor α can never exceed 1 — each slot holds at most one element, so a table of size m can contain at most m elements. This constraint, combined with the absence of auxiliary data structures, makes open addressing memory-efficient but sensitive to high load factors.

The Probe Sequence

For a key k, an open addressing scheme generates a probe sequence: a permutation of indices {0, 1, 2, ..., m-1} that determines the order in which we examine slots. The ideal probe sequence touches every slot exactly once — a complete permutation ensures we can always find an empty slot if one exists.

We denote the probe sequence as h(k, 0), h(k, 1), h(k, 2), ..., h(k, m-1), where h(k, i) gives the i-th index to probe for key k.

The probe sequence function:

Every open addressing scheme can be expressed using a general formula:

h(k, i) = (h₁(k) + f(i)) mod m

Where:

h₁(k) is the primary hash function that determines the initial probe position
f(i) is the probe offset function that determines how we deviate from the initial position
i is the probe number (0, 1, 2, ...)
m is the table size

The three probing strategies differ precisely in their choice of f(i):

Probe Offset Functions for Open Addressing Strategies
Strategy	Probe Offset f(i)	Resulting Probe Sequence
Linear Probing	f(i) = i	h₁(k), h₁(k)+1, h₁(k)+2, ...
Quadratic Probing	f(i) = c₁i + c₂i²	h₁(k), h₁(k)+c₁+c₂, h₁(k)+2c₁+4c₂, ...
Double Hashing	f(i) = i × h₂(k)	h₁(k), h₁(k)+h₂(k), h₁(k)+2h₂(k), ...

Why the choice of f(i) matters profoundly:

The probe sequence directly determines:

Clustering behavior — Whether keys with similar initial hashes will cluster together
Table coverage — Whether the sequence visits all slots (crucial for correctness)
Cache performance — Whether consecutive probes access nearby memory locations
Collision chain length — The expected number of probes needed for operations

Let's examine each strategy in depth, understanding not just how it works but why its behavior emerges from its mathematical structure.

Linear Probing — The Simplest Strategy

Linear probing is the most straightforward open addressing strategy. When a collision occurs at index h₁(k), we simply check the next slot, then the next, wrapping around to the beginning if necessary. The probe sequence is:

h(k, i) = (h₁(k) + i) mod m

For example, if h₁(k) = 5 and the table size m = 11, the probe sequence is: 5 → 6 → 7 → 8 → 9 → 10 → 0 → 1 → 2 → 3 → 4

This sequence visits every slot exactly once — a complete permutation — guaranteeing we find an empty slot if one exists.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
class LinearProbingHashTable:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.size = 0
        self.keys = [None] * capacity
        self.values = [None] * capacity
        self.DELETED = object()  # Tombstone marker
    
    def _hash(self, key) -> int:
        """Primary hash function."""
        return hash(key) % self.capacity
    
    def _probe_sequence(self, key):
        """
        Generator for linear probe sequence.
        Yields indices: h(k), h(k)+1, h(k)+2, ... (mod m)
        """
        start = self._hash(key)
        for i in range(self.capacity):
            yield (start + i) % self.capacity
    
    def insert(self, key, value) -> bool:
        """
        Insert key-value pair using linear probing.
        Returns True if successful, False if table is full.
        """
        if self.size >= self.capacity:
            return False  # Table is full
        
        for index in self._probe_sequence(key):
            # Empty slot or tombstone — insert here
            if self.keys[index] is None or self.keys[index] is self.DELETED:
                self.keys[index] = key
                self.values[index] = value
                self.size += 1
                return True
            # Key already exists — update value
            if self.keys[index] == key:
                self.values[index] = value
                return True
        
        return False  # Should never reach here if size < capacity
    
    def search(self, key):
        """
        Search for key using linear probing.
        Returns value if found, None otherwise.
        """
        for index in self._probe_sequence(key):
            if self.keys[index] is None:
                return None  # Empty slot — key not present
            if self.keys[index] == key:
                return self.values[index]
            # If DELETED, continue probing — key might be further
        
        return None  # Probed entire table without finding key

Advantages of Linear Probing:

Simplicity — The implementation is trivial; no secondary hash functions or complex offset calculations.
Excellent cache performance — Consecutive slots are accessed sequentially, maximizing CPU cache utilization. Modern processors prefetch contiguous memory, making linear probing surprisingly fast in practice despite its theoretical clustering issues.
Guaranteed full table coverage — The sequence visits every slot, ensuring correctness.
Memory efficiency — No pointers, no auxiliary structures; just an array.

Primary Clustering: The Achilles Heel

Linear probing suffers from primary clustering: occupied slots tend to form contiguous blocks (clusters), and these clusters grow larger over time. When a new key hashes to any slot within a cluster, it joins the end of that cluster, making it even larger.

The probability of hitting a cluster of length L is proportional to L+1 (including the slot just before the cluster). Larger clusters attract more keys, creating a self-reinforcing growth pattern. This clustering effect dramatically increases the expected number of probes at high load factors.

Visualizing Primary Clustering:

Consider inserting keys that hash to positions 3, 4, 3, 5, 4 into a table of size 11:

Initial:   [ _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ ]
Indices:     0   1   2   3   4   5   6   7   8   9  10

Insert A (h=3): [ _ | _ | _ | A | _ | _ | _ | _ | _ | _ | _ ]
Insert B (h=4): [ _ | _ | _ | A | B | _ | _ | _ | _ | _ | _ ]
Insert C (h=3): [ _ | _ | _ | A | B | C | _ | _ | _ | _ | _ ]  ← C probes 3→4→5
Insert D (h=5): [ _ | _ | _ | A | B | C | D | _ | _ | _ | _ ]  ← D probes 5→6
Insert E (h=4): [ _ | _ | _ | A | B | C | D | E | _ | _ | _ ]  ← E probes 4→5→6→7

A cluster has formed at indices 3-7. Key E, hashing to position 4, required 4 probes to find an empty slot. As the table fills, these clusters grow and merge, severely degrading performance.

Theoretical Analysis of Linear Probing:

Donald Knuth's seminal analysis (1963) provides precise formulas for the expected number of probes. Assuming uniform hashing and a load factor α = n/m:

For successful search (finding an existing key):

E[probes] ≈ ½(1 + 1/(1-α))

For unsuccessful search (verifying key absence) and insertion:

E[probes] ≈ ½(1 + 1/(1-α)²)

These formulas reveal the critical sensitivity to load factor:

Expected Probes for Linear Probing by Load Factor
Load Factor (α)	Successful Search	Unsuccessful Search/Insert
0.50	1.50 probes	2.50 probes
0.75	2.50 probes	8.50 probes
0.90	5.50 probes	50.50 probes
0.95	10.50 probes	200.50 probes
0.99	50.50 probes	5,000.50 probes

Critical Threshold at High Load Factors

At α = 0.90, insertion already requires an average of 50+ probes. At α = 0.95, it jumps to 200+ probes. This exponential degradation means linear probing is impractical beyond α ≈ 0.70 without resizing. In practice, implementations resize when the load factor exceeds 0.5 to 0.7 to maintain acceptable performance.

Quadratic Probing — Breaking Linear Patterns

Quadratic probing addresses linear probing's primary clustering by using a quadratic function for probe offsets. Instead of examining consecutive slots, we jump by increasing quadratic increments:

h(k, i) = (h₁(k) + c₁i + c₂i²) mod m

Common choices include:

c₁ = 0, c₂ = 1: h(k, i) = (h₁(k) + i²) mod m → probes at +0, +1, +4, +9, +16, ...
c₁ = 1, c₂ = 1: h(k, i) = (h₁(k) + i + i²) mod m → probes at +0, +2, +6, +12, +20, ...
Alternating signs: (h₁(k) + (-1)^i × ⌈i/2⌉²) mod m → probes at +0, +1, -1, +4, -4, +9, -9, ...

The quadratic jumps spread probes across the table, reducing the tendency for clusters to form.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
class QuadraticProbingHashTable:
    def __init__(self, capacity: int):
        # Capacity should be prime or power of 2 for full coverage
        self.capacity = capacity
        self.size = 0
        self.keys = [None] * capacity
        self.values = [None] * capacity
        self.DELETED = object()
    
    def _hash(self, key) -> int:
        return hash(key) % self.capacity
    
    def _probe_sequence(self, key, c1: int = 0, c2: int = 1):
        """
        Generator for quadratic probe sequence.
        h(k, i) = (h(k) + c1*i + c2*i²) mod m
        
        Note: May not cover all slots unless capacity and
        coefficients satisfy specific conditions.
        """
        start = self._hash(key)
        for i in range(self.capacity):
            yield (start + c1 * i + c2 * i * i) % self.capacity
    
    def _probe_sequence_alternating(self, key):
        """
        Alternative: Alternating sign probe sequence.
        Probes at +1², -1², +2², -2², +3², -3², ...
        This ensures better coverage for any table size.
        """
        start = self._hash(key)
        yield start
        
        for i in range(1, (self.capacity + 1) // 2 + 1):
            yield (start + i * i) % self.capacity
            yield (start - i * i) % self.capacity
    
    def insert(self, key, value) -> bool:
        if self.size >= self.capacity // 2:  # More conservative threshold
            return False  # Trigger resize
        
        for index in self._probe_sequence(key):
            if self.keys[index] is None or self.keys[index] is self.DELETED:
                self.keys[index] = key
                self.values[index] = value
                self.size += 1
                return True
            if self.keys[index] == key:
                self.values[index] = value
                return True
        
        return False  # Table full or sequence exhausted

The Coverage Problem

Unlike linear probing, quadratic probing does not guarantee visiting all slots. For many (m, c₁, c₂) combinations, the probe sequence may cycle through only a subset of indices, potentially failing to find an empty slot even when one exists.

Solution 1: Choose m to be a prime number. When m is prime and c₂ = 1, the sequence h₁(k), h₁(k)+1, h₁(k)+4, h₁(k)+9, ... covers at least m/2 distinct positions.

Solution 2: Use m = 2^k (power of 2) with c₁ = c₂ = ½, giving h(k, i) = (h₁(k) + i(i+1)/2) mod m. This triangular number sequence covers all 2^k positions.

Why Quadratic Probing Reduces Clustering:

The key insight is that quadratic probing generates different probe sequences for keys with different initial hash values, even when those initial collisions would create the same cluster under linear probing.

Consider two keys A and B where h(A) = 3 and h(B) = 4:

Linear Probing:

A's sequence: 3, 4, 5, 6, 7, 8, ...
B's sequence: 4, 5, 6, 7, 8, 9, ...
After A occupies slot 3, B's sequence shares slots 4, 5, 6, ... with A

Quadratic Probing (c₁=0, c₂=1, m=11):

A's sequence: 3, 4, 7, 1, 8, 6, 6, 9, 3, ... (note: may cycle)
B's sequence: 4, 5, 8, 2, 9, 7, 7, 10, 4, ...
Completely different sequences after the first position!

Even if two keys collide at the same initial position, their subsequent probe positions still match (both start probing from the same position), but keys with different initial positions probe entirely different sequences. This prevents the pile-up effect where keys with adjacent initial hashes all compete for the same slots.

Secondary Clustering:

Quadratic probing eliminates primary clustering but introduces secondary clustering: all keys with the same h₁(k) follow the identical probe sequence. If keys A and B both hash to index 5, they compete for the same slots in the same order:

A and B both probe: 5, 6, 9, 3, 10, ...

This is less severe than primary clustering because it only affects keys with identical hash values, not keys with merely adjacent hash values. The expected number of colliding keys is smaller, so secondary clustering has a milder performance impact.

Linear vs Quadratic Probing Performance (Theoretical)
Load Factor (α)	Linear: Unsuccessful Search	Quadratic: Unsuccessful Search
0.50	2.50 probes	~2.19 probes
0.75	8.50 probes	~4.64 probes
0.90	50.50 probes	~11.40 probes

Practical Implementation Advice

When implementing quadratic probing:

Choose table size carefully — Use a prime number or specific power-of-2 constructions that guarantee full coverage.
Keep load factor below 0.5 — The coverage guarantee for prime sizes only ensures m/2 positions are visited. Keeping α < 0.5 ensures successful insertion.
Consider the alternating-signs variant — Probing +1, -1, +4, -4, +9, -9, ... often provides better coverage across table sizes.

Double Hashing — The Gold Standard

Double hashing represents the most sophisticated open addressing strategy. It uses two independent hash functions: one to determine the initial position and another to determine the step size:

h(k, i) = (h₁(k) + i × h₂(k)) mod m

Where:

h₁(k) determines the starting position
h₂(k) determines the probe step size
Each key has its own unique step size, creating a key-specific probe sequence

The magic of double hashing lies in this key-dependent step size. Unlike linear and quadratic probing where the probe pattern is determined solely by the initial position, double hashing generates a unique probe sequence for each key.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
class DoubleHashingTable:
    def __init__(self, capacity: int):
        # Capacity should be prime for best results
        self.capacity = capacity
        self.size = 0
        self.keys = [None] * capacity
        self.values = [None] * capacity
        self.DELETED = object()
    
    def _hash1(self, key) -> int:
        """Primary hash function: determines starting position."""
        return hash(key) % self.capacity
    
    def _hash2(self, key) -> int:
        """
        Secondary hash function: determines step size.
        
        CRITICAL: h₂(k) must never return 0, otherwise we'd
        probe the same position forever!
        
        Common approach: h₂(k) = prime - (hash(key) % prime)
        where prime < capacity. This ensures h₂ ∈ [1, prime].
        """
        # Use a different prime smaller than capacity
        prime = self._largest_prime_below(self.capacity)
        return prime - (hash(key) % prime)
    
    def _largest_prime_below(self, n: int) -> int:
        """Find the largest prime strictly less than n."""
        def is_prime(x):
            if x < 2:
                return False
            if x == 2:
                return True
            if x % 2 == 0:
                return False
            for i in range(3, int(x**0.5) + 1, 2):
                if x % i == 0:
                    return False
            return True
        
        for candidate in range(n - 1, 1, -1):
            if is_prime(candidate):
                return candidate
        return 2
    
    def _probe_sequence(self, key):
        """
        Generator for double hashing probe sequence.
        h(k, i) = (h₁(k) + i × h₂(k)) mod m
        """
        start = self._hash1(key)
        step = self._hash2(key)
        
        for i in range(self.capacity):
            yield (start + i * step) % self.capacity
    
    def insert(self, key, value) -> bool:
        if self.size >= self.capacity:
            return False
        
        for index in self._probe_sequence(key):
            if self.keys[index] is None or self.keys[index] is self.DELETED:
                self.keys[index] = key
                self.values[index] = value
                self.size += 1
                return True
            if self.keys[index] == key:
                self.values[index] = value
                return True
        
        return False
    
    def search(self, key):
        for index in self._probe_sequence(key):
            if self.keys[index] is None:
                return None
            if self.keys[index] == key:
                return self.values[index]
        return None

Critical Requirement: h₂(k) ≠ 0

The secondary hash function must never return zero. If h₂(k) = 0, the probe sequence becomes:

h(k, 0) = h₁(k) h(k, 1) = h₁(k) + 0 = h₁(k) h(k, 2) = h₁(k) + 0 = h₁(k) ...

We'd infinitely probe the same position! The standard solution is to design h₂ to return values in the range [1, m-1].

Ensuring Full Table Coverage:

For double hashing to visit all table slots, gcd(h₂(k), m) must equal 1 for all keys. If h₂(k) and m share a common factor, the probe sequence will only visit a subset of slots.

Example: If m = 10 and h₂(k) = 4, the sequence visits: h₁(k), h₁(k)+4, h₁(k)+8, h₁(k)+12=h₁(k)+2, h₁(k)+6, h₁(k)+10=h₁(k), ...

This only covers {h₁(k), h₁(k)+2, h₁(k)+4, h₁(k)+6, h₁(k)+8} — only 5 positions, not all 10!

Solution: Make m prime. When m is prime, any step size from 1 to m-1 is coprime with m, guaranteeing full coverage.

Common h₂ Constructions:

Several well-tested secondary hash function designs exist:

1. Division-based (most common):

h₂(k) = q - (k mod q)

where q is a prime smaller than m. This produces values in [1, q].

2. Multiplication-based:

h₂(k) = 1 + (hash(k) · A mod m-1)

where A is a constant. This produces values in [1, m-1].

3. Bit manipulation (for integer keys):

h₂(k) = 1 + (k ^ (k >> 5)) % (m - 1)

XOR and shift operations create independence from h₁.

The key property is that h₁ and h₂ should be independent — knowing h₁(k) should give no information about h₂(k).

Why Independence Matters

If h₁ and h₂ are correlated, keys with similar h₁ values will have similar step sizes, recreating clustering effects. True independence ensures that even when two keys collide on h₁, they probe in completely different directions.

The theoretical analysis assumes uniform hashing: the probe sequence is a random permutation of {0, 1, ..., m-1}. Double hashing with good independent hash functions closely approximates this ideal.

Performance Analysis:

Double hashing virtually eliminates both primary and secondary clustering. Each key generates a unique probe sequence (in expectation), providing performance close to the theoretical optimum of uniform hashing.

Under uniform hashing assumptions:

Unsuccessful search/insertion: E[probes] = 1/(1-α)

Successful search: E[probes] = (1/α) × ln(1/(1-α))

Double Hashing vs Linear Probing Performance
Load Factor (α)	Double: Unsuccessful	Linear: Unsuccessful	Improvement
0.50	2.00 probes	2.50 probes	20% fewer probes
0.75	4.00 probes	8.50 probes	53% fewer probes
0.90	10.00 probes	50.50 probes	80% fewer probes
0.95	20.00 probes	200.50 probes	90% fewer probes

The Tradeoff:

Double hashing's superior performance comes at a cost:

Computational overhead — Two hash function evaluations per probe instead of one
Implementation complexity — Must carefully design h₂ to avoid zero and ensure coverage
Cache performance — Probes may jump across the table, losing linear probing's cache-friendly access pattern

In practice, the cache advantage of linear probing often outweighs double hashing's theoretical superiority for low load factors (α < 0.5). Double hashing shines when higher load factors are necessary.

Comparative Analysis — Choosing the Right Strategy

Let's synthesize our understanding with a comprehensive comparison across all dimensions that matter in practice:

Comprehensive Probing Strategy Comparison
Characteristic	Linear Probing	Quadratic Probing	Double Hashing
Probe formula	h₁(k) + i	h₁(k) + c₁i + c₂i²	h₁(k) + i·h₂(k)
Clustering type	Primary (severe)	Secondary (moderate)	None (ideal)
Cache locality	Excellent	Poor	Poor
Table coverage	Always 100%	Partial (needs care)	100% if gcd=1
Hash functions needed	1	1	2
Computation per probe	Addition	Multiplication, addition	Multiplication, addition
Best for load factor	α < 0.5	α < 0.5	α < 0.75
Implementation complexity	Trivial	Moderate	Complex

When to Use Linear Probing

•Memory is constrained and simplicity is valued
•Load factor will be kept low (< 0.5)
•Cache performance is critical
•Keys have uniform distribution (good hash function)
•Simple implementation is preferred

When to Use Double Hashing

•Higher load factors are necessary (0.5 to 0.75)
•Keys may cluster naturally (similar inputs)
•Worst-case performance bounds matter
•Cache misses are acceptable
•Two independent hash functions are available

Real-World Implementation Choices

Python's dict (prior to 3.6) used open addressing with a custom probing strategy that combines aspects of quadratic probing and pseudorandom probing. Modern implementations like Swiss Tables (Google's Abseil) use SIMD-accelerated linear probing with smart metadata to achieve both cache efficiency and collision resistance.

The theoretical superiority of double hashing rarely translates to practical superiority because cache effects dominate at typical load factors.

Summary: Probing Strategies Mastered

We've completed a deep exploration of the three canonical open addressing probing strategies. Let's consolidate the essential insights:

Key Takeaways

•Open addressing stores all elements in the table itself — No auxiliary structures; probe sequences find alternative slots for collisions.
•Linear probing (f(i) = i) is simple and cache-friendly — But primary clustering causes severe performance degradation at high load factors.
•Quadratic probing (f(i) = c₁i + c₂i²) reduces clustering — But may not cover the entire table; requires careful size selection.
•Double hashing (f(i) = i·h₂(k)) achieves near-ideal performance — Key-dependent step sizes eliminate clustering; requires two independent hash functions.
•The probe sequence must cover the entire table — Otherwise, insertion may fail even when empty slots exist.
•Cache performance often beats theoretical optimality — Linear probing's cache efficiency makes it competitive despite clustering.

What's next:

With probing strategies understood, we'll examine how these strategies affect search operations in the next page. Searching in an open-addressed hash table is subtler than insertion — we must correctly handle tombstones, know when to stop probing, and maintain the invariants that make lookup correct.

Page Complete

You now understand the three fundamental probing strategies for open addressing: linear, quadratic, and double hashing. You can analyze their clustering behavior, performance characteristics, and implementation requirements. Next, we'll explore how searching operates within open-addressed tables and the algorithms that maintain correctness across the probe sequence.

Probing Strategies — Linear, Quadratic, Double Hashing

Resolving Collisions Without Extra Memory

What You Will Learn

By the end of this page, you will understand:

The Open Addressing Paradigm

The Probe Sequence

We denote the probe sequence as h(k, 0), h(k, 1), h(k, 2), ..., h(k, m-1), where h(k, i) gives the i-th index to probe for key k.

The probe sequence function:

Every open addressing scheme can be expressed using a general formula:

h(k, i) = (h₁(k) + f(i)) mod m

Where:

h₁(k) is the primary hash function that determines the initial probe position
f(i) is the probe offset function that determines how we deviate from the initial position
i is the probe number (0, 1, 2, ...)
m is the table size

The three probing strategies differ precisely in their choice of f(i):

Probe Offset Functions for Open Addressing Strategies
Strategy	Probe Offset f(i)	Resulting Probe Sequence
Linear Probing	f(i) = i	h₁(k), h₁(k)+1, h₁(k)+2, ...
Quadratic Probing	f(i) = c₁i + c₂i²	h₁(k), h₁(k)+c₁+c₂, h₁(k)+2c₁+4c₂, ...
Double Hashing	f(i) = i × h₂(k)	h₁(k), h₁(k)+h₂(k), h₁(k)+2h₂(k), ...

Why the choice of f(i) matters profoundly:

The probe sequence directly determines:

Clustering behavior — Whether keys with similar initial hashes will cluster together
Table coverage — Whether the sequence visits all slots (crucial for correctness)
Cache performance — Whether consecutive probes access nearby memory locations
Collision chain length — The expected number of probes needed for operations

Let's examine each strategy in depth, understanding not just how it works but why its behavior emerges from its mathematical structure.

Linear Probing — The Simplest Strategy

h(k, i) = (h₁(k) + i) mod m

For example, if h₁(k) = 5 and the table size m = 11, the probe sequence is: 5 → 6 → 7 → 8 → 9 → 10 → 0 → 1 → 2 → 3 → 4

This sequence visits every slot exactly once — a complete permutation — guaranteeing we find an empty slot if one exists.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
class LinearProbingHashTable:
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.size = 0
        self.keys = [None] * capacity
        self.values = [None] * capacity
        self.DELETED = object()  # Tombstone marker
    
    def _hash(self, key) -> int:
        """Primary hash function."""
        return hash(key) % self.capacity
    
    def _probe_sequence(self, key):
        """
        Generator for linear probe sequence.
        Yields indices: h(k), h(k)+1, h(k)+2, ... (mod m)
        """
        start = self._hash(key)
        for i in range(self.capacity):
            yield (start + i) % self.capacity
    
    def insert(self, key, value) -> bool:
        """
        Insert key-value pair using linear probing.
        Returns True if successful, False if table is full.
        """
        if self.size >= self.capacity:
            return False  # Table is full
        
        for index in self._probe_sequence(key):
            # Empty slot or tombstone — insert here
            if self.keys[index] is None or self.keys[index] is self.DELETED:
                self.keys[index] = key
                self.values[index] = value
                self.size += 1
                return True
            # Key already exists — update value
            if self.keys[index] == key:
                self.values[index] = value
                return True
        
        return False  # Should never reach here if size < capacity
    
    def search(self, key):
        """
        Search for key using linear probing.
        Returns value if found, None otherwise.
        """
        for index in self._probe_sequence(key):
            if self.keys[index] is None:
                return None  # Empty slot — key not present
            if self.keys[index] == key:
                return self.values[index]
            # If DELETED, continue probing — key might be further
        
        return None  # Probed entire table without finding key

Advantages of Linear Probing:

Simplicity — The implementation is trivial; no secondary hash functions or complex offset calculations.
Excellent cache performance — Consecutive slots are accessed sequentially, maximizing CPU cache utilization. Modern processors prefetch contiguous memory, making linear probing surprisingly fast in practice despite its theoretical clustering issues.
Guaranteed full table coverage — The sequence visits every slot, ensuring correctness.
Memory efficiency — No pointers, no auxiliary structures; just an array.

Primary Clustering: The Achilles Heel

Visualizing Primary Clustering:

Consider inserting keys that hash to positions 3, 4, 3, 5, 4 into a table of size 11:

Initial:   [ _ | _ | _ | _ | _ | _ | _ | _ | _ | _ | _ ]
Indices:     0   1   2   3   4   5   6   7   8   9  10

Insert A (h=3): [ _ | _ | _ | A | _ | _ | _ | _ | _ | _ | _ ]
Insert B (h=4): [ _ | _ | _ | A | B | _ | _ | _ | _ | _ | _ ]
Insert C (h=3): [ _ | _ | _ | A | B | C | _ | _ | _ | _ | _ ]  ← C probes 3→4→5
Insert D (h=5): [ _ | _ | _ | A | B | C | D | _ | _ | _ | _ ]  ← D probes 5→6
Insert E (h=4): [ _ | _ | _ | A | B | C | D | E | _ | _ | _ ]  ← E probes 4→5→6→7

A cluster has formed at indices 3-7. Key E, hashing to position 4, required 4 probes to find an empty slot. As the table fills, these clusters grow and merge, severely degrading performance.

Theoretical Analysis of Linear Probing:

Donald Knuth's seminal analysis (1963) provides precise formulas for the expected number of probes. Assuming uniform hashing and a load factor α = n/m:

For successful search (finding an existing key):

E[probes] ≈ ½(1 + 1/(1-α))

For unsuccessful search (verifying key absence) and insertion:

E[probes] ≈ ½(1 + 1/(1-α)²)

These formulas reveal the critical sensitivity to load factor:

Expected Probes for Linear Probing by Load Factor
Load Factor (α)	Successful Search	Unsuccessful Search/Insert
0.50	1.50 probes	2.50 probes
0.75	2.50 probes	8.50 probes
0.90	5.50 probes	50.50 probes
0.95	10.50 probes	200.50 probes
0.99	50.50 probes	5,000.50 probes

Critical Threshold at High Load Factors

Quadratic Probing — Breaking Linear Patterns

h(k, i) = (h₁(k) + c₁i + c₂i²) mod m

Common choices include:

c₁ = 0, c₂ = 1: h(k, i) = (h₁(k) + i²) mod m → probes at +0, +1, +4, +9, +16, ...
c₁ = 1, c₂ = 1: h(k, i) = (h₁(k) + i + i²) mod m → probes at +0, +2, +6, +12, +20, ...
Alternating signs: (h₁(k) + (-1)^i × ⌈i/2⌉²) mod m → probes at +0, +1, -1, +4, -4, +9, -9, ...

The quadratic jumps spread probes across the table, reducing the tendency for clusters to form.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
class QuadraticProbingHashTable:
    def __init__(self, capacity: int):
        # Capacity should be prime or power of 2 for full coverage
        self.capacity = capacity
        self.size = 0
        self.keys = [None] * capacity
        self.values = [None] * capacity
        self.DELETED = object()
    
    def _hash(self, key) -> int:
        return hash(key) % self.capacity
    
    def _probe_sequence(self, key, c1: int = 0, c2: int = 1):
        """
        Generator for quadratic probe sequence.
        h(k, i) = (h(k) + c1*i + c2*i²) mod m
        
        Note: May not cover all slots unless capacity and
        coefficients satisfy specific conditions.
        """
        start = self._hash(key)
        for i in range(self.capacity):
            yield (start + c1 * i + c2 * i * i) % self.capacity
    
    def _probe_sequence_alternating(self, key):
        """
        Alternative: Alternating sign probe sequence.
        Probes at +1², -1², +2², -2², +3², -3², ...
        This ensures better coverage for any table size.
        """
        start = self._hash(key)
        yield start
        
        for i in range(1, (self.capacity + 1) // 2 + 1):
            yield (start + i * i) % self.capacity
            yield (start - i * i) % self.capacity
    
    def insert(self, key, value) -> bool:
        if self.size >= self.capacity // 2:  # More conservative threshold
            return False  # Trigger resize
        
        for index in self._probe_sequence(key):
            if self.keys[index] is None or self.keys[index] is self.DELETED:
                self.keys[index] = key
                self.values[index] = value
                self.size += 1
                return True
            if self.keys[index] == key:
                self.values[index] = value
                return True
        
        return False  # Table full or sequence exhausted

The Coverage Problem

Solution 1: Choose m to be a prime number. When m is prime and c₂ = 1, the sequence h₁(k), h₁(k)+1, h₁(k)+4, h₁(k)+9, ... covers at least m/2 distinct positions.

Solution 2: Use m = 2^k (power of 2) with c₁ = c₂ = ½, giving h(k, i) = (h₁(k) + i(i+1)/2) mod m. This triangular number sequence covers all 2^k positions.

Why Quadratic Probing Reduces Clustering:

Consider two keys A and B where h(A) = 3 and h(B) = 4:

Linear Probing:

A's sequence: 3, 4, 5, 6, 7, 8, ...
B's sequence: 4, 5, 6, 7, 8, 9, ...
After A occupies slot 3, B's sequence shares slots 4, 5, 6, ... with A

Quadratic Probing (c₁=0, c₂=1, m=11):

A's sequence: 3, 4, 7, 1, 8, 6, 6, 9, 3, ... (note: may cycle)
B's sequence: 4, 5, 8, 2, 9, 7, 7, 10, 4, ...
Completely different sequences after the first position!

Secondary Clustering:

A and B both probe: 5, 6, 9, 3, 10, ...

Linear vs Quadratic Probing Performance (Theoretical)
Load Factor (α)	Linear: Unsuccessful Search	Quadratic: Unsuccessful Search
0.50	2.50 probes	~2.19 probes
0.75	8.50 probes	~4.64 probes
0.90	50.50 probes	~11.40 probes

Practical Implementation Advice

When implementing quadratic probing:

Choose table size carefully — Use a prime number or specific power-of-2 constructions that guarantee full coverage.
Keep load factor below 0.5 — The coverage guarantee for prime sizes only ensures m/2 positions are visited. Keeping α < 0.5 ensures successful insertion.
Consider the alternating-signs variant — Probing +1, -1, +4, -4, +9, -9, ... often provides better coverage across table sizes.

Double Hashing — The Gold Standard

Double hashing represents the most sophisticated open addressing strategy. It uses two independent hash functions: one to determine the initial position and another to determine the step size:

h(k, i) = (h₁(k) + i × h₂(k)) mod m

Where:

h₁(k) determines the starting position
h₂(k) determines the probe step size
Each key has its own unique step size, creating a key-specific probe sequence

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
class DoubleHashingTable:
    def __init__(self, capacity: int):
        # Capacity should be prime for best results
        self.capacity = capacity
        self.size = 0
        self.keys = [None] * capacity
        self.values = [None] * capacity
        self.DELETED = object()
    
    def _hash1(self, key) -> int:
        """Primary hash function: determines starting position."""
        return hash(key) % self.capacity
    
    def _hash2(self, key) -> int:
        """
        Secondary hash function: determines step size.
        
        CRITICAL: h₂(k) must never return 0, otherwise we'd
        probe the same position forever!
        
        Common approach: h₂(k) = prime - (hash(key) % prime)
        where prime < capacity. This ensures h₂ ∈ [1, prime].
        """
        # Use a different prime smaller than capacity
        prime = self._largest_prime_below(self.capacity)
        return prime - (hash(key) % prime)
    
    def _largest_prime_below(self, n: int) -> int:
        """Find the largest prime strictly less than n."""
        def is_prime(x):
            if x < 2:
                return False
            if x == 2:
                return True
            if x % 2 == 0:
                return False
            for i in range(3, int(x**0.5) + 1, 2):
                if x % i == 0:
                    return False
            return True
        
        for candidate in range(n - 1, 1, -1):
            if is_prime(candidate):
                return candidate
        return 2
    
    def _probe_sequence(self, key):
        """
        Generator for double hashing probe sequence.
        h(k, i) = (h₁(k) + i × h₂(k)) mod m
        """
        start = self._hash1(key)
        step = self._hash2(key)
        
        for i in range(self.capacity):
            yield (start + i * step) % self.capacity
    
    def insert(self, key, value) -> bool:
        if self.size >= self.capacity:
            return False
        
        for index in self._probe_sequence(key):
            if self.keys[index] is None or self.keys[index] is self.DELETED:
                self.keys[index] = key
                self.values[index] = value
                self.size += 1
                return True
            if self.keys[index] == key:
                self.values[index] = value
                return True
        
        return False
    
    def search(self, key):
        for index in self._probe_sequence(key):
            if self.keys[index] is None:
                return None
            if self.keys[index] == key:
                return self.values[index]
        return None

Critical Requirement: h₂(k) ≠ 0

The secondary hash function must never return zero. If h₂(k) = 0, the probe sequence becomes:

h(k, 0) = h₁(k) h(k, 1) = h₁(k) + 0 = h₁(k) h(k, 2) = h₁(k) + 0 = h₁(k) ...

We'd infinitely probe the same position! The standard solution is to design h₂ to return values in the range [1, m-1].

Ensuring Full Table Coverage:

For double hashing to visit all table slots, gcd(h₂(k), m) must equal 1 for all keys. If h₂(k) and m share a common factor, the probe sequence will only visit a subset of slots.

Example: If m = 10 and h₂(k) = 4, the sequence visits: h₁(k), h₁(k)+4, h₁(k)+8, h₁(k)+12=h₁(k)+2, h₁(k)+6, h₁(k)+10=h₁(k), ...

This only covers {h₁(k), h₁(k)+2, h₁(k)+4, h₁(k)+6, h₁(k)+8} — only 5 positions, not all 10!

Solution: Make m prime. When m is prime, any step size from 1 to m-1 is coprime with m, guaranteeing full coverage.

Common h₂ Constructions:

Several well-tested secondary hash function designs exist:

1. Division-based (most common):

h₂(k) = q - (k mod q)

where q is a prime smaller than m. This produces values in [1, q].

2. Multiplication-based:

h₂(k) = 1 + (hash(k) · A mod m-1)

where A is a constant. This produces values in [1, m-1].

3. Bit manipulation (for integer keys):

h₂(k) = 1 + (k ^ (k >> 5)) % (m - 1)

XOR and shift operations create independence from h₁.

The key property is that h₁ and h₂ should be independent — knowing h₁(k) should give no information about h₂(k).

Why Independence Matters

Performance Analysis:

Under uniform hashing assumptions:

Unsuccessful search/insertion: E[probes] = 1/(1-α)

Successful search: E[probes] = (1/α) × ln(1/(1-α))

Double Hashing vs Linear Probing Performance
Load Factor (α)	Double: Unsuccessful	Linear: Unsuccessful	Improvement
0.50	2.00 probes	2.50 probes	20% fewer probes
0.75	4.00 probes	8.50 probes	53% fewer probes
0.90	10.00 probes	50.50 probes	80% fewer probes
0.95	20.00 probes	200.50 probes	90% fewer probes

The Tradeoff:

Double hashing's superior performance comes at a cost:

Computational overhead — Two hash function evaluations per probe instead of one
Implementation complexity — Must carefully design h₂ to avoid zero and ensure coverage
Cache performance — Probes may jump across the table, losing linear probing's cache-friendly access pattern

Comparative Analysis — Choosing the Right Strategy

Let's synthesize our understanding with a comprehensive comparison across all dimensions that matter in practice:

Comprehensive Probing Strategy Comparison
Characteristic	Linear Probing	Quadratic Probing	Double Hashing
Probe formula	h₁(k) + i	h₁(k) + c₁i + c₂i²	h₁(k) + i·h₂(k)
Clustering type	Primary (severe)	Secondary (moderate)	None (ideal)
Cache locality	Excellent	Poor	Poor
Table coverage	Always 100%	Partial (needs care)	100% if gcd=1
Hash functions needed	1	1	2
Computation per probe	Addition	Multiplication, addition	Multiplication, addition
Best for load factor	α < 0.5	α < 0.5	α < 0.75
Implementation complexity	Trivial	Moderate	Complex

When to Use Linear Probing

•Memory is constrained and simplicity is valued
•Load factor will be kept low (< 0.5)
•Cache performance is critical
•Keys have uniform distribution (good hash function)
•Simple implementation is preferred

When to Use Double Hashing

•Higher load factors are necessary (0.5 to 0.75)
•Keys may cluster naturally (similar inputs)
•Worst-case performance bounds matter
•Cache misses are acceptable
•Two independent hash functions are available

Real-World Implementation Choices

The theoretical superiority of double hashing rarely translates to practical superiority because cache effects dominate at typical load factors.

Summary: Probing Strategies Mastered

We've completed a deep exploration of the three canonical open addressing probing strategies. Let's consolidate the essential insights:

Key Takeaways

•Open addressing stores all elements in the table itself — No auxiliary structures; probe sequences find alternative slots for collisions.
•Linear probing (f(i) = i) is simple and cache-friendly — But primary clustering causes severe performance degradation at high load factors.
•Quadratic probing (f(i) = c₁i + c₂i²) reduces clustering — But may not cover the entire table; requires careful size selection.
•Double hashing (f(i) = i·h₂(k)) achieves near-ideal performance — Key-dependent step sizes eliminate clustering; requires two independent hash functions.
•The probe sequence must cover the entire table — Otherwise, insertion may fail even when empty slots exist.
•Cache performance often beats theoretical optimality — Linear probing's cache efficiency makes it competitive despite clustering.

What's next:

Page Complete