Operations & Cost - Learning Module

Loading content...

0/276

Hash Table Lookup Operation — O(1) Average, O(n) Worst Case

Lookup: The Operation That Defines Hash Tables

If insertion is how hash tables receive data, lookup is how they earn their reputation. The ability to retrieve any element from a collection of millions in constant time—regardless of the collection's size—is hash tables' defining superpower. This O(1) average lookup is why hash tables underpin caches, symbol tables, database indexes, and countless other performance-critical systems.

Yet, like insertion, this O(1) is a probabilistic average, not a universal guarantee. Understanding when lookup truly achieves constant time, when it degrades to linear, and what factors govern this behavior is essential knowledge for any engineer building systems at scale.

This page provides a comprehensive examination of the lookup operation, from its mathematical foundations to its real-world performance characteristics. By the end, you'll understand exactly what makes hash table lookup special—and exactly what can break it.

What You Will Learn

By the end of this page, you will understand: the step-by-step mechanics of hash table lookup in both chaining and open addressing; the mathematical basis for O(1) average-case performance; how worst-case O(n) lookup manifests; the difference between successful and unsuccessful lookups; and how hash table lookup performance compares to alternative data structures.

The Anatomy of Hash Table Lookup

The lookup operation answers a fundamental question: "Given a key, does this hash table contain it, and if so, what is its associated value?" The process mirrors insertion but terminates differently—instead of storing, we retrieve or report absence.

The Fundamental Lookup Algorithm:

Compute the hash: Apply the hash function to the search key
Compute the index: Reduce the hash to a valid bucket index
Access the bucket: Navigate to the identified location in the array
Search within the bucket: Traverse the collision chain comparing keys
Return result: Return the value if found, or indicate absence

The critical observation: steps 1-3 are O(1), but step 4 depends entirely on the bucket's contents. An empty bucket or immediate match gives instant results. A long chain forces linear traversal.

lookup_anatomy.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
from typing import Optional, Tuple, TypeVar
 
K = TypeVar('K')
V = TypeVar('V')
 
class HashTableLookup:
    """
    Illustrates hash table lookup with detailed step-by-step analysis.
    Each step's complexity contribution is annotated.
    """
    
    def __init__(self, capacity: int = 16):
        self.capacity = capacity
        self.buckets = [[] for _ in range(capacity)]
        self.size = 0
        
        # Metrics for analysis
        self.comparisons = 0
        self.lookups_performed = 0
    
    def _hash(self, key: K) -> int:
        """
        Step 1: Compute hash value.
        Time: O(1) for primitive types, O(k) for k-length strings.
        Treated as O(1) for bounded-size keys.
        """
        return hash(key)
    
    def _get_bucket_index(self, hash_value: int) -> int:
        """
        Step 2: Reduce hash to bucket index.
        Time: O(1) - single modulo operation.
        """
        return hash_value % self.capacity
    
    def lookup(self, key: K) -> Tuple[bool, Optional[V], int]:
        """
        Complete lookup operation with detailed analysis.
        
        Returns:
            - found: Whether the key exists
            - value: The associated value (None if not found)
            - comparisons: Number of key comparisons performed
        """
        self.lookups_performed += 1
        comparisons_this_lookup = 0
        
        # Step 1: Compute hash - O(1)
        hash_value = self._hash(key)
        
        # Step 2: Compute bucket index - O(1)
        bucket_index = self._get_bucket_index(hash_value)
        
        # Step 3: Access bucket - O(1) array indexing
        bucket = self.buckets[bucket_index]
        
        # Step 4: Search within bucket - O(k) where k is bucket length
        # THIS IS WHERE COMPLEXITY VARIANCE OCCURS
        for existing_key, value in bucket:
            comparisons_this_lookup += 1
            self.comparisons += 1
            
            if existing_key == key:
                # FOUND - successful lookup
                return (True, value, comparisons_this_lookup)
        
        # NOT FOUND - unsuccessful lookup
        # We traversed entire bucket without match
        return (False, None, comparisons_this_lookup)
    
    def get(self, key: K, default: V = None) -> Optional[V]:
        """
        Pythonic interface: return value or default.
        """
        found, value, _ = self.lookup(key)
        return value if found else default
    
    def __contains__(self, key: K) -> bool:
        """
        Enable 'key in table' syntax.
        """
        found, _, _ = self.lookup(key)
        return found
    
    def insert(self, key: K, value: V):
        """Helper method to populate the table."""
        bucket_index = hash(key) % self.capacity
        bucket = self.buckets[bucket_index]
        
        for i, (k, _) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return
        
        bucket.append((key, value))
        self.size += 1
    
    def get_lookup_statistics(self) -> dict:
        """Return lookup performance statistics."""
        return {
            "total_lookups": self.lookups_performed,
            "total_comparisons": self.comparisons,
            "avg_comparisons": (self.comparisons / self.lookups_performed 
                               if self.lookups_performed > 0 else 0),
            "elements": self.size,
            "buckets": self.capacity,
            "load_factor": self.size / self.capacity,
        }
 
 
# Demonstration
def demonstrate_lookup_mechanics():
    """Show lookup behavior under different conditions."""
    
    print("=== Hash Table Lookup Mechanics Demo ===\n")
    
    table = HashTableLookup(capacity=10)
    
    # Insert some elements
    test_data = [
        ("apple", "red"),
        ("banana", "yellow"),
        ("cherry", "red"),
        ("date", "brown"),
        ("elderberry", "purple"),
    ]
    
    for key, value in test_data:
        table.insert(key, value)
    
    # Perform lookups
    print("Performing lookups:")
    
    # Successful lookups
    for key in ["apple", "cherry", "elderberry"]:
        found, value, comparisons = table.lookup(key)
        print(f"  lookup('{key}'): found={found}, "
              f"value='{value}', comparisons={comparisons}")
    
    # Unsuccessful lookups
    for key in ["grape", "mango", "kiwi"]:
        found, value, comparisons = table.lookup(key)
        print(f"  lookup('{key}'): found={found}, comparisons={comparisons}")
    
    print()
    print("Statistics:", table.get_lookup_statistics())
 
demonstrate_lookup_mechanics()

The Performance Pivot

The entire complexity story of hash table lookup hinges on step 4—searching within the bucket. Everything else is guaranteed O(1). When we say lookup is O(1) on average, we're really saying: 'on average, buckets contain O(1) elements.' When we say worst-case is O(n), we mean: 'in the worst case, one bucket contains everything.'

Successful vs. Unsuccessful Lookups: A Critical Distinction

Lookup complexity analysis distinguishes between two fundamentally different scenarios: searching for a key that exists versus one that doesn't. This isn't merely academic—the distinction has real performance implications.

Successful Lookup (Key Exists):

When seeking a key that's present, we traverse the collision chain until we find a match. On average, the key is neither first nor last—it's somewhere in the middle of its chain.

Expected comparisons (successful) ≈ 1 + α/2

Where α = n/m is the load factor. We check about half the chain before finding our element.

Unsuccessful Lookup (Key Absent):

When seeking a key that doesn't exist, we must examine every element in the bucket before concluding it's absent. There's no early termination—absence can only be proven by exhaustive search.

Expected comparisons (unsuccessful) ≈ α

We traverse the entire average chain length. For high load factors, unsuccessful lookups are measurably slower than successful ones.

Expected Comparisons: Successful vs. Unsuccessful Lookup
Load Factor (α)	Successful Lookup	Unsuccessful Lookup	Ratio (Fail/Success)
0.25	1.125 comparisons	0.25 comparisons	0.22x (failures faster)
0.50	1.25 comparisons	0.50 comparisons	0.40x (failures faster)
0.75	1.375 comparisons	0.75 comparisons	0.55x (failures faster)
1.00	1.50 comparisons	1.00 comparison	0.67x (approaching parity)
2.00	2.00 comparisons	2.00 comparisons	1.00x (equal cost)
5.00	3.50 comparisons	5.00 comparisons	1.43x (failures slower)

The Surprising Inversion:

At low load factors, unsuccessful lookups are actually faster than successful ones on average. This counterintuitive result occurs because:

With α = 0.5, most buckets are empty (probability e^(-0.5) ≈ 0.61)
Checking an empty bucket is instant (0 comparisons)
Many unsuccessful lookups hit empty buckets immediately
Successful lookups, by definition, always find something to compare

As load factor increases, this advantage diminishes, and above α = 1, unsuccessful lookups become slower because they traverse longer chains before concluding absence.

lookup_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import random
import time
from typing import Set
 
def compare_lookup_types():
    """
    Empirically compare successful vs unsuccessful lookup performance.
    Demonstrates the theoretical distinction in practice.
    """
    
    class SimpleHashTable:
        def __init__(self, capacity):
            self.capacity = capacity
            self.buckets = [[] for _ in range(capacity)]
            self.comparisons = 0
        
        def insert(self, key, value):
            idx = hash(key) % self.capacity
            for i, (k, _) in enumerate(self.buckets[idx]):
                if k == key:
                    self.buckets[idx][i] = (key, value)
                    return
            self.buckets[idx].append((key, value))
        
        def lookup(self, key):
            idx = hash(key) % self.capacity
            bucket = self.buckets[idx]
            comps = 0
            for k, v in bucket:
                comps += 1
                if k == key:
                    self.comparisons += comps
                    return (True, v, comps)
            self.comparisons += comps
            return (False, None, comps)
    
    # Test parameters
    n_elements = 10000
    n_lookups = 5000
    
    for load_factor in [0.25, 0.50, 0.75, 1.00, 2.00]:
        capacity = int(n_elements / load_factor)
        table = SimpleHashTable(capacity)
        
        # Insert known keys
        keys_in_table: Set[int] = set()
        for _ in range(n_elements):
            key = random.randint(0, 10**9)
            keys_in_table.add(key)
            table.insert(key, f"value_{key}")
        
        # Generate lookup targets: half exist, half don't
        existing_keys = random.sample(list(keys_in_table), min(n_lookups // 2, len(keys_in_table)))
        missing_keys = []
        while len(missing_keys) < n_lookups // 2:
            candidate = random.randint(0, 10**9)
            if candidate not in keys_in_table:
                missing_keys.append(candidate)
        
        # Measure successful lookups
        table.comparisons = 0
        success_comparisons = []
        for key in existing_keys:
            _, _, comps = table.lookup(key)
            success_comparisons.append(comps)
        
        avg_success = sum(success_comparisons) / len(success_comparisons) if success_comparisons else 0
        
        # Measure unsuccessful lookups
        table.comparisons = 0
        fail_comparisons = []
        for key in missing_keys:
            _, _, comps = table.lookup(key)
            fail_comparisons.append(comps)
        
        avg_fail = sum(fail_comparisons) / len(fail_comparisons) if fail_comparisons else 0
        
        print(f"Load Factor {load_factor:.2f}:")
        print(f"  Successful lookup:   {avg_success:.2f} comparisons (expected: {1 + load_factor/2:.2f})")
        print(f"  Unsuccessful lookup: {avg_fail:.2f} comparisons (expected: {load_factor:.2f})")
        print(f"  Ratio (fail/success): {avg_fail/avg_success if avg_success > 0 else 0:.2f}")
        print()
 
compare_lookup_types()

Practical Implication

If your workload involves many 'check if key exists' operations where you expect many negatives (e.g., filtering, deduplication), consider Bloom filters as a preprocessing step. Bloom filters can answer 'definitely not present' in O(1) with zero false negatives, allowing you to skip hash table lookups entirely for non-existent keys.

Average-Case Analysis: The Mathematical Foundation

The O(1) average-case claim for hash table lookup is grounded in probability theory and relies on specific assumptions about hash function behavior. Let's examine the mathematics rigorously.

Setup:

Hash table with m buckets (slots)
n elements stored in the table
Load factor α = n/m
Hash function satisfies Simple Uniform Hashing Assumption (SUHA)

SUHA Statement: Each key is equally likely to hash to any of the m buckets, independently of other keys.

Chain Length Distribution:

Under SUHA, the number of keys in each bucket follows a Poisson distribution with parameter λ = α for large n and m:

P(bucket has k elements) ≈ (e^(-α) × α^k) / k!

This tells us:

Probability of empty bucket: e^(-α)
Expected elements per bucket: α
Variance in bucket size: α

The expected lookup time is dominated by the expected chain length, which is O(α). When α is kept constant (through resizing), this is O(1).

poisson_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import math
from collections import Counter
import random
 
def analyze_chain_distribution():
    """
    Demonstrate that chain lengths follow Poisson distribution.
    This validates the mathematical foundation of O(1) analysis.
    """
    
    def poisson_pmf(k: int, lambda_: float) -> float:
        """Poisson probability mass function."""
        return (math.exp(-lambda_) * (lambda_ ** k)) / math.factorial(k)
    
    # Parameters
    m = 10000  # Number of buckets
    n = 7500   # Number of elements (load factor = 0.75)
    alpha = n / m
    
    # Simulate hash table insertion
    buckets = [0] * m  # Count elements per bucket
    
    for _ in range(n):
        # Simulate uniform hash by using random bucket selection
        bucket_idx = random.randint(0, m - 1)
        buckets[bucket_idx] += 1
    
    # Count bucket sizes
    size_distribution = Counter(buckets)
    max_size = max(size_distribution.keys())
    
    print(f"Hash Table Chain Length Distribution")
    print(f"Buckets: {m}, Elements: {n}, Load Factor: {alpha:.2f}")
    print("=" * 60)
    print(f"{'Chain Length':<15} {'Observed':<15} {'Poisson Expected':<15} {'Deviation'}")
    print("-" * 60)
    
    for k in range(min(max_size + 1, 10)):
        observed = size_distribution.get(k, 0)
        expected = m * poisson_pmf(k, alpha)
        deviation = (observed - expected) / expected if expected > 0 else 0
        
        print(f"{k:<15} {observed:<15} {expected:<15.1f} {deviation:+.2%}")
    
    print()
    
    # Calculate expected lookup cost
    total_comparisons_for_all = sum(k * count for k, count in size_distribution.items())
    avg_chain_length = total_comparisons_for_all / m if m > 0 else 0
    
    print(f"Average chain length: {avg_chain_length:.3f} (expected: {alpha:.3f})")
    print(f"Max chain length: {max_size}")
    print(f"Empty buckets: {size_distribution.get(0, 0)} ({size_distribution.get(0, 0)/m:.1%})")
    print(f"Expected empty: {m * math.exp(-alpha):.0f} ({math.exp(-alpha):.1%})")
    
    # Prove O(1) by showing average is constant regardless of n
    print()
    print("Verification: Average lookup cost remains constant as n grows")
    print("-" * 60)
    
    for n_test in [1000, 5000, 10000, 50000, 100000]:
        m_test = int(n_test / 0.75)  # Maintain load factor 0.75
        buckets_test = [0] * m_test
        
        for _ in range(n_test):
            buckets_test[random.randint(0, m_test - 1)] += 1
        
        total_comps = sum(k * count for k, count in Counter(buckets_test).items())
        avg_comps = total_comps / m_test
        
        # For successful lookup: 1 + α/2
        expected_successful = 1 + 0.75 / 2
        
        print(f"n = {n_test:>6}: avg chain = {avg_comps:.3f}, "
              f"expected success = {expected_successful:.3f}")
 
analyze_chain_distribution()

The Magic of Bounded Load Factor

The key insight is that when we keep α bounded (e.g., α ≤ 0.75), the expected chain length is bounded by a constant (0.75). Thus, expected lookup time is bounded by a constant—giving us O(1). Without load factor management, α grows linearly with n, and lookup becomes O(n/m), which is O(n) if m is fixed.

Worst-Case O(n) Lookup: When Everything Collides

The O(n) worst case for lookup mirrors insertion: all n elements occupy a single bucket, transforming the hash table into a linked list. But the consequences for lookup are arguably more severe than for insertion.

Why Lookup Worst-Case Is More Damaging:

Lookups are typically more frequent — Most applications read far more than they write. A 100:1 read-to-write ratio is common. Slow lookups compound.
Users wait for lookups — Insertions often happen asynchronously or in batches. Lookups often block user-facing operations. A slow lookup means visible latency.
Unsuccessful lookups are common — Many applications check for existence (caching, deduplication). Worst-case unsuccessful lookup traverses the entire chain every time.
Repeated hits on hot keys — A malicious or pathological access pattern can repeatedly query the degenerate bucket, causing sustained degradation rather than occasional spikes.

worst_case_lookup_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import time
import random
 
class DegenerateHashTable:
    """
    Demonstrates worst-case lookup behavior when all elements collide.
    """
    
    def __init__(self, capacity=100):
        self.capacity = capacity
        self.buckets = [[] for _ in range(capacity)]
        self.size = 0
    
    def insert_with_hash(self, key, value, hash_value: int):
        """Insert with explicit hash value (allows forcing collisions)."""
        bucket_idx = hash_value % self.capacity
        bucket = self.buckets[bucket_idx]
        
        for i, (k, _) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return
        
        bucket.append((key, value))
        self.size += 1
    
    def lookup_timed(self, key, hash_value: int) -> tuple:
        """Lookup with timing and comparison counting."""
        bucket_idx = hash_value % self.capacity
        bucket = self.buckets[bucket_idx]
        
        start = time.perf_counter_ns()
        comparisons = 0
        
        for k, v in bucket:
            comparisons += 1
            if k == key:
                end = time.perf_counter_ns()
                return (True, v, comparisons, end - start)
        
        end = time.perf_counter_ns()
        return (False, None, comparisons, end - start)
 
 
def demonstrate_lookup_degradation():
    """Show dramatic performance difference under collision attack."""
    
    n_elements = 10000
    n_lookups = 1000
    
    print("=" * 70)
    print("DEMONSTRATION: Lookup Performance Normal vs Degenerate")
    print("=" * 70)
    print()
    
    # SCENARIO 1: Normal distribution (good hash function)
    print("SCENARIO 1: Normal Hash Distribution")
    print("-" * 50)
    
    normal_table = DegenerateHashTable(capacity=1000)
    
    # Insert with uniformly distributed hashes
    keys = [f"key_{i}" for i in range(n_elements)]
    for key in keys:
        normal_table.insert_with_hash(key, f"value", hash(key))
    
    # Measure lookup times
    sample_keys = random.sample(keys, n_lookups)
    normal_times = []
    normal_comps = []
    
    for key in sample_keys:
        _, _, comps, ns = normal_table.lookup_timed(key, hash(key))
        normal_times.append(ns)
        normal_comps.append(comps)
    
    print(f"  Elements: {normal_table.size}")
    print(f"  Avg comparisons: {sum(normal_comps)/len(normal_comps):.2f}")
    print(f"  Max comparisons: {max(normal_comps)}")
    print(f"  Avg lookup time: {sum(normal_times)/len(normal_times):.0f} ns")
    print(f"  Max lookup time: {max(normal_times):.0f} ns")
    print()
    
    # SCENARIO 2: All collisions (adversarial/pathological)
    print("SCENARIO 2: Complete Collision (All Same Bucket)")
    print("-" * 50)
    
    degenerate_table = DegenerateHashTable(capacity=1000)
    
    # Insert ALL with the same hash value -> same bucket
    collision_hash = 42
    for key in keys:
        degenerate_table.insert_with_hash(key, f"value", collision_hash)
    
    # Measure lookup times
    degenerate_times = []
    degenerate_comps = []
    
    for key in sample_keys:
        _, _, comps, ns = degenerate_table.lookup_timed(key, collision_hash)
        degenerate_times.append(ns)
        degenerate_comps.append(comps)
    
    print(f"  Elements: {degenerate_table.size}")
    print(f"  Avg comparisons: {sum(degenerate_comps)/len(degenerate_comps):.2f}")
    print(f"  Max comparisons: {max(degenerate_comps)}")
    print(f"  Avg lookup time: {sum(degenerate_times)/len(degenerate_times):.0f} ns")
    print(f"  Max lookup time: {max(degenerate_times):.0f} ns")
    print()
    
    # COMPARISON
    print("COMPARISON:")
    print("-" * 50)
    avg_normal = sum(normal_comps) / len(normal_comps)
    avg_degen = sum(degenerate_comps) / len(degenerate_comps)
    time_normal = sum(normal_times) / len(normal_times)
    time_degen = sum(degenerate_times) / len(degenerate_times)
    
    print(f"  Comparison increase: {avg_degen / avg_normal:.0f}x")
    print(f"  Time increase: {time_degen / time_normal:.0f}x")
    print()
    print(f"  This is the difference between O(1) and O(n)!")
 
demonstrate_lookup_degradation()

Production Impact

In 2003, Scott Crosby demonstrated algorithmic complexity attacks against Perl's hash tables. By crafting HTTP request parameters that all collided, he could make a web server spend minutes processing a single request that should take milliseconds. Modern languages now use randomized hashing or alternative data structures specifically to prevent these attacks.

Lookup with Open Addressing: Probing Sequences

Open addressing stores all elements directly in the array. Lookup follows the same probe sequence used during insertion, searching until finding the target, an empty slot (indicating absence), or completing a full cycle.

Critical Subtlety: Tombstones Complicate Lookup

With open addressing, deletion creates a problem. If we simply empty a slot that was part of a collision chain, we break lookups for elements that probed past it. The solution is tombstones—markers indicating 'deleted, but keep probing.'

This means unsuccessful lookups cannot stop at the first 'empty' slot—they must stop at a slot that was never occupied (not just deleted). As tombstones accumulate, lookup performance degrades.

Probe Count Analysis (Linear Probing):

For load factor α, expected probes are:

Successful lookup: ½(1 + 1/(1-α))
Unsuccessful lookup: ½(1 + (1/(1-α))²)

As α → 1, these explode. Linear probing is extremely sensitive to high load factors.

Expected Probes for Linear Probing
Load Factor	Successful Lookup	Unsuccessful Lookup	Practical Assessment
50%	1.5 probes	2.5 probes	Excellent—minimal degradation
70%	2.2 probes	6.1 probes	Good—still acceptable
80%	3.0 probes	13.0 probes	Degrading—consider resize
90%	5.5 probes	50.5 probes	Poor—resize urgently needed
95%	10.5 probes	200.5 probes	Critical—effectively broken

open_addressing_lookup.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
class OpenAddressingLookup:
    """
    Demonstrates lookup with open addressing and tombstone handling.
    """
    
    EMPTY = "EMPTY"
    DELETED = "DELETED"
    
    def __init__(self, capacity=16):
        self.capacity = capacity
        self.slots = [self.EMPTY] * capacity
        self.size = 0
        self.deleted_count = 0
    
    def _hash(self, key) -> int:
        return hash(key)
    
    def _probe_sequence(self, key):
        """Generate probe sequence using linear probing."""
        start = self._hash(key) % self.capacity
        for offset in range(self.capacity):
            yield (start + offset) % self.capacity
    
    def insert(self, key, value) -> int:
        """Insert and return number of probes."""
        probes = 0
        for idx in self._probe_sequence(key):
            probes += 1
            slot = self.slots[idx]
            
            if slot is self.EMPTY or slot is self.DELETED:
                # Found empty/deleted slot - insert
                if slot is self.DELETED:
                    self.deleted_count -= 1
                self.slots[idx] = (key, value)
                self.size += 1
                return probes
            
            if slot[0] == key:
                # Key exists - update
                self.slots[idx] = (key, value)
                return probes
        
        raise RuntimeError("Table is full")
    
    def lookup(self, key) -> tuple:
        """
        Lookup with probe counting.
        
        Returns: (found, value, probes)
        
        Key insight: We CANNOT stop at DELETED slots - we must continue
        probing until we find the key or hit a truly EMPTY slot.
        """
        probes = 0
        for idx in self._probe_sequence(key):
            probes += 1
            slot = self.slots[idx]
            
            if slot is self.EMPTY:
                # Never occupied = key definitely absent
                return (False, None, probes)
            
            if slot is self.DELETED:
                # Was occupied then deleted - keep probing
                continue
            
            if slot[0] == key:
                # Found it!
                return (True, slot[1], probes)
        
        # Full cycle without finding key
        return (False, None, probes)
    
    def delete(self, key) -> bool:
        """Delete a key, leaving tombstone."""
        for idx in self._probe_sequence(key):
            slot = self.slots[idx]
            
            if slot is self.EMPTY:
                return False  # Key not found
            
            if slot is not self.DELETED and slot[0] == key:
                self.slots[idx] = self.DELETED
                self.size -= 1
                self.deleted_count += 1
                return True
        
        return False
    
    def load_factor(self) -> float:
        """Effective load including tombstones."""
        return (self.size + self.deleted_count) / self.capacity
 
 
def demonstrate_tombstone_impact():
    """Show how deletions affect lookup performance."""
    
    print("Demonstrating Tombstone Impact on Lookup")
    print("=" * 60)
    
    capacity = 100
    
    for scenario, deletions in [("No deletions", 0), ("25% deleted", 25), ("50% deleted", 50)]:
        table = OpenAddressingLookup(capacity=capacity)
        
        # Insert 70 elements
        for i in range(70):
            table.insert(f"key_{i}", f"value_{i}")
        
        # Delete specified number
        for i in range(deletions):
            table.delete(f"key_{i}")
        
        # Measure lookup performance for remaining elements
        probes_list = []
        for i in range(deletions, 70):
            _, _, probes = table.lookup(f"key_{i}")
            probes_list.append(probes)
        
        # Also measure unsuccessful lookups
        fail_probes = []
        for i in range(100, 200):
            _, _, probes = table.lookup(f"key_{i}")
            fail_probes.append(probes)
        
        print(f"\n{scenario}:")
        print(f"  Active elements: {table.size}")
        print(f"  Tombstones: {table.deleted_count}")
        print(f"  Effective load: {table.load_factor():.0%}")
        print(f"  Avg probes (success): {sum(probes_list)/len(probes_list):.1f}")
        print(f"  Avg probes (failure): {sum(fail_probes)/len(fail_probes):.1f}")
 
demonstrate_tombstone_impact()

Tombstone Accumulation Problem

In workloads with many insertions and deletions, tombstones accumulate. Even if the 'live' load factor is low, the probing cost reflects all slots that were ever occupied. Periodic 'compaction' (rebuilding the table without tombstones) is necessary to maintain performance. This is why some implementations prefer chaining—deletion is simply removing a list node.

Hash Table Lookup vs. Alternatives

Hash tables' O(1) average lookup is their primary advantage. Let's compare against alternative data structures to understand when this matters.

The Competition:

Unsorted Array/List: O(n) lookup—must check every element
Sorted Array: O(log n) lookup via binary search—no dynamic resizing
Balanced BST (Red-Black, AVL): O(log n) lookup—maintains order, predictable worst case
Skip List: O(log n) expected—probabilistic, simpler than BSTs
B-Tree: O(log n) with excellent cache behavior—optimized for disk

The key question: when does O(1) vs O(log n) actually matter?

Lookup Count at Different Collection Sizes
Collection Size	O(n) Unsorted	O(log n) BST	O(1) Hash Table
100	~50 comparisons	~7 comparisons	~1-2 comparisons
10,000	~5,000 comparisons	~14 comparisons	~1-2 comparisons
1,000,000	~500,000 comparisons	~20 comparisons	~1-2 comparisons
100,000,000	~50,000,000 comparisons	~27 comparisons	~1-2 comparisons

The Profound Implication:

For 100 million elements, hash tables perform approximately the same number of operations as for 100 elements. BSTs scale logarithmically—still excellent, but growing. Unsorted lists become completely impractical.

However, hash tables trade order for speed:

No range queries: You cannot efficiently find 'all keys between A and B'
No sorted iteration: Elements have no inherent order
Worst-case vulnerability: O(n) remains possible under adversarial conditions

BSTs preserve order at the cost of O(log n) operations—a worthy trade for many applications.

When Hash Tables Excel

•Pure key-value lookup (caches, dictionaries)
•Membership testing (sets)
•Counting/frequency analysis
•Deduplication tasks
•Symbol tables in compilers/interpreters
•Database join operations (hash joins)

When Alternatives Are Better

•Range queries needed (BST/B-tree)
•Ordered iteration required (BST)
•Worst-case guarantee critical (BST)
•Memory extremely constrained (arrays)
•Data naturally sorted on disk (B-tree)
•Need min/max extraction (heap)

Summary: Mastering Hash Table Lookup

We've thoroughly examined the lookup operation—the crown jewel of hash table performance. Let's consolidate the essential insights:

Key Takeaways

•Lookup is hash tables' defining advantage — The O(1) average lookup time is why hash tables dominate for key-value access, membership testing, and frequency counting.
•Successful vs. unsuccessful lookups differ — Successful lookups average 1 + α/2 comparisons; unsuccessful lookups average α comparisons. At low load factors, failures are actually faster.
•The O(1) is probabilistic, not absolute — It relies on good hash distribution, bounded load factors, and non-adversarial input. Violate these assumptions and performance degrades.
•Worst-case O(n) has severe consequences — Read-heavy workloads suffer disproportionately from lookup degradation. Adversarial attacks exploit this for denial-of-service.
•Open addressing complicates deletion — Tombstones preserve probe chains but accumulate, degrading performance. Periodic compaction or chaining may be preferred.
•Hash tables sacrifice ordering for speed — They can't support range queries or sorted iteration. Choose BSTs when order matters more than raw lookup speed.

What's Next:

Insert and lookup are the glamorous operations, but a complete understanding requires examining delete—where subtle complexities arise, especially with open addressing. The next page explores deletion mechanics, tombstone strategies, and their performance implications.

Page Complete

You now possess a comprehensive understanding of hash table lookup. You can explain the mathematical foundations of O(1) average case, articulate precisely how and why O(n) worst case emerges, compare successful versus unsuccessful lookups, and evaluate when hash tables are the right choice versus ordered alternatives. Next, we'll complete the picture with delete operations.