Data Structures & AlgorithmsHash Table Structure & Mechanism

Hash Tables — Structure & Mechanism

LevelIntermediate

Duration60 mins

TopicHash Table Structure & Mechanism

2 / 4

Buckets and Slots

The Architecture Inside the Hash Table

When we say a hash table is "an array with a hash function," we're stating a truth that hides enormous subtlety. The way that array is organized—what lives in each position, how entries are arranged, what happens when positions overflow—determines the real-world performance of every hash table operation.

This organizational scheme centers on two concepts: buckets and slots. Understanding these terms precisely, along with the various ways implementations structure them, gives you the mental model needed to predict performance, debug issues, and choose between hash table variants.

What You Will Learn

By the end of this page, you will understand exactly what buckets and slots are, how they organize data within a hash table, the different strategies for handling multiple items per bucket, and how these design choices affect performance and memory usage.

Defining Buckets and Slots

The terms "bucket" and "slot" are sometimes used interchangeably, and sometimes with distinct meanings. Let's establish precise definitions that apply across implementations.

Bucket:

A bucket is a position in the hash table's underlying array—a location where data associated with a particular hash value can be stored. When a key hashes to index i, we say it "falls into bucket i."

The critical insight: a bucket is not a single storage space for a single item. A bucket is a location that may contain:

Nothing (empty bucket)
Exactly one key-value pair
Multiple key-value pairs (when collisions occur)

Slot:

A slot is a storage position for a single key-value pair. In implementations where each bucket holds exactly one pair, "bucket" and "slot" are synonymous. In implementations where buckets can hold multiple pairs (chaining), each bucket contains multiple slots.

Think of it this way:

Bucket = The container identified by the hash index
Slot = A space for one key-value pair within that container

Bucket vs Slot: Terminology Clarification
Term	Definition	In Chaining	In Open Addressing
Bucket	Array position identified by hash(key) % size	A linked list head (can hold many entries)	A single slot (holds zero or one entry)
Slot	Space for one key-value pair	One node in the bucket's linked list	One array position = one slot
Entry	A stored key-value pair	A node in the chain	An occupied slot
Capacity	Number of buckets in the table	Array size	Array size = maximum entries

Terminology in Practice

In practice, many sources use 'bucket' and 'slot' interchangeably. When reading documentation or papers, rely on context. The key distinction is whether the discussion involves a single storage location (slot) or a hash-indexed position that might contain multiple items (bucket).

The Bucket Array Structure

The hash table's core structure is an array of buckets. This array has several important properties that affect behavior:

Fixed Size at Creation:

When a hash table is created (or resized), the bucket array has a fixed number of positions. This size is chosen based on expected usage:

bucket_count = initial_capacity  // e.g., 16, 32, 64

The array must be allocated contiguously in memory, enabling O(1) access to any bucket:

bucket_address = base_address + (index × bucket_size)

Index Range:

Valid bucket indices range from 0 to capacity - 1. The hash function must produce values in this range (typically via modulo arithmetic):

index = hash(key) % capacity

Bucket States:

Each bucket exists in one of several possible states, depending on the implementation:

Possible Bucket States

•Empty (Never Used) — The bucket has never contained an entry. Searches can stop here.
•Occupied — The bucket contains one or more valid entries. The key(s) hash to this index.
•Deleted (Tombstone) — The bucket previously held an entry that was deleted. Important for open addressing: searches must continue past tombstones.
•Overflow — In chaining, the bucket points to a secondary structure (linked list, tree) containing multiple entries.

bucket_states.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
from enum import Enum
from typing import Optional, Generic, TypeVar, List
from dataclasses import dataclass
 
K = TypeVar('K')
V = TypeVar('V')
 
class BucketState(Enum):
    """Possible states of a bucket in open addressing."""
    EMPTY = "empty"           # Never used - search can stop
    OCCUPIED = "occupied"     # Contains valid entry
    DELETED = "deleted"       # Was occupied, now deleted (tombstone)
 
@dataclass
class Slot(Generic[K, V]):
    """A slot holding one key-value pair."""
    key: K
    value: V
 
@dataclass
class Bucket(Generic[K, V]):
    """
    A bucket in open addressing: holds state and optionally one slot.
    """
    state: BucketState
    slot: Optional[Slot[K, V]]
    
    @classmethod
    def empty(cls) -> 'Bucket':
        return cls(state=BucketState.EMPTY, slot=None)
    
    @classmethod
    def with_entry(cls, key: K, value: V) -> 'Bucket':
        return cls(state=BucketState.OCCUPIED, slot=Slot(key, value))
    
    def is_available(self) -> bool:
        """Can we insert a new entry here?"""
        return self.state in (BucketState.EMPTY, BucketState.DELETED)
    
    def mark_deleted(self) -> None:
        """Convert occupied bucket to tombstone."""
        self.state = BucketState.DELETED
        self.slot = None
 
 
# For chaining: bucket contains a list of slots
@dataclass
class ChainedBucket(Generic[K, V]):
    """
    A bucket in chaining: holds a list of key-value pairs.
    """
    entries: List[Slot[K, V]]
    
    def __init__(self):
        self.entries = []
    
    def add(self, key: K, value: V) -> None:
        """Add or update entry in the chain."""
        for entry in self.entries:
            if entry.key == key:
                entry.value = value  # Update existing
                return
        self.entries.append(Slot(key, value))  # Append new
    
    def get(self, key: K) -> Optional[V]:
        """Find value by key in the chain."""
        for entry in self.entries:
            if entry.key == key:
                return entry.value
        return None
    
    def remove(self, key: K) -> bool:
        """Remove entry by key. Returns True if found."""
        for i, entry in enumerate(self.entries):
            if entry.key == key:
                self.entries.pop(i)
                return True
        return False
    
    def __len__(self) -> int:
        return len(self.entries)

Bucket Organization Strategies

How buckets handle their contents—especially when multiple keys hash to the same index—defines the fundamental character of a hash table implementation. Two primary strategies exist, each with cascading implications for performance and memory usage.

Strategy 1: Separate Chaining (Closed Addressing)

In separate chaining, each bucket contains a pointer to a secondary data structure (typically a linked list) that holds all entries with that hash index.

Bucket = pointer to chain
Chain = list of (key, value) pairs
Multiple collisions → longer chain

Key Characteristics:

Buckets never "overflow" into other buckets
Table never "fills up" in the traditional sense
Performance degrades gradually as chains grow
Extra memory for pointers and list nodes

Converting Mermaid diagram...

Strategy 2: Open Addressing (Closed Hashing)

In open addressing, all entries are stored directly in the bucket array. When a collision occurs, the algorithm probes for the next available slot within the array.

Bucket = slot = one entry
Collision → find alternative slot in same array
Table can completely fill (100% capacity)

Key Characteristics:

No external data structures—all data in the array
Better cache locality (entries are contiguous)
Performance degrades severely as table approaches full
Deletion is complex (must use tombstones)

Comparing Bucket Organization Strategies
Aspect	Separate Chaining	Open Addressing
What bucket contains	Pointer to external chain	Actual key-value entry
Collision handling	Append to chain at same bucket	Probe for next available bucket
Maximum load factor	Unlimited (chains grow)	< 1.0 (must have empty slots)
Memory layout	Non-contiguous (follows pointers)	Contiguous (better cache performance)
Memory overhead	Pointer per bucket + node overhead	May need reserved space for tombstones
Deletion complexity	Simple (remove from list)	Complex (tombstone markers)
Performance at high load	Gradual degradation	Severe degradation near capacity
Common use	Java's HashMap (before Java 8)	Python's dict, modern implementations

Modern Hybrid Approaches

Modern implementations often combine strategies. Java 8+ HashMap starts with chaining but converts long chains (≥8 entries) to balanced trees for O(log n) worst-case lookup. Python uses open addressing with a sophisticated probing sequence. Understanding both foundational strategies helps you reason about any hybrid approach.

Anatomy of a Chained Bucket

Let's examine the internal structure of buckets under separate chaining in detail. Understanding this structure helps you trace operations and predict performance.

The Chain Structure:

Each bucket in a chaining implementation contains a reference to the head of a linked list. Each node in the list contains:

The key
The value
A pointer to the next node (or null for the last node)

When the bucket is empty, it typically holds a null reference or a pointer to an empty sentinel.

chained_bucket_structure.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
from typing import Optional, Generic, TypeVar
 
K = TypeVar('K')
V = TypeVar('V')
 
class ChainNode(Generic[K, V]):
    """
    A single node in a hash chain.
    Stores key, value, and pointer to next node.
    """
    __slots__ = ('key', 'value', 'hash_code', 'next')
    
    def __init__(self, key: K, value: V, hash_code: int):
        self.key = key
        self.value = value
        self.hash_code = hash_code  # Cache hash for comparisons
        self.next: Optional[ChainNode[K, V]] = None
    
    def __repr__(self):
        return f"({self.key}: {self.value})"
 
 
class ChainingHashTable(Generic[K, V]):
    """
    Hash table implementation using separate chaining.
    """
    
    def __init__(self, capacity: int = 16):
        # Each bucket is either None (empty) or a ChainNode (head of list)
        self.buckets: list[Optional[ChainNode[K, V]]] = [None] * capacity
        self.capacity = capacity
        self.size = 0
    
    def _hash_index(self, key: K) -> int:
        """Compute bucket index for a key."""
        return hash(key) % self.capacity
    
    def put(self, key: K, value: V) -> None:
        """
        Insert or update a key-value pair.
        
        Process:
        1. Compute bucket index
        2. Search chain for existing key
        3. If found: update value
        4. If not found: prepend new node to chain
        """
        hash_code = hash(key)
        index = hash_code % self.capacity
        
        # Search existing chain
        current = self.buckets[index]
        while current is not None:
            # Use cached hash for fast inequality check
            if current.hash_code == hash_code and current.key == key:
                current.value = value  # Update existing
                return
            current = current.next
        
        # Key not found: create new node and prepend to chain
        new_node = ChainNode(key, value, hash_code)
        new_node.next = self.buckets[index]  # Link to existing chain
        self.buckets[index] = new_node        # Make new node the head
        self.size += 1
    
    def get(self, key: K) -> Optional[V]:
        """
        Retrieve value for a key.
        
        Process:
        1. Compute bucket index
        2. Traverse chain comparing keys
        3. Return value if found, None otherwise
        """
        hash_code = hash(key)
        index = hash_code % self.capacity
        
        current = self.buckets[index]
        while current is not None:
            if current.hash_code == hash_code and current.key == key:
                return current.value
            current = current.next
        
        return None  # Not found
    
    def remove(self, key: K) -> bool:
        """
        Remove a key from the table.
        
        Process:
        1. Compute bucket index
        2. Find the node in chain
        3. Unlink it (update predecessor's next pointer)
        """
        hash_code = hash(key)
        index = hash_code % self.capacity
        
        current = self.buckets[index]
        prev = None
        
        while current is not None:
            if current.hash_code == hash_code and current.key == key:
                if prev is None:
                    # Removing head of chain
                    self.buckets[index] = current.next
                else:
                    # Removing middle or end of chain
                    prev.next = current.next
                self.size -= 1
                return True
            prev = current
            current = current.next
        
        return False  # Not found
    
    def debug_buckets(self) -> None:
        """Print bucket contents for visualization."""
        print(f"Hash Table ({self.size} items in {self.capacity} buckets)")
        print("-" * 50)
        
        for i, bucket in enumerate(self.buckets):
            if bucket is None:
                print(f"  [{i}]: (empty)")
            else:
                chain_items = []
                current = bucket
                while current:
                    chain_items.append(f"({current.key}: {current.value})")
                    current = current.next
                print(f"  [{i}]: {' → '.join(chain_items)}")
 
 
# Example usage and visualization
if __name__ == "__main__":
    ht = ChainingHashTable(capacity=8)
    
    # Insert some entries
    ht.put("alice", 100)
    ht.put("bob", 200)
    ht.put("charlie", 300)
    ht.put("diana", 400)
    
    ht.debug_buckets()
    # Output might look like:
    # Hash Table (4 items in 8 buckets)
    # --------------------------------------------------
    #   [0]: (empty)
    #   [1]: (alice: 100)
    #   [2]: (charlie: 300) → (bob: 200)  # Collision!
    #   [3]: (empty)
    #   [4]: (empty)
    #   [5]: (diana: 400)
    #   [6]: (empty)
    #   [7]: (empty)

Memory Layout Analysis:

In chaining, memory is not contiguous. The bucket array lives in one location, but each chain node is separately allocated:

Bucket Array:  [ptr0] [ptr1] [ptr2] [ptr3] [ptr4] ...
                  |      |      |      |
                  v      v      v      |
                null   Node   Node   null
                         |      |
                         v      v
                       null   Node
                               |
                               v
                             null

Each pointer-following operation can cause a cache miss, which is why open addressing (with contiguous storage) often performs better for workloads that fit in cache.

Anatomy of Open Addressing Buckets

In open addressing, the bucket array directly stores key-value pairs. There are no chains, no pointers to external structures—just the array itself.

Slot Structure:

Each slot in an open addressing table must track more than just the key and value. It needs state information to distinguish empty from occupied from deleted:

Slot = { state, key, value }

Where state ∈ { EMPTY, OCCUPIED, DELETED }

Why DELETED Matters:

Consider this scenario:

Insert key A at index 3
Insert key B at index 3 (collision), probe lands at index 4
Delete key A from index 3
Search for key B

If we simply clear slot 3 when deleting A, the search for B will:

Hash B to index 3
Find slot 3 empty
Conclude B doesn't exist (WRONG!)

B is still in the table at index 4, but we stopped looking too soon. Tombstones (DELETED state) tell the search: "Someone was here; keep looking."

open_addressing_buckets.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
from enum import Enum
from typing import Optional, Generic, TypeVar, Tuple
 
K = TypeVar('K')
V = TypeVar('V')
 
 
class SlotState(Enum):
    EMPTY = 0      # Never occupied - search can stop
    OCCUPIED = 1   # Contains valid entry
    DELETED = 2    # Was occupied - keep searching (tombstone)
 
 
class OpenAddressingTable(Generic[K, V]):
    """
    Hash table using open addressing with linear probing.
    Each bucket directly stores key-value data.
    """
    
    def __init__(self, capacity: int = 16):
        # Parallel arrays for state, keys, and values
        # (In practice, might use array of struct/dataclass)
        self.states = [SlotState.EMPTY] * capacity
        self.keys: list[Optional[K]] = [None] * capacity
        self.values: list[Optional[V]] = [None] * capacity
        self.capacity = capacity
        self.size = 0
    
    def _probe_sequence(self, key: K) -> int:
        """
        Generate the sequence of indices to probe for a key.
        Linear probing: check indices hash, hash+1, hash+2, ...
        """
        h = hash(key) % self.capacity
        for i in range(self.capacity):
            yield (h + i) % self.capacity
    
    def put(self, key: K, value: V) -> bool:
        """
        Insert or update a key-value pair.
        
        During insertion:
        - Skip OCCUPIED slots with different keys
        - Use EMPTY or DELETED slots for new entries
        - If key already exists, update the value
        """
        first_available = None
        
        for index in self._probe_sequence(key):
            state = self.states[index]
            
            if state == SlotState.EMPTY:
                # Empty slot: key definitely not in table
                insert_index = first_available if first_available is not None else index
                self.states[insert_index] = SlotState.OCCUPIED
                self.keys[insert_index] = key
                self.values[insert_index] = value
                self.size += 1
                return True
            
            elif state == SlotState.DELETED:
                # Remember first deleted slot as potential insertion point
                if first_available is None:
                    first_available = index
                # Keep probing - key might be ahead
            
            elif state == SlotState.OCCUPIED:
                if self.keys[index] == key:
                    # Key exists: update value
                    self.values[index] = value
                    return False  # No size increase
        
        # Table is full (should rehash before this happens)
        raise RuntimeError("Hash table is full")
    
    def get(self, key: K) -> Optional[V]:
        """
        Retrieve value for a key.
        
        During lookup:
        - EMPTY -> key not found (stop)
        - DELETED -> keep probing
        - OCCUPIED -> check key match
        """
        for index in self._probe_sequence(key):
            state = self.states[index]
            
            if state == SlotState.EMPTY:
                return None  # Definitively not in table
            
            elif state == SlotState.DELETED:
                continue  # Keep looking
            
            elif state == SlotState.OCCUPIED:
                if self.keys[index] == key:
                    return self.values[index]
                # Wrong key, keep probing
        
        return None  # Probed all slots, not found
    
    def remove(self, key: K) -> Optional[V]:
        """
        Remove a key from the table.
        
        CRITICAL: We cannot simply set slot to EMPTY.
        We must use DELETED (tombstone) to preserve probe chains.
        """
        for index in self._probe_sequence(key):
            state = self.states[index]
            
            if state == SlotState.EMPTY:
                return None  # Key not in table
            
            elif state == SlotState.DELETED:
                continue  # Keep looking
            
            elif state == SlotState.OCCUPIED:
                if self.keys[index] == key:
                    # Found it - mark as deleted (tombstone)
                    self.states[index] = SlotState.DELETED
                    value = self.values[index]
                    self.keys[index] = None  # Help GC
                    self.values[index] = None
                    self.size -= 1
                    return value
        
        return None  # Not found
    
    def debug_slots(self) -> None:
        """Print slot contents for visualization."""
        print(f"\nOpen Addressing Table ({self.size} items in {self.capacity} slots)")
        print("-" * 60)
        
        for i in range(self.capacity):
            state = self.states[i]
            if state == SlotState.EMPTY:
                print(f"  [{i}]: [EMPTY]")
            elif state == SlotState.DELETED:
                print(f"  [{i}]: [DELETED] (tombstone)")
            else:
                print(f"  [{i}]: [OCCUPIED] {self.keys[i]} → {self.values[i]}")
 
 
# Demonstrate tombstone behavior
if __name__ == "__main__":
    ht = OpenAddressingTable(capacity=8)
    
    # Scenario: Insert A and B where B probes past A's slot
    # Assume hash("keyA") % 8 = 3 and hash("keyB") % 8 = 3 (collision)
    
    ht.put("keyA", "valueA")
    ht.put("keyB", "valueB")  # Will probe to slot 4
    
    print("Before deletion:")
    ht.debug_slots()
    
    # Delete keyA
    ht.remove("keyA")
    
    print("\nAfter deleting keyA:")
    ht.debug_slots()
    
    # Can we still find keyB?
    result = ht.get("keyB")
    print(f"\nLooking up keyB: {result}")  # Should find it!

The Tombstone Accumulation Problem

Tombstones are necessary for correctness, but they accumulate over time. A table with many tombstones has poor performance: searches must probe through deleted slots. This is why open addressing tables periodically rehash: they rebuild the table, eliminating tombstones and improving probe sequences.

Bucket Sizing and Load Factor

The number of buckets in a hash table is a critical design parameter. Too few buckets cause excessive collisions; too many waste memory. The load factor quantifies this relationship.

Load Factor Definition:

load_factor = n / m

Where:
  n = number of stored entries
  m = number of buckets (capacity)

For example:

100 entries in 200 buckets → load factor = 0.5 (50% full)
100 entries in 100 buckets → load factor = 1.0 (100% full)
100 entries in 50 buckets → load factor = 2.0 (average 2 entries per bucket)

Load Factor Impact:

Load Factor Effects on Performance
Load Factor	Chaining Performance	Open Addressing Performance	Memory Efficiency
0.25	Excellent (very short chains)	Excellent (rare probing)	Poor (75% empty space)
0.50	Very good (avg 0.5 items/bucket)	Very good (short probe sequences)	Moderate (50% empty)
0.75	Good (manageable chains)	Good (acceptable probing)	Good (25% empty)
1.0	Acceptable (avg 1 item/bucket)	Degraded (long probes)	Perfect (no waste)
2.0	Poor (chains averaging 2 items)	Impossible (no space)	Cannot achieve

Typical Thresholds:

Chaining: Load factor of 0.75 is common before rehashing. Some implementations allow higher values since chaining degrades gracefully.
Open Addressing: Load factor of 0.5-0.7 is typical. Performance degrades sharply as the table fills because probe sequences lengthen exponentially.

Why Powers of Two for Capacity?

Many implementations use power-of-two bucket counts (16, 32, 64, ...) because:

hash % capacity can be computed as hash & (capacity - 1) using a fast bitwise AND
Growth by doubling is simple: capacity × 2 is still a power of two

However, power-of-two sizes can interact poorly with hash functions that don't distribute lower bits well. Some implementations use prime capacities or add secondary hashing to mitigate this.

load_factor_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def demonstrate_load_factor():
    """
    Show how load factor affects hash table behavior.
    """
    entries = [
        ("alice", 1), ("bob", 2), ("charlie", 3), ("diana", 4),
        ("eve", 5), ("frank", 6), ("grace", 7), ("henry", 8)
    ]
    
    print("=" * 60)
    print("LOAD FACTOR DEMONSTRATION")
    print("=" * 60)
    
    for capacity in [32, 16, 8, 4]:
        # Simple simulation: count collisions
        buckets = [[] for _ in range(capacity)]
        
        for key, value in entries:
            index = hash(key) % capacity
            buckets[index].append((key, value))
        
        # Calculate statistics
        load_factor = len(entries) / capacity
        non_empty = sum(1 for b in buckets if b)
        collisions = sum(1 for b in buckets if len(b) > 1)
        max_chain = max(len(b) for b in buckets)
        
        print(f"\nCapacity: {capacity} buckets, {len(entries)} entries")
        print(f"  Load Factor: {load_factor:.2f}")
        print(f"  Non-empty buckets: {non_empty}")
        print(f"  Buckets with collisions: {collisions}")
        print(f"  Longest chain: {max_chain}")
        
        # Visual
        print("  Bucket distribution:", end="")
        for b in buckets:
            print(f" [{len(b)}]", end="")
        print()
 
 
# Expected output:
# Capacity: 32 buckets, 8 entries
#   Load Factor: 0.25
#   Non-empty buckets: 8
#   Buckets with collisions: 0
#   Longest chain: 1
 
# Capacity: 8 buckets, 8 entries
#   Load Factor: 1.00
#   Non-empty buckets: 6
#   Buckets with collisions: 2
#   Longest chain: 2
 
demonstrate_load_factor()

Memory Layout Considerations

The physical memory layout of buckets and slots has significant performance implications that go beyond algorithmic complexity. Understanding memory layout helps explain why modern implementations often prefer open addressing.

Cache Locality:

Modern CPUs are optimized for sequential memory access. When you access one memory location, nearby locations are loaded into cache. This makes sequential access much faster than random access.

Open Addressing: All slots are contiguous in memory. Probing through slots often hits cached data.
Chaining: Each chain node is separately allocated. Following the next pointer causes a cache miss.

Comparison: Probing vs Chain Traversal

Open Addressing Memory

•All slots in contiguous array
•Sequential probes often hit cache
•One allocation for entire table
•Predictable memory footprint
•Memory = capacity × slot_size

Chaining Memory

•Bucket array is contiguous
•Chain nodes scattered in heap
•Each insert allocates a new node
•Memory fragmentation over time
•Memory = buckets + (n × node_size)

Real-World Measurements:

In practice, these differences matter substantially:

Studies show open addressing can be 2-3x faster than chaining for workloads that fit in cache
Chaining becomes more competitive when chains are very short (low load factor)
For very large tables that exceed cache, the difference diminishes

Memory Overhead Analysis:

Chaining:

Bucket array: 8 bytes per bucket (64-bit pointer)
Per entry: 8 (key ptr) + 8 (value ptr) + 8 (next ptr) + 16 (object header) = 40 bytes minimum

Open Addressing:

Per slot: 8 (key ptr) + 8 (value ptr) + 1 (state) = 17 bytes (with padding: maybe 24)
Empty slots consume same space

Open addressing wastes space on empty slots; chaining wastes space on pointers and allocator overhead. The break-even depends on load factor and entry size.

Python's Dict: A Case Study

Python's dict uses open addressing with a sparse design: the hash table contains indices into a dense keys/values array. This provides cache-friendly iteration (the dense array), efficient memory (empty slots are just integers), and good probe performance. This clever design, introduced in Python 3.6, demonstrates how careful memory layout can improve both performance and functionality.

Summary: Buckets and Slots

We've developed a detailed understanding of how hash tables organize their internal storage. Let's consolidate the key concepts:

Key Takeaways

•Buckets are indexed positions in the hash table's array — each bucket holds entries whose keys hash to that index.
•Slots are individual storage spaces for key-value pairs — in chaining, a bucket contains multiple slots (a chain); in open addressing, each bucket is one slot.
•Separate chaining stores collisions in linked lists — flexible but has pointer overhead and poor cache locality.
•Open addressing stores all entries in the array — uses probing for collisions, requires tombstones for deletion, but has excellent cache performance.
•Load factor (n/m) measures how full the table is — higher load factors increase collisions; typical thresholds are 0.5-0.75.
•Memory layout affects real-world performance — contiguous storage (open addressing) often outperforms scattered nodes (chaining) due to CPU cache effects.
•Bucket state tracking is essential for open addressing — EMPTY, OCCUPIED, and DELETED states ensure correct probe chain behavior.

What's Next:

Now that you understand the internal organization of hash tables, the next page walks through exactly how insertion works: the complete step-by-step process of adding a new key-value pair, handling collisions, and maintaining table invariants.

Page Complete

You now understand the internal architecture of hash tables: buckets and slots, chaining vs open addressing, load factor, and memory layout considerations. This structural knowledge prepares you to trace and reason about hash table operations in detail.

2 / 4

Loading learning content...

Data Structures & AlgorithmsHash Table Structure & Mechanism

Hash Tables — Structure & Mechanism

LevelIntermediate

Duration60 mins

TopicHash Table Structure & Mechanism

2 / 4

Buckets and Slots

The Architecture Inside the Hash Table

What You Will Learn

Defining Buckets and Slots

The terms "bucket" and "slot" are sometimes used interchangeably, and sometimes with distinct meanings. Let's establish precise definitions that apply across implementations.

Bucket:

The critical insight: a bucket is not a single storage space for a single item. A bucket is a location that may contain:

Nothing (empty bucket)
Exactly one key-value pair
Multiple key-value pairs (when collisions occur)

Slot:

Think of it this way:

Bucket = The container identified by the hash index
Slot = A space for one key-value pair within that container

Bucket vs Slot: Terminology Clarification
Term	Definition	In Chaining	In Open Addressing
Bucket	Array position identified by hash(key) % size	A linked list head (can hold many entries)	A single slot (holds zero or one entry)
Slot	Space for one key-value pair	One node in the bucket's linked list	One array position = one slot
Entry	A stored key-value pair	A node in the chain	An occupied slot
Capacity	Number of buckets in the table	Array size	Array size = maximum entries

Terminology in Practice

The Bucket Array Structure

The hash table's core structure is an array of buckets. This array has several important properties that affect behavior:

Fixed Size at Creation:

When a hash table is created (or resized), the bucket array has a fixed number of positions. This size is chosen based on expected usage:

bucket_count = initial_capacity  // e.g., 16, 32, 64

The array must be allocated contiguously in memory, enabling O(1) access to any bucket:

bucket_address = base_address + (index × bucket_size)

Index Range:

Valid bucket indices range from 0 to capacity - 1. The hash function must produce values in this range (typically via modulo arithmetic):

index = hash(key) % capacity

Bucket States:

Each bucket exists in one of several possible states, depending on the implementation:

Possible Bucket States

•Empty (Never Used) — The bucket has never contained an entry. Searches can stop here.
•Occupied — The bucket contains one or more valid entries. The key(s) hash to this index.
•Deleted (Tombstone) — The bucket previously held an entry that was deleted. Important for open addressing: searches must continue past tombstones.
•Overflow — In chaining, the bucket points to a secondary structure (linked list, tree) containing multiple entries.

bucket_states.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
from enum import Enum
from typing import Optional, Generic, TypeVar, List
from dataclasses import dataclass
 
K = TypeVar('K')
V = TypeVar('V')
 
class BucketState(Enum):
    """Possible states of a bucket in open addressing."""
    EMPTY = "empty"           # Never used - search can stop
    OCCUPIED = "occupied"     # Contains valid entry
    DELETED = "deleted"       # Was occupied, now deleted (tombstone)
 
@dataclass
class Slot(Generic[K, V]):
    """A slot holding one key-value pair."""
    key: K
    value: V
 
@dataclass
class Bucket(Generic[K, V]):
    """
    A bucket in open addressing: holds state and optionally one slot.
    """
    state: BucketState
    slot: Optional[Slot[K, V]]
    
    @classmethod
    def empty(cls) -> 'Bucket':
        return cls(state=BucketState.EMPTY, slot=None)
    
    @classmethod
    def with_entry(cls, key: K, value: V) -> 'Bucket':
        return cls(state=BucketState.OCCUPIED, slot=Slot(key, value))
    
    def is_available(self) -> bool:
        """Can we insert a new entry here?"""
        return self.state in (BucketState.EMPTY, BucketState.DELETED)
    
    def mark_deleted(self) -> None:
        """Convert occupied bucket to tombstone."""
        self.state = BucketState.DELETED
        self.slot = None
 
 
# For chaining: bucket contains a list of slots
@dataclass
class ChainedBucket(Generic[K, V]):
    """
    A bucket in chaining: holds a list of key-value pairs.
    """
    entries: List[Slot[K, V]]
    
    def __init__(self):
        self.entries = []
    
    def add(self, key: K, value: V) -> None:
        """Add or update entry in the chain."""
        for entry in self.entries:
            if entry.key == key:
                entry.value = value  # Update existing
                return
        self.entries.append(Slot(key, value))  # Append new
    
    def get(self, key: K) -> Optional[V]:
        """Find value by key in the chain."""
        for entry in self.entries:
            if entry.key == key:
                return entry.value
        return None
    
    def remove(self, key: K) -> bool:
        """Remove entry by key. Returns True if found."""
        for i, entry in enumerate(self.entries):
            if entry.key == key:
                self.entries.pop(i)
                return True
        return False
    
    def __len__(self) -> int:
        return len(self.entries)

Bucket Organization Strategies

Strategy 1: Separate Chaining (Closed Addressing)

In separate chaining, each bucket contains a pointer to a secondary data structure (typically a linked list) that holds all entries with that hash index.

Bucket = pointer to chain
Chain = list of (key, value) pairs
Multiple collisions → longer chain

Key Characteristics:

Buckets never "overflow" into other buckets
Table never "fills up" in the traditional sense
Performance degrades gradually as chains grow
Extra memory for pointers and list nodes

Converting Mermaid diagram...

Strategy 2: Open Addressing (Closed Hashing)

In open addressing, all entries are stored directly in the bucket array. When a collision occurs, the algorithm probes for the next available slot within the array.

Bucket = slot = one entry
Collision → find alternative slot in same array
Table can completely fill (100% capacity)

Key Characteristics:

No external data structures—all data in the array
Better cache locality (entries are contiguous)
Performance degrades severely as table approaches full
Deletion is complex (must use tombstones)

Comparing Bucket Organization Strategies
Aspect	Separate Chaining	Open Addressing
What bucket contains	Pointer to external chain	Actual key-value entry
Collision handling	Append to chain at same bucket	Probe for next available bucket
Maximum load factor	Unlimited (chains grow)	< 1.0 (must have empty slots)
Memory layout	Non-contiguous (follows pointers)	Contiguous (better cache performance)
Memory overhead	Pointer per bucket + node overhead	May need reserved space for tombstones
Deletion complexity	Simple (remove from list)	Complex (tombstone markers)
Performance at high load	Gradual degradation	Severe degradation near capacity
Common use	Java's HashMap (before Java 8)	Python's dict, modern implementations

Modern Hybrid Approaches

Anatomy of a Chained Bucket

Let's examine the internal structure of buckets under separate chaining in detail. Understanding this structure helps you trace operations and predict performance.

The Chain Structure:

Each bucket in a chaining implementation contains a reference to the head of a linked list. Each node in the list contains:

The key
The value
A pointer to the next node (or null for the last node)

When the bucket is empty, it typically holds a null reference or a pointer to an empty sentinel.

chained_bucket_structure.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
from typing import Optional, Generic, TypeVar
 
K = TypeVar('K')
V = TypeVar('V')
 
class ChainNode(Generic[K, V]):
    """
    A single node in a hash chain.
    Stores key, value, and pointer to next node.
    """
    __slots__ = ('key', 'value', 'hash_code', 'next')
    
    def __init__(self, key: K, value: V, hash_code: int):
        self.key = key
        self.value = value
        self.hash_code = hash_code  # Cache hash for comparisons
        self.next: Optional[ChainNode[K, V]] = None
    
    def __repr__(self):
        return f"({self.key}: {self.value})"
 
 
class ChainingHashTable(Generic[K, V]):
    """
    Hash table implementation using separate chaining.
    """
    
    def __init__(self, capacity: int = 16):
        # Each bucket is either None (empty) or a ChainNode (head of list)
        self.buckets: list[Optional[ChainNode[K, V]]] = [None] * capacity
        self.capacity = capacity
        self.size = 0
    
    def _hash_index(self, key: K) -> int:
        """Compute bucket index for a key."""
        return hash(key) % self.capacity
    
    def put(self, key: K, value: V) -> None:
        """
        Insert or update a key-value pair.
        
        Process:
        1. Compute bucket index
        2. Search chain for existing key
        3. If found: update value
        4. If not found: prepend new node to chain
        """
        hash_code = hash(key)
        index = hash_code % self.capacity
        
        # Search existing chain
        current = self.buckets[index]
        while current is not None:
            # Use cached hash for fast inequality check
            if current.hash_code == hash_code and current.key == key:
                current.value = value  # Update existing
                return
            current = current.next
        
        # Key not found: create new node and prepend to chain
        new_node = ChainNode(key, value, hash_code)
        new_node.next = self.buckets[index]  # Link to existing chain
        self.buckets[index] = new_node        # Make new node the head
        self.size += 1
    
    def get(self, key: K) -> Optional[V]:
        """
        Retrieve value for a key.
        
        Process:
        1. Compute bucket index
        2. Traverse chain comparing keys
        3. Return value if found, None otherwise
        """
        hash_code = hash(key)
        index = hash_code % self.capacity
        
        current = self.buckets[index]
        while current is not None:
            if current.hash_code == hash_code and current.key == key:
                return current.value
            current = current.next
        
        return None  # Not found
    
    def remove(self, key: K) -> bool:
        """
        Remove a key from the table.
        
        Process:
        1. Compute bucket index
        2. Find the node in chain
        3. Unlink it (update predecessor's next pointer)
        """
        hash_code = hash(key)
        index = hash_code % self.capacity
        
        current = self.buckets[index]
        prev = None
        
        while current is not None:
            if current.hash_code == hash_code and current.key == key:
                if prev is None:
                    # Removing head of chain
                    self.buckets[index] = current.next
                else:
                    # Removing middle or end of chain
                    prev.next = current.next
                self.size -= 1
                return True
            prev = current
            current = current.next
        
        return False  # Not found
    
    def debug_buckets(self) -> None:
        """Print bucket contents for visualization."""
        print(f"Hash Table ({self.size} items in {self.capacity} buckets)")
        print("-" * 50)
        
        for i, bucket in enumerate(self.buckets):
            if bucket is None:
                print(f"  [{i}]: (empty)")
            else:
                chain_items = []
                current = bucket
                while current:
                    chain_items.append(f"({current.key}: {current.value})")
                    current = current.next
                print(f"  [{i}]: {' → '.join(chain_items)}")
 
 
# Example usage and visualization
if __name__ == "__main__":
    ht = ChainingHashTable(capacity=8)
    
    # Insert some entries
    ht.put("alice", 100)
    ht.put("bob", 200)
    ht.put("charlie", 300)
    ht.put("diana", 400)
    
    ht.debug_buckets()
    # Output might look like:
    # Hash Table (4 items in 8 buckets)
    # --------------------------------------------------
    #   [0]: (empty)
    #   [1]: (alice: 100)
    #   [2]: (charlie: 300) → (bob: 200)  # Collision!
    #   [3]: (empty)
    #   [4]: (empty)
    #   [5]: (diana: 400)
    #   [6]: (empty)
    #   [7]: (empty)

Memory Layout Analysis:

In chaining, memory is not contiguous. The bucket array lives in one location, but each chain node is separately allocated:

Bucket Array:  [ptr0] [ptr1] [ptr2] [ptr3] [ptr4] ...
                  |      |      |      |
                  v      v      v      |
                null   Node   Node   null
                         |      |
                         v      v
                       null   Node
                               |
                               v
                             null

Each pointer-following operation can cause a cache miss, which is why open addressing (with contiguous storage) often performs better for workloads that fit in cache.

Anatomy of Open Addressing Buckets

In open addressing, the bucket array directly stores key-value pairs. There are no chains, no pointers to external structures—just the array itself.

Slot Structure:

Each slot in an open addressing table must track more than just the key and value. It needs state information to distinguish empty from occupied from deleted:

Slot = { state, key, value }

Where state ∈ { EMPTY, OCCUPIED, DELETED }

Why DELETED Matters:

Consider this scenario:

Insert key A at index 3
Insert key B at index 3 (collision), probe lands at index 4
Delete key A from index 3
Search for key B

If we simply clear slot 3 when deleting A, the search for B will:

Hash B to index 3
Find slot 3 empty
Conclude B doesn't exist (WRONG!)

B is still in the table at index 4, but we stopped looking too soon. Tombstones (DELETED state) tell the search: "Someone was here; keep looking."

open_addressing_buckets.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
from enum import Enum
from typing import Optional, Generic, TypeVar, Tuple
 
K = TypeVar('K')
V = TypeVar('V')
 
 
class SlotState(Enum):
    EMPTY = 0      # Never occupied - search can stop
    OCCUPIED = 1   # Contains valid entry
    DELETED = 2    # Was occupied - keep searching (tombstone)
 
 
class OpenAddressingTable(Generic[K, V]):
    """
    Hash table using open addressing with linear probing.
    Each bucket directly stores key-value data.
    """
    
    def __init__(self, capacity: int = 16):
        # Parallel arrays for state, keys, and values
        # (In practice, might use array of struct/dataclass)
        self.states = [SlotState.EMPTY] * capacity
        self.keys: list[Optional[K]] = [None] * capacity
        self.values: list[Optional[V]] = [None] * capacity
        self.capacity = capacity
        self.size = 0
    
    def _probe_sequence(self, key: K) -> int:
        """
        Generate the sequence of indices to probe for a key.
        Linear probing: check indices hash, hash+1, hash+2, ...
        """
        h = hash(key) % self.capacity
        for i in range(self.capacity):
            yield (h + i) % self.capacity
    
    def put(self, key: K, value: V) -> bool:
        """
        Insert or update a key-value pair.
        
        During insertion:
        - Skip OCCUPIED slots with different keys
        - Use EMPTY or DELETED slots for new entries
        - If key already exists, update the value
        """
        first_available = None
        
        for index in self._probe_sequence(key):
            state = self.states[index]
            
            if state == SlotState.EMPTY:
                # Empty slot: key definitely not in table
                insert_index = first_available if first_available is not None else index
                self.states[insert_index] = SlotState.OCCUPIED
                self.keys[insert_index] = key
                self.values[insert_index] = value
                self.size += 1
                return True
            
            elif state == SlotState.DELETED:
                # Remember first deleted slot as potential insertion point
                if first_available is None:
                    first_available = index
                # Keep probing - key might be ahead
            
            elif state == SlotState.OCCUPIED:
                if self.keys[index] == key:
                    # Key exists: update value
                    self.values[index] = value
                    return False  # No size increase
        
        # Table is full (should rehash before this happens)
        raise RuntimeError("Hash table is full")
    
    def get(self, key: K) -> Optional[V]:
        """
        Retrieve value for a key.
        
        During lookup:
        - EMPTY -> key not found (stop)
        - DELETED -> keep probing
        - OCCUPIED -> check key match
        """
        for index in self._probe_sequence(key):
            state = self.states[index]
            
            if state == SlotState.EMPTY:
                return None  # Definitively not in table
            
            elif state == SlotState.DELETED:
                continue  # Keep looking
            
            elif state == SlotState.OCCUPIED:
                if self.keys[index] == key:
                    return self.values[index]
                # Wrong key, keep probing
        
        return None  # Probed all slots, not found
    
    def remove(self, key: K) -> Optional[V]:
        """
        Remove a key from the table.
        
        CRITICAL: We cannot simply set slot to EMPTY.
        We must use DELETED (tombstone) to preserve probe chains.
        """
        for index in self._probe_sequence(key):
            state = self.states[index]
            
            if state == SlotState.EMPTY:
                return None  # Key not in table
            
            elif state == SlotState.DELETED:
                continue  # Keep looking
            
            elif state == SlotState.OCCUPIED:
                if self.keys[index] == key:
                    # Found it - mark as deleted (tombstone)
                    self.states[index] = SlotState.DELETED
                    value = self.values[index]
                    self.keys[index] = None  # Help GC
                    self.values[index] = None
                    self.size -= 1
                    return value
        
        return None  # Not found
    
    def debug_slots(self) -> None:
        """Print slot contents for visualization."""
        print(f"\nOpen Addressing Table ({self.size} items in {self.capacity} slots)")
        print("-" * 60)
        
        for i in range(self.capacity):
            state = self.states[i]
            if state == SlotState.EMPTY:
                print(f"  [{i}]: [EMPTY]")
            elif state == SlotState.DELETED:
                print(f"  [{i}]: [DELETED] (tombstone)")
            else:
                print(f"  [{i}]: [OCCUPIED] {self.keys[i]} → {self.values[i]}")
 
 
# Demonstrate tombstone behavior
if __name__ == "__main__":
    ht = OpenAddressingTable(capacity=8)
    
    # Scenario: Insert A and B where B probes past A's slot
    # Assume hash("keyA") % 8 = 3 and hash("keyB") % 8 = 3 (collision)
    
    ht.put("keyA", "valueA")
    ht.put("keyB", "valueB")  # Will probe to slot 4
    
    print("Before deletion:")
    ht.debug_slots()
    
    # Delete keyA
    ht.remove("keyA")
    
    print("\nAfter deleting keyA:")
    ht.debug_slots()
    
    # Can we still find keyB?
    result = ht.get("keyB")
    print(f"\nLooking up keyB: {result}")  # Should find it!

The Tombstone Accumulation Problem

Bucket Sizing and Load Factor

The number of buckets in a hash table is a critical design parameter. Too few buckets cause excessive collisions; too many waste memory. The load factor quantifies this relationship.

Load Factor Definition:

load_factor = n / m

Where:
  n = number of stored entries
  m = number of buckets (capacity)

For example:

100 entries in 200 buckets → load factor = 0.5 (50% full)
100 entries in 100 buckets → load factor = 1.0 (100% full)
100 entries in 50 buckets → load factor = 2.0 (average 2 entries per bucket)

Load Factor Impact:

Load Factor Effects on Performance
Load Factor	Chaining Performance	Open Addressing Performance	Memory Efficiency
0.25	Excellent (very short chains)	Excellent (rare probing)	Poor (75% empty space)
0.50	Very good (avg 0.5 items/bucket)	Very good (short probe sequences)	Moderate (50% empty)
0.75	Good (manageable chains)	Good (acceptable probing)	Good (25% empty)
1.0	Acceptable (avg 1 item/bucket)	Degraded (long probes)	Perfect (no waste)
2.0	Poor (chains averaging 2 items)	Impossible (no space)	Cannot achieve

Typical Thresholds:

Chaining: Load factor of 0.75 is common before rehashing. Some implementations allow higher values since chaining degrades gracefully.
Open Addressing: Load factor of 0.5-0.7 is typical. Performance degrades sharply as the table fills because probe sequences lengthen exponentially.

Why Powers of Two for Capacity?

Many implementations use power-of-two bucket counts (16, 32, 64, ...) because:

hash % capacity can be computed as hash & (capacity - 1) using a fast bitwise AND
Growth by doubling is simple: capacity × 2 is still a power of two

However, power-of-two sizes can interact poorly with hash functions that don't distribute lower bits well. Some implementations use prime capacities or add secondary hashing to mitigate this.

load_factor_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def demonstrate_load_factor():
    """
    Show how load factor affects hash table behavior.
    """
    entries = [
        ("alice", 1), ("bob", 2), ("charlie", 3), ("diana", 4),
        ("eve", 5), ("frank", 6), ("grace", 7), ("henry", 8)
    ]
    
    print("=" * 60)
    print("LOAD FACTOR DEMONSTRATION")
    print("=" * 60)
    
    for capacity in [32, 16, 8, 4]:
        # Simple simulation: count collisions
        buckets = [[] for _ in range(capacity)]
        
        for key, value in entries:
            index = hash(key) % capacity
            buckets[index].append((key, value))
        
        # Calculate statistics
        load_factor = len(entries) / capacity
        non_empty = sum(1 for b in buckets if b)
        collisions = sum(1 for b in buckets if len(b) > 1)
        max_chain = max(len(b) for b in buckets)
        
        print(f"\nCapacity: {capacity} buckets, {len(entries)} entries")
        print(f"  Load Factor: {load_factor:.2f}")
        print(f"  Non-empty buckets: {non_empty}")
        print(f"  Buckets with collisions: {collisions}")
        print(f"  Longest chain: {max_chain}")
        
        # Visual
        print("  Bucket distribution:", end="")
        for b in buckets:
            print(f" [{len(b)}]", end="")
        print()
 
 
# Expected output:
# Capacity: 32 buckets, 8 entries
#   Load Factor: 0.25
#   Non-empty buckets: 8
#   Buckets with collisions: 0
#   Longest chain: 1
 
# Capacity: 8 buckets, 8 entries
#   Load Factor: 1.00
#   Non-empty buckets: 6
#   Buckets with collisions: 2
#   Longest chain: 2
 
demonstrate_load_factor()

Memory Layout Considerations

Cache Locality:

Modern CPUs are optimized for sequential memory access. When you access one memory location, nearby locations are loaded into cache. This makes sequential access much faster than random access.

Open Addressing: All slots are contiguous in memory. Probing through slots often hits cached data.
Chaining: Each chain node is separately allocated. Following the next pointer causes a cache miss.

Comparison: Probing vs Chain Traversal

Open Addressing Memory

•All slots in contiguous array
•Sequential probes often hit cache
•One allocation for entire table
•Predictable memory footprint
•Memory = capacity × slot_size

Chaining Memory

•Bucket array is contiguous
•Chain nodes scattered in heap
•Each insert allocates a new node
•Memory fragmentation over time
•Memory = buckets + (n × node_size)

Real-World Measurements:

In practice, these differences matter substantially:

Studies show open addressing can be 2-3x faster than chaining for workloads that fit in cache
Chaining becomes more competitive when chains are very short (low load factor)
For very large tables that exceed cache, the difference diminishes

Memory Overhead Analysis:

Chaining:

Bucket array: 8 bytes per bucket (64-bit pointer)
Per entry: 8 (key ptr) + 8 (value ptr) + 8 (next ptr) + 16 (object header) = 40 bytes minimum

Open Addressing:

Per slot: 8 (key ptr) + 8 (value ptr) + 1 (state) = 17 bytes (with padding: maybe 24)
Empty slots consume same space

Open addressing wastes space on empty slots; chaining wastes space on pointers and allocator overhead. The break-even depends on load factor and entry size.

Python's Dict: A Case Study

Summary: Buckets and Slots

We've developed a detailed understanding of how hash tables organize their internal storage. Let's consolidate the key concepts:

Key Takeaways

•Buckets are indexed positions in the hash table's array — each bucket holds entries whose keys hash to that index.
•Slots are individual storage spaces for key-value pairs — in chaining, a bucket contains multiple slots (a chain); in open addressing, each bucket is one slot.
•Separate chaining stores collisions in linked lists — flexible but has pointer overhead and poor cache locality.
•Open addressing stores all entries in the array — uses probing for collisions, requires tombstones for deletion, but has excellent cache performance.
•Load factor (n/m) measures how full the table is — higher load factors increase collisions; typical thresholds are 0.5-0.75.
•Memory layout affects real-world performance — contiguous storage (open addressing) often outperforms scattered nodes (chaining) due to CPU cache effects.
•Bucket state tracking is essential for open addressing — EMPTY, OCCUPIED, and DELETED states ensure correct probe chain behavior.

What's Next:

Page Complete

2 / 4