Loading learning content...
When we say a hash table is "an array with a hash function," we're stating a truth that hides enormous subtlety. The way that array is organized—what lives in each position, how entries are arranged, what happens when positions overflow—determines the real-world performance of every hash table operation.
This organizational scheme centers on two concepts: buckets and slots. Understanding these terms precisely, along with the various ways implementations structure them, gives you the mental model needed to predict performance, debug issues, and choose between hash table variants.
By the end of this page, you will understand exactly what buckets and slots are, how they organize data within a hash table, the different strategies for handling multiple items per bucket, and how these design choices affect performance and memory usage.
The terms "bucket" and "slot" are sometimes used interchangeably, and sometimes with distinct meanings. Let's establish precise definitions that apply across implementations.
Bucket:
A bucket is a position in the hash table's underlying array—a location where data associated with a particular hash value can be stored. When a key hashes to index i, we say it "falls into bucket i."
The critical insight: a bucket is not a single storage space for a single item. A bucket is a location that may contain:
Slot:
A slot is a storage position for a single key-value pair. In implementations where each bucket holds exactly one pair, "bucket" and "slot" are synonymous. In implementations where buckets can hold multiple pairs (chaining), each bucket contains multiple slots.
Think of it this way:
| Term | Definition | In Chaining | In Open Addressing |
|---|---|---|---|
| Bucket | Array position identified by hash(key) % size | A linked list head (can hold many entries) | A single slot (holds zero or one entry) |
| Slot | Space for one key-value pair | One node in the bucket's linked list | One array position = one slot |
| Entry | A stored key-value pair | A node in the chain | An occupied slot |
| Capacity | Number of buckets in the table | Array size | Array size = maximum entries |
In practice, many sources use 'bucket' and 'slot' interchangeably. When reading documentation or papers, rely on context. The key distinction is whether the discussion involves a single storage location (slot) or a hash-indexed position that might contain multiple items (bucket).
The hash table's core structure is an array of buckets. This array has several important properties that affect behavior:
Fixed Size at Creation:
When a hash table is created (or resized), the bucket array has a fixed number of positions. This size is chosen based on expected usage:
bucket_count = initial_capacity // e.g., 16, 32, 64
The array must be allocated contiguously in memory, enabling O(1) access to any bucket:
bucket_address = base_address + (index × bucket_size)
Index Range:
Valid bucket indices range from 0 to capacity - 1. The hash function must produce values in this range (typically via modulo arithmetic):
index = hash(key) % capacity
Bucket States:
Each bucket exists in one of several possible states, depending on the implementation:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
from enum import Enumfrom typing import Optional, Generic, TypeVar, Listfrom dataclasses import dataclass K = TypeVar('K')V = TypeVar('V') class BucketState(Enum): """Possible states of a bucket in open addressing.""" EMPTY = "empty" # Never used - search can stop OCCUPIED = "occupied" # Contains valid entry DELETED = "deleted" # Was occupied, now deleted (tombstone) @dataclassclass Slot(Generic[K, V]): """A slot holding one key-value pair.""" key: K value: V @dataclassclass Bucket(Generic[K, V]): """ A bucket in open addressing: holds state and optionally one slot. """ state: BucketState slot: Optional[Slot[K, V]] @classmethod def empty(cls) -> 'Bucket': return cls(state=BucketState.EMPTY, slot=None) @classmethod def with_entry(cls, key: K, value: V) -> 'Bucket': return cls(state=BucketState.OCCUPIED, slot=Slot(key, value)) def is_available(self) -> bool: """Can we insert a new entry here?""" return self.state in (BucketState.EMPTY, BucketState.DELETED) def mark_deleted(self) -> None: """Convert occupied bucket to tombstone.""" self.state = BucketState.DELETED self.slot = None # For chaining: bucket contains a list of slots@dataclassclass ChainedBucket(Generic[K, V]): """ A bucket in chaining: holds a list of key-value pairs. """ entries: List[Slot[K, V]] def __init__(self): self.entries = [] def add(self, key: K, value: V) -> None: """Add or update entry in the chain.""" for entry in self.entries: if entry.key == key: entry.value = value # Update existing return self.entries.append(Slot(key, value)) # Append new def get(self, key: K) -> Optional[V]: """Find value by key in the chain.""" for entry in self.entries: if entry.key == key: return entry.value return None def remove(self, key: K) -> bool: """Remove entry by key. Returns True if found.""" for i, entry in enumerate(self.entries): if entry.key == key: self.entries.pop(i) return True return False def __len__(self) -> int: return len(self.entries)How buckets handle their contents—especially when multiple keys hash to the same index—defines the fundamental character of a hash table implementation. Two primary strategies exist, each with cascading implications for performance and memory usage.
Strategy 1: Separate Chaining (Closed Addressing)
In separate chaining, each bucket contains a pointer to a secondary data structure (typically a linked list) that holds all entries with that hash index.
Key Characteristics:
Strategy 2: Open Addressing (Closed Hashing)
In open addressing, all entries are stored directly in the bucket array. When a collision occurs, the algorithm probes for the next available slot within the array.
Key Characteristics:
| Aspect | Separate Chaining | Open Addressing |
|---|---|---|
| What bucket contains | Pointer to external chain | Actual key-value entry |
| Collision handling | Append to chain at same bucket | Probe for next available bucket |
| Maximum load factor | Unlimited (chains grow) | < 1.0 (must have empty slots) |
| Memory layout | Non-contiguous (follows pointers) | Contiguous (better cache performance) |
| Memory overhead | Pointer per bucket + node overhead | May need reserved space for tombstones |
| Deletion complexity | Simple (remove from list) | Complex (tombstone markers) |
| Performance at high load | Gradual degradation | Severe degradation near capacity |
| Common use | Java's HashMap (before Java 8) | Python's dict, modern implementations |
Modern implementations often combine strategies. Java 8+ HashMap starts with chaining but converts long chains (≥8 entries) to balanced trees for O(log n) worst-case lookup. Python uses open addressing with a sophisticated probing sequence. Understanding both foundational strategies helps you reason about any hybrid approach.
Let's examine the internal structure of buckets under separate chaining in detail. Understanding this structure helps you trace operations and predict performance.
The Chain Structure:
Each bucket in a chaining implementation contains a reference to the head of a linked list. Each node in the list contains:
When the bucket is empty, it typically holds a null reference or a pointer to an empty sentinel.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154
from typing import Optional, Generic, TypeVar K = TypeVar('K')V = TypeVar('V') class ChainNode(Generic[K, V]): """ A single node in a hash chain. Stores key, value, and pointer to next node. """ __slots__ = ('key', 'value', 'hash_code', 'next') def __init__(self, key: K, value: V, hash_code: int): self.key = key self.value = value self.hash_code = hash_code # Cache hash for comparisons self.next: Optional[ChainNode[K, V]] = None def __repr__(self): return f"({self.key}: {self.value})" class ChainingHashTable(Generic[K, V]): """ Hash table implementation using separate chaining. """ def __init__(self, capacity: int = 16): # Each bucket is either None (empty) or a ChainNode (head of list) self.buckets: list[Optional[ChainNode[K, V]]] = [None] * capacity self.capacity = capacity self.size = 0 def _hash_index(self, key: K) -> int: """Compute bucket index for a key.""" return hash(key) % self.capacity def put(self, key: K, value: V) -> None: """ Insert or update a key-value pair. Process: 1. Compute bucket index 2. Search chain for existing key 3. If found: update value 4. If not found: prepend new node to chain """ hash_code = hash(key) index = hash_code % self.capacity # Search existing chain current = self.buckets[index] while current is not None: # Use cached hash for fast inequality check if current.hash_code == hash_code and current.key == key: current.value = value # Update existing return current = current.next # Key not found: create new node and prepend to chain new_node = ChainNode(key, value, hash_code) new_node.next = self.buckets[index] # Link to existing chain self.buckets[index] = new_node # Make new node the head self.size += 1 def get(self, key: K) -> Optional[V]: """ Retrieve value for a key. Process: 1. Compute bucket index 2. Traverse chain comparing keys 3. Return value if found, None otherwise """ hash_code = hash(key) index = hash_code % self.capacity current = self.buckets[index] while current is not None: if current.hash_code == hash_code and current.key == key: return current.value current = current.next return None # Not found def remove(self, key: K) -> bool: """ Remove a key from the table. Process: 1. Compute bucket index 2. Find the node in chain 3. Unlink it (update predecessor's next pointer) """ hash_code = hash(key) index = hash_code % self.capacity current = self.buckets[index] prev = None while current is not None: if current.hash_code == hash_code and current.key == key: if prev is None: # Removing head of chain self.buckets[index] = current.next else: # Removing middle or end of chain prev.next = current.next self.size -= 1 return True prev = current current = current.next return False # Not found def debug_buckets(self) -> None: """Print bucket contents for visualization.""" print(f"Hash Table ({self.size} items in {self.capacity} buckets)") print("-" * 50) for i, bucket in enumerate(self.buckets): if bucket is None: print(f" [{i}]: (empty)") else: chain_items = [] current = bucket while current: chain_items.append(f"({current.key}: {current.value})") current = current.next print(f" [{i}]: {' → '.join(chain_items)}") # Example usage and visualizationif __name__ == "__main__": ht = ChainingHashTable(capacity=8) # Insert some entries ht.put("alice", 100) ht.put("bob", 200) ht.put("charlie", 300) ht.put("diana", 400) ht.debug_buckets() # Output might look like: # Hash Table (4 items in 8 buckets) # -------------------------------------------------- # [0]: (empty) # [1]: (alice: 100) # [2]: (charlie: 300) → (bob: 200) # Collision! # [3]: (empty) # [4]: (empty) # [5]: (diana: 400) # [6]: (empty) # [7]: (empty)Memory Layout Analysis:
In chaining, memory is not contiguous. The bucket array lives in one location, but each chain node is separately allocated:
Bucket Array: [ptr0] [ptr1] [ptr2] [ptr3] [ptr4] ...
| | | |
v v v |
null Node Node null
| |
v v
null Node
|
v
null
Each pointer-following operation can cause a cache miss, which is why open addressing (with contiguous storage) often performs better for workloads that fit in cache.
In open addressing, the bucket array directly stores key-value pairs. There are no chains, no pointers to external structures—just the array itself.
Slot Structure:
Each slot in an open addressing table must track more than just the key and value. It needs state information to distinguish empty from occupied from deleted:
Slot = { state, key, value }
Where state ∈ { EMPTY, OCCUPIED, DELETED }
Why DELETED Matters:
Consider this scenario:
If we simply clear slot 3 when deleting A, the search for B will:
B is still in the table at index 4, but we stopped looking too soon. Tombstones (DELETED state) tell the search: "Someone was here; keep looking."
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165
from enum import Enumfrom typing import Optional, Generic, TypeVar, Tuple K = TypeVar('K')V = TypeVar('V') class SlotState(Enum): EMPTY = 0 # Never occupied - search can stop OCCUPIED = 1 # Contains valid entry DELETED = 2 # Was occupied - keep searching (tombstone) class OpenAddressingTable(Generic[K, V]): """ Hash table using open addressing with linear probing. Each bucket directly stores key-value data. """ def __init__(self, capacity: int = 16): # Parallel arrays for state, keys, and values # (In practice, might use array of struct/dataclass) self.states = [SlotState.EMPTY] * capacity self.keys: list[Optional[K]] = [None] * capacity self.values: list[Optional[V]] = [None] * capacity self.capacity = capacity self.size = 0 def _probe_sequence(self, key: K) -> int: """ Generate the sequence of indices to probe for a key. Linear probing: check indices hash, hash+1, hash+2, ... """ h = hash(key) % self.capacity for i in range(self.capacity): yield (h + i) % self.capacity def put(self, key: K, value: V) -> bool: """ Insert or update a key-value pair. During insertion: - Skip OCCUPIED slots with different keys - Use EMPTY or DELETED slots for new entries - If key already exists, update the value """ first_available = None for index in self._probe_sequence(key): state = self.states[index] if state == SlotState.EMPTY: # Empty slot: key definitely not in table insert_index = first_available if first_available is not None else index self.states[insert_index] = SlotState.OCCUPIED self.keys[insert_index] = key self.values[insert_index] = value self.size += 1 return True elif state == SlotState.DELETED: # Remember first deleted slot as potential insertion point if first_available is None: first_available = index # Keep probing - key might be ahead elif state == SlotState.OCCUPIED: if self.keys[index] == key: # Key exists: update value self.values[index] = value return False # No size increase # Table is full (should rehash before this happens) raise RuntimeError("Hash table is full") def get(self, key: K) -> Optional[V]: """ Retrieve value for a key. During lookup: - EMPTY -> key not found (stop) - DELETED -> keep probing - OCCUPIED -> check key match """ for index in self._probe_sequence(key): state = self.states[index] if state == SlotState.EMPTY: return None # Definitively not in table elif state == SlotState.DELETED: continue # Keep looking elif state == SlotState.OCCUPIED: if self.keys[index] == key: return self.values[index] # Wrong key, keep probing return None # Probed all slots, not found def remove(self, key: K) -> Optional[V]: """ Remove a key from the table. CRITICAL: We cannot simply set slot to EMPTY. We must use DELETED (tombstone) to preserve probe chains. """ for index in self._probe_sequence(key): state = self.states[index] if state == SlotState.EMPTY: return None # Key not in table elif state == SlotState.DELETED: continue # Keep looking elif state == SlotState.OCCUPIED: if self.keys[index] == key: # Found it - mark as deleted (tombstone) self.states[index] = SlotState.DELETED value = self.values[index] self.keys[index] = None # Help GC self.values[index] = None self.size -= 1 return value return None # Not found def debug_slots(self) -> None: """Print slot contents for visualization.""" print(f"\nOpen Addressing Table ({self.size} items in {self.capacity} slots)") print("-" * 60) for i in range(self.capacity): state = self.states[i] if state == SlotState.EMPTY: print(f" [{i}]: [EMPTY]") elif state == SlotState.DELETED: print(f" [{i}]: [DELETED] (tombstone)") else: print(f" [{i}]: [OCCUPIED] {self.keys[i]} → {self.values[i]}") # Demonstrate tombstone behaviorif __name__ == "__main__": ht = OpenAddressingTable(capacity=8) # Scenario: Insert A and B where B probes past A's slot # Assume hash("keyA") % 8 = 3 and hash("keyB") % 8 = 3 (collision) ht.put("keyA", "valueA") ht.put("keyB", "valueB") # Will probe to slot 4 print("Before deletion:") ht.debug_slots() # Delete keyA ht.remove("keyA") print("\nAfter deleting keyA:") ht.debug_slots() # Can we still find keyB? result = ht.get("keyB") print(f"\nLooking up keyB: {result}") # Should find it!Tombstones are necessary for correctness, but they accumulate over time. A table with many tombstones has poor performance: searches must probe through deleted slots. This is why open addressing tables periodically rehash: they rebuild the table, eliminating tombstones and improving probe sequences.
The number of buckets in a hash table is a critical design parameter. Too few buckets cause excessive collisions; too many waste memory. The load factor quantifies this relationship.
Load Factor Definition:
load_factor = n / m
Where:
n = number of stored entries
m = number of buckets (capacity)
For example:
Load Factor Impact:
| Load Factor | Chaining Performance | Open Addressing Performance | Memory Efficiency |
|---|---|---|---|
| 0.25 | Excellent (very short chains) | Excellent (rare probing) | Poor (75% empty space) |
| 0.50 | Very good (avg 0.5 items/bucket) | Very good (short probe sequences) | Moderate (50% empty) |
| 0.75 | Good (manageable chains) | Good (acceptable probing) | Good (25% empty) |
| 1.0 | Acceptable (avg 1 item/bucket) | Degraded (long probes) | Perfect (no waste) |
| 2.0 | Poor (chains averaging 2 items) | Impossible (no space) | Cannot achieve |
Typical Thresholds:
Chaining: Load factor of 0.75 is common before rehashing. Some implementations allow higher values since chaining degrades gracefully.
Open Addressing: Load factor of 0.5-0.7 is typical. Performance degrades sharply as the table fills because probe sequences lengthen exponentially.
Why Powers of Two for Capacity?
Many implementations use power-of-two bucket counts (16, 32, 64, ...) because:
hash % capacity can be computed as hash & (capacity - 1) using a fast bitwise ANDHowever, power-of-two sizes can interact poorly with hash functions that don't distribute lower bits well. Some implementations use prime capacities or add secondary hashing to mitigate this.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
def demonstrate_load_factor(): """ Show how load factor affects hash table behavior. """ entries = [ ("alice", 1), ("bob", 2), ("charlie", 3), ("diana", 4), ("eve", 5), ("frank", 6), ("grace", 7), ("henry", 8) ] print("=" * 60) print("LOAD FACTOR DEMONSTRATION") print("=" * 60) for capacity in [32, 16, 8, 4]: # Simple simulation: count collisions buckets = [[] for _ in range(capacity)] for key, value in entries: index = hash(key) % capacity buckets[index].append((key, value)) # Calculate statistics load_factor = len(entries) / capacity non_empty = sum(1 for b in buckets if b) collisions = sum(1 for b in buckets if len(b) > 1) max_chain = max(len(b) for b in buckets) print(f"\nCapacity: {capacity} buckets, {len(entries)} entries") print(f" Load Factor: {load_factor:.2f}") print(f" Non-empty buckets: {non_empty}") print(f" Buckets with collisions: {collisions}") print(f" Longest chain: {max_chain}") # Visual print(" Bucket distribution:", end="") for b in buckets: print(f" [{len(b)}]", end="") print() # Expected output:# Capacity: 32 buckets, 8 entries# Load Factor: 0.25# Non-empty buckets: 8# Buckets with collisions: 0# Longest chain: 1 # Capacity: 8 buckets, 8 entries# Load Factor: 1.00# Non-empty buckets: 6# Buckets with collisions: 2# Longest chain: 2 demonstrate_load_factor()The physical memory layout of buckets and slots has significant performance implications that go beyond algorithmic complexity. Understanding memory layout helps explain why modern implementations often prefer open addressing.
Cache Locality:
Modern CPUs are optimized for sequential memory access. When you access one memory location, nearby locations are loaded into cache. This makes sequential access much faster than random access.
Comparison: Probing vs Chain Traversal
Real-World Measurements:
In practice, these differences matter substantially:
Memory Overhead Analysis:
Chaining:
Open Addressing:
Open addressing wastes space on empty slots; chaining wastes space on pointers and allocator overhead. The break-even depends on load factor and entry size.
Python's dict uses open addressing with a sparse design: the hash table contains indices into a dense keys/values array. This provides cache-friendly iteration (the dense array), efficient memory (empty slots are just integers), and good probe performance. This clever design, introduced in Python 3.6, demonstrates how careful memory layout can improve both performance and functionality.
We've developed a detailed understanding of how hash tables organize their internal storage. Let's consolidate the key concepts:
What's Next:
Now that you understand the internal organization of hash tables, the next page walks through exactly how insertion works: the complete step-by-step process of adding a new key-value pair, handling collisions, and maintaining table invariants.
You now understand the internal architecture of hash tables: buckets and slots, chaining vs open addressing, load factor, and memory layout considerations. This structural knowledge prepares you to trace and reason about hash table operations in detail.