Loading learning content...
Every time you write map[key] = value or dict.put(key, value), you trigger a sophisticated sequence of operations that the hash table executes in microseconds. Understanding this sequence—from the moment the key enters the function to the moment the value is stored—is essential for reasoning about performance, debugging issues, and designing efficient systems.
Insertion is where theory meets practice. All the concepts we've covered—hash functions, buckets, slots, collision resolution—come together in a single, coordinated operation. Let's trace through this operation step by step, examining every decision point and potential outcome.
By the end of this page, you will be able to trace the complete insertion process in your mind, handle collision scenarios with both chaining and open addressing, understand when and why updates occur instead of insertions, and identify the performance implications of each step.
Before diving into details, let's establish the high-level flow of hash table insertion. This algorithm applies regardless of the collision resolution strategy:
The Universal Insertion Algorithm:
INSERT(table, key, value):
1. HASH: Compute hash_code = hash(key)
2. INDEX: Compute bucket_index = hash_code mod table.capacity
3. LOCATE: Navigate to table.buckets[bucket_index]
4. SEARCH: Check if key already exists at this location
5. STORE: Either update existing entry or create new one
6. MAINTAIN: Trigger rehashing if load factor exceeds threshold
Each of these steps has nuances that affect correctness and performance. Let's examine each in depth.
Hash tables enforce a fundamental invariant: each key appears at most once. Insertion with a duplicate key doesn't create a second entry—it replaces the existing value. This invariant simplifies reasoning and prevents inconsistencies.
The first two steps transform an arbitrary key into a specific array index. This transformation is the "hash" in "hash table" and is central to achieving O(1) access.
Computing the Hash Code:
The hash function takes the key and produces an integer hash code. Different key types require different hashing approaches:
The hash code is typically a large integer (potentially negative) that doesn't directly correspond to any array index.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
def trace_hashing_step(key, table_capacity: int = 16): """ Trace the hashing and indexing steps in detail. """ print(f"\n{'='*60}") print(f"INSERTION TRACE: Steps 1-2 (Hashing and Indexing)") print(f"{'='*60}") print(f"Key: {repr(key)}") print(f"Table capacity: {table_capacity}") print() # STEP 1: Compute hash code print("STEP 1: Hashing") print("-" * 40) # Python's built-in hash hash_code = hash(key) print(f" hash({repr(key)}) = {hash_code}") # Note on hash code range print(f" Hash code is a potentially large integer") print(f" Can be negative: {hash_code < 0}") print(f" Bit length: {hash_code.bit_length()} bits") print() # STEP 2: Compute array index print("STEP 2: Indexing") print("-" * 40) # Standard modulo approach raw_index = hash_code % table_capacity print(f" hash_code % capacity = {hash_code} % {table_capacity}") print(f" Raw index = {raw_index}") # Handle negative results (Python % always returns non-negative for positive divisor) final_index = raw_index if raw_index >= 0 else raw_index + table_capacity print(f" Final index = {final_index}") print() # Demonstrate why same key always gives same index print("VERIFICATION: Determinism") print("-" * 40) repeat_hash = hash(key) repeat_index = repeat_hash % table_capacity print(f" Re-hash: {repeat_hash}, Re-index: {repeat_index}") print(f" Consistent: {repeat_index == final_index}") return final_index # Examples with different key typestrace_hashing_step("alice")trace_hashing_step(42)trace_hashing_step(("composite", "key", 123)) # Sample output:# ============================================================# INSERTION TRACE: Steps 1-2 (Hashing and Indexing)# ============================================================# Key: 'alice'# Table capacity: 16## STEP 1: Hashing# ----------------------------------------# hash('alice') = -6512894263891412622# Hash code is a potentially large integer# Can be negative: True# Bit length: 63 bits## STEP 2: Indexing# ----------------------------------------# hash_code % capacity = -6512894263891412622 % 16# Raw index = 2# Final index = 2Index Computation Details:
The modulo operation hash_code % capacity constrains the hash code to the range [0, capacity-1]. There are important implementation details:
Positive Modulo: Some languages (C, C++, Java) can return negative remainders for negative dividends. Hash table implementations must ensure the result is non-negative:
int index = (hashCode % capacity + capacity) % capacity;
// or
int index = hashCode & (capacity - 1); // Only works for power-of-2 capacity
Power-of-Two Optimization: When capacity is a power of 2, modulo can be replaced with bitwise AND:
hash % 16 = hash & 15 // 16 = 2^4, so mask is 1111 in binary
hash % 32 = hash & 31 // 32 = 2^5, so mask is 11111
This is faster because bitwise operations are simpler than division. Most implementations use power-of-two sizes for this reason.
If the hash function produces clustered values (many keys mapping to similar indices), insertion performance degrades. A good hash function spreads keys uniformly across all buckets, minimizing collisions and keeping insertion O(1) on average.
With the index computed, we access the corresponding bucket. This step is deceptively simple—the power of hash tables lies in its O(1) nature.
Array Indexing:
Array access by index is a constant-time operation because the memory address is computed directly:
address = base_address + (index × element_size)
There's no searching, no iteration, no comparison—just arithmetic followed by memory access.
What We Find at the Bucket:
Depending on the table's state and collision strategy, we may find:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
from typing import Optional, Anyfrom dataclasses import dataclassfrom enum import Enum class BucketState(Enum): EMPTY = "empty" # Never occupied OCCUPIED = "occupied" # Contains entry DELETED = "deleted" # Tombstone @dataclassclass BucketInspection: """Result of inspecting a bucket during insertion.""" index: int state: BucketState existing_key: Optional[Any] action_needed: str def trace_bucket_location(table, key, index: int) -> BucketInspection: """ Step 3: Navigate to and inspect the computed bucket. Determines what action is needed. """ print(f"\nSTEP 3: Locating Bucket") print("-" * 40) print(f" Computed index: {index}") print(f" Accessing table.buckets[{index}]...") # Access the bucket (O(1) operation) bucket = table.buckets[index] # Analyze bucket state if bucket is None or bucket.state == BucketState.EMPTY: print(f" Bucket state: EMPTY") print(f" Action: Can insert directly at this index") return BucketInspection( index=index, state=BucketState.EMPTY, existing_key=None, action_needed="INSERT_NEW" ) elif bucket.state == BucketState.DELETED: print(f" Bucket state: DELETED (tombstone)") print(f" Action: Can reuse for insertion, but must continue search") return BucketInspection( index=index, state=BucketState.DELETED, existing_key=None, action_needed="CHECK_FURTHER_THEN_INSERT" ) elif bucket.state == BucketState.OCCUPIED: existing_key = bucket.entry.key print(f" Bucket state: OCCUPIED") print(f" Existing key: {repr(existing_key)}") if existing_key == key: print(f" Key match: YES") print(f" Action: Update value in place") return BucketInspection( index=index, state=BucketState.OCCUPIED, existing_key=existing_key, action_needed="UPDATE_VALUE" ) else: print(f" Key match: NO (collision!)") print(f" Action: Need collision resolution") return BucketInspection( index=index, state=BucketState.OCCUPIED, existing_key=existing_key, action_needed="RESOLVE_COLLISION" ) # The key insight: This step is always O(1)# The complexity comes in what we find and what happens nextBefore storing a value, we must check if the key already exists in the table. This check enforces the unique-key invariant and determines whether we're performing an insertion or an update.
Why This Step Exists:
table[key] = value should update, not duplicateThe Search Process Differs by Strategy:
Chaining: Search the linked list at the bucket Open Addressing: Follow the probe sequence until key found or empty slot
In chaining, searching means traversing the linked list at the bucket. We compare each node's key until we find a match or reach the end:
SEARCH_CHAIN(bucket, key):
current = bucket.head
while current is not None:
if current.key == key:
return current // Found existing entry
current = current.next
return None // Key not in chain
Performance:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
class ChainNode: def __init__(self, key, value, hash_code): self.key = key self.value = value self.hash_code = hash_code self.next = None def search_chain(bucket_head, key, target_hash): """ Search for a key in a chained bucket. Returns (found_node, predecessor) tuple. """ predecessor = None current = bucket_head comparisons = 0 print(f"\n Searching chain for key: {repr(key)}") while current is not None: comparisons += 1 # Optimization: compare hash first (fast integer comparison) if current.hash_code == target_hash: # Hash matches - now compare actual key if current.key == key: print(f" [Node {comparisons}] Key {repr(current.key)}: MATCH!") print(f" Found after {comparisons} comparison(s)") return current, predecessor else: print(f" [Node {comparisons}] Key {repr(current.key)}: hash match but key differs") else: print(f" [Node {comparisons}] Key {repr(current.key)}: hash mismatch") predecessor = current current = current.next print(f" Reached end of chain after {comparisons} comparison(s)") print(f" Key not found - this is a new entry") return None, predecessor # predecessor useful for appending # Example chain traversal# bucket: ("alice", A) -> ("charlie", C) -> ("eve", E) -> None# Searching for "charlie" requires 2 comparisons# Searching for "new_key" requires 3 comparisons (full chain)The search step determines what happens next. Based on the search result, we take one of several storage actions:
Case 1: Key Found (Update)
If the key already exists, we simply replace the old value with the new one. No structural changes to the table are needed—we're modifying an existing entry in place.
Case 2: Key Not Found, Empty Slot Available (Insert)
If the key doesn't exist and we found an empty slot (or, in open addressing, a tombstone), we create a new entry at that location.
Case 3: Key Not Found, Collision (Resolve and Insert)
If the key doesn't exist but the target slot is occupied by a different key, we must apply collision resolution before inserting.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144
from typing import Optional, Tuplefrom dataclasses import dataclass @dataclassclass InsertResult: """Describes what happened during insertion.""" action: str # "INSERTED" or "UPDATED" index: int # Where the entry ended up probes: int # Number of probes/comparisons needed old_value: Optional[any] # Previous value if updated class ChainingHashTable: """Chaining implementation focusing on insertion.""" def insert(self, key, value) -> InsertResult: """ Complete insertion implementation with detailed tracking. """ hash_code = hash(key) index = hash_code % self.capacity # Search existing chain current = self.buckets[index] comparisons = 0 while current is not None: comparisons += 1 if current.hash_code == hash_code and current.key == key: # CASE 1: Key found - UPDATE old_value = current.value current.value = value return InsertResult( action="UPDATED", index=index, probes=comparisons, old_value=old_value ) current = current.next # CASE 2: Key not found - INSERT at head of chain new_node = ChainNode(key, value, hash_code) new_node.next = self.buckets[index] # Prepend to existing chain self.buckets[index] = new_node self.size += 1 return InsertResult( action="INSERTED", index=index, probes=comparisons + 1, # +1 for the insert itself old_value=None ) class OpenAddressingHashTable: """Open addressing implementation focusing on insertion.""" def insert(self, key, value) -> InsertResult: """ Complete insertion with linear probing. """ if self.size >= self.capacity * 0.75: self._rehash() # Prevent high load factor hash_code = hash(key) start_index = hash_code % self.capacity first_available = None probes = 0 for i in range(self.capacity): probes += 1 probe_index = (start_index + i) % self.capacity slot = self.slots[probe_index] if slot.state == SlotState.EMPTY: # CASE 2: Empty slot - INSERT here insert_index = first_available if first_available is not None else probe_index self._store_at(insert_index, key, value, hash_code) self.size += 1 return InsertResult( action="INSERTED", index=insert_index, probes=probes, old_value=None ) elif slot.state == SlotState.DELETED: # Remember first deleted slot as potential insertion point if first_available is None: first_available = probe_index # Continue probing - key might exist ahead elif slot.state == SlotState.OCCUPIED: if slot.key == key: # CASE 1: Key found - UPDATE old_value = slot.value slot.value = value return InsertResult( action="UPDATED", index=probe_index, probes=probes, old_value=old_value ) # Different key - continue probing (CASE 3: collision) # Table is full - should never happen if we rehash properly raise RuntimeError("Hash table is full - rehash failed") def _store_at(self, index: int, key, value, hash_code: int): """Store a new entry at the given index.""" slot = self.slots[index] slot.state = SlotState.OCCUPIED slot.key = key slot.value = value slot.hash_code = hash_code # Cache for fast comparisons # Demonstrationdef trace_full_insertion(): """Trace a complete insertion sequence.""" table = ChainingHashTable(capacity=8) insertions = [ ("alice", 100), ("bob", 200), ("alice", 150), # Update! ("charlie", 300), ] print("=" * 60) print("COMPLETE INSERTION TRACE") print("=" * 60) for key, value in insertions: print(f"\nInserting ({repr(key)}, {value}):") result = table.insert(key, value) print(f" Action: {result.action}") print(f" Final index: {result.index}") print(f" Probes: {result.probes}") if result.old_value is not None: print(f" Replaced value: {result.old_value} → {value}")In chaining, new entries are typically inserted at the head of the chain (prepending) rather than the tail (appending). Prepending is O(1) because we immediately have the head pointer. Appending would require O(chain_length) to traverse to the end. If recently inserted keys are more likely to be accessed, prepending also improves cache locality.
After successful insertion, the hash table may need maintenance. The primary concern is the load factor—if too many entries crowd the table, performance degrades. The solution is rehashing: creating a larger table and re-inserting all entries.
Rehashing Trigger:
if (size / capacity) > load_factor_threshold:
rehash()
Typical thresholds:
The Rehashing Process:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
class HashTableWithRehash: """ Hash table demonstrating the rehashing process. """ LOAD_FACTOR_THRESHOLD = 0.75 def __init__(self, initial_capacity: int = 16): self.capacity = initial_capacity self.size = 0 self.buckets = [None] * self.capacity print(f"Created table with capacity {self.capacity}") def _load_factor(self) -> float: return self.size / self.capacity def _needs_rehash(self) -> bool: return self._load_factor() > self.LOAD_FACTOR_THRESHOLD def _rehash(self): """ Resize the table and re-insert all entries. """ old_capacity = self.capacity old_buckets = self.buckets # Double the capacity self.capacity *= 2 self.buckets = [None] * self.capacity self.size = 0 # Reset - will be incremented during re-insertion print(f"\n{'='*50}") print(f"REHASHING: {old_capacity} → {self.capacity}") print(f"{'='*50}") entries_moved = 0 # Re-insert all existing entries for bucket in old_buckets: if bucket is None: continue # For chaining: iterate through chain current = bucket while current is not None: # Insert into new table old_index = hash(current.key) % old_capacity new_index = hash(current.key) % self.capacity print(f" Moving ({current.key}): index {old_index} → {new_index}") # Perform insertion (simplified - direct store) self._insert_internal(current.key, current.value) entries_moved += 1 current = current.next print(f"Rehashing complete: {entries_moved} entries moved") print(f"New load factor: {self._load_factor():.2%}") def insert(self, key, value): """Insert with automatic rehashing.""" # Check BEFORE insert if we need to rehash if self._needs_rehash(): print(f"\nLoad factor {self._load_factor():.2%} exceeds threshold {self.LOAD_FACTOR_THRESHOLD:.0%}") self._rehash() self._insert_internal(key, value) def _insert_internal(self, key, value): """Internal insertion (no rehash check).""" hash_code = hash(key) index = hash_code % self.capacity # Simplified: prepend to chain (ignoring update case) new_node = ChainNode(key, value, hash_code) new_node.next = self.buckets[index] self.buckets[index] = new_node self.size += 1 # Demonstrate rehashing triggerdef demo_rehashing(): table = HashTableWithRehash(initial_capacity=4) keys = ["a", "b", "c", "d", "e", "f", "g", "h"] for key in keys: print(f"\nInserting '{key}'...") print(f" Before: size={table.size}, capacity={table.capacity}, " f"load={table._load_factor():.2%}") table.insert(key, key.upper()) print(f" After: size={table.size}, capacity={table.capacity}") # Output shows rehashing at load factor threshold:# Inserting 'a'...# Before: size=0, capacity=4, load=0.00%# After: size=1, capacity=4# ...# Inserting 'd'...# Before: size=3, capacity=4, load=75.00%# # Load factor 100.00% exceeds threshold 75%# ==================================================# REHASHING: 4 → 8# ==================================================# Moving (c): index 2 → 2# Moving (b): index 1 → 1# Moving (a): index 1 → 1# Rehashing complete: 3 entries moved# New load factor: 37.50%Rehashing is O(n)—we must re-insert every entry. But because the table doubles in size, the next rehash won't occur until n more insertions. Spreading the O(n) cost over those n insertions gives O(1) amortized time per insertion. This is the same principle that makes dynamic arrays (Python lists, Java ArrayList) efficient.
Let's trace a complete insertion through all six steps, combining everything we've learned into one comprehensive example.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146
def complete_insertion_trace(key: str, value: int, table): """ Trace every step of hash table insertion. """ print("\n" + "=" * 70) print(f"COMPLETE INSERTION TRACE: ({repr(key)}, {value})") print("=" * 70) # ======================================== # STEP 1: Hash the key # ======================================== print("\n[STEP 1] HASHING THE KEY") print("-" * 50) hash_code = hash(key) print(f"Input key: {repr(key)}") print(f"hash({repr(key)}) = {hash_code}") print(f"Hash computation: O(k) where k = len(key)") # ======================================== # STEP 2: Compute bucket index # ======================================== print("\n[STEP 2] COMPUTING BUCKET INDEX") print("-" * 50) index = hash_code % table.capacity print(f"Table capacity: {table.capacity}") print(f"Index = {hash_code} % {table.capacity} = {index}") print(f"Index computation: O(1)") # ======================================== # STEP 3: Navigate to bucket # ======================================== print("\n[STEP 3] NAVIGATING TO BUCKET") print("-" * 50) bucket = table.buckets[index] bucket_state = "EMPTY" if bucket is None else "OCCUPIED" print(f"Accessing table.buckets[{index}]") print(f"Bucket state: {bucket_state}") print(f"Array access: O(1)") # ======================================== # STEP 4: Search for existing key # ======================================== print("\n[STEP 4] SEARCHING FOR EXISTING KEY") print("-" * 50) found_node = None search_steps = 0 current = bucket while current is not None: search_steps += 1 print(f" Comparing with node {search_steps}: key={repr(current.key)}") if current.hash_code == hash_code and current.key == key: found_node = current print(f" → MATCH FOUND at node {search_steps}") break current = current.next if found_node is None: print(f" → Key not found after {search_steps} comparison(s)") print(f"Search complexity: O(chain_length) = O({search_steps})") # ======================================== # STEP 5: Store the entry # ======================================== print("\n[STEP 5] STORING THE ENTRY") print("-" * 50) if found_node is not None: # UPDATE existing entry old_value = found_node.value found_node.value = value print(f"Action: UPDATE existing entry") print(f"Old value: {old_value}") print(f"New value: {value}") action = "UPDATED" else: # INSERT new entry at head new_node = ChainNode(key, value, hash_code) new_node.next = table.buckets[index] table.buckets[index] = new_node table.size += 1 print(f"Action: INSERT new entry at head of chain") print(f"Table size: {table.size - 1} → {table.size}") action = "INSERTED" print(f"Storage: O(1)") # ======================================== # STEP 6: Check for rehashing # ======================================== print("\n[STEP 6] MAINTENANCE CHECK") print("-" * 50) load_factor = table.size / table.capacity threshold = 0.75 print(f"Current load factor: {table.size}/{table.capacity} = {load_factor:.2%}") print(f"Threshold: {threshold:.0%}") if load_factor > threshold: print(f"→ REHASH NEEDED!") print(f" This would trigger O(n) rehashing") # In real code: table._rehash() else: remaining = int(table.capacity * threshold) - table.size print(f"→ No rehash needed") print(f" {remaining} more insertions before rehash") # ======================================== # Summary # ======================================== print("\n" + "=" * 70) print("INSERTION COMPLETE") print("=" * 70) total_ops = 1 + 1 + 1 + search_steps + 1 + 1 # Each step's primary operation print(f"Result: {action}") print(f"Final position: bucket[{index}]") print(f"Total comparisons: {search_steps}") print(f"Time complexity: O(1) average, O(n) worst case") return action, index # Run the traceclass SimpleTable: def __init__(self): self.capacity = 8 self.size = 3 # Pre-populate bucket 2 with a chain: bob -> charlie self.buckets = [None] * 8 charlie = ChainNode("charlie", 300, hash("charlie")) bob = ChainNode("bob", 200, hash("bob")) bob.next = charlie self.buckets[hash("bob") % 8] = bob # table = SimpleTable()# complete_insertion_trace("alice", 100, table) # Insert new# complete_insertion_trace("bob", 250, table) # Update existingWe've traced the complete journey of hash table insertion from key to stored value. Let's consolidate the essential knowledge:
hash(key) % capacity. This must be deterministic for correct retrieval.What's Next:
Now that you understand insertion, the next page explores the symmetric operation: lookup. We'll trace how the hash table finds values by key, handling the same collision scenarios but with different decision logic for not-found cases.
You can now trace hash table insertion step by step, from initial hashing through collision resolution to final storage. This detailed understanding of "what happens when" is essential for debugging, optimization, and designing efficient systems.