Loading learning content...
Every dynamic hash table eventually faces a critical moment: the load factor has crossed its threshold, performance is degrading, and the only solution is to rebuild the entire table from scratch. This operation—called rehashing (also known as resizing, growing, or table doubling)—is one of the most important behind-the-scenes operations in computer science.
Rehashing is counterintuitive at first. We're expending O(n) work just to maintain the O(1) operations we love. Why would an operation that processes every single element preserve constant-time efficiency? The answer lies in one of the most elegant concepts in algorithm analysis: amortized cost.
This page examines rehashing from every angle: when to trigger it, exactly how it works, the tradeoffs involved, and how to implement it correctly. By the end, you'll understand not just the mechanics but the deep reasoning that makes rehashing essential for hash table efficiency.
By the end of this page, you will understand: (1) The exact conditions that trigger rehashing, (2) The step-by-step rehashing algorithm, (3) Why elements must be re-inserted (not just moved), (4) Growth factor choices and their implications, (5) Shrinking strategies for memory reclamation, and (6) Implementation patterns for both chaining and open addressing.
Rehashing is triggered by specific conditions related to load factor. Understanding these triggers helps you predict when resizing will occur and plan accordingly.
Example trigger calculation:
HashMap with m = 16 buckets, α_max = 0.75
Secondary triggers:
Some implementations consider additional conditions:
Maximum chain length (chaining): If any bucket's chain exceeds a threshold (e.g., 8 in Java), restructure that bucket into a tree or trigger resize.
Probe sequence length (open addressing): If an insertion requires more than a threshold number of probes, resize proactively.
Memory pressure: External signals indicating memory scarcity might delay resizing or trigger shrinking.
Time-based amortization: Some systems schedule resizing during low-load periods rather than during peak insertions.
Implementations vary on whether they check the threshold before or after insertion. Checking before ensures the table never exceeds the threshold; checking after may briefly exceed it. For open addressing at high load factors, checking before is safer since even one insertion at near-capacity can cause expensive probing.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
class HashTableWithResize: """ Demonstrates resize triggering logic. """ MAX_LOAD_FACTOR = 0.75 def __init__(self, initial_capacity: int = 16): self.capacity = initial_capacity self.size = 0 self.buckets = [[] for _ in range(initial_capacity)] self.resize_count = 0 @property def load_factor(self) -> float: return self.size / self.capacity @property def threshold(self) -> int: """Maximum elements before resize triggers.""" return int(self.capacity * self.MAX_LOAD_FACTOR) def _should_resize(self) -> bool: """Check if insertion would exceed load factor threshold.""" return self.size + 1 > self.threshold def insert(self, key, value): """Insert with automatic resize triggering.""" # Check threshold BEFORE insertion if self._should_resize(): print(f"⚡ Resize triggered! size={self.size}, capacity={self.capacity}, " f"α={self.load_factor:.3f}") self._resize() # Now insert (simplified - ignoring collision handling) bucket_idx = hash(key) % self.capacity self.buckets[bucket_idx].append((key, value)) self.size += 1 def _resize(self): """Double capacity and rehash all elements.""" old_buckets = self.buckets self.capacity *= 2 self.buckets = [[] for _ in range(self.capacity)] self.size = 0 self.resize_count += 1 # Re-insert all elements for bucket in old_buckets: for key, value in bucket: bucket_idx = hash(key) % self.capacity self.buckets[bucket_idx].append((key, value)) self.size += 1 print(f" ✓ Resize complete: new capacity={self.capacity}, α={self.load_factor:.3f}") def status(self): print(f"Size: {self.size}, Capacity: {self.capacity}, " f"Load Factor: {self.load_factor:.3f}, Resizes: {self.resize_count}") # Demonstrate resize triggerstable = HashTableWithResize(initial_capacity=8)print("Inserting 30 elements into table with initial capacity 8...")print(f"Threshold: {table.threshold} elements (α_max = 0.75)\n") for i in range(30): table.insert(f"key_{i}", i) print("\nFinal status:")table.status() # Output:# Inserting 30 elements into table with initial capacity 8...# Threshold: 6 elements (α_max = 0.75)## ⚡ Resize triggered! size=6, capacity=8, α=0.750# ✓ Resize complete: new capacity=16, α=0.375# ⚡ Resize triggered! size=12, capacity=16, α=0.750# ✓ Resize complete: new capacity=32, α=0.375# ⚡ Resize triggered! size=24, capacity=32, α=0.750# ✓ Resize complete: new capacity=64, α=0.375## Final status:# Size: 30, Capacity: 64, Load Factor: 0.469, Resizes: 3Rehashing appears simple in concept—make the table bigger—but the implementation requires careful attention to detail. Here's the complete algorithm:
A common misconception is that we can simply copy elements to corresponding positions in the larger array. This is wrong! Hash positions depend on capacity via hash % capacity. When capacity changes, elements hash to different buckets. An element at bucket 5 (hash=37, capacity=16, 37%16=5) might go to bucket 37 (hash=37, capacity=64, 37%64=37) after resize. Every element must be re-hashed and re-inserted.
Visualizing position changes:
Consider elements with hash values 5, 21, 37, 53 in a table of capacity 16:
| Hash | Old Position (hash % 16) | New Position (hash % 32) |
|---|---|---|
| 5 | 5 | 5 |
| 21 | 5 | 21 |
| 37 | 5 | 5 |
| 53 | 5 | 21 |
Notice: These four elements all collided at bucket 5 in the old table. After resize, they split between buckets 5 and 21. This redistribution is exactly the goal—reducing chain lengths by spreading elements across more buckets.
Elegant property of power-of-2 doubling:
When capacity doubles from m to 2m:
This means we can determine new positions with a single bit check, which some implementations exploit for efficiency.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
interface Entry<K, V> { key: K; value: V; hash: number; // Cached hash for efficiency} class HashTableWithDetailedRehash<K, V> { private buckets: Array<Entry<K, V>[]>; private _capacity: number; private _size: number; private readonly maxLoadFactor = 0.75; constructor(initialCapacity: number = 16) { this._capacity = initialCapacity; this._size = 0; this.buckets = Array.from({ length: initialCapacity }, () => []); } get capacity(): number { return this._capacity; } get size(): number { return this._size; } get loadFactor(): number { return this._size / this._capacity; } private hash(key: K): number { // Simplified hash - real implementation would be more robust const str = String(key); let h = 0; for (let i = 0; i < str.length; i++) { h = (h * 31 + str.charCodeAt(i)) >>> 0; } return h; } private rehash(): void { console.log(`\n=== REHASH OPERATION ===`); console.log(`Before: capacity=${this._capacity}, size=${this._size}, α=${this.loadFactor.toFixed(3)}`); // Step 1: Store old buckets reference const oldBuckets = this.buckets; const oldCapacity = this._capacity; // Step 2: Allocate new bucket array (2x capacity) this._capacity *= 2; this.buckets = Array.from({ length: this._capacity }, () => []); // Step 3: Reset size (will recount) this._size = 0; // Step 4: Re-insert every element let movedCount = 0; let stayedCount = 0; for (let i = 0; i < oldCapacity; i++) { for (const entry of oldBuckets[i]) { // Compute NEW position (hash % newCapacity) const newBucketIdx = entry.hash % this._capacity; // Track if element moved if (newBucketIdx === i) { stayedCount++; } else { movedCount++; } // Insert into new bucket this.buckets[newBucketIdx].push(entry); this._size++; } } // Step 5: Old array released automatically (no reference) console.log(`After: capacity=${this._capacity}, size=${this._size}, α=${this.loadFactor.toFixed(3)}`); console.log(`Elements stayed: ${stayedCount}, moved: ${movedCount}`); console.log(`=== REHASH COMPLETE ===\n`); } set(key: K, value: V): void { // Check if resize needed before insertion if ((this._size + 1) / this._capacity > this.maxLoadFactor) { this.rehash(); } const h = this.hash(key); const bucketIdx = h % this._capacity; // Check if key exists (update case) const bucket = this.buckets[bucketIdx]; for (const entry of bucket) { if (entry.key === key) { entry.value = value; return; } } // New key - insert bucket.push({ key, value, hash: h }); this._size++; }} // Demonstrationconst table = new HashTableWithDetailedRehash<string, number>(8); console.log("Inserting 15 elements...");for (let i = 0; i < 15; i++) { table.set(`key${i}`, i);} console.log(`Final state: capacity=${table.capacity}, size=${table.size}, α=${table.loadFactor.toFixed(3)}`);When resizing, by how much should we increase capacity? This growth factor decision has significant implications for both time efficiency and memory usage.
| Growth Factor | Pros | Cons | Used By |
|---|---|---|---|
| 2× (doubling) | Fewer resizes, simple bit operations, predictable | Memory doubles (+100%) | Java HashMap, Python dict |
| 1.5× | Less memory waste (+50%) | More frequent resizes, no bit tricks | Some C++ implementations |
| φ ≈ 1.618 (golden ratio) | Theoretical elegance, memory reuse possible | Complex, marginal benefits | Academic interest |
| Additive (+k) | Constant overhead | Too many resizes for growing tables | Rarely used alone |
Why doubling is dominant:
Amortized O(1) insertions: With doubling, the total work for n insertions is O(n), giving O(1) amortized per insert. Smaller growth factors also achieve O(1) amortized but with higher constants.
Power-of-2 capacity enables bit masking: hash % (2^k) = hash & (2^k - 1). Bit-AND is faster than modulo division on most processors.
Predictable half-empty tables: After doubling, load factor drops to α/2. If threshold is 0.75, post-resize α is ~0.375—comfortably far from threshold.
Simple implementation: Multiply by 2 or left-shift by 1. No floating-point or complex calculations.
When non-doubling makes sense:
With α_max = 0.75 and doubling, hash tables typically use 1.33× to 2.67× the memory of the elements themselves (depending on where in the growth cycle they are). This overhead is the price of O(1) operations. For applications where memory is more precious than CPU cycles, consider higher load factor thresholds or smaller growth factors.
Most discussions focus on growing hash tables, but shrinking is equally important for long-lived applications where tables grow and then shrink. Without shrinking, memory from deleted elements is never reclaimed.
When to shrink:
Common strategies for triggering shrink operations:
Load factor minimum threshold: If α drops below α_min (e.g., 0.25), halve the capacity.
Hysteresis window: Only shrink when α < α_min for a sustained period or multiple deletions, avoiding thrashing.
Explicit shrink-to-fit: Provide a method for users to explicitly request compaction.
Never shrink: Some implementations (early Python dicts) never shrink automatically—users must rebuild.
The thrashing problem:
If α_max = 0.75 and α_min = 0.375 (half of threshold), a series of insert-delete-insert-delete operations at the boundary could cause repeated grow-shrink cycles. This is thrashing and destroys performance.
Solution: Hysteresis
Set α_min significantly below α_max/growth_factor:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
class HashTableWithShrinking: """ Hash table that both grows and shrinks to maintain memory efficiency. """ MAX_LOAD_FACTOR = 0.75 # Grow threshold MIN_LOAD_FACTOR = 0.25 # Shrink threshold (with hysteresis gap) MIN_CAPACITY = 16 # Never shrink below this def __init__(self, initial_capacity: int = 16): self.capacity = max(initial_capacity, self.MIN_CAPACITY) self.size = 0 self.buckets = [[] for _ in range(self.capacity)] @property def load_factor(self) -> float: return self.size / self.capacity def _should_grow(self) -> bool: return (self.size + 1) / self.capacity > self.MAX_LOAD_FACTOR def _should_shrink(self) -> bool: # Only shrink if above minimum capacity and below minimum load factor return (self.capacity > self.MIN_CAPACITY and self.load_factor < self.MIN_LOAD_FACTOR) def _resize(self, new_capacity: int) -> None: action = "Growing" if new_capacity > self.capacity else "Shrinking" print(f"{action}: {self.capacity} → {new_capacity} " f"(α: {self.load_factor:.3f} → {self.size/new_capacity:.3f})") old_buckets = self.buckets self.capacity = new_capacity self.buckets = [[] for _ in range(new_capacity)] self.size = 0 for bucket in old_buckets: for key, value in bucket: bucket_idx = hash(key) % self.capacity self.buckets[bucket_idx].append((key, value)) self.size += 1 def insert(self, key, value) -> None: if self._should_grow(): self._resize(self.capacity * 2) bucket_idx = hash(key) % self.capacity self.buckets[bucket_idx].append((key, value)) self.size += 1 def delete(self, key) -> bool: bucket_idx = hash(key) % self.capacity bucket = self.buckets[bucket_idx] for i, (k, v) in enumerate(bucket): if k == key: bucket.pop(i) self.size -= 1 # Check shrink condition after deletion if self._should_shrink(): self._resize(self.capacity // 2) return True return False def status(self) -> str: return f"size={self.size}, capacity={self.capacity}, α={self.load_factor:.3f}" # Demonstrationtable = HashTableWithShrinking(16) # Grow phaseprint("=== GROWTH PHASE ===")for i in range(50): table.insert(f"key_{i}", i)print(f"After 50 insertions: {table.status()}") # Shrink phaseprint("\n=== SHRINK PHASE ===")for i in range(45): table.delete(f"key_{i}")print(f"After 45 deletions: {table.status()}") # Output:# === GROWTH PHASE ===# Growing: 16 → 32 (α: 0.750 → 0.375)# Growing: 32 → 64 (α: 0.750 → 0.375)# After 50 insertions: size=50, capacity=64, α=0.781## === SHRINK PHASE ===# Shrinking: 64 → 32 (α: 0.234 → 0.469)# After 45 deletions: size=5, capacity=32, α=0.156Many production implementations don't shrink automatically because: (1) Memory freed may not be returned to OS immediately anyway, (2) Re-growth is likely in many applications, (3) Shrinking is O(n) work that may not be worth it. Consider providing explicit shrink_to_fit() methods instead of automatic shrinking.
Rehashing in open addressing systems has additional complexities compared to chaining, primarily around tombstones and probe sequence integrity.
The tombstone problem:
In open addressing, deletion can't simply remove an element—doing so would break probe sequences for other elements. Instead, deleted slots are marked with a tombstone (special marker indicating 'deleted but probeable').
Over time, tombstones accumulate:
Rehashing clears tombstones:
When rehashing an open addressing table:
This makes rehashing especially valuable in open addressing—it's not just about capacity, but about table health.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
from enum import Enumfrom typing import Optional, Tuple class SlotState(Enum): EMPTY = "empty" OCCUPIED = "occupied" TOMBSTONE = "tombstone" class OpenAddressedHashTable: """ Open addressing hash table with tombstone handling during rehash. """ MAX_LOAD_FACTOR = 0.6 # Lower threshold for open addressing def __init__(self, capacity: int = 16): self.capacity = capacity self.size = 0 # Living elements self.tombstones = 0 # Deleted elements (tombstones) self.slots = [(SlotState.EMPTY, None, None)] * capacity @property def load_factor(self) -> float: return self.size / self.capacity @property def effective_load(self) -> float: """Load including tombstones - affects probe sequence length.""" return (self.size + self.tombstones) / self.capacity def _probe(self, key, for_insert: bool = False) -> int: """Linear probing. Returns slot index.""" h = hash(key) % self.capacity first_tombstone = -1 for i in range(self.capacity): idx = (h + i) % self.capacity state, k, v = self.slots[idx] if state == SlotState.EMPTY: # For insert: use first tombstone if found, else this empty slot return first_tombstone if first_tombstone >= 0 and for_insert else idx elif state == SlotState.TOMBSTONE: if first_tombstone < 0: first_tombstone = idx elif k == key: return idx # Found the key return first_tombstone if first_tombstone >= 0 else -1 def _rehash(self) -> None: print(f"\n=== REHASH (Open Addressing) ===") print(f"Before: capacity={self.capacity}, size={self.size}, " f"tombstones={self.tombstones}, effective_α={self.effective_load:.3f}") old_slots = self.slots self.capacity *= 2 self.slots = [(SlotState.EMPTY, None, None)] * self.capacity # Reset counters old_size = self.size old_tombstones = self.tombstones self.size = 0 self.tombstones = 0 # Tombstones are NOT transferred! # Re-insert only living elements transferred = 0 for state, key, value in old_slots: if state == SlotState.OCCUPIED: idx = self._probe(key, for_insert=True) self.slots[idx] = (SlotState.OCCUPIED, key, value) self.size += 1 transferred += 1 print(f"After: capacity={self.capacity}, size={self.size}, " f"tombstones={self.tombstones}, effective_α={self.effective_load:.3f}") print(f"Transferred {transferred} elements, discarded {old_tombstones} tombstones") print(f"=== REHASH COMPLETE ===\n") def insert(self, key, value) -> None: if (self.size + 1) / self.capacity > self.MAX_LOAD_FACTOR: self._rehash() idx = self._probe(key, for_insert=True) state, _, _ = self.slots[idx] if state == SlotState.TOMBSTONE: self.tombstones -= 1 # Reusing a tombstone self.slots[idx] = (SlotState.OCCUPIED, key, value) self.size += 1 def delete(self, key) -> bool: idx = self._probe(key) state, k, _ = self.slots[idx] if state == SlotState.OCCUPIED and k == key: self.slots[idx] = (SlotState.TOMBSTONE, None, None) self.size -= 1 self.tombstones += 1 return True return False def status(self) -> str: return (f"size={self.size}, tombstones={self.tombstones}, " f"capacity={self.capacity}, α={self.load_factor:.3f}, " f"effective_α={self.effective_load:.3f}") # Demonstration of tombstone cleanuptable = OpenAddressedHashTable(16) # Insert then delete many elements (creates tombstones)print("Creating tombstones through insert/delete cycles...")for i in range(20): table.insert(f"temp_{i}", i)for i in range(15): table.delete(f"temp_{i}") print(f"Status after deletions: {table.status()}")print("Notice: tombstones bloat effective load factor!") # Trigger rehashprint("\nTriggering rehash by inserting more elements...")for i in range(10): table.insert(f"new_{i}", i) print(f"Final status: {table.status()}")print("Notice: tombstones were cleared by rehash!")Some implementations trigger rehashing based on tombstone density, not just load factor. If tombstones exceed a threshold (e.g., 25% of capacity), rehash even if living element count is low. This prevents pathological probe sequences in delete-heavy workloads.
In multi-threaded environments, rehashing introduces significant complexity. While deep concurrent hash table implementation is beyond our scope, understanding the key challenges is valuable.
Incremental rehashing example (Redis-style):
This spreads O(n) work across O(n) operations, maintaining O(1) per-operation cost without blocking.
If you're building a multi-threaded application, you probably shouldn't implement your own concurrent hash table—use battle-tested implementations like Java's ConcurrentHashMap, Go's sync.Map, or concurrent hashmaps from your language's standard library. Understanding these challenges helps you choose the right tool and reason about performance characteristics.
We've comprehensively covered when and how rehashing works. Here are the essential takeaways:
What's next:
The final page addresses a crucial question: if rehashing is O(n), how can we claim hash table operations are O(1)? The answer—amortized analysis—reveals one of the most elegant arguments in algorithm analysis. We'll prove that the total cost of n insertions with dynamic resizing is O(n), yielding O(1) amortized cost per operation.
You now understand the complete mechanics of rehashing: when it triggers, how it works, and the design decisions involved. This operational understanding prepares you for the amortized analysis that justifies hash tables' O(1) claims despite expensive resizing operations.