Loading learning content...
Most discussions of dynamic hashing focus on growth—how to add buckets as data volumes increase. But real-world databases aren't monotonically growing. Data is deleted, archived, or purged. Tables that once held millions of records may shrink to thousands after cleanup operations.
The problem with growth-only approaches:
Imagine a hash index that grew to 10,000 buckets during peak usage. After a major data purge, only 1,000 records remain—roughly one record per 10 buckets. The index wastes 90% of its allocated space. Worse, the sparse distribution may cause unnecessary I/O: scanning an index with many empty buckets is inefficient.
Shrinkage handling addresses this by enabling hash indexes to consolidate underutilized buckets, reducing both space consumption and search overhead.
By the end of this page, you will understand the complete mechanics of shrinkage handling in dynamic hashing—from detecting underutilization through executing bucket merges to maintaining index integrity. You'll see why shrinkage is harder than growth and why many production systems don't implement it fully.
At first glance, shrinkage seems like growth in reverse: where growth splits buckets, shrinkage merges them. But several factors make shrinkage fundamentally more complex:
1. Asymmetric Triggering:
Growth has a clear trigger: bucket overflow. An insertion that can't fit forces immediate action. But when should shrinkage occur? A half-empty bucket isn't broken—it still works correctly. There's no urgent pressure to merge like there is to split.
2. Merge Partner Identification:
When splitting a bucket addressed by prefix P, records naturally separate into P0 and P1. The reverse isn't automatically available: to merge P0 and P1 back into P, we need to find and access both buckets. If they have different local depths or if one doesn't exist, merging becomes complex.
3. Concurrency Complications:
Merging requires coordinating access to two buckets simultaneously, not just one. This doubles the locking complexity and increases deadlock potential.
4. Directory Halving Overhead:
In Extendible Hashing, directory doubling is relatively cheap—we just duplicate pointers. But directory halving requires verifying that all bucket pairs share the same target bucket before we can safely reduce the directory size.
| Aspect | Growth (Split) | Shrinkage (Merge) |
|---|---|---|
| Trigger clarity | Unambiguous (overflow) | Subjective (how empty?) |
| Urgency | Immediate (can't insert) | Deferrable (still works) |
| Scope | Single bucket | Two buckets must coordinate |
| Partner finding | Creates partner (new bucket) | Must locate existing partner |
| Directory change | Doubling is pointer copy | Halving requires validation |
| Failure mode | Must succeed for insert | Can skip without data loss |
Due to these complexities, many production hash index implementations don't support automatic shrinkage. Instead, they rely on periodic index rebuilding or administrative space reclamation commands. This is a practical trade-off that prioritizes implementation simplicity and predictable behavior.
Unlike overflow detection, which has an absolute threshold (bucket is full), underutilization detection requires choosing arbitrary thresholds. Different strategies offer different trade-offs:
Threshold-Based Detection:
The simplest approach: trigger merge consideration when a bucket falls below a certain fill percentage.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
from dataclasses import dataclassfrom typing import List, Optional, Tuplefrom enum import Enum class ShrinkagePolicy(Enum): """Different policies for triggering shrinkage.""" THRESHOLD = "threshold" # Merge when below fill percentage COMBINED = "combined" # Both buckets together below threshold LOAD_FACTOR = "load_factor" # Global load factor too low HYSTERESIS = "hysteresis" # Different thresholds for grow/shrink @dataclassclass UnderutilizationDetector: """ Detect when buckets should be considered for merging. Key insight: Shrinkage thresholds should be lower than split thresholds to prevent oscillation (split-merge-split cycles). """ bucket_capacity: int = 100 # Threshold for single bucket underutilization single_bucket_threshold: float = 0.25 # 25% or less triggers check # Threshold for combined (merged) bucket viability combined_threshold: float = 0.70 # Merged bucket must not exceed 70% # Global load factor floor global_load_floor: float = 0.30 # Shrink when overall < 30% def check_single_bucket(self, record_count: int) -> bool: """ Check if a single bucket is underutilized. Returns True if the bucket is sparse enough to consider merging. """ fill_ratio = record_count / self.bucket_capacity return fill_ratio <= self.single_bucket_threshold def check_merge_viability(self, bucket_a_count: int, bucket_b_count: int) -> Tuple[bool, str]: """ Check if two buckets can be safely merged. Returns (can_merge, reason) """ combined_count = bucket_a_count + bucket_b_count if combined_count > self.bucket_capacity: ratio = combined_count / self.bucket_capacity return False, f"Would overflow ({ratio:.1%} of capacity)" combined_ratio = combined_count / self.bucket_capacity if combined_ratio > self.combined_threshold: return False, f"Would be too full ({combined_ratio:.1%} > {self.combined_threshold:.1%})" return True, f"Safe to merge ({combined_ratio:.1%} of capacity)" def check_global_underutilization(self, total_records: int, total_buckets: int) -> bool: """ Check if the entire index is underutilized. Used by Linear Hashing-style global shrinkage. """ if total_buckets == 0: return False load_factor = total_records / (total_buckets * self.bucket_capacity) return load_factor < self.global_load_floor def analyze_shrinkage_potential(self, buckets: List[int], # record counts bucket_depths: List[int]) -> dict: """ Analyze an index to find shrinkage opportunities. Returns statistics about underutilization and merge candidates. """ underutilized = [] merge_candidates = [] total_records = sum(buckets) total_capacity = len(buckets) * self.bucket_capacity # Find underutilized buckets for i, count in enumerate(buckets): if self.check_single_bucket(count): underutilized.append(i) # Find merge candidates (buckets that could merge with sibling) # In practice, sibling identification depends on the scheme # Here we assume adjacent buckets with same depth are siblings for i in range(len(buckets) - 1): if bucket_depths[i] == bucket_depths[i + 1]: can_merge, _ = self.check_merge_viability(buckets[i], buckets[i + 1]) if can_merge: merge_candidates.append((i, i + 1)) return { "total_buckets": len(buckets), "total_records": total_records, "overall_utilization": total_records / total_capacity if total_capacity else 0, "underutilized_buckets": len(underutilized), "underutilized_indices": underutilized, "merge_candidates": merge_candidates, "potential_space_savings": len(merge_candidates) * self.bucket_capacity, "global_shrinkage_recommended": self.check_global_underutilization( total_records, len(buckets) ), } # Example usagedetector = UnderutilizationDetector(bucket_capacity=100) # Analyze an underutilized indexbucket_counts = [15, 10, 5, 20, 25, 8, 12, 18] # Very sparsebucket_depths = [3, 3, 3, 3, 3, 3, 3, 3] # All same depth analysis = detector.analyze_shrinkage_potential(bucket_counts, bucket_depths) print("Shrinkage Analysis:")print(f" Overall utilization: {analysis['overall_utilization']:.1%}")print(f" Underutilized buckets: {analysis['underutilized_buckets']} of {analysis['total_buckets']}")print(f" Merge candidates: {len(analysis['merge_candidates'])} pairs")print(f" Potential space savings: {analysis['potential_space_savings']} records worth")print(f" Global shrinkage recommended: {analysis['global_shrinkage_recommended']}")A critical design principle: the threshold for merging should be significantly lower than the threshold for splitting. If we split at 80% full and merge when the combined bucket would be 80% full, we risk split-merge oscillation. Using 80% split / 50% merge thresholds creates a stable buffer zone.
Bucket merging is conceptually the reverse of splitting: two buckets with prefixes P0 and P1 consolidate into a single bucket with prefix P. But the implementation details differ significantly.
The Merge Algorithm:
1. Identify merge candidate (bucket below threshold)
2. Find sibling bucket (shares prefix P, differs only in last bit)
3. Verify combined records fit in one bucket
4. Copy all records to the surviving bucket
5. Update directory entries (point both prefixes to one bucket)
6. Deallocate the empty bucket
7. Optionally reduce local depth of merged bucket
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253
from typing import List, Tuple, Optionalfrom dataclasses import dataclass, field @dataclassclass MergeableBucket: """Bucket supporting merge operations.""" id: int local_depth: int records: List[tuple] = field(default_factory=list) capacity: int = 100 def can_absorb(self, other: 'MergeableBucket') -> bool: """Check if this bucket can absorb another's records.""" return len(self.records) + len(other.records) <= self.capacity def get_sibling_prefix(self) -> int: """ Calculate the prefix of our sibling bucket. If our prefix is P with local_depth bits, sibling has prefix P XOR (1 << (local_depth - 1)) (flip the last bit of our prefix) """ # This is conceptual - actual prefix comes from directory position return -1 # Placeholder class BucketMerger: """ Implements bucket merge operations for dynamic hashing. Merging is more complex than splitting because: 1. Must find sibling bucket 2. Must verify both buckets exist and are compatible 3. Must atomically update directory and deallocate """ def find_sibling(self, directory: List[MergeableBucket], bucket_idx: int, global_depth: int) -> Optional[Tuple[int, MergeableBucket]]: """ Find the sibling bucket for a potential merge. The sibling shares all prefix bits except the last one used by the bucket's local depth. """ bucket = directory[bucket_idx] if bucket.local_depth == 0: return None # Can't merge at depth 0 # Calculate sibling's directory index # Flip the bit at position (local_depth - 1) from the left flip_bit = 1 << (global_depth - bucket.local_depth) sibling_idx = bucket_idx ^ flip_bit if sibling_idx >= len(directory): return None sibling = directory[sibling_idx] # Sibling must have same local depth for merge if sibling.local_depth != bucket.local_depth: return None # Sibling must be a different bucket (not same bucket via shared pointers) if sibling.id == bucket.id: return None return sibling_idx, sibling def merge_buckets(self, bucket: MergeableBucket, sibling: MergeableBucket) -> MergeableBucket: """ Merge two sibling buckets into one. All records from sibling move to bucket. Sibling becomes empty and can be deallocated. """ if not bucket.can_absorb(sibling): raise ValueError("Cannot merge: combined records exceed capacity") # Move all records from sibling to bucket bucket.records.extend(sibling.records) sibling.records.clear() # Decrease local depth (we're using one fewer bit) bucket.local_depth -= 1 print(f"Merged bucket {sibling.id} into bucket {bucket.id}") print(f" New local depth: {bucket.local_depth}") print(f" Total records: {len(bucket.records)}") return bucket def update_directory_for_merge(self, directory: List[MergeableBucket], surviving_bucket: MergeableBucket, empty_bucket: MergeableBucket): """ Update directory entries after a merge. All entries pointing to the empty bucket should now point to the surviving bucket. """ updates = 0 for i in range(len(directory)): if directory[i].id == empty_bucket.id: directory[i] = surviving_bucket updates += 1 print(f" Updated {updates} directory entries") return updates class ExtendibleHashWithShrinkage: """ Extendible Hash Index with shrinkage support. Extends basic Extendible Hashing to support bucket merging and directory halving. """ def __init__(self, bucket_capacity: int = 4): self.global_depth = 1 self.bucket_capacity = bucket_capacity self.next_bucket_id = 0 self.directory: List[MergeableBucket] = [] self.merger = BucketMerger() self.shrink_threshold = 0.25 # Consider merge at 25% full self._init_directory() def _init_directory(self): for i in range(2 ** self.global_depth): bucket = MergeableBucket( id=self.next_bucket_id, local_depth=self.global_depth, capacity=self.bucket_capacity ) self.next_bucket_id += 1 self.directory.append(bucket) def delete(self, key: str, hash_value: int): """Delete a record and potentially trigger shrinkage.""" idx = self._get_directory_index(hash_value) bucket = self.directory[idx] # Find and remove the record original_count = len(bucket.records) bucket.records = [r for r in bucket.records if r[0] != key] if len(bucket.records) == original_count: return # Key not found # Check for shrinkage opportunity fill_ratio = len(bucket.records) / self.bucket_capacity if fill_ratio <= self.shrink_threshold: self._attempt_merge(idx, bucket) def _get_directory_index(self, hash_value: int) -> int: shift = 32 - self.global_depth return hash_value >> shift def _attempt_merge(self, bucket_idx: int, bucket: MergeableBucket): """Attempt to merge an underutilized bucket with its sibling.""" result = self.merger.find_sibling( self.directory, bucket_idx, self.global_depth ) if result is None: print(f"No sibling found for bucket {bucket.id}") return sibling_idx, sibling = result # Check if merge is viable if not bucket.can_absorb(sibling): combined = len(bucket.records) + len(sibling.records) print(f"Cannot merge: combined {combined} exceeds capacity {self.bucket_capacity}") return # Perform the merge print(f">>> Attempting merge of buckets {bucket.id} and {sibling.id}") self.merger.merge_buckets(bucket, sibling) self.merger.update_directory_for_merge(self.directory, bucket, sibling) # Check if directory can be halved self._attempt_directory_halve() def _attempt_directory_halve(self): """ Attempt to halve the directory size. Can only halve if every pair of entries points to the same bucket (i.e., all buckets have local_depth < global_depth). """ if self.global_depth <= 1: return # Can't go below 2 entries # Check if all buckets have depth less than global depth for bucket in self.directory: if bucket.local_depth >= self.global_depth: return # Can't halve yet # Verify that pairs point to same bucket new_size = len(self.directory) // 2 for i in range(new_size): if self.directory[i].id != self.directory[i + new_size].id: return # Pairs not identical # Safe to halve print(f"Halving directory: {len(self.directory)} -> {new_size}") self.directory = self.directory[:new_size] self.global_depth -= 1 # Demonstrationdef demonstrate_shrinkage(): """Show bucket merging in action.""" index = ExtendibleHashWithShrinkage(bucket_capacity=4) # Insert some records test_data = [ ("k1", 0b00110000_00000000_00000000_00000000), ("k2", 0b00110001_00000000_00000000_00000000), ("k3", 0b01110000_00000000_00000000_00000000), ("k4", 0b01110001_00000000_00000000_00000000), ] for key, hash_val in test_data: index.directory[hash_val >> 31].records.append((key, hash_val, 0)) print("After insertions:") for i, b in enumerate(index.directory): print(f" Dir[{i}] -> Bucket {b.id} ({len(b.records)} records)") # Delete records to trigger shrinkage print("Deleting k1, k2, k3...") for key, hash_val in test_data[:3]: index.delete(key, hash_val) print("After deletions:") for i, b in enumerate(index.directory): print(f" Dir[{i}] -> Bucket {b.id} ({len(b.records)} records)") demonstrate_shrinkage()Linear Hashing's round-robin approach extends naturally to shrinkage: just as it splits buckets in order, it can merge buckets in reverse order.
Contraction Algorithm:
1. Monitor global load factor
2. When load factor drops below threshold:
a. Decrement split pointer (or wrap to previous level)
b. Merge the last bucket into its partner
c. Deallocate the last bucket
3. Repeat until load factor is acceptable
The elegance of Linear Hashing is that contraction is exactly splitting in reverse. The most recently added bucket merges back into its original partner.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174
from typing import Listfrom dataclasses import dataclass, field @dataclassclass ContractibleBucket: """Bucket for Linear Hashing with contraction support.""" records: List[tuple] = field(default_factory=list) capacity: int = 100 class LinearHashWithContraction: """ Linear Hashing with both expansion and contraction. Key insight: Contraction is splitting in reverse. The most recently added bucket merges first. """ def __init__(self, initial_buckets: int = 4, bucket_capacity: int = 4, expand_threshold: float = 0.80, contract_threshold: float = 0.30): self.level = 0 while (1 << self.level) < initial_buckets: self.level += 1 initial = 1 << self.level self.buckets: List[ContractibleBucket] = [ ContractibleBucket(capacity=bucket_capacity) for _ in range(initial) ] self.split_pointer = 0 self.bucket_capacity = bucket_capacity self.expand_threshold = expand_threshold self.contract_threshold = contract_threshold self.total_records = 0 self.min_buckets = 2 # Never shrink below this def _load_factor(self) -> float: """Current load factor.""" capacity = len(self.buckets) * self.bucket_capacity return self.total_records / capacity if capacity else 0 def delete(self, key: str, hash_value: int): """Delete a record and potentially trigger contraction.""" idx = self._get_bucket_index(hash_value) bucket = self.buckets[idx] # Find and remove original_count = len(bucket.records) bucket.records = [r for r in bucket.records if r[0] != key] if len(bucket.records) < original_count: self.total_records -= 1 # Check for contraction if self._load_factor() < self.contract_threshold: self._contract() def _get_bucket_index(self, hash_value: int) -> int: """Determine bucket for a key.""" idx = hash_value & ((1 << self.level) - 1) if idx < self.split_pointer: idx = hash_value & ((1 << (self.level + 1)) - 1) return idx def _contract(self): """ Remove the last bucket, merging its contents. The last bucket was created when split_pointer reached its current position. Contraction reverses this. """ if len(self.buckets) <= self.min_buckets: print("Cannot contract: at minimum bucket count") return # Identify the bucket to remove and its merge partner last_bucket_idx = len(self.buckets) - 1 last_bucket = self.buckets[last_bucket_idx] # Partner is at index: last_bucket_idx - 2^level # (where it was before the split that created this bucket) partner_idx = last_bucket_idx - (1 << self.level) if partner_idx < 0: # We're at the start of a new level - go back self.level -= 1 self.split_pointer = (1 << self.level) - 1 partner_idx = last_bucket_idx - (1 << self.level) partner = self.buckets[partner_idx] print(f">>> Contracting: merging bucket {last_bucket_idx} into {partner_idx}") print(f" Level: {self.level}, Split pointer before: {self.split_pointer}") # Move all records from last bucket to partner partner.records.extend(last_bucket.records) # Remove the last bucket self.buckets.pop() # Update split pointer if self.split_pointer > 0: self.split_pointer -= 1 elif self.level > 0: self.level -= 1 self.split_pointer = (1 << self.level) - 1 print(f" Split pointer after: {self.split_pointer}") print(f" Buckets remaining: {len(self.buckets)}") def display_state(self): """Display current state.""" print(f"=== Linear Hash State ===") print(f"Level: {self.level}, Split pointer: {self.split_pointer}") print(f"Buckets: {len(self.buckets)}") print(f"Records: {self.total_records}") print(f"Load factor: {self._load_factor():.2%}") for i, b in enumerate(self.buckets): status = "(already split)" if i < self.split_pointer else "" print(f" Bucket {i} {status}: {len(b.records)} records") # Demonstrationdef demonstrate_linear_contraction(): """Show Linear Hashing expansion followed by contraction.""" index = LinearHashWithContraction( initial_buckets=2, bucket_capacity=2, expand_threshold=0.75, contract_threshold=0.25 ) # Insert records to cause expansion print("=== Expansion Phase ===") import random random.seed(42) keys = [] for i in range(8): key = f"key_{i}" hash_val = random.getrandbits(32) keys.append((key, hash_val)) idx = index._get_bucket_index(hash_val) index.buckets[idx].records.append((key, hash_val, i)) index.total_records += 1 if index._load_factor() > index.expand_threshold: # Simulate expansion (simplified) index.buckets.append(ContractibleBucket(capacity=2)) index.split_pointer += 1 if index.split_pointer >= (1 << index.level): index.level += 1 index.split_pointer = 0 index.display_state() # Delete records to cause contraction print("=== Contraction Phase ===") for key, hash_val in keys[:6]: # Delete most records index.delete(key, hash_val) index.display_state() demonstrate_linear_contraction()Linear Hashing's contraction perfectly mirrors its expansion. Bucket n was created when the split pointer reached position n - 2^level. Contraction reverses this: bucket n merges back when contraction brings us to that same point. This symmetry makes Linear Hashing's shrinkage more predictable than Extendible Hashing's.
Beyond bucket merging, databases employ various strategies to reclaim space from shrinking hash indexes:
1. Deferred Reclamation:
Instead of immediately deallocating merged buckets, the system adds them to a free list. Future splits can reuse these pages without disk allocation overhead.
2. Compaction:
Periodically consolidate fragmented pages. Even if buckets aren't merged, records within a bucket may become fragmented after many deletions. Compaction rewrites pages contiguously.
3. Online Reorganization:
Some systems support reorganizing the hash index while it remains online. This involves:
4. Scheduled Rebuilds:
Many production systems simply rebuild hash indexes periodically during maintenance windows. This is the most reliable way to achieve optimal space utilization.
| Strategy | Online? | Complexity | Effectiveness | Best For |
|---|---|---|---|---|
| Bucket merging | Yes | Medium | Moderate | Gradual shrinkage |
| Deferred reuse | Yes | Low | High | Volatile workloads |
| Page compaction | Yes | Medium | Moderate | Fragmentation |
| Online reorg | Yes | High | Complete | Major shrinkage |
| Scheduled rebuild | No | Low | Complete | Predictable workloads |
In practice, many DBAs prefer scheduled rebuilds over complex shrinkage algorithms. A weekly rebuild during low-traffic hours is simpler, more predictable, and just as effective for workloads with gradual changes. Save complex online shrinkage for systems that truly cannot afford any maintenance window.
Shrinkage under concurrent access introduces unique challenges beyond those faced during growth:
The Two-Bucket Problem:
Merging requires locking two buckets simultaneously. This creates deadlock potential:
Solutions for concurrent merging:
Directory Halving Coordination:
In Extendible Hashing, directory halving affects all lookups. This requires:
This global synchronization is why many Extendible Hashing implementations don't automatically halve the directory, even when merges make it theoretically possible.
A subtle bug occurs when shrinkage and growth happen simultaneously. Thread A decides to merge buckets based on low utilization. Before the merge completes, Thread B inserts many records. The merge then violates capacity constraints. Proper implementations recheck conditions after acquiring all necessary locks.
Given the complexity of shrinkage, when should systems actually implement it? Consider these practical guidelines:
Cost-Benefit Analysis:
| Factor | Without Shrinkage | With Shrinkage |
|---|---|---|
| Space usage | Higher (up to 10x) | Optimal |
| Implementation | Simpler | Complex |
| Maintenance | Scheduled rebuilds | Automatic |
| Predictability | High | Lower |
| Concurrency overhead | None | Moderate |
| Debugging complexity | Low | High |
For most workloads, the overhead of shrinkage isn't worth the benefits. Modern storage is cheap, and scheduled rebuilds handle most scenarios. Reserve automatic shrinkage for specialized systems where space truly matters.
We've explored the often-overlooked challenge of hash index shrinkage—the algorithms, challenges, and practical considerations for reclaiming space from underutilized indexes. Here are the key insights:
What's Next:
With an understanding of both growth and shrinkage, we're ready to examine how dynamic hashing maintains performance over time. The next page explores performance maintenance—how to keep hash indexes performing optimally as data patterns change, including load balancing, overflow management, and adaptation to skewed distributions.
You now understand the complete shrinkage lifecycle in dynamic hashing—from detecting underutilization through executing merges to practical deployment decisions. You can evaluate whether automatic shrinkage is appropriate for your workload.