Database Management SystemsDynamic Hashing

Dynamic Hashing

LevelIntermediate

Duration60 mins

TopicDynamic Hashing

3 / 5

Shrinkage Handling

The Forgotten Half of Dynamic Hashing

Most discussions of dynamic hashing focus on growth—how to add buckets as data volumes increase. But real-world databases aren't monotonically growing. Data is deleted, archived, or purged. Tables that once held millions of records may shrink to thousands after cleanup operations.

The problem with growth-only approaches:

Imagine a hash index that grew to 10,000 buckets during peak usage. After a major data purge, only 1,000 records remain—roughly one record per 10 buckets. The index wastes 90% of its allocated space. Worse, the sparse distribution may cause unnecessary I/O: scanning an index with many empty buckets is inefficient.

Shrinkage handling addresses this by enabling hash indexes to consolidate underutilized buckets, reducing both space consumption and search overhead.

What You Will Learn

By the end of this page, you will understand the complete mechanics of shrinkage handling in dynamic hashing—from detecting underutilization through executing bucket merges to maintaining index integrity. You'll see why shrinkage is harder than growth and why many production systems don't implement it fully.

Why Shrinkage Is Harder Than Growth

At first glance, shrinkage seems like growth in reverse: where growth splits buckets, shrinkage merges them. But several factors make shrinkage fundamentally more complex:

1. Asymmetric Triggering:

Growth has a clear trigger: bucket overflow. An insertion that can't fit forces immediate action. But when should shrinkage occur? A half-empty bucket isn't broken—it still works correctly. There's no urgent pressure to merge like there is to split.

2. Merge Partner Identification:

When splitting a bucket addressed by prefix P, records naturally separate into P0 and P1. The reverse isn't automatically available: to merge P0 and P1 back into P, we need to find and access both buckets. If they have different local depths or if one doesn't exist, merging becomes complex.

3. Concurrency Complications:

Merging requires coordinating access to two buckets simultaneously, not just one. This doubles the locking complexity and increases deadlock potential.

4. Directory Halving Overhead:

In Extendible Hashing, directory doubling is relatively cheap—we just duplicate pointers. But directory halving requires verifying that all bucket pairs share the same target bucket before we can safely reduce the directory size.

Growth vs Shrinkage Asymmetry
Aspect	Growth (Split)	Shrinkage (Merge)
Trigger clarity	Unambiguous (overflow)	Subjective (how empty?)
Urgency	Immediate (can't insert)	Deferrable (still works)
Scope	Single bucket	Two buckets must coordinate
Partner finding	Creates partner (new bucket)	Must locate existing partner
Directory change	Doubling is pointer copy	Halving requires validation
Failure mode	Must succeed for insert	Can skip without data loss

Many Systems Skip Shrinkage

Due to these complexities, many production hash index implementations don't support automatic shrinkage. Instead, they rely on periodic index rebuilding or administrative space reclamation commands. This is a practical trade-off that prioritizes implementation simplicity and predictable behavior.

Detecting Underutilization

Unlike overflow detection, which has an absolute threshold (bucket is full), underutilization detection requires choosing arbitrary thresholds. Different strategies offer different trade-offs:

Threshold-Based Detection:

The simplest approach: trigger merge consideration when a bucket falls below a certain fill percentage.

underutilization_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
from dataclasses import dataclass
from typing import List, Optional, Tuple
from enum import Enum
 
class ShrinkagePolicy(Enum):
    """Different policies for triggering shrinkage."""
    THRESHOLD = "threshold"      # Merge when below fill percentage
    COMBINED = "combined"        # Both buckets together below threshold
    LOAD_FACTOR = "load_factor"  # Global load factor too low
    HYSTERESIS = "hysteresis"    # Different thresholds for grow/shrink
 
 
@dataclass
class UnderutilizationDetector:
    """
    Detect when buckets should be considered for merging.
    
    Key insight: Shrinkage thresholds should be lower than split
    thresholds to prevent oscillation (split-merge-split cycles).
    """
    
    bucket_capacity: int = 100
    
    # Threshold for single bucket underutilization
    single_bucket_threshold: float = 0.25  # 25% or less triggers check
    
    # Threshold for combined (merged) bucket viability
    combined_threshold: float = 0.70  # Merged bucket must not exceed 70%
    
    # Global load factor floor
    global_load_floor: float = 0.30  # Shrink when overall < 30%
    
    def check_single_bucket(self, record_count: int) -> bool:
        """
        Check if a single bucket is underutilized.
        
        Returns True if the bucket is sparse enough to consider merging.
        """
        fill_ratio = record_count / self.bucket_capacity
        return fill_ratio <= self.single_bucket_threshold
    
    def check_merge_viability(self, 
                               bucket_a_count: int, 
                               bucket_b_count: int) -> Tuple[bool, str]:
        """
        Check if two buckets can be safely merged.
        
        Returns (can_merge, reason)
        """
        combined_count = bucket_a_count + bucket_b_count
        
        if combined_count > self.bucket_capacity:
            ratio = combined_count / self.bucket_capacity
            return False, f"Would overflow ({ratio:.1%} of capacity)"
        
        combined_ratio = combined_count / self.bucket_capacity
        
        if combined_ratio > self.combined_threshold:
            return False, f"Would be too full ({combined_ratio:.1%} > {self.combined_threshold:.1%})"
        
        return True, f"Safe to merge ({combined_ratio:.1%} of capacity)"
    
    def check_global_underutilization(self, 
                                        total_records: int, 
                                        total_buckets: int) -> bool:
        """
        Check if the entire index is underutilized.
        
        Used by Linear Hashing-style global shrinkage.
        """
        if total_buckets == 0:
            return False
        
        load_factor = total_records / (total_buckets * self.bucket_capacity)
        return load_factor < self.global_load_floor
    
    def analyze_shrinkage_potential(self,
                                     buckets: List[int],  # record counts
                                     bucket_depths: List[int]) -> dict:
        """
        Analyze an index to find shrinkage opportunities.
        
        Returns statistics about underutilization and merge candidates.
        """
        underutilized = []
        merge_candidates = []
        
        total_records = sum(buckets)
        total_capacity = len(buckets) * self.bucket_capacity
        
        # Find underutilized buckets
        for i, count in enumerate(buckets):
            if self.check_single_bucket(count):
                underutilized.append(i)
        
        # Find merge candidates (buckets that could merge with sibling)
        # In practice, sibling identification depends on the scheme
        # Here we assume adjacent buckets with same depth are siblings
        for i in range(len(buckets) - 1):
            if bucket_depths[i] == bucket_depths[i + 1]:
                can_merge, _ = self.check_merge_viability(buckets[i], buckets[i + 1])
                if can_merge:
                    merge_candidates.append((i, i + 1))
        
        return {
            "total_buckets": len(buckets),
            "total_records": total_records,
            "overall_utilization": total_records / total_capacity if total_capacity else 0,
            "underutilized_buckets": len(underutilized),
            "underutilized_indices": underutilized,
            "merge_candidates": merge_candidates,
            "potential_space_savings": len(merge_candidates) * self.bucket_capacity,
            "global_shrinkage_recommended": self.check_global_underutilization(
                total_records, len(buckets)
            ),
        }
 
 
# Example usage
detector = UnderutilizationDetector(bucket_capacity=100)
 
# Analyze an underutilized index
bucket_counts = [15, 10, 5, 20, 25, 8, 12, 18]  # Very sparse
bucket_depths = [3, 3, 3, 3, 3, 3, 3, 3]  # All same depth
 
analysis = detector.analyze_shrinkage_potential(bucket_counts, bucket_depths)
 
print("Shrinkage Analysis:")
print(f"  Overall utilization: {analysis['overall_utilization']:.1%}")
print(f"  Underutilized buckets: {analysis['underutilized_buckets']} of {analysis['total_buckets']}")
print(f"  Merge candidates: {len(analysis['merge_candidates'])} pairs")
print(f"  Potential space savings: {analysis['potential_space_savings']} records worth")
print(f"  Global shrinkage recommended: {analysis['global_shrinkage_recommended']}")

Hysteresis Prevents Oscillation

A critical design principle: the threshold for merging should be significantly lower than the threshold for splitting. If we split at 80% full and merge when the combined bucket would be 80% full, we risk split-merge oscillation. Using 80% split / 50% merge thresholds creates a stable buffer zone.

Bucket Merging Mechanics

Bucket merging is conceptually the reverse of splitting: two buckets with prefixes P0 and P1 consolidate into a single bucket with prefix P. But the implementation details differ significantly.

The Merge Algorithm:

1. Identify merge candidate (bucket below threshold)
2. Find sibling bucket (shares prefix P, differs only in last bit)
3. Verify combined records fit in one bucket
4. Copy all records to the surviving bucket
5. Update directory entries (point both prefixes to one bucket)
6. Deallocate the empty bucket
7. Optionally reduce local depth of merged bucket

bucket_merge.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
from typing import List, Tuple, Optional
from dataclasses import dataclass, field
 
@dataclass
class MergeableBucket:
    """Bucket supporting merge operations."""
    id: int
    local_depth: int
    records: List[tuple] = field(default_factory=list)
    capacity: int = 100
    
    def can_absorb(self, other: 'MergeableBucket') -> bool:
        """Check if this bucket can absorb another's records."""
        return len(self.records) + len(other.records) <= self.capacity
    
    def get_sibling_prefix(self) -> int:
        """
        Calculate the prefix of our sibling bucket.
        
        If our prefix is P with local_depth bits, sibling has
        prefix P XOR (1 << (local_depth - 1))
        (flip the last bit of our prefix)
        """
        # This is conceptual - actual prefix comes from directory position
        return -1  # Placeholder
 
 
class BucketMerger:
    """
    Implements bucket merge operations for dynamic hashing.
    
    Merging is more complex than splitting because:
    1. Must find sibling bucket
    2. Must verify both buckets exist and are compatible
    3. Must atomically update directory and deallocate
    """
    
    def find_sibling(self, 
                     directory: List[MergeableBucket],
                     bucket_idx: int,
                     global_depth: int) -> Optional[Tuple[int, MergeableBucket]]:
        """
        Find the sibling bucket for a potential merge.
        
        The sibling shares all prefix bits except the last one
        used by the bucket's local depth.
        """
        bucket = directory[bucket_idx]
        
        if bucket.local_depth == 0:
            return None  # Can't merge at depth 0
        
        # Calculate sibling's directory index
        # Flip the bit at position (local_depth - 1) from the left
        flip_bit = 1 << (global_depth - bucket.local_depth)
        sibling_idx = bucket_idx ^ flip_bit
        
        if sibling_idx >= len(directory):
            return None
        
        sibling = directory[sibling_idx]
        
        # Sibling must have same local depth for merge
        if sibling.local_depth != bucket.local_depth:
            return None
        
        # Sibling must be a different bucket (not same bucket via shared pointers)
        if sibling.id == bucket.id:
            return None
            
        return sibling_idx, sibling
    
    def merge_buckets(self,
                      bucket: MergeableBucket,
                      sibling: MergeableBucket) -> MergeableBucket:
        """
        Merge two sibling buckets into one.
        
        All records from sibling move to bucket.
        Sibling becomes empty and can be deallocated.
        """
        if not bucket.can_absorb(sibling):
            raise ValueError("Cannot merge: combined records exceed capacity")
        
        # Move all records from sibling to bucket
        bucket.records.extend(sibling.records)
        sibling.records.clear()
        
        # Decrease local depth (we're using one fewer bit)
        bucket.local_depth -= 1
        
        print(f"Merged bucket {sibling.id} into bucket {bucket.id}")
        print(f"  New local depth: {bucket.local_depth}")
        print(f"  Total records: {len(bucket.records)}")
        
        return bucket
    
    def update_directory_for_merge(self,
                                    directory: List[MergeableBucket],
                                    surviving_bucket: MergeableBucket,
                                    empty_bucket: MergeableBucket):
        """
        Update directory entries after a merge.
        
        All entries pointing to the empty bucket should now
        point to the surviving bucket.
        """
        updates = 0
        for i in range(len(directory)):
            if directory[i].id == empty_bucket.id:
                directory[i] = surviving_bucket
                updates += 1
        
        print(f"  Updated {updates} directory entries")
        return updates
 
 
class ExtendibleHashWithShrinkage:
    """
    Extendible Hash Index with shrinkage support.
    
    Extends basic Extendible Hashing to support bucket merging
    and directory halving.
    """
    
    def __init__(self, bucket_capacity: int = 4):
        self.global_depth = 1
        self.bucket_capacity = bucket_capacity
        self.next_bucket_id = 0
        self.directory: List[MergeableBucket] = []
        self.merger = BucketMerger()
        self.shrink_threshold = 0.25  # Consider merge at 25% full
        self._init_directory()
    
    def _init_directory(self):
        for i in range(2 ** self.global_depth):
            bucket = MergeableBucket(
                id=self.next_bucket_id,
                local_depth=self.global_depth,
                capacity=self.bucket_capacity
            )
            self.next_bucket_id += 1
            self.directory.append(bucket)
    
    def delete(self, key: str, hash_value: int):
        """Delete a record and potentially trigger shrinkage."""
        idx = self._get_directory_index(hash_value)
        bucket = self.directory[idx]
        
        # Find and remove the record
        original_count = len(bucket.records)
        bucket.records = [r for r in bucket.records if r[0] != key]
        
        if len(bucket.records) == original_count:
            return  # Key not found
        
        # Check for shrinkage opportunity
        fill_ratio = len(bucket.records) / self.bucket_capacity
        if fill_ratio <= self.shrink_threshold:
            self._attempt_merge(idx, bucket)
    
    def _get_directory_index(self, hash_value: int) -> int:
        shift = 32 - self.global_depth
        return hash_value >> shift
    
    def _attempt_merge(self, bucket_idx: int, bucket: MergeableBucket):
        """Attempt to merge an underutilized bucket with its sibling."""
        result = self.merger.find_sibling(
            self.directory, bucket_idx, self.global_depth
        )
        
        if result is None:
            print(f"No sibling found for bucket {bucket.id}")
            return
        
        sibling_idx, sibling = result
        
        # Check if merge is viable
        if not bucket.can_absorb(sibling):
            combined = len(bucket.records) + len(sibling.records)
            print(f"Cannot merge: combined {combined} exceeds capacity {self.bucket_capacity}")
            return
        
        # Perform the merge
        print(f"
>>> Attempting merge of buckets {bucket.id} and {sibling.id}")
        self.merger.merge_buckets(bucket, sibling)
        self.merger.update_directory_for_merge(self.directory, bucket, sibling)
        
        # Check if directory can be halved
        self._attempt_directory_halve()
    
    def _attempt_directory_halve(self):
        """
        Attempt to halve the directory size.
        
        Can only halve if every pair of entries points to the same bucket
        (i.e., all buckets have local_depth < global_depth).
        """
        if self.global_depth <= 1:
            return  # Can't go below 2 entries
        
        # Check if all buckets have depth less than global depth
        for bucket in self.directory:
            if bucket.local_depth >= self.global_depth:
                return  # Can't halve yet
        
        # Verify that pairs point to same bucket
        new_size = len(self.directory) // 2
        for i in range(new_size):
            if self.directory[i].id != self.directory[i + new_size].id:
                return  # Pairs not identical
        
        # Safe to halve
        print(f"Halving directory: {len(self.directory)} -> {new_size}")
        self.directory = self.directory[:new_size]
        self.global_depth -= 1
 
 
# Demonstration
def demonstrate_shrinkage():
    """Show bucket merging in action."""
    
    index = ExtendibleHashWithShrinkage(bucket_capacity=4)
    
    # Insert some records
    test_data = [
        ("k1", 0b00110000_00000000_00000000_00000000),
        ("k2", 0b00110001_00000000_00000000_00000000),
        ("k3", 0b01110000_00000000_00000000_00000000),
        ("k4", 0b01110001_00000000_00000000_00000000),
    ]
    
    for key, hash_val in test_data:
        index.directory[hash_val >> 31].records.append((key, hash_val, 0))
    
    print("After insertions:")
    for i, b in enumerate(index.directory):
        print(f"  Dir[{i}] -> Bucket {b.id} ({len(b.records)} records)")
    
    # Delete records to trigger shrinkage
    print("
Deleting k1, k2, k3...")
    for key, hash_val in test_data[:3]:
        index.delete(key, hash_val)
    
    print("
After deletions:")
    for i, b in enumerate(index.directory):
        print(f"  Dir[{i}] -> Bucket {b.id} ({len(b.records)} records)")
 
 
demonstrate_shrinkage()

Linear Hashing Contraction

Linear Hashing's round-robin approach extends naturally to shrinkage: just as it splits buckets in order, it can merge buckets in reverse order.

Contraction Algorithm:

1. Monitor global load factor
2. When load factor drops below threshold:
   a. Decrement split pointer (or wrap to previous level)
   b. Merge the last bucket into its partner
   c. Deallocate the last bucket
3. Repeat until load factor is acceptable

The elegance of Linear Hashing is that contraction is exactly splitting in reverse. The most recently added bucket merges back into its original partner.

linear_contraction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
from typing import List
from dataclasses import dataclass, field
 
@dataclass
class ContractibleBucket:
    """Bucket for Linear Hashing with contraction support."""
    records: List[tuple] = field(default_factory=list)
    capacity: int = 100
 
 
class LinearHashWithContraction:
    """
    Linear Hashing with both expansion and contraction.
    
    Key insight: Contraction is splitting in reverse.
    The most recently added bucket merges first.
    """
    
    def __init__(self, initial_buckets: int = 4,
                 bucket_capacity: int = 4,
                 expand_threshold: float = 0.80,
                 contract_threshold: float = 0.30):
        
        self.level = 0
        while (1 << self.level) < initial_buckets:
            self.level += 1
        
        initial = 1 << self.level
        self.buckets: List[ContractibleBucket] = [
            ContractibleBucket(capacity=bucket_capacity)
            for _ in range(initial)
        ]
        
        self.split_pointer = 0
        self.bucket_capacity = bucket_capacity
        self.expand_threshold = expand_threshold
        self.contract_threshold = contract_threshold
        self.total_records = 0
        self.min_buckets = 2  # Never shrink below this
    
    def _load_factor(self) -> float:
        """Current load factor."""
        capacity = len(self.buckets) * self.bucket_capacity
        return self.total_records / capacity if capacity else 0
    
    def delete(self, key: str, hash_value: int):
        """Delete a record and potentially trigger contraction."""
        idx = self._get_bucket_index(hash_value)
        bucket = self.buckets[idx]
        
        # Find and remove
        original_count = len(bucket.records)
        bucket.records = [r for r in bucket.records if r[0] != key]
        
        if len(bucket.records) < original_count:
            self.total_records -= 1
            
            # Check for contraction
            if self._load_factor() < self.contract_threshold:
                self._contract()
    
    def _get_bucket_index(self, hash_value: int) -> int:
        """Determine bucket for a key."""
        idx = hash_value & ((1 << self.level) - 1)
        if idx < self.split_pointer:
            idx = hash_value & ((1 << (self.level + 1)) - 1)
        return idx
    
    def _contract(self):
        """
        Remove the last bucket, merging its contents.
        
        The last bucket was created when split_pointer reached
        its current position. Contraction reverses this.
        """
        if len(self.buckets) <= self.min_buckets:
            print("Cannot contract: at minimum bucket count")
            return
        
        # Identify the bucket to remove and its merge partner
        last_bucket_idx = len(self.buckets) - 1
        last_bucket = self.buckets[last_bucket_idx]
        
        # Partner is at index: last_bucket_idx - 2^level
        # (where it was before the split that created this bucket)
        partner_idx = last_bucket_idx - (1 << self.level)
        
        if partner_idx < 0:
            # We're at the start of a new level - go back
            self.level -= 1
            self.split_pointer = (1 << self.level) - 1
            partner_idx = last_bucket_idx - (1 << self.level)
        
        partner = self.buckets[partner_idx]
        
        print(f">>> Contracting: merging bucket {last_bucket_idx} into {partner_idx}")
        print(f"    Level: {self.level}, Split pointer before: {self.split_pointer}")
        
        # Move all records from last bucket to partner
        partner.records.extend(last_bucket.records)
        
        # Remove the last bucket
        self.buckets.pop()
        
        # Update split pointer
        if self.split_pointer > 0:
            self.split_pointer -= 1
        elif self.level > 0:
            self.level -= 1
            self.split_pointer = (1 << self.level) - 1
        
        print(f"    Split pointer after: {self.split_pointer}")
        print(f"    Buckets remaining: {len(self.buckets)}")
    
    def display_state(self):
        """Display current state."""
        print(f"
=== Linear Hash State ===")
        print(f"Level: {self.level}, Split pointer: {self.split_pointer}")
        print(f"Buckets: {len(self.buckets)}")
        print(f"Records: {self.total_records}")
        print(f"Load factor: {self._load_factor():.2%}")
        
        for i, b in enumerate(self.buckets):
            status = "(already split)" if i < self.split_pointer else ""
            print(f"  Bucket {i} {status}: {len(b.records)} records")
 
 
# Demonstration
def demonstrate_linear_contraction():
    """Show Linear Hashing expansion followed by contraction."""
    
    index = LinearHashWithContraction(
        initial_buckets=2,
        bucket_capacity=2,
        expand_threshold=0.75,
        contract_threshold=0.25
    )
    
    # Insert records to cause expansion
    print("=== Expansion Phase ===")
    import random
    random.seed(42)
    
    keys = []
    for i in range(8):
        key = f"key_{i}"
        hash_val = random.getrandbits(32)
        keys.append((key, hash_val))
        
        idx = index._get_bucket_index(hash_val)
        index.buckets[idx].records.append((key, hash_val, i))
        index.total_records += 1
        
        if index._load_factor() > index.expand_threshold:
            # Simulate expansion (simplified)
            index.buckets.append(ContractibleBucket(capacity=2))
            index.split_pointer += 1
            if index.split_pointer >= (1 << index.level):
                index.level += 1
                index.split_pointer = 0
    
    index.display_state()
    
    # Delete records to cause contraction
    print("
=== Contraction Phase ===")
    for key, hash_val in keys[:6]:  # Delete most records
        index.delete(key, hash_val)
    
    index.display_state()
 
 
demonstrate_linear_contraction()

Symmetric Evolution

Linear Hashing's contraction perfectly mirrors its expansion. Bucket n was created when the split pointer reached position n - 2^level. Contraction reverses this: bucket n merges back when contraction brings us to that same point. This symmetry makes Linear Hashing's shrinkage more predictable than Extendible Hashing's.

Space Reclamation Strategies

Beyond bucket merging, databases employ various strategies to reclaim space from shrinking hash indexes:

1. Deferred Reclamation:

Instead of immediately deallocating merged buckets, the system adds them to a free list. Future splits can reuse these pages without disk allocation overhead.

2. Compaction:

Periodically consolidate fragmented pages. Even if buckets aren't merged, records within a bucket may become fragmented after many deletions. Compaction rewrites pages contiguously.

3. Online Reorganization:

Some systems support reorganizing the hash index while it remains online. This involves:

Creating new buckets in optimal configuration
Gradually migrating records
Atomically switching to the new structure
Deallocating the old structure

4. Scheduled Rebuilds:

Many production systems simply rebuild hash indexes periodically during maintenance windows. This is the most reliable way to achieve optimal space utilization.

Space Reclamation Strategy Comparison
Strategy	Online?	Complexity	Effectiveness	Best For
Bucket merging	Yes	Medium	Moderate	Gradual shrinkage
Deferred reuse	Yes	Low	High	Volatile workloads
Page compaction	Yes	Medium	Moderate	Fragmentation
Online reorg	Yes	High	Complete	Major shrinkage
Scheduled rebuild	No	Low	Complete	Predictable workloads

Pragmatic Reality

In practice, many DBAs prefer scheduled rebuilds over complex shrinkage algorithms. A weekly rebuild during low-traffic hours is simpler, more predictable, and just as effective for workloads with gradual changes. Save complex online shrinkage for systems that truly cannot afford any maintenance window.

Concurrent Shrinkage Challenges

Shrinkage under concurrent access introduces unique challenges beyond those faced during growth:

The Two-Bucket Problem:

Merging requires locking two buckets simultaneously. This creates deadlock potential:

Thread A holds bucket X, wants bucket Y
Thread B holds bucket Y, wants bucket X
Deadlock!

Solutions for concurrent merging:

Deadlock Prevention Strategies

•Ordered locking — Always acquire bucket locks in order (lower ID first). Prevents circular wait.
•Try-lock with backoff — Attempt to lock both; if either fails, release all and retry. Simple but may livelock.
•Dedicated shrinkage thread — Only one thread performs shrinkage, eliminating contention.
•Epoch-based reclamation — Mark buckets for merging; complete during quiescent periods.
•Lock-free with CAS — Use compare-and-swap for pointer updates. Very complex but scalable.

Directory Halving Coordination:

In Extendible Hashing, directory halving affects all lookups. This requires:

Taking an exclusive lock on the directory (blocks all operations)
Verifying all bucket pairs are mergeable
Creating the new, smaller directory
Atomically switching to the new directory
Releasing the lock

This global synchronization is why many Extendible Hashing implementations don't automatically halve the directory, even when merges make it theoretically possible.

Shrinkage-Growth Race

A subtle bug occurs when shrinkage and growth happen simultaneously. Thread A decides to merge buckets based on low utilization. Before the merge completes, Thread B inserts many records. The merge then violates capacity constraints. Proper implementations recheck conditions after acquiring all necessary locks.

Practical Guidelines: When to Shrink

Given the complexity of shrinkage, when should systems actually implement it? Consider these practical guidelines:

Implement Shrinkage When

•Data volume fluctuates significantly (>50% swings)
•Memory/disk constraints are tight
•Maintenance windows are impossible
•Sparse indexes cause measurable performance issues
•Regulatory requirements mandate space efficiency

Skip Shrinkage When

•Data grows monotonically (append-only)
•Periodic rebuilds are acceptable
•Sparse indexes don't impact performance
•Implementation complexity is a concern
•Concurrency requirements are demanding

Cost-Benefit Analysis:

Factor	Without Shrinkage	With Shrinkage
Space usage	Higher (up to 10x)	Optimal
Implementation	Simpler	Complex
Maintenance	Scheduled rebuilds	Automatic
Predictability	High	Lower
Concurrency overhead	None	Moderate
Debugging complexity	Low	High

For most workloads, the overhead of shrinkage isn't worth the benefits. Modern storage is cheap, and scheduled rebuilds handle most scenarios. Reserve automatic shrinkage for specialized systems where space truly matters.

Summary: Shrinkage Handling

We've explored the often-overlooked challenge of hash index shrinkage—the algorithms, challenges, and practical considerations for reclaiming space from underutilized indexes. Here are the key insights:

Key Takeaways

•Shrinkage is harder than growth — Asymmetric triggering, sibling finding, and concurrent coordination make merging more complex than splitting
•Underutilization detection requires thresholds — Unlike overflow (absolute), underutilization is subjective and requires carefully chosen thresholds
•Hysteresis prevents oscillation — Merge thresholds must be significantly lower than split thresholds to avoid split-merge cycles
•Linear Hashing contracts symmetrically — Its round-robin approach reverses naturally, making contraction predictable
•Directory halving is complex — Requires global coordination and verification that all bucket pairs are compatible
•Many systems skip automatic shrinkage — Scheduled rebuilds are simpler and sufficient for most workloads

What's Next:

With an understanding of both growth and shrinkage, we're ready to examine how dynamic hashing maintains performance over time. The next page explores performance maintenance—how to keep hash indexes performing optimally as data patterns change, including load balancing, overflow management, and adaptation to skewed distributions.

Page Complete

You now understand the complete shrinkage lifecycle in dynamic hashing—from detecting underutilization through executing merges to practical deployment decisions. You can evaluate whether automatic shrinkage is appropriate for your workload.

3 / 5

Loading learning content...

Database Management SystemsDynamic Hashing

Dynamic Hashing

LevelIntermediate

Duration60 mins

TopicDynamic Hashing

3 / 5

Shrinkage Handling

The Forgotten Half of Dynamic Hashing

The problem with growth-only approaches:

Shrinkage handling addresses this by enabling hash indexes to consolidate underutilized buckets, reducing both space consumption and search overhead.

What You Will Learn

Why Shrinkage Is Harder Than Growth

At first glance, shrinkage seems like growth in reverse: where growth splits buckets, shrinkage merges them. But several factors make shrinkage fundamentally more complex:

1. Asymmetric Triggering:

2. Merge Partner Identification:

3. Concurrency Complications:

Merging requires coordinating access to two buckets simultaneously, not just one. This doubles the locking complexity and increases deadlock potential.

4. Directory Halving Overhead:

Growth vs Shrinkage Asymmetry
Aspect	Growth (Split)	Shrinkage (Merge)
Trigger clarity	Unambiguous (overflow)	Subjective (how empty?)
Urgency	Immediate (can't insert)	Deferrable (still works)
Scope	Single bucket	Two buckets must coordinate
Partner finding	Creates partner (new bucket)	Must locate existing partner
Directory change	Doubling is pointer copy	Halving requires validation
Failure mode	Must succeed for insert	Can skip without data loss

Many Systems Skip Shrinkage

Detecting Underutilization

Unlike overflow detection, which has an absolute threshold (bucket is full), underutilization detection requires choosing arbitrary thresholds. Different strategies offer different trade-offs:

Threshold-Based Detection:

The simplest approach: trigger merge consideration when a bucket falls below a certain fill percentage.

underutilization_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
from dataclasses import dataclass
from typing import List, Optional, Tuple
from enum import Enum
 
class ShrinkagePolicy(Enum):
    """Different policies for triggering shrinkage."""
    THRESHOLD = "threshold"      # Merge when below fill percentage
    COMBINED = "combined"        # Both buckets together below threshold
    LOAD_FACTOR = "load_factor"  # Global load factor too low
    HYSTERESIS = "hysteresis"    # Different thresholds for grow/shrink
 
 
@dataclass
class UnderutilizationDetector:
    """
    Detect when buckets should be considered for merging.
    
    Key insight: Shrinkage thresholds should be lower than split
    thresholds to prevent oscillation (split-merge-split cycles).
    """
    
    bucket_capacity: int = 100
    
    # Threshold for single bucket underutilization
    single_bucket_threshold: float = 0.25  # 25% or less triggers check
    
    # Threshold for combined (merged) bucket viability
    combined_threshold: float = 0.70  # Merged bucket must not exceed 70%
    
    # Global load factor floor
    global_load_floor: float = 0.30  # Shrink when overall < 30%
    
    def check_single_bucket(self, record_count: int) -> bool:
        """
        Check if a single bucket is underutilized.
        
        Returns True if the bucket is sparse enough to consider merging.
        """
        fill_ratio = record_count / self.bucket_capacity
        return fill_ratio <= self.single_bucket_threshold
    
    def check_merge_viability(self, 
                               bucket_a_count: int, 
                               bucket_b_count: int) -> Tuple[bool, str]:
        """
        Check if two buckets can be safely merged.
        
        Returns (can_merge, reason)
        """
        combined_count = bucket_a_count + bucket_b_count
        
        if combined_count > self.bucket_capacity:
            ratio = combined_count / self.bucket_capacity
            return False, f"Would overflow ({ratio:.1%} of capacity)"
        
        combined_ratio = combined_count / self.bucket_capacity
        
        if combined_ratio > self.combined_threshold:
            return False, f"Would be too full ({combined_ratio:.1%} > {self.combined_threshold:.1%})"
        
        return True, f"Safe to merge ({combined_ratio:.1%} of capacity)"
    
    def check_global_underutilization(self, 
                                        total_records: int, 
                                        total_buckets: int) -> bool:
        """
        Check if the entire index is underutilized.
        
        Used by Linear Hashing-style global shrinkage.
        """
        if total_buckets == 0:
            return False
        
        load_factor = total_records / (total_buckets * self.bucket_capacity)
        return load_factor < self.global_load_floor
    
    def analyze_shrinkage_potential(self,
                                     buckets: List[int],  # record counts
                                     bucket_depths: List[int]) -> dict:
        """
        Analyze an index to find shrinkage opportunities.
        
        Returns statistics about underutilization and merge candidates.
        """
        underutilized = []
        merge_candidates = []
        
        total_records = sum(buckets)
        total_capacity = len(buckets) * self.bucket_capacity
        
        # Find underutilized buckets
        for i, count in enumerate(buckets):
            if self.check_single_bucket(count):
                underutilized.append(i)
        
        # Find merge candidates (buckets that could merge with sibling)
        # In practice, sibling identification depends on the scheme
        # Here we assume adjacent buckets with same depth are siblings
        for i in range(len(buckets) - 1):
            if bucket_depths[i] == bucket_depths[i + 1]:
                can_merge, _ = self.check_merge_viability(buckets[i], buckets[i + 1])
                if can_merge:
                    merge_candidates.append((i, i + 1))
        
        return {
            "total_buckets": len(buckets),
            "total_records": total_records,
            "overall_utilization": total_records / total_capacity if total_capacity else 0,
            "underutilized_buckets": len(underutilized),
            "underutilized_indices": underutilized,
            "merge_candidates": merge_candidates,
            "potential_space_savings": len(merge_candidates) * self.bucket_capacity,
            "global_shrinkage_recommended": self.check_global_underutilization(
                total_records, len(buckets)
            ),
        }
 
 
# Example usage
detector = UnderutilizationDetector(bucket_capacity=100)
 
# Analyze an underutilized index
bucket_counts = [15, 10, 5, 20, 25, 8, 12, 18]  # Very sparse
bucket_depths = [3, 3, 3, 3, 3, 3, 3, 3]  # All same depth
 
analysis = detector.analyze_shrinkage_potential(bucket_counts, bucket_depths)
 
print("Shrinkage Analysis:")
print(f"  Overall utilization: {analysis['overall_utilization']:.1%}")
print(f"  Underutilized buckets: {analysis['underutilized_buckets']} of {analysis['total_buckets']}")
print(f"  Merge candidates: {len(analysis['merge_candidates'])} pairs")
print(f"  Potential space savings: {analysis['potential_space_savings']} records worth")
print(f"  Global shrinkage recommended: {analysis['global_shrinkage_recommended']}")

Hysteresis Prevents Oscillation

Bucket Merging Mechanics

Bucket merging is conceptually the reverse of splitting: two buckets with prefixes P0 and P1 consolidate into a single bucket with prefix P. But the implementation details differ significantly.

The Merge Algorithm:

1. Identify merge candidate (bucket below threshold)
2. Find sibling bucket (shares prefix P, differs only in last bit)
3. Verify combined records fit in one bucket
4. Copy all records to the surviving bucket
5. Update directory entries (point both prefixes to one bucket)
6. Deallocate the empty bucket
7. Optionally reduce local depth of merged bucket

bucket_merge.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
from typing import List, Tuple, Optional
from dataclasses import dataclass, field
 
@dataclass
class MergeableBucket:
    """Bucket supporting merge operations."""
    id: int
    local_depth: int
    records: List[tuple] = field(default_factory=list)
    capacity: int = 100
    
    def can_absorb(self, other: 'MergeableBucket') -> bool:
        """Check if this bucket can absorb another's records."""
        return len(self.records) + len(other.records) <= self.capacity
    
    def get_sibling_prefix(self) -> int:
        """
        Calculate the prefix of our sibling bucket.
        
        If our prefix is P with local_depth bits, sibling has
        prefix P XOR (1 << (local_depth - 1))
        (flip the last bit of our prefix)
        """
        # This is conceptual - actual prefix comes from directory position
        return -1  # Placeholder
 
 
class BucketMerger:
    """
    Implements bucket merge operations for dynamic hashing.
    
    Merging is more complex than splitting because:
    1. Must find sibling bucket
    2. Must verify both buckets exist and are compatible
    3. Must atomically update directory and deallocate
    """
    
    def find_sibling(self, 
                     directory: List[MergeableBucket],
                     bucket_idx: int,
                     global_depth: int) -> Optional[Tuple[int, MergeableBucket]]:
        """
        Find the sibling bucket for a potential merge.
        
        The sibling shares all prefix bits except the last one
        used by the bucket's local depth.
        """
        bucket = directory[bucket_idx]
        
        if bucket.local_depth == 0:
            return None  # Can't merge at depth 0
        
        # Calculate sibling's directory index
        # Flip the bit at position (local_depth - 1) from the left
        flip_bit = 1 << (global_depth - bucket.local_depth)
        sibling_idx = bucket_idx ^ flip_bit
        
        if sibling_idx >= len(directory):
            return None
        
        sibling = directory[sibling_idx]
        
        # Sibling must have same local depth for merge
        if sibling.local_depth != bucket.local_depth:
            return None
        
        # Sibling must be a different bucket (not same bucket via shared pointers)
        if sibling.id == bucket.id:
            return None
            
        return sibling_idx, sibling
    
    def merge_buckets(self,
                      bucket: MergeableBucket,
                      sibling: MergeableBucket) -> MergeableBucket:
        """
        Merge two sibling buckets into one.
        
        All records from sibling move to bucket.
        Sibling becomes empty and can be deallocated.
        """
        if not bucket.can_absorb(sibling):
            raise ValueError("Cannot merge: combined records exceed capacity")
        
        # Move all records from sibling to bucket
        bucket.records.extend(sibling.records)
        sibling.records.clear()
        
        # Decrease local depth (we're using one fewer bit)
        bucket.local_depth -= 1
        
        print(f"Merged bucket {sibling.id} into bucket {bucket.id}")
        print(f"  New local depth: {bucket.local_depth}")
        print(f"  Total records: {len(bucket.records)}")
        
        return bucket
    
    def update_directory_for_merge(self,
                                    directory: List[MergeableBucket],
                                    surviving_bucket: MergeableBucket,
                                    empty_bucket: MergeableBucket):
        """
        Update directory entries after a merge.
        
        All entries pointing to the empty bucket should now
        point to the surviving bucket.
        """
        updates = 0
        for i in range(len(directory)):
            if directory[i].id == empty_bucket.id:
                directory[i] = surviving_bucket
                updates += 1
        
        print(f"  Updated {updates} directory entries")
        return updates
 
 
class ExtendibleHashWithShrinkage:
    """
    Extendible Hash Index with shrinkage support.
    
    Extends basic Extendible Hashing to support bucket merging
    and directory halving.
    """
    
    def __init__(self, bucket_capacity: int = 4):
        self.global_depth = 1
        self.bucket_capacity = bucket_capacity
        self.next_bucket_id = 0
        self.directory: List[MergeableBucket] = []
        self.merger = BucketMerger()
        self.shrink_threshold = 0.25  # Consider merge at 25% full
        self._init_directory()
    
    def _init_directory(self):
        for i in range(2 ** self.global_depth):
            bucket = MergeableBucket(
                id=self.next_bucket_id,
                local_depth=self.global_depth,
                capacity=self.bucket_capacity
            )
            self.next_bucket_id += 1
            self.directory.append(bucket)
    
    def delete(self, key: str, hash_value: int):
        """Delete a record and potentially trigger shrinkage."""
        idx = self._get_directory_index(hash_value)
        bucket = self.directory[idx]
        
        # Find and remove the record
        original_count = len(bucket.records)
        bucket.records = [r for r in bucket.records if r[0] != key]
        
        if len(bucket.records) == original_count:
            return  # Key not found
        
        # Check for shrinkage opportunity
        fill_ratio = len(bucket.records) / self.bucket_capacity
        if fill_ratio <= self.shrink_threshold:
            self._attempt_merge(idx, bucket)
    
    def _get_directory_index(self, hash_value: int) -> int:
        shift = 32 - self.global_depth
        return hash_value >> shift
    
    def _attempt_merge(self, bucket_idx: int, bucket: MergeableBucket):
        """Attempt to merge an underutilized bucket with its sibling."""
        result = self.merger.find_sibling(
            self.directory, bucket_idx, self.global_depth
        )
        
        if result is None:
            print(f"No sibling found for bucket {bucket.id}")
            return
        
        sibling_idx, sibling = result
        
        # Check if merge is viable
        if not bucket.can_absorb(sibling):
            combined = len(bucket.records) + len(sibling.records)
            print(f"Cannot merge: combined {combined} exceeds capacity {self.bucket_capacity}")
            return
        
        # Perform the merge
        print(f"
>>> Attempting merge of buckets {bucket.id} and {sibling.id}")
        self.merger.merge_buckets(bucket, sibling)
        self.merger.update_directory_for_merge(self.directory, bucket, sibling)
        
        # Check if directory can be halved
        self._attempt_directory_halve()
    
    def _attempt_directory_halve(self):
        """
        Attempt to halve the directory size.
        
        Can only halve if every pair of entries points to the same bucket
        (i.e., all buckets have local_depth < global_depth).
        """
        if self.global_depth <= 1:
            return  # Can't go below 2 entries
        
        # Check if all buckets have depth less than global depth
        for bucket in self.directory:
            if bucket.local_depth >= self.global_depth:
                return  # Can't halve yet
        
        # Verify that pairs point to same bucket
        new_size = len(self.directory) // 2
        for i in range(new_size):
            if self.directory[i].id != self.directory[i + new_size].id:
                return  # Pairs not identical
        
        # Safe to halve
        print(f"Halving directory: {len(self.directory)} -> {new_size}")
        self.directory = self.directory[:new_size]
        self.global_depth -= 1
 
 
# Demonstration
def demonstrate_shrinkage():
    """Show bucket merging in action."""
    
    index = ExtendibleHashWithShrinkage(bucket_capacity=4)
    
    # Insert some records
    test_data = [
        ("k1", 0b00110000_00000000_00000000_00000000),
        ("k2", 0b00110001_00000000_00000000_00000000),
        ("k3", 0b01110000_00000000_00000000_00000000),
        ("k4", 0b01110001_00000000_00000000_00000000),
    ]
    
    for key, hash_val in test_data:
        index.directory[hash_val >> 31].records.append((key, hash_val, 0))
    
    print("After insertions:")
    for i, b in enumerate(index.directory):
        print(f"  Dir[{i}] -> Bucket {b.id} ({len(b.records)} records)")
    
    # Delete records to trigger shrinkage
    print("
Deleting k1, k2, k3...")
    for key, hash_val in test_data[:3]:
        index.delete(key, hash_val)
    
    print("
After deletions:")
    for i, b in enumerate(index.directory):
        print(f"  Dir[{i}] -> Bucket {b.id} ({len(b.records)} records)")
 
 
demonstrate_shrinkage()

Linear Hashing Contraction

Linear Hashing's round-robin approach extends naturally to shrinkage: just as it splits buckets in order, it can merge buckets in reverse order.

Contraction Algorithm:

1. Monitor global load factor
2. When load factor drops below threshold:
   a. Decrement split pointer (or wrap to previous level)
   b. Merge the last bucket into its partner
   c. Deallocate the last bucket
3. Repeat until load factor is acceptable

The elegance of Linear Hashing is that contraction is exactly splitting in reverse. The most recently added bucket merges back into its original partner.

linear_contraction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
from typing import List
from dataclasses import dataclass, field
 
@dataclass
class ContractibleBucket:
    """Bucket for Linear Hashing with contraction support."""
    records: List[tuple] = field(default_factory=list)
    capacity: int = 100
 
 
class LinearHashWithContraction:
    """
    Linear Hashing with both expansion and contraction.
    
    Key insight: Contraction is splitting in reverse.
    The most recently added bucket merges first.
    """
    
    def __init__(self, initial_buckets: int = 4,
                 bucket_capacity: int = 4,
                 expand_threshold: float = 0.80,
                 contract_threshold: float = 0.30):
        
        self.level = 0
        while (1 << self.level) < initial_buckets:
            self.level += 1
        
        initial = 1 << self.level
        self.buckets: List[ContractibleBucket] = [
            ContractibleBucket(capacity=bucket_capacity)
            for _ in range(initial)
        ]
        
        self.split_pointer = 0
        self.bucket_capacity = bucket_capacity
        self.expand_threshold = expand_threshold
        self.contract_threshold = contract_threshold
        self.total_records = 0
        self.min_buckets = 2  # Never shrink below this
    
    def _load_factor(self) -> float:
        """Current load factor."""
        capacity = len(self.buckets) * self.bucket_capacity
        return self.total_records / capacity if capacity else 0
    
    def delete(self, key: str, hash_value: int):
        """Delete a record and potentially trigger contraction."""
        idx = self._get_bucket_index(hash_value)
        bucket = self.buckets[idx]
        
        # Find and remove
        original_count = len(bucket.records)
        bucket.records = [r for r in bucket.records if r[0] != key]
        
        if len(bucket.records) < original_count:
            self.total_records -= 1
            
            # Check for contraction
            if self._load_factor() < self.contract_threshold:
                self._contract()
    
    def _get_bucket_index(self, hash_value: int) -> int:
        """Determine bucket for a key."""
        idx = hash_value & ((1 << self.level) - 1)
        if idx < self.split_pointer:
            idx = hash_value & ((1 << (self.level + 1)) - 1)
        return idx
    
    def _contract(self):
        """
        Remove the last bucket, merging its contents.
        
        The last bucket was created when split_pointer reached
        its current position. Contraction reverses this.
        """
        if len(self.buckets) <= self.min_buckets:
            print("Cannot contract: at minimum bucket count")
            return
        
        # Identify the bucket to remove and its merge partner
        last_bucket_idx = len(self.buckets) - 1
        last_bucket = self.buckets[last_bucket_idx]
        
        # Partner is at index: last_bucket_idx - 2^level
        # (where it was before the split that created this bucket)
        partner_idx = last_bucket_idx - (1 << self.level)
        
        if partner_idx < 0:
            # We're at the start of a new level - go back
            self.level -= 1
            self.split_pointer = (1 << self.level) - 1
            partner_idx = last_bucket_idx - (1 << self.level)
        
        partner = self.buckets[partner_idx]
        
        print(f">>> Contracting: merging bucket {last_bucket_idx} into {partner_idx}")
        print(f"    Level: {self.level}, Split pointer before: {self.split_pointer}")
        
        # Move all records from last bucket to partner
        partner.records.extend(last_bucket.records)
        
        # Remove the last bucket
        self.buckets.pop()
        
        # Update split pointer
        if self.split_pointer > 0:
            self.split_pointer -= 1
        elif self.level > 0:
            self.level -= 1
            self.split_pointer = (1 << self.level) - 1
        
        print(f"    Split pointer after: {self.split_pointer}")
        print(f"    Buckets remaining: {len(self.buckets)}")
    
    def display_state(self):
        """Display current state."""
        print(f"
=== Linear Hash State ===")
        print(f"Level: {self.level}, Split pointer: {self.split_pointer}")
        print(f"Buckets: {len(self.buckets)}")
        print(f"Records: {self.total_records}")
        print(f"Load factor: {self._load_factor():.2%}")
        
        for i, b in enumerate(self.buckets):
            status = "(already split)" if i < self.split_pointer else ""
            print(f"  Bucket {i} {status}: {len(b.records)} records")
 
 
# Demonstration
def demonstrate_linear_contraction():
    """Show Linear Hashing expansion followed by contraction."""
    
    index = LinearHashWithContraction(
        initial_buckets=2,
        bucket_capacity=2,
        expand_threshold=0.75,
        contract_threshold=0.25
    )
    
    # Insert records to cause expansion
    print("=== Expansion Phase ===")
    import random
    random.seed(42)
    
    keys = []
    for i in range(8):
        key = f"key_{i}"
        hash_val = random.getrandbits(32)
        keys.append((key, hash_val))
        
        idx = index._get_bucket_index(hash_val)
        index.buckets[idx].records.append((key, hash_val, i))
        index.total_records += 1
        
        if index._load_factor() > index.expand_threshold:
            # Simulate expansion (simplified)
            index.buckets.append(ContractibleBucket(capacity=2))
            index.split_pointer += 1
            if index.split_pointer >= (1 << index.level):
                index.level += 1
                index.split_pointer = 0
    
    index.display_state()
    
    # Delete records to cause contraction
    print("
=== Contraction Phase ===")
    for key, hash_val in keys[:6]:  # Delete most records
        index.delete(key, hash_val)
    
    index.display_state()
 
 
demonstrate_linear_contraction()

Symmetric Evolution

Space Reclamation Strategies

Beyond bucket merging, databases employ various strategies to reclaim space from shrinking hash indexes:

1. Deferred Reclamation:

Instead of immediately deallocating merged buckets, the system adds them to a free list. Future splits can reuse these pages without disk allocation overhead.

2. Compaction:

Periodically consolidate fragmented pages. Even if buckets aren't merged, records within a bucket may become fragmented after many deletions. Compaction rewrites pages contiguously.

3. Online Reorganization:

Some systems support reorganizing the hash index while it remains online. This involves:

Creating new buckets in optimal configuration
Gradually migrating records
Atomically switching to the new structure
Deallocating the old structure

4. Scheduled Rebuilds:

Many production systems simply rebuild hash indexes periodically during maintenance windows. This is the most reliable way to achieve optimal space utilization.

Space Reclamation Strategy Comparison
Strategy	Online?	Complexity	Effectiveness	Best For
Bucket merging	Yes	Medium	Moderate	Gradual shrinkage
Deferred reuse	Yes	Low	High	Volatile workloads
Page compaction	Yes	Medium	Moderate	Fragmentation
Online reorg	Yes	High	Complete	Major shrinkage
Scheduled rebuild	No	Low	Complete	Predictable workloads

Pragmatic Reality

Concurrent Shrinkage Challenges

Shrinkage under concurrent access introduces unique challenges beyond those faced during growth:

The Two-Bucket Problem:

Merging requires locking two buckets simultaneously. This creates deadlock potential:

Thread A holds bucket X, wants bucket Y
Thread B holds bucket Y, wants bucket X
Deadlock!

Solutions for concurrent merging:

Deadlock Prevention Strategies

•Ordered locking — Always acquire bucket locks in order (lower ID first). Prevents circular wait.
•Try-lock with backoff — Attempt to lock both; if either fails, release all and retry. Simple but may livelock.
•Dedicated shrinkage thread — Only one thread performs shrinkage, eliminating contention.
•Epoch-based reclamation — Mark buckets for merging; complete during quiescent periods.
•Lock-free with CAS — Use compare-and-swap for pointer updates. Very complex but scalable.

Directory Halving Coordination:

In Extendible Hashing, directory halving affects all lookups. This requires:

Taking an exclusive lock on the directory (blocks all operations)
Verifying all bucket pairs are mergeable
Creating the new, smaller directory
Atomically switching to the new directory
Releasing the lock

This global synchronization is why many Extendible Hashing implementations don't automatically halve the directory, even when merges make it theoretically possible.

Shrinkage-Growth Race

Practical Guidelines: When to Shrink

Given the complexity of shrinkage, when should systems actually implement it? Consider these practical guidelines:

Implement Shrinkage When

•Data volume fluctuates significantly (>50% swings)
•Memory/disk constraints are tight
•Maintenance windows are impossible
•Sparse indexes cause measurable performance issues
•Regulatory requirements mandate space efficiency

Skip Shrinkage When

•Data grows monotonically (append-only)
•Periodic rebuilds are acceptable
•Sparse indexes don't impact performance
•Implementation complexity is a concern
•Concurrency requirements are demanding

Cost-Benefit Analysis:

Factor	Without Shrinkage	With Shrinkage
Space usage	Higher (up to 10x)	Optimal
Implementation	Simpler	Complex
Maintenance	Scheduled rebuilds	Automatic
Predictability	High	Lower
Concurrency overhead	None	Moderate
Debugging complexity	Low	High

Summary: Shrinkage Handling

Key Takeaways

•Shrinkage is harder than growth — Asymmetric triggering, sibling finding, and concurrent coordination make merging more complex than splitting
•Underutilization detection requires thresholds — Unlike overflow (absolute), underutilization is subjective and requires carefully chosen thresholds
•Hysteresis prevents oscillation — Merge thresholds must be significantly lower than split thresholds to avoid split-merge cycles
•Linear Hashing contracts symmetrically — Its round-robin approach reverses naturally, making contraction predictable
•Directory halving is complex — Requires global coordination and verification that all bucket pairs are compatible
•Many systems skip automatic shrinkage — Scheduled rebuilds are simpler and sufficient for most workloads

What's Next:

Page Complete

3 / 5