Dynamic Hashing - Learning Module

Loading content...

0/241

Growth Handling

The Art of Graceful Expansion

When a database receives a flood of new data—perhaps from a viral product launch, a data migration, or a successful marketing campaign—the hash index must grow without grinding operations to a halt. Growth handling is the set of algorithms and strategies that enable this graceful expansion.

Consider the challenge: a hash index with 1,000 buckets is receiving 10,000 new records per second. If every bucket overflow required global reorganization, the system would spend more time restructuring itself than serving queries. Instead, dynamic hashing systems use sophisticated techniques to localize growth, amortize costs, and maintain query performance even during rapid expansion.

What You Will Learn

By the end of this page, you will master the complete mechanics of growth handling in dynamic hashing—from detecting overflow conditions through executing bucket splits to maintaining index integrity. You'll understand both Extendible Hashing's directory-based approach and Linear Hashing's round-robin strategy, including their trade-offs and implementation details.

Detecting Overflow Conditions

Before growth can occur, the system must detect that growth is needed. This detection happens at different levels depending on the hashing scheme.

Bucket-Level Detection:

The most direct approach monitors individual buckets. When an insertion would exceed bucket capacity, growth is triggered. This is the primary mechanism in Extendible Hashing:

1. Compute hash(key) to find target bucket
2. Attempt to insert into bucket
3. If bucket is full:
   - Trigger bucket split
   - Redistribute records based on additional hash bits
   - Complete the original insertion

Global-Level Detection:

Linear Hashing uses a global metric—the load factor—to trigger growth. The load factor is the ratio of total records to total buckets:

Load Factor = N / B
where N = total records, B = total buckets

When the load factor exceeds a threshold (typically 0.75-0.85), a split is triggered regardless of which bucket overflowed.

Overflow Detection Mechanisms Compared
Scheme	Metric	Trigger	Advantages	Drawbacks
Extendible	Bucket fullness	Individual bucket overflow	Splits only when necessary	May cause uneven bucket sizes
Linear	Global load factor	Threshold exceeded	Predictable growth pattern	May split non-full buckets
Hybrid	Both metrics	Either condition met	Balances local/global	More complex implementation

overflow_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
class OverflowDetector:
    """
    Overflow detection strategies for dynamic hashing.
    
    Production systems combine multiple strategies:
    1. Hard overflow: Bucket physically cannot hold more
    2. Soft overflow: Performance threshold exceeded
    3. Load factor: Global capacity threshold
    """
    
    def __init__(self, 
                 bucket_capacity: int = 100,
                 load_factor_threshold: float = 0.75,
                 overflow_chain_limit: int = 3):
        self.bucket_capacity = bucket_capacity
        self.load_factor_threshold = load_factor_threshold
        self.overflow_chain_limit = overflow_chain_limit
        
    def check_bucket_overflow(self, bucket_record_count: int) -> bool:
        """
        Check if a bucket has exceeded capacity.
        Used by Extendible Hashing to trigger immediate splits.
        """
        return bucket_record_count >= self.bucket_capacity
    
    def check_load_factor(self, total_records: int, total_buckets: int) -> bool:
        """
        Check if global load factor exceeds threshold.
        Used by Linear Hashing to trigger round-robin splits.
        """
        if total_buckets == 0:
            return True  # Must grow if no buckets
        
        load_factor = total_records / total_buckets
        return load_factor > self.load_factor_threshold
    
    def check_overflow_chain(self, overflow_page_count: int) -> bool:
        """
        Check if overflow chain is too long.
        Long chains indicate poor distribution or undersized index.
        """
        return overflow_page_count >= self.overflow_chain_limit
    
    def recommend_action(self, 
                        bucket_count: int,
                        bucket_overflow: bool,
                        total_records: int,
                        total_buckets: int) -> str:
        """
        Recommend growth action based on current state.
        
        Returns one of:
        - 'none': No action needed
        - 'split': Split overflowing bucket
        - 'expand': Expand hash table (add buckets)
        - 'reorganize': Full reorganization needed
        """
        load_factor_exceeded = self.check_load_factor(total_records, total_buckets)
        
        if bucket_overflow and load_factor_exceeded:
            return 'expand'  # Both signals indicate growth needed
        elif bucket_overflow:
            return 'split'  # Local hotspot, split this bucket
        elif load_factor_exceeded:
            return 'expand'  # Preventive expansion
        else:
            return 'none'
 
 
# Example usage demonstrating detection scenarios
detector = OverflowDetector(bucket_capacity=100, load_factor_threshold=0.8)
 
# Scenario 1: Single bucket overflow, low global load
print(detector.recommend_action(
    bucket_count=95,  # 95 records in one bucket
    bucket_overflow=False,
    total_records=5000,
    total_buckets=100
))  # Output: 'none' (load factor = 0.5, bucket not overflowing)
 
# Scenario 2: Bucket overflow triggered
print(detector.recommend_action(
    bucket_count=100,  # Full bucket
    bucket_overflow=True,
    total_records=5000,
    total_buckets=100
))  # Output: 'split' (bucket overflow, but global load ok)
 
# Scenario 3: High global load
print(detector.recommend_action(
    bucket_count=80,
    bucket_overflow=False,
    total_records=8500,
    total_buckets=100
))  # Output: 'expand' (load factor = 0.85 > 0.8)

Bucket Splitting Mechanics

The bucket split is the fundamental growth operation in dynamic hashing. When a bucket overflows, it divides into two buckets, each taking a portion of the original records based on an additional bit of the hash value.

The Split Algorithm:

Consider a bucket addressed by the k-bit hash prefix P. After splitting:

Records with hash prefix P0 stay in the original bucket
Records with hash prefix P1 move to a new bucket

This is remarkably elegant—no rehashing of keys is required. We simply examine one more bit of the existing hash values to determine record placement.

bucket_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
from typing import List, Tuple, NamedTuple
from dataclasses import dataclass, field
 
@dataclass
class HashRecord:
    """A record in the hash index."""
    key: str
    hash_value: int  # Full 32-bit hash
    data_pointer: int  # Pointer to actual data
 
@dataclass
class Bucket:
    """A hash bucket that can be split."""
    id: int
    local_depth: int  # How many bits of hash we use
    records: List[HashRecord] = field(default_factory=list)
    capacity: int = 100
    
    def is_full(self) -> bool:
        return len(self.records) >= self.capacity
    
    def get_prefix(self) -> int:
        """Get the hash prefix that addresses this bucket."""
        # In practice, this comes from the directory
        # For demonstration, we derive it from the first record
        if self.records:
            return self._extract_prefix(self.records[0].hash_value)
        return 0
    
    def _extract_prefix(self, hash_value: int, depth: int = None) -> int:
        """Extract the prefix bits from a hash value."""
        if depth is None:
            depth = self.local_depth
        if depth == 0:
            return 0
        shift = 32 - depth
        return hash_value >> shift
 
 
class BucketSplitter:
    """
    Implements the bucket split algorithm for dynamic hashing.
    
    The split operation is the heart of dynamic hashing growth.
    It takes one overflowing bucket and produces two buckets,
    redistributing records based on an additional hash bit.
    """
    
    def split_bucket(self, bucket: Bucket, next_bucket_id: int) -> Tuple[Bucket, Bucket]:
        """
        Split a bucket into two based on the next hash bit.
        
        Args:
            bucket: The overflowing bucket to split
            next_bucket_id: ID to assign to the new bucket
            
        Returns:
            Tuple of (bucket_0, bucket_1) - the two resulting buckets
        """
        new_depth = bucket.local_depth + 1
        
        # Create two new buckets with increased depth
        bucket_0 = Bucket(
            id=bucket.id,  # Original bucket keeps its ID
            local_depth=new_depth,
            records=[],
            capacity=bucket.capacity
        )
        
        bucket_1 = Bucket(
            id=next_bucket_id,  # New bucket gets new ID
            local_depth=new_depth,
            records=[],
            capacity=bucket.capacity
        )
        
        # Redistribute records based on the new bit
        for record in bucket.records:
            # Extract the bit at position 'new_depth'
            # This is the leftmost bit we haven't examined yet
            bit_position = 32 - new_depth
            distinguishing_bit = (record.hash_value >> bit_position) & 1
            
            if distinguishing_bit == 0:
                bucket_0.records.append(record)
            else:
                bucket_1.records.append(record)
        
        # Log the split statistics
        print(f"Split bucket {bucket.id} (depth {bucket.local_depth} -> {new_depth})")
        print(f"  Original records: {len(bucket.records)}")
        print(f"  Bucket 0 records: {len(bucket_0.records)}")
        print(f"  Bucket 1 records: {len(bucket_1.records)}")
        
        return bucket_0, bucket_1
    
    def validate_split(self, original: Bucket, 
                       result_0: Bucket, 
                       result_1: Bucket) -> bool:
        """
        Validate that a split preserved all records correctly.
        
        This is crucial for correctness - no records can be lost
        or duplicated during a split.
        """
        original_keys = set(r.key for r in original.records)
        result_keys = set(r.key for r in result_0.records) |                      set(r.key for r in result_1.records)
        
        if original_keys != result_keys:
            missing = original_keys - result_keys
            extra = result_keys - original_keys
            print(f"ERROR: Split validation failed!")
            print(f"  Missing keys: {missing}")
            print(f"  Extra keys: {extra}")
            return False
        
        # Also verify no duplicates
        count_0 = len(result_0.records)
        count_1 = len(result_1.records)
        if count_0 + count_1 != len(original.records):
            print(f"ERROR: Record count mismatch!")
            return False
            
        return True
 
 
# Demonstration of bucket splitting
def demonstrate_split():
    """Show how bucket splitting redistributes records."""
    
    # Create a bucket with records
    bucket = Bucket(id=0, local_depth=2, capacity=4)
    
    # Add records with known hash values
    # Hash values designed to show split behavior:
    # Depth 2 prefix: all start with "01" (binary)
    # Depth 3 will split on the 3rd bit
    test_records = [
        # These will go to bucket_0 (3rd bit = 0): prefix 010
        HashRecord("key_a", 0b01001111_00000000_00000000_00000000, 1001),
        HashRecord("key_b", 0b01011111_00000000_00000000_00000000, 1002),
        # These will go to bucket_1 (3rd bit = 1): prefix 011
        HashRecord("key_c", 0b01101111_00000000_00000000_00000000, 1003),
        HashRecord("key_d", 0b01111111_00000000_00000000_00000000, 1004),
    ]
    
    for record in test_records:
        bucket.records.append(record)
    
    # Perform the split
    splitter = BucketSplitter()
    bucket_0, bucket_1 = splitter.split_bucket(bucket, next_bucket_id=1)
    
    # Validate
    assert splitter.validate_split(bucket, bucket_0, bucket_1)
    
    print(f"\nBucket 0 contains: {[r.key for r in bucket_0.records]}")
    print(f"Bucket 1 contains: {[r.key for r in bucket_1.records]}")
    
    # Output:
    # Split bucket 0 (depth 2 -> 3)
    #   Original records: 4
    #   Bucket 0 records: 2
    #   Bucket 1 records: 2
    # Bucket 0 contains: ['key_a', 'key_b']
    # Bucket 1 contains: ['key_c', 'key_d']
 
 
demonstrate_split()

Split Skew Problem

A split might not evenly distribute records. If all records happen to have the same value for the distinguishing bit, one bucket gets everything and the other is empty. This triggers an immediate second split (or more) until records separate. Poorly designed hash functions exacerbate this problem.

Extendible Hashing: Directory-Based Growth

Extendible Hashing uses a directory structure to map hash prefixes to buckets. This directory enables flexible growth where not all buckets need the same depth.

The Directory Structure:

The directory is an array of 2^d bucket pointers, where d is the global depth. Each directory entry points to a bucket. Multiple directory entries can point to the same bucket when that bucket uses fewer bits than the global depth.

Growth via Directory Doubling:

When a bucket with local depth equal to global depth overflows:

The directory must double in size (global depth increases)
The overflowing bucket splits
Directory pointers are updated to reflect the split

When a bucket with local depth less than global depth overflows:

The bucket splits without directory doubling
Only the affected directory entries are updated

extendible_growth.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
from typing import List, Dict, Optional
from dataclasses import dataclass, field
 
@dataclass
class ExtendibleBucket:
    """Bucket for Extendible Hashing with local depth tracking."""
    id: int
    local_depth: int
    records: List[tuple] = field(default_factory=list)  # (key, hash_value, data_ptr)
    capacity: int = 4
 
 
class ExtendibleHashIndex:
    """
    Extendible Hashing implementation demonstrating growth handling.
    
    Key concepts:
    - Global depth: Number of bits used by directory
    - Local depth: Number of bits used by each bucket
    - Directory: 2^global_depth pointers to buckets
    - Multiple directory entries can point to same bucket
    """
    
    def __init__(self, initial_depth: int = 1, bucket_capacity: int = 4):
        self.global_depth = initial_depth
        self.bucket_capacity = bucket_capacity
        self.next_bucket_id = 0
        
        # Initialize directory with 2^depth entries
        self.directory: List[ExtendibleBucket] = []
        self._initialize_directory()
        
    def _initialize_directory(self):
        """Create initial directory with one bucket per entry."""
        num_entries = 2 ** self.global_depth
        
        # Initially, each entry points to its own bucket
        # In practice, we might start with fewer buckets
        for i in range(num_entries):
            bucket = ExtendibleBucket(
                id=self.next_bucket_id,
                local_depth=self.global_depth,
                capacity=self.bucket_capacity
            )
            self.next_bucket_id += 1
            self.directory.append(bucket)
    
    def _get_directory_index(self, hash_value: int) -> int:
        """Map hash value to directory index using global depth bits."""
        shift = 32 - self.global_depth
        return hash_value >> shift
    
    def insert(self, key: str, hash_value: int, data_ptr: int):
        """
        Insert a record, handling bucket overflow with splits.
        
        This is where growth handling happens:
        1. Find target bucket
        2. If bucket full, split (possibly doubling directory)
        3. Retry insertion
        """
        while True:
            idx = self._get_directory_index(hash_value)
            bucket = self.directory[idx]
            
            if len(bucket.records) < bucket.capacity:
                bucket.records.append((key, hash_value, data_ptr))
                return
            
            # Bucket is full - must split
            self._handle_overflow(idx, bucket)
            # Loop to retry insertion after split
    
    def _handle_overflow(self, dir_index: int, bucket: ExtendibleBucket):
        """
        Handle bucket overflow through splitting.
        
        Two cases:
        1. local_depth < global_depth: Split bucket, update pointers
        2. local_depth == global_depth: Double directory first, then split
        """
        print(f"\n>>> Handling overflow in bucket {bucket.id}")
        print(f"    Global depth: {self.global_depth}, Local depth: {bucket.local_depth}")
        
        if bucket.local_depth == self.global_depth:
            # Must double directory first
            self._double_directory()
        
        # Now split the bucket
        self._split_bucket(bucket)
    
    def _double_directory(self):
        """
        Double the directory size by increasing global depth.
        
        Each existing entry is duplicated. The entry at index i
        is copied to index 2i and 2i+1. Both point to same bucket
        until a split happens.
        """
        old_size = len(self.directory)
        self.global_depth += 1
        
        print(f"    Doubling directory: {old_size} -> {old_size * 2} entries")
        
        # Create new directory with doubled size
        new_directory = []
        for bucket in self.directory:
            # Each bucket now has two directory entries
            new_directory.append(bucket)
            new_directory.append(bucket)
        
        self.directory = new_directory
    
    def _split_bucket(self, bucket: ExtendibleBucket):
        """
        Split a bucket into two based on the next hash bit.
        
        Steps:
        1. Create new bucket with increased local depth
        2. Redistribute records based on distinguishing bit
        3. Update directory entries to point to correct bucket
        """
        new_depth = bucket.local_depth + 1
        
        # Create new bucket
        new_bucket = ExtendibleBucket(
            id=self.next_bucket_id,
            local_depth=new_depth,
            capacity=self.bucket_capacity
        )
        self.next_bucket_id += 1
        
        # Update original bucket's depth
        old_local_depth = bucket.local_depth
        bucket.local_depth = new_depth
        
        # Redistribute records
        records_to_redistribute = bucket.records[:]
        bucket.records = []
        
        for key, hash_val, data_ptr in records_to_redistribute:
            # Check the distinguishing bit
            bit_position = 32 - new_depth
            distinguishing_bit = (hash_val >> bit_position) & 1
            
            if distinguishing_bit == 0:
                bucket.records.append((key, hash_val, data_ptr))
            else:
                new_bucket.records.append((key, hash_val, data_ptr))
        
        print(f"    Split bucket {bucket.id}: {len(records_to_redistribute)} records")
        print(f"    -> Bucket {bucket.id}: {len(bucket.records)} records")
        print(f"    -> Bucket {new_bucket.id}: {len(new_bucket.records)} records")
        
        # Update directory entries
        self._update_directory_pointers(bucket, new_bucket, old_local_depth)
    
    def _update_directory_pointers(self, bucket: ExtendibleBucket, 
                                    new_bucket: ExtendibleBucket,
                                    old_depth: int):
        """
        Update directory entries after a split.
        
        Find all entries that pointed to the old bucket and
        update half of them to point to the new bucket.
        """
        for i in range(len(self.directory)):
            if self.directory[i].id == bucket.id:
                # Check if this entry should point to new bucket
                # Based on the bit at the new depth position
                bit_position = 32 - bucket.local_depth
                entry_prefix = i << (32 - self.global_depth)
                distinguishing_bit = (entry_prefix >> bit_position) & 1
                
                if distinguishing_bit == 1:
                    self.directory[i] = new_bucket
    
    def display_state(self):
        """Display current state of the hash index."""
        print(f"\n=== Extendible Hash Index State ===")
        print(f"Global depth: {self.global_depth}")
        print(f"Directory size: {len(self.directory)}")
        print(f"\nDirectory entries:")
        
        seen_buckets = set()
        for i, bucket in enumerate(self.directory):
            prefix = format(i, f'0{self.global_depth}b')
            if bucket.id not in seen_buckets:
                print(f"  [{prefix}] -> Bucket {bucket.id} (depth={bucket.local_depth}, records={len(bucket.records)})")
                seen_buckets.add(bucket.id)
            else:
                print(f"  [{prefix}] -> Bucket {bucket.id} (shared)")
 
 
# Demonstration
def demonstrate_extendible_growth():
    """Show how Extendible Hashing grows with insertions."""
    
    index = ExtendibleHashIndex(initial_depth=1, bucket_capacity=2)
    print("Initial state:")
    index.display_state()
    
    # Insert records that will cause splits
    test_data = [
        ("emp1", 0b00101010_11001100_11110000_10101010),  # prefix 0
        ("emp2", 0b00110011_01010101_00001111_01010101),  # prefix 0
        ("emp3", 0b01010101_10101010_11110000_11001100),  # prefix 0 (overflow!)
        ("emp4", 0b10101010_00110011_00001111_10101010),  # prefix 1
        ("emp5", 0b11001100_11110000_10101010_01010101),  # prefix 1
    ]
    
    for key, hash_val in test_data:
        print(f"\nInserting {key} (hash prefix: {format(hash_val >> 30, '02b')})")
        index.insert(key, hash_val, id(key))
        
    print("\nFinal state:")
    index.display_state()
 
 
demonstrate_extendible_growth()

Directory Doubles, Buckets Don't

A key insight of Extendible Hashing: when the directory doubles, the number of buckets doesn't necessarily double. Only the overflowing bucket splits. Other buckets simply have more directory entries pointing to them. This minimizes data movement during growth.

Linear Hashing: Round-Robin Growth

Linear Hashing takes a fundamentally different approach to growth. Instead of a directory, it uses a split pointer that cycles through buckets in order, splitting them one at a time regardless of which bucket overflowed.

The Split Pointer Mechanism:

Linear Hashing maintains:

Level (L): The current round of expansion (starts at 0)
Split pointer (S): Points to the next bucket to split (0 to 2^L - 1)
Hash functions: h_L(k) = hash(k) mod 2^L and h_{L+1}(k) = hash(k) mod 2^(L+1)

When a split is triggered:

Split the bucket at position S (not necessarily the overflowing bucket)
Increment S
If S reaches 2^L, reset S = 0 and increment L

This creates perfectly even growth—every bucket gets split exactly once per level.

linear_growth.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
from typing import List, Optional
from dataclasses import dataclass, field
 
@dataclass
class LinearBucket:
    """Bucket for Linear Hashing."""
    records: List[tuple] = field(default_factory=list)  # (key, hash_value, data_ptr)
    overflow: Optional['LinearBucket'] = None  # Overflow chain
    capacity: int = 4
    
    def add(self, record: tuple) -> bool:
        """Add record, using overflow chain if needed."""
        if len(self.records) < self.capacity:
            self.records.append(record)
            return True
        
        # Add to overflow chain
        if self.overflow is None:
            self.overflow = LinearBucket(capacity=self.capacity)
        return self.overflow.add(record)
    
    def get_all_records(self) -> List[tuple]:
        """Get all records including overflow chain."""
        result = self.records[:]
        if self.overflow:
            result.extend(self.overflow.get_all_records())
        return result
    
    def clear(self):
        """Clear all records including overflow."""
        self.records = []
        self.overflow = None
 
 
class LinearHashIndex:
    """
    Linear Hashing implementation demonstrating growth handling.
    
    Key characteristics:
    - No directory structure (space efficient)
    - Predictable, round-robin bucket expansion
    - Two hash functions active during expansion
    - Overflow chains for temporary capacity
    """
    
    def __init__(self, initial_buckets: int = 4, 
                 bucket_capacity: int = 4,
                 load_factor_threshold: float = 0.75):
        # Initial state: 2^level buckets
        self.level = 0
        while (1 << self.level) < initial_buckets:
            self.level += 1
        
        initial_count = 1 << self.level  # 2^level
        self.buckets: List[LinearBucket] = [
            LinearBucket(capacity=bucket_capacity) 
            for _ in range(initial_count)
        ]
        
        self.split_pointer = 0  # Next bucket to split
        self.bucket_capacity = bucket_capacity
        self.load_threshold = load_factor_threshold
        self.total_records = 0
        
    def _hash_level(self, hash_value: int, level: int) -> int:
        """Compute hash at given level: hash mod 2^level."""
        return hash_value & ((1 << level) - 1)
    
    def _get_bucket_index(self, hash_value: int) -> int:
        """
        Determine which bucket a key belongs to.
        
        Linear Hashing uses two hash functions during expansion:
        - h_L: For buckets not yet split this round
        - h_{L+1}: For buckets already split this round
        """
        # First, try hash at current level
        idx = self._hash_level(hash_value, self.level)
        
        if idx < self.split_pointer:
            # This bucket has been split - use next level hash
            idx = self._hash_level(hash_value, self.level + 1)
        
        return idx
    
    def _current_load_factor(self) -> float:
        """Calculate current load factor."""
        return self.total_records / len(self.buckets)
    
    def insert(self, key: str, hash_value: int, data_ptr: int):
        """
        Insert a record, triggering split if load factor exceeded.
        
        Unlike Extendible Hashing, the split happens to the bucket
        at split_pointer, not necessarily the bucket we're inserting into.
        """
        # Find bucket and insert
        idx = self._get_bucket_index(hash_value)
        self.buckets[idx].add((key, hash_value, data_ptr))
        self.total_records += 1
        
        # Check if split needed based on load factor
        if self._current_load_factor() > self.load_threshold:
            self._split_next_bucket()
    
    def _split_next_bucket(self):
        """
        Split the bucket at split_pointer (round-robin).
        
        This is the distinctive feature of Linear Hashing:
        splits happen in order, not based on which bucket overflowed.
        """
        bucket_to_split = self.split_pointer
        old_bucket = self.buckets[bucket_to_split]
        
        # Calculate the new bucket index
        # New bucket will have index: split_pointer + 2^level
        new_bucket_idx = self.split_pointer + (1 << self.level)
        
        print(f"\n>>> Linear Hashing split")
        print(f"    Level: {self.level}, Split pointer: {self.split_pointer}")
        print(f"    Splitting bucket {bucket_to_split} into {bucket_to_split} and {new_bucket_idx}")
        
        # Create new bucket
        new_bucket = LinearBucket(capacity=self.bucket_capacity)
        self.buckets.append(new_bucket)
        
        # Redistribute records
        all_records = old_bucket.get_all_records()
        old_bucket.clear()
        
        for key, hash_val, data_ptr in all_records:
            # Use next-level hash to determine placement
            new_idx = self._hash_level(hash_val, self.level + 1)
            
            if new_idx == bucket_to_split:
                old_bucket.add((key, hash_val, data_ptr))
            else:  # new_idx == new_bucket_idx
                new_bucket.add((key, hash_val, data_ptr))
        
        print(f"    Records: {len(all_records)} -> {len(old_bucket.get_all_records())} + {len(new_bucket.get_all_records())}")
        
        # Advance split pointer
        self.split_pointer += 1
        
        # Check if round is complete
        if self.split_pointer >= (1 << self.level):
            print(f"    Round complete! Advancing level {self.level} -> {self.level + 1}")
            self.level += 1
            self.split_pointer = 0
    
    def display_state(self):
        """Display current state of the hash index."""
        print(f"\n=== Linear Hash Index State ===")
        print(f"Level: {self.level}, Split pointer: {self.split_pointer}")
        print(f"Buckets: {len(self.buckets)}, Records: {self.total_records}")
        print(f"Load factor: {self._current_load_factor():.2f}")
        print(f"\nBucket contents:")
        
        for i, bucket in enumerate(self.buckets):
            records = bucket.get_all_records()
            status = "(already split)" if i < self.split_pointer else "(pending split)"
            print(f"  Bucket {i} {status}: {len(records)} records")
 
 
# Demonstration
def demonstrate_linear_growth():
    """Show how Linear Hashing grows with insertions."""
    
    # Start with 2 buckets (level 1), capacity 2
    index = LinearHashIndex(initial_buckets=2, bucket_capacity=2, 
                           load_factor_threshold=0.7)
    print("Initial state:")
    index.display_state()
    
    # Insert records
    import random
    random.seed(42)
    
    for i in range(10):
        key = f"key_{i}"
        hash_val = random.getrandbits(32)
        print(f"\nInserting {key}")
        index.insert(key, hash_val, i)
    
    print("\nFinal state:")
    index.display_state()
 
 
demonstrate_linear_growth()

Overflow Chains in Linear Hashing

Because Linear Hashing may split a bucket other than the one that overflowed, it must support overflow chains. These chains are temporary—the overflowing bucket will eventually be split when the pointer reaches it. This is acceptable because chain length is bounded by the time until the pointer cycles around.

Comparing Growth Strategies

Both Extendible and Linear Hashing achieve dynamic growth, but their strategies have distinct trade-offs:

Extendible Hashing Strengths

•No overflow chains — Splits happen immediately at overflow
•Localized splits — Only overflowing bucket is split
•Guaranteed O(1) lookup — No chain traversal needed
•Adaptive depth — Dense regions use more bits automatically

Linear Hashing Strengths

•No directory overhead — Space efficient
•Predictable growth — Round-robin splitting
•Simpler concurrency — Fewer structures to lock
•Even bucket sizes — All buckets split each round

Growth Strategy Comparison
Aspect	Extendible Hashing	Linear Hashing
Trigger	Bucket overflow	Load factor threshold
Split target	Overflowing bucket	Next bucket in sequence
Directory	Required (2^d entries)	Not needed
Overflow chains	None (split eliminates)	Temporary (until split)
Lookup complexity	O(1) always	O(1) average, O(chain) worst
Space overhead	Directory (can be large)	Minimal
Split frequency	Only when needed	At regular intervals
Best for	Skewed distributions	Uniform distributions

Choosing Between Strategies:

Use Extendible Hashing when lookup performance is critical and directory memory is acceptable. It handles skewed data better because only hot buckets split.
Use Linear Hashing when memory is constrained or concurrency requirements are demanding. Its predictable growth pattern simplifies implementation.
Use hybrid approaches in production systems that need the best of both—PostgreSQL's hash indexes, for instance, use variations that combine features of both schemes.

Handling Cascading and Degenerate Splits

Not all splits are successful on the first attempt. Cascading splits occur when a bucket split doesn't adequately distribute records, requiring immediate additional splits.

The Degenerate Case:

Imagine inserting records where the hash function produces values with many leading zeros:

hash("key1") = 0b00000001_...
hash("key2") = 0b00000010_...
hash("key3") = 0b00000001_...
hash("key4") = 0b00000011_...

At depth 1, all records go to bucket 0. After splitting (depth 2), records still share prefix 00. This triggers another split. The process continues until records finally separate—potentially requiring many consecutive splits.

cascading_splits.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
def demonstrate_cascading_splits():
    """
    Show how poor hash distribution causes cascading splits.
    
    This is a pathological case that well-designed hash functions
    avoid, but it illustrates why hash function quality matters.
    """
    
    # Records with hash values sharing many prefix bits
    # All start with 0b00000000 (8 leading zeros)
    pathological_records = [
        ("user_001", 0b00000000_11111111_00001111_10101010),
        ("user_002", 0b00000000_11110000_00001111_10101010),
        ("user_003", 0b00000000_11001100_00001111_10101010),
        ("user_004", 0b00000000_10101010_00001111_10101010),
        ("user_005", 0b00000000_01010101_00001111_10101010),
    ]
    
    bucket_capacity = 2
    
    print("Demonstrating cascading splits with pathological data")
    print(f"Bucket capacity: {bucket_capacity}")
    print(f"Records to insert: {len(pathological_records)}")
    print()
    
    def count_shared_prefix_bits(records):
        """Count how many prefix bits all records share."""
        if not records:
            return 32
        
        # XOR all hash values - shared bits will be 0
        reference = records[0][1]
        xor_result = 0
        for _, hash_val in records:
            xor_result |= (reference ^ hash_val)
        
        # Count leading zeros in xor_result
        shared = 0
        for bit in range(31, -1, -1):
            if (xor_result >> bit) & 1:
                break
            shared += 1
        return shared
    
    shared_bits = count_shared_prefix_bits(pathological_records)
    print(f"All records share {shared_bits} prefix bits")
    print(f"First split will occur at depth 1")
    print(f"Minimum depth for separation: {shared_bits + 1}")
    print()
    
    # Simulate splits
    depth = 1
    bucket_records = pathological_records[:]
    
    while len(bucket_records) > bucket_capacity:
        print(f"Depth {depth}: Bucket has {len(bucket_records)} records (capacity {bucket_capacity})")
        print(f"  -> Splitting at depth {depth}")
        
        # Split based on bit at current depth
        bucket_0 = []
        bucket_1 = []
        
        bit_position = 32 - depth
        for key, hash_val in bucket_records:
            bit = (hash_val >> bit_position) & 1
            if bit == 0:
                bucket_0.append((key, hash_val))
            else:
                bucket_1.append((key, hash_val))
        
        print(f"  -> Bucket 0: {len(bucket_0)} records")
        print(f"  -> Bucket 1: {len(bucket_1)} records")
        
        # Check which bucket still overflows
        if len(bucket_0) > bucket_capacity:
            print(f"  -> Bucket 0 still overflows! Cascading split needed.")
            bucket_records = bucket_0
        elif len(bucket_1) > bucket_capacity:
            print(f"  -> Bucket 1 still overflows! Cascading split needed.")
            bucket_records = bucket_1
        else:
            print(f"  -> Split successful! Both buckets within capacity.")
            break
        
        depth += 1
    
    print(f"\nFinal depth after cascading: {depth}")
    print(f"Total splits required: {depth}")
 
 
demonstrate_cascading_splits()
 
# Output:
# All records share 8 prefix bits
# First split will occur at depth 1
# Minimum depth for separation: 9
#
# Depth 1: Bucket has 5 records (capacity 2)
#   -> Splitting at depth 1
#   -> Bucket 0: 5 records
#   -> Bucket 1: 0 records
#   -> Bucket 0 still overflows! Cascading split needed.
# ... (continues until depth 9)

Preventing Cascading Splits

Cascading splits are symptoms of poor hash function design or adversarial input. Production systems mitigate this by: (1) using cryptographic-quality hash functions with uniform distribution, (2) implementing maximum depth limits with overflow to avoid unbounded cascading, and (3) monitoring for abnormal split patterns that might indicate hash collision attacks.

Concurrent Growth Considerations

In production database systems, multiple transactions operate concurrently. Growth operations must not corrupt the index or cause incorrect query results.

Concurrency Challenges:

Readers during splits: A query might access a bucket while it's being split. The query must see either the pre-split or post-split state—never an inconsistent mix.
Writers during splits: Multiple inserts to the same bucket might both trigger splits. Only one split should occur.
Directory updates: In Extendible Hashing, directory doubling affects all lookups. This global operation requires careful synchronization.

Common Solutions:

Concurrency Control Strategies

•Bucket-level locking — Lock only the bucket being accessed/split. Fine-grained but requires careful ordering to avoid deadlocks.
•Split locking — Separate locks for read/write vs. split operations. Splits acquire exclusive access only to affected buckets.
•Copy-on-write — Create new bucket pages during split, then atomically update pointers. Readers continue using old pages until switch completes.
•Lock-free techniques — Use compare-and-swap operations for pointer updates. Complex but highly scalable for read-heavy workloads.
•Latching protocols — Short-term latches (not locks) held only during physical page access. Release before I/O whenever possible.

Why Linear Hashing Simplifies Concurrency:

Linear Hashing's predictable split pattern provides concurrency advantages:

Known split target: The split pointer tells us exactly which bucket will split next. Other buckets can be accessed freely.
No directory contention: Without a shared directory structure, there's no global bottleneck.
Bounded overflow: Even if splits lag behind, overflow chain length is bounded by the split pointer's progress.

These properties make Linear Hashing popular in concurrent environments where predictability is valued over optimal space utilization.

Real-World Implementation

PostgreSQL's hash index uses a variant that combines directory-style addressing with careful locking protocols. Oracle Database provides hash clusters that use Linear Hashing principles. Both implement sophisticated concurrency controls that go beyond simple locking—including buffer management integration and write-ahead logging for crash recovery.

Summary: Growth Handling

We've examined the complete growth handling lifecycle in dynamic hashing—from detecting overflow to executing splits to managing concurrent access. Here are the key insights:

Key Takeaways

•Overflow detection varies by scheme — Extendible Hashing triggers on bucket overflow; Linear Hashing uses global load factor thresholds
•Bucket splitting redistributes by hash bits — Records separate based on an additional bit of their hash values, requiring no rehashing
•Extendible Hashing uses directory doubling — Growth may require doubling the directory before splitting can proceed
•Linear Hashing uses round-robin splits — Buckets split in order, creating predictable but not necessarily optimal growth
•Cascading splits can occur — Poor hash distribution causes repeated splits until records separate
•Concurrency requires careful design — Bucket-level locking, copy-on-write, and lock-free techniques enable parallel access

What's Next:

Growth is only half the story. When data is deleted, indexes should reclaim space by shrinking. The next page explores shrinkage handling—how dynamic hashing systems merge underutilized buckets, reduce directory size, and maintain efficiency as data volumes decrease.

Page Complete

You now understand how dynamic hashing systems grow gracefully under insertion load. You can trace the complete split process for both Extendible and Linear Hashing, and you appreciate the concurrency considerations that production implementations must address.