Database Management SystemsHash Indexing

Linear Hashing

LevelAdvanced

Duration75 mins

TopicHash Indexing

3 / 5

Split Trigger

When Should a Linear Hash Table Grow?

We've established that linear hashing grows by splitting buckets one at a time, always following the split pointer. But a crucial question remains: when should a split occur?

Unlike static hash tables with fixed capacity, dynamic hash tables must decide when to expand. This decision—the split trigger—profoundly impacts performance. Trigger too often, and you waste space and split overhead. Trigger too rarely, and overflow chains grow, degrading lookup performance.

The split trigger is where theory meets engineering pragmatism. Different trigger policies offer different trade-offs, and understanding these trade-offs is essential for implementing high-performance linear hash tables.

What You Will Learn

By the end of this page, you will understand multiple split trigger strategies, their mathematical properties, performance implications, and implementation considerations. You'll learn how to select and tune trigger policies for specific workload characteristics.

Split Trigger Overview

A split trigger is a condition or policy that determines when the linear hash table should execute a split operation. When the trigger condition becomes true, a split is initiated on bucket S (the next bucket in sequence).

Key Observation: Deferred Splitting

In linear hashing, the bucket that triggers a split is not necessarily the bucket that gets split. If bucket 7 overflows and the split pointer S = 3, then bucket 3 splits—not bucket 7. This means:

Overflow in bucket 7 is not immediately resolved
Bucket 7's overflow chain remains until S reaches 7
The load across all buckets is what matters, not individual bucket load

Split Trigger Categories
Category	Trigger Condition	Key Characteristic	Use Case
Load Factor	Global load exceeds threshold	Maintains average performance	General-purpose
Overflow-Based	Any overflow page created	Controls worst-case access	Latency-sensitive
Bucket Overflow	Specific bucket overflows	Immediate response to hot spots	Skewed workloads
Periodic	After N insertions	Predictable growth	Batch processing
Hybrid	Combination of above	Balanced trade-offs	Production systems

Trigger Independence from Split Target

A fundamental property of linear hashing is that the split trigger and split target are independent. The trigger decides WHEN to split; the split pointer decides WHICH bucket to split. This separation is what enables directory-free addressing—but it also means overflow in one bucket may persist for many splits before being addressed.

Load Factor-Based Triggers

The most common split trigger is based on load factor (α)—the ratio of stored records to bucket count. When the load factor exceeds a threshold, a split is triggered.

Definition:

α = n / N

Where:

n = total number of records in the hash table
N = total number of buckets (= N₀ × 2^L + S)

Trigger Condition:

Split when α > α_max (typically 0.7 to 0.9)

load_factor_trigger.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
class LinearHashTable:
    def __init__(self, initial_buckets=4, max_load_factor=0.80):
        self.N0 = initial_buckets
        self.L = 0
        self.S = 0
        self.buckets = [[] for _ in range(initial_buckets)]
        self.record_count = 0
        self.max_load_factor = max_load_factor
    
    def _bucket_count(self) -> int:
        """Current number of buckets."""
        return self.N0 * (2 ** self.L) + self.S
    
    def _load_factor(self) -> float:
        """Current load factor α = n / N."""
        return self.record_count / self._bucket_count()
    
    def _should_split(self) -> bool:
        """
        Trigger split when load factor exceeds threshold.
        
        This is the classic load factor trigger policy.
        """
        return self._load_factor() > self.max_load_factor
    
    def insert(self, key, value):
        """Insert with load factor-based split trigger."""
        bucket = self._resolve_bucket(key)
        
        # Insert record
        self.buckets[bucket].append((key, value))
        self.record_count += 1
        
        # Check trigger AFTER insertion
        if self._should_split():
            self._perform_split()
    
    def _perform_split(self):
        """Execute split on bucket S."""
        # [Split implementation as before]
        sibling = self.S + self.N0 * (2 ** self.L)
        self.buckets.append([])
        
        old_records = self.buckets[self.S]
        self.buckets[self.S] = []
        
        next_range = self.N0 * (2 ** (self.L + 1))
        for k, v in old_records:
            new_bucket = hash(k) % next_range
            self.buckets[new_bucket].append((k, v))
        
        self.S += 1
        if self.S >= self.N0 * (2 ** self.L):
            self.S = 0
            self.L += 1

Load Factor Trigger Properties

•Expected Chain Length: With uniform hashing and load factor α, expected chain length ≈ α. Setting α_max = 0.8 means average chain ≤ 0.8 records.
•Space Utilization: Higher α_max means better space utilization but longer chains. Lower α_max means shorter chains but more wasted space.
•Load Factor Range: α oscillates between α_max/2 and α_max during steady-state operation. Each complete level doubles buckets, halving load.
•Trigger Frequency: With load factor trigger, splits occur roughly every 1/α_max insertions on average.
•Workload Agnostic: Works well for uniform distributions but may not respond optimally to skewed access patterns.

Mathematical Analysis:

Let's analyze load factor behavior during linear growth:

At start of level L: N = N₀ × 2^L buckets, load factor = n / (N₀ × 2^L)
During level L: Each split adds one bucket, decreasing load factor slightly
At end of level L: N = N₀ × 2^(L+1) buckets, load factor ≈ n / (N₀ × 2^(L+1)) = half of start

Load Factor Oscillation:

Load Factor α
    ▲
α_max ┼─────●─────●─────●─────●
      │    /│    /│    /│    /│
      │   / │   / │   / │   / │
      │  /  │  /  │  /  │  /  │
α_max/2─●   ●   ●   ●   ●
      └──────────────────────────▶ Time / Insertions
        L=0  L=1  L=2  L=3

The load factor saw-tooth pattern is characteristic of linear hashing. After completing a level (doubling buckets), the load factor drops to approximately half.

Overflow-Based Triggers

An alternative trigger strategy focuses on overflow control—triggering splits when buckets overflow their primary capacity. This approach directly controls worst-case access time rather than average performance.

Overflow Page Trigger:

When any bucket requires an overflow page (chaining beyond primary capacity), trigger a split.

overflow_trigger.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class LinearHashTableOverflowTrigger:
    def __init__(self, initial_buckets=4, bucket_capacity=10):
        self.N0 = initial_buckets
        self.L = 0
        self.S = 0
        self.buckets = [[] for _ in range(initial_buckets)]
        self.bucket_capacity = bucket_capacity
        self.overflow_count = 0  # Track how many buckets have overflow
    
    def _bucket_has_overflow(self, bucket_idx: int) -> bool:
        """Check if bucket exceeds primary capacity."""
        return len(self.buckets[bucket_idx]) > self.bucket_capacity
    
    def _should_split(self, inserted_bucket: int) -> bool:
        """
        Trigger split when an overflow condition occurs.
        
        Can be configured for different overflow policies:
        1. Any bucket overflows
        2. Total overflow pages exceed threshold
        3. Inserted bucket overflows
        """
        # Policy 1: Inserted bucket caused overflow
        if self._bucket_has_overflow(inserted_bucket):
            return True
        
        return False
    
    def _count_total_overflow(self) -> int:
        """Count total records in overflow (beyond primary capacity)."""
        total = 0
        for bucket in self.buckets:
            if len(bucket) > self.bucket_capacity:
                total += len(bucket) - self.bucket_capacity
        return total
    
    def insert(self, key, value):
        """Insert with overflow-based split trigger."""
        bucket = self._resolve_bucket(key)
        self.buckets[bucket].append((key, value))
        
        # Check for overflow trigger
        if self._should_split(bucket):
            self._perform_split()
            
            # May need multiple splits for severe overflow
            while self._count_total_overflow() > self.bucket_capacity:
                self._perform_split()

Overflow Trigger Variants
Variant	Trigger Condition	Pros	Cons
Single Overflow	Any single bucket overflows	Controls worst case strictly	May trigger too aggressively
Overflow Count	Total overflow exceeds threshold	Tolerates some overflow	Needs global counting
Overflow Ratio	Overflow buckets > X% of total	Normalized across sizes	Complex to compute
Overflow Chain	Any chain exceeds K pages	Limits I/O for any lookup	May allow many short chains

Overflow Trigger Caveat

With overflow triggers, splitting bucket S doesn't necessarily reduce overflow in the bucket that triggered the split. If bucket 7 overflows and S = 2, splitting bucket 2 reduces overall load but bucket 7's overflow remains until S reaches 7. This can lead to multiple consecutive splits when a single bucket severely overflows.

Controlling Cascade Splits:

When using overflow triggers, a single insertion might cause multiple splits if:

The inserted bucket severely overflows
The split bucket happens to be empty (no load reduction)
Load is highly skewed toward unsplit buckets

To prevent excessive splitting:

def insert_with_cascade_limit(self, key, value, max_splits=3):
    bucket = self._resolve_bucket(key)
    self.buckets[bucket].append((key, value))
    
    splits_performed = 0
    while self._should_split(bucket) and splits_performed < max_splits:
        self._perform_split()
        splits_performed += 1
    
    if splits_performed == max_splits:
        # Log warning: cascade limit reached
        pass

Controlled Splitting Policies

Beyond reactive triggers (load factor, overflow), some systems use proactive or controlled splitting policies that provide more predictable resource consumption.

Periodic Splitting:

Split after every K insertions, regardless of load factor or overflow.

periodic_trigger.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class LinearHashTablePeriodicTrigger:
    def __init__(self, initial_buckets=4, split_interval=100):
        self.N0 = initial_buckets
        self.L = 0
        self.S = 0
        self.buckets = [[] for _ in range(initial_buckets)]
        self.split_interval = split_interval
        self.insertions_since_split = 0
    
    def _should_split(self) -> bool:
        """
        Trigger split after fixed number of insertions.
        
        Advantages:
        - Very predictable resource usage
        - Easy capacity planning
        - No need to track overflow or count records
        """
        return self.insertions_since_split >= self.split_interval
    
    def insert(self, key, value):
        """Insert with periodic split trigger."""
        bucket = self._resolve_bucket(key)
        self.buckets[bucket].append((key, value))
        self.insertions_since_split += 1
        
        if self._should_split():
            self._perform_split()
            self.insertions_since_split = 0
    
    def tune_interval(self, target_load_factor=0.75):
        """
        Dynamically adjust split interval to maintain target load.
        
        If load factor is consistently high, decrease interval.
        If load factor is consistently low, increase interval.
        """
        current_load = self._load_factor()
        
        if current_load > target_load_factor * 1.1:
            # Load too high, split more frequently
            self.split_interval = max(10, self.split_interval * 0.9)
        elif current_load < target_load_factor * 0.8:
            # Load too low, split less frequently
            self.split_interval = min(1000, self.split_interval * 1.1)

Controlled Splitting Strategies

•Fixed Interval: Split every K insertions. Simple but may not adapt to workload changes.
•Time-Based: Split every T seconds/minutes. Useful for batch processing systems.
•Capacity Reservation: Split to maintain X% free space. Pro-active for anticipated load.
•Adaptive Interval: Adjust K based on observed load factor. Self-tuning behavior.
•Background Splitting: Perform splits asynchronously to avoid insertion latency impact.

Background Splitting

Production systems often decouple split execution from insertion. A background thread monitors load factor and performs splits during low-activity periods. This spreads split cost evenly and prevents insertion latency spikes. The trade-off is slightly higher average load factor during peak times.

Hybrid Trigger Strategies

Real-world systems often combine multiple trigger strategies to balance different requirements. A hybrid trigger uses multiple conditions in conjunction or disjunction.

Common Hybrid Approaches:

hybrid_trigger.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
class LinearHashTableHybridTrigger:
    def __init__(
        self,
        initial_buckets=4,
        soft_load_threshold=0.75,    # Soft trigger threshold
        hard_load_threshold=0.95,    # Hard trigger threshold
        max_overflow_pages=5,        # Overflow limit
        bucket_capacity=10
    ):
        self.N0 = initial_buckets
        self.L = 0
        self.S = 0
        self.buckets = [[] for _ in range(initial_buckets)]
        
        # Trigger parameters
        self.soft_load = soft_load_threshold
        self.hard_load = hard_load_threshold
        self.max_overflow = max_overflow_pages
        self.bucket_capacity = bucket_capacity
        
        # Tracking
        self.record_count = 0
    
    def _load_factor(self) -> float:
        return self.record_count / len(self.buckets)
    
    def _total_overflow_pages(self) -> int:
        """Count buckets that have overflow."""
        overflow = 0
        for bucket in self.buckets:
            excess = len(bucket) - self.bucket_capacity
            if excess > 0:
                overflow += (excess + self.bucket_capacity - 1) // self.bucket_capacity
        return overflow
    
    def _should_split(self, inserted_bucket: int) -> tuple[bool, str]:
        """
        Hybrid trigger with multiple conditions.
        
        Returns (should_split, reason) for logging/debugging.
        """
        load = self._load_factor()
        overflow = self._total_overflow_pages()
        bucket_overflow = len(self.buckets[inserted_bucket]) > self.bucket_capacity
        
        # Hard triggers: MUST split
        if load >= self.hard_load:
            return True, "hard_load_threshold"
        
        if overflow >= self.max_overflow:
            return True, "max_overflow_pages"
        
        # Soft trigger: split if convenient
        if load >= self.soft_load and bucket_overflow:
            return True, "soft_load_with_overflow"
        
        # No trigger
        return False, "none"
    
    def insert(self, key, value):
        """Insert with hybrid trigger."""
        bucket = self._resolve_bucket(key)
        self.buckets[bucket].append((key, value))
        self.record_count += 1
        
        should_split, reason = self._should_split(bucket)
        if should_split:
            self._perform_split()
            # Log: f"Split triggered by: {reason}"

Hybrid Trigger Design Patterns
Pattern	Logic	Effect
Soft + Hard Load	Split at soft if overflow, always at hard	Responds to hot spots while bounding load
Load OR Overflow	Split if either threshold exceeded	Dual protection against degradation
Load AND Overflow	Split only if both conditions met	Conservative splitting, saves resources
Tiered Response	Single split at soft, multi-split at hard	Graduated response to load increase
Hysteresis	Split at α_high, stop at α_low	Prevents split/unsplit oscillation

Production Recommendation

For most production systems, a hybrid trigger combining load factor (soft: 0.75, hard: 0.90) with overflow limit (max 3-5 pages) provides good balance. The soft trigger handles normal growth gracefully, while the hard trigger and overflow limit prevent pathological cases.

Split Trigger Tuning

Selecting optimal trigger thresholds requires understanding the workload and performance requirements. There's no universally optimal setting—the best choice depends on access patterns, latency requirements, and memory constraints.

Key Tuning Parameters:

Trigger Tuning Trade-offs
Parameter	Low Value Effect	High Value Effect	Tuning Guidance
α_max (load threshold)	More splits, shorter chains, higher space	Fewer splits, longer chains, better space	Start at 0.75, adjust based on lookup latency
Bucket capacity	More overflow pages, more splits	Fewer overflow, delayed splits	Match to disk page size for disk-based
Overflow threshold	Aggressive splitting	Tolerates long chains	Lower if worst-case latency matters
Split interval (periodic)	Continuous splitting, low load	Burst splitting, variable load	Match to batch sizes if applicable

Workload-Specific Recommendations:

Read-Heavy Workloads:

Lower α_max (0.65-0.75) for shorter chains
Aggressive overflow triggers
Accept higher space overhead for faster lookups

Write-Heavy Workloads:

Higher α_max (0.80-0.90) for fewer splits
Periodic or background splitting
Trade lookup speed for insertion throughput

Mixed Workloads:

Hybrid triggers with soft/hard thresholds
Background splitting during read-heavy periods
Adaptive interval adjustment

Latency-Sensitive Workloads:

Low overflow tolerance (max 1-2 pages)
Proactive splitting based on load trends
Consider capacity reservation

adaptive_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
class AdaptiveTuningLinearHash:
    """
    Self-tuning linear hash table that adjusts thresholds
    based on observed performance.
    """
    
    def __init__(self, initial_buckets=4):
        self.N0 = initial_buckets
        self.L = 0
        self.S = 0
        self.buckets = [[] for _ in range(initial_buckets)]
        
        # Adaptive parameters
        self.load_threshold = 0.75
        self.min_threshold = 0.50
        self.max_threshold = 0.95
        
        # Performance tracking
        self.recent_lookup_probes = []  # Last N lookups
        self.target_avg_probes = 1.5
        self.window_size = 1000
    
    def record_lookup_probes(self, probes: int):
        """Track probe count for lookups."""
        self.recent_lookup_probes.append(probes)
        if len(self.recent_lookup_probes) > self.window_size:
            self.recent_lookup_probes.pop(0)
    
    def _average_probes(self) -> float:
        if not self.recent_lookup_probes:
            return 1.0
        return sum(self.recent_lookup_probes) / len(self.recent_lookup_probes)
    
    def _tune_threshold(self):
        """Adjust load threshold based on performance."""
        avg_probes = self._average_probes()
        
        if avg_probes > self.target_avg_probes * 1.2:
            # Lookups too slow: lower threshold (split more)
            self.load_threshold = max(
                self.min_threshold,
                self.load_threshold - 0.05
            )
        elif avg_probes < self.target_avg_probes * 0.8:
            # Lookups fast: raise threshold (split less)
            self.load_threshold = min(
                self.max_threshold,
                self.load_threshold + 0.02
            )
    
    def lookup(self, key):
        """Lookup with probe counting."""
        bucket = self._resolve_bucket(key)
        probes = 0
        
        for k, v in self.buckets[bucket]:
            probes += 1
            if k == key:
                self.record_lookup_probes(probes)
                return v
        
        self.record_lookup_probes(probes)
        return None
    
    def periodic_tune(self):
        """Call periodically to adjust thresholds."""
        self._tune_threshold()

Observability Is Key

Effective tuning requires visibility into hash table behavior. Track: average/p99 chain length, split frequency, load factor distribution, overflow page count. These metrics enable data-driven threshold adjustment rather than guesswork.

Bucket Capacity Considerations

The bucket capacity (how many records fit before overflow) interacts critically with split triggers. This is especially important for disk-based hash tables where bucket capacity is constrained by page size.

In-Memory vs. Disk-Based:

In-Memory Hash Tables

•Bucket 'capacity' is conceptual
•Chains extend with pointer chasing
•Cost is cache misses, not I/O
•Can tolerate longer chains
•Trigger primarily on load factor

Disk-Based Hash Tables

•Bucket capacity = records per page
•Overflow = chained overflow pages
•Cost is I/O operations
•Must tightly control overflow
•Trigger on overflow + load factor

Calculating Bucket Capacity:

For disk-based systems with page size P and record size R:

Bucket capacity = floor((P - header_size) / R)

Example:
- Page size: 4096 bytes
- Page header: 64 bytes  
- Record size: 100 bytes
- Capacity = floor((4096 - 64) / 100) = 40 records per bucket

Variable-Length Records:

With variable-length records, bucket capacity varies. Trigger strategies must adapt:

Minimum capacity: Assume worst-case record size
Average capacity: Use statistical average, accept some overflow
Fill factor: Trigger when page is X% full by bytes
Dynamic threshold: Adjust based on actual free space

Overflow Page I/O Cost

Each overflow page in a chain adds one I/O operation to lookups. If average lookup requires 1.5 I/O (primary bucket + 0.5 average overflow), query performance doubles compared to no overflow. Disk-based systems must aggressively control overflow chain length.

Summary: Split Trigger Strategies

The split trigger determines when linear hashing expands. Choosing the right trigger strategy requires balancing lookup performance, space efficiency, and implementation complexity.

Key Takeaways

•Load factor triggers maintain average-case performance by controlling α = n/N. Typical thresholds: 0.70-0.85.
•Overflow triggers control worst-case performance by limiting chain length. Essential for disk-based systems.
•Periodic triggers provide predictable growth for batch workloads but may not adapt to load variations.
•Hybrid triggers combine multiple conditions for robust behavior across different scenarios.
•Trigger tuning depends on workload: read-heavy favors low thresholds; write-heavy favors high thresholds.
•Bucket capacity is critical for disk-based systems where overflow means additional I/O.

What's Next:

We've covered when to split. The next page explores a distinctive property of linear hashing: no directory needed. We'll examine how linear hashing achieves directory-free dynamic growth, the implications for memory efficiency and concurrency, and comparisons with directory-based schemes like extendible hashing.

Page Complete

You now understand the full spectrum of split trigger strategies for linear hashing. From simple load factor thresholds to sophisticated hybrid policies, you can select and tune triggers appropriate for any workload. This completes the 'when to grow' aspect of linear hashing.

3 / 5

Loading learning content...

Database Management SystemsHash Indexing

Linear Hashing

LevelAdvanced

Duration75 mins

TopicHash Indexing

3 / 5

Split Trigger

When Should a Linear Hash Table Grow?

We've established that linear hashing grows by splitting buckets one at a time, always following the split pointer. But a crucial question remains: when should a split occur?

What You Will Learn

Split Trigger Overview

Key Observation: Deferred Splitting

Overflow in bucket 7 is not immediately resolved
Bucket 7's overflow chain remains until S reaches 7
The load across all buckets is what matters, not individual bucket load

Split Trigger Categories
Category	Trigger Condition	Key Characteristic	Use Case
Load Factor	Global load exceeds threshold	Maintains average performance	General-purpose
Overflow-Based	Any overflow page created	Controls worst-case access	Latency-sensitive
Bucket Overflow	Specific bucket overflows	Immediate response to hot spots	Skewed workloads
Periodic	After N insertions	Predictable growth	Batch processing
Hybrid	Combination of above	Balanced trade-offs	Production systems

Trigger Independence from Split Target

Load Factor-Based Triggers

The most common split trigger is based on load factor (α)—the ratio of stored records to bucket count. When the load factor exceeds a threshold, a split is triggered.

Definition:

α = n / N

Where:

n = total number of records in the hash table
N = total number of buckets (= N₀ × 2^L + S)

Trigger Condition:

Split when α > α_max (typically 0.7 to 0.9)

load_factor_trigger.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
class LinearHashTable:
    def __init__(self, initial_buckets=4, max_load_factor=0.80):
        self.N0 = initial_buckets
        self.L = 0
        self.S = 0
        self.buckets = [[] for _ in range(initial_buckets)]
        self.record_count = 0
        self.max_load_factor = max_load_factor
    
    def _bucket_count(self) -> int:
        """Current number of buckets."""
        return self.N0 * (2 ** self.L) + self.S
    
    def _load_factor(self) -> float:
        """Current load factor α = n / N."""
        return self.record_count / self._bucket_count()
    
    def _should_split(self) -> bool:
        """
        Trigger split when load factor exceeds threshold.
        
        This is the classic load factor trigger policy.
        """
        return self._load_factor() > self.max_load_factor
    
    def insert(self, key, value):
        """Insert with load factor-based split trigger."""
        bucket = self._resolve_bucket(key)
        
        # Insert record
        self.buckets[bucket].append((key, value))
        self.record_count += 1
        
        # Check trigger AFTER insertion
        if self._should_split():
            self._perform_split()
    
    def _perform_split(self):
        """Execute split on bucket S."""
        # [Split implementation as before]
        sibling = self.S + self.N0 * (2 ** self.L)
        self.buckets.append([])
        
        old_records = self.buckets[self.S]
        self.buckets[self.S] = []
        
        next_range = self.N0 * (2 ** (self.L + 1))
        for k, v in old_records:
            new_bucket = hash(k) % next_range
            self.buckets[new_bucket].append((k, v))
        
        self.S += 1
        if self.S >= self.N0 * (2 ** self.L):
            self.S = 0
            self.L += 1

Load Factor Trigger Properties

•Expected Chain Length: With uniform hashing and load factor α, expected chain length ≈ α. Setting α_max = 0.8 means average chain ≤ 0.8 records.
•Space Utilization: Higher α_max means better space utilization but longer chains. Lower α_max means shorter chains but more wasted space.
•Load Factor Range: α oscillates between α_max/2 and α_max during steady-state operation. Each complete level doubles buckets, halving load.
•Trigger Frequency: With load factor trigger, splits occur roughly every 1/α_max insertions on average.
•Workload Agnostic: Works well for uniform distributions but may not respond optimally to skewed access patterns.

Mathematical Analysis:

Let's analyze load factor behavior during linear growth:

At start of level L: N = N₀ × 2^L buckets, load factor = n / (N₀ × 2^L)
During level L: Each split adds one bucket, decreasing load factor slightly
At end of level L: N = N₀ × 2^(L+1) buckets, load factor ≈ n / (N₀ × 2^(L+1)) = half of start

Load Factor Oscillation:

Load Factor α
    ▲
α_max ┼─────●─────●─────●─────●
      │    /│    /│    /│    /│
      │   / │   / │   / │   / │
      │  /  │  /  │  /  │  /  │
α_max/2─●   ●   ●   ●   ●
      └──────────────────────────▶ Time / Insertions
        L=0  L=1  L=2  L=3

The load factor saw-tooth pattern is characteristic of linear hashing. After completing a level (doubling buckets), the load factor drops to approximately half.

Overflow-Based Triggers

Overflow Page Trigger:

When any bucket requires an overflow page (chaining beyond primary capacity), trigger a split.

overflow_trigger.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class LinearHashTableOverflowTrigger:
    def __init__(self, initial_buckets=4, bucket_capacity=10):
        self.N0 = initial_buckets
        self.L = 0
        self.S = 0
        self.buckets = [[] for _ in range(initial_buckets)]
        self.bucket_capacity = bucket_capacity
        self.overflow_count = 0  # Track how many buckets have overflow
    
    def _bucket_has_overflow(self, bucket_idx: int) -> bool:
        """Check if bucket exceeds primary capacity."""
        return len(self.buckets[bucket_idx]) > self.bucket_capacity
    
    def _should_split(self, inserted_bucket: int) -> bool:
        """
        Trigger split when an overflow condition occurs.
        
        Can be configured for different overflow policies:
        1. Any bucket overflows
        2. Total overflow pages exceed threshold
        3. Inserted bucket overflows
        """
        # Policy 1: Inserted bucket caused overflow
        if self._bucket_has_overflow(inserted_bucket):
            return True
        
        return False
    
    def _count_total_overflow(self) -> int:
        """Count total records in overflow (beyond primary capacity)."""
        total = 0
        for bucket in self.buckets:
            if len(bucket) > self.bucket_capacity:
                total += len(bucket) - self.bucket_capacity
        return total
    
    def insert(self, key, value):
        """Insert with overflow-based split trigger."""
        bucket = self._resolve_bucket(key)
        self.buckets[bucket].append((key, value))
        
        # Check for overflow trigger
        if self._should_split(bucket):
            self._perform_split()
            
            # May need multiple splits for severe overflow
            while self._count_total_overflow() > self.bucket_capacity:
                self._perform_split()

Overflow Trigger Variants
Variant	Trigger Condition	Pros	Cons
Single Overflow	Any single bucket overflows	Controls worst case strictly	May trigger too aggressively
Overflow Count	Total overflow exceeds threshold	Tolerates some overflow	Needs global counting
Overflow Ratio	Overflow buckets > X% of total	Normalized across sizes	Complex to compute
Overflow Chain	Any chain exceeds K pages	Limits I/O for any lookup	May allow many short chains

Overflow Trigger Caveat

Controlling Cascade Splits:

When using overflow triggers, a single insertion might cause multiple splits if:

The inserted bucket severely overflows
The split bucket happens to be empty (no load reduction)
Load is highly skewed toward unsplit buckets

To prevent excessive splitting:

def insert_with_cascade_limit(self, key, value, max_splits=3):
    bucket = self._resolve_bucket(key)
    self.buckets[bucket].append((key, value))
    
    splits_performed = 0
    while self._should_split(bucket) and splits_performed < max_splits:
        self._perform_split()
        splits_performed += 1
    
    if splits_performed == max_splits:
        # Log warning: cascade limit reached
        pass

Controlled Splitting Policies

Beyond reactive triggers (load factor, overflow), some systems use proactive or controlled splitting policies that provide more predictable resource consumption.

Periodic Splitting:

Split after every K insertions, regardless of load factor or overflow.

periodic_trigger.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class LinearHashTablePeriodicTrigger:
    def __init__(self, initial_buckets=4, split_interval=100):
        self.N0 = initial_buckets
        self.L = 0
        self.S = 0
        self.buckets = [[] for _ in range(initial_buckets)]
        self.split_interval = split_interval
        self.insertions_since_split = 0
    
    def _should_split(self) -> bool:
        """
        Trigger split after fixed number of insertions.
        
        Advantages:
        - Very predictable resource usage
        - Easy capacity planning
        - No need to track overflow or count records
        """
        return self.insertions_since_split >= self.split_interval
    
    def insert(self, key, value):
        """Insert with periodic split trigger."""
        bucket = self._resolve_bucket(key)
        self.buckets[bucket].append((key, value))
        self.insertions_since_split += 1
        
        if self._should_split():
            self._perform_split()
            self.insertions_since_split = 0
    
    def tune_interval(self, target_load_factor=0.75):
        """
        Dynamically adjust split interval to maintain target load.
        
        If load factor is consistently high, decrease interval.
        If load factor is consistently low, increase interval.
        """
        current_load = self._load_factor()
        
        if current_load > target_load_factor * 1.1:
            # Load too high, split more frequently
            self.split_interval = max(10, self.split_interval * 0.9)
        elif current_load < target_load_factor * 0.8:
            # Load too low, split less frequently
            self.split_interval = min(1000, self.split_interval * 1.1)

Controlled Splitting Strategies

•Fixed Interval: Split every K insertions. Simple but may not adapt to workload changes.
•Time-Based: Split every T seconds/minutes. Useful for batch processing systems.
•Capacity Reservation: Split to maintain X% free space. Pro-active for anticipated load.
•Adaptive Interval: Adjust K based on observed load factor. Self-tuning behavior.
•Background Splitting: Perform splits asynchronously to avoid insertion latency impact.

Background Splitting

Hybrid Trigger Strategies

Real-world systems often combine multiple trigger strategies to balance different requirements. A hybrid trigger uses multiple conditions in conjunction or disjunction.

Common Hybrid Approaches:

hybrid_trigger.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
class LinearHashTableHybridTrigger:
    def __init__(
        self,
        initial_buckets=4,
        soft_load_threshold=0.75,    # Soft trigger threshold
        hard_load_threshold=0.95,    # Hard trigger threshold
        max_overflow_pages=5,        # Overflow limit
        bucket_capacity=10
    ):
        self.N0 = initial_buckets
        self.L = 0
        self.S = 0
        self.buckets = [[] for _ in range(initial_buckets)]
        
        # Trigger parameters
        self.soft_load = soft_load_threshold
        self.hard_load = hard_load_threshold
        self.max_overflow = max_overflow_pages
        self.bucket_capacity = bucket_capacity
        
        # Tracking
        self.record_count = 0
    
    def _load_factor(self) -> float:
        return self.record_count / len(self.buckets)
    
    def _total_overflow_pages(self) -> int:
        """Count buckets that have overflow."""
        overflow = 0
        for bucket in self.buckets:
            excess = len(bucket) - self.bucket_capacity
            if excess > 0:
                overflow += (excess + self.bucket_capacity - 1) // self.bucket_capacity
        return overflow
    
    def _should_split(self, inserted_bucket: int) -> tuple[bool, str]:
        """
        Hybrid trigger with multiple conditions.
        
        Returns (should_split, reason) for logging/debugging.
        """
        load = self._load_factor()
        overflow = self._total_overflow_pages()
        bucket_overflow = len(self.buckets[inserted_bucket]) > self.bucket_capacity
        
        # Hard triggers: MUST split
        if load >= self.hard_load:
            return True, "hard_load_threshold"
        
        if overflow >= self.max_overflow:
            return True, "max_overflow_pages"
        
        # Soft trigger: split if convenient
        if load >= self.soft_load and bucket_overflow:
            return True, "soft_load_with_overflow"
        
        # No trigger
        return False, "none"
    
    def insert(self, key, value):
        """Insert with hybrid trigger."""
        bucket = self._resolve_bucket(key)
        self.buckets[bucket].append((key, value))
        self.record_count += 1
        
        should_split, reason = self._should_split(bucket)
        if should_split:
            self._perform_split()
            # Log: f"Split triggered by: {reason}"

Hybrid Trigger Design Patterns
Pattern	Logic	Effect
Soft + Hard Load	Split at soft if overflow, always at hard	Responds to hot spots while bounding load
Load OR Overflow	Split if either threshold exceeded	Dual protection against degradation
Load AND Overflow	Split only if both conditions met	Conservative splitting, saves resources
Tiered Response	Single split at soft, multi-split at hard	Graduated response to load increase
Hysteresis	Split at α_high, stop at α_low	Prevents split/unsplit oscillation

Production Recommendation

Split Trigger Tuning

Key Tuning Parameters:

Trigger Tuning Trade-offs
Parameter	Low Value Effect	High Value Effect	Tuning Guidance
α_max (load threshold)	More splits, shorter chains, higher space	Fewer splits, longer chains, better space	Start at 0.75, adjust based on lookup latency
Bucket capacity	More overflow pages, more splits	Fewer overflow, delayed splits	Match to disk page size for disk-based
Overflow threshold	Aggressive splitting	Tolerates long chains	Lower if worst-case latency matters
Split interval (periodic)	Continuous splitting, low load	Burst splitting, variable load	Match to batch sizes if applicable

Workload-Specific Recommendations:

Read-Heavy Workloads:

Lower α_max (0.65-0.75) for shorter chains
Aggressive overflow triggers
Accept higher space overhead for faster lookups

Write-Heavy Workloads:

Higher α_max (0.80-0.90) for fewer splits
Periodic or background splitting
Trade lookup speed for insertion throughput

Mixed Workloads:

Hybrid triggers with soft/hard thresholds
Background splitting during read-heavy periods
Adaptive interval adjustment

Latency-Sensitive Workloads:

Low overflow tolerance (max 1-2 pages)
Proactive splitting based on load trends
Consider capacity reservation

adaptive_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
class AdaptiveTuningLinearHash:
    """
    Self-tuning linear hash table that adjusts thresholds
    based on observed performance.
    """
    
    def __init__(self, initial_buckets=4):
        self.N0 = initial_buckets
        self.L = 0
        self.S = 0
        self.buckets = [[] for _ in range(initial_buckets)]
        
        # Adaptive parameters
        self.load_threshold = 0.75
        self.min_threshold = 0.50
        self.max_threshold = 0.95
        
        # Performance tracking
        self.recent_lookup_probes = []  # Last N lookups
        self.target_avg_probes = 1.5
        self.window_size = 1000
    
    def record_lookup_probes(self, probes: int):
        """Track probe count for lookups."""
        self.recent_lookup_probes.append(probes)
        if len(self.recent_lookup_probes) > self.window_size:
            self.recent_lookup_probes.pop(0)
    
    def _average_probes(self) -> float:
        if not self.recent_lookup_probes:
            return 1.0
        return sum(self.recent_lookup_probes) / len(self.recent_lookup_probes)
    
    def _tune_threshold(self):
        """Adjust load threshold based on performance."""
        avg_probes = self._average_probes()
        
        if avg_probes > self.target_avg_probes * 1.2:
            # Lookups too slow: lower threshold (split more)
            self.load_threshold = max(
                self.min_threshold,
                self.load_threshold - 0.05
            )
        elif avg_probes < self.target_avg_probes * 0.8:
            # Lookups fast: raise threshold (split less)
            self.load_threshold = min(
                self.max_threshold,
                self.load_threshold + 0.02
            )
    
    def lookup(self, key):
        """Lookup with probe counting."""
        bucket = self._resolve_bucket(key)
        probes = 0
        
        for k, v in self.buckets[bucket]:
            probes += 1
            if k == key:
                self.record_lookup_probes(probes)
                return v
        
        self.record_lookup_probes(probes)
        return None
    
    def periodic_tune(self):
        """Call periodically to adjust thresholds."""
        self._tune_threshold()

Observability Is Key

Bucket Capacity Considerations

In-Memory vs. Disk-Based:

In-Memory Hash Tables

•Bucket 'capacity' is conceptual
•Chains extend with pointer chasing
•Cost is cache misses, not I/O
•Can tolerate longer chains
•Trigger primarily on load factor

Disk-Based Hash Tables

•Bucket capacity = records per page
•Overflow = chained overflow pages
•Cost is I/O operations
•Must tightly control overflow
•Trigger on overflow + load factor

Calculating Bucket Capacity:

For disk-based systems with page size P and record size R:

Bucket capacity = floor((P - header_size) / R)

Example:
- Page size: 4096 bytes
- Page header: 64 bytes  
- Record size: 100 bytes
- Capacity = floor((4096 - 64) / 100) = 40 records per bucket

Variable-Length Records:

With variable-length records, bucket capacity varies. Trigger strategies must adapt:

Minimum capacity: Assume worst-case record size
Average capacity: Use statistical average, accept some overflow
Fill factor: Trigger when page is X% full by bytes
Dynamic threshold: Adjust based on actual free space

Overflow Page I/O Cost

Summary: Split Trigger Strategies

The split trigger determines when linear hashing expands. Choosing the right trigger strategy requires balancing lookup performance, space efficiency, and implementation complexity.

Key Takeaways

•Load factor triggers maintain average-case performance by controlling α = n/N. Typical thresholds: 0.70-0.85.
•Overflow triggers control worst-case performance by limiting chain length. Essential for disk-based systems.
•Periodic triggers provide predictable growth for batch workloads but may not adapt to load variations.
•Hybrid triggers combine multiple conditions for robust behavior across different scenarios.
•Trigger tuning depends on workload: read-heavy favors low thresholds; write-heavy favors high thresholds.
•Bucket capacity is critical for disk-based systems where overflow means additional I/O.

What's Next:

Page Complete

3 / 5