Database Management SystemsHash Indexing

Linear Hashing

LevelAdvanced

Duration75 mins

TopicHash Indexing

2 / 5

Split Pointer

The Index That Tracks Expansion

At the heart of every linear hash table lies a deceptively simple variable: the split pointer (commonly denoted as S or next). This single integer tracks the progress of the table's expansion, determines which hash function applies to which bucket, and enables the directory-free addressing that makes linear hashing unique.

Understanding the split pointer is understanding linear hashing itself. It's the state machine that transforms the abstract concept of 'linear growth' into concrete algorithmic operations.

What You Will Learn

By the end of this page, you will master the split pointer mechanism in complete detail—how it advances, how it determines bucket addressing, how it signals level transitions, and how it interacts with both insertions and lookups. You'll understand why this single variable eliminates the need for complex directory structures.

Split Pointer Definition and Role

The split pointer S is an integer variable that indicates the next bucket to be split when an expansion occurs. Its value ranges from 0 to N₀ × 2^L - 1 within each level L, where N₀ is the initial bucket count.

Formal Definition:

Range: 0 ≤ S < N₀ × 2^L
Interpretation: Bucket S is the next bucket that will be split
Advancement: S ← S + 1 after each split
Reset Condition: When S reaches N₀ × 2^L, reset S ← 0 and L ← L + 1

Split Pointer State Interpretation
S Value	Meaning	Hash Function Usage
S = 0	No buckets split in current level yet	All buckets use h_L only
0 < S < N₀×2^L	Buckets 0 to S-1 have been split	Buckets 0..S-1 use h_{L+1}; buckets S..N₀×2^L-1 use h_L
S → 0, L++	Level complete, transitioning to next	All buckets now use h_{L+1} (which becomes h_L for the new level)

The Split Pointer as a Partition

Think of S as partitioning the bucket space into two regions: buckets 0 to S-1 (already split, using h_{L+1}) and buckets S to N₀×2^L-1 (not yet split in this level, using h_L). The new buckets from N₀×2^L to the current maximum always use h_{L+1} since they were created during splits.

The Dual Role of S:

Expansion Planning: S indicates which bucket splits next, ensuring the deterministic, round-robin expansion pattern that defines linear hashing.
Address Resolution: S determines which hash function to use for any given bucket address—enabling lookups to find records correctly despite ongoing expansion.

Split Pointer Advancement

The split pointer advances according to a strict, deterministic protocol. This predictability is essential—it's what allows lookups to determine which hash function applies to any bucket without consulting external metadata.

The Advancement Algorithm:

split_pointer_advance.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def advance_split_pointer(self):
    """
    Advance the split pointer after a split operation.
    Handles level transitions when a full round completes.
    """
    # Increment split pointer
    self.S += 1
    
    # Check for level completion
    level_bucket_count = self.N0 * (2 ** self.L)
    
    if self.S >= level_bucket_count:
        # Complete round: all original buckets have been split
        # Transition to next level
        self.S = 0
        self.L += 1
        
        # After transition:
        # - Bucket count has doubled (from N0*2^(L-1) to N0*2^L)
        # - All buckets now consistently use h_L (formerly h_{L+1})
        # - Ready to begin a new round of splits
 
 
def current_state_summary(self) -> dict:
    """Return current state of the linear hash table."""
    level_buckets = self.N0 * (2 ** self.L)
    return {
        "level": self.L,
        "split_pointer": self.S,
        "buckets_split_this_level": self.S,
        "buckets_remaining_this_level": level_buckets - self.S,
        "total_buckets": level_buckets + self.S,
        "percent_through_level": (self.S / level_buckets) * 100
    }

Tracing Split Pointer Through Multiple Levels:

Let's trace S through two complete levels with N₀ = 4:

Operation	S (before)	L (before)	Action	S (after)	L (after)	Buckets
Init	0	0	—	0	0	4
Split 1	0	0	Split bucket 0	1	0	5
Split 2	1	0	Split bucket 1	2	0	6
Split 3	2	0	Split bucket 2	3	0	7
Split 4	3	0	Split bucket 3	0	1	8
Split 5	0	1	Split bucket 0	1	1	9
Split 6	1	1	Split bucket 1	2	1	10
...	...	1	...	...	1	...
Split 12	7	1	Split bucket 7	0	2	16

Level Transition Observation

Notice how the level transition (L: 0→1) occurs precisely when S would exceed 3 (which is N₀ × 2^0 - 1 = 3). At that moment, S resets to 0 and L increments. The bucket count has exactly doubled from 4 to 8. The next transition (L: 1→2) occurs when S would exceed 7, at which point we've gone from 8 to 16 buckets.

Address Resolution Using the Split Pointer

The split pointer's most elegant function is enabling correct address resolution without any auxiliary data structure. The algorithm is remarkably simple yet handles all transitional states correctly.

The Core Address Resolution Algorithm:

address_resolution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def resolve_address(self, key) -> int:
    """
    Resolve the bucket address for a given key.
    
    This is the critical path for all lookups and insertions.
    Uses the split pointer to determine which hash function applies.
    
    Returns:
        int: The bucket index where this key belongs
    """
    # Step 1: Compute address using current level's hash function
    h_val = hash(key)
    level_range = self.N0 * (2 ** self.L)
    bucket = h_val % level_range
    
    # Step 2: Check if this bucket has already been split
    if bucket < self.S:
        # This bucket was split earlier in the current level.
        # The key might now belong to the sibling bucket.
        # Use the next level's hash function instead.
        next_level_range = self.N0 * (2 ** (self.L + 1))
        bucket = h_val % next_level_range
    
    # Step 3: Return the resolved bucket
    # Bucket is guaranteed to be a valid index in our bucket array
    return bucket
 
 
def explain_resolution(self, key) -> str:
    """Explain the resolution process for educational purposes."""
    h_val = hash(key)
    level_range = self.N0 * (2 ** self.L)
    initial_bucket = h_val % level_range
    
    explanation = f"Key: {key}, hash: {h_val}\n"
    explanation += f"h_L({key}) = {h_val} mod {level_range} = {initial_bucket}\n"
    
    if initial_bucket < self.S:
        next_level_range = self.N0 * (2 ** (self.L + 1))
        final_bucket = h_val % next_level_range
        explanation += f"Bucket {initial_bucket} < S={self.S}, so bucket was split.\n"
        explanation += f"h_{{L+1}}({key}) = {h_val} mod {next_level_range} = {final_bucket}\n"
        explanation += f"Final bucket: {final_bucket}"
    else:
        explanation += f"Bucket {initial_bucket} >= S={self.S}, not yet split.\n"
        explanation += f"Final bucket: {initial_bucket}"
    
    return explanation

Why This Algorithm Works

•Bucket index < S: This bucket has been split. Using h_L would incorrectly direct us to the original bucket, but the key might have moved to the sibling. h_{L+1} correctly resolves to either the original or sibling bucket.
•Bucket index ≥ S: This bucket hasn't been split yet. h_L gives the correct address. Using h_{L+1} would be incorrect—it might point to a bucket that doesn't exist yet!
•Sibling bucket addresses: When bucket b splits, it creates sibling b + N₀×2^L. This sibling's address is ≥ N₀×2^L, which is always > S (since S < N₀×2^L). So sibling buckets are never redirected—their h_L address is already their h_{L+1} address.
•Consistency guarantee: For any key, this algorithm returns the same bucket address whether called during lookup, insertion, or deletion—as long as no split occurs between calls.

Concurrency Consideration

The split pointer must be read consistently with bucket access. If S changes between reading S and accessing the bucket, lookups might fail. This is typically handled by either: (1) locking S and the bucket array together, (2) using version numbers for optimistic concurrency, or (3) employing lock-free atomic operations on S.

Split Pointer State Diagram

The split pointer and level together form a finite state machine that governs linear hashing behavior. Understanding this state machine clarifies all aspects of linear hashing operation.

State Definition:

The state is fully captured by the tuple (S, L). Valid states satisfy:

L ≥ 0
0 ≤ S < N₀ × 2^L

State Transitions:

Split Pointer State Transitions
Current State	Event	Condition	Next State	Side Effect
(S, L)	Split	S + 1 < N₀ × 2^L	(S + 1, L)	Create bucket N₀×2^L + S
(S, L)	Split	S + 1 = N₀ × 2^L	(0, L + 1)	Create bucket, complete level
(S, L)	Lookup	Always	(S, L)	Return resolved address
(S, L)	Insert	No trigger	(S, L)	Add to resolved bucket
(S, L)	Insert	Trigger	(S', L')	Add to bucket, then split

Visual State Diagram for N₀ = 4:

Level 0:                    Level 1:
┌─────┐     ┌─────┐     ┌─────┐     ┌─────┐
│S = 0│────▶│S = 1│────▶│S = 2│────▶│S = 3│────▶┬──▶ ┌─────┐────▶ ...
└─────┘     └─────┘     └─────┘     └─────┘     │    │S = 0│     
  4 bkt      5 bkt       6 bkt       7 bkt      │    └─────┘    Level 1
                                                │      8 bkt   continues...
                                                │
                                                └─── Level transition: 
                                                     S resets, L increments

Level 1 continues: S = 0 ──▶ S = 1 ──▶ ... ──▶ S = 7 ──▶ Level 2 (S = 0)
                   8 bkt     9 bkt          15 bkt       16 bkt

State Machine Invariants

Key invariants maintained by the state machine: (1) Total buckets = N₀ × 2^L + S, (2) Buckets 0..S-1 and N₀×2^L..total-1 use h_{L+1}, (3) Buckets S..N₀×2^L-1 use h_L, (4) S monotonically increases within a level. These invariants are preserved by every state transition.

Split Pointer and Hash Function Relationship

The relationship between the split pointer and hash function selection is the mathematical core of linear hashing. Understanding this relationship deeply explains why linear hashing works correctly.

Hash Function Selection Rule:

For key k with base hash value h(k):

if h(k) mod (N₀ × 2^L) < S:
    use bucket = h(k) mod (N₀ × 2^(L+1))
else:
    use bucket = h(k) mod (N₀ × 2^L)

Mathematical Properties

•Modular Arithmetic Property: For any h and M, h mod (2M) is either h mod M or (h mod M) + M. This binary outcome is what enables bucket splitting to work correctly.
•Bucket Partition: At any state (S, L), buckets form three groups: (a) buckets 0..S-1 (split, image of h_{L+1}), (b) buckets S..N₀×2^L-1 (unsplit, image of h_L), (c) buckets N₀×2^L..total-1 (new, siblings of group a).
•Consistent Mapping: A key k maps to the same bucket whether we're inserting, looking up, or deleting—as long as no split intervenes. The split pointer determines the mapping deterministically.
•No Key Migration: Once a key is placed in a bucket, it stays there until that bucket splits. After splitting, the key is in exactly one of the two sibling buckets. It never moves again until the next round.

Detailed Example:

Consider N₀ = 4, L = 0, S = 2. Let's trace several keys:

Key k	h(k)	h(k) mod 4	h(k) mod 8	Bucket Used	Reason
k1	17	1	1	1	1 < S=2, use h_{L+1}; result is 1
k2	21	1	5	5	1 < S=2, use h_{L+1}; result is 5
k3	10	2	2	2	2 ≥ S=2, use h_L
k4	14	2	6	2	2 ≥ S=2, use h_L (h_{L+1} ignored)
k5	23	3	7	3	3 ≥ S=2, use h_L

Notice:

k1 and k2 both have h_L = 1, but h_{L+1} differs (1 vs 5). Since bucket 1 was split (1 < S), they correctly go to different buckets.
k3 and k4 both have h_L = 2. Bucket 2 hasn't been split yet (2 ≥ S), so both go to bucket 2, even though their h_{L+1} values differ.
When bucket 2 eventually splits (when S becomes 3), k3 will stay in bucket 2 while k4 will move to bucket 6.

Future Split Prediction

The relationship h_{L+1}(k) ∈ {h_L(k), h_L(k) + N₀×2^L} means we can predict where a key will go when its bucket splits—even before the split happens. Keys with h_{L+1}(k) = h_L(k) stay; others move. This predictability enables optimizations like pre-computing split destinations.

Implementing the Split Pointer Correctly

Correctly implementing the split pointer requires attention to several subtle details. Common implementation errors can cause silent data loss or lookup failures.

Critical Implementation Considerations:

Common Implementation Pitfalls

•Off-by-one in level transition: The transition condition is S + 1 > N₀ × 2^L, not S ≥ N₀ × 2^L. After incrementing S, check if S equals (not exceeds) N₀ × 2^L.
•Incorrect hash function range: h_L uses modulo N₀ × 2^L, not modulo (total buckets). Using current bucket count would break address resolution.
•Forgetting to allocate new bucket: Before advancing S, ensure the new bucket N₀ × 2^L + S exists in the bucket array.
•Wrong sibling address calculation: The sibling of bucket S is at S + N₀ × 2^L, not S + S or 2*S.
•Incorrect comparison in address resolution: The check is bucket < S, not bucket ≤ S. Bucket S hasn't been split yet!

split_pointer_complete.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
class LinearHashTable:
    """
    Complete linear hash table implementation with correct split pointer handling.
    """
    
    def __init__(self, initial_buckets: int = 4):
        if initial_buckets < 1 or (initial_buckets & (initial_buckets - 1)) != 0:
            raise ValueError("initial_buckets must be a power of 2")
        
        self.N0 = initial_buckets
        self.L = 0
        self.S = 0
        self.buckets = [[] for _ in range(initial_buckets)]
    
    def _level_range(self, level: int) -> int:
        """Compute N0 * 2^level safely."""
        return self.N0 << level  # Bit shift is 2^level multiplication
    
    def _resolve_bucket(self, key) -> int:
        """Resolve correct bucket for a key."""
        h = hash(key)
        level_range = self._level_range(self.L)
        bucket = h % level_range
        
        # CRITICAL: Use strict less-than
        if bucket < self.S:
            next_range = self._level_range(self.L + 1)
            bucket = h % next_range
        
        return bucket
    
    def _perform_split(self):
        """Execute a split on bucket S."""
        # 1. Calculate sibling address BEFORE modifying anything
        sibling_address = self.S + self._level_range(self.L)
        
        # 2. Allocate new bucket
        self.buckets.append([])  # This adds bucket at sibling_address
        
        # Verify allocation
        assert len(self.buckets) == sibling_address + 1,             f"Bucket allocation error: expected {sibling_address + 1}, got {len(self.buckets)}"
        
        # 3. Redistribute records from bucket S
        old_records = self.buckets[self.S]
        self.buckets[self.S] = []
        
        next_range = self._level_range(self.L + 1)
        for key, value in old_records:
            new_bucket = hash(key) % next_range
            self.buckets[new_bucket].append((key, value))
        
        # 4. Advance split pointer
        self.S += 1
        
        # 5. Check for level completion
        # CRITICAL: Check equality, not greater-than
        if self.S == self._level_range(self.L):
            self.S = 0
            self.L += 1
    
    def insert(self, key, value):
        """Insert key-value pair, triggering split if needed."""
        bucket = self._resolve_bucket(key)
        
        # Check for existing key (optional: update instead of duplicate)
        for i, (k, v) in enumerate(self.buckets[bucket]):
            if k == key:
                self.buckets[bucket][i] = (key, value)
                return
        
        self.buckets[bucket].append((key, value))
        
        # Check split trigger (covered in next page)
        if self._should_split():
            self._perform_split()
    
    def _should_split(self) -> bool:
        """Determine if split should be triggered."""
        # Placeholder - covered in detail in page 3
        total_records = sum(len(b) for b in self.buckets)
        load_factor = total_records / len(self.buckets)
        return load_factor > 0.75

Implementation Verification

To verify correct implementation: (1) After each split, confirm bucket count = N₀ × 2^L + S, (2) Verify every key maps to the same bucket for lookup and insert, (3) After a complete level, confirm all buckets split exactly once, (4) Insert and retrieve known keys to validate addressing.

Split Pointer Persistence and Recovery

In database systems, the split pointer must survive crashes. Unlike extendible hashing's directory (which is a complex structure), linear hashing's state is trivially small—making persistence and recovery remarkably simple.

Minimal Persistent State:

To fully recover a linear hash table, we need only:

N₀ (initial bucket count, often fixed at compile time)
L (current level)
S (split pointer)
The bucket array contents

That's it. Three integers plus the data.

Recovery Scenarios
Crash Point	State Lost	Recovery Action
Before split begins	None	Resume normal operation
During record redistribution	Partial split	Re-split bucket S (idempotent)
After split, before S increment	S not advanced	Verify bucket count, increment S
During level transition	L or S inconsistent	Recalculate from bucket count

Write-Ahead Logging (WAL) Integration:

For durability in database systems, splits are typically logged:

SPLIT_BEGIN: (S=2, L=0, sibling=6)
  - Records moved: [(k1,v1), (k5,v5), ...]
SPLIT_COMPLETE: (S=3, L=0)

During recovery, incomplete splits are either completed (if SPLIT_BEGIN found without SPLIT_COMPLETE) or ignored (if no SPLIT_BEGIN found). The deterministic nature of linear hashing makes recovery logic simple.

Checkpointing Strategy

Because the split pointer state is so compact, it can be included in every checkpoint or even every disk page header. Some implementations store (S, L) redundantly in the page header of each bucket page, enabling independent recovery of bucket pages without reading separate metadata.

Summary: The Split Pointer Mechanism

The split pointer is the elegant mechanism that makes linear hashing work. This single variable, combined with the level counter, replaces the entire directory structure of extendible hashing.

Key Takeaways

•S indicates the next bucket to split — Its value ranges from 0 to N₀ × 2^L - 1, advancing by 1 with each split.
•S partitions the hash function usage — Buckets 0..S-1 use h_{L+1}; buckets S..N₀×2^L-1 use h_L.
•Address resolution is a simple two-step process — Compute h_L(k); if result < S, recompute with h_{L+1}(k).
•Level transitions are implicit — When S reaches N₀ × 2^L, reset S to 0 and increment L.
•State is trivially small — Only L, S, and N₀ are needed; no directory structure required.
•Recovery is straightforward — The compact state enables simple persistence and crash recovery.

What's Next:

We've seen how the split pointer tracks expansion progress and enables correct addressing. But we haven't addressed when splits occur. The next page explores split triggers—the policies that decide when to execute a split, balancing performance, overflow chain length, and resource utilization.

Page Complete

You now have deep mastery of the split pointer mechanism. You understand its role in address resolution, its state transitions, its relationship to hash functions, and its implementation requirements. This understanding is essential for implementing and reasoning about linear hashing in database systems.

2 / 5

Loading learning content...

Database Management SystemsHash Indexing

Linear Hashing

LevelAdvanced

Duration75 mins

TopicHash Indexing

2 / 5

Split Pointer

The Index That Tracks Expansion

Understanding the split pointer is understanding linear hashing itself. It's the state machine that transforms the abstract concept of 'linear growth' into concrete algorithmic operations.

What You Will Learn

Split Pointer Definition and Role

Formal Definition:

Range: 0 ≤ S < N₀ × 2^L
Interpretation: Bucket S is the next bucket that will be split
Advancement: S ← S + 1 after each split
Reset Condition: When S reaches N₀ × 2^L, reset S ← 0 and L ← L + 1

Split Pointer State Interpretation
S Value	Meaning	Hash Function Usage
S = 0	No buckets split in current level yet	All buckets use h_L only
0 < S < N₀×2^L	Buckets 0 to S-1 have been split	Buckets 0..S-1 use h_{L+1}; buckets S..N₀×2^L-1 use h_L
S → 0, L++	Level complete, transitioning to next	All buckets now use h_{L+1} (which becomes h_L for the new level)

The Split Pointer as a Partition

The Dual Role of S:

Expansion Planning: S indicates which bucket splits next, ensuring the deterministic, round-robin expansion pattern that defines linear hashing.
Address Resolution: S determines which hash function to use for any given bucket address—enabling lookups to find records correctly despite ongoing expansion.

Split Pointer Advancement

The Advancement Algorithm:

split_pointer_advance.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def advance_split_pointer(self):
    """
    Advance the split pointer after a split operation.
    Handles level transitions when a full round completes.
    """
    # Increment split pointer
    self.S += 1
    
    # Check for level completion
    level_bucket_count = self.N0 * (2 ** self.L)
    
    if self.S >= level_bucket_count:
        # Complete round: all original buckets have been split
        # Transition to next level
        self.S = 0
        self.L += 1
        
        # After transition:
        # - Bucket count has doubled (from N0*2^(L-1) to N0*2^L)
        # - All buckets now consistently use h_L (formerly h_{L+1})
        # - Ready to begin a new round of splits
 
 
def current_state_summary(self) -> dict:
    """Return current state of the linear hash table."""
    level_buckets = self.N0 * (2 ** self.L)
    return {
        "level": self.L,
        "split_pointer": self.S,
        "buckets_split_this_level": self.S,
        "buckets_remaining_this_level": level_buckets - self.S,
        "total_buckets": level_buckets + self.S,
        "percent_through_level": (self.S / level_buckets) * 100
    }

Tracing Split Pointer Through Multiple Levels:

Let's trace S through two complete levels with N₀ = 4:

Operation	S (before)	L (before)	Action	S (after)	L (after)	Buckets
Init	0	0	—	0	0	4
Split 1	0	0	Split bucket 0	1	0	5
Split 2	1	0	Split bucket 1	2	0	6
Split 3	2	0	Split bucket 2	3	0	7
Split 4	3	0	Split bucket 3	0	1	8
Split 5	0	1	Split bucket 0	1	1	9
Split 6	1	1	Split bucket 1	2	1	10
...	...	1	...	...	1	...
Split 12	7	1	Split bucket 7	0	2	16

Level Transition Observation

Address Resolution Using the Split Pointer

The Core Address Resolution Algorithm:

address_resolution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def resolve_address(self, key) -> int:
    """
    Resolve the bucket address for a given key.
    
    This is the critical path for all lookups and insertions.
    Uses the split pointer to determine which hash function applies.
    
    Returns:
        int: The bucket index where this key belongs
    """
    # Step 1: Compute address using current level's hash function
    h_val = hash(key)
    level_range = self.N0 * (2 ** self.L)
    bucket = h_val % level_range
    
    # Step 2: Check if this bucket has already been split
    if bucket < self.S:
        # This bucket was split earlier in the current level.
        # The key might now belong to the sibling bucket.
        # Use the next level's hash function instead.
        next_level_range = self.N0 * (2 ** (self.L + 1))
        bucket = h_val % next_level_range
    
    # Step 3: Return the resolved bucket
    # Bucket is guaranteed to be a valid index in our bucket array
    return bucket
 
 
def explain_resolution(self, key) -> str:
    """Explain the resolution process for educational purposes."""
    h_val = hash(key)
    level_range = self.N0 * (2 ** self.L)
    initial_bucket = h_val % level_range
    
    explanation = f"Key: {key}, hash: {h_val}\n"
    explanation += f"h_L({key}) = {h_val} mod {level_range} = {initial_bucket}\n"
    
    if initial_bucket < self.S:
        next_level_range = self.N0 * (2 ** (self.L + 1))
        final_bucket = h_val % next_level_range
        explanation += f"Bucket {initial_bucket} < S={self.S}, so bucket was split.\n"
        explanation += f"h_{{L+1}}({key}) = {h_val} mod {next_level_range} = {final_bucket}\n"
        explanation += f"Final bucket: {final_bucket}"
    else:
        explanation += f"Bucket {initial_bucket} >= S={self.S}, not yet split.\n"
        explanation += f"Final bucket: {initial_bucket}"
    
    return explanation

Why This Algorithm Works

•Bucket index < S: This bucket has been split. Using h_L would incorrectly direct us to the original bucket, but the key might have moved to the sibling. h_{L+1} correctly resolves to either the original or sibling bucket.
•Bucket index ≥ S: This bucket hasn't been split yet. h_L gives the correct address. Using h_{L+1} would be incorrect—it might point to a bucket that doesn't exist yet!
•Sibling bucket addresses: When bucket b splits, it creates sibling b + N₀×2^L. This sibling's address is ≥ N₀×2^L, which is always > S (since S < N₀×2^L). So sibling buckets are never redirected—their h_L address is already their h_{L+1} address.
•Consistency guarantee: For any key, this algorithm returns the same bucket address whether called during lookup, insertion, or deletion—as long as no split occurs between calls.

Concurrency Consideration

Split Pointer State Diagram

The split pointer and level together form a finite state machine that governs linear hashing behavior. Understanding this state machine clarifies all aspects of linear hashing operation.

State Definition:

The state is fully captured by the tuple (S, L). Valid states satisfy:

L ≥ 0
0 ≤ S < N₀ × 2^L

State Transitions:

Split Pointer State Transitions
Current State	Event	Condition	Next State	Side Effect
(S, L)	Split	S + 1 < N₀ × 2^L	(S + 1, L)	Create bucket N₀×2^L + S
(S, L)	Split	S + 1 = N₀ × 2^L	(0, L + 1)	Create bucket, complete level
(S, L)	Lookup	Always	(S, L)	Return resolved address
(S, L)	Insert	No trigger	(S, L)	Add to resolved bucket
(S, L)	Insert	Trigger	(S', L')	Add to bucket, then split

Visual State Diagram for N₀ = 4:

Level 0:                    Level 1:
┌─────┐     ┌─────┐     ┌─────┐     ┌─────┐
│S = 0│────▶│S = 1│────▶│S = 2│────▶│S = 3│────▶┬──▶ ┌─────┐────▶ ...
└─────┘     └─────┘     └─────┘     └─────┘     │    │S = 0│     
  4 bkt      5 bkt       6 bkt       7 bkt      │    └─────┘    Level 1
                                                │      8 bkt   continues...
                                                │
                                                └─── Level transition: 
                                                     S resets, L increments

Level 1 continues: S = 0 ──▶ S = 1 ──▶ ... ──▶ S = 7 ──▶ Level 2 (S = 0)
                   8 bkt     9 bkt          15 bkt       16 bkt

State Machine Invariants

Split Pointer and Hash Function Relationship

The relationship between the split pointer and hash function selection is the mathematical core of linear hashing. Understanding this relationship deeply explains why linear hashing works correctly.

Hash Function Selection Rule:

For key k with base hash value h(k):

if h(k) mod (N₀ × 2^L) < S:
    use bucket = h(k) mod (N₀ × 2^(L+1))
else:
    use bucket = h(k) mod (N₀ × 2^L)

Mathematical Properties

•Modular Arithmetic Property: For any h and M, h mod (2M) is either h mod M or (h mod M) + M. This binary outcome is what enables bucket splitting to work correctly.
•Bucket Partition: At any state (S, L), buckets form three groups: (a) buckets 0..S-1 (split, image of h_{L+1}), (b) buckets S..N₀×2^L-1 (unsplit, image of h_L), (c) buckets N₀×2^L..total-1 (new, siblings of group a).
•Consistent Mapping: A key k maps to the same bucket whether we're inserting, looking up, or deleting—as long as no split intervenes. The split pointer determines the mapping deterministically.
•No Key Migration: Once a key is placed in a bucket, it stays there until that bucket splits. After splitting, the key is in exactly one of the two sibling buckets. It never moves again until the next round.

Detailed Example:

Consider N₀ = 4, L = 0, S = 2. Let's trace several keys:

Key k	h(k)	h(k) mod 4	h(k) mod 8	Bucket Used	Reason
k1	17	1	1	1	1 < S=2, use h_{L+1}; result is 1
k2	21	1	5	5	1 < S=2, use h_{L+1}; result is 5
k3	10	2	2	2	2 ≥ S=2, use h_L
k4	14	2	6	2	2 ≥ S=2, use h_L (h_{L+1} ignored)
k5	23	3	7	3	3 ≥ S=2, use h_L

Notice:

k1 and k2 both have h_L = 1, but h_{L+1} differs (1 vs 5). Since bucket 1 was split (1 < S), they correctly go to different buckets.
k3 and k4 both have h_L = 2. Bucket 2 hasn't been split yet (2 ≥ S), so both go to bucket 2, even though their h_{L+1} values differ.
When bucket 2 eventually splits (when S becomes 3), k3 will stay in bucket 2 while k4 will move to bucket 6.

Future Split Prediction

Implementing the Split Pointer Correctly

Correctly implementing the split pointer requires attention to several subtle details. Common implementation errors can cause silent data loss or lookup failures.

Critical Implementation Considerations:

Common Implementation Pitfalls

•Off-by-one in level transition: The transition condition is S + 1 > N₀ × 2^L, not S ≥ N₀ × 2^L. After incrementing S, check if S equals (not exceeds) N₀ × 2^L.
•Incorrect hash function range: h_L uses modulo N₀ × 2^L, not modulo (total buckets). Using current bucket count would break address resolution.
•Forgetting to allocate new bucket: Before advancing S, ensure the new bucket N₀ × 2^L + S exists in the bucket array.
•Wrong sibling address calculation: The sibling of bucket S is at S + N₀ × 2^L, not S + S or 2*S.
•Incorrect comparison in address resolution: The check is bucket < S, not bucket ≤ S. Bucket S hasn't been split yet!

split_pointer_complete.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
class LinearHashTable:
    """
    Complete linear hash table implementation with correct split pointer handling.
    """
    
    def __init__(self, initial_buckets: int = 4):
        if initial_buckets < 1 or (initial_buckets & (initial_buckets - 1)) != 0:
            raise ValueError("initial_buckets must be a power of 2")
        
        self.N0 = initial_buckets
        self.L = 0
        self.S = 0
        self.buckets = [[] for _ in range(initial_buckets)]
    
    def _level_range(self, level: int) -> int:
        """Compute N0 * 2^level safely."""
        return self.N0 << level  # Bit shift is 2^level multiplication
    
    def _resolve_bucket(self, key) -> int:
        """Resolve correct bucket for a key."""
        h = hash(key)
        level_range = self._level_range(self.L)
        bucket = h % level_range
        
        # CRITICAL: Use strict less-than
        if bucket < self.S:
            next_range = self._level_range(self.L + 1)
            bucket = h % next_range
        
        return bucket
    
    def _perform_split(self):
        """Execute a split on bucket S."""
        # 1. Calculate sibling address BEFORE modifying anything
        sibling_address = self.S + self._level_range(self.L)
        
        # 2. Allocate new bucket
        self.buckets.append([])  # This adds bucket at sibling_address
        
        # Verify allocation
        assert len(self.buckets) == sibling_address + 1,             f"Bucket allocation error: expected {sibling_address + 1}, got {len(self.buckets)}"
        
        # 3. Redistribute records from bucket S
        old_records = self.buckets[self.S]
        self.buckets[self.S] = []
        
        next_range = self._level_range(self.L + 1)
        for key, value in old_records:
            new_bucket = hash(key) % next_range
            self.buckets[new_bucket].append((key, value))
        
        # 4. Advance split pointer
        self.S += 1
        
        # 5. Check for level completion
        # CRITICAL: Check equality, not greater-than
        if self.S == self._level_range(self.L):
            self.S = 0
            self.L += 1
    
    def insert(self, key, value):
        """Insert key-value pair, triggering split if needed."""
        bucket = self._resolve_bucket(key)
        
        # Check for existing key (optional: update instead of duplicate)
        for i, (k, v) in enumerate(self.buckets[bucket]):
            if k == key:
                self.buckets[bucket][i] = (key, value)
                return
        
        self.buckets[bucket].append((key, value))
        
        # Check split trigger (covered in next page)
        if self._should_split():
            self._perform_split()
    
    def _should_split(self) -> bool:
        """Determine if split should be triggered."""
        # Placeholder - covered in detail in page 3
        total_records = sum(len(b) for b in self.buckets)
        load_factor = total_records / len(self.buckets)
        return load_factor > 0.75

Implementation Verification

Split Pointer Persistence and Recovery

Minimal Persistent State:

To fully recover a linear hash table, we need only:

N₀ (initial bucket count, often fixed at compile time)
L (current level)
S (split pointer)
The bucket array contents

That's it. Three integers plus the data.

Recovery Scenarios
Crash Point	State Lost	Recovery Action
Before split begins	None	Resume normal operation
During record redistribution	Partial split	Re-split bucket S (idempotent)
After split, before S increment	S not advanced	Verify bucket count, increment S
During level transition	L or S inconsistent	Recalculate from bucket count

Write-Ahead Logging (WAL) Integration:

For durability in database systems, splits are typically logged:

SPLIT_BEGIN: (S=2, L=0, sibling=6)
  - Records moved: [(k1,v1), (k5,v5), ...]
SPLIT_COMPLETE: (S=3, L=0)

Checkpointing Strategy

Summary: The Split Pointer Mechanism

The split pointer is the elegant mechanism that makes linear hashing work. This single variable, combined with the level counter, replaces the entire directory structure of extendible hashing.

Key Takeaways

•S indicates the next bucket to split — Its value ranges from 0 to N₀ × 2^L - 1, advancing by 1 with each split.
•S partitions the hash function usage — Buckets 0..S-1 use h_{L+1}; buckets S..N₀×2^L-1 use h_L.
•Address resolution is a simple two-step process — Compute h_L(k); if result < S, recompute with h_{L+1}(k).
•Level transitions are implicit — When S reaches N₀ × 2^L, reset S to 0 and increment L.
•State is trivially small — Only L, S, and N₀ are needed; no directory structure required.
•Recovery is straightforward — The compact state enables simple persistence and crash recovery.

What's Next:

Page Complete

2 / 5