Loading learning content...
At the heart of every linear hash table lies a deceptively simple variable: the split pointer (commonly denoted as S or next). This single integer tracks the progress of the table's expansion, determines which hash function applies to which bucket, and enables the directory-free addressing that makes linear hashing unique.
Understanding the split pointer is understanding linear hashing itself. It's the state machine that transforms the abstract concept of 'linear growth' into concrete algorithmic operations.
By the end of this page, you will master the split pointer mechanism in complete detail—how it advances, how it determines bucket addressing, how it signals level transitions, and how it interacts with both insertions and lookups. You'll understand why this single variable eliminates the need for complex directory structures.
The split pointer S is an integer variable that indicates the next bucket to be split when an expansion occurs. Its value ranges from 0 to N₀ × 2^L - 1 within each level L, where N₀ is the initial bucket count.
Formal Definition:
| S Value | Meaning | Hash Function Usage |
|---|---|---|
| S = 0 | No buckets split in current level yet | All buckets use h_L only |
| 0 < S < N₀×2^L | Buckets 0 to S-1 have been split | Buckets 0..S-1 use h_{L+1}; buckets S..N₀×2^L-1 use h_L |
| S → 0, L++ | Level complete, transitioning to next | All buckets now use h_{L+1} (which becomes h_L for the new level) |
Think of S as partitioning the bucket space into two regions: buckets 0 to S-1 (already split, using h_{L+1}) and buckets S to N₀×2^L-1 (not yet split in this level, using h_L). The new buckets from N₀×2^L to the current maximum always use h_{L+1} since they were created during splits.
The Dual Role of S:
Expansion Planning: S indicates which bucket splits next, ensuring the deterministic, round-robin expansion pattern that defines linear hashing.
Address Resolution: S determines which hash function to use for any given bucket address—enabling lookups to find records correctly despite ongoing expansion.
The split pointer advances according to a strict, deterministic protocol. This predictability is essential—it's what allows lookups to determine which hash function applies to any bucket without consulting external metadata.
The Advancement Algorithm:
12345678910111213141516171819202122232425262728293031323334
def advance_split_pointer(self): """ Advance the split pointer after a split operation. Handles level transitions when a full round completes. """ # Increment split pointer self.S += 1 # Check for level completion level_bucket_count = self.N0 * (2 ** self.L) if self.S >= level_bucket_count: # Complete round: all original buckets have been split # Transition to next level self.S = 0 self.L += 1 # After transition: # - Bucket count has doubled (from N0*2^(L-1) to N0*2^L) # - All buckets now consistently use h_L (formerly h_{L+1}) # - Ready to begin a new round of splits def current_state_summary(self) -> dict: """Return current state of the linear hash table.""" level_buckets = self.N0 * (2 ** self.L) return { "level": self.L, "split_pointer": self.S, "buckets_split_this_level": self.S, "buckets_remaining_this_level": level_buckets - self.S, "total_buckets": level_buckets + self.S, "percent_through_level": (self.S / level_buckets) * 100 }Tracing Split Pointer Through Multiple Levels:
Let's trace S through two complete levels with N₀ = 4:
| Operation | S (before) | L (before) | Action | S (after) | L (after) | Buckets |
|---|---|---|---|---|---|---|
| Init | 0 | 0 | — | 0 | 0 | 4 |
| Split 1 | 0 | 0 | Split bucket 0 | 1 | 0 | 5 |
| Split 2 | 1 | 0 | Split bucket 1 | 2 | 0 | 6 |
| Split 3 | 2 | 0 | Split bucket 2 | 3 | 0 | 7 |
| Split 4 | 3 | 0 | Split bucket 3 | 0 | 1 | 8 |
| Split 5 | 0 | 1 | Split bucket 0 | 1 | 1 | 9 |
| Split 6 | 1 | 1 | Split bucket 1 | 2 | 1 | 10 |
| ... | ... | 1 | ... | ... | 1 | ... |
| Split 12 | 7 | 1 | Split bucket 7 | 0 | 2 | 16 |
Notice how the level transition (L: 0→1) occurs precisely when S would exceed 3 (which is N₀ × 2^0 - 1 = 3). At that moment, S resets to 0 and L increments. The bucket count has exactly doubled from 4 to 8. The next transition (L: 1→2) occurs when S would exceed 7, at which point we've gone from 8 to 16 buckets.
The split pointer's most elegant function is enabling correct address resolution without any auxiliary data structure. The algorithm is remarkably simple yet handles all transitional states correctly.
The Core Address Resolution Algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
def resolve_address(self, key) -> int: """ Resolve the bucket address for a given key. This is the critical path for all lookups and insertions. Uses the split pointer to determine which hash function applies. Returns: int: The bucket index where this key belongs """ # Step 1: Compute address using current level's hash function h_val = hash(key) level_range = self.N0 * (2 ** self.L) bucket = h_val % level_range # Step 2: Check if this bucket has already been split if bucket < self.S: # This bucket was split earlier in the current level. # The key might now belong to the sibling bucket. # Use the next level's hash function instead. next_level_range = self.N0 * (2 ** (self.L + 1)) bucket = h_val % next_level_range # Step 3: Return the resolved bucket # Bucket is guaranteed to be a valid index in our bucket array return bucket def explain_resolution(self, key) -> str: """Explain the resolution process for educational purposes.""" h_val = hash(key) level_range = self.N0 * (2 ** self.L) initial_bucket = h_val % level_range explanation = f"Key: {key}, hash: {h_val}\n" explanation += f"h_L({key}) = {h_val} mod {level_range} = {initial_bucket}\n" if initial_bucket < self.S: next_level_range = self.N0 * (2 ** (self.L + 1)) final_bucket = h_val % next_level_range explanation += f"Bucket {initial_bucket} < S={self.S}, so bucket was split.\n" explanation += f"h_{{L+1}}({key}) = {h_val} mod {next_level_range} = {final_bucket}\n" explanation += f"Final bucket: {final_bucket}" else: explanation += f"Bucket {initial_bucket} >= S={self.S}, not yet split.\n" explanation += f"Final bucket: {initial_bucket}" return explanationThe split pointer must be read consistently with bucket access. If S changes between reading S and accessing the bucket, lookups might fail. This is typically handled by either: (1) locking S and the bucket array together, (2) using version numbers for optimistic concurrency, or (3) employing lock-free atomic operations on S.
The split pointer and level together form a finite state machine that governs linear hashing behavior. Understanding this state machine clarifies all aspects of linear hashing operation.
State Definition:
The state is fully captured by the tuple (S, L). Valid states satisfy:
State Transitions:
| Current State | Event | Condition | Next State | Side Effect |
|---|---|---|---|---|
| (S, L) | Split | S + 1 < N₀ × 2^L | (S + 1, L) | Create bucket N₀×2^L + S |
| (S, L) | Split | S + 1 = N₀ × 2^L | (0, L + 1) | Create bucket, complete level |
| (S, L) | Lookup | Always | (S, L) | Return resolved address |
| (S, L) | Insert | No trigger | (S, L) | Add to resolved bucket |
| (S, L) | Insert | Trigger | (S', L') | Add to bucket, then split |
Visual State Diagram for N₀ = 4:
Level 0: Level 1:
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│S = 0│────▶│S = 1│────▶│S = 2│────▶│S = 3│────▶┬──▶ ┌─────┐────▶ ...
└─────┘ └─────┘ └─────┘ └─────┘ │ │S = 0│
4 bkt 5 bkt 6 bkt 7 bkt │ └─────┘ Level 1
│ 8 bkt continues...
│
└─── Level transition:
S resets, L increments
Level 1 continues: S = 0 ──▶ S = 1 ──▶ ... ──▶ S = 7 ──▶ Level 2 (S = 0)
8 bkt 9 bkt 15 bkt 16 bkt
Key invariants maintained by the state machine: (1) Total buckets = N₀ × 2^L + S, (2) Buckets 0..S-1 and N₀×2^L..total-1 use h_{L+1}, (3) Buckets S..N₀×2^L-1 use h_L, (4) S monotonically increases within a level. These invariants are preserved by every state transition.
The relationship between the split pointer and hash function selection is the mathematical core of linear hashing. Understanding this relationship deeply explains why linear hashing works correctly.
Hash Function Selection Rule:
For key k with base hash value h(k):
if h(k) mod (N₀ × 2^L) < S:
use bucket = h(k) mod (N₀ × 2^(L+1))
else:
use bucket = h(k) mod (N₀ × 2^L)
Detailed Example:
Consider N₀ = 4, L = 0, S = 2. Let's trace several keys:
| Key k | h(k) | h(k) mod 4 | h(k) mod 8 | Bucket Used | Reason |
|---|---|---|---|---|---|
| k1 | 17 | 1 | 1 | 1 | 1 < S=2, use h_{L+1}; result is 1 |
| k2 | 21 | 1 | 5 | 5 | 1 < S=2, use h_{L+1}; result is 5 |
| k3 | 10 | 2 | 2 | 2 | 2 ≥ S=2, use h_L |
| k4 | 14 | 2 | 6 | 2 | 2 ≥ S=2, use h_L (h_{L+1} ignored) |
| k5 | 23 | 3 | 7 | 3 | 3 ≥ S=2, use h_L |
Notice:
The relationship h_{L+1}(k) ∈ {h_L(k), h_L(k) + N₀×2^L} means we can predict where a key will go when its bucket splits—even before the split happens. Keys with h_{L+1}(k) = h_L(k) stay; others move. This predictability enables optimizations like pre-computing split destinations.
Correctly implementing the split pointer requires attention to several subtle details. Common implementation errors can cause silent data loss or lookup failures.
Critical Implementation Considerations:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
class LinearHashTable: """ Complete linear hash table implementation with correct split pointer handling. """ def __init__(self, initial_buckets: int = 4): if initial_buckets < 1 or (initial_buckets & (initial_buckets - 1)) != 0: raise ValueError("initial_buckets must be a power of 2") self.N0 = initial_buckets self.L = 0 self.S = 0 self.buckets = [[] for _ in range(initial_buckets)] def _level_range(self, level: int) -> int: """Compute N0 * 2^level safely.""" return self.N0 << level # Bit shift is 2^level multiplication def _resolve_bucket(self, key) -> int: """Resolve correct bucket for a key.""" h = hash(key) level_range = self._level_range(self.L) bucket = h % level_range # CRITICAL: Use strict less-than if bucket < self.S: next_range = self._level_range(self.L + 1) bucket = h % next_range return bucket def _perform_split(self): """Execute a split on bucket S.""" # 1. Calculate sibling address BEFORE modifying anything sibling_address = self.S + self._level_range(self.L) # 2. Allocate new bucket self.buckets.append([]) # This adds bucket at sibling_address # Verify allocation assert len(self.buckets) == sibling_address + 1, f"Bucket allocation error: expected {sibling_address + 1}, got {len(self.buckets)}" # 3. Redistribute records from bucket S old_records = self.buckets[self.S] self.buckets[self.S] = [] next_range = self._level_range(self.L + 1) for key, value in old_records: new_bucket = hash(key) % next_range self.buckets[new_bucket].append((key, value)) # 4. Advance split pointer self.S += 1 # 5. Check for level completion # CRITICAL: Check equality, not greater-than if self.S == self._level_range(self.L): self.S = 0 self.L += 1 def insert(self, key, value): """Insert key-value pair, triggering split if needed.""" bucket = self._resolve_bucket(key) # Check for existing key (optional: update instead of duplicate) for i, (k, v) in enumerate(self.buckets[bucket]): if k == key: self.buckets[bucket][i] = (key, value) return self.buckets[bucket].append((key, value)) # Check split trigger (covered in next page) if self._should_split(): self._perform_split() def _should_split(self) -> bool: """Determine if split should be triggered.""" # Placeholder - covered in detail in page 3 total_records = sum(len(b) for b in self.buckets) load_factor = total_records / len(self.buckets) return load_factor > 0.75To verify correct implementation: (1) After each split, confirm bucket count = N₀ × 2^L + S, (2) Verify every key maps to the same bucket for lookup and insert, (3) After a complete level, confirm all buckets split exactly once, (4) Insert and retrieve known keys to validate addressing.
In database systems, the split pointer must survive crashes. Unlike extendible hashing's directory (which is a complex structure), linear hashing's state is trivially small—making persistence and recovery remarkably simple.
Minimal Persistent State:
To fully recover a linear hash table, we need only:
That's it. Three integers plus the data.
| Crash Point | State Lost | Recovery Action |
|---|---|---|
| Before split begins | None | Resume normal operation |
| During record redistribution | Partial split | Re-split bucket S (idempotent) |
| After split, before S increment | S not advanced | Verify bucket count, increment S |
| During level transition | L or S inconsistent | Recalculate from bucket count |
Write-Ahead Logging (WAL) Integration:
For durability in database systems, splits are typically logged:
SPLIT_BEGIN: (S=2, L=0, sibling=6)
- Records moved: [(k1,v1), (k5,v5), ...]
SPLIT_COMPLETE: (S=3, L=0)
During recovery, incomplete splits are either completed (if SPLIT_BEGIN found without SPLIT_COMPLETE) or ignored (if no SPLIT_BEGIN found). The deterministic nature of linear hashing makes recovery logic simple.
Because the split pointer state is so compact, it can be included in every checkpoint or even every disk page header. Some implementations store (S, L) redundantly in the page header of each bucket page, enabling independent recovery of bucket pages without reading separate metadata.
The split pointer is the elegant mechanism that makes linear hashing work. This single variable, combined with the level counter, replaces the entire directory structure of extendible hashing.
What's Next:
We've seen how the split pointer tracks expansion progress and enables correct addressing. But we haven't addressed when splits occur. The next page explores split triggers—the policies that decide when to execute a split, balancing performance, overflow chain length, and resource utilization.
You now have deep mastery of the split pointer mechanism. You understand its role in address resolution, its state transitions, its relationship to hash functions, and its implementation requirements. This understanding is essential for implementing and reasoning about linear hashing in database systems.