Loading learning content...
In the realm of dynamic hashing, linear hashing stands as one of the most elegant solutions to the fundamental challenge of growing hash structures. While extendible hashing achieves dynamic growth through directory manipulation, linear hashing takes a remarkably different approach: it expands the hash table one bucket at a time in a predetermined, linear sequence.
This seemingly simple idea—growing incrementally rather than exponentially—yields profound consequences for memory management, I/O efficiency, and system predictability. Linear hashing, first proposed by Witold Litwin in 1980, eliminates the need for directory structures entirely while maintaining O(1) average-case lookup performance.
By the end of this page, you will understand linear hashing's growth mechanism from first principles—how buckets are added one at a time, why this approach eliminates directories, and how the mathematical properties of hash function families enable seamless transitions during expansion. You'll grasp both the theoretical foundations and practical implementation considerations.
Before appreciating linear hashing's innovation, we must understand the limitations it addresses. Traditional dynamic hashing schemes, including extendible hashing, suffer from a fundamental characteristic: exponential growth in bucket count.
When an extendible hash table needs more capacity, it doubles its directory (from 2^d to 2^(d+1) entries). While individual bucket splits are local operations, the directory itself grows exponentially. This creates several problems:
Consider a hash table with 10 million buckets. When the next split requires directory expansion, the directory must grow from 10 million to 20 million entries—allocating 10 million new pointers simultaneously. This 'big bang' growth pattern is fundamentally at odds with gradual, predictable resource consumption.
The Linear Hashing Alternative:
Linear hashing addresses these problems by adopting a fundamentally different growth philosophy. Instead of asking "how can we double efficiently?", it asks "what if we never doubled at all?"
The answer lies in linear growth: adding exactly one bucket at a time, in a fixed sequence, regardless of where overflow actually occurs. This seemingly counterintuitive approach—splitting buckets that may not be overflowing—unlocks remarkable properties that make linear hashing particularly suitable for database systems.
Linear hashing's core insight is elegantly simple: grow the hash table one bucket at a time, always in the same predetermined order. This order is strictly sequential—bucket 0 splits first, then bucket 1, then bucket 2, and so on.
The key parameters that govern this process are:
| Parameter | Symbol | Description | Role in Growth |
|---|---|---|---|
| Initial bucket count | N₀ | Number of buckets at initialization | Typically a power of 2 (common: 2, 4, 8) |
| Current level | L | Number of complete rounds of expansion | Determines which hash functions are active |
| Split pointer | S | Next bucket to be split (0 ≤ S < N₀ × 2^L) | Tracks expansion progress within current level |
| Current bucket count | N | Total buckets = N₀ × 2^L + S | Grows by exactly 1 with each split |
The Expansion Cycle:
A complete expansion cycle (called a level or round) works as follows:
Within each level, the bucket count grows linearly from N₀ × 2^L to N₀ × 2^(L+1). The transition between levels is seamless—just S resetting to 0 and L incrementing.
Think of levels as 'generations' of the hash table. Level 0 uses the initial N₀ buckets. Level 1 doubles to 2N₀. Level 2 doubles again to 4N₀. But unlike extendible hashing, these doublings happen gradually—one bucket at a time—across an entire level's worth of splits.
Concrete Example:
Let's trace a linear hash table with N₀ = 4 through its first several splits:
| Split # | Before Split | Bucket Split | After Split | S | L | Total Buckets |
|---|---|---|---|---|---|---|
| 0 | 0,1,2,3 | Bucket 0 → 0,4 | 0,1,2,3,4 | 1 | 0 | 5 |
| 1 | 0,1,2,3,4 | Bucket 1 → 1,5 | 0,1,2,3,4,5 | 2 | 0 | 6 |
| 2 | 0,1,2,3,4,5 | Bucket 2 → 2,6 | 0,1,2,3,4,5,6 | 3 | 0 | 7 |
| 3 | 0,1,2,3,4,5,6 | Bucket 3 → 3,7 | 0,1,2,3,4,5,6,7 | 0 | 1 | 8 |
| 4 | 0,1,2,3,4,5,6,7 | Bucket 0 → 0,8 | 0,1,...,7,8 | 1 | 1 | 9 |
Notice how the table grows by exactly one bucket per split. After 4 splits (one complete level), the table has doubled from 4 to 8 buckets. But this doubling is spread across 4 operations, not concentrated in one explosive growth event.
Linear hashing's ability to grow one bucket at a time depends on a carefully designed family of hash functions that work together seamlessly. Unlike static hashing with a single hash function, linear hashing uses a sequence of related functions where each function in the sequence has double the range of its predecessor.
The hash function family is typically defined as:
h_L(k) = h(k) mod (N₀ × 2^L)
Where:
The critical property is that h_{L+1}(k) either equals h_L(k) or equals h_L(k) + N₀ × 2^L. When a bucket at position b splits, its records distribute between bucket b (which keeps h_{L+1}(k) = b) and bucket b + N₀ × 2^L (which gets h_{L+1}(k) = b + N₀ × 2^L). This is why splitting works correctly.
Dual Hash Function Lookup:
During the gradual expansion within a level, some buckets have been split while others have not. This creates a transitional state where we must carefully select which hash function to use for each lookup.
The algorithm for finding the correct bucket is:
function findBucket(key k):
b = h_L(k) // Try current level's hash function
if b < S:
b = h_{L+1}(k) // Bucket was already split, use next level
return b
This two-step process is the heart of linear hashing's address calculation:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
class LinearHashTable: def __init__(self, initial_buckets=4): self.N0 = initial_buckets # Initial bucket count self.L = 0 # Current level self.S = 0 # Split pointer self.buckets = [[] for _ in range(initial_buckets)] def _hash(self, key: int) -> int: """Base hash function - typically a good integer hash.""" # Using multiplication method for demonstration A = 0.6180339887 # Golden ratio conjugate return int(len(self.buckets) * ((key * A) % 1)) def _h_level(self, key: int, level: int) -> int: """Hash function for a specific level.""" # h_L(k) = h(k) mod (N0 * 2^L) return hash(key) % (self.N0 * (2 ** level)) def _find_bucket(self, key: int) -> int: """Find the correct bucket for a key.""" # Step 1: Apply current level's hash function bucket = self._h_level(key, self.L) # Step 2: If bucket < S, it has already been split if bucket < self.S: # Step 3: Use next level's hash function bucket = self._h_level(key, self.L + 1) return bucket @property def total_buckets(self) -> int: """Current total number of buckets.""" return self.N0 * (2 ** self.L) + self.S def lookup(self, key: int): """Look up a key in the hash table.""" bucket_idx = self._find_bucket(key) bucket = self.buckets[bucket_idx] for k, v in bucket: if k == key: return v return None # Key not foundUnderstanding the mathematical properties of linear growth helps explain why this approach works and enables precise analysis of performance characteristics.
Bucket Count at Any State:
The total number of buckets at any point is:
N = N₀ × 2^L + S
Where:
| Level (L) | S Range | Bucket Count Range | Interpretation |
|---|---|---|---|
| 0 | 0 to 3 | 4 to 7 | First level: Growing from 4 to 8 buckets |
| 1 | 0 to 7 | 8 to 15 | Second level: Growing from 8 to 16 buckets |
| 2 | 0 to 15 | 16 to 31 | Third level: Growing from 16 to 32 buckets |
| 3 | 0 to 31 | 32 to 63 | Fourth level: Growing from 32 to 64 buckets |
| L | 0 to N₀×2^L - 1 | N₀×2^L to N₀×2^(L+1) - 1 | General formula for any level |
Hash Function Relationship:
The family of hash functions satisfies a crucial halving property:
h_L(k) = h_{L+1}(k) mod (N₀ × 2^L)
This means that for any key k:
Proof: Let h = h(k) be the base hash value.
Since h mod (2M) equals either (h mod M) or (h mod M) + M:
This halving property ensures that when bucket b splits into buckets b and b + N₀ × 2^L, every record in bucket b has a well-defined destination. A record either stays in bucket b (its h_{L+1} value equals b) or moves to the new bucket (its h_{L+1} value equals b + N₀ × 2^L). No record can end up in any other bucket.
Average Chain Length Analysis:
Assuming uniform hashing and a load factor α (records per bucket), the expected chain length at any bucket is:
E[chain length] = α = n / N
Where n is the total number of records and N is the current bucket count.
During linear growth, the load factor varies as buckets are added:
The split trigger (discussed in a later page) controls this range. With typical triggers, α stays between 0.5 and 1.0, ensuring O(1) expected lookup time.
Space Overhead Analysis:
Linear hashing eliminates directory overhead entirely. The space overhead consists only of:
Compare this to extendible hashing, which requires O(2^d) directory entries where d is the global depth. Linear hashing's overhead is strictly O(N), never O(2^something_larger).
When a split occurs in linear hashing, the mechanics are straightforward but must be executed precisely. The bucket at position S is always the one that splits, regardless of which bucket triggered the split condition.
The Split Algorithm:
1234567891011121314151617181920212223242526272829303132
def split(self): """ Perform a split operation on bucket S. Creates a new bucket and redistributes records. """ # Identify the bucket to split and its new sibling old_bucket_idx = self.S new_bucket_idx = self.S + self.N0 * (2 ** self.L) # Create the new bucket self.buckets.append([]) # Adds bucket at index new_bucket_idx # Get all records from the bucket being split old_records = self.buckets[old_bucket_idx] self.buckets[old_bucket_idx] = [] # Clear for redistribution # Redistribute records using h_{L+1} for key, value in old_records: # Use next level's hash function to determine placement new_bucket = self._h_level(key, self.L + 1) self.buckets[new_bucket].append((key, value)) # Advance the split pointer self.S += 1 # Check if we've completed a full round if self.S >= self.N0 * (2 ** self.L): # Move to next level: reset S, increment L self.S = 0 self.L += 1 # Note: No directory to update! This is the beauty of linear hashing.A critical insight: when bucket 5 overflows, we split bucket S (say, bucket 2). Bucket 5's overflow is not immediately addressed! However, when S eventually reaches 5, it will split, resolving the overflow. This 'eventual resolution' is acceptable because overflow chains, while suboptimal, still provide O(1) expected access with good load factor management.
Visualizing a Split Sequence:
Consider a linear hash table with N₀ = 4 at level L = 0, S = 2:
Before split (6 buckets total):
┌─────┬─────┬─────┬─────┬─────┬─────┐
│ 0' │ 1' │ 2 │ 3 │ 4 │ 5 │
└─────┴─────┴─────┴─────┴─────┴─────┘
↑ ↑ ↑
| | └── S = 2 (next to split)
| └── Already split (uses h₁)
└── Already split (uses h₁)
After split (7 buckets total):
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ 0' │ 1' │ 2' │ 3 │ 4 │ 5 │ 6 │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┘
↑ ↑ ↑ ↑ ↑
| | | └── S = 3 (next to split)
| | └── Just split into 2 and 6
| └── Previously split
└── Previously split
The prime notation (0', 1', 2') indicates buckets already split in this level—they now use h₁ instead of h₀ for addressing.
Linear hashing and extendible hashing are often compared as the two major dynamic hashing techniques. Understanding their differences illuminates when each is preferable.
| Aspect | Linear Hashing | Extendible Hashing |
|---|---|---|
| Directory | None (O(1) state) | Required (O(2^d) entries) |
| Growth Pattern | One bucket at a time | Bucket + potential directory doubling |
| Split Target | Fixed (bucket S) | Flexible (overflowing bucket) |
| Address Calculation | Two hash computations (worst case) | One directory lookup + one hash |
| Memory Overhead | Minimal (L, S, N₀ only) | Directory can dominate for high depths |
| Latency Spikes | No directory doubling | Directory doubling causes spikes |
| Overflow Handling | Chains may persist longer | Immediate split of overflowing bucket |
| Implementation | Simpler state management | More complex directory management |
| Concurrency | Easier (localized updates) | Harder (directory coordination required) |
Linear hashing excels in disk-based database systems where I/O predictability matters more than perfect bucket allocation. It's also preferred in concurrent environments due to simpler synchronization requirements. Extendible hashing may be preferable for in-memory tables where directory overhead is less concerning and immediate overflow resolution is valuable.
We've explored the foundational concept of linear hashing: linear growth. This elegant approach to dynamic hash tables eliminates directory structures by growing one bucket at a time in a predetermined sequence.
What's Next:
Now that we understand how linear hashing grows, we'll examine the split pointer mechanism in detail. The split pointer is the state machine at the heart of linear hashing—tracking exactly which bucket splits next and managing the transition between levels. Understanding its behavior is essential for implementing linear hashing correctly and analyzing its performance characteristics.
You now understand the linear growth mechanism that distinguishes linear hashing from other dynamic hashing schemes. This foundation—growing one bucket at a time without directory structures—enables the predictable, efficient behavior that makes linear hashing valuable for database systems.