Database Management SystemsHash Indexing

Linear Hashing

LevelAdvanced

Duration75 mins

TopicHash Indexing

1 / 5

Linear Growth

The Elegance of Incremental Expansion

In the realm of dynamic hashing, linear hashing stands as one of the most elegant solutions to the fundamental challenge of growing hash structures. While extendible hashing achieves dynamic growth through directory manipulation, linear hashing takes a remarkably different approach: it expands the hash table one bucket at a time in a predetermined, linear sequence.

This seemingly simple idea—growing incrementally rather than exponentially—yields profound consequences for memory management, I/O efficiency, and system predictability. Linear hashing, first proposed by Witold Litwin in 1980, eliminates the need for directory structures entirely while maintaining O(1) average-case lookup performance.

What You Will Learn

By the end of this page, you will understand linear hashing's growth mechanism from first principles—how buckets are added one at a time, why this approach eliminates directories, and how the mathematical properties of hash function families enable seamless transitions during expansion. You'll grasp both the theoretical foundations and practical implementation considerations.

The Problem with Exponential Growth

Before appreciating linear hashing's innovation, we must understand the limitations it addresses. Traditional dynamic hashing schemes, including extendible hashing, suffer from a fundamental characteristic: exponential growth in bucket count.

When an extendible hash table needs more capacity, it doubles its directory (from 2^d to 2^(d+1) entries). While individual bucket splits are local operations, the directory itself grows exponentially. This creates several problems:

Exponential Growth Limitations

•Memory Spikes — Directory doubling causes sudden, large memory allocations. A directory growing from 1 million to 2 million entries requires instantaneous allocation of significant memory, potentially causing allocation failures or memory pressure.
•Unpredictable Latency — The doubling operation introduces latency spikes. While individual lookups remain O(1), a lookup that triggers a split followed by directory doubling experiences dramatically higher latency.
•Wasted Capacity — After doubling, many directory entries point to the same bucket until subsequent splits occur. A freshly doubled directory may have 50% redundant pointers, consuming memory without improving access patterns.
•Concurrency Challenges — Directory doubling affects the entire structure. All concurrent operations must either wait or use complex protocols to handle the transitional state, creating contention bottlenecks.
•Checkpoint Complexity — For database systems requiring consistent snapshots, exponential directory growth complicates checkpoint operations since the directory size can change dramatically between checkpoints.

The Doubling Dilemma

Consider a hash table with 10 million buckets. When the next split requires directory expansion, the directory must grow from 10 million to 20 million entries—allocating 10 million new pointers simultaneously. This 'big bang' growth pattern is fundamentally at odds with gradual, predictable resource consumption.

The Linear Hashing Alternative:

Linear hashing addresses these problems by adopting a fundamentally different growth philosophy. Instead of asking "how can we double efficiently?", it asks "what if we never doubled at all?"

The answer lies in linear growth: adding exactly one bucket at a time, in a fixed sequence, regardless of where overflow actually occurs. This seemingly counterintuitive approach—splitting buckets that may not be overflowing—unlocks remarkable properties that make linear hashing particularly suitable for database systems.

The Linear Growth Principle

Linear hashing's core insight is elegantly simple: grow the hash table one bucket at a time, always in the same predetermined order. This order is strictly sequential—bucket 0 splits first, then bucket 1, then bucket 2, and so on.

The key parameters that govern this process are:

Linear Hashing State Parameters
Parameter	Symbol	Description	Role in Growth
Initial bucket count	N₀	Number of buckets at initialization	Typically a power of 2 (common: 2, 4, 8)
Current level	L	Number of complete rounds of expansion	Determines which hash functions are active
Split pointer	S	Next bucket to be split (0 ≤ S < N₀ × 2^L)	Tracks expansion progress within current level
Current bucket count	N	Total buckets = N₀ × 2^L + S	Grows by exactly 1 with each split

The Expansion Cycle:

A complete expansion cycle (called a level or round) works as follows:

Start of Level L: The table has N₀ × 2^L buckets, and S = 0
Split Sequence: Buckets are split in order: 0, 1, 2, ..., up to N₀ × 2^L - 1
End of Level L: After N₀ × 2^L splits, S resets to 0, L increments to L+1
Level L+1 Begins: The table now has N₀ × 2^(L+1) buckets

Within each level, the bucket count grows linearly from N₀ × 2^L to N₀ × 2^(L+1). The transition between levels is seamless—just S resetting to 0 and L incrementing.

Understanding the Level Concept

Think of levels as 'generations' of the hash table. Level 0 uses the initial N₀ buckets. Level 1 doubles to 2N₀. Level 2 doubles again to 4N₀. But unlike extendible hashing, these doublings happen gradually—one bucket at a time—across an entire level's worth of splits.

Concrete Example:

Let's trace a linear hash table with N₀ = 4 through its first several splits:

Split #	Before Split	Bucket Split	After Split	S	L	Total Buckets
0	0,1,2,3	Bucket 0 → 0,4	0,1,2,3,4	1	0	5
1	0,1,2,3,4	Bucket 1 → 1,5	0,1,2,3,4,5	2	0	6
2	0,1,2,3,4,5	Bucket 2 → 2,6	0,1,2,3,4,5,6	3	0	7
3	0,1,2,3,4,5,6	Bucket 3 → 3,7	0,1,2,3,4,5,6,7	0	1	8
4	0,1,2,3,4,5,6,7	Bucket 0 → 0,8	0,1,...,7,8	1	1	9

Notice how the table grows by exactly one bucket per split. After 4 splits (one complete level), the table has doubled from 4 to 8 buckets. But this doubling is spread across 4 operations, not concentrated in one explosive growth event.

Hash Function Families for Linear Growth

Linear hashing's ability to grow one bucket at a time depends on a carefully designed family of hash functions that work together seamlessly. Unlike static hashing with a single hash function, linear hashing uses a sequence of related functions where each function in the sequence has double the range of its predecessor.

The hash function family is typically defined as:

h_L(k) = h(k) mod (N₀ × 2^L)

Where:

h(k) is a base hash function producing a large integer
L is the current level
N₀ is the initial bucket count
The result maps key k to a bucket in range [0, N₀ × 2^L - 1]

The Nesting Property

The critical property is that h_{L+1}(k) either equals h_L(k) or equals h_L(k) + N₀ × 2^L. When a bucket at position b splits, its records distribute between bucket b (which keeps h_{L+1}(k) = b) and bucket b + N₀ × 2^L (which gets h_{L+1}(k) = b + N₀ × 2^L). This is why splitting works correctly.

Dual Hash Function Lookup:

During the gradual expansion within a level, some buckets have been split while others have not. This creates a transitional state where we must carefully select which hash function to use for each lookup.

The algorithm for finding the correct bucket is:

function findBucket(key k):
    b = h_L(k)        // Try current level's hash function
    if b < S:
        b = h_{L+1}(k)  // Bucket was already split, use next level
    return b

This two-step process is the heart of linear hashing's address calculation:

Address Calculation Logic

•Step 1: Apply h_L — Compute h(k) mod (N₀ × 2^L). This gives a bucket address in the range [0, N₀ × 2^L - 1], as if no splits had occurred in the current level.
•Step 2: Check against S — If the computed bucket address b is less than the split pointer S, this bucket has already been split. The record might now be in the new sibling bucket.
•Step 3: Apply h_{L+1} if needed — If b < S, recompute using h_{L+1}(k) = h(k) mod (N₀ × 2^(L+1)). This correctly places the record in either b or b + N₀ × 2^L.
•Step 4: Access the bucket — The final bucket address is guaranteed to be valid and to contain the record if it exists.

linear_hashing_address.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class LinearHashTable:
    def __init__(self, initial_buckets=4):
        self.N0 = initial_buckets          # Initial bucket count
        self.L = 0                          # Current level
        self.S = 0                          # Split pointer
        self.buckets = [[] for _ in range(initial_buckets)]
    
    def _hash(self, key: int) -> int:
        """Base hash function - typically a good integer hash."""
        # Using multiplication method for demonstration
        A = 0.6180339887  # Golden ratio conjugate
        return int(len(self.buckets) * ((key * A) % 1))
    
    def _h_level(self, key: int, level: int) -> int:
        """Hash function for a specific level."""
        # h_L(k) = h(k) mod (N0 * 2^L)
        return hash(key) % (self.N0 * (2 ** level))
    
    def _find_bucket(self, key: int) -> int:
        """Find the correct bucket for a key."""
        # Step 1: Apply current level's hash function
        bucket = self._h_level(key, self.L)
        
        # Step 2: If bucket < S, it has already been split
        if bucket < self.S:
            # Step 3: Use next level's hash function
            bucket = self._h_level(key, self.L + 1)
        
        return bucket
    
    @property
    def total_buckets(self) -> int:
        """Current total number of buckets."""
        return self.N0 * (2 ** self.L) + self.S
    
    def lookup(self, key: int):
        """Look up a key in the hash table."""
        bucket_idx = self._find_bucket(key)
        bucket = self.buckets[bucket_idx]
        
        for k, v in bucket:
            if k == key:
                return v
        return None  # Key not found

The Mathematics of Linear Growth

Understanding the mathematical properties of linear growth helps explain why this approach works and enables precise analysis of performance characteristics.

Bucket Count at Any State:

The total number of buckets at any point is:

N = N₀ × 2^L + S

Where:

N₀ × 2^L represents completed levels (buckets from previous doublings)
S represents progress in the current level

Bucket Count Evolution (N₀ = 4)
Level (L)	S Range	Bucket Count Range	Interpretation
0	0 to 3	4 to 7	First level: Growing from 4 to 8 buckets
1	0 to 7	8 to 15	Second level: Growing from 8 to 16 buckets
2	0 to 15	16 to 31	Third level: Growing from 16 to 32 buckets
3	0 to 31	32 to 63	Fourth level: Growing from 32 to 64 buckets
L	0 to N₀×2^L - 1	N₀×2^L to N₀×2^(L+1) - 1	General formula for any level

Hash Function Relationship:

The family of hash functions satisfies a crucial halving property:

h_L(k) = h_{L+1}(k) mod (N₀ × 2^L)

This means that for any key k:

h_{L+1}(k) is either equal to h_L(k), OR
h_{L+1}(k) is equal to h_L(k) + N₀ × 2^L

Proof: Let h = h(k) be the base hash value.

h_L(k) = h mod (N₀ × 2^L)
h_{L+1}(k) = h mod (N₀ × 2^(L+1)) = h mod (2 × N₀ × 2^L)

Since h mod (2M) equals either (h mod M) or (h mod M) + M:

h_{L+1}(k) ∈ {h_L(k), h_L(k) + N₀ × 2^L}

Why This Property Matters

This halving property ensures that when bucket b splits into buckets b and b + N₀ × 2^L, every record in bucket b has a well-defined destination. A record either stays in bucket b (its h_{L+1} value equals b) or moves to the new bucket (its h_{L+1} value equals b + N₀ × 2^L). No record can end up in any other bucket.

Average Chain Length Analysis:

Assuming uniform hashing and a load factor α (records per bucket), the expected chain length at any bucket is:

E[chain length] = α = n / N

Where n is the total number of records and N is the current bucket count.

During linear growth, the load factor varies as buckets are added:

Maximum load (before a split cycle): α_max
Minimum load (after sufficient splits): α_min ≈ α_max / 2

The split trigger (discussed in a later page) controls this range. With typical triggers, α stays between 0.5 and 1.0, ensuring O(1) expected lookup time.

Space Overhead Analysis:

Linear hashing eliminates directory overhead entirely. The space overhead consists only of:

State variables: O(1) — just L, S, and N₀
Overflow chains: O(n / bucket_capacity) for chaining
Bucket array: O(N) bucket pointers

Compare this to extendible hashing, which requires O(2^d) directory entries where d is the global depth. Linear hashing's overhead is strictly O(N), never O(2^something_larger).

Split Mechanics in Detail

When a split occurs in linear hashing, the mechanics are straightforward but must be executed precisely. The bucket at position S is always the one that splits, regardless of which bucket triggered the split condition.

The Split Algorithm:

linear_hash_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def split(self):
    """
    Perform a split operation on bucket S.
    Creates a new bucket and redistributes records.
    """
    # Identify the bucket to split and its new sibling
    old_bucket_idx = self.S
    new_bucket_idx = self.S + self.N0 * (2 ** self.L)
    
    # Create the new bucket
    self.buckets.append([])  # Adds bucket at index new_bucket_idx
    
    # Get all records from the bucket being split
    old_records = self.buckets[old_bucket_idx]
    self.buckets[old_bucket_idx] = []  # Clear for redistribution
    
    # Redistribute records using h_{L+1}
    for key, value in old_records:
        # Use next level's hash function to determine placement
        new_bucket = self._h_level(key, self.L + 1)
        self.buckets[new_bucket].append((key, value))
    
    # Advance the split pointer
    self.S += 1
    
    # Check if we've completed a full round
    if self.S >= self.N0 * (2 ** self.L):
        # Move to next level: reset S, increment L
        self.S = 0
        self.L += 1
    
    # Note: No directory to update! This is the beauty of linear hashing.

Key Properties of Linear Hash Splits

•Fixed Split Order — Bucket S always splits next, regardless of which bucket overflowed. This is counterintuitive but crucial: it enables directory-free addressing.
•Binary Distribution — Records from bucket S go to exactly two destinations: bucket S (stays) or bucket S + N₀ × 2^L (moves). No record can go elsewhere.
•Local Operation — Only the splitting bucket and its new sibling are modified. All other buckets remain untouched, enabling excellent concurrency properties.
•Automatic Level Transition — When S reaches the end of its range, incrementing L happens implicitly. There's no separate 'level up' operation.
•No Pointer Updates — Unlike extendible hashing, no existing bucket pointers need modification. The new bucket simply extends the bucket array.

The Deferred Splitting Insight

A critical insight: when bucket 5 overflows, we split bucket S (say, bucket 2). Bucket 5's overflow is not immediately addressed! However, when S eventually reaches 5, it will split, resolving the overflow. This 'eventual resolution' is acceptable because overflow chains, while suboptimal, still provide O(1) expected access with good load factor management.

Visualizing a Split Sequence:

Consider a linear hash table with N₀ = 4 at level L = 0, S = 2:

Before split (6 buckets total):
┌─────┬─────┬─────┬─────┬─────┬─────┐
│  0' │  1' │  2  │  3  │  4  │  5  │
└─────┴─────┴─────┴─────┴─────┴─────┘
  ↑     ↑     ↑
  |     |     └── S = 2 (next to split)
  |     └── Already split (uses h₁)
  └── Already split (uses h₁)

After split (7 buckets total):
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│  0' │  1' │  2' │  3  │  4  │  5  │  6  │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┘
  ↑     ↑     ↑     ↑              ↑
  |     |     |     └── S = 3 (next to split)
  |     |     └── Just split into 2 and 6
  |     └── Previously split
  └── Previously split

The prime notation (0', 1', 2') indicates buckets already split in this level—they now use h₁ instead of h₀ for addressing.

Comparison with Extendible Hashing

Linear hashing and extendible hashing are often compared as the two major dynamic hashing techniques. Understanding their differences illuminates when each is preferable.

Linear Hashing vs. Extendible Hashing
Aspect	Linear Hashing	Extendible Hashing
Directory	None (O(1) state)	Required (O(2^d) entries)
Growth Pattern	One bucket at a time	Bucket + potential directory doubling
Split Target	Fixed (bucket S)	Flexible (overflowing bucket)
Address Calculation	Two hash computations (worst case)	One directory lookup + one hash
Memory Overhead	Minimal (L, S, N₀ only)	Directory can dominate for high depths
Latency Spikes	No directory doubling	Directory doubling causes spikes
Overflow Handling	Chains may persist longer	Immediate split of overflowing bucket
Implementation	Simpler state management	More complex directory management
Concurrency	Easier (localized updates)	Harder (directory coordination required)

Linear Hashing Advantages

•Predictable, incremental growth
•No directory memory overhead
•Simpler concurrent access
•No latency spikes from doubling
•Better for disk-based systems

Linear Hashing Disadvantages

•May split non-overflowing buckets
•Overflow chains can persist
•Two hash computations sometimes needed
•Load factor varies during expansion
•Less immediate overflow resolution

When to Choose Linear Hashing

Linear hashing excels in disk-based database systems where I/O predictability matters more than perfect bucket allocation. It's also preferred in concurrent environments due to simpler synchronization requirements. Extendible hashing may be preferable for in-memory tables where directory overhead is less concerning and immediate overflow resolution is valuable.

Summary: Linear Growth Fundamentals

We've explored the foundational concept of linear hashing: linear growth. This elegant approach to dynamic hash tables eliminates directory structures by growing one bucket at a time in a predetermined sequence.

Key Takeaways

•Linear growth adds one bucket at a time — The split pointer S advances sequentially through buckets, avoiding explosive directory growth.
•Hash function families enable smooth transitions — The relationship h_L(k) = h_{L+1}(k) mod (range) ensures correct record distribution during splits.
•Two-step address calculation — Lookups compute h_L(k), then check if the bucket was already split and apply h_{L+1}(k) if needed.
•Level transitions are implicit — When S completes a full round, it resets to 0 and L increments, naturally doubling capacity over time.
•No directory overhead — Only O(1) state variables are needed: L, S, and N₀.
•Predictable resource consumption — Memory grows linearly, never exponentially in a single operation.

What's Next:

Now that we understand how linear hashing grows, we'll examine the split pointer mechanism in detail. The split pointer is the state machine at the heart of linear hashing—tracking exactly which bucket splits next and managing the transition between levels. Understanding its behavior is essential for implementing linear hashing correctly and analyzing its performance characteristics.

Page Complete

You now understand the linear growth mechanism that distinguishes linear hashing from other dynamic hashing schemes. This foundation—growing one bucket at a time without directory structures—enables the predictable, efficient behavior that makes linear hashing valuable for database systems.

1 / 5

Loading learning content...

Database Management SystemsHash Indexing

Linear Hashing

LevelAdvanced

Duration75 mins

TopicHash Indexing

1 / 5

Linear Growth

The Elegance of Incremental Expansion

What You Will Learn

The Problem with Exponential Growth

Exponential Growth Limitations

•Memory Spikes — Directory doubling causes sudden, large memory allocations. A directory growing from 1 million to 2 million entries requires instantaneous allocation of significant memory, potentially causing allocation failures or memory pressure.
•Unpredictable Latency — The doubling operation introduces latency spikes. While individual lookups remain O(1), a lookup that triggers a split followed by directory doubling experiences dramatically higher latency.
•Wasted Capacity — After doubling, many directory entries point to the same bucket until subsequent splits occur. A freshly doubled directory may have 50% redundant pointers, consuming memory without improving access patterns.
•Concurrency Challenges — Directory doubling affects the entire structure. All concurrent operations must either wait or use complex protocols to handle the transitional state, creating contention bottlenecks.
•Checkpoint Complexity — For database systems requiring consistent snapshots, exponential directory growth complicates checkpoint operations since the directory size can change dramatically between checkpoints.

The Doubling Dilemma

The Linear Hashing Alternative:

Linear hashing addresses these problems by adopting a fundamentally different growth philosophy. Instead of asking "how can we double efficiently?", it asks "what if we never doubled at all?"

The Linear Growth Principle

The key parameters that govern this process are:

Linear Hashing State Parameters
Parameter	Symbol	Description	Role in Growth
Initial bucket count	N₀	Number of buckets at initialization	Typically a power of 2 (common: 2, 4, 8)
Current level	L	Number of complete rounds of expansion	Determines which hash functions are active
Split pointer	S	Next bucket to be split (0 ≤ S < N₀ × 2^L)	Tracks expansion progress within current level
Current bucket count	N	Total buckets = N₀ × 2^L + S	Grows by exactly 1 with each split

The Expansion Cycle:

A complete expansion cycle (called a level or round) works as follows:

Start of Level L: The table has N₀ × 2^L buckets, and S = 0
Split Sequence: Buckets are split in order: 0, 1, 2, ..., up to N₀ × 2^L - 1
End of Level L: After N₀ × 2^L splits, S resets to 0, L increments to L+1
Level L+1 Begins: The table now has N₀ × 2^(L+1) buckets

Within each level, the bucket count grows linearly from N₀ × 2^L to N₀ × 2^(L+1). The transition between levels is seamless—just S resetting to 0 and L incrementing.

Understanding the Level Concept

Concrete Example:

Let's trace a linear hash table with N₀ = 4 through its first several splits:

Split #	Before Split	Bucket Split	After Split	S	L	Total Buckets
0	0,1,2,3	Bucket 0 → 0,4	0,1,2,3,4	1	0	5
1	0,1,2,3,4	Bucket 1 → 1,5	0,1,2,3,4,5	2	0	6
2	0,1,2,3,4,5	Bucket 2 → 2,6	0,1,2,3,4,5,6	3	0	7
3	0,1,2,3,4,5,6	Bucket 3 → 3,7	0,1,2,3,4,5,6,7	0	1	8
4	0,1,2,3,4,5,6,7	Bucket 0 → 0,8	0,1,...,7,8	1	1	9

Hash Function Families for Linear Growth

The hash function family is typically defined as:

h_L(k) = h(k) mod (N₀ × 2^L)

Where:

h(k) is a base hash function producing a large integer
L is the current level
N₀ is the initial bucket count
The result maps key k to a bucket in range [0, N₀ × 2^L - 1]

The Nesting Property

Dual Hash Function Lookup:

The algorithm for finding the correct bucket is:

function findBucket(key k):
    b = h_L(k)        // Try current level's hash function
    if b < S:
        b = h_{L+1}(k)  // Bucket was already split, use next level
    return b

This two-step process is the heart of linear hashing's address calculation:

Address Calculation Logic

•Step 1: Apply h_L — Compute h(k) mod (N₀ × 2^L). This gives a bucket address in the range [0, N₀ × 2^L - 1], as if no splits had occurred in the current level.
•Step 2: Check against S — If the computed bucket address b is less than the split pointer S, this bucket has already been split. The record might now be in the new sibling bucket.
•Step 3: Apply h_{L+1} if needed — If b < S, recompute using h_{L+1}(k) = h(k) mod (N₀ × 2^(L+1)). This correctly places the record in either b or b + N₀ × 2^L.
•Step 4: Access the bucket — The final bucket address is guaranteed to be valid and to contain the record if it exists.

linear_hashing_address.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class LinearHashTable:
    def __init__(self, initial_buckets=4):
        self.N0 = initial_buckets          # Initial bucket count
        self.L = 0                          # Current level
        self.S = 0                          # Split pointer
        self.buckets = [[] for _ in range(initial_buckets)]
    
    def _hash(self, key: int) -> int:
        """Base hash function - typically a good integer hash."""
        # Using multiplication method for demonstration
        A = 0.6180339887  # Golden ratio conjugate
        return int(len(self.buckets) * ((key * A) % 1))
    
    def _h_level(self, key: int, level: int) -> int:
        """Hash function for a specific level."""
        # h_L(k) = h(k) mod (N0 * 2^L)
        return hash(key) % (self.N0 * (2 ** level))
    
    def _find_bucket(self, key: int) -> int:
        """Find the correct bucket for a key."""
        # Step 1: Apply current level's hash function
        bucket = self._h_level(key, self.L)
        
        # Step 2: If bucket < S, it has already been split
        if bucket < self.S:
            # Step 3: Use next level's hash function
            bucket = self._h_level(key, self.L + 1)
        
        return bucket
    
    @property
    def total_buckets(self) -> int:
        """Current total number of buckets."""
        return self.N0 * (2 ** self.L) + self.S
    
    def lookup(self, key: int):
        """Look up a key in the hash table."""
        bucket_idx = self._find_bucket(key)
        bucket = self.buckets[bucket_idx]
        
        for k, v in bucket:
            if k == key:
                return v
        return None  # Key not found

The Mathematics of Linear Growth

Understanding the mathematical properties of linear growth helps explain why this approach works and enables precise analysis of performance characteristics.

Bucket Count at Any State:

The total number of buckets at any point is:

N = N₀ × 2^L + S

Where:

N₀ × 2^L represents completed levels (buckets from previous doublings)
S represents progress in the current level

Bucket Count Evolution (N₀ = 4)
Level (L)	S Range	Bucket Count Range	Interpretation
0	0 to 3	4 to 7	First level: Growing from 4 to 8 buckets
1	0 to 7	8 to 15	Second level: Growing from 8 to 16 buckets
2	0 to 15	16 to 31	Third level: Growing from 16 to 32 buckets
3	0 to 31	32 to 63	Fourth level: Growing from 32 to 64 buckets
L	0 to N₀×2^L - 1	N₀×2^L to N₀×2^(L+1) - 1	General formula for any level

Hash Function Relationship:

The family of hash functions satisfies a crucial halving property:

h_L(k) = h_{L+1}(k) mod (N₀ × 2^L)

This means that for any key k:

h_{L+1}(k) is either equal to h_L(k), OR
h_{L+1}(k) is equal to h_L(k) + N₀ × 2^L

Proof: Let h = h(k) be the base hash value.

h_L(k) = h mod (N₀ × 2^L)
h_{L+1}(k) = h mod (N₀ × 2^(L+1)) = h mod (2 × N₀ × 2^L)

Since h mod (2M) equals either (h mod M) or (h mod M) + M:

h_{L+1}(k) ∈ {h_L(k), h_L(k) + N₀ × 2^L}

Why This Property Matters

Average Chain Length Analysis:

Assuming uniform hashing and a load factor α (records per bucket), the expected chain length at any bucket is:

E[chain length] = α = n / N

Where n is the total number of records and N is the current bucket count.

During linear growth, the load factor varies as buckets are added:

Maximum load (before a split cycle): α_max
Minimum load (after sufficient splits): α_min ≈ α_max / 2

The split trigger (discussed in a later page) controls this range. With typical triggers, α stays between 0.5 and 1.0, ensuring O(1) expected lookup time.

Space Overhead Analysis:

Linear hashing eliminates directory overhead entirely. The space overhead consists only of:

State variables: O(1) — just L, S, and N₀
Overflow chains: O(n / bucket_capacity) for chaining
Bucket array: O(N) bucket pointers

Compare this to extendible hashing, which requires O(2^d) directory entries where d is the global depth. Linear hashing's overhead is strictly O(N), never O(2^something_larger).

Split Mechanics in Detail

The Split Algorithm:

linear_hash_split.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def split(self):
    """
    Perform a split operation on bucket S.
    Creates a new bucket and redistributes records.
    """
    # Identify the bucket to split and its new sibling
    old_bucket_idx = self.S
    new_bucket_idx = self.S + self.N0 * (2 ** self.L)
    
    # Create the new bucket
    self.buckets.append([])  # Adds bucket at index new_bucket_idx
    
    # Get all records from the bucket being split
    old_records = self.buckets[old_bucket_idx]
    self.buckets[old_bucket_idx] = []  # Clear for redistribution
    
    # Redistribute records using h_{L+1}
    for key, value in old_records:
        # Use next level's hash function to determine placement
        new_bucket = self._h_level(key, self.L + 1)
        self.buckets[new_bucket].append((key, value))
    
    # Advance the split pointer
    self.S += 1
    
    # Check if we've completed a full round
    if self.S >= self.N0 * (2 ** self.L):
        # Move to next level: reset S, increment L
        self.S = 0
        self.L += 1
    
    # Note: No directory to update! This is the beauty of linear hashing.

Key Properties of Linear Hash Splits

•Fixed Split Order — Bucket S always splits next, regardless of which bucket overflowed. This is counterintuitive but crucial: it enables directory-free addressing.
•Binary Distribution — Records from bucket S go to exactly two destinations: bucket S (stays) or bucket S + N₀ × 2^L (moves). No record can go elsewhere.
•Local Operation — Only the splitting bucket and its new sibling are modified. All other buckets remain untouched, enabling excellent concurrency properties.
•Automatic Level Transition — When S reaches the end of its range, incrementing L happens implicitly. There's no separate 'level up' operation.
•No Pointer Updates — Unlike extendible hashing, no existing bucket pointers need modification. The new bucket simply extends the bucket array.

The Deferred Splitting Insight

Visualizing a Split Sequence:

Consider a linear hash table with N₀ = 4 at level L = 0, S = 2:

Before split (6 buckets total):
┌─────┬─────┬─────┬─────┬─────┬─────┐
│  0' │  1' │  2  │  3  │  4  │  5  │
└─────┴─────┴─────┴─────┴─────┴─────┘
  ↑     ↑     ↑
  |     |     └── S = 2 (next to split)
  |     └── Already split (uses h₁)
  └── Already split (uses h₁)

After split (7 buckets total):
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│  0' │  1' │  2' │  3  │  4  │  5  │  6  │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┘
  ↑     ↑     ↑     ↑              ↑
  |     |     |     └── S = 3 (next to split)
  |     |     └── Just split into 2 and 6
  |     └── Previously split
  └── Previously split

The prime notation (0', 1', 2') indicates buckets already split in this level—they now use h₁ instead of h₀ for addressing.

Comparison with Extendible Hashing

Linear hashing and extendible hashing are often compared as the two major dynamic hashing techniques. Understanding their differences illuminates when each is preferable.

Linear Hashing vs. Extendible Hashing
Aspect	Linear Hashing	Extendible Hashing
Directory	None (O(1) state)	Required (O(2^d) entries)
Growth Pattern	One bucket at a time	Bucket + potential directory doubling
Split Target	Fixed (bucket S)	Flexible (overflowing bucket)
Address Calculation	Two hash computations (worst case)	One directory lookup + one hash
Memory Overhead	Minimal (L, S, N₀ only)	Directory can dominate for high depths
Latency Spikes	No directory doubling	Directory doubling causes spikes
Overflow Handling	Chains may persist longer	Immediate split of overflowing bucket
Implementation	Simpler state management	More complex directory management
Concurrency	Easier (localized updates)	Harder (directory coordination required)

Linear Hashing Advantages

•Predictable, incremental growth
•No directory memory overhead
•Simpler concurrent access
•No latency spikes from doubling
•Better for disk-based systems

Linear Hashing Disadvantages

•May split non-overflowing buckets
•Overflow chains can persist
•Two hash computations sometimes needed
•Load factor varies during expansion
•Less immediate overflow resolution

When to Choose Linear Hashing

Summary: Linear Growth Fundamentals

Key Takeaways

•Linear growth adds one bucket at a time — The split pointer S advances sequentially through buckets, avoiding explosive directory growth.
•Hash function families enable smooth transitions — The relationship h_L(k) = h_{L+1}(k) mod (range) ensures correct record distribution during splits.
•Two-step address calculation — Lookups compute h_L(k), then check if the bucket was already split and apply h_{L+1}(k) if needed.
•Level transitions are implicit — When S completes a full round, it resets to 0 and L increments, naturally doubling capacity over time.
•No directory overhead — Only O(1) state variables are needed: L, S, and N₀.
•Predictable resource consumption — Memory grows linearly, never exponentially in a single operation.

What's Next:

Page Complete

1 / 5