Database Management SystemsHash Indexing

Linear Hashing

LevelAdvanced

Duration75 mins

TopicHash Indexing

4 / 5

No Directory Needed

The Directory-Free Revolution

Perhaps the most remarkable property of linear hashing is what it doesn't have: a directory. While extendible hashing requires an auxiliary data structure that maps hash prefixes to bucket addresses—and this directory can grow exponentially—linear hashing computes bucket addresses using only three integers: N₀, L, and S.

This elimination of directory overhead isn't merely an implementation simplification. It fundamentally changes the memory characteristics, concurrency properties, and scalability behavior of the hash table. Understanding why linear hashing doesn't need a directory reveals the elegance of its design.

What You Will Learn

By the end of this page, you will understand why linear hashing's algorithmic design eliminates directory requirements, the significant advantages this provides for memory efficiency and concurrent access, and the trade-offs compared to directory-based schemes.

What Is a Directory in Hashing?

Before appreciating directory-free operation, we must understand what directories provide in other dynamic hashing schemes.

In Extendible Hashing:

A directory is an array of bucket pointers, indexed by the first d bits (global depth) of the hash value. The directory serves as an indirection layer:

hash(key) → extract d bits → directory[bits] → bucket address

The directory enables:

Multiple directory entries pointing to the same bucket (before splitting)
Efficient doubling by duplicating pointers
Direct mapping from hash prefix to bucket location

Directory Characteristics in Extendible Hashing
Property	Description	Implications
Size	2^d entries (d = global depth)	Doubles with each depth increase
Entry Content	Pointer/address to a bucket	4-8 bytes per entry
Depth 10 example	1,024 entries (8 KB)	Manageable in-memory
Depth 20 example	1,048,576 entries (8 MB)	Significant memory footprint
Depth 30 example	1,073,741,824 entries (8 GB)	Larger than many datasets!

Directory Explosion Problem

The exponential growth of directories is problematic. If your dataset requires 1 million buckets, the directory might need 2^20 = 1 million entries (8MB). But with skewed data causing many splits, depth might reach 25 or 30, exploding the directory to hundreds of megabytes or gigabytes—potentially larger than the data itself!

The Directory Bottleneck:

Directories create several operational challenges:

Memory Overhead: Directory consumes memory proportional to 2^d, not to actual bucket count.
Doubling Operations: When global depth increases, the entire directory must be allocated and copied.
Concurrency Complexity: Directory updates affect all concurrent operations; locking or versioning required.
Cache Disruption: Large directories exceed cache, requiring memory accesses for every lookup.
Persistence Complexity: Directory must be persisted consistently with bucket data.

How Linear Hashing Eliminates Directories

Linear hashing replaces the directory with a computational formula. Instead of looking up the bucket address in a table, we calculate it using the split pointer and level.

The Core Insight:

In extendible hashing, the directory exists because buckets can be at different 'local depths'—some buckets have been split more times than others. The directory maps hash prefixes to the correct bucket regardless of local depth.

In linear hashing, the split pointer S partitions buckets into exactly two groups:

Buckets 0 to S-1: already split this level (use h_{L+1})
Buckets S to N₀×2^L - 1: not yet split this level (use h_L)

This binary partition is captured by a single integer S, eliminating the need for per-bucket depth tracking.

addressing_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Extendible Hashing: Directory-based addressing
class ExtendibleHash:
    def __init__(self):
        self.global_depth = 1
        self.directory = [None, None]  # 2^1 = 2 entries
        # Directory MUST be maintained and can grow to 2^d entries
    
    def find_bucket(self, key) -> int:
        h = hash(key)
        # Extract global_depth bits from hash
        index = h & ((1 << self.global_depth) - 1)
        # Look up in directory - O(1) but requires directory memory
        return self.directory[index]
    
    def directory_memory(self) -> int:
        """Memory used by directory alone."""
        return len(self.directory) * 8  # 8 bytes per pointer
 
 
# Linear Hashing: Computation-based addressing
class LinearHash:
    def __init__(self, initial_buckets=4):
        self.N0 = initial_buckets
        self.L = 0  # Level
        self.S = 0  # Split pointer
        # That's it! No directory needed.
    
    def find_bucket(self, key) -> int:
        h = hash(key)
        level_range = self.N0 * (2 ** self.L)
        bucket = h % level_range
        
        # Use split pointer to determine hash function
        if bucket < self.S:
            next_range = self.N0 * (2 ** (self.L + 1))
            bucket = h % next_range
        
        return bucket
    
    def metadata_memory(self) -> int:
        """Memory used by addressing metadata."""
        return 3 * 8  # Just 3 integers: N0, L, S = 24 bytes total!

Why the Formula Works

•Sequential Splitting — Because buckets split in order (0, 1, 2, ...), the split pointer cleanly divides 'already split' from 'not yet split'.
•Hash Function Nesting — The relationship h_{L+1}(k) ∈ {h_L(k), h_L(k) + N₀×2^L} means only one bit of additional information is needed per bucket.
•Two Ranges Suffice — At any moment, buckets use either h_L or h_{L+1}—never h_{L-1} or h_{L+2}. Two hash functions cover all cases.
•Global State Suffices — The split pointer is global, not per-bucket. One value captures the entire partitioning state.

The Power of Sequential Splitting

The requirement that buckets split sequentially (not based on which bucket overflows) is what enables directory-free operation. If we allowed arbitrary split order (like extendible hashing), we'd need per-bucket metadata to track which buckets have split—essentially recreating the directory.

Memory Efficiency Analysis

The elimination of directories provides dramatic memory savings, especially at scale. Let's quantify the difference.

Metadata Overhead Comparison:

Metadata Memory Usage
Bucket Count	Linear Hashing	Extendible Hashing (balanced)	Extendible Hashing (skewed)
1,000	24 bytes	8 KB (2^10 directory)	64 KB (2^13)
10,000	24 bytes	128 KB (2^14)	1 MB (2^17)
100,000	24 bytes	1 MB (2^17)	32 MB (2^22)
1,000,000	24 bytes	8 MB (2^20)	256 MB (2^25)
10,000,000	24 bytes	128 MB (2^24)	2 GB (2^28)

Memory Ratio Analysis:

The memory savings are substantial:

Memory Ratio = (Directory Size) / (Linear Hash Metadata)
             = (2^d × 8 bytes) / 24 bytes
             = (2^d) / 3

For d = 20: Ratio = 1,048,576 / 3 ≈ 350,000x more memory for directory!

Where It Matters Most:

Embedded Systems: Limited memory makes directory overhead prohibitive.
High Fan-out Indexes: When each node is a small hash table, directory overhead per node accumulates.
Large Datasets: At billions of records, directory memory competes with working set.
Memory-Mapped Files: Reducing metadata pages reduces file size and I/O.

Real-World Impact

Consider a database server with 64GB RAM hosting tables with 100 million records each. With extendible hashing requiring 8MB-256MB per table's directory, you might consume 1-16GB just for hash directories. With linear hashing, metadata consumption is negligible—kilobytes instead of gigabytes.

Cache Behavior:

Beyond raw memory, cache behavior differs significantly:

Linear Hashing:

Metadata (24 bytes) easily fits in L1 cache
Address calculation is pure computation
No memory access needed for addressing
Cache available for actual data

Extendible Hashing:

Small directories (< 32KB) fit in L1 cache
Large directories span L2/L3 or main memory
Each lookup requires directory access
Cache pollution reduces data cache effectiveness

Lookup Latency:
Linear:     hash(k) → compute → bucket access
            |______|          |_____________|
            ~1ns (ALU)        ~50ns (L3) or ~100ns (RAM)

Extendible: hash(k) → directory access → bucket access 
            |______|   |_______________|   |_____________|
            ~1ns       ~10-100ns           ~50-100ns

Concurrency Advantages

The absence of a directory dramatically simplifies concurrent access to linear hash tables. Let's examine the concurrency properties in detail.

Extendible Hashing Concurrency Challenges:

Directory Concurrency Issues

•Directory Doubling Is Global — When global depth increases, the entire directory is replaced. All readers must either wait or handle the transition.
•Directory Reads Need Protection — Even read-only lookups must ensure directory consistency; a concurrent double could invalidate the lookup mid-flight.
•Split Affects Directory — Each split might require directory entry updates, creating contention even for non-related buckets.
•Pointer Staleness — If a reader caches a directory pointer, a subsequent split might invalidate it, leading to incorrect bucket access.
•Complex Locking Protocols — Fine-grained locking with directories requires bucket locks, directory locks, and careful lock ordering.

Linear Hashing Concurrency Model:

Without a directory, linear hashing's concurrency story is much simpler:

concurrent_linear_hash.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import threading
from typing import Optional, Any
 
class ConcurrentLinearHash:
    """
    Thread-safe linear hash table with minimal locking.
    
    Key insight: Only (S, L) and individual buckets need protection.
    No directory lock needed!
    """
    
    def __init__(self, initial_buckets=4):
        self.N0 = initial_buckets
        self.L = 0
        self.S = 0
        self.buckets = [[] for _ in range(initial_buckets)]
        
        # Locking strategy:
        # - state_lock: protects L, S, and bucket array growth
        # - Per-bucket locks for bucket contents
        self.state_lock = threading.RLock()
        self.bucket_locks = [threading.Lock() for _ in range(initial_buckets)]
    
    def _get_bucket_lock(self, bucket_idx: int) -> threading.Lock:
        """Get lock for a bucket, creating if needed."""
        with self.state_lock:
            while bucket_idx >= len(self.bucket_locks):
                self.bucket_locks.append(threading.Lock())
            return self.bucket_locks[bucket_idx]
    
    def _resolve_bucket_snapshot(self, key) -> int:
        """
        Resolve bucket with consistent state snapshot.
        
        Read S and L atomically, then compute.
        """
        with self.state_lock:
            L = self.L
            S = self.S
        
        # Computation outside lock - no directory access!
        h = hash(key)
        level_range = self.N0 * (2 ** L)
        bucket = h % level_range
        
        if bucket < S:
            next_range = self.N0 * (2 ** (L + 1))
            bucket = h % next_range
        
        return bucket
    
    def lookup(self, key) -> Optional[Any]:
        """
        Thread-safe lookup.
        
        Lock order: state_lock (briefly) → bucket_lock
        """
        bucket_idx = self._resolve_bucket_snapshot(key)
        bucket_lock = self._get_bucket_lock(bucket_idx)
        
        with bucket_lock:
            for k, v in self.buckets[bucket_idx]:
                if k == key:
                    return v
        return None
    
    def insert(self, key, value):
        """
        Thread-safe insert with split handling.
        """
        bucket_idx = self._resolve_bucket_snapshot(key)
        bucket_lock = self._get_bucket_lock(bucket_idx)
        
        with bucket_lock:
            self.buckets[bucket_idx].append((key, value))
        
        # Check if split needed (might be triggered by another thread)
        self._maybe_split()
    
    def _maybe_split(self):
        """
        Perform split if threshold exceeded.
        
        Split only touches bucket S and new sibling - localized!
        """
        with self.state_lock:
            if not self._should_split():
                return
            
            # Identify buckets involved
            old_idx = self.S
            new_idx = self.S + self.N0 * (2 ** self.L)
            
            # Add new bucket
            self.buckets.append([])
            self.bucket_locks.append(threading.Lock())
            
            # Lock both buckets for redistribution
            old_lock = self.bucket_locks[old_idx]
            new_lock = self.bucket_locks[new_idx]
        
        # Redistribution with bucket locks only
        with old_lock:
            with new_lock:
                with self.state_lock:
                    # Re-read state under lock for safety
                    self._redistribute(old_idx, new_idx)
                    self.S += 1
                    if self.S >= self.N0 * (2 ** self.L):
                        self.S = 0
                        self.L += 1

Concurrency Benefits Summary

•State lock is tiny — Only 24 bytes protected, contention minimal.
•Lookups can proceed during splits — Only the splitting bucket and sibling are locked; all other lookups unaffected.
•No directory doubling — Eliminates the most disruptive concurrent operation.
•Lock-free reads possible — With atomic L/S reads, lookups can be lock-free using optimistic concurrency.
•Scales with cores — Independent bucket locks enable true parallelism.

Lock-Free Lookups

With careful implementation using atomic variables for S and L, lookups can be completely lock-free. The address calculation is deterministic given S and L; if a split occurs mid-lookup, the worst case is accessing an empty new bucket and retrying. This enables extremely high read throughput.

Persistence Simplification

For database systems, persistence is critical. Linear hashing's lack of directory structure dramatically simplifies durability and recovery.

Extendible Hashing Persistence Challenges:

Directory Persistence Issues

•Directory must be persisted — The entire 2^d-entry directory must be written to storage.
•Atomic directory updates — Directory doubling must be atomic to avoid partial state.
•Directory-bucket consistency — Directory and bucket changes must be coordinated.
•Directory recovery — After crash, directory must be reconstructed or validated.
•Multiple directory pages — Large directories span multiple disk pages, complicating atomic writes.

Linear Hashing Persistence Model:

Linear hashing's persistent state is trivially small:

Linear Hash Persistence:
┌─────────────────────────────┐
│  Header (24 bytes):         │
│    N0: 4 bytes              │
│    L:  4 bytes              │
│    S:  4 bytes              │
│    (padding/checksums)      │
├─────────────────────────────┤
│  Bucket Array:              │
│    [Bucket 0 pointer/data]  │
│    [Bucket 1 pointer/data]  │
│    ...                      │
│    [Bucket N-1]             │
└─────────────────────────────┘

Compare to extendible hashing's directory spanning potentially thousands of pages.

persistence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import struct
import os
 
class PersistentLinearHash:
    """
    Disk-persistent linear hash table.
    
    File format:
    - Header (32 bytes): N0, L, S, record_count, checksum
    - Bucket directory: list of bucket page offsets
    - Data pages: bucket contents
    """
    
    HEADER_SIZE = 32
    HEADER_FORMAT = '<IIII16s'  # N0, L, S, count, checksum
    
    def __init__(self, filepath: str, initial_buckets=4, page_size=4096):
        self.filepath = filepath
        self.page_size = page_size
        
        if os.path.exists(filepath):
            self._load_header()
        else:
            self.N0 = initial_buckets
            self.L = 0
            self.S = 0
            self.record_count = 0
            self._initialize_file()
    
    def _load_header(self):
        """Load minimal state from file header."""
        with open(self.filepath, 'rb') as f:
            header = f.read(self.HEADER_SIZE)
            n0, l, s, count, checksum = struct.unpack(self.HEADER_FORMAT, header)
            
            # Validate checksum
            expected_checksum = self._compute_checksum(n0, l, s, count)
            if checksum != expected_checksum:
                raise ValueError("Header checksum mismatch - file corrupted")
            
            self.N0 = n0
            self.L = l
            self.S = s
            self.record_count = count
    
    def _persist_header(self):
        """
        Write header to file - just 32 bytes!
        
        This is all the metadata linear hashing needs.
        """
        checksum = self._compute_checksum(
            self.N0, self.L, self.S, self.record_count
        )
        header = struct.pack(
            self.HEADER_FORMAT,
            self.N0, self.L, self.S, self.record_count, checksum
        )
        
        with open(self.filepath, 'r+b') as f:
            f.seek(0)
            f.write(header)
            f.flush()
            os.fsync(f.fileno())  # Ensure durability
    
    def _compute_checksum(self, n0, l, s, count) -> bytes:
        """Simple checksum for header validation."""
        import hashlib
        data = struct.pack('<IIII', n0, l, s, count)
        return hashlib.md5(data).digest()
    
    def recover(self):
        """
        Recovery after crash is trivial.
        
        1. Load header (N0, L, S)
        2. Bucket addresses computed from formula
        3. No directory reconstruction needed!
        """
        self._load_header()
        
        # Validate bucket count matches expected
        expected_buckets = self.N0 * (2 ** self.L) + self.S
        actual_buckets = self._count_bucket_pages()
        
        if actual_buckets != expected_buckets:
            # Handle split-in-progress recovery
            if actual_buckets == expected_buckets + 1:
                # Split created new bucket but didn't update S
                self.S += 1
                if self.S >= self.N0 * (2 ** self.L):
                    self.S = 0
                    self.L += 1
                self._persist_header()
            else:
                raise ValueError(f"Bucket count mismatch: {actual_buckets} vs {expected_buckets}")

Write-Ahead Logging Integration

Linear hashing integrates elegantly with WAL. A split operation logs: (1) new bucket creation, (2) record movements, (3) S increment. Recovery replays the log, recomputing bucket addresses from (N₀, L, S). No directory reconstruction is ever needed.

Trade-offs of Directory-Free Design

While directory-free operation provides significant advantages, it comes with trade-offs. Understanding these helps in choosing between linear and extendible hashing.

What We Lose Without a Directory:

Directory-Free Trade-offs
Aspect	With Directory (Extendible)	Without Directory (Linear)	Trade-off Assessment
Split flexibility	Split overflowing bucket	Split bucket S (fixed order)	Linear may split non-overflowing buckets
Overflow resolution	Immediate	Deferred until S catches up	Linear may have longer chains temporarily
Hash computation	One hash, one lookup	One or two hash computations	Linear slightly more CPU; extendible slightly more I/O
Bucket address lookup	O(1) array access	O(1) computation	Equivalent complexity, different constants
Space mapping	Flexible (multiple entries → one bucket)	Rigid (formula determines all)	Extendible can represent sparse distributions

The Split Order Constraint:

The most significant trade-off is the fixed split order. Consider a scenario:

Bucket 100 is severely overflowing
Current S = 5
To split bucket 100, we must first split buckets 5, 6, 7, ... 99

This means 95 splits before the problematic bucket is addressed! In extendible hashing, we'd split bucket 100 immediately.

Mitigation Strategies:

Aggressive Splitting: Lower load factor threshold so overflow is rare.
Larger Buckets: More capacity per bucket reduces overflow probability.
Overflow Tolerance: Accept some overflow chains as a trade-off for directory freedom.
Hybrid Approaches: Use extendible hashing for hot-spot-prone tables, linear for general use.

When to Choose Linear Hashing

•Memory-constrained environments
•High concurrency requirements
•Uniform or near-uniform key distributions
•Disk-based systems with I/O sensitivity
•Simpler implementation needed

When to Choose Extendible Hashing

•Highly skewed key distributions
•Worst-case latency sensitivity
•Memory-rich environments
•Lower concurrency requirements
•Need immediate overflow resolution

Practical Observation

In many production database systems, hash indexes use linear hashing by default. The memory savings and concurrency benefits outweigh the deferred split trade-off for typical workloads. PostgreSQL's hash indexes, for example, use linear hashing.

Architectural Comparison

Let's visualize the structural differences between directory-based and directory-free hashing.

Extendible Hashing Structure:

                     Directory (2^d entries)
                 ┌───┬───┬───┬───┬───┬───┬───┬───┐
hash(key) ──────▶│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │  (depth = 3)
                 └─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┘
                   │   │   │   │   │   │   │   │
                   ▼   └─┬─┘   ▼   └───▼───┘   ▼
                 ┌───┐  ▼   ┌───┐  ┌───┐    ┌───┐
                 │ A │┌───┐│ C │  │ D │    │ E │  ← Buckets
                 └───┘│ B │└───┘  └───┘    └───┘
                      └───┘
                 
                 Directory entries: 8
                 Actual buckets: 5
                 Overhead: 3 wasted pointers

Linear Hashing Structure:

                      State: N₀=4, L=1, S=2

                         h_L(key)        if bucket ≥ S
hash(key) ─────┬────────────────────────────────────▶ bucket
               │
               │         h_{L+1}(key)    if bucket < S
               └────────────────────────────────────▶ bucket

                 Buckets: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
                           └──┬──┘ └────┬────┘ └────────┬────────┘
                           Split     Not yet split      New siblings
                         (use h_{L+1})  (use h_L)       (use h_{L+1})

                 Metadata: Just 3 integers!
                 No directory, no wasted pointers

Structural Comparison Summary
Component	Extendible Hashing	Linear Hashing
Address computation input	hash → directory → bucket	hash → formula → bucket
Auxiliary structure	Directory array (2^d entries)	None (3 integers)
Growth granularity	Bucket + possible directory double	One bucket at a time
Per-bucket metadata	Local depth stored	None needed
Bucket identity	Directory entry points to bucket	Sequential bucket numbers
Pointer indirection	One level (directory → bucket)	Zero levels (direct computation)

Summary: Directory-Free Operation

Linear hashing's elimination of directory structures is its defining characteristic. By using sequential splitting and a simple address formula, it achieves dynamic growth with minimal overhead.

Key Takeaways

•Sequential splitting enables formula-based addressing — The split pointer partitions buckets into two groups, each using a different hash function.
•Metadata is constant size — Only N₀, L, and S are needed; no 2^d directory entries.
•Memory savings are dramatic — From megabytes/gigabytes of directory to 24 bytes of state.
•Concurrency is simplified — No directory lock contention; bucket-level locking suffices.
•Persistence is trivial — Header plus buckets; no directory pages to manage.
•Trade-off is fixed split order — Can't immediately split overflowing buckets; must wait for split pointer.

What's Next:

We've now covered linear hashing's core mechanisms: growth, split pointer, triggers, and directory-free operation. The final page examines performance characteristics—analyzing average and worst-case lookup costs, comparison with other index structures, and practical performance tuning for production systems.

Page Complete

You now understand why linear hashing doesn't need a directory and the profound implications this has for memory efficiency, concurrency, and persistence. This directory-free property is what makes linear hashing particularly suitable for database systems where resource efficiency and concurrent access are paramount.

4 / 5

Loading learning content...

Database Management SystemsHash Indexing

Linear Hashing

LevelAdvanced

Duration75 mins

TopicHash Indexing

4 / 5

No Directory Needed

The Directory-Free Revolution

What You Will Learn

What Is a Directory in Hashing?

Before appreciating directory-free operation, we must understand what directories provide in other dynamic hashing schemes.

In Extendible Hashing:

A directory is an array of bucket pointers, indexed by the first d bits (global depth) of the hash value. The directory serves as an indirection layer:

hash(key) → extract d bits → directory[bits] → bucket address

The directory enables:

Multiple directory entries pointing to the same bucket (before splitting)
Efficient doubling by duplicating pointers
Direct mapping from hash prefix to bucket location

Directory Characteristics in Extendible Hashing
Property	Description	Implications
Size	2^d entries (d = global depth)	Doubles with each depth increase
Entry Content	Pointer/address to a bucket	4-8 bytes per entry
Depth 10 example	1,024 entries (8 KB)	Manageable in-memory
Depth 20 example	1,048,576 entries (8 MB)	Significant memory footprint
Depth 30 example	1,073,741,824 entries (8 GB)	Larger than many datasets!

Directory Explosion Problem

The Directory Bottleneck:

Directories create several operational challenges:

Memory Overhead: Directory consumes memory proportional to 2^d, not to actual bucket count.
Doubling Operations: When global depth increases, the entire directory must be allocated and copied.
Concurrency Complexity: Directory updates affect all concurrent operations; locking or versioning required.
Cache Disruption: Large directories exceed cache, requiring memory accesses for every lookup.
Persistence Complexity: Directory must be persisted consistently with bucket data.

How Linear Hashing Eliminates Directories

Linear hashing replaces the directory with a computational formula. Instead of looking up the bucket address in a table, we calculate it using the split pointer and level.

The Core Insight:

In linear hashing, the split pointer S partitions buckets into exactly two groups:

Buckets 0 to S-1: already split this level (use h_{L+1})
Buckets S to N₀×2^L - 1: not yet split this level (use h_L)

This binary partition is captured by a single integer S, eliminating the need for per-bucket depth tracking.

addressing_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Extendible Hashing: Directory-based addressing
class ExtendibleHash:
    def __init__(self):
        self.global_depth = 1
        self.directory = [None, None]  # 2^1 = 2 entries
        # Directory MUST be maintained and can grow to 2^d entries
    
    def find_bucket(self, key) -> int:
        h = hash(key)
        # Extract global_depth bits from hash
        index = h & ((1 << self.global_depth) - 1)
        # Look up in directory - O(1) but requires directory memory
        return self.directory[index]
    
    def directory_memory(self) -> int:
        """Memory used by directory alone."""
        return len(self.directory) * 8  # 8 bytes per pointer
 
 
# Linear Hashing: Computation-based addressing
class LinearHash:
    def __init__(self, initial_buckets=4):
        self.N0 = initial_buckets
        self.L = 0  # Level
        self.S = 0  # Split pointer
        # That's it! No directory needed.
    
    def find_bucket(self, key) -> int:
        h = hash(key)
        level_range = self.N0 * (2 ** self.L)
        bucket = h % level_range
        
        # Use split pointer to determine hash function
        if bucket < self.S:
            next_range = self.N0 * (2 ** (self.L + 1))
            bucket = h % next_range
        
        return bucket
    
    def metadata_memory(self) -> int:
        """Memory used by addressing metadata."""
        return 3 * 8  # Just 3 integers: N0, L, S = 24 bytes total!

Why the Formula Works

•Sequential Splitting — Because buckets split in order (0, 1, 2, ...), the split pointer cleanly divides 'already split' from 'not yet split'.
•Hash Function Nesting — The relationship h_{L+1}(k) ∈ {h_L(k), h_L(k) + N₀×2^L} means only one bit of additional information is needed per bucket.
•Two Ranges Suffice — At any moment, buckets use either h_L or h_{L+1}—never h_{L-1} or h_{L+2}. Two hash functions cover all cases.
•Global State Suffices — The split pointer is global, not per-bucket. One value captures the entire partitioning state.

The Power of Sequential Splitting

Memory Efficiency Analysis

The elimination of directories provides dramatic memory savings, especially at scale. Let's quantify the difference.

Metadata Overhead Comparison:

Metadata Memory Usage
Bucket Count	Linear Hashing	Extendible Hashing (balanced)	Extendible Hashing (skewed)
1,000	24 bytes	8 KB (2^10 directory)	64 KB (2^13)
10,000	24 bytes	128 KB (2^14)	1 MB (2^17)
100,000	24 bytes	1 MB (2^17)	32 MB (2^22)
1,000,000	24 bytes	8 MB (2^20)	256 MB (2^25)
10,000,000	24 bytes	128 MB (2^24)	2 GB (2^28)

Memory Ratio Analysis:

The memory savings are substantial:

Memory Ratio = (Directory Size) / (Linear Hash Metadata)
             = (2^d × 8 bytes) / 24 bytes
             = (2^d) / 3

For d = 20: Ratio = 1,048,576 / 3 ≈ 350,000x more memory for directory!

Where It Matters Most:

Embedded Systems: Limited memory makes directory overhead prohibitive.
High Fan-out Indexes: When each node is a small hash table, directory overhead per node accumulates.
Large Datasets: At billions of records, directory memory competes with working set.
Memory-Mapped Files: Reducing metadata pages reduces file size and I/O.

Real-World Impact

Cache Behavior:

Beyond raw memory, cache behavior differs significantly:

Linear Hashing:

Metadata (24 bytes) easily fits in L1 cache
Address calculation is pure computation
No memory access needed for addressing
Cache available for actual data

Extendible Hashing:

Small directories (< 32KB) fit in L1 cache
Large directories span L2/L3 or main memory
Each lookup requires directory access
Cache pollution reduces data cache effectiveness

Lookup Latency:
Linear:     hash(k) → compute → bucket access
            |______|          |_____________|
            ~1ns (ALU)        ~50ns (L3) or ~100ns (RAM)

Extendible: hash(k) → directory access → bucket access 
            |______|   |_______________|   |_____________|
            ~1ns       ~10-100ns           ~50-100ns

Concurrency Advantages

The absence of a directory dramatically simplifies concurrent access to linear hash tables. Let's examine the concurrency properties in detail.

Extendible Hashing Concurrency Challenges:

Directory Concurrency Issues

•Directory Doubling Is Global — When global depth increases, the entire directory is replaced. All readers must either wait or handle the transition.
•Directory Reads Need Protection — Even read-only lookups must ensure directory consistency; a concurrent double could invalidate the lookup mid-flight.
•Split Affects Directory — Each split might require directory entry updates, creating contention even for non-related buckets.
•Pointer Staleness — If a reader caches a directory pointer, a subsequent split might invalidate it, leading to incorrect bucket access.
•Complex Locking Protocols — Fine-grained locking with directories requires bucket locks, directory locks, and careful lock ordering.

Linear Hashing Concurrency Model:

Without a directory, linear hashing's concurrency story is much simpler:

concurrent_linear_hash.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import threading
from typing import Optional, Any
 
class ConcurrentLinearHash:
    """
    Thread-safe linear hash table with minimal locking.
    
    Key insight: Only (S, L) and individual buckets need protection.
    No directory lock needed!
    """
    
    def __init__(self, initial_buckets=4):
        self.N0 = initial_buckets
        self.L = 0
        self.S = 0
        self.buckets = [[] for _ in range(initial_buckets)]
        
        # Locking strategy:
        # - state_lock: protects L, S, and bucket array growth
        # - Per-bucket locks for bucket contents
        self.state_lock = threading.RLock()
        self.bucket_locks = [threading.Lock() for _ in range(initial_buckets)]
    
    def _get_bucket_lock(self, bucket_idx: int) -> threading.Lock:
        """Get lock for a bucket, creating if needed."""
        with self.state_lock:
            while bucket_idx >= len(self.bucket_locks):
                self.bucket_locks.append(threading.Lock())
            return self.bucket_locks[bucket_idx]
    
    def _resolve_bucket_snapshot(self, key) -> int:
        """
        Resolve bucket with consistent state snapshot.
        
        Read S and L atomically, then compute.
        """
        with self.state_lock:
            L = self.L
            S = self.S
        
        # Computation outside lock - no directory access!
        h = hash(key)
        level_range = self.N0 * (2 ** L)
        bucket = h % level_range
        
        if bucket < S:
            next_range = self.N0 * (2 ** (L + 1))
            bucket = h % next_range
        
        return bucket
    
    def lookup(self, key) -> Optional[Any]:
        """
        Thread-safe lookup.
        
        Lock order: state_lock (briefly) → bucket_lock
        """
        bucket_idx = self._resolve_bucket_snapshot(key)
        bucket_lock = self._get_bucket_lock(bucket_idx)
        
        with bucket_lock:
            for k, v in self.buckets[bucket_idx]:
                if k == key:
                    return v
        return None
    
    def insert(self, key, value):
        """
        Thread-safe insert with split handling.
        """
        bucket_idx = self._resolve_bucket_snapshot(key)
        bucket_lock = self._get_bucket_lock(bucket_idx)
        
        with bucket_lock:
            self.buckets[bucket_idx].append((key, value))
        
        # Check if split needed (might be triggered by another thread)
        self._maybe_split()
    
    def _maybe_split(self):
        """
        Perform split if threshold exceeded.
        
        Split only touches bucket S and new sibling - localized!
        """
        with self.state_lock:
            if not self._should_split():
                return
            
            # Identify buckets involved
            old_idx = self.S
            new_idx = self.S + self.N0 * (2 ** self.L)
            
            # Add new bucket
            self.buckets.append([])
            self.bucket_locks.append(threading.Lock())
            
            # Lock both buckets for redistribution
            old_lock = self.bucket_locks[old_idx]
            new_lock = self.bucket_locks[new_idx]
        
        # Redistribution with bucket locks only
        with old_lock:
            with new_lock:
                with self.state_lock:
                    # Re-read state under lock for safety
                    self._redistribute(old_idx, new_idx)
                    self.S += 1
                    if self.S >= self.N0 * (2 ** self.L):
                        self.S = 0
                        self.L += 1

Concurrency Benefits Summary

•State lock is tiny — Only 24 bytes protected, contention minimal.
•Lookups can proceed during splits — Only the splitting bucket and sibling are locked; all other lookups unaffected.
•No directory doubling — Eliminates the most disruptive concurrent operation.
•Lock-free reads possible — With atomic L/S reads, lookups can be lock-free using optimistic concurrency.
•Scales with cores — Independent bucket locks enable true parallelism.

Lock-Free Lookups

Persistence Simplification

For database systems, persistence is critical. Linear hashing's lack of directory structure dramatically simplifies durability and recovery.

Extendible Hashing Persistence Challenges:

Directory Persistence Issues

•Directory must be persisted — The entire 2^d-entry directory must be written to storage.
•Atomic directory updates — Directory doubling must be atomic to avoid partial state.
•Directory-bucket consistency — Directory and bucket changes must be coordinated.
•Directory recovery — After crash, directory must be reconstructed or validated.
•Multiple directory pages — Large directories span multiple disk pages, complicating atomic writes.

Linear Hashing Persistence Model:

Linear hashing's persistent state is trivially small:

Linear Hash Persistence:
┌─────────────────────────────┐
│  Header (24 bytes):         │
│    N0: 4 bytes              │
│    L:  4 bytes              │
│    S:  4 bytes              │
│    (padding/checksums)      │
├─────────────────────────────┤
│  Bucket Array:              │
│    [Bucket 0 pointer/data]  │
│    [Bucket 1 pointer/data]  │
│    ...                      │
│    [Bucket N-1]             │
└─────────────────────────────┘

Compare to extendible hashing's directory spanning potentially thousands of pages.

persistence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import struct
import os
 
class PersistentLinearHash:
    """
    Disk-persistent linear hash table.
    
    File format:
    - Header (32 bytes): N0, L, S, record_count, checksum
    - Bucket directory: list of bucket page offsets
    - Data pages: bucket contents
    """
    
    HEADER_SIZE = 32
    HEADER_FORMAT = '<IIII16s'  # N0, L, S, count, checksum
    
    def __init__(self, filepath: str, initial_buckets=4, page_size=4096):
        self.filepath = filepath
        self.page_size = page_size
        
        if os.path.exists(filepath):
            self._load_header()
        else:
            self.N0 = initial_buckets
            self.L = 0
            self.S = 0
            self.record_count = 0
            self._initialize_file()
    
    def _load_header(self):
        """Load minimal state from file header."""
        with open(self.filepath, 'rb') as f:
            header = f.read(self.HEADER_SIZE)
            n0, l, s, count, checksum = struct.unpack(self.HEADER_FORMAT, header)
            
            # Validate checksum
            expected_checksum = self._compute_checksum(n0, l, s, count)
            if checksum != expected_checksum:
                raise ValueError("Header checksum mismatch - file corrupted")
            
            self.N0 = n0
            self.L = l
            self.S = s
            self.record_count = count
    
    def _persist_header(self):
        """
        Write header to file - just 32 bytes!
        
        This is all the metadata linear hashing needs.
        """
        checksum = self._compute_checksum(
            self.N0, self.L, self.S, self.record_count
        )
        header = struct.pack(
            self.HEADER_FORMAT,
            self.N0, self.L, self.S, self.record_count, checksum
        )
        
        with open(self.filepath, 'r+b') as f:
            f.seek(0)
            f.write(header)
            f.flush()
            os.fsync(f.fileno())  # Ensure durability
    
    def _compute_checksum(self, n0, l, s, count) -> bytes:
        """Simple checksum for header validation."""
        import hashlib
        data = struct.pack('<IIII', n0, l, s, count)
        return hashlib.md5(data).digest()
    
    def recover(self):
        """
        Recovery after crash is trivial.
        
        1. Load header (N0, L, S)
        2. Bucket addresses computed from formula
        3. No directory reconstruction needed!
        """
        self._load_header()
        
        # Validate bucket count matches expected
        expected_buckets = self.N0 * (2 ** self.L) + self.S
        actual_buckets = self._count_bucket_pages()
        
        if actual_buckets != expected_buckets:
            # Handle split-in-progress recovery
            if actual_buckets == expected_buckets + 1:
                # Split created new bucket but didn't update S
                self.S += 1
                if self.S >= self.N0 * (2 ** self.L):
                    self.S = 0
                    self.L += 1
                self._persist_header()
            else:
                raise ValueError(f"Bucket count mismatch: {actual_buckets} vs {expected_buckets}")

Write-Ahead Logging Integration

Trade-offs of Directory-Free Design

While directory-free operation provides significant advantages, it comes with trade-offs. Understanding these helps in choosing between linear and extendible hashing.

What We Lose Without a Directory:

Directory-Free Trade-offs
Aspect	With Directory (Extendible)	Without Directory (Linear)	Trade-off Assessment
Split flexibility	Split overflowing bucket	Split bucket S (fixed order)	Linear may split non-overflowing buckets
Overflow resolution	Immediate	Deferred until S catches up	Linear may have longer chains temporarily
Hash computation	One hash, one lookup	One or two hash computations	Linear slightly more CPU; extendible slightly more I/O
Bucket address lookup	O(1) array access	O(1) computation	Equivalent complexity, different constants
Space mapping	Flexible (multiple entries → one bucket)	Rigid (formula determines all)	Extendible can represent sparse distributions

The Split Order Constraint:

The most significant trade-off is the fixed split order. Consider a scenario:

Bucket 100 is severely overflowing
Current S = 5
To split bucket 100, we must first split buckets 5, 6, 7, ... 99

This means 95 splits before the problematic bucket is addressed! In extendible hashing, we'd split bucket 100 immediately.

Mitigation Strategies:

Aggressive Splitting: Lower load factor threshold so overflow is rare.
Larger Buckets: More capacity per bucket reduces overflow probability.
Overflow Tolerance: Accept some overflow chains as a trade-off for directory freedom.
Hybrid Approaches: Use extendible hashing for hot-spot-prone tables, linear for general use.

When to Choose Linear Hashing

•Memory-constrained environments
•High concurrency requirements
•Uniform or near-uniform key distributions
•Disk-based systems with I/O sensitivity
•Simpler implementation needed

When to Choose Extendible Hashing

•Highly skewed key distributions
•Worst-case latency sensitivity
•Memory-rich environments
•Lower concurrency requirements
•Need immediate overflow resolution

Practical Observation

Architectural Comparison

Let's visualize the structural differences between directory-based and directory-free hashing.

Extendible Hashing Structure:

                     Directory (2^d entries)
                 ┌───┬───┬───┬───┬───┬───┬───┬───┐
hash(key) ──────▶│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │  (depth = 3)
                 └─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┴─┬─┘
                   │   │   │   │   │   │   │   │
                   ▼   └─┬─┘   ▼   └───▼───┘   ▼
                 ┌───┐  ▼   ┌───┐  ┌───┐    ┌───┐
                 │ A │┌───┐│ C │  │ D │    │ E │  ← Buckets
                 └───┘│ B │└───┘  └───┘    └───┘
                      └───┘
                 
                 Directory entries: 8
                 Actual buckets: 5
                 Overhead: 3 wasted pointers

Linear Hashing Structure:

                      State: N₀=4, L=1, S=2

                         h_L(key)        if bucket ≥ S
hash(key) ─────┬────────────────────────────────────▶ bucket
               │
               │         h_{L+1}(key)    if bucket < S
               └────────────────────────────────────▶ bucket

                 Buckets: [0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
                           └──┬──┘ └────┬────┘ └────────┬────────┘
                           Split     Not yet split      New siblings
                         (use h_{L+1})  (use h_L)       (use h_{L+1})

                 Metadata: Just 3 integers!
                 No directory, no wasted pointers

Structural Comparison Summary
Component	Extendible Hashing	Linear Hashing
Address computation input	hash → directory → bucket	hash → formula → bucket
Auxiliary structure	Directory array (2^d entries)	None (3 integers)
Growth granularity	Bucket + possible directory double	One bucket at a time
Per-bucket metadata	Local depth stored	None needed
Bucket identity	Directory entry points to bucket	Sequential bucket numbers
Pointer indirection	One level (directory → bucket)	Zero levels (direct computation)

Summary: Directory-Free Operation

Linear hashing's elimination of directory structures is its defining characteristic. By using sequential splitting and a simple address formula, it achieves dynamic growth with minimal overhead.

Key Takeaways

•Sequential splitting enables formula-based addressing — The split pointer partitions buckets into two groups, each using a different hash function.
•Metadata is constant size — Only N₀, L, and S are needed; no 2^d directory entries.
•Memory savings are dramatic — From megabytes/gigabytes of directory to 24 bytes of state.
•Concurrency is simplified — No directory lock contention; bucket-level locking suffices.
•Persistence is trivial — Header plus buckets; no directory pages to manage.
•Trade-off is fixed split order — Can't immediately split overflowing buckets; must wait for split pointer.

What's Next:

Page Complete

4 / 5