Linear Hashing - Learning Module

Loading content...

0/252

Performance Characteristics

Analyzing Linear Hashing Performance

Understanding performance characteristics is essential for using linear hashing effectively in database systems. While the O(1) expected lookup time is well-known, the complete picture involves subtleties around load factor variation, overflow chain behavior, I/O costs, and comparison with alternative index structures.

This page provides a rigorous analysis of linear hashing performance, equipping you to predict behavior, tune parameters, and make informed choices about when to use linear hashing versus alternatives.

What You Will Learn

By the end of this page, you will understand average-case and worst-case performance of linear hashing operations, the impact of load factor on chain length, I/O cost analysis for disk-based implementations, comparison with B+-trees and extendible hashing, and practical tuning strategies.

Complexity Analysis

Let's establish the fundamental complexity bounds for linear hashing operations.

Operation Complexities:

Linear Hashing Time Complexity
Operation	Average Case	Worst Case	Amortized	Notes
Lookup	O(1)	O(n)	O(1)	Worst case = all keys in one bucket
Insert	O(1)	O(n)	O(1)	Includes split amortization
Delete	O(1)	O(n)	O(1)	Chain traversal for delete
Split	O(b)	O(b)	—	b = records in bucket being split
Range query	O(n)	O(n)	—	Hash indexes don't support ranges!

Expected Chain Length Analysis:

For a hash table with n records and N buckets, the load factor is α = n/N.

Assuming uniform hashing (each key equally likely to hash to any bucket), the expected number of records in each bucket follows a Poisson distribution with parameter α.

Expected chain length = α

For example:

α = 0.7: Expected 0.7 records per bucket (most buckets empty or have 1)
α = 1.0: Expected 1 record per bucket
α = 2.0: Expected 2 records per bucket

Probability of Finding a Key:

For successful lookup in a bucket with expected α records:

Expected probes = (1 + α)/2 (on average, find in middle of chain)

For unsuccessful lookup:

Expected probes = α (must check entire chain)

Load Factor During Level Transition

In linear hashing, the load factor oscillates. At the start of a level, α might be 0.8 (triggering splits). At the end of the level, after doubling buckets, α drops to about 0.4. This saw-tooth pattern means actual performance varies—but stays within predictable bounds.

Worst-Case Considerations:

The O(n) worst case occurs with pathological hash collisions—all n records hashing to the same bucket. This is:

Theoretically possible for any hash function if an adversary controls inputs
Practically rare with good hash functions and random keys
Mitigated by cryptographic hashing in adversarial contexts

Secondary Clustering:

Linear hashing with chaining doesn't suffer from primary clustering (like linear probing), but can experience secondary clustering if the hash function produces non-uniform distributions.

Load Factor Impact on Performance

The load factor α is the primary determinant of linear hashing performance. Let's analyze its impact quantitatively.

Chain Length Distribution:

With uniform hashing, bucket sizes follow a Poisson distribution:

P(bucket has k records) = (α^k × e^(-α)) / k!

α	P(0)	P(1)	P(2)	P(3)	P(≥4)	E[probes]
0.5	60.7%	30.3%	7.6%	1.3%	0.1%	1.25
0.7	49.7%	34.8%	12.2%	2.8%	0.5%	1.35
0.8	44.9%	35.9%	14.4%	3.8%	1.0%	1.40
1.0	36.8%	36.8%	18.4%	6.1%	1.9%	1.50
1.5	22.3%	33.5%	25.1%	12.6%	6.5%	1.75
2.0	13.5%	27.1%	27.1%	18.0%	14.3%	2.00

Performance Recommendations by Load Factor:

Load Factor Guidelines

•α < 0.5: Very fast lookups (~1.25 probes), but 50%+ wasted space. Use for latency-critical, read-heavy workloads.
•0.6 ≤ α ≤ 0.8: Sweet spot for most applications. Good balance of speed (~1.4 probes) and space (70-80% utilization).
•0.8 ≤ α ≤ 1.0: Favor space over speed. Acceptable for write-heavy workloads where space is constrained.
•α > 1.0: Not recommended. Chains become long enough to negate hashing benefits. Consider restructuring or using B+-trees.

load_factor_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import math
 
def analyze_load_factor(alpha: float) -> dict:
    """
    Analyze expected performance metrics for a given load factor.
    """
    # Expected probes for successful lookup (average chain position)
    successful_probes = (1 + alpha) / 2
    
    # Expected probes for unsuccessful lookup (full chain)
    unsuccessful_probes = alpha
    
    # Probability of bucket having more than k records
    def prob_overflow(k: int) -> float:
        cumulative = sum(
            (alpha**i * math.exp(-alpha)) / math.factorial(i)
            for i in range(k + 1)
        )
        return 1 - cumulative
    
    return {
        "load_factor": alpha,
        "expected_chain_length": alpha,
        "successful_lookup_probes": successful_probes,
        "unsuccessful_lookup_probes": unsuccessful_probes,
        "prob_empty_bucket": math.exp(-alpha),
        "prob_overflow_3": prob_overflow(3),
        "prob_overflow_5": prob_overflow(5),
        "space_utilization": min(alpha, 1.0)  # Approximate
    }
 
# Example analysis
for alpha in [0.5, 0.7, 0.8, 1.0, 1.5]:
    metrics = analyze_load_factor(alpha)
    print(f"α={alpha}: {metrics['successful_lookup_probes']:.2f} probes, "
          f"{metrics['prob_overflow_3']*100:.1f}% overflow >3")

The 0.75 Sweet Spot

A load factor of 0.75 (75% full) is commonly recommended as the default split threshold. At α=0.75, expected probes are ~1.38, only 0.7% of buckets exceed 3 records, and space utilization is reasonable. This value is used by Java's HashMap and many database hash indexes.

I/O Cost Analysis for Disk-Based Systems

For database systems, I/O cost often dominates. Let's analyze linear hashing's disk I/O characteristics.

I/O Model Assumptions:

Page size: P bytes (typically 4KB or 8KB)
Record size: R bytes
Records per bucket (primary page): b = P/R
Overflow page: extends bucket capacity by b records
Random I/O cost: 1 disk seek/read = ~10ms (HDD) or ~0.1ms (SSD)

I/O Cost Per Operation
Operation	Best Case	Expected	Worst Case	Conditions
Lookup (point)	1 I/O	1 + O/N I/Os	1 + O I/O	O = total overflow pages
Insert	1 I/O	1 + O/N I/Os	1 + O + split I/O	May trigger split
Delete	1 I/O	1 + O/N I/Os	1 + O I/O	May traverse chain
Split	2 I/Os	2-4 I/Os	Many I/Os	Depends on bucket size

Overflow Chain I/O:

The critical factor for disk-based linear hashing is overflow chain management. Each overflow page traversal costs one I/O.

Total I/O per lookup = 1 (primary bucket) + chain_length - 1 (overflows)

If primary bucket holds b records and chain has c pages:
Chain length c = ceil(records_in_bucket / b)
Expected I/O = 1 + (expected_chain - 1) = expected_chain
             = ceil(α × b / b) = ceil(α)
             ≈ 1 for α ≤ 1.0

With load factor α ≤ 0.8 and reasonable bucket size, expected I/O stays close to 1.

io_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def analyze_io_cost(
    num_records: int,
    page_size: int = 4096,
    record_size: int = 100,
    load_factor: float = 0.75
) -> dict:
    """
    Analyze expected I/O costs for disk-based linear hashing.
    """
    # Bucket capacity (records per primary page)
    header_size = 64  # Page header
    records_per_page = (page_size - header_size) // record_size
    
    # Number of buckets needed
    num_buckets = int(num_records / (load_factor * records_per_page))
    
    # Expected records per bucket
    records_per_bucket = num_records / num_buckets
    
    # Expected overflow pages per bucket
    if records_per_bucket <= records_per_page:
        overflow_per_bucket = 0
    else:
        overflow_per_bucket = (records_per_bucket - records_per_page) / records_per_page
    
    # I/O costs
    lookup_io = 1 + overflow_per_bucket
    insert_io = lookup_io  # Find bucket + possible write
    
    # Split I/O: read old bucket + write old + write new
    split_io = 3 + 2 * max(0, overflow_per_bucket)
    
    return {
        "num_buckets": num_buckets,
        "records_per_bucket": records_per_bucket,
        "records_per_page": records_per_page,
        "overflow_pages_per_bucket": overflow_per_bucket,
        "expected_lookup_io": lookup_io,
        "expected_insert_io": insert_io,
        "split_io": split_io,
        "total_pages": num_buckets * (1 + overflow_per_bucket),
        "space_utilization": num_records * record_size / (num_buckets * (1 + overflow_per_bucket) * page_size)
    }
 
# Example: 1 million records
analysis = analyze_io_cost(1_000_000)
print(f"1M records: {analysis['expected_lookup_io']:.2f} I/Os per lookup")

SSD vs HDD Considerations

On HDDs, minimizing I/O count is paramount (10ms per I/O). SSDs change the calculus: individual I/O latency is ~100x lower, but throughput benefits from sequential access. For SSD-based hash indexes, slightly higher load factors (trading more chain traversals for fewer pages) can be acceptable.

Comparison with B+-Trees

B+-trees are the dominant index structure in databases. Understanding when linear hashing outperforms B+-trees (and vice versa) is crucial for index selection.

Performance Comparison:

Linear Hashing vs B+-Tree Performance
Operation	Linear Hashing	B+-Tree	Winner
Point lookup	O(1) expected, ~1-2 I/Os	O(log n), 2-4 I/Os	Linear Hashing
Range query	O(n) — must scan all	O(log n + k)	B+-Tree (by far)
Insert	O(1) amortized	O(log n)	Linear Hashing
Delete	O(1) amortized	O(log n)	Linear Hashing
Ordered scan	Not supported	O(n)	B+-Tree
Min/Max	O(n)	O(log n)	B+-Tree
Space overhead	~25% (chaining)	~30-50%	Similar

Concrete I/O Comparison:

For a table with 10 million records (assuming 100 records per B+-tree node and load factor 0.75 for hash):

Dataset Size	B+-Tree Height	B+-Tree I/Os	Linear Hash I/Os
1,000	2	2	1
100,000	3	3	1-2
10,000,000	4	4	1-2
1,000,000,000	5	5	1-2

Linear hashing's advantage grows with data size—while B+-tree depth increases logarithmically, hash lookup I/O stays constant.

Use Linear Hashing When

•Primary access pattern is point lookups
•Equality joins dominate
•No range queries needed
•Dataset is very large (benefits more from O(1))
•Write-heavy workload (faster inserts)

Use B+-Tree When

•Range queries are common
•ORDER BY on indexed column
•Min/Max aggregations needed
•Prefix matching (LIKE 'abc%')
•Need to scan in sorted order

Real Database Practice

Most databases use B+-trees as the default index and hash indexes for specific cases. PostgreSQL offers both; MySQL/InnoDB uses B+-trees exclusively (though its adaptive hash index uses hashing internally). Hash indexes shine for primary key lookups and equality-based joins on large tables.

Comparison with Extendible Hashing

Both linear and extendible hashing are dynamic hash schemes. Let's compare their performance characteristics in detail.

Linear vs Extendible Hashing Performance
Aspect	Linear Hashing	Extendible Hashing	Notes
Lookup I/O	1-2 I/Os	2 I/Os (directory + bucket)	Linear wins slightly
Overflow handling	May accumulate chains	Immediate split	Extendible better worst-case
Space overhead	O(1) metadata	O(2^d) directory	Linear wins significantly
Split cost	Local (1 bucket)	Local (1 bucket + directory update)	Similar
Doubling cost	Distributed over N splits	Concentrated in 1 operation	Linear smoother
Concurrency	Simple locks	Directory coordination needed	Linear easier

Performance Under Skewed Distributions:

Skewed key distributions (hot keys) expose a key difference:

Linear Hashing:

Hot bucket overflows
Must wait for split pointer to reach it
Overflow chains can grow longer
But: overall load is controlled

Extendible Hashing:

Hot bucket split immediately
But: may increase global depth significantly
Directory grows exponentially
Eventually: directory overhead dominates

Scenario: 80% of accesses to 20% of keys (Zipf distribution)

Linear Hashing:
- Hot buckets have longer chains (3-5 records)
- Cold buckets mostly empty
- But: only 24 bytes metadata
- I/O: 1.5-2 average

Extendible Hashing:
- Hot buckets split, depth increases
- Directory may reach depth 25-30
- Directory alone: 100+ MB
- I/O: 2 (guaranteed: directory + bucket)

Choosing Between Linear and Extendible

Linear hashing is generally preferred for database systems due to memory efficiency and simpler concurrency. Extendible hashing may be better for: (1) in-memory hash tables where directory overhead is less concerning, (2) workloads with extreme skew where immediate split response matters, (3) systems requiring guaranteed worst-case bounds.

Tuning and Optimization Strategies

Achieving optimal linear hashing performance requires careful tuning based on workload characteristics. Here are key optimization strategies:

1. Load Factor Tuning:

tuning_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
class TunedLinearHash:
    """
    Linear hash table with tunable parameters.
    """
    
    def __init__(
        self,
        initial_buckets: int = 64,      # Start larger for known large datasets
        target_load_factor: float = 0.75,
        min_load_factor: float = 0.25,   # Trigger contract if too sparse
        bucket_capacity: int = 50,       # Match to page size
    ):
        self.N0 = initial_buckets
        self.L = 0
        self.S = 0
        self.buckets = [[] for _ in range(initial_buckets)]
        
        self.target_load = target_load_factor
        self.min_load = min_load_factor
        self.capacity = bucket_capacity
        
        # Performance tracking
        self.total_probes = 0
        self.total_lookups = 0
    
    def _should_split(self) -> bool:
        """Trigger split based on global load factor."""
        load = self._record_count() / len(self.buckets)
        return load > self.target_load
    
    def _should_contract(self) -> bool:
        """
        Consider contraction if highly underutilized.
        Note: Linear hashing doesn't traditionally contract,
        but some implementations support it.
        """
        if self.L == 0 and self.S == 0:
            return False  # Can't contract below initial size
        
        load = self._record_count() / len(self.buckets)
        return load < self.min_load
    
    def tune_for_read_heavy(self):
        """Optimize for read-heavy workload."""
        self.target_load = 0.60  # Lower load = shorter chains
        # Proactively split to reduce chains
        while self._load_factor() > self.target_load:
            self._perform_split()
    
    def tune_for_write_heavy(self):
        """Optimize for write-heavy workload."""
        self.target_load = 0.85  # Higher load = fewer splits
        # Accept slightly longer chains for fewer split operations
    
    def tune_for_memory(self):
        """Optimize for memory efficiency."""
        self.target_load = 0.95  # Maximize utilization
        self.min_load = 0.40     # Enable contraction
    
    def get_performance_metrics(self) -> dict:
        """Return current performance statistics."""
        bucket_sizes = [len(b) for b in self.buckets]
        
        return {
            "load_factor": self._load_factor(),
            "bucket_count": len(self.buckets),
            "avg_chain_length": sum(bucket_sizes) / len(bucket_sizes),
            "max_chain_length": max(bucket_sizes),
            "empty_buckets": sum(1 for s in bucket_sizes if s == 0),
            "overflow_buckets": sum(1 for s in bucket_sizes if s > self.capacity),
            "avg_probes_per_lookup": self.total_probes / max(1, self.total_lookups)
        }

Optimization Strategies

•Pre-sizing: If dataset size is known, initialize with appropriate N₀ to minimize initial splits.
•Hash Function Selection: Use high-quality hash functions (MurmurHash, xxHash) to minimize clustering.
•Bucket Sizing: Align bucket capacity with disk page size for I/O efficiency.
•Background Splitting: Perform splits asynchronously to avoid insertion latency spikes.
•Monitoring: Track chain length distribution, split frequency, and probe counts for tuning decisions.
•Adaptive Thresholds: Adjust load factor thresholds based on observed performance metrics.

Production Checklist

Before deploying linear hashing: (1) Profile workload read/write ratio, (2) Estimate dataset size growth, (3) Set initial bucket count to avoid many early splits, (4) Choose load factor based on latency vs space trade-off, (5) Implement monitoring for chain length and split frequency.

Benchmarking Methodology

Accurate benchmarking of linear hashing requires attention to methodology. Here's how to measure and interpret performance correctly.

Key Metrics to Measure:

Benchmarking Metrics
Metric	What It Measures	How to Measure	Target Range
Lookup latency (p50/p99)	Typical and tail performance	Histogram of operation times	p99 < 2× p50
Throughput (ops/sec)	Aggregate performance	Operations completed / time	Workload-dependent
Average probe count	Hash table efficiency	Total probes / total lookups	< 1.5 for α < 0.8
Split frequency	Growth overhead	Splits / insertions	~1/α on average
Memory per record	Space efficiency	Total memory / record count	Record size + ~30%

benchmark.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import time
import random
import statistics
 
def benchmark_linear_hash(
    hash_table,
    num_records: int = 1_000_000,
    read_ratio: float = 0.8,
    num_operations: int = 100_000
) -> dict:
    """
    Comprehensive benchmark of linear hash table.
    
    Returns performance metrics dictionary.
    """
    # Phase 1: Bulk load
    start_load = time.perf_counter()
    for i in range(num_records):
        hash_table.insert(f"key_{i}", f"value_{i}")
    load_time = time.perf_counter() - start_load
    
    # Phase 2: Mixed workload
    keys = [f"key_{random.randint(0, num_records-1)}" for _ in range(num_operations)]
    operations = [random.random() < read_ratio for _ in range(num_operations)]
    
    latencies = []
    for i, (key, is_read) in enumerate(zip(keys, operations)):
        start = time.perf_counter()
        if is_read:
            hash_table.lookup(key)
        else:
            hash_table.insert(f"new_key_{i}", f"new_value_{i}")
        latencies.append(time.perf_counter() - start)
    
    # Compute statistics
    latencies.sort()
    
    return {
        "bulk_load_time_sec": load_time,
        "load_throughput_ops_sec": num_records / load_time,
        "mixed_throughput_ops_sec": num_operations / sum(latencies),
        "latency_p50_us": latencies[int(0.50 * len(latencies))] * 1e6,
        "latency_p90_us": latencies[int(0.90 * len(latencies))] * 1e6,
        "latency_p99_us": latencies[int(0.99 * len(latencies))] * 1e6,
        "latency_avg_us": statistics.mean(latencies) * 1e6,
        "latency_stdev_us": statistics.stdev(latencies) * 1e6,
        "final_load_factor": hash_table._load_factor(),
        "final_bucket_count": len(hash_table.buckets)
    }

Benchmarking Pitfalls

Common benchmarking mistakes: (1) Not warming up the hash table, (2) Not considering cache effects (first access slower), (3) Using sequential keys (best case for hash), (4) Ignoring garbage collection pauses, (5) Not measuring tail latency (p99/p999). Always use production-like data and access patterns.

Summary: Linear Hashing Performance

We've analyzed linear hashing performance from theoretical complexity to practical benchmarking. The key insight is that linear hashing provides excellent O(1) expected performance for point operations while trading off range query capability.

Key Takeaways

•O(1) expected, O(n) worst case — Average lookups take 1-2 probes; pathological clustering degrades to linear.
•Load factor determines performance — α = 0.75 is a good default; lower for read-heavy, higher for write-heavy.
•I/O efficiency for disk-based — With proper overflow control, expects 1-2 I/Os per lookup.
•Beats B+-trees for point lookups — As dataset grows, hash advantage increases (O(1) vs O(log n)).
•Loses to B+-trees for ranges — Hash indexes cannot support range queries, ordering, or prefix matching.
•Tunable parameters — Load factor, bucket capacity, and initial size provide optimization levers.

Module Complete:

This concludes our comprehensive exploration of linear hashing. We've covered:

Linear Growth — How buckets are added one at a time
Split Pointer — The state machine tracking expansion progress
Split Triggers — Policies determining when to split
No Directory Needed — The elegant elimination of directory structures
Performance Characteristics — Analysis and comparison with alternatives

You now have deep mastery of linear hashing—from theoretical foundations to implementation details to performance tuning.

Module Complete

Congratulations! You have completed the Linear Hashing module. You now understand this elegant dynamic hashing technique that eliminates directory overhead while maintaining O(1) expected performance. This knowledge is directly applicable to database system design and optimization.