Database Management SystemsStatic Hashing

Static Hashing: Fundamentals and Implementation

LevelIntermediate

Duration90 mins

TopicStatic Hashing

3 / 5

Overflow Handling: Managing Bucket Capacity Limits

When Buckets Fill Beyond Capacity

No matter how carefully we size our hash index or how well our hash function distributes keys, overflow is inevitable. The mathematics of random distribution guarantee that some buckets will receive more entries than they can hold in a single page. When this happens, the elegant O(1) property of hash indexing begins to erode.

Overflow handling is the set of strategies for managing entries that cannot fit in their designated primary bucket. These strategies range from simple linked overflow pages to sophisticated schemes like open addressing. Each approach involves tradeoffs between lookup performance, insertion efficiency, space utilization, and implementation complexity.

Understanding overflow is critical because it represents the failure mode of static hashing. When overflow becomes excessive, the hash index degenerates from a direct-access structure to something more like a linear search. Recognizing when this threshold is crossed—and knowing how to respond—separates competent database engineers from those who merely hope their systems will scale.

What You Will Learn

By the end of this page, you will understand: (1) Why overflow is mathematically inevitable in static hashing, (2) The primary strategies for handling overflow (chaining vs. open addressing), (3) How overflow impacts query performance, (4) Space management for overflow pages, and (5) Detection and monitoring of overflow conditions.

The Inevitability of Overflow

Consider a perfectly designed hash index: ideal hash function, optimal bucket count, and exact foreknowledge of data volume. Even in this theoretical paradise, overflow will occur. The reason lies in fundamental probability—the same mathematics that govern the birthday paradox.

Random Distribution and the Poisson Model

When n records are distributed across N buckets with a perfect uniform hash function, the number of records in each bucket follows a Poisson distribution with parameter λ = n/N (the load factor). For any individual bucket:

P(bucket has k records) = (λ^k × e^(-λ)) / k!

If our bucket capacity is c records per page, we care about:

P(overflow) = P(bucket has > c records) = 1 - Σ(k=0 to c) P(k)

Let's examine concrete numbers:

Overflow Probability by Load Factor and Bucket Capacity
Load Factor (λ)	Bucket Capacity (c)	P(overflow per bucket)	Expected Overflow Buckets (N=10,000)
0.5	5	0.0002%	~0 buckets
0.5	2	1.44%	144 buckets
0.7	5	0.009%	~1 bucket
0.7	3	0.66%	66 buckets
1.0	5	0.14%	14 buckets
1.0	3	1.90%	190 buckets
1.5	5	1.85%	185 buckets
1.5	3	9.34%	934 buckets
2.0	5	5.26%	526 buckets
2.0	3	32.33%	3,233 buckets

Key Insights:

Low load factors with high capacity are safest: At λ=0.5 with c=5, overflow is virtually nonexistent.
Load factor has exponential impact: Doubling λ from 1.0 to 2.0 increases overflow probability ~37× (for c=5).
Bucket capacity is a powerful lever: Increasing bucket size from 3 to 5 reduces overflow probability dramatically.
At practical settings, overflow is common: Even at the recommended λ=0.7 with c=3, ~66 buckets in 10,000 will overflow.

The Maximum Bucket Load Problem

Recall from probability theory: when hashing n items into n buckets, the maximum bucket load is approximately O(log n / log log n) with high probability. For practical database sizes:

n = 1,000,000: Max bucket ≈ 10-12 records above average
n = 1,000,000,000: Max bucket ≈ 15-18 records above average

Even if average bucket load is well below capacity, some buckets will far exceed it.

The Hot Bucket Problem

Data distribution in real systems is rarely uniform. Certain keys (popular user IDs, common timestamps, default values) appear disproportionately often. These 'hot keys' can cause specific buckets to overflow catastrophically even when overall load factor is low. Monitoring must focus on maximum bucket load, not just average.

Closed Addressing: Overflow Chaining

The most common approach to handling overflow in database hash indexes is closed addressing (also called separate chaining or overflow chaining). In this approach, each bucket can link to additional overflow pages that hold excess entries.

The Chain Structure

When a bucket's primary page fills, an overflow page is allocated and linked from the primary page. If this overflow page also fills, another is linked, forming a chain:

Primary Bucket Page → Overflow Page 1 → Overflow Page 2 → ... → NULL

Each page contains:

Entry data (as in the primary bucket)
A pointer to the next overflow page (or NULL if end of chain)

overflow_chain.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
/* Bucket page structure with overflow chaining */
typedef struct BucketPage {
    /* Header */
    uint32_t page_id;
    uint32_t bucket_id;
    uint16_t entry_count;
    uint16_t free_space;
    uint32_t overflow_page_id;   /* KEY: pointer to next page, or NULL_PAGE */
    uint32_t checksum;
    
    /* Slot directory and entry data */
    uint16_t slots[MAX_SLOTS];
    char data[PAGE_SIZE - HEADER_SIZE];
} BucketPage;
 
#define NULL_PAGE 0xFFFFFFFF
 
/* Check if bucket has overflow chain */
static inline bool has_overflow(BucketPage *page) {
    return page->overflow_page_id != NULL_PAGE;
}
 
/* Insert entry into bucket, creating overflow if needed */
InsertResult insert_entry(HashIndex *index, uint32_t bucket_id, 
                          const void *key, size_t key_len,
                          const void *value, size_t value_len) {
    BucketPage *page = read_bucket(index, bucket_id);
    BucketPage *current = page;
    
    /* Traverse to end of chain or find space */
    while (!has_space(current, key_len + value_len)) {
        if (!has_overflow(current)) {
            /* Allocate new overflow page */
            uint32_t new_page_id = allocate_overflow_page(index);
            current->overflow_page_id = new_page_id;
            mark_dirty(current);
            
            BucketPage *new_page = read_page(index, new_page_id);
            init_overflow_page(new_page, bucket_id);
            current = new_page;
            break;
        }
        /* Move to next page in chain */
        current = read_page(index, current->overflow_page_id);
    }
    
    /* Insert into current page */
    insert_into_page(current, key, key_len, value, value_len);
    mark_dirty(current);
    
    return INSERT_SUCCESS;
}

Lookup with Overflow Chains

Searching for a key requires traversing the chain until the key is found or the chain ends:

overflow_lookup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
/* Lookup entry in bucket with overflow chain */
LookupResult lookup_entry(HashIndex *index, 
                          const void *key, size_t key_len,
                          void *value_out, size_t *value_len_out) {
    /* Compute bucket */
    uint32_t bucket_id = compute_bucket_id(index, key, key_len);
    
    /* Read primary bucket page */
    BucketPage *page = read_bucket(index, bucket_id);
    uint32_t pages_read = 1;
    
    while (page != NULL) {
        /* Search within this page */
        int slot = find_entry_in_page(page, key, key_len);
        if (slot >= 0) {
            /* Found! Extract value */
            extract_value(page, slot, value_out, value_len_out);
            return (LookupResult){
                .found = true,
                .pages_read = pages_read
            };
        }
        
        /* Move to overflow page if exists */
        if (has_overflow(page)) {
            page = read_page(index, page->overflow_page_id);
            pages_read++;
        } else {
            page = NULL;  /* End of chain, key not found */
        }
    }
    
    return (LookupResult){
        .found = false,
        .pages_read = pages_read
    };
}

Advantages of Overflow Chaining

•Simple implementation — Adding overflow pages requires only pointer manipulation; no complex reorganization.
•No capacity limit — Can handle arbitrarily many entries, limited only by disk space.
•Maintains deterministic addressing — Every key has a predictable bucket (important for concurrency).
•Deletion is straightforward — Remove entry from page; deallocate empty overflow pages.
•Good cache behavior per page — Each page is scanned sequentially.

Disadvantages of Overflow Chaining

•Degraded lookup performance — Each chain page adds one I/O to lookup; O(1) becomes O(chain_length).
•Unpredictable performance — Hot buckets with long chains have much worse performance than average.
•Space fragmentation — Overflow pages may be scattered across disk, causing random I/O.
•Memory overhead — Each page wastes space on overflow pointers (4-8 bytes per page).
•Cleanup complexity — After deletions, chains may have mostly-empty pages; requires periodic compression.

Open Addressing Alternatives

While chaining dominates database implementations, open addressing offers an alternative approach where overflow entries are stored in other buckets rather than linked overflow pages. This technique is more common in in-memory hash tables but has disk-based applications in certain scenarios.

Linear Probing

When a bucket is full, probe subsequent buckets until an empty slot is found:

Insert(k): Try bucket h(k), h(k)+1, h(k)+2, ... until space found
Lookup(k): Check bucket h(k), h(k)+1, h(k)+2, ... until key found or empty bucket

Advantages:

No overflow page allocation
Good cache locality for clustered keys
Simple implementation

Disadvantages:

Primary clustering: consecutive keys fill consecutive buckets, creating "runs"
Deletion is complex (must use tombstones or rehashing)
Load factor must stay below ~70% to maintain performance

Expected Probes for Linear Probing by Load Factor
Load Factor (α)	Successful Search	Unsuccessful Search	Insert
0.50	1.5 probes	2.5 probes	2.5 probes
0.75	2.5 probes	8.5 probes	8.5 probes
0.90	5.5 probes	50.5 probes	50.5 probes
0.95	10.5 probes	200.5 probes	200.5 probes

Quadratic Probing

To reduce primary clustering, probe at quadratically increasing intervals:

Insert(k): Try bucket h(k), h(k)+1², h(k)+2², h(k)+3², ...

This spreads probes more widely but introduces secondary clustering (keys with the same hash value follow the same probe sequence).

Double Hashing

Use a second hash function to determine probe intervals:

Insert(k): Try bucket h₁(k), h₁(k)+h₂(k), h₁(k)+2×h₂(k), ...

This eliminates both primary and secondary clustering but requires computing two hash functions.

Why Chaining Dominates in Databases

Despite open addressing advantages in memory, chaining is strongly preferred for disk-based indexes:

Chaining vs. Open Addressing for Disk

•Page granularity — Disk I/O reads entire pages. Probing one entry in a different bucket still costs a full page read.
•Deletion complexity — Open addressing with tombstones wastes space; periodic rehashing blocks access.
•Concurrent access — Open addressing probes may span multiple buckets, complicating locking strategies.
•Predictable addressing — Chaining keeps all entries for a key in one logical location (bucket + chain).
•High load tolerance — Chaining degrades gracefully; open addressing can fail catastrophically above ~90% load.

Open Addressing in Practice

Open addressing is used in hash tables within database engines (e.g., hash join intermediates, query plan caches) where data is in-memory. For persistent hash indexes, chaining overwhelmingly dominates. Some specialized systems (LMDB, certain key-value stores) use hybrid approaches.

Overflow Page Management

Managing overflow pages efficiently is critical for maintaining hash index performance over time. Poor overflow management leads to fragmentation, wasted space, and degraded I/O patterns.

Overflow Page Allocation Strategies

Strategy 1: Global Free List

Maintain a linked list of free pages that any bucket can claim:

Allocation: Pop from free list (O(1))
Deallocation: Push to free list (O(1))
Space efficient: Reuses freed pages
Fragmentation: Overflow pages scattered across file

overflow_freelist.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
/* Global free list for overflow page management */
typedef struct OverflowManager {
    uint32_t free_list_head;     /* Page ID of first free page */
    uint32_t total_overflow;     /* Count of allocated overflow pages */
    uint32_t free_count;         /* Count of free pages in list */
    pthread_mutex_t lock;        /* Protect concurrent access */
} OverflowManager;
 
/* Allocate an overflow page */
uint32_t alloc_overflow_page(HashIndex *index) {
    OverflowManager *mgr = &index->overflow_mgr;
    pthread_mutex_lock(&mgr->lock);
    
    uint32_t page_id;
    if (mgr->free_list_head != NULL_PAGE) {
        /* Reuse a free page */
        page_id = mgr->free_list_head;
        BucketPage *page = read_page(index, page_id);
        mgr->free_list_head = page->overflow_page_id;  /* Next in list */
        mgr->free_count--;
    } else {
        /* Extend file to create new page */
        page_id = extend_index_file(index);
        mgr->total_overflow++;
    }
    
    pthread_mutex_unlock(&mgr->lock);
    return page_id;
}
 
/* Free an overflow page */
void free_overflow_page(HashIndex *index, uint32_t page_id) {
    OverflowManager *mgr = &index->overflow_mgr;
    pthread_mutex_lock(&mgr->lock);
    
    /* Add to head of free list */
    BucketPage *page = read_page(index, page_id);
    page->overflow_page_id = mgr->free_list_head;
    page->entry_count = 0;  /* Mark as free */
    mark_dirty(page);
    
    mgr->free_list_head = page_id;
    mgr->free_count++;
    
    pthread_mutex_unlock(&mgr->lock);
}

Strategy 2: Dedicated Overflow Region

Allocate overflow pages from a separate region of the index file:

[Primary Bucket Region (pages 0 to N-1)] [Overflow Region (pages N onwards)]

Advantages: Clear separation, easier to grow overflow region
Disadvantages: May require copying if primary region needs expansion

Strategy 3: Per-Bucket Overflow Areas

Pre-allocate a small overflow area adjacent to each bucket:

Bucket 0: [Primary Page] [Overflow 0a]
Bucket 1: [Primary Page] [Overflow 1a]
...

Advantages: Overflow pages near primary for sequential access
Disadvantages: Wastes space if overflow is rare; fixed overflow capacity

Lazy vs. Eager Overflow Allocation

Most production systems use LAZY allocation—overflow pages are created only when needed. EAGER pre-allocation wastes space but can reduce insertion latency spikes. Choose based on workload: high-frequency insertions may benefit from pre-allocated overflow pools.

Performance Impact of Overflow

Overflow directly impacts the I/O cost of hash index operations. Understanding this impact helps in capacity planning and performance tuning.

I/O Cost Model with Overflow

Let L be the average chain length (including primary page):

L = 1: No overflow, single page per bucket
L = 2: One overflow page on average
L = 5: Four overflow pages on average (severe)

Point Lookup Cost:

Best case (key in primary): 1 I/O
Expected case: L/2 I/Os (assuming uniform position in chain)
Worst case (key at end or not found): L I/Os

Insertion Cost:

Typically must traverse entire chain to check for duplicates: L I/Os for read + 1 for write
If appending: L I/Os read + 1 write (to last page)

Deletion Cost:

Must find entry: Expected L/2 I/Os
Write modified page: 1 I/O
Possible chain cleanup: Additional I/Os

Expected I/Os by Operation and Chain Length
Operation	L=1 (no overflow)	L=2	L=3	L=5
Point lookup (success)	1.0	1.5	2.0	3.0
Point lookup (failure)	1.0	2.0	3.0	5.0
Insert (unique check)	2.0	3.0	4.0	6.0
Delete (entry exists)	1.5	2.5	3.5	5.5
Full bucket scan	1.0	2.0	3.0	5.0

Performance Degradation Visualization

The relationship between load factor and expected chain length follows this approximate formula:

E[L] ≈ 1 + λ/c for low to moderate overflow

Where λ is load factor and c is bucket capacity.

At higher overflow rates, the relationship becomes more complex, but the key insight is: doubling the load factor more than doubles lookup cost due to non-linear overflow growth.

Comparison with B+-Tree

When overflow becomes significant, hash index advantages erode:

Hash Index vs. B+-Tree Performance
Scenario	Hash Index I/Os	B+-Tree I/Os (height=4)	Winner
No overflow	1	4	Hash
L=2 (light overflow)	1.5	4	Hash
L=3 (moderate overflow)	2	4	Hash
L=4 (heavy overflow)	2.5	4	Similar
L=5+ (severe overflow)	3+	4	B+-Tree

The Crossover Point

When average chain length exceeds B+-Tree height (typically 3-4 for large tables), the hash index loses its primary advantage. At this point, the index should be rebuilt with more buckets or replaced with a B+-Tree. Monitor chain length as a critical health metric.

Overflow Detection and Monitoring

Production database systems must actively monitor overflow conditions to maintain performance and trigger maintenance actions when thresholds are exceeded.

Key Metrics to Monitor

Overflow Monitoring Metrics

•Total Overflow Pages — Count of pages allocated beyond primary buckets. Trend over time indicates growth patterns.
•Overflow Bucket Percentage — Fraction of buckets with at least one overflow page. Should stay below 5-10%.
•Average Chain Length — Mean pages per bucket including overflow. Target: <1.5 for good performance.
•Maximum Chain Length — Longest overflow chain. Indicates worst-case lookup cost. Alert if >5.
•Chain Length Distribution — Histogram of chain lengths. Reveals clustering vs. uniform overflow.
•Overflow Pages per Insert — Rolling average of new overflow allocations per insert batch.

overflow_stats.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/* Hash index overflow statistics */
typedef struct OverflowStats {
    uint32_t num_buckets;           /* Total primary buckets */
    uint32_t buckets_with_overflow; /* Buckets that have overflow pages */
    uint32_t total_overflow_pages;  /* Total overflow pages allocated */
    uint32_t max_chain_length;      /* Longest chain encountered */
    double   avg_chain_length;      /* Average chain length */
    uint32_t chain_histogram[16];   /* Distribution: count of chains by length */
} OverflowStats;
 
/* Collect overflow statistics (expensive, run periodically) */
OverflowStats compute_overflow_stats(HashIndex *index) {
    OverflowStats stats = {0};
    stats.num_buckets = index->metadata.bucket_count;
    
    uint64_t total_chain_length = 0;
    
    for (uint32_t b = 0; b < stats.num_buckets; b++) {
        uint32_t chain_len = 1;
        BucketPage *page = read_bucket(index, b);
        
        while (has_overflow(page)) {
            chain_len++;
            stats.total_overflow_pages++;
            page = read_page(index, page->overflow_page_id);
        }
        
        if (chain_len > 1) {
            stats.buckets_with_overflow++;
        }
        
        total_chain_length += chain_len;
        stats.max_chain_length = MAX(stats.max_chain_length, chain_len);
        
        /* Update histogram (clamp to 15+) */
        uint32_t hist_idx = MIN(chain_len - 1, 15);
        stats.chain_histogram[hist_idx]++;
    }
    
    stats.avg_chain_length = (double)total_chain_length / stats.num_buckets;
    return stats;
}
 
/* Example alert logic */
void check_overflow_health(const OverflowStats *stats) {
    double overflow_pct = 100.0 * stats->buckets_with_overflow / stats->num_buckets;
    
    if (overflow_pct > 15.0) {
        log_alert("CRITICAL: %.1f%% of buckets have overflow - rebuild needed", 
                  overflow_pct);
    } else if (overflow_pct > 10.0) {
        log_warning("WARNING: %.1f%% overflow - approaching rebuild threshold",
                    overflow_pct);
    }
    
    if (stats->max_chain_length > 10) {
        log_alert("CRITICAL: Max chain length %d - severe clustering detected",
                  stats->max_chain_length);
    }
    
    if (stats->avg_chain_length > 2.0) {
        log_warning("WARNING: Avg chain length %.2f - performance degrading",
                    stats->avg_chain_length);
    }
}

Sampling for Large Indexes

For very large hash indexes, computing exact statistics is expensive. Sample a random subset of buckets (e.g., 1%) and extrapolate. Ensure the sample is truly random—sequential buckets may share clustering patterns.

Overflow Reduction Strategies

When overflow becomes problematic, several strategies can reduce it—ranging from quick mitigations to complete index rebuilds.

Strategy 1: Index Rebuild

The most effective solution: create a new index with more buckets and rehash all entries.

New bucket count formula:

N_new = current_records / (target_fill_factor × bucket_capacity)

A typical approach doubles the bucket count or targets a specific fill factor (e.g., 60% for read-heavy workloads).

Rebuild process:

Create new index structure with N_new buckets
Scan all records and insert into new index
Swap new index for old (atomically if possible)
Deallocate old index

Considerations:

Requires write lock or dual-index during transition
Takes O(n) time for n records
Needs ~2× storage during rebuild

Online Rebuild Approach

•Build new index in background
•Apply concurrent changes to both
•Switch after new index is ready
•Minimal downtime
•More complex implementation

Offline Rebuild Approach

•Lock table during rebuild
•Faster total rebuild time
•Simpler implementation
•Requires maintenance window
•Blocks all operations

Strategy 2: Chain Compaction

Without changing bucket count, consolidate entries to reduce wasted space:

Identify chains with low utilization
Move entries from overflow pages to primary/earlier pages
Deallocate empty overflow pages
Does NOT reduce chain length if buckets are genuinely overloaded

Strategy 3: Hash Function Change

If overflow is caused by poor hash distribution (clustering), changing the hash function or seed may help:

Analyze key patterns that cause clustering
Select hash function better suited to key distribution
Rebuild index with new hash function
Requires full rehash (equivalent to rebuild)

Strategy 4: Transition to Dynamic Hashing

If data volume is unpredictable, static hashing may be fundamentally unsuitable:

Extendible hashing: Doubles directory when needed; local bucket splits
Linear hashing: Gradual bucket addition during insertions
These eliminate the fixed bucket count limitation entirely

Prevention is Better than Cure

The best overflow strategy is prevention: size indexes with growth headroom, monitor continuously, and schedule proactive rebuilds before performance degrades. Reactive rebuilds during production issues are stressful and error-prone.

Summary: Overflow Handling

We have examined the critical challenge of overflow handling in static hashing—how to manage entries when buckets fill beyond their capacity and how to maintain performance as overflow accumulates.

Key Takeaways

•Overflow is inevitable — Poisson distribution guarantees some buckets will exceed capacity even with perfect hashing.
•Chaining dominates in databases — Linked overflow pages are simple, unbounded, and preserve addressing; open addressing is rarely used for disk indexes.
•Performance degrades linearly — Each overflow page adds one I/O to lookups; at L>4, hash index loses advantage over B+-Tree.
•Page management matters — Free lists, allocation strategies, and compaction affect both performance and space utilization.
•Monitoring is essential — Track overflow percentage, chain length distribution, and maximum chain to detect problems early.
•Rebuild is the ultimate fix — When overflow is excessive, increasing bucket count via rebuild is the only complete solution.

What's Next:

With overflow handling understood, we now examine the specific case of chaining—the dominant overflow strategy in database systems. We'll explore implementation details, optimization techniques, and production considerations.

Page Complete

You now understand why overflow occurs in static hashing, the strategies to handle it, and how to monitor and mitigate overflow in production systems. This knowledge is essential for sizing hash indexes correctly and maintaining their performance over time.

3 / 5

Loading learning content...

Database Management SystemsStatic Hashing

Static Hashing: Fundamentals and Implementation

LevelIntermediate

Duration90 mins

TopicStatic Hashing

3 / 5

Overflow Handling: Managing Bucket Capacity Limits

When Buckets Fill Beyond Capacity

What You Will Learn

The Inevitability of Overflow

Random Distribution and the Poisson Model

P(bucket has k records) = (λ^k × e^(-λ)) / k!

If our bucket capacity is c records per page, we care about:

P(overflow) = P(bucket has > c records) = 1 - Σ(k=0 to c) P(k)

Let's examine concrete numbers:

Overflow Probability by Load Factor and Bucket Capacity
Load Factor (λ)	Bucket Capacity (c)	P(overflow per bucket)	Expected Overflow Buckets (N=10,000)
0.5	5	0.0002%	~0 buckets
0.5	2	1.44%	144 buckets
0.7	5	0.009%	~1 bucket
0.7	3	0.66%	66 buckets
1.0	5	0.14%	14 buckets
1.0	3	1.90%	190 buckets
1.5	5	1.85%	185 buckets
1.5	3	9.34%	934 buckets
2.0	5	5.26%	526 buckets
2.0	3	32.33%	3,233 buckets

Key Insights:

Low load factors with high capacity are safest: At λ=0.5 with c=5, overflow is virtually nonexistent.
Load factor has exponential impact: Doubling λ from 1.0 to 2.0 increases overflow probability ~37× (for c=5).
Bucket capacity is a powerful lever: Increasing bucket size from 3 to 5 reduces overflow probability dramatically.
At practical settings, overflow is common: Even at the recommended λ=0.7 with c=3, ~66 buckets in 10,000 will overflow.

The Maximum Bucket Load Problem

Recall from probability theory: when hashing n items into n buckets, the maximum bucket load is approximately O(log n / log log n) with high probability. For practical database sizes:

n = 1,000,000: Max bucket ≈ 10-12 records above average
n = 1,000,000,000: Max bucket ≈ 15-18 records above average

Even if average bucket load is well below capacity, some buckets will far exceed it.

The Hot Bucket Problem

Closed Addressing: Overflow Chaining

The Chain Structure

When a bucket's primary page fills, an overflow page is allocated and linked from the primary page. If this overflow page also fills, another is linked, forming a chain:

Primary Bucket Page → Overflow Page 1 → Overflow Page 2 → ... → NULL

Each page contains:

Entry data (as in the primary bucket)
A pointer to the next overflow page (or NULL if end of chain)

overflow_chain.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
/* Bucket page structure with overflow chaining */
typedef struct BucketPage {
    /* Header */
    uint32_t page_id;
    uint32_t bucket_id;
    uint16_t entry_count;
    uint16_t free_space;
    uint32_t overflow_page_id;   /* KEY: pointer to next page, or NULL_PAGE */
    uint32_t checksum;
    
    /* Slot directory and entry data */
    uint16_t slots[MAX_SLOTS];
    char data[PAGE_SIZE - HEADER_SIZE];
} BucketPage;
 
#define NULL_PAGE 0xFFFFFFFF
 
/* Check if bucket has overflow chain */
static inline bool has_overflow(BucketPage *page) {
    return page->overflow_page_id != NULL_PAGE;
}
 
/* Insert entry into bucket, creating overflow if needed */
InsertResult insert_entry(HashIndex *index, uint32_t bucket_id, 
                          const void *key, size_t key_len,
                          const void *value, size_t value_len) {
    BucketPage *page = read_bucket(index, bucket_id);
    BucketPage *current = page;
    
    /* Traverse to end of chain or find space */
    while (!has_space(current, key_len + value_len)) {
        if (!has_overflow(current)) {
            /* Allocate new overflow page */
            uint32_t new_page_id = allocate_overflow_page(index);
            current->overflow_page_id = new_page_id;
            mark_dirty(current);
            
            BucketPage *new_page = read_page(index, new_page_id);
            init_overflow_page(new_page, bucket_id);
            current = new_page;
            break;
        }
        /* Move to next page in chain */
        current = read_page(index, current->overflow_page_id);
    }
    
    /* Insert into current page */
    insert_into_page(current, key, key_len, value, value_len);
    mark_dirty(current);
    
    return INSERT_SUCCESS;
}

Lookup with Overflow Chains

Searching for a key requires traversing the chain until the key is found or the chain ends:

overflow_lookup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
/* Lookup entry in bucket with overflow chain */
LookupResult lookup_entry(HashIndex *index, 
                          const void *key, size_t key_len,
                          void *value_out, size_t *value_len_out) {
    /* Compute bucket */
    uint32_t bucket_id = compute_bucket_id(index, key, key_len);
    
    /* Read primary bucket page */
    BucketPage *page = read_bucket(index, bucket_id);
    uint32_t pages_read = 1;
    
    while (page != NULL) {
        /* Search within this page */
        int slot = find_entry_in_page(page, key, key_len);
        if (slot >= 0) {
            /* Found! Extract value */
            extract_value(page, slot, value_out, value_len_out);
            return (LookupResult){
                .found = true,
                .pages_read = pages_read
            };
        }
        
        /* Move to overflow page if exists */
        if (has_overflow(page)) {
            page = read_page(index, page->overflow_page_id);
            pages_read++;
        } else {
            page = NULL;  /* End of chain, key not found */
        }
    }
    
    return (LookupResult){
        .found = false,
        .pages_read = pages_read
    };
}

Advantages of Overflow Chaining

•Simple implementation — Adding overflow pages requires only pointer manipulation; no complex reorganization.
•No capacity limit — Can handle arbitrarily many entries, limited only by disk space.
•Maintains deterministic addressing — Every key has a predictable bucket (important for concurrency).
•Deletion is straightforward — Remove entry from page; deallocate empty overflow pages.
•Good cache behavior per page — Each page is scanned sequentially.

Disadvantages of Overflow Chaining

•Degraded lookup performance — Each chain page adds one I/O to lookup; O(1) becomes O(chain_length).
•Unpredictable performance — Hot buckets with long chains have much worse performance than average.
•Space fragmentation — Overflow pages may be scattered across disk, causing random I/O.
•Memory overhead — Each page wastes space on overflow pointers (4-8 bytes per page).
•Cleanup complexity — After deletions, chains may have mostly-empty pages; requires periodic compression.

Open Addressing Alternatives

Linear Probing

When a bucket is full, probe subsequent buckets until an empty slot is found:

Insert(k): Try bucket h(k), h(k)+1, h(k)+2, ... until space found
Lookup(k): Check bucket h(k), h(k)+1, h(k)+2, ... until key found or empty bucket

Advantages:

No overflow page allocation
Good cache locality for clustered keys
Simple implementation

Disadvantages:

Primary clustering: consecutive keys fill consecutive buckets, creating "runs"
Deletion is complex (must use tombstones or rehashing)
Load factor must stay below ~70% to maintain performance

Expected Probes for Linear Probing by Load Factor
Load Factor (α)	Successful Search	Unsuccessful Search	Insert
0.50	1.5 probes	2.5 probes	2.5 probes
0.75	2.5 probes	8.5 probes	8.5 probes
0.90	5.5 probes	50.5 probes	50.5 probes
0.95	10.5 probes	200.5 probes	200.5 probes

Quadratic Probing

To reduce primary clustering, probe at quadratically increasing intervals:

Insert(k): Try bucket h(k), h(k)+1², h(k)+2², h(k)+3², ...

This spreads probes more widely but introduces secondary clustering (keys with the same hash value follow the same probe sequence).

Double Hashing

Use a second hash function to determine probe intervals:

Insert(k): Try bucket h₁(k), h₁(k)+h₂(k), h₁(k)+2×h₂(k), ...

This eliminates both primary and secondary clustering but requires computing two hash functions.

Why Chaining Dominates in Databases

Despite open addressing advantages in memory, chaining is strongly preferred for disk-based indexes:

Chaining vs. Open Addressing for Disk

•Page granularity — Disk I/O reads entire pages. Probing one entry in a different bucket still costs a full page read.
•Deletion complexity — Open addressing with tombstones wastes space; periodic rehashing blocks access.
•Concurrent access — Open addressing probes may span multiple buckets, complicating locking strategies.
•Predictable addressing — Chaining keeps all entries for a key in one logical location (bucket + chain).
•High load tolerance — Chaining degrades gracefully; open addressing can fail catastrophically above ~90% load.

Open Addressing in Practice

Overflow Page Management

Managing overflow pages efficiently is critical for maintaining hash index performance over time. Poor overflow management leads to fragmentation, wasted space, and degraded I/O patterns.

Overflow Page Allocation Strategies

Strategy 1: Global Free List

Maintain a linked list of free pages that any bucket can claim:

Allocation: Pop from free list (O(1))
Deallocation: Push to free list (O(1))
Space efficient: Reuses freed pages
Fragmentation: Overflow pages scattered across file

overflow_freelist.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
/* Global free list for overflow page management */
typedef struct OverflowManager {
    uint32_t free_list_head;     /* Page ID of first free page */
    uint32_t total_overflow;     /* Count of allocated overflow pages */
    uint32_t free_count;         /* Count of free pages in list */
    pthread_mutex_t lock;        /* Protect concurrent access */
} OverflowManager;
 
/* Allocate an overflow page */
uint32_t alloc_overflow_page(HashIndex *index) {
    OverflowManager *mgr = &index->overflow_mgr;
    pthread_mutex_lock(&mgr->lock);
    
    uint32_t page_id;
    if (mgr->free_list_head != NULL_PAGE) {
        /* Reuse a free page */
        page_id = mgr->free_list_head;
        BucketPage *page = read_page(index, page_id);
        mgr->free_list_head = page->overflow_page_id;  /* Next in list */
        mgr->free_count--;
    } else {
        /* Extend file to create new page */
        page_id = extend_index_file(index);
        mgr->total_overflow++;
    }
    
    pthread_mutex_unlock(&mgr->lock);
    return page_id;
}
 
/* Free an overflow page */
void free_overflow_page(HashIndex *index, uint32_t page_id) {
    OverflowManager *mgr = &index->overflow_mgr;
    pthread_mutex_lock(&mgr->lock);
    
    /* Add to head of free list */
    BucketPage *page = read_page(index, page_id);
    page->overflow_page_id = mgr->free_list_head;
    page->entry_count = 0;  /* Mark as free */
    mark_dirty(page);
    
    mgr->free_list_head = page_id;
    mgr->free_count++;
    
    pthread_mutex_unlock(&mgr->lock);
}

Strategy 2: Dedicated Overflow Region

Allocate overflow pages from a separate region of the index file:

[Primary Bucket Region (pages 0 to N-1)] [Overflow Region (pages N onwards)]

Advantages: Clear separation, easier to grow overflow region
Disadvantages: May require copying if primary region needs expansion

Strategy 3: Per-Bucket Overflow Areas

Pre-allocate a small overflow area adjacent to each bucket:

Bucket 0: [Primary Page] [Overflow 0a]
Bucket 1: [Primary Page] [Overflow 1a]
...

Advantages: Overflow pages near primary for sequential access
Disadvantages: Wastes space if overflow is rare; fixed overflow capacity

Lazy vs. Eager Overflow Allocation

Performance Impact of Overflow

Overflow directly impacts the I/O cost of hash index operations. Understanding this impact helps in capacity planning and performance tuning.

I/O Cost Model with Overflow

Let L be the average chain length (including primary page):

L = 1: No overflow, single page per bucket
L = 2: One overflow page on average
L = 5: Four overflow pages on average (severe)

Point Lookup Cost:

Best case (key in primary): 1 I/O
Expected case: L/2 I/Os (assuming uniform position in chain)
Worst case (key at end or not found): L I/Os

Insertion Cost:

Typically must traverse entire chain to check for duplicates: L I/Os for read + 1 for write
If appending: L I/Os read + 1 write (to last page)

Deletion Cost:

Must find entry: Expected L/2 I/Os
Write modified page: 1 I/O
Possible chain cleanup: Additional I/Os

Expected I/Os by Operation and Chain Length
Operation	L=1 (no overflow)	L=2	L=3	L=5
Point lookup (success)	1.0	1.5	2.0	3.0
Point lookup (failure)	1.0	2.0	3.0	5.0
Insert (unique check)	2.0	3.0	4.0	6.0
Delete (entry exists)	1.5	2.5	3.5	5.5
Full bucket scan	1.0	2.0	3.0	5.0

Performance Degradation Visualization

The relationship between load factor and expected chain length follows this approximate formula:

E[L] ≈ 1 + λ/c for low to moderate overflow

Where λ is load factor and c is bucket capacity.

At higher overflow rates, the relationship becomes more complex, but the key insight is: doubling the load factor more than doubles lookup cost due to non-linear overflow growth.

Comparison with B+-Tree

When overflow becomes significant, hash index advantages erode:

Hash Index vs. B+-Tree Performance
Scenario	Hash Index I/Os	B+-Tree I/Os (height=4)	Winner
No overflow	1	4	Hash
L=2 (light overflow)	1.5	4	Hash
L=3 (moderate overflow)	2	4	Hash
L=4 (heavy overflow)	2.5	4	Similar
L=5+ (severe overflow)	3+	4	B+-Tree

The Crossover Point

Overflow Detection and Monitoring

Production database systems must actively monitor overflow conditions to maintain performance and trigger maintenance actions when thresholds are exceeded.

Key Metrics to Monitor

Overflow Monitoring Metrics

•Total Overflow Pages — Count of pages allocated beyond primary buckets. Trend over time indicates growth patterns.
•Overflow Bucket Percentage — Fraction of buckets with at least one overflow page. Should stay below 5-10%.
•Average Chain Length — Mean pages per bucket including overflow. Target: <1.5 for good performance.
•Maximum Chain Length — Longest overflow chain. Indicates worst-case lookup cost. Alert if >5.
•Chain Length Distribution — Histogram of chain lengths. Reveals clustering vs. uniform overflow.
•Overflow Pages per Insert — Rolling average of new overflow allocations per insert batch.

overflow_stats.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/* Hash index overflow statistics */
typedef struct OverflowStats {
    uint32_t num_buckets;           /* Total primary buckets */
    uint32_t buckets_with_overflow; /* Buckets that have overflow pages */
    uint32_t total_overflow_pages;  /* Total overflow pages allocated */
    uint32_t max_chain_length;      /* Longest chain encountered */
    double   avg_chain_length;      /* Average chain length */
    uint32_t chain_histogram[16];   /* Distribution: count of chains by length */
} OverflowStats;
 
/* Collect overflow statistics (expensive, run periodically) */
OverflowStats compute_overflow_stats(HashIndex *index) {
    OverflowStats stats = {0};
    stats.num_buckets = index->metadata.bucket_count;
    
    uint64_t total_chain_length = 0;
    
    for (uint32_t b = 0; b < stats.num_buckets; b++) {
        uint32_t chain_len = 1;
        BucketPage *page = read_bucket(index, b);
        
        while (has_overflow(page)) {
            chain_len++;
            stats.total_overflow_pages++;
            page = read_page(index, page->overflow_page_id);
        }
        
        if (chain_len > 1) {
            stats.buckets_with_overflow++;
        }
        
        total_chain_length += chain_len;
        stats.max_chain_length = MAX(stats.max_chain_length, chain_len);
        
        /* Update histogram (clamp to 15+) */
        uint32_t hist_idx = MIN(chain_len - 1, 15);
        stats.chain_histogram[hist_idx]++;
    }
    
    stats.avg_chain_length = (double)total_chain_length / stats.num_buckets;
    return stats;
}
 
/* Example alert logic */
void check_overflow_health(const OverflowStats *stats) {
    double overflow_pct = 100.0 * stats->buckets_with_overflow / stats->num_buckets;
    
    if (overflow_pct > 15.0) {
        log_alert("CRITICAL: %.1f%% of buckets have overflow - rebuild needed", 
                  overflow_pct);
    } else if (overflow_pct > 10.0) {
        log_warning("WARNING: %.1f%% overflow - approaching rebuild threshold",
                    overflow_pct);
    }
    
    if (stats->max_chain_length > 10) {
        log_alert("CRITICAL: Max chain length %d - severe clustering detected",
                  stats->max_chain_length);
    }
    
    if (stats->avg_chain_length > 2.0) {
        log_warning("WARNING: Avg chain length %.2f - performance degrading",
                    stats->avg_chain_length);
    }
}

Sampling for Large Indexes

Overflow Reduction Strategies

When overflow becomes problematic, several strategies can reduce it—ranging from quick mitigations to complete index rebuilds.

Strategy 1: Index Rebuild

The most effective solution: create a new index with more buckets and rehash all entries.

New bucket count formula:

N_new = current_records / (target_fill_factor × bucket_capacity)

A typical approach doubles the bucket count or targets a specific fill factor (e.g., 60% for read-heavy workloads).

Rebuild process:

Create new index structure with N_new buckets
Scan all records and insert into new index
Swap new index for old (atomically if possible)
Deallocate old index

Considerations:

Requires write lock or dual-index during transition
Takes O(n) time for n records
Needs ~2× storage during rebuild

Online Rebuild Approach

•Build new index in background
•Apply concurrent changes to both
•Switch after new index is ready
•Minimal downtime
•More complex implementation

Offline Rebuild Approach

•Lock table during rebuild
•Faster total rebuild time
•Simpler implementation
•Requires maintenance window
•Blocks all operations

Strategy 2: Chain Compaction

Without changing bucket count, consolidate entries to reduce wasted space:

Identify chains with low utilization
Move entries from overflow pages to primary/earlier pages
Deallocate empty overflow pages
Does NOT reduce chain length if buckets are genuinely overloaded

Strategy 3: Hash Function Change

If overflow is caused by poor hash distribution (clustering), changing the hash function or seed may help:

Analyze key patterns that cause clustering
Select hash function better suited to key distribution
Rebuild index with new hash function
Requires full rehash (equivalent to rebuild)

Strategy 4: Transition to Dynamic Hashing

If data volume is unpredictable, static hashing may be fundamentally unsuitable:

Extendible hashing: Doubles directory when needed; local bucket splits
Linear hashing: Gradual bucket addition during insertions
These eliminate the fixed bucket count limitation entirely

Prevention is Better than Cure

Summary: Overflow Handling

We have examined the critical challenge of overflow handling in static hashing—how to manage entries when buckets fill beyond their capacity and how to maintain performance as overflow accumulates.

Key Takeaways

•Overflow is inevitable — Poisson distribution guarantees some buckets will exceed capacity even with perfect hashing.
•Chaining dominates in databases — Linked overflow pages are simple, unbounded, and preserve addressing; open addressing is rarely used for disk indexes.
•Performance degrades linearly — Each overflow page adds one I/O to lookups; at L>4, hash index loses advantage over B+-Tree.
•Page management matters — Free lists, allocation strategies, and compaction affect both performance and space utilization.
•Monitoring is essential — Track overflow percentage, chain length distribution, and maximum chain to detect problems early.
•Rebuild is the ultimate fix — When overflow is excessive, increasing bucket count via rebuild is the only complete solution.

What's Next:

Page Complete

3 / 5