Database Management SystemsStatic Hashing

Static Hashing: Fundamentals and Implementation

LevelIntermediate

Duration90 mins

TopicStatic Hashing

1 / 5

Fixed Bucket Count: The Foundation of Static Hashing

The Fundamental Trade-off

When designing a hash-based index structure, one of the most consequential decisions occurs before any data is inserted: How many buckets should we allocate?

This seemingly simple question encapsulates a fundamental tension in database systems engineering. Allocate too few buckets, and you create overflow chains that degrade O(1) lookups to O(n) scans. Allocate too many, and you waste precious memory and disk space on empty buckets that may never hold data. Static hashing confronts this challenge by requiring the database designer to make this decision upfront, at index creation time, and live with the consequences.

Understanding fixed bucket count architecture is essential because it reveals the core constraints that motivated decades of research into dynamic hashing schemes. Before we can appreciate solutions like extendible hashing or linear hashing, we must deeply understand the problem they solve—and that problem begins here, with the fixed bucket.

What You Will Learn

By the end of this page, you will understand: (1) Why static hashing requires a fixed number of buckets at creation time, (2) How bucket count is determined mathematically, (3) The space-time tradeoffs involved in bucket sizing, (4) How fixed bucket architectures interact with disk I/O patterns, and (5) The fundamental limitations that drive the need for dynamic hashing alternatives.

What is a Bucket? Anatomy of Hash-Based Storage

Before we can discuss bucket counts, we need a precise understanding of what a bucket actually is in the context of database hash indexing. A bucket is the fundamental unit of storage in a hash-based index structure. It serves as a container that holds one or more index entries, where each entry typically consists of a search key and a pointer to the corresponding data record (or the record itself, in some implementations).

Physical Representation

In disk-based database systems, a bucket is typically mapped to one or more disk pages (also called blocks). A disk page is the smallest unit of I/O transfer between disk and memory, commonly ranging from 4KB to 16KB depending on the system configuration. When the buffer manager requests a bucket, it must read at least one full disk page—even if the bucket contains only a single entry.

Typical Bucket Configurations in Production Systems
System	Page Size	Bucket Capacity (Est.)	Key-Pointer Size
PostgreSQL	8 KB	~200-400 entries	~20-40 bytes/entry
MySQL InnoDB	16 KB	~400-800 entries	~20-40 bytes/entry
Oracle	8 KB (default)	~200-400 entries	~20-40 bytes/entry
SQL Server	8 KB	~200-400 entries	~20-40 bytes/entry
Berkeley DB	4-64 KB (configurable)	Varies	Application-defined

Bucket vs. Page: An Important Distinction

While buckets are often implemented as single disk pages for efficiency, the logical bucket and physical page are conceptually separate. A bucket is a logical construct defined by the hash function—it's the destination for a particular range of hash values. A page is a physical storage unit defined by the I/O subsystem.

In static hashing, the simplest configuration maps each bucket to exactly one page. This one-to-one mapping provides predictable I/O behavior: finding any record requires exactly one disk access (assuming the bucket doesn't overflow). However, this rigidity also creates the first constraint—the number of buckets is bounded by available storage space, since each bucket consumes at least one page.

Internal Bucket Structure

A well-designed bucket page contains several components:

Bucket Page Layout

•Page Header — Contains metadata: page ID, free space pointer, entry count, overflow pointer (if applicable), checksum for integrity verification
•Entry Directory — A slot array tracking the offset of each entry within the page, enabling efficient deletions without compaction
•Entry Region — The actual key-pointer pairs, potentially variable-length depending on key types
•Free Space — Unallocated region that grows from the bottom (entries grow from top), meeting in the middle

bucket_page_layout.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/* Typical bucket page structure in a database system */
typedef struct BucketPage {
    /* Page Header - typically 24-32 bytes */
    uint32_t page_id;           /* Unique identifier for this page */
    uint32_t bucket_id;         /* Logical bucket number */
    uint16_t entry_count;       /* Number of entries currently stored */
    uint16_t free_space_offset; /* Offset where free space begins */
    uint32_t overflow_page_id;  /* Pointer to overflow page, or NULL_PAGE */
    uint32_t checksum;          /* CRC32 for corruption detection */
    
    /* Slot directory grows downward from header */
    uint16_t slots[MAX_SLOTS];  /* Offsets to each entry */
    
    /* Entry data grows upward from end of page */
    /* Structure: [key_len][key_data][pointer] for variable-length keys */
    /* Or simply: [fixed_key][pointer] for fixed-length keys */
    char data[PAGE_SIZE - HEADER_SIZE]; 
} BucketPage;
 
/* Example: 8KB page with 32-byte header and 20-byte entries */
/* Theoretical capacity: (8192 - 32) / (20 + 2) ≈ 370 entries */
/* Practical capacity (80% fill): ~296 entries */

Fill Factor Considerations

Production systems rarely fill buckets to 100% capacity. A typical fill factor of 70-80% is maintained to leave room for insertions without immediate overflow. This means a bucket that could theoretically hold 400 entries might be considered 'full' at 280-320 entries.

The Fixed Bucket Count Decision

In static hashing, the total number of buckets N is determined at index creation time and remains constant throughout the index's lifetime (short of dropping and recreating the index). This fixed allocation is the defining characteristic that distinguishes static hashing from its dynamic counterparts.

Why Fix the Bucket Count?

The fixed bucket count simplifies many aspects of the hash index implementation:

Advantages of Fixed Bucket Count

•Direct Address Calculation — Given a key k, the bucket is simply h(k) mod N. No directory lookups, no tree traversals, just arithmetic.
•Predictable Memory Layout — All N buckets can be allocated contiguously, improving cache locality and enabling memory-mapped I/O optimizations.
•Single I/O Guarantee — In the absence of overflow, any key lookup requires exactly one disk access—the theoretical minimum.
•Implementation Simplicity — No complex restructuring algorithms, no directory management, no split/merge operations during normal operation.
•Concurrent Access Patterns — Fixed structure simplifies locking strategies; different buckets can be accessed in parallel without structural conflicts.

The Bucket Count Formula

Determining the optimal bucket count requires balancing several factors. The fundamental equation considers:

N = ⌈r / (f × c)⌉

Where:

N = Number of buckets
r = Expected number of records
f = Target fill factor (typically 0.7 - 0.8)
c = Bucket capacity (entries per bucket)

This formula seeks to distribute records across buckets such that each bucket reaches approximately the target fill factor without overflowing.

Calculating Bucket Count for a User TableSuppose we're creating a hash index on a user table expected to contain 1 million users.

Input

Output

The Estimation Problem

The bucket count formula requires knowing 'r' — the expected number of records. But what if your estimates are wrong? Underestimate and you'll have rampant overflow. Overestimate and you waste storage on empty buckets. This uncertainty is the Achilles' heel of static hashing.

Mathematical Analysis of Bucket Distribution

Even with a perfect hash function that distributes keys uniformly at random, some buckets will receive more entries than others due to the probabilistic nature of hashing. Understanding this distribution is crucial for capacity planning and performance prediction.

The Birthday Paradox Connection

Hash bucket distribution follows the same mathematics as the famous birthday problem. When hashing n records into N buckets, we're effectively asking: 'How many buckets will have k records?'

Assuming uniform random hashing, the number of records in each bucket follows a Binomial distribution B(n, 1/N), which for large N approximates a Poisson distribution with parameter λ = n/N (the expected number of records per bucket).

Expected Distribution Analysis

Let λ = n/N be the average load factor (records per bucket). The probability that a specific bucket contains exactly k records is:

P(X = k) = (λ^k × e^(-λ)) / k!

Bucket Load Distribution for λ = 1 (n = N)
Records in Bucket (k)	Probability P(X=k)	Expected Buckets (of 10,000)
0 (empty)	36.79%	3,679
1	36.79%	3,679
2	18.39%	1,839
3	6.13%	613
4	1.53%	153
5+	0.37%	37

Critical Insight: Even when n = N (one record per bucket on average), over 36% of buckets remain empty while some buckets hold 4+ records. This uneven distribution is unavoidable with random hashing.

Impact of Load Factor

As the load factor λ increases, the distribution shifts dramatically:

Empty Bucket Percentage at Different Load Factors
Load Factor (λ)	Empty Buckets	Buckets with 2+ Records	Max Expected Load
0.5	60.65%	9.02%	~5 records
1.0	36.79%	26.42%	~7 records
2.0	13.53%	59.40%	~10 records
3.0	4.98%	80.09%	~12 records
4.0	1.83%	90.84%	~14 records

The 70% Rule

The commonly recommended 70-80% fill factor for hash indexes isn't arbitrary—it's derived from this Poisson analysis. At λ ≈ 0.7-0.8, overflow is rare (<5% of buckets significantly overloaded), empty buckets aren't excessive (~50%), and space utilization is acceptable.

Maximum Bucket Load: Balls into Bins

A classic result from probability theory states that when throwing n balls uniformly at random into n bins, the maximum load in any bin is:

Max Load ≈ O(log n / log log n) with high probability

For practical database sizes:

n = 1 million records: Max bucket might have ~20-25 records
n = 1 billion records: Max bucket might have ~30-35 records

This means even with perfect hash functions, some buckets will have significantly more records than others, necessitating overflow handling strategies.

Disk Storage and I/O Implications

The fixed bucket count directly impacts how the hash index interacts with the storage subsystem. Understanding these implications is essential for production database tuning.

Contiguous Allocation Strategies

With a fixed number of buckets, the database can allocate storage in several ways:

Storage Allocation Approaches

•Pre-allocated Contiguous File — All N bucket pages allocated upfront in a single file extent. Provides best sequential I/O during scans and enables memory-mapping.
•Lazy Allocation — Bucket pages allocated on first write. Saves space when data is sparse but introduces fragmentation and non-contiguous layout.
•Striped Allocation — Buckets distributed across multiple disk devices (RAID-like). Improves parallel I/O for bulk operations.

I/O Cost Model

For a static hash index with N buckets, bucket capacity c, and n records:

Point Lookup (without overflow):

Hash computation: O(1) CPU
Bucket identification: O(1) computation (h(k) mod N)
Disk I/O: 1 page read (the bucket page)
In-page search: O(c/2) average comparisons
Total: O(1) I/O + O(c) CPU

Point Lookup (with overflow chain of length L):

Disk I/O: 1 + L page reads (primary + overflow pages)
Total: O(L) I/O

Full Index Scan:

Without contiguous allocation: N random I/Os
With contiguous allocation: N sequential I/Os ≈ N/S random I/Os (where S is sequential speedup, typically 10-100x)

I/O Cost Comparison: Static Hash vs. B+-Tree
Operation	Static Hash (no overflow)	Static Hash (with overflow)	B+-Tree (height h)
Point lookup	1 I/O	1 + L I/O	h I/Os (typically 3-4)
Range scan (k records)	k I/Os (scattered)	k × (1 + avg_L) I/Os	h + ⌈k/c⌉ I/Os (sequential)
Insert (no split/overflow)	2 I/O (read + write)	2 + 2L I/O	h + 1 I/O minimum
Full scan	N I/Os	N + overflow pages I/O	Leaf pages sequential

The Single I/O Advantage

When properly sized with minimal overflow, static hashing achieves what no tree structure can: guaranteed single-I/O lookups. This makes it ideal for high-frequency point queries where every millisecond counts. However, this advantage evaporates quickly with overflow chains.

Memory-Mapped I/O Optimization

The fixed bucket architecture is particularly amenable to memory-mapped I/O (mmap). Since the bucket file has a known, fixed size:

The entire hash index can be mapped into virtual address space
OS page cache handles bucket caching automatically
Bucket access becomes pointer arithmetic: bucket_addr = base + (bucket_id × page_size)
No explicit file seek operations needed

This optimization is used extensively in embedded database systems like LMDB and SQLite (for its hash indexes).

Sizing for Workload Characteristics

The optimal bucket count depends not just on data volume but on workload patterns. Different access patterns favor different sizing strategies.

Read-Heavy Workloads

For workloads dominated by point lookups (e.g., session caches, user profile access), minimize overflow at the cost of space:

Target fill factor: 50-60%
Bucket count: 1.5-2× the number expected by the basic formula
Rationale: Empty buckets waste space but overflow chains waste time. Read latency is more expensive.

Write-Heavy Workloads

For workloads with frequent insertions (e.g., log tables, event streams), balance overflow tolerance with space:

Target fill factor: 75-85%
Bucket count: As calculated by basic formula
Rationale: Some overflow is acceptable if writes are append-mostly. Short overflow chains add minimal overhead.

Mixed Workloads

For balanced read/write patterns (e.g., OLTP user tables):

Target fill factor: 70%
Include growth headroom: If data might grow 2x over index lifetime, size for 2x current volume
Monitor and rebuild: Plan for periodic index reconstruction if growth exceeds predictions

Conservative Sizing (Low Fill Factor)

•Minimal overflow chains
•Consistent O(1) lookup performance
•Handles data growth gracefully
•Higher space consumption
•May not fit in buffer pool
•More pages to scan for range queries

Aggressive Sizing (High Fill Factor)

•Longer overflow chains likely
•Variable lookup performance
•No room for data growth
•Efficient space utilization
•Better buffer pool residency
•Fewer pages for full scans

Prime Number Bucket Counts

A common recommendation is to use a prime number for N (the bucket count). Prime numbers reduce clustering when hash function outputs have periodic patterns. For example, if many keys share a common factor with N, they'll collide. A prime N has no common factors except 1 and itself.

Fundamental Limitations of Fixed Bucket Count

Despite its simplicity and performance advantages, the fixed bucket count architecture has fundamental limitations that restrict its applicability in modern database systems.

The Scaling Problem

The core limitation is straightforward: static hashing cannot adapt to changing data volumes.

Consider a table that grows from 100,000 records to 10 million over its lifetime. A hash index sized for 100,000 records would have:

Original load factor: 0.7 (optimal)
After 10x growth: Load factor of 7.0 (catastrophic)
Average overflow chain length: 6+ records per bucket
Point lookup: Degraded from 1 I/O to 7+ I/Os

Conversely, sizing for 10 million records when only 100,000 exist means:

99% of buckets are empty
99% of allocated space is wasted
Index exceeds available buffer pool → cache misses

Static Hash Failure Modes

•Data Growth — Unpredicted data volume increases cause rampant overflow, degrading point lookup to linear scan within overflow chains.
•Data Shrinkage — Mass deletions leave buckets mostly empty, wasting space and spreading remaining data across many cache lines.
•Skewed Distributions — Even with correct sizing, non-uniform key distributions create hot buckets with long chains while others sit empty.
•Periodic Patterns — Hash functions interacting with periodic key patterns (timestamps, sequential IDs) can create systematic clustering.
•Dynamic Schema — Adding/removing columns that affect key size may require complete index rebuild.

The Rebuild Imperative

When a static hash index becomes pathological (too much overflow or too much wasted space), the only remedy is complete reconstruction:

Create new hash index with updated bucket count
Scan all records and reinsert into new index
Swap new index for old
Deallocate old index

This operation is:

Time-consuming: O(n) for n records, often requiring table lock
Resource-intensive: Requires 2× storage during transition
Disruptive: May need maintenance window
Non-recoverable: If system crashes during rebuild, state is unclear

In contrast, dynamic hashing schemes (extendible hashing, linear hashing) incrementally adapt without full reconstruction.

When Static Hashing Still Makes Sense

Despite these limitations, static hashing remains valuable for: (1) Tables with predictable, stable sizes, (2) High-frequency point lookup workloads where single-I/O guarantee matters, (3) Embedded systems with constrained resources where dynamic restructuring is prohibitive, (4) Hash joins on intermediate results with known cardinality.

Implementation Considerations

Implementing a fixed-bucket hash index requires attention to several practical details that affect correctness, performance, and maintainability.

Metadata Management

The hash index requires persistent metadata to survive system restarts:

hash_index_metadata.h
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/* Hash index metadata stored in catalog or header page */
typedef struct HashIndexMetadata {
    uint32_t magic;              /* Magic number for validation */
    uint32_t version;            /* Format version for compatibility */
    uint32_t bucket_count;       /* N - total number of buckets */
    uint32_t page_size;          /* Size of each bucket page */
    uint32_t key_size;           /* Fixed key size, or 0 for variable */
    uint32_t entry_size;         /* Size of each index entry */
    uint32_t fill_factor;        /* Target fill factor × 100 */
    uint64_t record_count;       /* Current number of entries */
    uint64_t overflow_pages;     /* Number of overflow pages allocated */
    uint32_t hash_function_id;   /* Identifier for hash function used */
    uint32_t seed;               /* Seed for hash function (if applicable) */
    uint64_t created_timestamp;  /* Index creation time */
    uint64_t last_rebuild;       /* Last rebuild timestamp */
    char     index_name[64];     /* Human-readable name */
} HashIndexMetadata;

Bucket Access Implementation

The core operation—locating the correct bucket for a key—must be highly optimized:

bucket_access.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
/* Locate bucket for a given key */
static inline uint32_t 
compute_bucket_id(const HashIndex *index, const void *key, size_t key_len) {
    /* Compute hash value using configured hash function */
    uint64_t hash = hash_function(index->hash_fn_id, 
                                   key, key_len, 
                                   index->metadata.seed);
    
    /* Map to bucket using modulo
     * Note: If bucket_count is power of 2, use bitwise AND for speed:
     *   return hash & (index->metadata.bucket_count - 1);
     * Otherwise, use modulo (handles prime bucket counts):
     */
    return (uint32_t)(hash % index->metadata.bucket_count);
}
 
/* Read bucket page from storage or buffer pool */
BucketPage* 
read_bucket(HashIndex *index, uint32_t bucket_id) {
    /* Calculate page offset in hash file */
    uint64_t page_offset = (uint64_t)bucket_id * index->metadata.page_size
                          + HEADER_PAGE_SIZE; /* Skip metadata page */
    
    /* Request page from buffer manager */
    PageHandle handle = buffer_fetch_page(index->buffer_pool,
                                          index->file_handle,
                                          page_offset);
    
    return (BucketPage*)page_get_data(handle);
}

Power-of-Two Optimization

If the bucket count is a power of 2, the modulo operation can be replaced with a bitwise AND: bucket_id = hash & (N - 1). This is significantly faster as it avoids expensive division. However, power-of-two counts may increase clustering with poor hash functions. Choose based on your hash function quality.

Summary: Fixed Bucket Count Architecture

We have explored the foundational concept of static hashing: the fixed bucket count. This architectural decision shapes every aspect of the hash index's behavior, from I/O patterns to scaling limitations.

Key Takeaways

•Buckets are the fundamental storage unit — Each bucket maps to disk pages and holds multiple index entries with associated metadata.
•Bucket count is fixed at creation time — The value N is computed based on expected records, fill factor, and bucket capacity, then never changes.
•Distribution follows Poisson statistics — Even perfect hash functions create uneven bucket loads; some overflow is mathematically inevitable.
•Single I/O advantage is conditional — Static hashing achieves O(1) I/O only when overflow is controlled; excessive overflow degrades to O(chain_length).
•Sizing requires workload awareness — Read-heavy workloads favor lower fill factors; write-heavy workloads can tolerate higher density.
•Static sizing is the fundamental limitation — Inability to adapt to data volume changes necessitates periodic rebuilds or migration to dynamic schemes.

What's Next:

With the bucket foundation established, the next page examines hash function design — the algorithms that determine which bucket receives each key. We'll explore distribution requirements, collision resistance, and the tradeoffs between computation cost and hash quality.

Page Complete

You now understand the fixed bucket count architecture that defines static hashing. This foundational knowledge is essential for evaluating when static hashing is appropriate and understanding why dynamic alternatives were developed. Next, we'll dive into the hash functions that make this architecture work.

1 / 5

Loading learning content...

Database Management SystemsStatic Hashing

Static Hashing: Fundamentals and Implementation

LevelIntermediate

Duration90 mins

TopicStatic Hashing

1 / 5

Fixed Bucket Count: The Foundation of Static Hashing

The Fundamental Trade-off

When designing a hash-based index structure, one of the most consequential decisions occurs before any data is inserted: How many buckets should we allocate?

What You Will Learn

What is a Bucket? Anatomy of Hash-Based Storage

Physical Representation

Typical Bucket Configurations in Production Systems
System	Page Size	Bucket Capacity (Est.)	Key-Pointer Size
PostgreSQL	8 KB	~200-400 entries	~20-40 bytes/entry
MySQL InnoDB	16 KB	~400-800 entries	~20-40 bytes/entry
Oracle	8 KB (default)	~200-400 entries	~20-40 bytes/entry
SQL Server	8 KB	~200-400 entries	~20-40 bytes/entry
Berkeley DB	4-64 KB (configurable)	Varies	Application-defined

Bucket vs. Page: An Important Distinction

Internal Bucket Structure

A well-designed bucket page contains several components:

Bucket Page Layout

•Page Header — Contains metadata: page ID, free space pointer, entry count, overflow pointer (if applicable), checksum for integrity verification
•Entry Directory — A slot array tracking the offset of each entry within the page, enabling efficient deletions without compaction
•Entry Region — The actual key-pointer pairs, potentially variable-length depending on key types
•Free Space — Unallocated region that grows from the bottom (entries grow from top), meeting in the middle

bucket_page_layout.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/* Typical bucket page structure in a database system */
typedef struct BucketPage {
    /* Page Header - typically 24-32 bytes */
    uint32_t page_id;           /* Unique identifier for this page */
    uint32_t bucket_id;         /* Logical bucket number */
    uint16_t entry_count;       /* Number of entries currently stored */
    uint16_t free_space_offset; /* Offset where free space begins */
    uint32_t overflow_page_id;  /* Pointer to overflow page, or NULL_PAGE */
    uint32_t checksum;          /* CRC32 for corruption detection */
    
    /* Slot directory grows downward from header */
    uint16_t slots[MAX_SLOTS];  /* Offsets to each entry */
    
    /* Entry data grows upward from end of page */
    /* Structure: [key_len][key_data][pointer] for variable-length keys */
    /* Or simply: [fixed_key][pointer] for fixed-length keys */
    char data[PAGE_SIZE - HEADER_SIZE]; 
} BucketPage;
 
/* Example: 8KB page with 32-byte header and 20-byte entries */
/* Theoretical capacity: (8192 - 32) / (20 + 2) ≈ 370 entries */
/* Practical capacity (80% fill): ~296 entries */

Fill Factor Considerations

The Fixed Bucket Count Decision

Why Fix the Bucket Count?

The fixed bucket count simplifies many aspects of the hash index implementation:

Advantages of Fixed Bucket Count

•Direct Address Calculation — Given a key k, the bucket is simply h(k) mod N. No directory lookups, no tree traversals, just arithmetic.
•Predictable Memory Layout — All N buckets can be allocated contiguously, improving cache locality and enabling memory-mapped I/O optimizations.
•Single I/O Guarantee — In the absence of overflow, any key lookup requires exactly one disk access—the theoretical minimum.
•Implementation Simplicity — No complex restructuring algorithms, no directory management, no split/merge operations during normal operation.
•Concurrent Access Patterns — Fixed structure simplifies locking strategies; different buckets can be accessed in parallel without structural conflicts.

The Bucket Count Formula

Determining the optimal bucket count requires balancing several factors. The fundamental equation considers:

N = ⌈r / (f × c)⌉

Where:

N = Number of buckets
r = Expected number of records
f = Target fill factor (typically 0.7 - 0.8)
c = Bucket capacity (entries per bucket)

This formula seeks to distribute records across buckets such that each bucket reaches approximately the target fill factor without overflowing.

Calculating Bucket Count for a User TableSuppose we're creating a hash index on a user table expected to contain 1 million users.

Input

Output

The Estimation Problem

Mathematical Analysis of Bucket Distribution

The Birthday Paradox Connection

Hash bucket distribution follows the same mathematics as the famous birthday problem. When hashing n records into N buckets, we're effectively asking: 'How many buckets will have k records?'

Expected Distribution Analysis

Let λ = n/N be the average load factor (records per bucket). The probability that a specific bucket contains exactly k records is:

P(X = k) = (λ^k × e^(-λ)) / k!

Bucket Load Distribution for λ = 1 (n = N)
Records in Bucket (k)	Probability P(X=k)	Expected Buckets (of 10,000)
0 (empty)	36.79%	3,679
1	36.79%	3,679
2	18.39%	1,839
3	6.13%	613
4	1.53%	153
5+	0.37%	37

Impact of Load Factor

As the load factor λ increases, the distribution shifts dramatically:

Empty Bucket Percentage at Different Load Factors
Load Factor (λ)	Empty Buckets	Buckets with 2+ Records	Max Expected Load
0.5	60.65%	9.02%	~5 records
1.0	36.79%	26.42%	~7 records
2.0	13.53%	59.40%	~10 records
3.0	4.98%	80.09%	~12 records
4.0	1.83%	90.84%	~14 records

The 70% Rule

Maximum Bucket Load: Balls into Bins

A classic result from probability theory states that when throwing n balls uniformly at random into n bins, the maximum load in any bin is:

Max Load ≈ O(log n / log log n) with high probability

For practical database sizes:

n = 1 million records: Max bucket might have ~20-25 records
n = 1 billion records: Max bucket might have ~30-35 records

This means even with perfect hash functions, some buckets will have significantly more records than others, necessitating overflow handling strategies.

Disk Storage and I/O Implications

The fixed bucket count directly impacts how the hash index interacts with the storage subsystem. Understanding these implications is essential for production database tuning.

Contiguous Allocation Strategies

With a fixed number of buckets, the database can allocate storage in several ways:

Storage Allocation Approaches

•Pre-allocated Contiguous File — All N bucket pages allocated upfront in a single file extent. Provides best sequential I/O during scans and enables memory-mapping.
•Lazy Allocation — Bucket pages allocated on first write. Saves space when data is sparse but introduces fragmentation and non-contiguous layout.
•Striped Allocation — Buckets distributed across multiple disk devices (RAID-like). Improves parallel I/O for bulk operations.

I/O Cost Model

For a static hash index with N buckets, bucket capacity c, and n records:

Point Lookup (without overflow):

Hash computation: O(1) CPU
Bucket identification: O(1) computation (h(k) mod N)
Disk I/O: 1 page read (the bucket page)
In-page search: O(c/2) average comparisons
Total: O(1) I/O + O(c) CPU

Point Lookup (with overflow chain of length L):

Disk I/O: 1 + L page reads (primary + overflow pages)
Total: O(L) I/O

Full Index Scan:

Without contiguous allocation: N random I/Os
With contiguous allocation: N sequential I/Os ≈ N/S random I/Os (where S is sequential speedup, typically 10-100x)

I/O Cost Comparison: Static Hash vs. B+-Tree
Operation	Static Hash (no overflow)	Static Hash (with overflow)	B+-Tree (height h)
Point lookup	1 I/O	1 + L I/O	h I/Os (typically 3-4)
Range scan (k records)	k I/Os (scattered)	k × (1 + avg_L) I/Os	h + ⌈k/c⌉ I/Os (sequential)
Insert (no split/overflow)	2 I/O (read + write)	2 + 2L I/O	h + 1 I/O minimum
Full scan	N I/Os	N + overflow pages I/O	Leaf pages sequential

The Single I/O Advantage

Memory-Mapped I/O Optimization

The fixed bucket architecture is particularly amenable to memory-mapped I/O (mmap). Since the bucket file has a known, fixed size:

The entire hash index can be mapped into virtual address space
OS page cache handles bucket caching automatically
Bucket access becomes pointer arithmetic: bucket_addr = base + (bucket_id × page_size)
No explicit file seek operations needed

This optimization is used extensively in embedded database systems like LMDB and SQLite (for its hash indexes).

Sizing for Workload Characteristics

The optimal bucket count depends not just on data volume but on workload patterns. Different access patterns favor different sizing strategies.

Read-Heavy Workloads

For workloads dominated by point lookups (e.g., session caches, user profile access), minimize overflow at the cost of space:

Target fill factor: 50-60%
Bucket count: 1.5-2× the number expected by the basic formula
Rationale: Empty buckets waste space but overflow chains waste time. Read latency is more expensive.

Write-Heavy Workloads

For workloads with frequent insertions (e.g., log tables, event streams), balance overflow tolerance with space:

Target fill factor: 75-85%
Bucket count: As calculated by basic formula
Rationale: Some overflow is acceptable if writes are append-mostly. Short overflow chains add minimal overhead.

Mixed Workloads

For balanced read/write patterns (e.g., OLTP user tables):

Target fill factor: 70%
Include growth headroom: If data might grow 2x over index lifetime, size for 2x current volume
Monitor and rebuild: Plan for periodic index reconstruction if growth exceeds predictions

Conservative Sizing (Low Fill Factor)

•Minimal overflow chains
•Consistent O(1) lookup performance
•Handles data growth gracefully
•Higher space consumption
•May not fit in buffer pool
•More pages to scan for range queries

Aggressive Sizing (High Fill Factor)

•Longer overflow chains likely
•Variable lookup performance
•No room for data growth
•Efficient space utilization
•Better buffer pool residency
•Fewer pages for full scans

Prime Number Bucket Counts

Fundamental Limitations of Fixed Bucket Count

Despite its simplicity and performance advantages, the fixed bucket count architecture has fundamental limitations that restrict its applicability in modern database systems.

The Scaling Problem

The core limitation is straightforward: static hashing cannot adapt to changing data volumes.

Consider a table that grows from 100,000 records to 10 million over its lifetime. A hash index sized for 100,000 records would have:

Original load factor: 0.7 (optimal)
After 10x growth: Load factor of 7.0 (catastrophic)
Average overflow chain length: 6+ records per bucket
Point lookup: Degraded from 1 I/O to 7+ I/Os

Conversely, sizing for 10 million records when only 100,000 exist means:

99% of buckets are empty
99% of allocated space is wasted
Index exceeds available buffer pool → cache misses

Static Hash Failure Modes

•Data Growth — Unpredicted data volume increases cause rampant overflow, degrading point lookup to linear scan within overflow chains.
•Data Shrinkage — Mass deletions leave buckets mostly empty, wasting space and spreading remaining data across many cache lines.
•Skewed Distributions — Even with correct sizing, non-uniform key distributions create hot buckets with long chains while others sit empty.
•Periodic Patterns — Hash functions interacting with periodic key patterns (timestamps, sequential IDs) can create systematic clustering.
•Dynamic Schema — Adding/removing columns that affect key size may require complete index rebuild.

The Rebuild Imperative

When a static hash index becomes pathological (too much overflow or too much wasted space), the only remedy is complete reconstruction:

Create new hash index with updated bucket count
Scan all records and reinsert into new index
Swap new index for old
Deallocate old index

This operation is:

Time-consuming: O(n) for n records, often requiring table lock
Resource-intensive: Requires 2× storage during transition
Disruptive: May need maintenance window
Non-recoverable: If system crashes during rebuild, state is unclear

In contrast, dynamic hashing schemes (extendible hashing, linear hashing) incrementally adapt without full reconstruction.

When Static Hashing Still Makes Sense

Implementation Considerations

Implementing a fixed-bucket hash index requires attention to several practical details that affect correctness, performance, and maintainability.

Metadata Management

The hash index requires persistent metadata to survive system restarts:

hash_index_metadata.h
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/* Hash index metadata stored in catalog or header page */
typedef struct HashIndexMetadata {
    uint32_t magic;              /* Magic number for validation */
    uint32_t version;            /* Format version for compatibility */
    uint32_t bucket_count;       /* N - total number of buckets */
    uint32_t page_size;          /* Size of each bucket page */
    uint32_t key_size;           /* Fixed key size, or 0 for variable */
    uint32_t entry_size;         /* Size of each index entry */
    uint32_t fill_factor;        /* Target fill factor × 100 */
    uint64_t record_count;       /* Current number of entries */
    uint64_t overflow_pages;     /* Number of overflow pages allocated */
    uint32_t hash_function_id;   /* Identifier for hash function used */
    uint32_t seed;               /* Seed for hash function (if applicable) */
    uint64_t created_timestamp;  /* Index creation time */
    uint64_t last_rebuild;       /* Last rebuild timestamp */
    char     index_name[64];     /* Human-readable name */
} HashIndexMetadata;

Bucket Access Implementation

The core operation—locating the correct bucket for a key—must be highly optimized:

bucket_access.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
/* Locate bucket for a given key */
static inline uint32_t 
compute_bucket_id(const HashIndex *index, const void *key, size_t key_len) {
    /* Compute hash value using configured hash function */
    uint64_t hash = hash_function(index->hash_fn_id, 
                                   key, key_len, 
                                   index->metadata.seed);
    
    /* Map to bucket using modulo
     * Note: If bucket_count is power of 2, use bitwise AND for speed:
     *   return hash & (index->metadata.bucket_count - 1);
     * Otherwise, use modulo (handles prime bucket counts):
     */
    return (uint32_t)(hash % index->metadata.bucket_count);
}
 
/* Read bucket page from storage or buffer pool */
BucketPage* 
read_bucket(HashIndex *index, uint32_t bucket_id) {
    /* Calculate page offset in hash file */
    uint64_t page_offset = (uint64_t)bucket_id * index->metadata.page_size
                          + HEADER_PAGE_SIZE; /* Skip metadata page */
    
    /* Request page from buffer manager */
    PageHandle handle = buffer_fetch_page(index->buffer_pool,
                                          index->file_handle,
                                          page_offset);
    
    return (BucketPage*)page_get_data(handle);
}

Power-of-Two Optimization

Summary: Fixed Bucket Count Architecture

Key Takeaways

•Buckets are the fundamental storage unit — Each bucket maps to disk pages and holds multiple index entries with associated metadata.
•Bucket count is fixed at creation time — The value N is computed based on expected records, fill factor, and bucket capacity, then never changes.
•Distribution follows Poisson statistics — Even perfect hash functions create uneven bucket loads; some overflow is mathematically inevitable.
•Single I/O advantage is conditional — Static hashing achieves O(1) I/O only when overflow is controlled; excessive overflow degrades to O(chain_length).
•Sizing requires workload awareness — Read-heavy workloads favor lower fill factors; write-heavy workloads can tolerate higher density.
•Static sizing is the fundamental limitation — Inability to adapt to data volume changes necessitates periodic rebuilds or migration to dynamic schemes.

What's Next:

Page Complete

1 / 5