Loading learning content...
When designing a hash-based index structure, one of the most consequential decisions occurs before any data is inserted: How many buckets should we allocate?
This seemingly simple question encapsulates a fundamental tension in database systems engineering. Allocate too few buckets, and you create overflow chains that degrade O(1) lookups to O(n) scans. Allocate too many, and you waste precious memory and disk space on empty buckets that may never hold data. Static hashing confronts this challenge by requiring the database designer to make this decision upfront, at index creation time, and live with the consequences.
Understanding fixed bucket count architecture is essential because it reveals the core constraints that motivated decades of research into dynamic hashing schemes. Before we can appreciate solutions like extendible hashing or linear hashing, we must deeply understand the problem they solve—and that problem begins here, with the fixed bucket.
By the end of this page, you will understand: (1) Why static hashing requires a fixed number of buckets at creation time, (2) How bucket count is determined mathematically, (3) The space-time tradeoffs involved in bucket sizing, (4) How fixed bucket architectures interact with disk I/O patterns, and (5) The fundamental limitations that drive the need for dynamic hashing alternatives.
Before we can discuss bucket counts, we need a precise understanding of what a bucket actually is in the context of database hash indexing. A bucket is the fundamental unit of storage in a hash-based index structure. It serves as a container that holds one or more index entries, where each entry typically consists of a search key and a pointer to the corresponding data record (or the record itself, in some implementations).
In disk-based database systems, a bucket is typically mapped to one or more disk pages (also called blocks). A disk page is the smallest unit of I/O transfer between disk and memory, commonly ranging from 4KB to 16KB depending on the system configuration. When the buffer manager requests a bucket, it must read at least one full disk page—even if the bucket contains only a single entry.
| System | Page Size | Bucket Capacity (Est.) | Key-Pointer Size |
|---|---|---|---|
| PostgreSQL | 8 KB | ~200-400 entries | ~20-40 bytes/entry |
| MySQL InnoDB | 16 KB | ~400-800 entries | ~20-40 bytes/entry |
| Oracle | 8 KB (default) | ~200-400 entries | ~20-40 bytes/entry |
| SQL Server | 8 KB | ~200-400 entries | ~20-40 bytes/entry |
| Berkeley DB | 4-64 KB (configurable) | Varies | Application-defined |
While buckets are often implemented as single disk pages for efficiency, the logical bucket and physical page are conceptually separate. A bucket is a logical construct defined by the hash function—it's the destination for a particular range of hash values. A page is a physical storage unit defined by the I/O subsystem.
In static hashing, the simplest configuration maps each bucket to exactly one page. This one-to-one mapping provides predictable I/O behavior: finding any record requires exactly one disk access (assuming the bucket doesn't overflow). However, this rigidity also creates the first constraint—the number of buckets is bounded by available storage space, since each bucket consumes at least one page.
A well-designed bucket page contains several components:
12345678910111213141516171819202122
/* Typical bucket page structure in a database system */typedef struct BucketPage { /* Page Header - typically 24-32 bytes */ uint32_t page_id; /* Unique identifier for this page */ uint32_t bucket_id; /* Logical bucket number */ uint16_t entry_count; /* Number of entries currently stored */ uint16_t free_space_offset; /* Offset where free space begins */ uint32_t overflow_page_id; /* Pointer to overflow page, or NULL_PAGE */ uint32_t checksum; /* CRC32 for corruption detection */ /* Slot directory grows downward from header */ uint16_t slots[MAX_SLOTS]; /* Offsets to each entry */ /* Entry data grows upward from end of page */ /* Structure: [key_len][key_data][pointer] for variable-length keys */ /* Or simply: [fixed_key][pointer] for fixed-length keys */ char data[PAGE_SIZE - HEADER_SIZE]; } BucketPage; /* Example: 8KB page with 32-byte header and 20-byte entries *//* Theoretical capacity: (8192 - 32) / (20 + 2) ≈ 370 entries *//* Practical capacity (80% fill): ~296 entries */Production systems rarely fill buckets to 100% capacity. A typical fill factor of 70-80% is maintained to leave room for insertions without immediate overflow. This means a bucket that could theoretically hold 400 entries might be considered 'full' at 280-320 entries.
In static hashing, the total number of buckets N is determined at index creation time and remains constant throughout the index's lifetime (short of dropping and recreating the index). This fixed allocation is the defining characteristic that distinguishes static hashing from its dynamic counterparts.
The fixed bucket count simplifies many aspects of the hash index implementation:
Determining the optimal bucket count requires balancing several factors. The fundamental equation considers:
N = ⌈r / (f × c)⌉
Where:
This formula seeks to distribute records across buckets such that each bucket reaches approximately the target fill factor without overflowing.
The bucket count formula requires knowing 'r' — the expected number of records. But what if your estimates are wrong? Underestimate and you'll have rampant overflow. Overestimate and you waste storage on empty buckets. This uncertainty is the Achilles' heel of static hashing.
Even with a perfect hash function that distributes keys uniformly at random, some buckets will receive more entries than others due to the probabilistic nature of hashing. Understanding this distribution is crucial for capacity planning and performance prediction.
Hash bucket distribution follows the same mathematics as the famous birthday problem. When hashing n records into N buckets, we're effectively asking: 'How many buckets will have k records?'
Assuming uniform random hashing, the number of records in each bucket follows a Binomial distribution B(n, 1/N), which for large N approximates a Poisson distribution with parameter λ = n/N (the expected number of records per bucket).
Let λ = n/N be the average load factor (records per bucket). The probability that a specific bucket contains exactly k records is:
P(X = k) = (λ^k × e^(-λ)) / k!
| Records in Bucket (k) | Probability P(X=k) | Expected Buckets (of 10,000) |
|---|---|---|
| 0 (empty) | 36.79% | 3,679 |
| 1 | 36.79% | 3,679 |
| 2 | 18.39% | 1,839 |
| 3 | 6.13% | 613 |
| 4 | 1.53% | 153 |
| 5+ | 0.37% | 37 |
Critical Insight: Even when n = N (one record per bucket on average), over 36% of buckets remain empty while some buckets hold 4+ records. This uneven distribution is unavoidable with random hashing.
As the load factor λ increases, the distribution shifts dramatically:
| Load Factor (λ) | Empty Buckets | Buckets with 2+ Records | Max Expected Load |
|---|---|---|---|
| 0.5 | 60.65% | 9.02% | ~5 records |
| 1.0 | 36.79% | 26.42% | ~7 records |
| 2.0 | 13.53% | 59.40% | ~10 records |
| 3.0 | 4.98% | 80.09% | ~12 records |
| 4.0 | 1.83% | 90.84% | ~14 records |
The commonly recommended 70-80% fill factor for hash indexes isn't arbitrary—it's derived from this Poisson analysis. At λ ≈ 0.7-0.8, overflow is rare (<5% of buckets significantly overloaded), empty buckets aren't excessive (~50%), and space utilization is acceptable.
A classic result from probability theory states that when throwing n balls uniformly at random into n bins, the maximum load in any bin is:
Max Load ≈ O(log n / log log n) with high probability
For practical database sizes:
This means even with perfect hash functions, some buckets will have significantly more records than others, necessitating overflow handling strategies.
The fixed bucket count directly impacts how the hash index interacts with the storage subsystem. Understanding these implications is essential for production database tuning.
With a fixed number of buckets, the database can allocate storage in several ways:
For a static hash index with N buckets, bucket capacity c, and n records:
Point Lookup (without overflow):
Point Lookup (with overflow chain of length L):
Full Index Scan:
| Operation | Static Hash (no overflow) | Static Hash (with overflow) | B+-Tree (height h) |
|---|---|---|---|
| Point lookup | 1 I/O | 1 + L I/O | h I/Os (typically 3-4) |
| Range scan (k records) | k I/Os (scattered) | k × (1 + avg_L) I/Os | h + ⌈k/c⌉ I/Os (sequential) |
| Insert (no split/overflow) | 2 I/O (read + write) | 2 + 2L I/O | h + 1 I/O minimum |
| Full scan | N I/Os | N + overflow pages I/O | Leaf pages sequential |
When properly sized with minimal overflow, static hashing achieves what no tree structure can: guaranteed single-I/O lookups. This makes it ideal for high-frequency point queries where every millisecond counts. However, this advantage evaporates quickly with overflow chains.
The fixed bucket architecture is particularly amenable to memory-mapped I/O (mmap). Since the bucket file has a known, fixed size:
bucket_addr = base + (bucket_id × page_size)This optimization is used extensively in embedded database systems like LMDB and SQLite (for its hash indexes).
The optimal bucket count depends not just on data volume but on workload patterns. Different access patterns favor different sizing strategies.
For workloads dominated by point lookups (e.g., session caches, user profile access), minimize overflow at the cost of space:
For workloads with frequent insertions (e.g., log tables, event streams), balance overflow tolerance with space:
For balanced read/write patterns (e.g., OLTP user tables):
A common recommendation is to use a prime number for N (the bucket count). Prime numbers reduce clustering when hash function outputs have periodic patterns. For example, if many keys share a common factor with N, they'll collide. A prime N has no common factors except 1 and itself.
Despite its simplicity and performance advantages, the fixed bucket count architecture has fundamental limitations that restrict its applicability in modern database systems.
The core limitation is straightforward: static hashing cannot adapt to changing data volumes.
Consider a table that grows from 100,000 records to 10 million over its lifetime. A hash index sized for 100,000 records would have:
Conversely, sizing for 10 million records when only 100,000 exist means:
When a static hash index becomes pathological (too much overflow or too much wasted space), the only remedy is complete reconstruction:
This operation is:
In contrast, dynamic hashing schemes (extendible hashing, linear hashing) incrementally adapt without full reconstruction.
Despite these limitations, static hashing remains valuable for: (1) Tables with predictable, stable sizes, (2) High-frequency point lookup workloads where single-I/O guarantee matters, (3) Embedded systems with constrained resources where dynamic restructuring is prohibitive, (4) Hash joins on intermediate results with known cardinality.
Implementing a fixed-bucket hash index requires attention to several practical details that affect correctness, performance, and maintainability.
The hash index requires persistent metadata to survive system restarts:
1234567891011121314151617
/* Hash index metadata stored in catalog or header page */typedef struct HashIndexMetadata { uint32_t magic; /* Magic number for validation */ uint32_t version; /* Format version for compatibility */ uint32_t bucket_count; /* N - total number of buckets */ uint32_t page_size; /* Size of each bucket page */ uint32_t key_size; /* Fixed key size, or 0 for variable */ uint32_t entry_size; /* Size of each index entry */ uint32_t fill_factor; /* Target fill factor × 100 */ uint64_t record_count; /* Current number of entries */ uint64_t overflow_pages; /* Number of overflow pages allocated */ uint32_t hash_function_id; /* Identifier for hash function used */ uint32_t seed; /* Seed for hash function (if applicable) */ uint64_t created_timestamp; /* Index creation time */ uint64_t last_rebuild; /* Last rebuild timestamp */ char index_name[64]; /* Human-readable name */} HashIndexMetadata;The core operation—locating the correct bucket for a key—must be highly optimized:
123456789101112131415161718192021222324252627282930
/* Locate bucket for a given key */static inline uint32_t compute_bucket_id(const HashIndex *index, const void *key, size_t key_len) { /* Compute hash value using configured hash function */ uint64_t hash = hash_function(index->hash_fn_id, key, key_len, index->metadata.seed); /* Map to bucket using modulo * Note: If bucket_count is power of 2, use bitwise AND for speed: * return hash & (index->metadata.bucket_count - 1); * Otherwise, use modulo (handles prime bucket counts): */ return (uint32_t)(hash % index->metadata.bucket_count);} /* Read bucket page from storage or buffer pool */BucketPage* read_bucket(HashIndex *index, uint32_t bucket_id) { /* Calculate page offset in hash file */ uint64_t page_offset = (uint64_t)bucket_id * index->metadata.page_size + HEADER_PAGE_SIZE; /* Skip metadata page */ /* Request page from buffer manager */ PageHandle handle = buffer_fetch_page(index->buffer_pool, index->file_handle, page_offset); return (BucketPage*)page_get_data(handle);}If the bucket count is a power of 2, the modulo operation can be replaced with a bitwise AND: bucket_id = hash & (N - 1). This is significantly faster as it avoids expensive division. However, power-of-two counts may increase clustering with poor hash functions. Choose based on your hash function quality.
We have explored the foundational concept of static hashing: the fixed bucket count. This architectural decision shapes every aspect of the hash index's behavior, from I/O patterns to scaling limitations.
What's Next:
With the bucket foundation established, the next page examines hash function design — the algorithms that determine which bucket receives each key. We'll explore distribution requirements, collision resistance, and the tradeoffs between computation cost and hash quality.
You now understand the fixed bucket count architecture that defines static hashing. This foundational knowledge is essential for evaluating when static hashing is appropriate and understanding why dynamic alternatives were developed. Next, we'll dive into the hash functions that make this architecture work.