Loading learning content...
Traditional static hashing faces a fundamental dilemma: choose a fixed number of buckets and live with either wasted space or overflow chains. Extendible hashing resolves this through an elegant architectural innovation—a directory structure that acts as an indirection layer between hash values and physical buckets.
Imagine a library where instead of fixed shelves, you have a dynamic card catalog that can reference any number of storage locations. As the library grows, you don't rebuild the entire shelving system—you simply update the catalog and add new storage compartments where needed. This is precisely the philosophy behind extendible hashing's directory structure.
By the end of this page, you will understand how the directory structure forms the backbone of extendible hashing, enabling O(1) lookups while supporting graceful growth without full reorganization. You'll see why this design has become fundamental to adaptive database indexing.
To appreciate extendible hashing's directory structure, we must first understand the limitations of static hashing—the approach it was designed to overcome.
Static hashing's core limitation:
In static hashing, you define N buckets at table creation time. The hash function h(k) maps each key to one of these N buckets. This seems straightforward, but it creates serious problems:
Consider a 10 million row table that needs to double its bucket count. With static hashing, every row must be read, rehashed with the new modulus, and written to its new location. At 10,000 rows/second, this takes over 15 minutes of exclusive processing—an eternity for a production database.
The insight that led to extendible hashing:
What if we could add buckets one at a time, only when and where they're needed? What if we could reorganize just the overflowing bucket instead of the entire table? What if we could decouple the logical addressing of buckets from their physical existence?
This is the breakthrough that extendible hashing achieves through its directory structure.
The directory is the central innovation of extendible hashing. It's an array of bucket pointers that serves as a level of indirection between hash values and actual data storage.
Core components:
| Component | Description | Purpose |
|---|---|---|
| Directory Array | Array of 2^d pointers, where d is the global depth | Maps hash prefixes to bucket addresses |
| Directory Entry | A single pointer to a bucket's physical location | Enables multiple hash prefixes to share one bucket |
| Global Depth (d) | Number of hash bits used to index the directory | Determines directory size as 2^d |
| Bucket | Physical storage unit holding data records | Fixed-size page storing actual key-value pairs |
| Local Depth (d') | Number of hash bits distinguishing a bucket's entries | Indicates how many times this bucket has split |
How the directory works:
When you want to look up a key k:
h(k) → produces a binary string (e.g., 32 bits)d bits of the hash (where d is the global depth)The key insight is that the directory size (2^d) can grow independently of the number of buckets. Multiple directory entries can point to the same bucket when that bucket hasn't needed to split yet.
12345678910111213141516171819202122
// Extendible Hash Lookup Algorithmfunction lookup(key): // Step 1: Compute full hash value hash_value = hash(key) // e.g., 0b10110011... // Step 2: Extract first 'd' bits (global depth) directory_index = extractBits(hash_value, global_depth) // Step 3: Follow directory pointer to bucket bucket = directory[directory_index] // Step 4: Search within bucket (small, fits in memory) for each entry in bucket: if entry.key == key: return entry.value return NOT_FOUND // Example with global_depth = 3:// hash("Alice") = 0b10110011... // First 3 bits = 101 (binary) = 5 (decimal)// bucket = directory[5]Understanding the physical layout of the directory is crucial for grasping how extendible hashing achieves its efficiency.
Directory structure in memory:
The directory is typically kept in memory (it's small compared to the data) and consists of a power-of-2 sized array. Each entry in the directory is simply a pointer (typically 4-8 bytes) to a bucket stored on disk.
Key observations from the diagram:
Directory entries 10 and 11 both point to Bucket C — This bucket has local depth 1, meaning only the first hash bit matters. Both 10 and 11 start with 1, so they share the same bucket.
Bucket C has local depth 1, but global depth is 2 — This asymmetry is the heart of extendible hashing. When Bucket C fills up, it can split without the directory growing.
The directory size is 2² = 4 — With global depth d=2, we have exactly 4 directory entries, regardless of how many buckets actually exist.
Bit prefix interpretation:
The directory index is derived from the most significant bits of the hash value. This is a critical design choice:
i splits into indices i and i + 2^(d-1) when global depth increases1234567891011121314151617
// Extract the first 'd' bits from a hash valuefunction extractBits(hash_value, d): // Shift right to move the first d bits to the rightmost position // Then mask to keep only those d bits hash_bits = 32 // Assuming 32-bit hash shift_amount = hash_bits - d return (hash_value >> shift_amount) & ((1 << d) - 1) // Example:// hash_value = 0b10110011_00000000_00000000_00000000// d = 3// shift_amount = 32 - 3 = 29// Shifted: 0b00000000_00000000_00000000_00000101 (= 5)// Mask: 0b111 = 7// Result: 5 & 7 = 5 (binary: 101)A distinguishing feature of extendible hashing is that multiple directory entries can point to the same physical bucket. This property is what enables incremental growth without wasting storage.
When does this happen?
When a bucket's local depth is less than the global depth, multiple directory entries map to that bucket. The relationship is precise:
local_depth = d' and global_depth = d, then 2^(d-d') directory entries point to this bucket.Visual example:
| Bucket | Local Depth | Pointers | Directory Indices |
|---|---|---|---|
| A | 3 | 1 (2^(3-3)) | 000 |
| B | 3 | 1 (2^(3-3)) | 001 |
| C | 2 | 2 (2^(3-2)) | 010, 011 |
| D | 2 | 2 (2^(3-2)) | 100, 101 |
| E | 1 | 4 (2^(3-1)) | 110, 111, (shared with F) |
Why this matters for performance:
This sharing mechanism means:
Space efficiency — You don't allocate buckets until data demands it. A nearly-empty table might have hundreds of directory entries but only a few actual buckets.
Localized splitting — When a bucket overflows, you split only that bucket and update only its directory pointers. The rest of the table is unaffected.
Graceful degradation — Even if the hash function has some clustering, only the affected buckets grow; others remain compact.
At any point in time, the total number of directory pointers equals 2^(global_depth). The sum of (2^(global_depth - local_depth)) across all buckets must also equal 2^(global_depth). This invariant is maintained by split and merge operations.
1234567891011121314151617181920
// Verify the pointer invariant holdsfunction verifyPointerInvariant(directory, buckets, global_depth): // Count pointers from directory perspective directory_pointers = 2 ^ global_depth // Count pointers from bucket perspective bucket_pointers = 0 for each bucket in buckets: pointers_to_bucket = 2 ^ (global_depth - bucket.local_depth) bucket_pointers += pointers_to_bucket // These must match assert directory_pointers == bucket_pointers // Also verify each bucket is correctly referenced for i = 0 to directory_pointers - 1: bucket = directory[i] // The first 'local_depth' bits of i must match bucket's identifier prefix = extractBits(i, bucket.local_depth, from_right=true) assert prefix == bucket.identifierIn practice, the directory's storage location and size have important implications for system performance.
Directory size analysis:
With global depth d, the directory contains 2^d pointers. Each pointer is typically 4-8 bytes (depending on address space). So:
| Global Depth | Directory Entries | Size (8-byte ptrs) | Typical Storage |
|---|---|---|---|
| 10 | 1,024 | 8 KB | Always in memory |
| 16 | 65,536 | 512 KB | In memory (L3 cache) |
| 20 | ~1 million | 8 MB | In memory (RAM) |
| 24 | ~16 million | 128 MB | RAM or disk with caching |
| 28+ | ~256 million+ | 2+ GB | Disk-based with buffer management |
Memory-resident directories:
For most database applications, the directory remains memory-resident because:
Disk-resident directories:
For extremely large datasets or embedded systems with limited memory, the directory itself may spill to disk. In this case:
Modern systems like Oracle's cluster hash indexes and PostgreSQL's extensible hash access method keep directories in shared memory buffers with careful concurrency control.
In practice, global depth rarely exceeds 20-24 for even the largest tables. With a 16 KB bucket size and average 100-byte records, each bucket holds ~160 records. At global depth 24, extendible hashing can address 16 million buckets × 160 records ≈ 2.5 billion records—sufficient for most applications.
The physical separation between directory and buckets is a fundamental architectural decision in extendible hashing.
Why separate them?
Physical layout in a database file:
A typical implementation stores extendible hash structures as follows:
┌─────────────────────────────────────────────────────────┐│ Header Page ││ ─ Global depth: 3 ││ ─ Directory page location: Page 1 ││ ─ Bucket page count: 5 ││ ─ Free page list: Page 8, Page 12, ... │└─────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────┐│ Directory Page(s) ││ Page 1: [ptr→P2] [ptr→P3] [ptr→P4] [ptr→P4] ││ [ptr→P5] [ptr→P5] [ptr→P6] [ptr→P7] ││ (8 entries for global depth 3) │└─────────────────────────────────────────────────────────┘ │ ┌──────────┬───────────┼───────────┬──────────┐ ▼ ▼ ▼ ▼ ▼┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐│ Page 2 │ │ Page 3 │ │ Page 4 │ │ Page 5 │ │ Page 6 ││Bucket A│ │Bucket B│ │Bucket C│ │Bucket D│ │Bucket E││ LD: 3 │ │ LD: 3 │ │ LD: 2 │ │ LD: 2 │ │ LD: 3 ││[data...]│ │[data...]│ │[data...]│ │[data...]│ │[data...]│└────────┘ └────────┘ └────────┘ └────────┘ └────────┘The indirection enables flexibility:
Because the directory stores pointers (not embedded data), several powerful capabilities emerge:
Implementing an extendible hash directory requires careful attention to several practical concerns.
Directory structure in code:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
// C++ Directory Implementationstruct Bucket { uint32_t local_depth; uint32_t page_id; // Physical location on disk uint32_t record_count; // Actual records stored separately in the page}; class ExtendibleHashDirectory {private: uint32_t global_depth; std::vector<Bucket*> directory; // 2^global_depth entries BufferPool* buffer_pool; public: ExtendibleHashDirectory(BufferPool* bp) : global_depth(1), buffer_pool(bp) { // Start with 2 entries pointing to 2 buckets directory.resize(2); directory[0] = createNewBucket(1); // local_depth = 1 directory[1] = createNewBucket(1); } Bucket* getBucket(uint32_t key) { uint32_t hash = hashFunction(key); uint32_t index = extractMSBits(hash, global_depth); return directory[index]; } uint32_t extractMSBits(uint32_t hash, uint32_t num_bits) { // Extract the most significant 'num_bits' bits return hash >> (32 - num_bits); } void doubleDirectory() { uint32_t old_size = directory.size(); directory.resize(old_size * 2); // Each old entry i is duplicated to positions i and i + old_size for (uint32_t i = 0; i < old_size; i++) { directory[i + old_size] = directory[i]; } global_depth++; }};Critical implementation considerations:
While doubling the directory doesn't move data, it does require allocating a new array of twice the size and copying all pointers. With global depth 20, that's 8 MB of memory to copy. Systems often pre-allocate directory space or use multi-level directories to amortize this cost.
The directory structure is the architectural cornerstone of extendible hashing. Let's consolidate the key concepts:
What's next:
With the directory structure understood, we'll explore the concepts of global depth and local depth—the dual counters that control when the directory grows and when individual buckets split. These depth values are the control mechanism that makes extendible hashing adaptive.
You now understand the directory structure—the dynamic indirection layer that enables extendible hashing to grow gracefully. The directory's ability to have multiple pointers to a single bucket is the key to avoiding full table reorganization during growth.