Database Management SystemsExtendible Hashing

Extendible Hashing

LevelIntermediate

Duration75 mins

TopicExtendible Hashing

1 / 5

Directory Structure

The Architecture That Enables Dynamic Growth

Traditional static hashing faces a fundamental dilemma: choose a fixed number of buckets and live with either wasted space or overflow chains. Extendible hashing resolves this through an elegant architectural innovation—a directory structure that acts as an indirection layer between hash values and physical buckets.

Imagine a library where instead of fixed shelves, you have a dynamic card catalog that can reference any number of storage locations. As the library grows, you don't rebuild the entire shelving system—you simply update the catalog and add new storage compartments where needed. This is precisely the philosophy behind extendible hashing's directory structure.

What You Will Learn

By the end of this page, you will understand how the directory structure forms the backbone of extendible hashing, enabling O(1) lookups while supporting graceful growth without full reorganization. You'll see why this design has become fundamental to adaptive database indexing.

The Problem with Fixed Buckets

To appreciate extendible hashing's directory structure, we must first understand the limitations of static hashing—the approach it was designed to overcome.

Static hashing's core limitation:

In static hashing, you define N buckets at table creation time. The hash function h(k) maps each key to one of these N buckets. This seems straightforward, but it creates serious problems:

Static Hashing Pain Points

•Overflow chains — When buckets fill, you chain additional overflow buckets. These chains degrade O(1) performance to O(k) where k is the chain length.
•Underutilization — If you over-provision buckets for future growth, you waste disk pages. Database pages are precious fixed-size resources (typically 4KB–16KB).
•Rehashing catastrophe — When you eventually need more buckets, you must allocate a new hash table and rehash every single record. For a billion-row table, this can take hours or days.
•Unavailability during reorganization — Full table reorganization often requires exclusive locks, making the data unavailable to applications.

The Rehashing Problem at Scale

Consider a 10 million row table that needs to double its bucket count. With static hashing, every row must be read, rehashed with the new modulus, and written to its new location. At 10,000 rows/second, this takes over 15 minutes of exclusive processing—an eternity for a production database.

The insight that led to extendible hashing:

What if we could add buckets one at a time, only when and where they're needed? What if we could reorganize just the overflowing bucket instead of the entire table? What if we could decouple the logical addressing of buckets from their physical existence?

This is the breakthrough that extendible hashing achieves through its directory structure.

The Directory Concept

The directory is the central innovation of extendible hashing. It's an array of bucket pointers that serves as a level of indirection between hash values and actual data storage.

Core components:

Directory Structure Components
Component	Description	Purpose
Directory Array	Array of 2^d pointers, where d is the global depth	Maps hash prefixes to bucket addresses
Directory Entry	A single pointer to a bucket's physical location	Enables multiple hash prefixes to share one bucket
Global Depth (d)	Number of hash bits used to index the directory	Determines directory size as 2^d
Bucket	Physical storage unit holding data records	Fixed-size page storing actual key-value pairs
Local Depth (d')	Number of hash bits distinguishing a bucket's entries	Indicates how many times this bucket has split

How the directory works:

When you want to look up a key k:

Compute the hash value: h(k) → produces a binary string (e.g., 32 bits)
Take the first d bits of the hash (where d is the global depth)
Use those bits as an index into the directory array
Follow the pointer to the target bucket
Search within the bucket for the key

The key insight is that the directory size (2^d) can grow independently of the number of buckets. Multiple directory entries can point to the same bucket when that bucket hasn't needed to split yet.

directory_lookup.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Extendible Hash Lookup Algorithm
function lookup(key):
    // Step 1: Compute full hash value
    hash_value = hash(key)                    // e.g., 0b10110011...
    
    // Step 2: Extract first 'd' bits (global depth)
    directory_index = extractBits(hash_value, global_depth)
    
    // Step 3: Follow directory pointer to bucket
    bucket = directory[directory_index]
    
    // Step 4: Search within bucket (small, fits in memory)
    for each entry in bucket:
        if entry.key == key:
            return entry.value
    
    return NOT_FOUND
 
// Example with global_depth = 3:
// hash("Alice") = 0b10110011... 
// First 3 bits = 101 (binary) = 5 (decimal)
// bucket = directory[5]

Directory Layout and Indexing

Understanding the physical layout of the directory is crucial for grasping how extendible hashing achieves its efficiency.

Directory structure in memory:

The directory is typically kept in memory (it's small compared to the data) and consists of a power-of-2 sized array. Each entry in the directory is simply a pointer (typically 4-8 bytes) to a bucket stored on disk.

Converting Mermaid diagram...

Key observations from the diagram:

Directory entries 10 and 11 both point to Bucket C — This bucket has local depth 1, meaning only the first hash bit matters. Both 10 and 11 start with 1, so they share the same bucket.
Bucket C has local depth 1, but global depth is 2 — This asymmetry is the heart of extendible hashing. When Bucket C fills up, it can split without the directory growing.
The directory size is 2² = 4 — With global depth d=2, we have exactly 4 directory entries, regardless of how many buckets actually exist.

Bit prefix interpretation:

The directory index is derived from the most significant bits of the hash value. This is a critical design choice:

Why Most Significant Bits?

•Consistent ordering — Entries sharing a prefix naturally group together in the directory
•Efficient doubling — When the directory doubles, existing entries stay in place; new entries fill the second half
•Predictable splitting — A bucket at index i splits into indices i and i + 2^(d-1) when global depth increases

bit_extraction.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// Extract the first 'd' bits from a hash value
function extractBits(hash_value, d):
    // Shift right to move the first d bits to the rightmost position
    // Then mask to keep only those d bits
    
    hash_bits = 32  // Assuming 32-bit hash
    shift_amount = hash_bits - d
    
    return (hash_value >> shift_amount) & ((1 << d) - 1)
 
// Example:
// hash_value = 0b10110011_00000000_00000000_00000000
// d = 3
// shift_amount = 32 - 3 = 29
// Shifted: 0b00000000_00000000_00000000_00000101 (= 5)
// Mask: 0b111 = 7
// Result: 5 & 7 = 5 (binary: 101)

Multiple Pointers to the Same Bucket

A distinguishing feature of extendible hashing is that multiple directory entries can point to the same physical bucket. This property is what enables incremental growth without wasting storage.

When does this happen?

When a bucket's local depth is less than the global depth, multiple directory entries map to that bucket. The relationship is precise:

If local_depth = d' and global_depth = d, then 2^(d-d') directory entries point to this bucket.

Visual example:

Directory Entries per Bucket (Global Depth = 3)
Bucket	Local Depth	Pointers	Directory Indices
A	3	1 (2^(3-3))	000
B	3	1 (2^(3-3))	001
C	2	2 (2^(3-2))	010, 011
D	2	2 (2^(3-2))	100, 101
E	1	4 (2^(3-1))	110, 111, (shared with F)

Why this matters for performance:

This sharing mechanism means:

Space efficiency — You don't allocate buckets until data demands it. A nearly-empty table might have hundreds of directory entries but only a few actual buckets.
Localized splitting — When a bucket overflows, you split only that bucket and update only its directory pointers. The rest of the table is unaffected.
Graceful degradation — Even if the hash function has some clustering, only the affected buckets grow; others remain compact.

The Pointer Invariant

At any point in time, the total number of directory pointers equals 2^(global_depth). The sum of (2^(global_depth - local_depth)) across all buckets must also equal 2^(global_depth). This invariant is maintained by split and merge operations.

pointer_count_verification.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Verify the pointer invariant holds
function verifyPointerInvariant(directory, buckets, global_depth):
    // Count pointers from directory perspective
    directory_pointers = 2 ^ global_depth
    
    // Count pointers from bucket perspective
    bucket_pointers = 0
    for each bucket in buckets:
        pointers_to_bucket = 2 ^ (global_depth - bucket.local_depth)
        bucket_pointers += pointers_to_bucket
    
    // These must match
    assert directory_pointers == bucket_pointers
    
    // Also verify each bucket is correctly referenced
    for i = 0 to directory_pointers - 1:
        bucket = directory[i]
        // The first 'local_depth' bits of i must match bucket's identifier
        prefix = extractBits(i, bucket.local_depth, from_right=true)
        assert prefix == bucket.identifier

Directory Storage Considerations

In practice, the directory's storage location and size have important implications for system performance.

Directory size analysis:

With global depth d, the directory contains 2^d pointers. Each pointer is typically 4-8 bytes (depending on address space). So:

d=10 → 1,024 pointers → 4-8 KB
d=20 → 1,048,576 pointers → 4-8 MB
d=24 → 16,777,216 pointers → 64-128 MB
d=32 → 4,294,967,296 pointers → 16-32 GB (impractical in memory)

Directory Size vs Storage Location
Global Depth	Directory Entries	Size (8-byte ptrs)	Typical Storage
10	1,024	8 KB	Always in memory
16	65,536	512 KB	In memory (L3 cache)
20	~1 million	8 MB	In memory (RAM)
24	~16 million	128 MB	RAM or disk with caching
28+	~256 million+	2+ GB	Disk-based with buffer management

Memory-resident directories:

For most database applications, the directory remains memory-resident because:

It's relatively small — Even with millions of buckets, the directory is typically megabytes, not gigabytes
It's accessed on every operation — Lookup, insert, and delete all start by consulting the directory
It has no internal structure — The directory is a simple array, making memory access patterns predictable

Disk-resident directories:

For extremely large datasets or embedded systems with limited memory, the directory itself may spill to disk. In this case:

The directory is stored as a sequence of disk pages
A buffer pool caches frequently accessed directory pages
Directory access adds one I/O in the worst case (typically cached)

Modern systems like Oracle's cluster hash indexes and PostgreSQL's extensible hash access method keep directories in shared memory buffers with careful concurrency control.

Real-World Directory Sizes

In practice, global depth rarely exceeds 20-24 for even the largest tables. With a 16 KB bucket size and average 100-byte records, each bucket holds ~160 records. At global depth 24, extendible hashing can address 16 million buckets × 160 records ≈ 2.5 billion records—sufficient for most applications.

Directory vs Bucket Separation

The physical separation between directory and buckets is a fundamental architectural decision in extendible hashing.

Why separate them?

Benefits of Directory-Bucket Separation

•Independent growth — Directory can double without moving any data; buckets can split without affecting directory size
•Different access patterns — Directory is small and frequently accessed (lives in memory); buckets are large and accessed on demand (live on disk)
•Concurrency — Directory updates and bucket updates can use different locking strategies
•Recovery — Directory can be rebuilt from bucket metadata if corrupted
•Different volatility — Directory changes during structure evolution; bucket contents change during data modifications

Physical layout in a database file:

A typical implementation stores extendible hash structures as follows:

physical_layout.txt
┌─────────────────────────────────────────────────────────┐
│                    Header Page                          │
│  ─ Global depth: 3                                      │
│  ─ Directory page location: Page 1                      │
│  ─ Bucket page count: 5                                 │
│  ─ Free page list: Page 8, Page 12, ...                │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│                    Directory Page(s)                    │
│  Page 1: [ptr→P2] [ptr→P3] [ptr→P4] [ptr→P4]           │
│          [ptr→P5] [ptr→P5] [ptr→P6] [ptr→P7]           │
│  (8 entries for global depth 3)                         │
└─────────────────────────────────────────────────────────┘
                           │
    ┌──────────┬───────────┼───────────┬──────────┐
    ▼          ▼           ▼           ▼          ▼
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Page 2 │ │ Page 3 │ │ Page 4 │ │ Page 5 │ │ Page 6 │
│Bucket A│ │Bucket B│ │Bucket C│ │Bucket D│ │Bucket E│
│ LD: 3  │ │ LD: 3  │ │ LD: 2  │ │ LD: 2  │ │ LD: 3  │
│[data...]│ │[data...]│ │[data...]│ │[data...]│ │[data...]│
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘

The indirection enables flexibility:

Because the directory stores pointers (not embedded data), several powerful capabilities emerge:

Bucket relocation — A bucket can be moved to a new page (for defragmentation or load balancing) by updating just the directory pointer
Shared buckets — Multiple directory entries naturally point to the same page ID when local_depth < global_depth
Lazy allocation — New buckets are allocated on demand, not pre-allocated based on directory size

Practical Implementation Details

Implementing an extendible hash directory requires careful attention to several practical concerns.

Directory structure in code:

directory_structure.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// C++ Directory Implementation
struct Bucket {
    uint32_t local_depth;
    uint32_t page_id;          // Physical location on disk
    uint32_t record_count;
    // Actual records stored separately in the page
};
 
class ExtendibleHashDirectory {
private:
    uint32_t global_depth;
    std::vector<Bucket*> directory;  // 2^global_depth entries
    BufferPool* buffer_pool;
    
public:
    ExtendibleHashDirectory(BufferPool* bp) 
        : global_depth(1), buffer_pool(bp) {
        // Start with 2 entries pointing to 2 buckets
        directory.resize(2);
        directory[0] = createNewBucket(1);  // local_depth = 1
        directory[1] = createNewBucket(1);
    }
    
    Bucket* getBucket(uint32_t key) {
        uint32_t hash = hashFunction(key);
        uint32_t index = extractMSBits(hash, global_depth);
        return directory[index];
    }
    
    uint32_t extractMSBits(uint32_t hash, uint32_t num_bits) {
        // Extract the most significant 'num_bits' bits
        return hash >> (32 - num_bits);
    }
    
    void doubleDirectory() {
        uint32_t old_size = directory.size();
        directory.resize(old_size * 2);
        
        // Each old entry i is duplicated to positions i and i + old_size
        for (uint32_t i = 0; i < old_size; i++) {
            directory[i + old_size] = directory[i];
        }
        global_depth++;
    }
};

Critical implementation considerations:

•Atomic directory updates — When modifying directory pointers, ensure changes are atomic to prevent partial visibility. Use latches or copy-on-write techniques.
•Memory alignment — Directory arrays should be aligned to cache line boundaries (typically 64 bytes) for efficient access.
•Persistence — On crash recovery, the directory must be consistent with bucket contents. Write-ahead logging (WAL) covers directory modifications.
•Concurrent readers — While writers modify the directory, readers must see a consistent view. Read-copy-update (RCU) or versioned directories help here.
•Hash function stability — The hash function must never change, as it determines the mapping from keys to directory indices.

Directory Doubling is Expensive

While doubling the directory doesn't move data, it does require allocating a new array of twice the size and copying all pointers. With global depth 20, that's 8 MB of memory to copy. Systems often pre-allocate directory space or use multi-level directories to amortize this cost.

Summary: The Directory Foundation

The directory structure is the architectural cornerstone of extendible hashing. Let's consolidate the key concepts:

Key Takeaways

•The directory is an array of bucket pointers — Its size is always 2^(global_depth), providing O(1) index computation from hash values.
•Multiple entries can point to the same bucket — When local_depth < global_depth, 2^(global_depth - local_depth) entries share a bucket.
•Directory and buckets are physically separate — This separation enables independent growth, different storage tiers, and flexible recovery.
•Hash bit prefixes determine bucket assignment — The most significant bits of the hash value index into the directory.
•Directory doubling doesn't move data — Only pointer copying occurs; actual records remain in their buckets.
•The directory is typically memory-resident — At practical depths, the directory is small enough to cache entirely in RAM.

What's next:

With the directory structure understood, we'll explore the concepts of global depth and local depth—the dual counters that control when the directory grows and when individual buckets split. These depth values are the control mechanism that makes extendible hashing adaptive.

Page Complete

You now understand the directory structure—the dynamic indirection layer that enables extendible hashing to grow gracefully. The directory's ability to have multiple pointers to a single bucket is the key to avoiding full table reorganization during growth.

1 / 5

Loading learning content...

Database Management SystemsExtendible Hashing

Extendible Hashing

LevelIntermediate

Duration75 mins

TopicExtendible Hashing

1 / 5

Directory Structure

The Architecture That Enables Dynamic Growth

What You Will Learn

The Problem with Fixed Buckets

To appreciate extendible hashing's directory structure, we must first understand the limitations of static hashing—the approach it was designed to overcome.

Static hashing's core limitation:

In static hashing, you define N buckets at table creation time. The hash function h(k) maps each key to one of these N buckets. This seems straightforward, but it creates serious problems:

Static Hashing Pain Points

•Overflow chains — When buckets fill, you chain additional overflow buckets. These chains degrade O(1) performance to O(k) where k is the chain length.
•Underutilization — If you over-provision buckets for future growth, you waste disk pages. Database pages are precious fixed-size resources (typically 4KB–16KB).
•Rehashing catastrophe — When you eventually need more buckets, you must allocate a new hash table and rehash every single record. For a billion-row table, this can take hours or days.
•Unavailability during reorganization — Full table reorganization often requires exclusive locks, making the data unavailable to applications.

The Rehashing Problem at Scale

The insight that led to extendible hashing:

This is the breakthrough that extendible hashing achieves through its directory structure.

The Directory Concept

The directory is the central innovation of extendible hashing. It's an array of bucket pointers that serves as a level of indirection between hash values and actual data storage.

Core components:

Directory Structure Components
Component	Description	Purpose
Directory Array	Array of 2^d pointers, where d is the global depth	Maps hash prefixes to bucket addresses
Directory Entry	A single pointer to a bucket's physical location	Enables multiple hash prefixes to share one bucket
Global Depth (d)	Number of hash bits used to index the directory	Determines directory size as 2^d
Bucket	Physical storage unit holding data records	Fixed-size page storing actual key-value pairs
Local Depth (d')	Number of hash bits distinguishing a bucket's entries	Indicates how many times this bucket has split

How the directory works:

When you want to look up a key k:

Compute the hash value: h(k) → produces a binary string (e.g., 32 bits)
Take the first d bits of the hash (where d is the global depth)
Use those bits as an index into the directory array
Follow the pointer to the target bucket
Search within the bucket for the key

directory_lookup.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Extendible Hash Lookup Algorithm
function lookup(key):
    // Step 1: Compute full hash value
    hash_value = hash(key)                    // e.g., 0b10110011...
    
    // Step 2: Extract first 'd' bits (global depth)
    directory_index = extractBits(hash_value, global_depth)
    
    // Step 3: Follow directory pointer to bucket
    bucket = directory[directory_index]
    
    // Step 4: Search within bucket (small, fits in memory)
    for each entry in bucket:
        if entry.key == key:
            return entry.value
    
    return NOT_FOUND
 
// Example with global_depth = 3:
// hash("Alice") = 0b10110011... 
// First 3 bits = 101 (binary) = 5 (decimal)
// bucket = directory[5]

Directory Layout and Indexing

Understanding the physical layout of the directory is crucial for grasping how extendible hashing achieves its efficiency.

Directory structure in memory:

Converting Mermaid diagram...

Key observations from the diagram:

Directory entries 10 and 11 both point to Bucket C — This bucket has local depth 1, meaning only the first hash bit matters. Both 10 and 11 start with 1, so they share the same bucket.
Bucket C has local depth 1, but global depth is 2 — This asymmetry is the heart of extendible hashing. When Bucket C fills up, it can split without the directory growing.
The directory size is 2² = 4 — With global depth d=2, we have exactly 4 directory entries, regardless of how many buckets actually exist.

Bit prefix interpretation:

The directory index is derived from the most significant bits of the hash value. This is a critical design choice:

Why Most Significant Bits?

•Consistent ordering — Entries sharing a prefix naturally group together in the directory
•Efficient doubling — When the directory doubles, existing entries stay in place; new entries fill the second half
•Predictable splitting — A bucket at index i splits into indices i and i + 2^(d-1) when global depth increases

bit_extraction.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// Extract the first 'd' bits from a hash value
function extractBits(hash_value, d):
    // Shift right to move the first d bits to the rightmost position
    // Then mask to keep only those d bits
    
    hash_bits = 32  // Assuming 32-bit hash
    shift_amount = hash_bits - d
    
    return (hash_value >> shift_amount) & ((1 << d) - 1)
 
// Example:
// hash_value = 0b10110011_00000000_00000000_00000000
// d = 3
// shift_amount = 32 - 3 = 29
// Shifted: 0b00000000_00000000_00000000_00000101 (= 5)
// Mask: 0b111 = 7
// Result: 5 & 7 = 5 (binary: 101)

Multiple Pointers to the Same Bucket

A distinguishing feature of extendible hashing is that multiple directory entries can point to the same physical bucket. This property is what enables incremental growth without wasting storage.

When does this happen?

When a bucket's local depth is less than the global depth, multiple directory entries map to that bucket. The relationship is precise:

If local_depth = d' and global_depth = d, then 2^(d-d') directory entries point to this bucket.

Visual example:

Directory Entries per Bucket (Global Depth = 3)
Bucket	Local Depth	Pointers	Directory Indices
A	3	1 (2^(3-3))	000
B	3	1 (2^(3-3))	001
C	2	2 (2^(3-2))	010, 011
D	2	2 (2^(3-2))	100, 101
E	1	4 (2^(3-1))	110, 111, (shared with F)

Why this matters for performance:

This sharing mechanism means:

Space efficiency — You don't allocate buckets until data demands it. A nearly-empty table might have hundreds of directory entries but only a few actual buckets.
Localized splitting — When a bucket overflows, you split only that bucket and update only its directory pointers. The rest of the table is unaffected.
Graceful degradation — Even if the hash function has some clustering, only the affected buckets grow; others remain compact.

The Pointer Invariant

pointer_count_verification.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Verify the pointer invariant holds
function verifyPointerInvariant(directory, buckets, global_depth):
    // Count pointers from directory perspective
    directory_pointers = 2 ^ global_depth
    
    // Count pointers from bucket perspective
    bucket_pointers = 0
    for each bucket in buckets:
        pointers_to_bucket = 2 ^ (global_depth - bucket.local_depth)
        bucket_pointers += pointers_to_bucket
    
    // These must match
    assert directory_pointers == bucket_pointers
    
    // Also verify each bucket is correctly referenced
    for i = 0 to directory_pointers - 1:
        bucket = directory[i]
        // The first 'local_depth' bits of i must match bucket's identifier
        prefix = extractBits(i, bucket.local_depth, from_right=true)
        assert prefix == bucket.identifier

Directory Storage Considerations

In practice, the directory's storage location and size have important implications for system performance.

Directory size analysis:

With global depth d, the directory contains 2^d pointers. Each pointer is typically 4-8 bytes (depending on address space). So:

d=10 → 1,024 pointers → 4-8 KB
d=20 → 1,048,576 pointers → 4-8 MB
d=24 → 16,777,216 pointers → 64-128 MB
d=32 → 4,294,967,296 pointers → 16-32 GB (impractical in memory)

Directory Size vs Storage Location
Global Depth	Directory Entries	Size (8-byte ptrs)	Typical Storage
10	1,024	8 KB	Always in memory
16	65,536	512 KB	In memory (L3 cache)
20	~1 million	8 MB	In memory (RAM)
24	~16 million	128 MB	RAM or disk with caching
28+	~256 million+	2+ GB	Disk-based with buffer management

Memory-resident directories:

For most database applications, the directory remains memory-resident because:

It's relatively small — Even with millions of buckets, the directory is typically megabytes, not gigabytes
It's accessed on every operation — Lookup, insert, and delete all start by consulting the directory
It has no internal structure — The directory is a simple array, making memory access patterns predictable

Disk-resident directories:

For extremely large datasets or embedded systems with limited memory, the directory itself may spill to disk. In this case:

The directory is stored as a sequence of disk pages
A buffer pool caches frequently accessed directory pages
Directory access adds one I/O in the worst case (typically cached)

Modern systems like Oracle's cluster hash indexes and PostgreSQL's extensible hash access method keep directories in shared memory buffers with careful concurrency control.

Real-World Directory Sizes

Directory vs Bucket Separation

The physical separation between directory and buckets is a fundamental architectural decision in extendible hashing.

Why separate them?

Benefits of Directory-Bucket Separation

•Independent growth — Directory can double without moving any data; buckets can split without affecting directory size
•Different access patterns — Directory is small and frequently accessed (lives in memory); buckets are large and accessed on demand (live on disk)
•Concurrency — Directory updates and bucket updates can use different locking strategies
•Recovery — Directory can be rebuilt from bucket metadata if corrupted
•Different volatility — Directory changes during structure evolution; bucket contents change during data modifications

Physical layout in a database file:

A typical implementation stores extendible hash structures as follows:

physical_layout.txt
┌─────────────────────────────────────────────────────────┐
│                    Header Page                          │
│  ─ Global depth: 3                                      │
│  ─ Directory page location: Page 1                      │
│  ─ Bucket page count: 5                                 │
│  ─ Free page list: Page 8, Page 12, ...                │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│                    Directory Page(s)                    │
│  Page 1: [ptr→P2] [ptr→P3] [ptr→P4] [ptr→P4]           │
│          [ptr→P5] [ptr→P5] [ptr→P6] [ptr→P7]           │
│  (8 entries for global depth 3)                         │
└─────────────────────────────────────────────────────────┘
                           │
    ┌──────────┬───────────┼───────────┬──────────┐
    ▼          ▼           ▼           ▼          ▼
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Page 2 │ │ Page 3 │ │ Page 4 │ │ Page 5 │ │ Page 6 │
│Bucket A│ │Bucket B│ │Bucket C│ │Bucket D│ │Bucket E│
│ LD: 3  │ │ LD: 3  │ │ LD: 2  │ │ LD: 2  │ │ LD: 3  │
│[data...]│ │[data...]│ │[data...]│ │[data...]│ │[data...]│
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘

The indirection enables flexibility:

Because the directory stores pointers (not embedded data), several powerful capabilities emerge:

Bucket relocation — A bucket can be moved to a new page (for defragmentation or load balancing) by updating just the directory pointer
Shared buckets — Multiple directory entries naturally point to the same page ID when local_depth < global_depth
Lazy allocation — New buckets are allocated on demand, not pre-allocated based on directory size

Practical Implementation Details

Implementing an extendible hash directory requires careful attention to several practical concerns.

Directory structure in code:

directory_structure.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// C++ Directory Implementation
struct Bucket {
    uint32_t local_depth;
    uint32_t page_id;          // Physical location on disk
    uint32_t record_count;
    // Actual records stored separately in the page
};
 
class ExtendibleHashDirectory {
private:
    uint32_t global_depth;
    std::vector<Bucket*> directory;  // 2^global_depth entries
    BufferPool* buffer_pool;
    
public:
    ExtendibleHashDirectory(BufferPool* bp) 
        : global_depth(1), buffer_pool(bp) {
        // Start with 2 entries pointing to 2 buckets
        directory.resize(2);
        directory[0] = createNewBucket(1);  // local_depth = 1
        directory[1] = createNewBucket(1);
    }
    
    Bucket* getBucket(uint32_t key) {
        uint32_t hash = hashFunction(key);
        uint32_t index = extractMSBits(hash, global_depth);
        return directory[index];
    }
    
    uint32_t extractMSBits(uint32_t hash, uint32_t num_bits) {
        // Extract the most significant 'num_bits' bits
        return hash >> (32 - num_bits);
    }
    
    void doubleDirectory() {
        uint32_t old_size = directory.size();
        directory.resize(old_size * 2);
        
        // Each old entry i is duplicated to positions i and i + old_size
        for (uint32_t i = 0; i < old_size; i++) {
            directory[i + old_size] = directory[i];
        }
        global_depth++;
    }
};

Critical implementation considerations:

•Atomic directory updates — When modifying directory pointers, ensure changes are atomic to prevent partial visibility. Use latches or copy-on-write techniques.
•Memory alignment — Directory arrays should be aligned to cache line boundaries (typically 64 bytes) for efficient access.
•Persistence — On crash recovery, the directory must be consistent with bucket contents. Write-ahead logging (WAL) covers directory modifications.
•Concurrent readers — While writers modify the directory, readers must see a consistent view. Read-copy-update (RCU) or versioned directories help here.
•Hash function stability — The hash function must never change, as it determines the mapping from keys to directory indices.

Directory Doubling is Expensive

Summary: The Directory Foundation

The directory structure is the architectural cornerstone of extendible hashing. Let's consolidate the key concepts:

Key Takeaways

•The directory is an array of bucket pointers — Its size is always 2^(global_depth), providing O(1) index computation from hash values.
•Multiple entries can point to the same bucket — When local_depth < global_depth, 2^(global_depth - local_depth) entries share a bucket.
•Directory and buckets are physically separate — This separation enables independent growth, different storage tiers, and flexible recovery.
•Hash bit prefixes determine bucket assignment — The most significant bits of the hash value index into the directory.
•Directory doubling doesn't move data — Only pointer copying occurs; actual records remain in their buckets.
•The directory is typically memory-resident — At practical depths, the directory is small enough to cache entirely in RAM.

What's next:

Page Complete

1 / 5

Extendible Hashing - Learning Module | OneNoughtOne