Loading learning content...
Directory doubling is the most structurally significant operation in extendible hashing. It occurs when the current directory structure cannot accommodate a bucket split—specifically, when a bucket with local_depth = global_depth overflows.
What makes directory doubling remarkable is that no data moves. Unlike full rehashing in static hashing (which reads and rewrites every record), directory doubling only copies pointers. The actual bucket pages, containing potentially millions of records, remain exactly where they are.
By the end of this page, you will understand when directory doubling is triggered, the exact mechanics of the doubling operation, its time and space complexity, concurrency considerations, and optimization techniques used in production systems.
Directory doubling is triggered by a precise condition:
The Trigger:
An insertion attempts to add a record to a bucket that is:
- Full (at capacity)
- Has local_depth = global_depth (already at maximum specificity)
When both conditions are true, the bucket cannot split within the current directory structure. Splitting would require halving the pointer count, but a bucket with local_depth = global_depth has exactly 1 pointer—you cannot have half a pointer.
Decision flowchart:
1234567891011121314151617181920212223
function insert(key, value): bucket = findBucket(key) if bucket.hasSpace(): bucket.add(key, value) return SUCCESS // Bucket is full - need to split if bucket.local_depth < global_depth: // CASE A: Local split only (no directory change) splitBucket(bucket) return insert(key, value) // Retry else: // CASE B: bucket.local_depth == global_depth // Directory doubling required before split doubleDirectory() splitBucket(bucket) return insert(key, value) // Retry // Key insight:// - CASE A handles most overflow situations (increasingly common as table grows)// - CASE B is rare but necessary for continued growthWhy can't we just use overflow chains?
Some systems do allow overflow chains as a fallback, but they defeat the purpose of extendible hashing:
Directory doubling preserves the O(1) guarantee by ensuring every bucket can always split when needed.
With global_depth d, only 1/(2^(d-1)) of buckets have local_depth = d (the maximum). As d grows, the probability that a random overflow triggers doubling approaches zero. A mature table with d=20 has less than a 1-in-500,000 chance per overflow.
The directory doubling algorithm is elegantly simple:
Algorithm: Double Directory
123456789101112131415161718192021222324252627282930
function doubleDirectory(): old_size = 2^global_depth new_size = 2 * old_size // Step 1: Allocate new directory array new_directory = allocateArray(new_size, type=BucketPointer) // Step 2: Copy and duplicate all pointers for i = 0 to old_size - 1: // Original entry at position i new_directory[i] = directory[i] // Duplicate at position i + old_size new_directory[i + old_size] = directory[i] // Step 3: Atomically swap directories old_directory = directory directory = new_directory // Step 4: Increment global depth global_depth = global_depth + 1 // Step 5: Deallocate old directory (deferred if concurrent readers) scheduleDeallocation(old_directory) // After doubling:// - directory size: 2^(global_depth)// - Every bucket now has 2× pointers pointing to it// - No bucket has changed (local depths unchanged)// - No data has movedVisual example:
Key observations:
Understanding the costs of directory doubling is essential for performance analysis.
Time complexity:
| Operation | Complexity | Explanation |
|---|---|---|
| Memory allocation | O(2^d) | Allocate 2^(d+1) pointer slots |
| Pointer copying | O(2^d) | Copy 2^d pointers twice each |
| Directory swap | O(1) | Single pointer assignment |
| Bucket operations | 0 | No bucket I/O during doubling |
| Total | O(2^d) | Linear in current directory size |
Space complexity:
Amortized analysis:
While each doubling costs O(2^d), doublings are exponentially rare:
1234567891011121314151617181920212223242526
Amortized Cost Analysis for N insertions: Assumptions: - Start with d = 1 (2 buckets) - Each bucket holds B records - Grow to N records total Number of doublings: - Need approximately N/B buckets - Maximum global depth: d_max ≈ log₂(N/B) - Total doublings: d_max - 1 = O(log(N/B)) Total doubling work: Σ (2^i for i = 1 to log₂(N/B)) = 2 + 4 + 8 + ... + (N/B) = 2 × (N/B) - 2 = O(N/B) Amortized cost per insertion: = O(N/B) / N = O(1/B) ≈ O(1) for fixed bucket size B Conclusion: Despite O(2^d) per doubling, the amortized cost per insertion is O(1) because doublings become exponentially less frequent.With 8-byte pointers and d=20, copying 1 million pointers takes ~8ms on modern hardware (memory bandwidth ~1GB/s). For d=24, it's ~130ms. These are rare operations—the 24th doubling happens only once in the table's lifetime after handling billions of insertions.
The pointer duplication during doubling follows a precise pattern that maintains correctness.
The pattern:
i (in old directory) is copied to indices i and i + 2^d (in new directory)0bXXX becomes 0b0XXX and 0b1XXXThis works because the most significant bit of the new index is the "new" bit added by increasing global depth.
Detailed example:
1234567891011121314151617181920212223242526272829303132333435
Before doubling (d = 3, 8 entries): Index 0 (000) → Bucket A Index 1 (001) → Bucket A // A has local_depth 2, shared Index 2 (010) → Bucket B Index 3 (011) → Bucket C Index 4 (100) → Bucket D Index 5 (101) → Bucket D // D has local_depth 2, shared Index 6 (110) → Bucket E Index 7 (111) → Bucket E // E has local_depth 2, shared After doubling (d = 4, 16 entries): Index 0 (0000) → Bucket A // From old index 0 Index 1 (0001) → Bucket A // From old index 1 Index 2 (0010) → Bucket B // From old index 2 Index 3 (0011) → Bucket C // From old index 3 Index 4 (0100) → Bucket D // From old index 4 Index 5 (0101) → Bucket D // From old index 5 Index 6 (0110) → Bucket E // From old index 6 Index 7 (0111) → Bucket E // From old index 7 ──────────────────────────────────────────────── Index 8 (1000) → Bucket A // Copy of old index 0 Index 9 (1001) → Bucket A // Copy of old index 1 Index 10 (1010) → Bucket B // Copy of old index 2 Index 11 (1011) → Bucket C // Copy of old index 3 Index 12 (1100) → Bucket D // Copy of old index 4 Index 13 (1101) → Bucket D // Copy of old index 5 Index 14 (1110) → Bucket E // Copy of old index 6 Index 15 (1111) → Bucket E // Copy of old index 7 Pointer counts after doubling: Bucket A: was 2, now 4 (indices 0,1,8,9) Bucket B: was 1, now 2 (indices 2,10) Bucket C: was 1, now 2 (indices 3,11) Bucket D: was 2, now 4 (indices 4,5,12,13) Bucket E: was 2, now 4 (indices 6,7,14,15)Why this pattern is correct:
Consider a key with hash h(k) = 0b1011....
101 → directory index 5 → Bucket D1011 → directory index 11 → Bucket DIndex 11 in the new directory is a copy of index 3 (which pointed to D in old directory)... wait, that seems wrong!
Actually, let me recalculate:
The duplication pattern assumes MSB-first indexing. Index 5 (binary 101) in a d=3 directory corresponds to keys with hash prefix 101. After doubling to d=4, these keys split into prefixes 1010 and 1011 (indices 10 and 11). The copy of old index 5 goes to new index 5 (0101) and new index 13 (1101)—both correctly unreachable by 101x keys. The split handles the rest.
Directory doubling is never performed in isolation—it always immediately precedes a bucket split. The sequence is:
After this sequence:
| Property | Before | After Doubling | After Split |
|---|---|---|---|
| global_depth | d | d + 1 | d + 1 |
| directory_size | 2^d | 2^(d+1) | 2^(d+1) |
| B.local_depth | d | d | d + 1 |
| B.pointers | 1 | 2 | 1 |
| Sibling.local_depth | — | — | d + 1 |
| Sibling.pointers | — | — | 1 |
Step-by-step walkthrough:
123456789101112131415161718192021222324252627
Initial state: global_depth = 2 Bucket B at index 11, local_depth = 2, pointers = 1 B is full and receives new insertion Step 1: Double directory global_depth becomes 3 Directory grows from 4 to 8 entries Entry 11 (0b11) copied to entries 011 and 111 B now has 2 pointers (indices 3 and 7) B.local_depth still = 2 Step 2: Split bucket B B.local_depth becomes 3 Create sibling S with local_depth = 3 Redistribute B's records: - Records with bit 3 = 0 stay in B - Records with bit 3 = 1 move to S Update directory: - Index 011 (3) points to B - Index 111 (7) points to S B now has 1 pointer, S has 1 pointer Step 3: Retry insertion Recalculate target bucket using new global_depth = 3 Insert into B or S depending on key's bit 3 Insertion succeeds (buckets now have space)In practice, the doubling and split are often treated as a single atomic operation for crash recovery purposes. The write-ahead log records the combined operation, ensuring either both complete or neither does.
Directory doubling presents unique concurrency challenges because it affects the global structure used by all operations.
The challenge:
During doubling, the relationship between hash values and bucket locations temporarily changes. Readers using the old global_depth will compute different indices than readers using the new depth. Without synchronization, lookups could:
Synchronization strategies:
12345678910111213141516171819202122232425262728293031323334353637383940
// RCU-style directory doublingclass HashIndex { atomic<DirectoryState*> current_state struct DirectoryState { uint32_t global_depth BucketPointer* directory uint64_t version }} function doubleDirectoryRCU(): old_state = current_state.load() // Build complete new state new_directory = allocateAndCopyDoubled(old_state.directory, old_state.global_depth) new_state = new DirectoryState { global_depth: old_state.global_depth + 1, directory: new_directory, version: old_state.version + 1 } // Atomic swap - no locks needed for readers while !current_state.compare_and_swap(old_state, new_state): // Another writer beat us - retry from scratch deallocate(new_directory) old_state = current_state.load() new_directory = allocateAndCopyDoubled(old_state.directory, old_state.global_depth) new_state.directory = new_directory new_state.global_depth = old_state.global_depth + 1 new_state.version = old_state.version + 1 // Schedule old state for deferred cleanup rcu_free(old_state) function lookup(key): state = current_state.load() // Consistent snapshot index = extractBits(hash(key), state.global_depth) return state.directory[index] // No validation needed - state is immutable once publishedSince lookups vastly outnumber directory modifications, RCU's zero-cost reads are ideal. Writers pay the cost of copying the directory, but this happens rarely. PostgreSQL's hash index uses a variant of this approach.
Production systems employ several techniques to optimize directory doubling.
Pre-allocation:
Instead of allocating exactly 2^d entries, allocate extra capacity:
12345678910111213141516171819202122232425262728293031323334
// Pre-allocate directory with growth marginclass OptimizedDirectory {private: BucketPointer* storage; // Actual memory size_t allocated_size; // May exceed 2^global_depth uint32_t global_depth; public: OptimizedDirectory() { global_depth = 1; // Allocate for depth 4 even though starting at 1 allocated_size = 16; // 2^4 storage = new BucketPointer[allocated_size]; // Only first 2 entries used initially } void doubleDirectory() { size_t needed = 2 * (1 << global_depth); if (needed <= allocated_size) { // Just duplicate pointers in-place size_t old_size = 1 << global_depth; for (size_t i = 0; i < old_size; i++) { storage[i + old_size] = storage[i]; } global_depth++; return; // No allocation! } // Need to reallocate size_t new_allocated = std::max(needed, allocated_size * 2); // ... proceed with standard reallocation }};Multi-level directories:
For very large directories, use a tree of directory pages instead of a flat array:
1234567891011121314151617181920212223
Multi-level Directory Structure (2-level example): Level 0 (Root): Array of pointers to Level 1 pages - Fix size throughout table lifetime - Each entry covers 2^(d - k) buckets (k = bits for L0 indexing) Level 1 (Leaves): Pages containing bucket pointers - Grows by adding new L1 pages - Each page covers a range of bucket addresses Example with k = 10: L0: 1024 entries, each pointing to an L1 page L1: Each page has up to 2^14 = 16K entries Maximum addressable: 2^24 = 16M buckets Doubling behavior: - If d < 10: Only L0 doubles (like regular doubling) - If d >= 10: L1 pages split, L0 unchanged Advantage: - L0 fits in cache (8KB) - L1 pages can be demand-paged - Doubling never copies entire directoryPre-allocation trades memory for lower doubling latency. Multi-level directories trade an extra memory access per lookup for bounded doubling cost. The right choice depends on workload characteristics—read-heavy vs write-heavy, memory constraints, latency requirements.
Directory doubling is the mechanism that allows extendible hashing to scale without bounds (within system limits). Let's consolidate the key concepts:
What's next:
With directory doubling understood, we'll explore bucket splitting—the complementary operation that redistributes records when a bucket overflows. While doubling expands capacity, splitting actually uses that capacity by creating new buckets.
You now understand directory doubling—the global expansion operation that enables unbounded growth. The key insight is that doubling is metadata-only; no records move, making it orders of magnitude faster than static hashing's full rehash.