Database Management SystemsExtendible Hashing

Extendible Hashing

LevelIntermediate

Duration75 mins

TopicExtendible Hashing

3 / 5

Local Depth

The Per-Bucket Memory of Split History

While global depth is the system-wide counter controlling directory size, local depth is the per-bucket counter that remembers each bucket's individual split history. Every bucket carries its local depth as metadata, and this single integer encodes crucial information about the bucket's creation, its relationship to the directory, and whether it can split independently.

If global depth is the orchestra conductor, local depth is each musician's individual tempo marking—together they coordinate to produce a harmonious, adaptive data structure.

What You Will Learn

By the end of this page, you will understand how local depth tracks bucket-specific history, why the difference (global_depth - local_depth) determines pointer count, how local depth governs split behavior, and the invariants that local depth must satisfy.

Definition and Meaning

Formal Definition:

The local depth of a bucket B (denoted as d' or LD) is a non-negative integer that specifies:

How many bits of the hash value distinguish entries that should go into this bucket from entries that should go elsewhere
How many times this bucket (or its ancestors) has been split since the initial bucket
The reciprocal of how many directory entries point to this bucket (relative to global depth)

The fundamental relationships:

local_depth_relationships.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Given:
  - Global depth: d
  - Local depth of bucket B: d'
 
Then:
  - Number of directory entries pointing to B = 2^(d - d')
  - The first d' bits of all keys in B are identical
  - B was created by (d' - initial_depth) splits from the original bucket
 
Examples (assuming d = 4, initial_depth = 1):
 
  Bucket with d' = 4:
    - Directory entries pointing here: 2^(4-4) = 1
    - Keys share first 4 bits
    - Created by 3 splits
 
  Bucket with d' = 2:
    - Directory entries pointing here: 2^(4-2) = 4
    - Keys share first 2 bits
    - Created by 1 split
 
  Bucket with d' = 1:
    - Directory entries pointing here: 2^(4-1) = 8
    - Keys share first 1 bit
    - This is an original bucket (never split)

What local depth means intuitively:

Local depth answers the question: "How specific is this bucket about which keys it accepts?"

Low local depth → bucket accepts keys with many different hash prefixes (less specific)
High local depth → bucket accepts keys with only one specific hash prefix (highly specific)

As buckets split, they become more specific—each split doubles the specificity by considering one more bit of the hash value.

The Split Counter Interpretation

If you start with global_depth = 1 (two buckets, each with local_depth = 1), then after n splits on a particular bucket lineage, the resulting bucket has local_depth = 1 + n. Local depth literally counts the split history.

Local Depth and Directory Pointers

The most operationally important property of local depth is how it determines the number of directory entries pointing to a bucket.

The pointer count formula:

pointer_count(bucket) = 2^(global_depth - local_depth)

This formula is always valid because of the invariant 0 ≤ d' ≤ d.

Visual demonstration:

Converting Mermaid diagram...

Key observations from the diagram:

Bucket C (d'=1) has 4 pointers — Its local depth is less than global depth by 2, so 2² = 4 entries point to it
Buckets A and B (d'=2) each have 2 pointers — They're more specific, with local depth only 1 less than global
If any bucket had d'=3 — It would have exactly 1 pointer (fully specified by all 3 bits)

Verifying the invariant:

The total pointer count across all buckets must equal the directory size:

2^(3-2) + 2^(3-2) + 2^(3-1) = 2 + 2 + 4 = 8 = 2^3 ✓

Which Directory Entries?

For a bucket with local_depth d', the directory entries pointing to it are all indices where the first d' bits match the bucket's hash prefix. For Bucket C (d'=1, prefix '1'), entries 100, 101, 110, 111 all start with '1', so all four point to C.

Local Depth and Key Distribution

Local depth not only affects directory pointers but also constrains which keys can reside in a bucket.

The key constraint:

All keys in a bucket with local_depth d' must have the same first d' bits in their hash values.

This is what makes the bucket addressable—any directory entry that points to this bucket shares those d' leading bits.

Example walkthrough:

key_distribution_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Scenario: Bucket with local_depth = 3, hash prefix = "010"
 
Valid keys (their hashes start with 010):
  - hash("alice")   = 0b01001101... → prefix 010 ✓
  - hash("zebra")   = 0b01011111... → prefix 010 ✓
  - hash("data")    = 0b01000001... → prefix 010 ✓
 
Invalid keys (wrong prefix):
  - hash("bob")     = 0b11001101... → prefix 110 ✗
  - hash("cat")     = 0b00011111... → prefix 000 ✗
  - hash("epsilon") = 0b01100001... → prefix 011 ✗
 
Key insight:
  After the first 3 bits (which must be 010), the remaining
  hash bits can be anything. The bucket may contain keys
  with hashes 01000000..., 01001111..., 01010101..., etc.
 
  When this bucket splits, bit 4 will differentiate:
    - Bit 4 = 0: Keys move to new bucket with prefix 0100
    - Bit 4 = 1: Keys stay with prefix 0101

Why this matters for correctness:

The local depth ensures that every key is findable. When searching for a key:

Hash the key → get hash value
Extract first d (global depth) bits → get directory index
Follow pointer to bucket
The bucket's d' first bits match the key's d' first bits (guaranteed by construction)
Linear search within bucket for exact match

If the invariant were violated (a key with wrong prefix in a bucket), lookups would fail to find existing keys.

Corruption Detection

A bucket containing a key whose hash prefix doesn't match the bucket's expected prefix indicates data corruption. Database systems can validate this invariant during consistency checks (e.g., DBCC CHECKDB in SQL Server, pg_amcheck in PostgreSQL).

Local Depth Changes During Splits

Local depth increases when a bucket splits. This is the fundamental mechanism of extendible hashing evolution.

The split operation and local depth:

split_and_local_depth.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
function splitBucket(bucket):
    old_local_depth = bucket.local_depth
    new_local_depth = old_local_depth + 1
    
    // Create sibling bucket with same new depth
    sibling = createNewBucket(local_depth = new_local_depth)
    bucket.local_depth = new_local_depth
    
    // Redistribute keys based on the (old_local_depth + 1)th bit
    discriminating_bit = new_local_depth
    
    for each key in bucket.keys:
        hash = hash(key)
        bit = extractBit(hash, discriminating_bit)  // 0 or 1
        
        if bit == 1:
            sibling.insert(key, bucket.getValue(key))
            bucket.remove(key)
    
    // Update directory pointers
    updateDirectoryPointers(bucket, sibling)
    
    // After split:
    //   - Original bucket has local_depth = old + 1
    //   - Sibling bucket has local_depth = old + 1
    //   - Each has half the original pointer count
    //     (2^(d - (old+1)) = 2^(d - old) / 2)
 
// Example:
// Before: Bucket B with d' = 2, global d = 4
//   - Pointers: 2^(4-2) = 4 entries point to B
//
// After split: Bucket B and Sibling S, both with d' = 3
//   - Pointers to B: 2^(4-3) = 2
//   - Pointers to S: 2^(4-3) = 2
//   - Total: 2 + 2 = 4 (same as before) ✓

Pointer redistribution during split:

When a bucket with local_depth d' splits:

The bucket's local_depth increases to d' + 1
A sibling bucket is created with local_depth = d' + 1
The 2^(d - d') directory entries that pointed to the original bucket are partitioned:
- Half (those where bit d'+1 = 0) continue pointing to the original bucket
- Half (those where bit d'+1 = 1) now point to the sibling bucket

The total pointer count is preserved: 2^(d - d') = 2^(d - d' - 1) + 2^(d - d' - 1)

Pointer Redistribution Example (d=4)
Before Split	Operation	After Split
Bucket A, d'=2	Split triggered by overflow	Bucket A: d'=3, Bucket A': d'=3
Entries 0100, 0101, 0110, 0111 → A	Bit 3 partitions entries	0100,0101 → A; 0110,0111 → A'
4 pointers to A	Each bucket gets half	2 pointers to A, 2 pointers to A'

The d' = d Special Case

When local_depth equals global_depth, the bucket has only 1 pointer. Splitting would require 0.5 pointers each—impossible! This is why d' = d overflow triggers directory doubling first, which increases d, creating room for the split.

Asymmetric Local Depths

A key feature of extendible hashing is that different buckets can have different local depths. This asymmetry reflects non-uniform data distribution.

Why asymmetry occurs:

Consider a scenario where keys hash non-uniformly:

Keys with hash prefix '0...' are rare
Keys with hash prefix '1...' are frequent

The '1...' bucket fills and splits repeatedly, increasing its local depth. The '0...' bucket remains small and never splits, keeping its original local depth.

Real-world example:

asymmetric_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Scenario: Customer ID hashing where most customers are recent
 
Hash function tends to cluster recent IDs:
  - Old customer IDs → hash prefixes 00..., 01...
  - Recent customer IDs → hash prefix 10..., 11...
 
Initial state (d=2):
  Bucket 00: d'=2, 20 records
  Bucket 01: d'=2, 25 records
  Bucket 10: d'=2, 150 records → overflow!
  Bucket 11: d'=2, 180 records → overflow!
 
After multiple splits (d=4):
  Bucket 00:   d'=2, 20 records   (4 directory ptrs)
  Bucket 01:   d'=2, 25 records   (4 directory ptrs)
  Bucket 1000: d'=4, 35 records   (1 directory ptr)
  Bucket 1001: d'=4, 38 records   (1 directory ptr)
  Bucket 1010: d'=4, 40 records   (1 directory ptr)
  Bucket 1011: d'=4, 37 records   (1 directory ptr)
  Bucket 1100: d'=4, 42 records   (1 directory ptr)
  Bucket 1101: d'=4, 45 records   (1 directory ptr)
  Bucket 1110: d'=4, 48 records   (1 directory ptr)
  Bucket 1111: d'=4, 43 records   (1 directory ptr)
 
Observations:
  - Sparse regions (00, 01) share pointers, saving buckets
  - Dense regions (10.., 11..) are fully split
  - Directory is 16 entries but only 10 buckets exist
  - Space adapts to actual data distribution

Benefits of asymmetric local depths:

Advantages of Local Depth Asymmetry

•Space efficiency — Sparse regions don't waste buckets. A bucket with d' < d uses fewer resources.
•Adaptive growth — Hot spots in data split independently of cold regions. No global reorganization needed.
•Graceful handling of skew — Hash functions aren't perfect; extendible hashing compensates by splitting only where needed.
•Delayed directory growth — Directory doubling is postponed until even the sparse regions fill up.

Measuring Skew

The variance in local depths across buckets indicates data skew. In a perfectly uniform distribution, all buckets would have similar local depths. Large variance suggests hash function issues or inherently skewed data that might benefit from a different hash strategy.

Local Depth and Bucket Merging

Just as splitting increases local depth, merging (the inverse operation) decreases it. Merging is used when buckets become sparse after many deletions.

When can buckets merge?

Two buckets can merge if:

They are siblings — created by the same split operation
Both have the same local depth
Their combined entries fit in one bucket

Identifying siblings:

Two buckets are siblings if their hash prefixes differ only in the last bit (the bit at position local_depth). Bucket with prefix 01010 and bucket with prefix 01011 are siblings if both have local_depth = 5.

bucket_merging.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
function considerMerge(bucket):
    if bucket.local_depth == 0:
        return  // Cannot merge original bucket
    
    // Find sibling by flipping the last bit of prefix
    sibling_prefix = bucket.prefix XOR (1 << (total_bits - bucket.local_depth))
    sibling = findBucketByPrefix(sibling_prefix)
    
    if sibling == null:
        return  // Sibling doesn't exist (maybe already merged)
    
    if sibling.local_depth != bucket.local_depth:
        return  // Not actually siblings
    
    if bucket.size + sibling.size > BUCKET_CAPACITY:
        return  // Merged bucket would overflow
    
    // Perform merge
    mergeBuckets(bucket, sibling)
 
function mergeBuckets(bucket, sibling):
    // Move all entries from sibling to bucket
    for each entry in sibling:
        bucket.insert(entry)
    
    // Decrease local depth
    bucket.local_depth = bucket.local_depth - 1
    
    // Update directory: pointers that went to sibling now go to bucket
    for each directory_entry pointing to sibling:
        directory_entry = bucket
    
    // Free sibling bucket
    deallocate(sibling)
    
    // After merge:
    //   - Bucket has local_depth decreased by 1
    //   - Bucket has 2x the pointer count
    //   - Sibling is deallocated
 
// Note: Directory shrinking is less common
// Even if all buckets could merge, the directory usually stays large

Directory shrinking (halving):

Theoretically, when every bucket has local_depth < global_depth, the directory could shrink. However, most systems don't implement automatic directory shrinking because:

Rare in practice — Workloads that grow and then shrink uniformly are uncommon
Non-critical savings — Directory memory is small compared to bucket storage
Complexity — Concurrent directory shrinking is complex to implement safely
Future growth — A shrunk directory will likely need to grow again

Merge Trigger Policies

Unlike splits (which are triggered by overflow), merges are often deferred or triggered by explicit compaction commands. Aggressive merging can lead to thrashing if data patterns fluctuate. Many systems skip automatic merging entirely.

Local Depth Storage and Metadata

In a physical implementation, local depth must be stored somewhere persistent. There are two common approaches:

Approach 1: Store in bucket header

Each bucket page contains a header with metadata including local depth:

bucket_page_layout.txt
┌─────────────────────────────────────────────────┐
│                 Bucket Page Header              │
│  ─ page_id:       4 bytes                       │
│  ─ local_depth:   1 byte  ← Local depth here    │
│  ─ record_count:  2 bytes                       │
│  ─ free_space:    2 bytes                       │
│  ─ next_overflow: 4 bytes (if using chains)     │
│  ─ checksum:      4 bytes                       │
├─────────────────────────────────────────────────┤
│                  Slot Directory                 │
│  [offset1, len1] [offset2, len2] ...            │
├─────────────────────────────────────────────────┤
│                                                 │
│              Record Storage Area                │
│    (grows from bottom, slots from top)          │
│                                                 │
└─────────────────────────────────────────────────┘
 
Advantages:
  - Local depth always available when reading bucket
  - No additional metadata lookup required
  - Self-describing pages for recovery
 
Disadvantages:
  - Extra I/O to read local depth if only checking metadata
  - Slight space overhead per page

Approach 2: Store in directory structure

The directory entry can store both the bucket pointer and local depth:

directory_with_depth.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Directory entry structure with embedded local depth
struct DirectoryEntry {
    uint32_t bucket_page_id;   // 4 bytes: physical page location
    uint8_t  local_depth;      // 1 byte: bucket's local depth
    uint8_t  padding[3];       // 3 bytes: alignment
};
// Total: 8 bytes per entry
 
// Advantages:
//   - Quick access to local depth without bucket I/O
//   - Useful for split planning without reading bucket contents
//
// Disadvantages:
//   - Redundant if local depth also in bucket header
//   - Must keep both copies in sync
 
// Alternative: Compute from directory pattern
// If entries i and i+1 both point to same bucket, local_depth < global_depth
// Count consecutive identical pointers to derive local_depth
// But this is O(log n) lookup vs O(1) with stored depth

Practical Choice

Most implementations store local depth in the bucket header and optionally cache it in the directory structure. The bucket header is authoritative; directory cache is for optimization. During recovery, local depths are verified by reading bucket headers.

Summary: The Local Depth Mechanism

Local depth is the per-bucket counter that enables extendible hashing's adaptive behavior. Let's consolidate the key concepts:

Key Takeaways

•Local depth tracks split history — A bucket's local depth equals the number of hash bits distinguishing its contents from other buckets.
•Pointer count formula — A bucket with local_depth d' has 2^(global_depth - d') directory entries pointing to it.
•Key prefix constraint — All keys in a bucket share the same first d' bits in their hash values.
•Splitting increases local depth — When a bucket splits, both resulting buckets have local_depth = old_depth + 1.
•Asymmetric depths are normal — Different buckets can have different local depths, reflecting non-uniform data distribution.
•Merging decreases local depth — Combining sibling buckets reduces local depth and reclaims resources.
•Local depth ≤ global depth always — This invariant ensures every bucket is addressable.

What's next:

Now that we understand both global depth and local depth, we'll explore directory doubling—the operation that increases global depth when a maximally-split bucket overflows. This is the most structurally significant operation in extendible hashing.

Page Complete

You now understand local depth—the per-bucket counter that enables adaptive, localized growth. The interplay between global_depth and local_depth is what makes extendible hashing simultaneously dynamic and efficient.

3 / 5

Loading learning content...

Database Management SystemsExtendible Hashing

Extendible Hashing

LevelIntermediate

Duration75 mins

TopicExtendible Hashing

3 / 5

Local Depth

The Per-Bucket Memory of Split History

If global depth is the orchestra conductor, local depth is each musician's individual tempo marking—together they coordinate to produce a harmonious, adaptive data structure.

What You Will Learn

Definition and Meaning

Formal Definition:

The local depth of a bucket B (denoted as d' or LD) is a non-negative integer that specifies:

How many bits of the hash value distinguish entries that should go into this bucket from entries that should go elsewhere
How many times this bucket (or its ancestors) has been split since the initial bucket
The reciprocal of how many directory entries point to this bucket (relative to global depth)

The fundamental relationships:

local_depth_relationships.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Given:
  - Global depth: d
  - Local depth of bucket B: d'
 
Then:
  - Number of directory entries pointing to B = 2^(d - d')
  - The first d' bits of all keys in B are identical
  - B was created by (d' - initial_depth) splits from the original bucket
 
Examples (assuming d = 4, initial_depth = 1):
 
  Bucket with d' = 4:
    - Directory entries pointing here: 2^(4-4) = 1
    - Keys share first 4 bits
    - Created by 3 splits
 
  Bucket with d' = 2:
    - Directory entries pointing here: 2^(4-2) = 4
    - Keys share first 2 bits
    - Created by 1 split
 
  Bucket with d' = 1:
    - Directory entries pointing here: 2^(4-1) = 8
    - Keys share first 1 bit
    - This is an original bucket (never split)

What local depth means intuitively:

Local depth answers the question: "How specific is this bucket about which keys it accepts?"

Low local depth → bucket accepts keys with many different hash prefixes (less specific)
High local depth → bucket accepts keys with only one specific hash prefix (highly specific)

As buckets split, they become more specific—each split doubles the specificity by considering one more bit of the hash value.

The Split Counter Interpretation

Local Depth and Directory Pointers

The most operationally important property of local depth is how it determines the number of directory entries pointing to a bucket.

The pointer count formula:

pointer_count(bucket) = 2^(global_depth - local_depth)

This formula is always valid because of the invariant 0 ≤ d' ≤ d.

Visual demonstration:

Converting Mermaid diagram...

Key observations from the diagram:

Bucket C (d'=1) has 4 pointers — Its local depth is less than global depth by 2, so 2² = 4 entries point to it
Buckets A and B (d'=2) each have 2 pointers — They're more specific, with local depth only 1 less than global
If any bucket had d'=3 — It would have exactly 1 pointer (fully specified by all 3 bits)

Verifying the invariant:

The total pointer count across all buckets must equal the directory size:

2^(3-2) + 2^(3-2) + 2^(3-1) = 2 + 2 + 4 = 8 = 2^3 ✓

Which Directory Entries?

Local Depth and Key Distribution

Local depth not only affects directory pointers but also constrains which keys can reside in a bucket.

The key constraint:

All keys in a bucket with local_depth d' must have the same first d' bits in their hash values.

This is what makes the bucket addressable—any directory entry that points to this bucket shares those d' leading bits.

Example walkthrough:

key_distribution_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Scenario: Bucket with local_depth = 3, hash prefix = "010"
 
Valid keys (their hashes start with 010):
  - hash("alice")   = 0b01001101... → prefix 010 ✓
  - hash("zebra")   = 0b01011111... → prefix 010 ✓
  - hash("data")    = 0b01000001... → prefix 010 ✓
 
Invalid keys (wrong prefix):
  - hash("bob")     = 0b11001101... → prefix 110 ✗
  - hash("cat")     = 0b00011111... → prefix 000 ✗
  - hash("epsilon") = 0b01100001... → prefix 011 ✗
 
Key insight:
  After the first 3 bits (which must be 010), the remaining
  hash bits can be anything. The bucket may contain keys
  with hashes 01000000..., 01001111..., 01010101..., etc.
 
  When this bucket splits, bit 4 will differentiate:
    - Bit 4 = 0: Keys move to new bucket with prefix 0100
    - Bit 4 = 1: Keys stay with prefix 0101

Why this matters for correctness:

The local depth ensures that every key is findable. When searching for a key:

Hash the key → get hash value
Extract first d (global depth) bits → get directory index
Follow pointer to bucket
The bucket's d' first bits match the key's d' first bits (guaranteed by construction)
Linear search within bucket for exact match

If the invariant were violated (a key with wrong prefix in a bucket), lookups would fail to find existing keys.

Corruption Detection

Local Depth Changes During Splits

Local depth increases when a bucket splits. This is the fundamental mechanism of extendible hashing evolution.

The split operation and local depth:

split_and_local_depth.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
function splitBucket(bucket):
    old_local_depth = bucket.local_depth
    new_local_depth = old_local_depth + 1
    
    // Create sibling bucket with same new depth
    sibling = createNewBucket(local_depth = new_local_depth)
    bucket.local_depth = new_local_depth
    
    // Redistribute keys based on the (old_local_depth + 1)th bit
    discriminating_bit = new_local_depth
    
    for each key in bucket.keys:
        hash = hash(key)
        bit = extractBit(hash, discriminating_bit)  // 0 or 1
        
        if bit == 1:
            sibling.insert(key, bucket.getValue(key))
            bucket.remove(key)
    
    // Update directory pointers
    updateDirectoryPointers(bucket, sibling)
    
    // After split:
    //   - Original bucket has local_depth = old + 1
    //   - Sibling bucket has local_depth = old + 1
    //   - Each has half the original pointer count
    //     (2^(d - (old+1)) = 2^(d - old) / 2)
 
// Example:
// Before: Bucket B with d' = 2, global d = 4
//   - Pointers: 2^(4-2) = 4 entries point to B
//
// After split: Bucket B and Sibling S, both with d' = 3
//   - Pointers to B: 2^(4-3) = 2
//   - Pointers to S: 2^(4-3) = 2
//   - Total: 2 + 2 = 4 (same as before) ✓

Pointer redistribution during split:

When a bucket with local_depth d' splits:

The bucket's local_depth increases to d' + 1
A sibling bucket is created with local_depth = d' + 1
The 2^(d - d') directory entries that pointed to the original bucket are partitioned:
- Half (those where bit d'+1 = 0) continue pointing to the original bucket
- Half (those where bit d'+1 = 1) now point to the sibling bucket

The total pointer count is preserved: 2^(d - d') = 2^(d - d' - 1) + 2^(d - d' - 1)

Pointer Redistribution Example (d=4)
Before Split	Operation	After Split
Bucket A, d'=2	Split triggered by overflow	Bucket A: d'=3, Bucket A': d'=3
Entries 0100, 0101, 0110, 0111 → A	Bit 3 partitions entries	0100,0101 → A; 0110,0111 → A'
4 pointers to A	Each bucket gets half	2 pointers to A, 2 pointers to A'

The d' = d Special Case

Asymmetric Local Depths

A key feature of extendible hashing is that different buckets can have different local depths. This asymmetry reflects non-uniform data distribution.

Why asymmetry occurs:

Consider a scenario where keys hash non-uniformly:

Keys with hash prefix '0...' are rare
Keys with hash prefix '1...' are frequent

The '1...' bucket fills and splits repeatedly, increasing its local depth. The '0...' bucket remains small and never splits, keeping its original local depth.

Real-world example:

asymmetric_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Scenario: Customer ID hashing where most customers are recent
 
Hash function tends to cluster recent IDs:
  - Old customer IDs → hash prefixes 00..., 01...
  - Recent customer IDs → hash prefix 10..., 11...
 
Initial state (d=2):
  Bucket 00: d'=2, 20 records
  Bucket 01: d'=2, 25 records
  Bucket 10: d'=2, 150 records → overflow!
  Bucket 11: d'=2, 180 records → overflow!
 
After multiple splits (d=4):
  Bucket 00:   d'=2, 20 records   (4 directory ptrs)
  Bucket 01:   d'=2, 25 records   (4 directory ptrs)
  Bucket 1000: d'=4, 35 records   (1 directory ptr)
  Bucket 1001: d'=4, 38 records   (1 directory ptr)
  Bucket 1010: d'=4, 40 records   (1 directory ptr)
  Bucket 1011: d'=4, 37 records   (1 directory ptr)
  Bucket 1100: d'=4, 42 records   (1 directory ptr)
  Bucket 1101: d'=4, 45 records   (1 directory ptr)
  Bucket 1110: d'=4, 48 records   (1 directory ptr)
  Bucket 1111: d'=4, 43 records   (1 directory ptr)
 
Observations:
  - Sparse regions (00, 01) share pointers, saving buckets
  - Dense regions (10.., 11..) are fully split
  - Directory is 16 entries but only 10 buckets exist
  - Space adapts to actual data distribution

Benefits of asymmetric local depths:

Advantages of Local Depth Asymmetry

•Space efficiency — Sparse regions don't waste buckets. A bucket with d' < d uses fewer resources.
•Adaptive growth — Hot spots in data split independently of cold regions. No global reorganization needed.
•Graceful handling of skew — Hash functions aren't perfect; extendible hashing compensates by splitting only where needed.
•Delayed directory growth — Directory doubling is postponed until even the sparse regions fill up.

Measuring Skew

Local Depth and Bucket Merging

Just as splitting increases local depth, merging (the inverse operation) decreases it. Merging is used when buckets become sparse after many deletions.

When can buckets merge?

Two buckets can merge if:

They are siblings — created by the same split operation
Both have the same local depth
Their combined entries fit in one bucket

Identifying siblings:

bucket_merging.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
function considerMerge(bucket):
    if bucket.local_depth == 0:
        return  // Cannot merge original bucket
    
    // Find sibling by flipping the last bit of prefix
    sibling_prefix = bucket.prefix XOR (1 << (total_bits - bucket.local_depth))
    sibling = findBucketByPrefix(sibling_prefix)
    
    if sibling == null:
        return  // Sibling doesn't exist (maybe already merged)
    
    if sibling.local_depth != bucket.local_depth:
        return  // Not actually siblings
    
    if bucket.size + sibling.size > BUCKET_CAPACITY:
        return  // Merged bucket would overflow
    
    // Perform merge
    mergeBuckets(bucket, sibling)
 
function mergeBuckets(bucket, sibling):
    // Move all entries from sibling to bucket
    for each entry in sibling:
        bucket.insert(entry)
    
    // Decrease local depth
    bucket.local_depth = bucket.local_depth - 1
    
    // Update directory: pointers that went to sibling now go to bucket
    for each directory_entry pointing to sibling:
        directory_entry = bucket
    
    // Free sibling bucket
    deallocate(sibling)
    
    // After merge:
    //   - Bucket has local_depth decreased by 1
    //   - Bucket has 2x the pointer count
    //   - Sibling is deallocated
 
// Note: Directory shrinking is less common
// Even if all buckets could merge, the directory usually stays large

Directory shrinking (halving):

Theoretically, when every bucket has local_depth < global_depth, the directory could shrink. However, most systems don't implement automatic directory shrinking because:

Rare in practice — Workloads that grow and then shrink uniformly are uncommon
Non-critical savings — Directory memory is small compared to bucket storage
Complexity — Concurrent directory shrinking is complex to implement safely
Future growth — A shrunk directory will likely need to grow again

Merge Trigger Policies

Local Depth Storage and Metadata

In a physical implementation, local depth must be stored somewhere persistent. There are two common approaches:

Approach 1: Store in bucket header

Each bucket page contains a header with metadata including local depth:

bucket_page_layout.txt
┌─────────────────────────────────────────────────┐
│                 Bucket Page Header              │
│  ─ page_id:       4 bytes                       │
│  ─ local_depth:   1 byte  ← Local depth here    │
│  ─ record_count:  2 bytes                       │
│  ─ free_space:    2 bytes                       │
│  ─ next_overflow: 4 bytes (if using chains)     │
│  ─ checksum:      4 bytes                       │
├─────────────────────────────────────────────────┤
│                  Slot Directory                 │
│  [offset1, len1] [offset2, len2] ...            │
├─────────────────────────────────────────────────┤
│                                                 │
│              Record Storage Area                │
│    (grows from bottom, slots from top)          │
│                                                 │
└─────────────────────────────────────────────────┘
 
Advantages:
  - Local depth always available when reading bucket
  - No additional metadata lookup required
  - Self-describing pages for recovery
 
Disadvantages:
  - Extra I/O to read local depth if only checking metadata
  - Slight space overhead per page

Approach 2: Store in directory structure

The directory entry can store both the bucket pointer and local depth:

directory_with_depth.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Directory entry structure with embedded local depth
struct DirectoryEntry {
    uint32_t bucket_page_id;   // 4 bytes: physical page location
    uint8_t  local_depth;      // 1 byte: bucket's local depth
    uint8_t  padding[3];       // 3 bytes: alignment
};
// Total: 8 bytes per entry
 
// Advantages:
//   - Quick access to local depth without bucket I/O
//   - Useful for split planning without reading bucket contents
//
// Disadvantages:
//   - Redundant if local depth also in bucket header
//   - Must keep both copies in sync
 
// Alternative: Compute from directory pattern
// If entries i and i+1 both point to same bucket, local_depth < global_depth
// Count consecutive identical pointers to derive local_depth
// But this is O(log n) lookup vs O(1) with stored depth

Practical Choice

Summary: The Local Depth Mechanism

Local depth is the per-bucket counter that enables extendible hashing's adaptive behavior. Let's consolidate the key concepts:

Key Takeaways

•Local depth tracks split history — A bucket's local depth equals the number of hash bits distinguishing its contents from other buckets.
•Pointer count formula — A bucket with local_depth d' has 2^(global_depth - d') directory entries pointing to it.
•Key prefix constraint — All keys in a bucket share the same first d' bits in their hash values.
•Splitting increases local depth — When a bucket splits, both resulting buckets have local_depth = old_depth + 1.
•Asymmetric depths are normal — Different buckets can have different local depths, reflecting non-uniform data distribution.
•Merging decreases local depth — Combining sibling buckets reduces local depth and reclaims resources.
•Local depth ≤ global depth always — This invariant ensures every bucket is addressable.

What's next:

Page Complete

3 / 5