Loading learning content...
Chaining is the unsung hero of hash-based indexing. While the hash function gets the glory for its O(1) lookup promise, chaining quietly handles the messy reality of buckets that exceed capacity. Every production hash index relies on chaining to gracefully degrade rather than catastrophically fail when the ideal case breaks down.
In this deep dive, we move beyond the conceptual introduction to examine the engineering details that make chaining work in real database systems. We'll explore how chains are structured at the byte level, how concurrent access is managed, how chains interact with the buffer pool, and what optimizations distinguish amateur implementations from production-grade ones.
Mastering chaining implementation is essential for anyone working on storage engines, key-value stores, or database internals. Even if you never implement a hash index yourself, understanding these details helps you reason about performance characteristics and troubleshoot production issues.
By the end of this page, you will understand: (1) Detailed chain structure and page linkage, (2) Insertion, deletion, and lookup algorithms with chains, (3) Chain compaction and reorganization strategies, (4) Concurrency control for chained buckets, and (5) Production optimizations for chain performance.
A chain in a hash index is a singly-linked list of pages, all belonging to the same logical bucket. Understanding the precise structure is essential for implementing correct and efficient operations.
A chain comprises two types of pages:
Primary Page — The first page of the bucket, whose location is determined by the bucket ID. Always exists (even if empty).
Overflow Pages — Additional pages linked from the primary page when it fills. May be zero or more.
Each page in the chain has identical structure; the only difference is how they're located:
page_offset = bucket_id × page_size)123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
/* Detailed bucket page layout for chained hash index */ #define PAGE_SIZE 8192 /* 8 KB pages */#define PAGE_HEADER_SIZE 32 /* Fixed header overhead */#define SLOT_SIZE 2 /* 2 bytes per slot (offset) */#define NULL_PAGE_ID 0xFFFFFFFF /* Page header structure - exactly 32 bytes */typedef struct ChainPageHeader { uint32_t page_id; /* This page's unique identifier */ uint32_t bucket_id; /* Logical bucket this page belongs to */ uint32_t overflow_page_id; /* Next page in chain, or NULL_PAGE_ID */ uint16_t entry_count; /* Number of valid entries */ uint16_t free_space_offset; /* Offset where free space begins */ uint16_t slot_count; /* Total slots (may > entry_count if deleted) */ uint16_t flags; /* Page state flags */ uint32_t checksum; /* CRC32 for integrity */ uint32_t reserved; /* Alignment padding / future use */} ChainPageHeader; /* Flag definitions */#define PAGE_FLAG_PRIMARY 0x0001 /* This is the primary bucket page */#define PAGE_FLAG_OVERFLOW 0x0002 /* This is an overflow page */#define PAGE_FLAG_LEAF 0x0004 /* Has no overflow (end of chain) */#define PAGE_FLAG_SORTED 0x0008 /* Entries are sorted (optimization) */ /* Complete page structure */typedef struct ChainPage { ChainPageHeader header; /* Slot directory: grows downward from after header * Each slot is 2 bytes containing offset to entry * Slot value 0x0000 indicates deleted slot (tombstone) */ uint16_t slots[(PAGE_SIZE - PAGE_HEADER_SIZE) / SLOT_SIZE]; /* Entry data: grows upward from end of page * Format: [key_len:2][key_data:variable][value_len:2][value_data:variable] * Or for fixed-size: [key:key_size][rid:rid_size] */} ChainPage; /* Calculate available space for entries */static inline size_t page_usable_space(void) { return PAGE_SIZE - PAGE_HEADER_SIZE;} /* Check if page is end of chain */static inline bool is_chain_end(const ChainPage *page) { return page->header.overflow_page_id == NULL_PAGE_ID;}Each page must track available space to determine when overflow is needed:
Usable Space = PAGE_SIZE - HEADER_SIZE
Used Space = (slot_count × 2) + Σ(entry_sizes)
Free Space = Usable Space - Used Space
The slot directory and entry region grow toward each other:
| Header | Slots → | ← Free Space → | ← Entries |
| 32B | ... | | ... |
When free space is insufficient for a new entry plus its slot, the page is considered full.
Deletions can create internal fragmentation—slots marked as deleted leave gaps in the entry region. Page compaction can reclaim this space without allocating overflow. Track fragmented space separately and compact when fragmentation exceeds a threshold (e.g., 25%).
Insertion into a chained bucket involves finding appropriate space for the new entry, potentially allocating overflow pages, and maintaining correct linkage.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
/* Insert entry into chained bucket */typedef enum { INSERT_SUCCESS, INSERT_DUPLICATE, INSERT_NO_SPACE, INSERT_ERROR} InsertStatus; InsertStatus chain_insert(HashIndex *index, const void *key, size_t key_len, const void *value, size_t value_len, bool allow_duplicates) { /* Calculate required space: slot + key header + key + value header + value */ size_t entry_size = 2 + key_len + 2 + value_len; /* Simplified */ size_t required = SLOT_SIZE + entry_size; /* Compute target bucket */ uint32_t bucket_id = compute_bucket(index, key, key_len); /* Acquire bucket lock (see concurrency section) */ acquire_bucket_lock(index, bucket_id, LOCK_EXCLUSIVE); /* Load primary page */ ChainPage *primary = load_primary_page(index, bucket_id); ChainPage *current = primary; ChainPage *last_with_space = NULL; /* Traverse chain */ while (current != NULL) { /* Check for duplicate key if not allowed */ if (!allow_duplicates) { int found = search_page_for_key(current, key, key_len); if (found >= 0) { release_bucket_lock(index, bucket_id); return INSERT_DUPLICATE; } } /* Track last page with sufficient space */ if (page_has_space(current, required)) { last_with_space = current; } /* Move to next page in chain */ if (!is_chain_end(current)) { current = load_page(index, current->header.overflow_page_id); } else { current = NULL; } } /* Determine insertion target */ ChainPage *target; if (last_with_space != NULL) { /* Insert into existing page with space */ target = last_with_space; } else { /* All pages full - need overflow */ target = allocate_overflow_page(index, bucket_id); /* Find end of chain and link new page */ ChainPage *tail = find_chain_tail(index, primary); tail->header.overflow_page_id = target->header.page_id; tail->header.flags &= ~PAGE_FLAG_LEAF; mark_dirty(tail); } /* Insert entry into target page */ insert_entry_into_page(target, key, key_len, value, value_len); mark_dirty(target); /* Update statistics */ index->metadata.record_count++; mark_metadata_dirty(index); release_bucket_lock(index, bucket_id); return INSERT_SUCCESS;}Several strategies exist for choosing where in the chain to insert:
Strategy 1: First-Fit (Insert at first page with space)
Strategy 2: Tail-Insert (Always insert at end of chain)
Strategy 3: Sorted-Insert (Maintain sorted order within pages)
If duplicates are allowed (e.g., secondary index on non-unique column), skip the duplicate check traverse and insert directly. This reduces insertion from O(chain_length) to O(1) for the insertion itself (though allocation may still traverse).
Lookup is the most frequent operation on hash indexes, and chain traversal is its critical path. Optimizing lookup through chains directly impacts query performance.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
/* Lookup entry in chained bucket */typedef struct { bool found; void *value; size_t value_len; uint32_t pages_scanned; uint32_t comparisons;} LookupResult; LookupResult chain_lookup(HashIndex *index, const void *key, size_t key_len) { LookupResult result = {0}; /* Compute target bucket */ uint32_t bucket_id = compute_bucket(index, key, key_len); /* Acquire shared lock (concurrent reads allowed) */ acquire_bucket_lock(index, bucket_id, LOCK_SHARED); /* Load primary page */ ChainPage *current = load_primary_page(index, bucket_id); while (current != NULL) { result.pages_scanned++; /* Search within current page */ for (uint16_t i = 0; i < current->header.slot_count; i++) { uint16_t slot = current->slots[i]; /* Skip deleted slots */ if (slot == 0) continue; /* Load entry */ Entry entry = load_entry(current, slot); result.comparisons++; /* Compare key */ if (entry.key_len == key_len && memcmp(entry.key, key, key_len) == 0) { /* Found! Copy value to result */ result.found = true; result.value = malloc(entry.value_len); memcpy(result.value, entry.value, entry.value_len); result.value_len = entry.value_len; release_bucket_lock(index, bucket_id); return result; } } /* Move to next page */ if (!is_chain_end(current)) { current = load_page(index, current->header.overflow_page_id); } else { current = NULL; } } /* Key not found */ release_bucket_lock(index, bucket_id); return result;}Bloom Filter Pre-check:
Before traversing the chain, check a Bloom filter to determine if the key might exist:
if (!bloom_might_contain(bucket->bloom, key, key_len)) {
return NOT_FOUND; /* Guaranteed not present */
}
/* Proceed with chain traversal */
Bloom filters have false positives but no false negatives—a negative result saves the entire chain traverse.
Hash Signature Comparison:
Store a truncated hash signature with each entry. Compare signatures before full key comparison:
uint32_t looking_for = compute_signature(key, key_len);
for each entry in page:
if (entry.signature == looking_for) {
if (full_key_match(entry, key, key_len)) {
return FOUND;
}
}
Signature comparison (4-byte integer) is much faster than variable-length key comparison.
Binary Search Within Sorted Pages:
If pages maintain sorted order:
int slot = binary_search(page, key, key_len);
if (slot >= 0) {
return &page->entries[slot];
}
Reduces per-page comparisons from O(entries) to O(log entries).
When searching for a unique key (e.g., primary key index), return immediately upon finding the first match. Don't traverse the rest of the chain. For non-unique indexes searching for all matches, the entire chain must be traversed.
Deletion from chains is more complex than insertion because it must handle:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
/* Delete entry from chained bucket */typedef enum { DELETE_SUCCESS, DELETE_NOT_FOUND, DELETE_ERROR} DeleteStatus; DeleteStatus chain_delete(HashIndex *index, const void *key, size_t key_len) { /* Compute target bucket */ uint32_t bucket_id = compute_bucket(index, key, key_len); /* Acquire exclusive lock */ acquire_bucket_lock(index, bucket_id, LOCK_EXCLUSIVE); /* Traverse chain to find entry */ ChainPage *current = load_primary_page(index, bucket_id); ChainPage *prev = NULL; while (current != NULL) { for (uint16_t i = 0; i < current->header.slot_count; i++) { uint16_t slot = current->slots[i]; if (slot == 0) continue; /* Skip deleted */ Entry entry = load_entry(current, slot); if (entry.key_len == key_len && memcmp(entry.key, key, key_len) == 0) { /* Found - delete it */ delete_entry_from_page(current, i); mark_dirty(current); /* Check if page is now empty (and not primary) */ if (current->header.entry_count == 0 && !(current->header.flags & PAGE_FLAG_PRIMARY)) { /* Unlink and deallocate this page */ unlink_overflow_page(index, prev, current); } /* Update statistics */ index->metadata.record_count--; mark_metadata_dirty(index); release_bucket_lock(index, bucket_id); return DELETE_SUCCESS; } } prev = current; if (!is_chain_end(current)) { current = load_page(index, current->header.overflow_page_id); } else { current = NULL; } } release_bucket_lock(index, bucket_id); return DELETE_NOT_FOUND;} /* Remove entry from page using tombstone or compaction */void delete_entry_from_page(ChainPage *page, uint16_t slot_index) { /* Option 1: Tombstone (mark slot as deleted) */ page->slots[slot_index] = 0; /* 0 indicates deleted */ page->header.entry_count--; /* Note: Entry space is not reclaimed until compaction * Track fragmented space: page->fragmented += entry_size * Trigger compaction when fragmented > threshold */} /* Unlink overflow page from chain */void unlink_overflow_page(HashIndex *index, ChainPage *prev, ChainPage *to_remove) { if (prev != NULL) { /* Update prev to skip removed page */ prev->header.overflow_page_id = to_remove->header.overflow_page_id; if (prev->header.overflow_page_id == NULL_PAGE_ID) { prev->header.flags |= PAGE_FLAG_LEAF; } mark_dirty(prev); } /* Return page to free list */ free_overflow_page(index, to_remove->header.page_id);}Tombstone Deletion (Lazy):
Immediate Compaction:
Hybrid Approach:
For non-unique indexes, decide whether DELETE removes one matching entry or all matching entries. The algorithm above removes the first match. To remove all, continue traversing after each deletion and track count. This affects both semantics and performance.
Over time, deletion operations create fragmented chains with wasted space. Compaction consolidates entries to reclaim space and potentially reduce chain length.
Within a single page, slide entries to eliminate gaps left by deletions:
12345678910111213141516171819202122232425262728293031323334353637
/* Compact entries within a single page */void compact_page(ChainPage *page) { /* Skip if already compact */ if (!page_needs_compaction(page)) return; /* Temporary buffer for rebuilding entry region */ char temp_entries[PAGE_SIZE]; size_t write_offset = PAGE_SIZE; /* Build from end backward */ uint16_t new_slot_count = 0; /* Copy live entries to temp buffer */ for (uint16_t i = 0; i < page->header.slot_count; i++) { if (page->slots[i] == 0) continue; /* Skip deleted */ Entry entry = load_entry(page, page->slots[i]); size_t entry_size = compute_entry_size(&entry); write_offset -= entry_size; memcpy(temp_entries + write_offset, &entry, entry_size); page->slots[new_slot_count++] = write_offset; } /* Copy compacted entries back to page */ memcpy(page->data + write_offset, temp_entries + write_offset, PAGE_SIZE - write_offset); /* Update page header */ page->header.slot_count = new_slot_count; page->header.free_space_offset = PAGE_HEADER_SIZE + (new_slot_count * SLOT_SIZE); /* Clear fragmentation tracking */ page->fragmented_space = 0; mark_dirty(page);}Move entries between pages in a chain to reduce the number of overflow pages:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
/* Compact entire chain, merging underfull pages */void compact_chain(HashIndex *index, uint32_t bucket_id) { acquire_bucket_lock(index, bucket_id, LOCK_EXCLUSIVE); ChainPage *primary = load_primary_page(index, bucket_id); /* First pass: compact individual pages */ ChainPage *current = primary; while (current != NULL) { compact_page(current); current = is_chain_end(current) ? NULL : load_page(index, current->header.overflow_page_id); } /* Second pass: merge underfull pages */ current = primary; ChainPage *prev = NULL; while (current != NULL && !is_chain_end(current)) { ChainPage *next = load_page(index, current->header.overflow_page_id); /* Try to merge next into current */ size_t current_used = page_used_space(current); size_t next_used = page_used_space(next); size_t available = page_usable_space() - current_used; if (next_used <= available) { /* Merge next into current */ move_all_entries(next, current); /* Unlink and free next */ current->header.overflow_page_id = next->header.overflow_page_id; if (is_chain_end(current)) { current->header.flags |= PAGE_FLAG_LEAF; } mark_dirty(current); free_overflow_page(index, next->header.page_id); /* Don't advance current - try to merge more */ continue; } prev = current; current = next; } release_bucket_lock(index, bucket_id);}| Strategy | When to Use | Cost | Benefit |
|---|---|---|---|
| Intra-page only | High delete rate, low space pressure | O(entries) | Reclaims space within pages |
| Full chain compact | Periodic maintenance | O(chain × entries) | Reduces chain length |
| Background compact | Continuous, low priority | Distributed over time | Minimal foreground impact |
| Deferred (never) | Read-heavy, few deletes | None | Maximum read performance |
Typical compaction thresholds: (1) Page fragmentation > 30% triggers intra-page compact, (2) Chain length decreases possible by >20% triggers chain merge, (3) Avoid compacting during peak load—schedule for maintenance windows or use background threads.
In multi-threaded or multi-process database systems, concurrent access to chained buckets requires careful synchronization to prevent race conditions and maintain consistency.
The simplest approach: one lock per bucket controlling access to the entire chain.
1234567891011121314151617181920212223242526272829303132333435363738
/* Bucket-level read-write locks */typedef struct { pthread_rwlock_t *bucket_locks; /* Array of N locks, one per bucket */ uint32_t num_buckets;} BucketLockManager; void init_bucket_locks(BucketLockManager *mgr, uint32_t num_buckets) { mgr->num_buckets = num_buckets; mgr->bucket_locks = calloc(num_buckets, sizeof(pthread_rwlock_t)); for (uint32_t i = 0; i < num_buckets; i++) { pthread_rwlock_init(&mgr->bucket_locks[i], NULL); }} void acquire_bucket_lock(BucketLockManager *mgr, uint32_t bucket_id, LockMode mode) { pthread_rwlock_t *lock = &mgr->bucket_locks[bucket_id]; if (mode == LOCK_SHARED) { pthread_rwlock_rdlock(lock); /* Multiple readers OK */ } else { pthread_rwlock_wrlock(lock); /* Exclusive for writes */ }} void release_bucket_lock(BucketLockManager *mgr, uint32_t bucket_id) { pthread_rwlock_unlock(&mgr->bucket_locks[bucket_id]);} /* Usage pattern */LookupResult safe_lookup(HashIndex *index, const void *key, size_t key_len) { uint32_t bucket = compute_bucket(index, key, key_len); acquire_bucket_lock(&index->lock_mgr, bucket, LOCK_SHARED); LookupResult result = chain_lookup(index, key, key_len); release_bucket_lock(&index->lock_mgr, bucket); return result;}With many buckets, per-bucket locks consume significant memory. Striped locking maps multiple buckets to fewer locks:
#define NUM_LOCK_STRIPES 1024
uint32_t get_lock_stripe(uint32_t bucket_id) {
return bucket_id % NUM_LOCK_STRIPES;
}
Tradeoff: Fewer locks = less memory but more false contention (unrelated buckets share locks).
For finer-grained concurrency, latch individual pages rather than whole buckets:
When multiple locks are held simultaneously (e.g., bucket lock + overflow allocation lock), always acquire in a consistent order: (1) Bucket lock first, (2) Overflow manager lock second. Violating this order risks deadlock.
Production hash indexes employ numerous optimizations beyond the basic algorithms. These optimizations can provide order-of-magnitude improvements for specific workloads.
When traversing a chain, prefetch subsequent pages while processing the current one:
1234567891011121314151617181920212223242526
/* Prefetch-optimized chain traversal */LookupResult lookup_with_prefetch(HashIndex *index, const void *key, size_t key_len) { uint32_t bucket_id = compute_bucket(index, key, key_len); ChainPage *current = load_primary_page(index, bucket_id); while (current != NULL) { /* Issue async prefetch for next page while processing current */ if (!is_chain_end(current)) { buffer_prefetch_async(index->buffer_pool, current->header.overflow_page_id); } /* Search current page */ int found = search_page_for_key(current, key, key_len); if (found >= 0) { return make_result(current, found); } /* Next page should now be in buffer pool */ current = is_chain_end(current) ? NULL : load_page(index, current->header.overflow_page_id); } return NOT_FOUND_RESULT;}For frequently accessed entries, maintain a small LRU cache in front of the hash index:
LookupResult cached_lookup(HashIndex *index, const void *key, size_t key_len) {
Entry *cached = lru_cache_get(index->hot_cache, key, key_len);
if (cached != NULL) {
return make_result_from_entry(cached);
}
LookupResult result = chain_lookup(index, key, key_len);
if (result.found) {
lru_cache_put(index->hot_cache, key, key_len, result.value);
}
return result;
}
For bulk inserts, sort entries by bucket first, then process each bucket once:
12345678910111213141516171819202122232425262728
/* Bulk insert optimization */void bulk_insert_sorted(HashIndex *index, Entry *entries, size_t count) { /* Compute bucket for each entry */ for (size_t i = 0; i < count; i++) { entries[i].bucket_id = compute_bucket(index, entries[i].key, entries[i].key_len); } /* Sort by bucket ID */ qsort(entries, count, sizeof(Entry), compare_by_bucket); /* Process each bucket's entries together */ size_t i = 0; while (i < count) { uint32_t bucket_id = entries[i].bucket_id; acquire_bucket_lock(&index->lock_mgr, bucket_id, LOCK_EXCLUSIVE); /* Insert all entries for this bucket */ while (i < count && entries[i].bucket_id == bucket_id) { insert_entry_no_lock(index, &entries[i]); i++; } release_bucket_lock(&index->lock_mgr, bucket_id); }}We have taken a deep dive into chaining—the dominant overflow handling mechanism in database hash indexes. From page structure to production optimizations, these details form the engineering foundation for high-performance hash-based indexing.
What's Next:
With the complete picture of static hashing—buckets, hash functions, overflow handling, and chaining—we now turn to performance analysis. We'll examine how to model, measure, and predict the performance characteristics of static hash indexes across different workloads.
You now possess deep knowledge of chaining implementation in hash indexes. This understanding enables you to work on storage engine internals, diagnose performance issues, and make informed decisions about hash index design and tuning.