Loading learning content...
B+-trees present a fascinating duality: from above, they appear as a tree structure optimized for logarithmic search; from the bottom, they appear as a linked list of sorted data pages. This dual nature is not accidental—it's the source of B+-trees' power for database workloads.
The key mechanism enabling this duality is leaf node linking: each leaf maintains a pointer to its next sibling (and often to its previous sibling as well). This seemingly simple addition transforms the B+-tree from a point lookup structure into a sequential access powerhouse, capable of efficiently handling range queries, sorted output, and aggregate operations.
By the end of this page, you will understand exactly how leaf node linking works, why it's essential for database operations beyond point lookups, and how it affects both query execution and tree maintenance. You'll see how this structure enables the range queries and ORDER BY optimizations that databases rely upon.
In a B+-tree, all leaf nodes are connected through sibling pointers, forming a chain that preserves key ordering. The structure can be visualized as:
[Root]
/ \
[Internal] [Internal]
/ | \ / \
[L1] ↔ [L2] ↔ [L3] ↔ [L4] ↔ [L5]
↑ ↑
Head Tail
Each leaf node contains:
Invariant: The leaves form a sorted sequence. If you traverse from the leftmost leaf to the rightmost leaf following sibling pointers, you visit all keys in sorted order.
| Component | Size (typical) | Purpose |
|---|---|---|
| Page Header | 16-64 bytes | Page type, LSN, checksum, entry count |
| Next Pointer | 8 bytes | Link to right sibling leaf page |
| Prev Pointer | 8 bytes (optional) | Link to left sibling leaf page |
| High Key (optional) | Variable | Highest key in this leaf (for searches) |
| Key-Value Entries | Page size - overhead | Actual data records |
Some implementations use singly-linked leaves (next pointer only), while others use doubly-linked (next and prev). Doubly-linked leaves enable efficient backward scans (ORDER BY DESC) but add slight overhead during splits. Most modern databases use doubly-linked leaves.
The primary motivation for leaf linking is efficient range query execution. Consider the query:
SELECT * FROM employees WHERE salary BETWEEN 50000 AND 75000;
Without Leaf Linking (B-Tree approach):
This yo-yo traversal is expensive: each record might require navigating multiple levels.
With Leaf Linking (B+-Tree approach):
The entire scan stays at the leaf level—no tree navigation during the range portion.
Algorithm: Range Scan
RangeScan(tree, low_key, high_key):
// Phase 1: Find starting point (logarithmic)
leaf = tree.root
while leaf is not a leaf node:
leaf = find_child(leaf, low_key)
// Find first key >= low_key within leaf
position = binary_search(leaf, low_key)
// Phase 2: Sequential scan (linear in result size)
results = []
while true:
while position < leaf.entry_count:
if leaf.keys[position] > high_key:
return results // Done
results.append(leaf.entries[position])
position++
// Move to next leaf
if leaf.next == null:
return results // No more leaves
leaf = fetch_page(leaf.next)
position = 0
return results
Time Complexity:
This algorithm is I/O-optimal for range queries: you cannot find k contiguous records in a tree with fewer than O(log n + k/B) page accesses. The leaf linking structure achieves this theoretical optimum.
Range queries are just one instance of sequential access. Leaf linking enables efficient execution of many common database operations:
WHERE date BETWEEN '2024-01-01' AND '2024-12-31'WHERE price > 100 scans from the first match to endWHERE name LIKE 'Smith%' scans contiguous keysSELECT * FROM t ORDER BY indexed_col - just scan leaves in orderSELECT MIN(col) - fetch leftmost leaf; SELECT MAX(col) - fetch rightmost leafSELECT indexed_col FROM t - scan all leaves without touching heapSELECT COUNT(*), SUM(col) - aggregate while scanning leavesORDER BY Optimization:
Without leaf linking, a query like:
SELECT * FROM employees ORDER BY last_name;
...would require either:
With leaf linking on a last_name index:
This is why database query optimizers often choose index scans even when most rows are selected: the free ordering avoids an expensive sort operation.
Reverse Scans (ORDER BY DESC):
With doubly-linked leaves:
SELECT * FROM employees ORDER BY last_name DESC;
If all needed columns are in the index, the scan never touches the base table—this is an 'index-only scan' or 'covering index scan'. Leaf linking makes this extremely efficient since all data is accessed sequentially in the leaves.
The true power of leaf linking is revealed when considering physical disk characteristics. Modern storage devices have drastically different performance for sequential vs. random access:
| Storage Type | Sequential Read | Random Read | Ratio |
|---|---|---|---|
| HDD (7200 RPM) | 200 MB/s | 100 IOPS × 4KB = 0.4 MB/s | 500:1 |
| SATA SSD | 550 MB/s | 50,000 IOPS × 4KB = 200 MB/s | 2.75:1 |
| NVMe SSD | 3,500 MB/s | 500,000 IOPS × 4KB = 2,000 MB/s | 1.75:1 |
Why Sequential Matters:
Disk Head Movement (HDD): Random reads require moving the disk head to different tracks, taking ~5-10ms each. Sequential reads keep the head in place, limited only by rotational speed.
Read Prefetching: Operating systems and disk controllers detect sequential patterns and read ahead, fetching multiple pages before they're requested.
SSD Parallelism: SSDs read faster when accessing consecutive blocks—internal parallelism is better utilized.
Prefetch Optimization:
Because leaves are linked, databases can:
Current read: Leaf 47
Asynchronously prefetching: Leaves 48, 49, 50, 51
By the time we need Leaf 48, it's already in buffer pool
This prefetching is only possible because leaf linking reveals the access pattern predictably.
While SSDs have much better random access than HDDs, they still benefit significantly from sequential access patterns. Internal parallelism, wear leveling efficiency, and prefetch buffering all work better with sequential access. Don't assume SSDs make leaf linking irrelevant.
Leaf links must be maintained during insertions and deletions. Understanding this maintenance is crucial for both implementing B+-trees and understanding their performance characteristics.
Leaf Split (during insert):
Before split:
... ↔ [A] ↔ [B overflowing] ↔ [C] ↔ ...
↓
(insert causes split)
↓
After split:
... ↔ [A] ↔ [B left half] ↔ [B right half] ↔ [C] ↔ ...
Algorithm:
Link updates are local—only adjacent leaves are affected.
Leaf Merge (during delete):
Before merge:
... ↔ [A] ↔ [B underfull] ↔ [C underfull] ↔ [D] ↔ ...
↓
(delete causes merge)
↓
After merge:
... ↔ [A] ↔ [B + C combined] ↔ [D] ↔ ...
Algorithm:
Key Insight: Sibling pointer updates are part of the atomic split/merge operation. If the operation fails midway, crash recovery must restore consistent sibling links.
Sibling pointer updates must be logged for crash recovery. If the system crashes after updating one link but before updating the reciprocal link, recovery must fix the inconsistency. This is why B+-tree implementations use write-ahead logging (WAL) for split operations.
Leaf linking introduces interesting concurrency challenges. When a reader is traversing leaves via sibling pointers, a concurrent writer might split or merge a leaf in the reader's path.
The B-Link Tree Protocol:
Lehman and Yao's B-link tree (1981) solves these problems elegantly:
Example:
Reader searching for key K = 50
Arrives at leaf with keys [30, 35, 40] and high_key = 45
50 > 45, so this leaf was split
Follow right-link to sibling [45, 50, 55, 60]
Find K = 50 here
This protocol allows readers to traverse leaves without blocking on splits—they may take an extra hop but will always find the correct data.
Lock-free or latch-optimistic leaf traversal is critical for OLTP workloads where thousands of concurrent transactions perform range scans. Without the B-link optimization, range scans would need to hold locks across multiple leaves, creating severe contention.
Logical sibling links and physical disk layout interact in important ways. Ideally, logically adjacent leaves would be physically adjacent on disk, maximizing sequential I/O benefits.
The Fragmentation Problem:
Initially, B+-tree leaves may be written sequentially. But after many splits:
| State | Physical Locations of Leaves 1→2→3→4→5 | I/O Pattern |
|---|---|---|
| Initial creation | Disk blocks 100, 101, 102, 103, 104 | Sequential |
| After some splits | Disk blocks 100, 101, 250, 103, 251 | Mixed |
| Heavily fragmented | Disk blocks 100, 350, 42, 217, 89 | Random |
Mitigation Strategies:
1. Pre-allocation Pools: Allocate pages in large contiguous chunks. When splits occur, the new page comes from the same chunk, likely near its sibling.
2. Periodic Reorganization (REORG): Periodically rebuild the index from scratch, writing leaves in sequential order:
-- PostgreSQL
REINDEX INDEX my_index;
-- MySQL
ALTER TABLE my_table ENGINE=InnoDB; -- Rebuilds all indexes
-- SQL Server
ALTER INDEX my_index ON my_table REBUILD;
3. Online Defragmentation: Some databases can relocate pages incrementally without locking the entire index.
4. Bulk Loading: When initially loading data, build leaves in sorted order:
This produces perfectly sequential layout.
SSDs are less affected by fragmentation than HDDs since random access is faster. However, they still benefit from sequential access patterns due to prefetching and internal parallelism. Don't ignore fragmentation just because you use SSDs.
Leaf node linking is the feature that transforms B+-trees from a dictionary structure into a sequential access powerhouse. Let's consolidate the key insights:
What's Next:
We've examined leaf nodes in depth. The next page explores internal nodes—the navigation layer that routes searches to the correct leaves. We'll examine their structure, how they're maintained during splits and merges, and various optimization techniques used in production databases.
You now understand how leaf node linking enables the range query and sequential access capabilities that make B+-trees ideal for database workloads. The linked leaf structure is what distinguishes B+-trees as a practical indexing solution rather than just a theoretical data structure. Next, we'll explore internal nodes in detail.