Database Management SystemsB+-Tree Structure

B+-Tree Structure

LevelIntermediate

Duration60 mins

TopicB+-Tree Structure

3 / 5

Leaf Node Linking

Transforming Trees into Ordered Lists

B+-trees present a fascinating duality: from above, they appear as a tree structure optimized for logarithmic search; from the bottom, they appear as a linked list of sorted data pages. This dual nature is not accidental—it's the source of B+-trees' power for database workloads.

The key mechanism enabling this duality is leaf node linking: each leaf maintains a pointer to its next sibling (and often to its previous sibling as well). This seemingly simple addition transforms the B+-tree from a point lookup structure into a sequential access powerhouse, capable of efficiently handling range queries, sorted output, and aggregate operations.

What You Will Learn

By the end of this page, you will understand exactly how leaf node linking works, why it's essential for database operations beyond point lookups, and how it affects both query execution and tree maintenance. You'll see how this structure enables the range queries and ORDER BY optimizations that databases rely upon.

The Linked Leaf Structure

In a B+-tree, all leaf nodes are connected through sibling pointers, forming a chain that preserves key ordering. The structure can be visualized as:

                    [Root]
                   /      \
            [Internal]    [Internal]
           /    |    \      /    \
        [L1] ↔ [L2] ↔ [L3] ↔ [L4] ↔ [L5]
         ↑                              ↑
      Head                            Tail

Each leaf node contains:

The actual key-value pairs (data records or pointers to them)
A 'next' pointer to the right sibling leaf
Optionally, a 'prev' pointer to the left sibling leaf (doubly-linked)

Invariant: The leaves form a sorted sequence. If you traverse from the leftmost leaf to the rightmost leaf following sibling pointers, you visit all keys in sorted order.

Leaf Node Structure
Component	Size (typical)	Purpose
Page Header	16-64 bytes	Page type, LSN, checksum, entry count
Next Pointer	8 bytes	Link to right sibling leaf page
Prev Pointer	8 bytes (optional)	Link to left sibling leaf page
High Key (optional)	Variable	Highest key in this leaf (for searches)
Key-Value Entries	Page size - overhead	Actual data records

Singly vs. Doubly Linked

Some implementations use singly-linked leaves (next pointer only), while others use doubly-linked (next and prev). Doubly-linked leaves enable efficient backward scans (ORDER BY DESC) but add slight overhead during splits. Most modern databases use doubly-linked leaves.

Range Query Execution

The primary motivation for leaf linking is efficient range query execution. Consider the query:

SELECT * FROM employees WHERE salary BETWEEN 50000 AND 75000;

Without Leaf Linking (B-Tree approach):

Find the first matching record (salary = 50000) via tree traversal
Return to parent, find next in-order key
If next key is in a different subtree, navigate back up and down
Repeat until salary > 75000

This yo-yo traversal is expensive: each record might require navigating multiple levels.

With Leaf Linking (B+-Tree approach):

Find the first matching record via tree traversal (one root-to-leaf path)
Follow 'next' pointer to subsequent records
Continue following links until salary > 75000

The entire scan stays at the leaf level—no tree navigation during the range portion.

Algorithm: Range Scan

RangeScan(tree, low_key, high_key):
    // Phase 1: Find starting point (logarithmic)
    leaf = tree.root
    while leaf is not a leaf node:
        leaf = find_child(leaf, low_key)
    
    // Find first key >= low_key within leaf
    position = binary_search(leaf, low_key)
    
    // Phase 2: Sequential scan (linear in result size)
    results = []
    while true:
        while position < leaf.entry_count:
            if leaf.keys[position] > high_key:
                return results  // Done
            results.append(leaf.entries[position])
            position++
        
        // Move to next leaf
        if leaf.next == null:
            return results  // No more leaves
        leaf = fetch_page(leaf.next)
        position = 0
    
    return results

Time Complexity:

Finding start: O(log n)
Scanning k results: O(k/entries_per_page)
Total: O(log n + k/B) where B = entries per leaf

I/O Optimal Range Scans

This algorithm is I/O-optimal for range queries: you cannot find k contiguous records in a tree with fewer than O(log n + k/B) page accesses. The leaf linking structure achieves this theoretical optimum.

Sequential Scan Patterns

Range queries are just one instance of sequential access. Leaf linking enables efficient execution of many common database operations:

Operations Benefiting from Leaf Linking

•BETWEEN queries: WHERE date BETWEEN '2024-01-01' AND '2024-12-31'
•Comparison queries: WHERE price > 100 scans from the first match to end
•LIKE prefix queries: WHERE name LIKE 'Smith%' scans contiguous keys
•ORDER BY elimination: SELECT * FROM t ORDER BY indexed_col - just scan leaves in order
•MIN/MAX queries: SELECT MIN(col) - fetch leftmost leaf; SELECT MAX(col) - fetch rightmost leaf
•Full index scans: SELECT indexed_col FROM t - scan all leaves without touching heap
•Aggregate pushdown: SELECT COUNT(*), SUM(col) - aggregate while scanning leaves

ORDER BY Optimization:

Without leaf linking, a query like:

SELECT * FROM employees ORDER BY last_name;

...would require either:

Full table scan followed by sort (expensive for large tables)
Complex in-order tree traversal with backtracking

With leaf linking on a last_name index:

Start at leftmost leaf
Follow next pointers to the end
Data arrives pre-sorted—no sorting phase needed!

This is why database query optimizers often choose index scans even when most rows are selected: the free ordering avoids an expensive sort operation.

Reverse Scans (ORDER BY DESC):

With doubly-linked leaves:

SELECT * FROM employees ORDER BY last_name DESC;

Start at rightmost leaf
Follow prev pointers to the beginning
Still O(n) and still avoids sorting

Index-Only Scans

If all needed columns are in the index, the scan never touches the base table—this is an 'index-only scan' or 'covering index scan'. Leaf linking makes this extremely efficient since all data is accessed sequentially in the leaves.

Disk I/O Benefits

The true power of leaf linking is revealed when considering physical disk characteristics. Modern storage devices have drastically different performance for sequential vs. random access:

Sequential vs. Random I/O Performance
Storage Type	Sequential Read	Random Read	Ratio
HDD (7200 RPM)	200 MB/s	100 IOPS × 4KB = 0.4 MB/s	500:1
SATA SSD	550 MB/s	50,000 IOPS × 4KB = 200 MB/s	2.75:1
NVMe SSD	3,500 MB/s	500,000 IOPS × 4KB = 2,000 MB/s	1.75:1

Why Sequential Matters:

Disk Head Movement (HDD): Random reads require moving the disk head to different tracks, taking ~5-10ms each. Sequential reads keep the head in place, limited only by rotational speed.
Read Prefetching: Operating systems and disk controllers detect sequential patterns and read ahead, fetching multiple pages before they're requested.
SSD Parallelism: SSDs read faster when accessing consecutive blocks—internal parallelism is better utilized.

Prefetch Optimization:

Because leaves are linked, databases can:

Detect that a range scan is in progress
Issue prefetch requests for upcoming leaf pages
Achieve near-sequential throughput even with some internal fragmentation

Current read: Leaf 47
Asynchronously prefetching: Leaves 48, 49, 50, 51
By the time we need Leaf 48, it's already in buffer pool

This prefetching is only possible because leaf linking reveals the access pattern predictably.

Even SSDs Benefit

While SSDs have much better random access than HDDs, they still benefit significantly from sequential access patterns. Internal parallelism, wear leveling efficiency, and prefetch buffering all work better with sequential access. Don't assume SSDs make leaf linking irrelevant.

Maintaining Links During Operations

Leaf links must be maintained during insertions and deletions. Understanding this maintenance is crucial for both implementing B+-trees and understanding their performance characteristics.

Leaf Split (during insert):

Before split:
  ... ↔ [A] ↔ [B overflowing] ↔ [C] ↔ ...
                    ↓
              (insert causes split)
                    ↓
After split:
  ... ↔ [A] ↔ [B left half] ↔ [B right half] ↔ [C] ↔ ...

Algorithm:

Create new right sibling page
Move upper half of entries to new page
Set new_page.next = old_page.next
Set new_page.prev = old_page (if doubly-linked)
Set old_page.next = new_page
If old_page.original_next exists, set its prev = new_page
Copy separator key to parent

Link updates are local—only adjacent leaves are affected.

Leaf Merge (during delete):

Before merge:
  ... ↔ [A] ↔ [B underfull] ↔ [C underfull] ↔ [D] ↔ ...
                    ↓
            (delete causes merge)
                    ↓
After merge:
  ... ↔ [A] ↔ [B + C combined] ↔ [D] ↔ ...

Algorithm:

Combine entries from both pages into one page
Set merged_page.next = right_sibling.next
If right_sibling.next exists, set its prev = merged_page
Deallocate the emptied page
Remove separator from parent (may cascade)

Key Insight: Sibling pointer updates are part of the atomic split/merge operation. If the operation fails midway, crash recovery must restore consistent sibling links.

Crash Recovery Consideration

Sibling pointer updates must be logged for crash recovery. If the system crashes after updating one link but before updating the reciprocal link, recovery must fix the inconsistency. This is why B+-tree implementations use write-ahead logging (WAL) for split operations.

Concurrency with Linked Leaves

Leaf linking introduces interesting concurrency challenges. When a reader is traversing leaves via sibling pointers, a concurrent writer might split or merge a leaf in the reader's path.

Concurrency Challenges

•Reader following next pointer finds unlinked page
•Reader misses newly split page during scan
•Split in progress leaves temporary inconsistency
•Merge deallocates page reader is about to access
•Backward scan on singly-linked leaves is difficult

B-Link Tree Solutions

•High keys in each leaf indicate max key in that node
•If search key > high key, follow right link
•Split completes sibling link before parent update
•Reader can always find correct leaf via right links
•No reader blocks on split; follows link if needed

The B-Link Tree Protocol:

Lehman and Yao's B-link tree (1981) solves these problems elegantly:

Each node stores a high key—the maximum key that can appear in that subtree
Sibling links are updated before parent pointers during splits
If a reader arrives at a node where search_key > high_key, the node must have split
Reader simply follows the right-link to the correct sibling

Example:

Reader searching for key K = 50
Arrives at leaf with keys [30, 35, 40] and high_key = 45
50 > 45, so this leaf was split
Follow right-link to sibling [45, 50, 55, 60]
Find K = 50 here

This protocol allows readers to traverse leaves without blocking on splits—they may take an extra hop but will always find the correct data.

Why This Matters for Performance

Lock-free or latch-optimistic leaf traversal is critical for OLTP workloads where thousands of concurrent transactions perform range scans. Without the B-link optimization, range scans would need to hold locks across multiple leaves, creating severe contention.

Physical Layout Considerations

Logical sibling links and physical disk layout interact in important ways. Ideally, logically adjacent leaves would be physically adjacent on disk, maximizing sequential I/O benefits.

The Fragmentation Problem:

Initially, B+-tree leaves may be written sequentially. But after many splits:

New leaves are allocated wherever free space exists
Logically adjacent leaves may be physically scattered
Following next pointers becomes random I/O

Logical vs. Physical Adjacency Over Time
State	Physical Locations of Leaves 1→2→3→4→5	I/O Pattern
Initial creation	Disk blocks 100, 101, 102, 103, 104	Sequential
After some splits	Disk blocks 100, 101, 250, 103, 251	Mixed
Heavily fragmented	Disk blocks 100, 350, 42, 217, 89	Random

Mitigation Strategies:

1. Pre-allocation Pools: Allocate pages in large contiguous chunks. When splits occur, the new page comes from the same chunk, likely near its sibling.

2. Periodic Reorganization (REORG): Periodically rebuild the index from scratch, writing leaves in sequential order:

-- PostgreSQL
REINDEX INDEX my_index;

-- MySQL
ALTER TABLE my_table ENGINE=InnoDB;  -- Rebuilds all indexes

-- SQL Server
ALTER INDEX my_index ON my_table REBUILD;

3. Online Defragmentation: Some databases can relocate pages incrementally without locking the entire index.

4. Bulk Loading: When initially loading data, build leaves in sorted order:

Sort all data externally
Write leaves left-to-right
Build internal nodes bottom-up

This produces perfectly sequential layout.

SSDs and Fragmentation

SSDs are less affected by fragmentation than HDDs since random access is faster. However, they still benefit from sequential access patterns due to prefetching and internal parallelism. Don't ignore fragmentation just because you use SSDs.

Summary: Leaf Node Linking

Leaf node linking is the feature that transforms B+-trees from a dictionary structure into a sequential access powerhouse. Let's consolidate the key insights:

Key Takeaways

•Dual nature — B+-trees are both a tree (for search) and a linked list (for scan); leaf linking enables this duality
•Optimal range queries — O(log n + k/B) performance is theoretically optimal; leaf linking achieves it
•Sequential I/O — Following next pointers enables sequential disk access, 10-500× faster than random access
•ORDER BY elimination — Scanning linked leaves produces sorted output without explicit sorting
•Prefetch friendly — Predictable access patterns enable effective read-ahead optimization
•Concurrency enabled — B-link tree protocol uses sibling links for lock-free traversal during splits

What's Next:

We've examined leaf nodes in depth. The next page explores internal nodes—the navigation layer that routes searches to the correct leaves. We'll examine their structure, how they're maintained during splits and merges, and various optimization techniques used in production databases.

Page Complete

You now understand how leaf node linking enables the range query and sequential access capabilities that make B+-trees ideal for database workloads. The linked leaf structure is what distinguishes B+-trees as a practical indexing solution rather than just a theoretical data structure. Next, we'll explore internal nodes in detail.

3 / 5

Loading learning content...

Database Management SystemsB+-Tree Structure

B+-Tree Structure

LevelIntermediate

Duration60 mins

TopicB+-Tree Structure

3 / 5

Leaf Node Linking

Transforming Trees into Ordered Lists

What You Will Learn

The Linked Leaf Structure

In a B+-tree, all leaf nodes are connected through sibling pointers, forming a chain that preserves key ordering. The structure can be visualized as:

                    [Root]
                   /      \
            [Internal]    [Internal]
           /    |    \      /    \
        [L1] ↔ [L2] ↔ [L3] ↔ [L4] ↔ [L5]
         ↑                              ↑
      Head                            Tail

Each leaf node contains:

The actual key-value pairs (data records or pointers to them)
A 'next' pointer to the right sibling leaf
Optionally, a 'prev' pointer to the left sibling leaf (doubly-linked)

Invariant: The leaves form a sorted sequence. If you traverse from the leftmost leaf to the rightmost leaf following sibling pointers, you visit all keys in sorted order.

Leaf Node Structure
Component	Size (typical)	Purpose
Page Header	16-64 bytes	Page type, LSN, checksum, entry count
Next Pointer	8 bytes	Link to right sibling leaf page
Prev Pointer	8 bytes (optional)	Link to left sibling leaf page
High Key (optional)	Variable	Highest key in this leaf (for searches)
Key-Value Entries	Page size - overhead	Actual data records

Singly vs. Doubly Linked

Range Query Execution

The primary motivation for leaf linking is efficient range query execution. Consider the query:

SELECT * FROM employees WHERE salary BETWEEN 50000 AND 75000;

Without Leaf Linking (B-Tree approach):

Find the first matching record (salary = 50000) via tree traversal
Return to parent, find next in-order key
If next key is in a different subtree, navigate back up and down
Repeat until salary > 75000

This yo-yo traversal is expensive: each record might require navigating multiple levels.

With Leaf Linking (B+-Tree approach):

Find the first matching record via tree traversal (one root-to-leaf path)
Follow 'next' pointer to subsequent records
Continue following links until salary > 75000

The entire scan stays at the leaf level—no tree navigation during the range portion.

Algorithm: Range Scan

RangeScan(tree, low_key, high_key):
    // Phase 1: Find starting point (logarithmic)
    leaf = tree.root
    while leaf is not a leaf node:
        leaf = find_child(leaf, low_key)
    
    // Find first key >= low_key within leaf
    position = binary_search(leaf, low_key)
    
    // Phase 2: Sequential scan (linear in result size)
    results = []
    while true:
        while position < leaf.entry_count:
            if leaf.keys[position] > high_key:
                return results  // Done
            results.append(leaf.entries[position])
            position++
        
        // Move to next leaf
        if leaf.next == null:
            return results  // No more leaves
        leaf = fetch_page(leaf.next)
        position = 0
    
    return results

Time Complexity:

Finding start: O(log n)
Scanning k results: O(k/entries_per_page)
Total: O(log n + k/B) where B = entries per leaf

I/O Optimal Range Scans

Sequential Scan Patterns

Range queries are just one instance of sequential access. Leaf linking enables efficient execution of many common database operations:

Operations Benefiting from Leaf Linking

•BETWEEN queries: WHERE date BETWEEN '2024-01-01' AND '2024-12-31'
•Comparison queries: WHERE price > 100 scans from the first match to end
•LIKE prefix queries: WHERE name LIKE 'Smith%' scans contiguous keys
•ORDER BY elimination: SELECT * FROM t ORDER BY indexed_col - just scan leaves in order
•MIN/MAX queries: SELECT MIN(col) - fetch leftmost leaf; SELECT MAX(col) - fetch rightmost leaf
•Full index scans: SELECT indexed_col FROM t - scan all leaves without touching heap
•Aggregate pushdown: SELECT COUNT(*), SUM(col) - aggregate while scanning leaves

ORDER BY Optimization:

Without leaf linking, a query like:

SELECT * FROM employees ORDER BY last_name;

...would require either:

Full table scan followed by sort (expensive for large tables)
Complex in-order tree traversal with backtracking

With leaf linking on a last_name index:

Start at leftmost leaf
Follow next pointers to the end
Data arrives pre-sorted—no sorting phase needed!

This is why database query optimizers often choose index scans even when most rows are selected: the free ordering avoids an expensive sort operation.

Reverse Scans (ORDER BY DESC):

With doubly-linked leaves:

SELECT * FROM employees ORDER BY last_name DESC;

Start at rightmost leaf
Follow prev pointers to the beginning
Still O(n) and still avoids sorting

Index-Only Scans

Disk I/O Benefits

The true power of leaf linking is revealed when considering physical disk characteristics. Modern storage devices have drastically different performance for sequential vs. random access:

Sequential vs. Random I/O Performance
Storage Type	Sequential Read	Random Read	Ratio
HDD (7200 RPM)	200 MB/s	100 IOPS × 4KB = 0.4 MB/s	500:1
SATA SSD	550 MB/s	50,000 IOPS × 4KB = 200 MB/s	2.75:1
NVMe SSD	3,500 MB/s	500,000 IOPS × 4KB = 2,000 MB/s	1.75:1

Why Sequential Matters:

Disk Head Movement (HDD): Random reads require moving the disk head to different tracks, taking ~5-10ms each. Sequential reads keep the head in place, limited only by rotational speed.
Read Prefetching: Operating systems and disk controllers detect sequential patterns and read ahead, fetching multiple pages before they're requested.
SSD Parallelism: SSDs read faster when accessing consecutive blocks—internal parallelism is better utilized.

Prefetch Optimization:

Because leaves are linked, databases can:

Detect that a range scan is in progress
Issue prefetch requests for upcoming leaf pages
Achieve near-sequential throughput even with some internal fragmentation

Current read: Leaf 47
Asynchronously prefetching: Leaves 48, 49, 50, 51
By the time we need Leaf 48, it's already in buffer pool

This prefetching is only possible because leaf linking reveals the access pattern predictably.

Even SSDs Benefit

Maintaining Links During Operations

Leaf links must be maintained during insertions and deletions. Understanding this maintenance is crucial for both implementing B+-trees and understanding their performance characteristics.

Leaf Split (during insert):

Before split:
  ... ↔ [A] ↔ [B overflowing] ↔ [C] ↔ ...
                    ↓
              (insert causes split)
                    ↓
After split:
  ... ↔ [A] ↔ [B left half] ↔ [B right half] ↔ [C] ↔ ...

Algorithm:

Create new right sibling page
Move upper half of entries to new page
Set new_page.next = old_page.next
Set new_page.prev = old_page (if doubly-linked)
Set old_page.next = new_page
If old_page.original_next exists, set its prev = new_page
Copy separator key to parent

Link updates are local—only adjacent leaves are affected.

Leaf Merge (during delete):

Before merge:
  ... ↔ [A] ↔ [B underfull] ↔ [C underfull] ↔ [D] ↔ ...
                    ↓
            (delete causes merge)
                    ↓
After merge:
  ... ↔ [A] ↔ [B + C combined] ↔ [D] ↔ ...

Algorithm:

Combine entries from both pages into one page
Set merged_page.next = right_sibling.next
If right_sibling.next exists, set its prev = merged_page
Deallocate the emptied page
Remove separator from parent (may cascade)

Key Insight: Sibling pointer updates are part of the atomic split/merge operation. If the operation fails midway, crash recovery must restore consistent sibling links.

Crash Recovery Consideration

Concurrency with Linked Leaves

Leaf linking introduces interesting concurrency challenges. When a reader is traversing leaves via sibling pointers, a concurrent writer might split or merge a leaf in the reader's path.

Concurrency Challenges

•Reader following next pointer finds unlinked page
•Reader misses newly split page during scan
•Split in progress leaves temporary inconsistency
•Merge deallocates page reader is about to access
•Backward scan on singly-linked leaves is difficult

B-Link Tree Solutions

•High keys in each leaf indicate max key in that node
•If search key > high key, follow right link
•Split completes sibling link before parent update
•Reader can always find correct leaf via right links
•No reader blocks on split; follows link if needed

The B-Link Tree Protocol:

Lehman and Yao's B-link tree (1981) solves these problems elegantly:

Each node stores a high key—the maximum key that can appear in that subtree
Sibling links are updated before parent pointers during splits
If a reader arrives at a node where search_key > high_key, the node must have split
Reader simply follows the right-link to the correct sibling

Example:

Reader searching for key K = 50
Arrives at leaf with keys [30, 35, 40] and high_key = 45
50 > 45, so this leaf was split
Follow right-link to sibling [45, 50, 55, 60]
Find K = 50 here

This protocol allows readers to traverse leaves without blocking on splits—they may take an extra hop but will always find the correct data.

Why This Matters for Performance

Physical Layout Considerations

Logical sibling links and physical disk layout interact in important ways. Ideally, logically adjacent leaves would be physically adjacent on disk, maximizing sequential I/O benefits.

The Fragmentation Problem:

Initially, B+-tree leaves may be written sequentially. But after many splits:

New leaves are allocated wherever free space exists
Logically adjacent leaves may be physically scattered
Following next pointers becomes random I/O

Logical vs. Physical Adjacency Over Time
State	Physical Locations of Leaves 1→2→3→4→5	I/O Pattern
Initial creation	Disk blocks 100, 101, 102, 103, 104	Sequential
After some splits	Disk blocks 100, 101, 250, 103, 251	Mixed
Heavily fragmented	Disk blocks 100, 350, 42, 217, 89	Random

Mitigation Strategies:

1. Pre-allocation Pools: Allocate pages in large contiguous chunks. When splits occur, the new page comes from the same chunk, likely near its sibling.

2. Periodic Reorganization (REORG): Periodically rebuild the index from scratch, writing leaves in sequential order:

-- PostgreSQL
REINDEX INDEX my_index;

-- MySQL
ALTER TABLE my_table ENGINE=InnoDB;  -- Rebuilds all indexes

-- SQL Server
ALTER INDEX my_index ON my_table REBUILD;

3. Online Defragmentation: Some databases can relocate pages incrementally without locking the entire index.

4. Bulk Loading: When initially loading data, build leaves in sorted order:

Sort all data externally
Write leaves left-to-right
Build internal nodes bottom-up

This produces perfectly sequential layout.

SSDs and Fragmentation

Summary: Leaf Node Linking

Leaf node linking is the feature that transforms B+-trees from a dictionary structure into a sequential access powerhouse. Let's consolidate the key insights:

Key Takeaways

•Dual nature — B+-trees are both a tree (for search) and a linked list (for scan); leaf linking enables this duality
•Optimal range queries — O(log n + k/B) performance is theoretically optimal; leaf linking achieves it
•Sequential I/O — Following next pointers enables sequential disk access, 10-500× faster than random access
•ORDER BY elimination — Scanning linked leaves produces sorted output without explicit sorting
•Prefetch friendly — Predictable access patterns enable effective read-ahead optimization
•Concurrency enabled — B-link tree protocol uses sibling links for lock-free traversal during splits

What's Next:

Page Complete

3 / 5