Loading content...
Of all the differences between B-trees and B+-trees, one stands above the rest in importance: in a B+-tree, all data records reside exclusively in leaf nodes. Internal nodes contain only keys (for routing) and pointers (to children). No data. Ever.
This seemingly simple rule is the single most impactful design decision in B+-tree architecture. It fundamentally changes how the tree behaves, how memory is utilized, how range queries perform, and how the structure can be optimized. Understanding why this matters—not just what it means—separates those who merely know B+-trees from those who truly understand them.
By the end of this page, you will deeply understand why storing data only in leaves is the cornerstone of B+-tree performance. You'll see how this enables massive fanout improvements, predictable search depths, efficient memory utilization, and optimal range query performance.
To appreciate the leaves-only storage design, let's first examine what each node type contains in a B+-tree versus a B-tree.
B-Tree Node Structure:
In a standard B-tree, both internal nodes and leaf nodes can contain complete key-value pairs. A node might look like:
[ptr₀ | k₁ | data₁ | ptr₁ | k₂ | data₂ | ptr₂ | k₃ | data₃ | ptr₃]
Here, the data (or data pointers) are interspersed with navigation pointers. The node serves dual purposes: routing searches AND storing information.
B+-Tree Node Structure:
Internal Node:
[ptr₀ | k₁ | ptr₁ | k₂ | ptr₂ | k₃ | ptr₃]
Leaf Node:
[k₁ | data₁ | k₂ | data₂ | k₃ | data₃ | sibling_ptr]
The separation is absolute. Internal nodes never hold data; leaf nodes hold all data.
| Content Type | B-Tree Internal | B-Tree Leaf | B+-Tree Internal | B+-Tree Leaf |
|---|---|---|---|---|
| Search Keys | ✓ | ✓ | ✓ (separators only) | ✓ (actual keys) |
| Data Records | ✓ | ✓ | ✗ Never | ✓ Always |
| Child Pointers | ✓ | ✗ | ✓ | ✗ |
| Sibling Pointers | ✗ | ✗ | ✗ | ✓ |
| Records per Node | Lower | Standard | Maximized | Standard |
The most immediate benefit of leaves-only storage is the dramatic increase in fanout—the number of children each internal node can have. Since internal nodes don't store data records, they can hold far more keys and pointers.
Concrete Example:
Consider a database page of 8,192 bytes (8KB), a common page size. We're indexing integer keys (8 bytes) with data records that are 200 bytes each. Pointers are 8 bytes.
B-Tree Internal Node:
B+-Tree Internal Node:
That's a 13× improvement in fanout!
Tree height = log_fanout(n). With fanout 38, a million records need ~4 levels. With fanout 501, they need only ~3 levels. That's one fewer disk read for EVERY search operation. Over millions of queries, this difference is transformative.
Height Comparison for 1 Billion Records:
| Fanout | Tree Height | Disk Reads per Search |
|---|---|---|
| 38 | 6 levels | 6 I/Os |
| 501 | 4 levels | 4 I/Os |
With millions of queries per second, this 33% reduction in I/O translates directly to:
The Fanout Effect Compounds:
The improvement isn't just about saving one level. Higher fanout means:
Modern databases operate across a dramatic memory hierarchy: CPU caches (nanoseconds), RAM (100+ nanoseconds), SSD (50+ microseconds), and HDD (5+ milliseconds). The leaves-only design allows B+-trees to optimally exploit this hierarchy.
Why Data-Free Internals Enable This:
In a B+-tree of order 500 with 1 billion records:
Memory for all internal nodes: (1 + 500 + 250,000) × 8KB ≈ 2 GB
This is manageable! Most production databases can keep all internal nodes in RAM. The result: most queries require only one disk I/O—the final leaf fetch.
Contrast with B-Trees:
If internal nodes contained 200-byte data records, the same tree would need:
The internal nodes alone might require 26 GB of RAM to cache—often impractical.
Database buffer pools have finite size. By keeping internal nodes lean, B+-trees ensure these critical navigation structures always fit in memory. The buffer pool can then dedicate remaining space to frequently-accessed leaf pages, maximizing cache effectiveness.
In a B-tree, search can terminate at any level—wherever the key is found. In a B+-tree, search always proceeds from root to leaf. Every search, successful or not, traverses the same number of levels.
Why Predictability Matters:
The Query Optimizer Perspective:
Query optimizers must estimate the cost of different execution plans. For index lookups, the key question is: how many disk I/Os will this require?
With B+-trees, the answer is simple and exact:
I/O cost = height of tree = ⌈log_fanout(n)⌉
For a tree with n=1M records and fanout=500:
I/O cost = ⌈log₅₀₀(1,000,000)⌉ = ⌈2.22⌉ = 3
The optimizer knows with certainty that this index lookup costs 3 I/Os. This precision enables accurate plan comparisons.
With B-trees, the cost depends on where keys appear—potentially requiring expensive statistics about key distributions across levels, which change with updates.
One might argue B-trees are sometimes faster because search can terminate early. In practice, this 'benefit' is minimal: with fanout of 500, 99.8% of keys are in leaves anyway. The 0.2% that might be found earlier doesn't compensate for the reduced fanout and increased complexity.
Range queries—finding all records where key BETWEEN x AND y—are fundamental to database operations. The leaves-only design makes B+-trees dramatically superior for range queries.
The Range Query Process:
Total cost: O(log n + k)
This is optimal—you can't do better than finding the start point plus reading the actual results.
Why B-Trees Suffer:
In a B-tree, matching records are scattered throughout the tree at various levels. A range query must:
The traversal constantly moves up and down the tree, potentially reading the same internal nodes multiple times.
| Metric | B+-Tree | B-Tree |
|---|---|---|
| I/O Pattern | Sequential leaf read | Random tree traversal |
| Pages read for k results | ⌈k/entries_per_leaf⌉ | Up to k × height pages |
| Cache efficiency | Excellent (sequential) | Poor (scattered access) |
| Prefetching | Highly effective | Difficult to predict |
| Disk head movement | Minimal | Extensive seeking |
Practical Example:
Query: Find all orders placed between January 1 and January 31 (10,000 matching records, 100 records per leaf page).
B+-Tree:
B-Tree (worst case):
For disk-based storage, this difference is catastrophic. Sequential reads are 100× faster than random reads on HDDs and 10× faster on SSDs.
Database workloads are dominated by range queries: date ranges, price ranges, alphabetical listings, TOP N queries. By concentrating all data in linked leaves, B+-trees convert these expensive random access patterns into cheap sequential scans.
Modern databases handle thousands of concurrent queries. The leaves-only design significantly simplifies the challenging problem of concurrent B+-tree access.
The Core Insight:
Data modifications (INSERT, UPDATE, DELETE) only affect leaf nodes. Internal nodes change only when leaves split or merge—relatively rare events. This asymmetry enables sophisticated locking protocols.
B-Link Trees:
The B-link tree, a popular B+-tree variant, adds sibling pointers at all levels to further improve concurrency. When a split occurs:
This optimization is only practical because data is in leaves—internal node changes are infrequent enough that the additional sibling links don't significantly impact space.
Why B-Trees Are Harder to Optimize:
If internal nodes contain data, they become write hotspots. Modifying an internal node entry requires exclusive access to a node that's also used for navigation. This creates bottlenecks, especially for frequently-accessed keys near the root.
PostgreSQL's B-tree implementation (which is actually a B+-tree) uses Lehman and Yao's B-link tree algorithm, achieving concurrent access by millions of transactions. This scalability is fundamentally enabled by the leaves-only data storage pattern.
The leaves-only design affects how B+-trees are implemented in practice. Understanding these implications helps when tuning database performance or debugging index issues.
| Aspect | Implementation Detail | Performance Impact |
|---|---|---|
| Search algorithm | Always descend to leaf, even for exact match | Predictable cost, simple code |
| Insert algorithm | Insert always at leaf level; may propagate split upward | Lower contention on internals |
| Delete algorithm | Delete from leaf; may propagate merge upward | Simpler tombstone handling |
| Node format | Two distinct formats: internal and leaf | Optimized storage for each role |
| Recovery | Only leaf changes logged with full data | Smaller log for internal changes |
| Compression | Different compression suited to each node type | Better overall compression ratio |
Two Node Type Formats:
Production databases typically implement different physical layouts for internal and leaf nodes:
Internal Node Layout:
[header | key₁ | ptr₁ | key₂ | ptr₂ | ... | keyₙ | ptrₙ | ptr₀]
Leaf Node Layout:
[header | slot_directory | free_space | ... | record₃ | record₂ | record₁]
This specialization is only possible because nodes serve different purposes—navigation vs. storage.
Since internal node keys are separators (not actual data), they can be aggressively compressed. If adjacent keys are 'Johnson' and 'Jones', the separator only need store 'Jon'—enough to distinguish the branches. This further increases internal node fanout.
The decision to store data exclusively in leaf nodes is the defining architectural choice of B+-trees. Let's consolidate why this matters:
What's Next:
The leaves-only design enables another crucial B+-tree feature: linking leaf nodes together for sequential access. The next page explores leaf node linking in depth—how it works, why it's essential for range queries, and how it interacts with concurrency control.
You now deeply understand why storing data only in leaves is the cornerstone of B+-tree design. This single decision enables the fanout, memory efficiency, predictability, and range query performance that make B+-trees the dominant database index structure. Next, we'll explore how leaf node linking enables efficient sequential access.