B Plus Tree Structure - Learning Module

Loading content...

0/252

Data in Leaves Only

The Design Decision That Changed Everything

Of all the differences between B-trees and B+-trees, one stands above the rest in importance: in a B+-tree, all data records reside exclusively in leaf nodes. Internal nodes contain only keys (for routing) and pointers (to children). No data. Ever.

This seemingly simple rule is the single most impactful design decision in B+-tree architecture. It fundamentally changes how the tree behaves, how memory is utilized, how range queries perform, and how the structure can be optimized. Understanding why this matters—not just what it means—separates those who merely know B+-trees from those who truly understand them.

What You Will Learn

By the end of this page, you will deeply understand why storing data only in leaves is the cornerstone of B+-tree performance. You'll see how this enables massive fanout improvements, predictable search depths, efficient memory utilization, and optimal range query performance.

The Anatomy of Storage Separation

To appreciate the leaves-only storage design, let's first examine what each node type contains in a B+-tree versus a B-tree.

B-Tree Node Structure:

In a standard B-tree, both internal nodes and leaf nodes can contain complete key-value pairs. A node might look like:

[ptr₀ | k₁ | data₁ | ptr₁ | k₂ | data₂ | ptr₂ | k₃ | data₃ | ptr₃]

Here, the data (or data pointers) are interspersed with navigation pointers. The node serves dual purposes: routing searches AND storing information.

B+-Tree Node Structure:

Internal Node:

[ptr₀ | k₁ | ptr₁ | k₂ | ptr₂ | k₃ | ptr₃]

Leaf Node:

[k₁ | data₁ | k₂ | data₂ | k₃ | data₃ | sibling_ptr]

The separation is absolute. Internal nodes never hold data; leaf nodes hold all data.

Node Content Comparison: B-Tree vs B+-Tree
Content Type	B-Tree Internal	B-Tree Leaf	B+-Tree Internal	B+-Tree Leaf
Search Keys	✓	✓	✓ (separators only)	✓ (actual keys)
Data Records	✓	✓	✗ Never	✓ Always
Child Pointers	✓	✗	✓	✗
Sibling Pointers	✗	✗	✗	✓
Records per Node	Lower	Standard	Maximized	Standard

Fanout Maximization: The Numerical Impact

The most immediate benefit of leaves-only storage is the dramatic increase in fanout—the number of children each internal node can have. Since internal nodes don't store data records, they can hold far more keys and pointers.

Concrete Example:

Consider a database page of 8,192 bytes (8KB), a common page size. We're indexing integer keys (8 bytes) with data records that are 200 bytes each. Pointers are 8 bytes.

B-Tree Internal Node:

Each entry needs: 8 (key) + 200 (data) + 8 (pointer) = 216 bytes
Entries per node: 8192 / 216 ≈ 37 entries
Fanout: 38 children

B+-Tree Internal Node:

Each entry needs: 8 (key) + 8 (pointer) = 16 bytes
Entries per node: 8192 / 16 ≈ 500 entries
Fanout: 501 children

That's a 13× improvement in fanout!

Why Fanout Matters So Much

Tree height = log_fanout(n). With fanout 38, a million records need ~4 levels. With fanout 501, they need only ~3 levels. That's one fewer disk read for EVERY search operation. Over millions of queries, this difference is transformative.

Height Comparison for 1 Billion Records:

Fanout	Tree Height	Disk Reads per Search
38	6 levels	6 I/Os
501	4 levels	4 I/Os

With millions of queries per second, this 33% reduction in I/O translates directly to:

Higher throughput (more queries per second)
Lower latency (faster response times)
Reduced hardware requirements (fewer servers needed)

The Fanout Effect Compounds:

The improvement isn't just about saving one level. Higher fanout means:

Fewer internal nodes total (reducing memory footprint)
More records reachable per cached internal node
Better cache hit rates at every level

Memory Hierarchy Optimization

Modern databases operate across a dramatic memory hierarchy: CPU caches (nanoseconds), RAM (100+ nanoseconds), SSD (50+ microseconds), and HDD (5+ milliseconds). The leaves-only design allows B+-trees to optimally exploit this hierarchy.

Tiered Caching Strategy

•Root Node (always cached): Fits in CPU cache; accessed on every query
•Upper Internal Nodes (usually cached): Small enough to stay in RAM; accessed frequently
•Lower Internal Nodes (often cached): May fit in RAM with good buffer management
•Leaf Nodes (demand-paged): Too numerous to cache; fetched from disk as needed

Why Data-Free Internals Enable This:

In a B+-tree of order 500 with 1 billion records:

Level 0 (root): 1 node
Level 1: ~500 nodes
Level 2: ~250,000 nodes
Level 3 (leaves): ~2,000,000 nodes

Memory for all internal nodes: (1 + 500 + 250,000) × 8KB ≈ 2 GB

This is manageable! Most production databases can keep all internal nodes in RAM. The result: most queries require only one disk I/O—the final leaf fetch.

Contrast with B-Trees:

If internal nodes contained 200-byte data records, the same tree would need:

Level 0: 1 node (but smaller fanout = more levels)
More levels total, potentially 6 instead of 4
Each internal level 13× larger (due to embedded data)

The internal nodes alone might require 26 GB of RAM to cache—often impractical.

The Working Set Advantage

Database buffer pools have finite size. By keeping internal nodes lean, B+-trees ensure these critical navigation structures always fit in memory. The buffer pool can then dedicate remaining space to frequently-accessed leaf pages, maximizing cache effectiveness.

Predictable Search Paths

In a B-tree, search can terminate at any level—wherever the key is found. In a B+-tree, search always proceeds from root to leaf. Every search, successful or not, traverses the same number of levels.

Why Predictability Matters:

B+-Tree Advantages

•Cost estimation: Optimizer knows exact I/O count
•Performance consistency: No lucky early finds
•Testing/benchmarking: Reproducible results
•SLA guarantees: Worst case = average case
•Simplified caching: All paths use same nodes

B-Tree Unpredictability

•Variable path length: 1 to h levels
•Biased performance: Popular keys are faster
•Complex estimation: Depends on key distribution
•Inconsistent latency: User experience varies
•Cache pollution: Random internal access patterns

The Query Optimizer Perspective:

Query optimizers must estimate the cost of different execution plans. For index lookups, the key question is: how many disk I/Os will this require?

With B+-trees, the answer is simple and exact:

I/O cost = height of tree = ⌈log_fanout(n)⌉

For a tree with n=1M records and fanout=500:

I/O cost = ⌈log₅₀₀(1,000,000)⌉ = ⌈2.22⌉ = 3

The optimizer knows with certainty that this index lookup costs 3 I/Os. This precision enables accurate plan comparisons.

With B-trees, the cost depends on where keys appear—potentially requiring expensive statistics about key distributions across levels, which change with updates.

The Fallacy of Early Termination

One might argue B-trees are sometimes faster because search can terminate early. In practice, this 'benefit' is minimal: with fanout of 500, 99.8% of keys are in leaves anyway. The 0.2% that might be found earlier doesn't compensate for the reduced fanout and increased complexity.

Implications for Range Queries

Range queries—finding all records where key BETWEEN x AND y—are fundamental to database operations. The leaves-only design makes B+-trees dramatically superior for range queries.

The Range Query Process:

Find start key x: Navigate from root to the leaf containing x (O(log n))
Scan to end key y: Follow leaf sibling pointers until y is reached (O(k) where k = number of matching records)

Total cost: O(log n + k)

This is optimal—you can't do better than finding the start point plus reading the actual results.

Why B-Trees Suffer:

In a B-tree, matching records are scattered throughout the tree at various levels. A range query must:

Find start key (O(log n))
Perform in-order traversal, bouncing between levels
Navigate to parent, then to siblings, then back down

The traversal constantly moves up and down the tree, potentially reading the same internal nodes multiple times.

Range Query Performance Comparison
Metric	B+-Tree	B-Tree
I/O Pattern	Sequential leaf read	Random tree traversal
Pages read for k results	⌈k/entries_per_leaf⌉	Up to k × height pages
Cache efficiency	Excellent (sequential)	Poor (scattered access)
Prefetching	Highly effective	Difficult to predict
Disk head movement	Minimal	Extensive seeking

Practical Example:

Query: Find all orders placed between January 1 and January 31 (10,000 matching records, 100 records per leaf page).

B+-Tree:

Find start: 3 I/Os
Scan leaves: 100 I/Os (10,000 ÷ 100)
Total: 103 I/Os, mostly sequential reads

B-Tree (worst case):

Navigate to each record individually
Potentially 10,000+ jumps through tree structure
Many repeated internal node reads
Total: Thousands of random I/Os

For disk-based storage, this difference is catastrophic. Sequential reads are 100× faster than random reads on HDDs and 10× faster on SSDs.

The Sequential Access Principle

Database workloads are dominated by range queries: date ranges, price ranges, alphabetical listings, TOP N queries. By concentrating all data in linked leaves, B+-trees convert these expensive random access patterns into cheap sequential scans.

Simplified Concurrency Control

Modern databases handle thousands of concurrent queries. The leaves-only design significantly simplifies the challenging problem of concurrent B+-tree access.

The Core Insight:

Data modifications (INSERT, UPDATE, DELETE) only affect leaf nodes. Internal nodes change only when leaves split or merge—relatively rare events. This asymmetry enables sophisticated locking protocols.

Concurrency Optimizations Enabled

•Latch coupling (crabbing): Release parent latch as soon as child is acquired; works because internal nodes rarely change
•Optimistic traversal: Navigate with read latches only; acquire write latch only at leaf level for modifications
•Lock-free reads: Some implementations allow lockless reads of internal nodes since they're append-only between splits
•Per-leaf locking: Different transactions can modify different leaves simultaneously without interference
•Reduced contention: Hot internal nodes (like root) are read-only for most operations

B-Link Trees:

The B-link tree, a popular B+-tree variant, adds sibling pointers at all levels to further improve concurrency. When a split occurs:

The new sibling is linked before the parent is updated
Concurrent readers who miss the parent update can follow the sibling link
No reader ever sees an inconsistent state

This optimization is only practical because data is in leaves—internal node changes are infrequent enough that the additional sibling links don't significantly impact space.

Why B-Trees Are Harder to Optimize:

If internal nodes contain data, they become write hotspots. Modifying an internal node entry requires exclusive access to a node that's also used for navigation. This creates bottlenecks, especially for frequently-accessed keys near the root.

Real-World Concurrency

PostgreSQL's B-tree implementation (which is actually a B+-tree) uses Lehman and Yao's B-link tree algorithm, achieving concurrent access by millions of transactions. This scalability is fundamentally enabled by the leaves-only data storage pattern.

Implementation Implications

The leaves-only design affects how B+-trees are implemented in practice. Understanding these implications helps when tuning database performance or debugging index issues.

Implementation Differences Due to Leaves-Only Storage
Aspect	Implementation Detail	Performance Impact
Search algorithm	Always descend to leaf, even for exact match	Predictable cost, simple code
Insert algorithm	Insert always at leaf level; may propagate split upward	Lower contention on internals
Delete algorithm	Delete from leaf; may propagate merge upward	Simpler tombstone handling
Node format	Two distinct formats: internal and leaf	Optimized storage for each role
Recovery	Only leaf changes logged with full data	Smaller log for internal changes
Compression	Different compression suited to each node type	Better overall compression ratio

Two Node Type Formats:

Production databases typically implement different physical layouts for internal and leaf nodes:

Internal Node Layout:

[header | key₁ | ptr₁ | key₂ | ptr₂ | ... | keyₙ | ptrₙ | ptr₀]

Fixed-size entries (key + pointer)
High density packing
Often uses prefix compression heavily

Leaf Node Layout:

[header | slot_directory | free_space | ... | record₃ | record₂ | record₁]

Variable-size entries (key + data)
Slot directory for variable-length records
Sibling pointers (next, sometimes previous)

This specialization is only possible because nodes serve different purposes—navigation vs. storage.

Prefix Compression in Internals

Since internal node keys are separators (not actual data), they can be aggressively compressed. If adjacent keys are 'Johnson' and 'Jones', the separator only need store 'Jon'—enough to distinguish the branches. This further increases internal node fanout.

Summary: The Power of Leaves-Only Storage

The decision to store data exclusively in leaf nodes is the defining architectural choice of B+-trees. Let's consolidate why this matters:

Key Benefits of Leaves-Only Storage

•Maximized fanout — Internal nodes hold 10-20× more entries, reducing tree height and I/O operations
•Optimal memory utilization — Lean internal nodes fit in RAM; only leaves require disk I/O for most queries
•Predictable performance — Every search traverses exactly height levels; cost estimation is trivial
•Superior range queries — Sequential leaf traversal instead of complex tree navigation
•Simplified concurrency — Internal nodes rarely change; sophisticated locking protocols become practical
•Flexible implementation — Different optimizations applied to navigation vs. storage nodes

What's Next:

The leaves-only design enables another crucial B+-tree feature: linking leaf nodes together for sequential access. The next page explores leaf node linking in depth—how it works, why it's essential for range queries, and how it interacts with concurrency control.

Page Complete

You now deeply understand why storing data only in leaves is the cornerstone of B+-tree design. This single decision enables the fanout, memory efficiency, predictability, and range query performance that make B+-trees the dominant database index structure. Next, we'll explore how leaf node linking enables efficient sequential access.