Loading content...
If there's one data structure that powers the modern database world, it's the B-tree. Invented by Rudolf Bayer and Edward McCreight in 1970 at Boeing Research Labs, the B-tree has become the de facto standard index structure in virtually every relational database system: PostgreSQL, MySQL, SQL Server, Oracle, SQLite, and countless others.
The B-tree's dominance isn't accidental. It was specifically designed to optimize disk I/O operations, making it ideally suited for databases where data far exceeds available memory. More than 50 years later, despite dramatic changes in storage technology, B-trees remain the optimal choice for most indexing scenarios.
In this page, we'll explore B-trees from their mathematical foundations through their practical behavior in production database systems, giving you the deep understanding needed to reason about B-tree index performance and design effective indexing strategies.
By the end of this page, you'll understand the B-tree data structure at a fundamental level, including node organization, search algorithms, insertion with splitting, deletion with merging, the B+ tree variant used in databases, and the factors that determine B-tree performance in practice.
To understand why B-trees dominate database indexing, we must first understand the problem they solve: minimizing disk I/O while maintaining efficient search, insert, and delete operations.
The Memory Hierarchy Reality:
Modern computer systems have a dramatic performance gap between memory and disk:
When data doesn't fit in memory, database performance is dominated by disk I/O. The goal of any disk-based data structure is to minimize the number of disk reads per operation.
Why Not Binary Search Trees?
A balanced binary search tree (like an AVL tree or Red-Black tree) provides O(log₂ N) search performance. For a billion elements, that's about 30 comparisons—which sounds excellent. But there's a critical problem:
Each comparison might require a disk read.
In a binary tree, each node has only 2 children. To store 1 billion elements, you need a tree of height ~30. If each node is on a different disk page, you need 30 disk reads per lookup—that's 30 × 10ms = 300ms per search on an HDD, or about 3 queries per second.
| Tree Type | Fanout | Height (1B elements) | Disk Reads/Lookup | Time (HDD) | Time (SSD) |
|---|---|---|---|---|---|
| Binary tree | 2 | 30 | 30 | 300ms | 3ms |
| 4-way tree | 4 | 15 | 15 | 150ms | 1.5ms |
| 100-way tree | 100 | 5 | 5 | 50ms | 0.5ms |
| B-tree (500-way) | 500 | 3-4 | 3-4 | 30-40ms | 0.3-0.4ms |
The B-tree Solution:
B-trees solve the disk I/O problem by maximizing the fanout—the number of children per node. Instead of 2 children (binary), B-trees have hundreds or thousands of children per node.
The key insight is that a single disk read retrieves an entire disk page (typically 4KB-16KB). Rather than wasting most of that page on a single tree node with 2 children, B-trees pack as many keys and pointers as possible into each page.
With a fanout of 500:
The Logarithmic Base Matters:
Remember that log₂(N) vs log₅₀₀(N) = log₂(N) / log₂(500) ≈ log₂(N) / 9. By increasing the fanout from 2 to 500, we reduce tree height by a factor of 9.
B-trees trade off in-memory comparison efficiency for disk I/O efficiency. It's faster to do 100 comparisons within a single page in RAM than to do 2 comparisons across 50 pages on disk. This trade-off is wildly profitable—the disk I/O saved dwarfs the extra in-memory comparisons.
A B-tree of order m (also called minimum degree t in some definitions) satisfies the following properties:
Structural Properties:
Node Structure:
Each node contains:
The Search Invariant:
For each key kᵢ in an internal node:
This invariant enables binary search within nodes to find the correct child pointer.
B-tree Node Structure (Order m=5, so 2-4 keys per node) ┌────────────────────────────────────────────┐│ INTERNAL NODE │├────────────────────────────────────────────┤│ Header: [type=INTERNAL, n=3] │├────────────────────────────────────────────┤│ Keys: │ 25 │ 50 │ 75 │ -- │ │├────────────────────────────────────────────┤│ Pointers: │ p0 │ p1 │ p2 │ p3 │ │└────────────────────────────────────────────┘ │ │ │ │ ▼ ▼ ▼ ▼ <25 25-50 50-75 >75 ┌────────────────────────────────────────────┐│ LEAF NODE │├────────────────────────────────────────────┤│ Header: [type=LEAF, n=4] │├────────────────────────────────────────────┤│ Keys: │ 10 │ 15 │ 20 │ 23 │ │├────────────────────────────────────────────┤│ Values: │ v1 │ v2 │ v3 │ v4 │ ← Row pointers or data├────────────────────────────────────────────┤│ Next Leaf: ────────────────────→ │ ← For range scans (B+ tree)└────────────────────────────────────────────┘Why the Minimum Fill Requirement?
The requirement that non-root nodes be at least half-full serves critical purposes:
Height Calculation:
For a B-tree of order m with N keys:
This height guarantee is what makes B-trees so powerful—regardless of the data distribution, lookups require at most 4 disk reads for billion-element indexes.
Different sources use different terminology. 'Order m' typically means max m children (Knuth's definition). 'Minimum degree t' means non-root nodes have t to 2t children (CLRS). When reading about B-trees, always verify which definition is being used.
Searching a B-tree combines traversal from root to leaf with binary search within each node. The algorithm is elegant and efficient.
Search Algorithm:
BTREE_SEARCH(node, key):
i = 0
// Find the smallest index i where key ≤ node.keys[i]
while i < node.n and key > node.keys[i]:
i = i + 1
// Case 1: Found exact match
if i < node.n and key == node.keys[i]:
return (node, i) // Return node and position
// Case 2: Not found in this node
if node.is_leaf:
return NOT_FOUND // Key doesn't exist
else:
// Recursively search the appropriate child
return BTREE_SEARCH(node.children[i], key)
Searching Within a Node:
The inner search (finding position i) can be done via:
In practice, for typical B-tree orders (m < 500), linear search is often faster due to better cache behavior and simpler code.
Search for key 67 in B-tree: Step 1: Start at root┌───────────────────────┐│ [30, 60, 90] │ ← 67 > 60, 67 < 90, go to child[2]└───────────────────────┘ │ │ │ │ ▼ ▼ ▼ ▼ └────┐ ▼Step 2: Read internal node┌───────────────────────┐│ [65, 70, 80, 85] │ ← 67 > 65, 67 < 70, go to child[1]└───────────────────────┘ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ └────┐ ▼Step 3: Read leaf node┌───────────────────────┐│ [66, 67, 68, 69] │ ← 67 == keys[1], FOUND!│ [p1, p2, p3, p4] │ return pointer p2└───────────────────────┘ Total disk reads: 3 (root often cached → 2)Time Complexity Analysis:
Practical Performance:
In practice, B-tree search performance is dominated by disk I/O. The CPU comparisons within each node are nearly instantaneous compared to disk access. This is why we optimize for minimizing tree height, even at the cost of more comparisons per node.
Caching Effects:
The root node and upper internal nodes are accessed by every lookup. Databases keep these nodes cached in the buffer pool, often resulting in:
For a height-4 B-tree with good caching, a typical lookup might require only 1-2 actual disk reads, not 4.
Database buffer pool sizing directly affects B-tree performance. A larger buffer pool means more internal nodes stay cached. The ideal buffer pool can hold the entire index's internal nodes, leaving only leaf nodes requiring disk access.
Insertion in a B-tree is more complex than search because it must maintain all B-tree invariants, particularly the balanced height property and the minimum/maximum key constraints.
Insertion Algorithm Overview:
The Splitting Operation:
When a node with m-1 keys needs to accept a new key:
Splitting a Full Node (Order m=5, max 4 keys per node) Before: Insert key 35 into full leaf┌─────────────────────────────────────────┐│ Parent: [30, 50, 70] ││ p0 p1 p2 p3 │└─────────────────────────────────────────┘ │ ▼ (p1 points to this full leaf)┌────────────────────────────────┐│ Leaf: [32, 34, 36, 38] ← FULL │ Insert 35 → overflow!└────────────────────────────────┘ Step 1: Split the leaf at median (35)┌────────────────────┐ ┌────────────────────┐│ Left: [32, 34] │ │ Right: [36, 38] │└────────────────────┘ └────────────────────┘ ↑ ↑ stays in place new sibling created Median key 35 promoted to parent After split:┌─────────────────────────────────────────┐│ Parent: [30, 35, 50, 70] │ ← 35 promoted here│ p0 p1 p2 p3 p4 │└─────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ [...] │ │ │ ▼ │ [36, 38] ← new sibling ▼ [32, 34] ← original (trimmed)Root Splitting (Tree Height Increase):
When a split propagates all the way to the root and the root itself splits, the tree grows taller:
This is the only way a B-tree increases in height. It ensures that all paths from root to leaves remain the same length (perfect balance).
Insertion Complexity:
Page Splits and I/O:
In database terms, each split involves:
This is 4 I/O operations per split level. In the worst case (splits to root), a single insert might trigger 4 × height = 12-16 I/O operations. However, this is rare—most inserts don't cause any splits.
Inserting keys in random order causes more splits than sequential insertion. Random inserts fill nodes unpredictably, while sequential inserts (like auto-increment IDs) append to the rightmost leaf, minimizing splits. This is why auto-increment primary keys are performance-friendly.
Deletion in a B-tree must maintain the minimum fill constraint: non-root nodes must have at least ⌈m/2⌉-1 keys. When deletion causes underflow, the tree must be rebalanced.
Deletion Algorithm Overview:
Case 1: Key is in a leaf node
Case 2: Key is in an internal node
Rebalancing on Underflow:
When a node has too few keys after deletion:
Option A: Borrow from sibling (Rotation)
Option B: Merge with sibling
Rebalancing After Deletion (Order m=5, min 2 keys per node) Scenario: Delete causes underflow, borrow from right sibling Before: Delete key 18 from left leaf┌───────────────────────────┐│ Parent: [20, 40] │└───────────────────────────┘ │ │ ▼ ▼┌────────────┐ ┌─────────────────┐│ [15, 18] │ │ [25, 30, 35] │ ← Right sibling has 3 keys└────────────┘ └─────────────────┘ ↑ Delete 18 → underflow! (only 1 key left) After: Borrow via rotation┌───────────────────────────┐│ Parent: [25, 40] │ ← Parent's 20 moved down, sibling's 25 moved up└───────────────────────────┘ │ │ ▼ ▼┌────────────┐ ┌────────────┐│ [15, 20] │ │ [30, 35] │ ← Balanced! Both have ≥2 keys└────────────┘ └────────────┘ ↑ ↑ Got parent's 20 Lost 25 to parent --- Scenario: Merge when siblings can't spare Before: Delete key 42, siblings at minimum┌───────────────────────────────┐│ Parent: [40, 50] │└───────────────────────────────┘ │ │ │ ▼ ▼ ▼┌────────┐ ┌────────┐ ┌────────┐│[35,38] │ │[42,45] │ │[52,55] │ All at minimum (2 keys)└────────┘ └────────┘ └────────┘ ↑ Delete 42 → underflow, can't borrow After: Merge with left sibling, pull down separator┌─────────────────────┐│ Parent: [50] │ ← 40 pulled down into merged node└─────────────────────┘ │ │ ▼ ▼┌─────────────────┐ ┌────────┐│[35, 38, 40, 45] │ │[52,55] │ ← Merged node└─────────────────┘ └────────┘Root Shrinking (Tree Height Decrease):
If the root becomes empty after a merge (its only children merged into one), the merged node becomes the new root. This is the only way a B-tree decreases in height.
Deletion Complexity:
Lazy Deletion in Databases:
Many database systems don't immediately perform deletion rebalancing. Instead, they use lazy deletion:
This approach reduces immediate deletion overhead at the cost of some space inefficiency and occasional expensive cleanup operations.
Deletion complexity is why many systems avoid it or defer it. Some databases mark rows as deleted but leave index entries until vacuum. Others use version-based approaches (MVCC) where 'deletion' is just creating a new version marked as deleted. Understanding this helps explain index bloat and vacuum importance.
While we've been discussing B-trees, virtually all database systems actually use a variant called the B+ tree. The B+ tree modifies the basic B-tree structure in ways that significantly improve database performance.
Key Differences from B-trees:
| Characteristic | B-tree | B+ tree |
|---|---|---|
| Data location | Internal and leaf nodes | Leaf nodes only |
| Internal node contents | Keys + data + child pointers | Keys + child pointers only |
| Leaf node linking | Not linked | Doubly-linked list |
| Duplicate keys in internal nodes | No | Yes (separator copies) |
| Fanout for same node size | Lower (data takes space) | Higher (more keys per node) |
| Range query efficiency | Tree traversal needed | Sequential leaf scan |
Why Databases Use B+ Trees:
1. Higher Fanout: Since internal nodes don't store data (only keys and pointers), they can fit more keys per node. Higher fanout means shorter trees, fewer disk reads, and faster lookups.
2. Efficient Range Scans: Leaf nodes are linked into a doubly-linked list. Range queries can:
In a pure B-tree, range scans require tree traversal for each key, which is much less efficient.
3. Predictable Performance: All data accesses follow the same path length (root → internal → ... → leaf). In a B-tree, internal node hits might be faster than leaf hits, causing variable performance.
4. Easier Caching: With data only in leaves, internal nodes are purely structural. They can be cached more aggressively since they're smaller and used more frequently.
B+ Tree Structure ┌─────────────────────┐ │ ROOT (keys only) │ │ [30] [60] │ └─────────────────────┘ / | \ / | \ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ [10, 20] │ │ [40, 50] │ │ [70, 80] │ └───────────┘ └───────────┘ └───────────┘ / \ / \ / \ / \ / \ / \ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │5,9 │→│12, │→│32, │→│45, │→│65, │→│75, │ │ │←│18 │←│38 │←│55 │←│68 │←│85 │ │data│ │data│ │data│ │data│ │data│ │data│ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ ↑ LEAF NODES contain actual data/pointers LINKED for efficient range scans Range Query: "WHERE key BETWEEN 15 AND 50"1. Navigate to leaf containing 15 (via tree traversal)2. Scan right through linked leaves: 18 → 32 → 38 → 453. Stop when key > 50When database documentation mentions 'B-tree indexes,' they almost always mean B+ trees. The term 'B-tree' is commonly used as a generic term for this family of self-balancing tree data structures. PostgreSQL, MySQL, SQLite, SQL Server—all use B+ tree variants for their primary index structures.
Understanding B-tree performance characteristics helps you predict query costs and design effective indexes.
Operation Complexity Summary:
| Operation | Disk I/O (Worst) | CPU (Worst) | Notes |
|---|---|---|---|
| Point lookup | O(log_m N) | O(m × log_m N) | Usually 2-4 I/O for billions of rows |
| Range scan | O(log_m N + K) | O(m × log_m N + K) | K = number of results |
| Insert | O(log_m N) | O(m × log_m N) | Amortized; splits are rare |
| Delete | O(log_m N) | O(m × log_m N) | Amortized; often lazy in databases |
| Min/Max | O(log_m N) | O(m × log_m N) | Navigate to leftmost/rightmost leaf |
| Full scan | O(N/B) | O(N) | B = records per leaf page |
Factors Affecting Real-World Performance:
1. Cache Hit Rate:
2. Fill Factor:
3. Key Size:
4. Index Fragmentation:
B-tree indexes handle 80% of database indexing needs excellently. Understanding B-trees deeply—their structure, performance characteristics, and limitations—provides the foundation for effective index design. The remaining 20% of cases may benefit from specialized indexes (hash, GiST, full-text, etc.).
The B-tree family includes several important variants optimized for specific scenarios.
B Tree:*
B* trees require nodes to be at least 2/3 full (instead of 1/2). They achieve this by redistributing keys among siblings before splitting:
B-link Tree:
B-link trees add right-sibling pointers to internal nodes (not just leaves). This enables lock-free concurrent access:
Copy-on-Write B-trees (CoW-B-trees):
Instead of modifying nodes in place, CoW-B-trees create new versions of modified nodes:
Copy-on-Write B-tree Update Original tree (version 1): ┌───────┐ ┌────│ ROOT1 │────┐ │ └───────┘ │ ▼ ▼┌───────┐ ┌───────┐│ NODE2 │ │ NODE3 │└───────┘ └───────┘ │ │ ▼ ▼┌───────┐ ┌───────┐│ LEAF4 │ │ LEAF5 │ ← Modify this leaf└───────┘ └───────┘ After modification (version 2):Copy path from modified leaf to root ┌───────┐ ┌────│ ROOT1'│────┐ ← New root (copy) │ └───────┘ │ │ ▼ │ ┌───────┐ │ │ NODE3'│ ← New copy │ └───────┘ │ │ ▼ ▼┌───────┐ ┌───────┐│ LEAF4 │ │ LEAF5'│ ← New copy with modification└───────┘ └───────┘ ▲ (shared with version 1) Version 1 (ROOT1) still works!Version 2 (ROOT1') has the update.Only 3 nodes written, not entire tree.Prefix-Compressed B-trees:
For string keys with common prefixes, prefix compression stores only the distinguishing suffix:
This significantly reduces space for sorted string keys.
Write-Optimized B-trees (Bε-trees, LSM-trees):
Traditional B-trees require random I/O for each insert. Write-optimized variants batch writes:
These variants are covered in detail when we discuss NoSQL databases and write-heavy workloads.
Each database implements B-trees differently. PostgreSQL uses B-link trees with MVCC. MySQL/InnoDB uses clustered B+ trees with page-level locking. Understanding your specific database's implementation helps with performance tuning and debugging.
We've explored B-tree indexes from their foundational motivation through their practical implementation in database systems. Let's consolidate the essential insights:
What's Next:
With B-tree mastery established, we'll explore an alternative index structure in the next page: Hash indexes. You'll understand when hash indexes outperform B-trees, their fundamental limitations, and why they're used sparingly despite their O(1) lookup promise.
You now possess deep knowledge of the most important data structure in database technology. B-tree understanding is foundational to query optimization, index design, and performance debugging. This knowledge serves you across every relational database system.