Loading learning content...
If you've ever used a database—and you have, whether you realized it or not—you've relied on B-trees. These remarkable structures are the backbone of virtually every database system in production today, from the SQLite database on your phone to the massive PostgreSQL clusters powering major web applications.
The B-tree isn't just another balanced tree variant. It's a purpose-built structure designed from the ground up for disk-based storage. Every design decision, from branching factor to node layout, reflects deep understanding of how magnetic and solid-state storage actually works. Understanding B-trees means understanding how databases achieve their remarkable performance.
By the end of this page, you will understand B-tree structure and invariants, how they're tailored for disk block access patterns, the insertion and deletion algorithms with their splitting and merging mechanics, and variations like B+ trees that further optimize for database workloads.
A B-tree of order m (also called minimum degree t in some literature, where m = 2t) is a self-balancing tree that satisfies the following properties:
Structural Properties:
Ordering Properties:
| Node Type | Minimum Keys | Maximum Keys | Minimum Children | Maximum Children |
|---|---|---|---|---|
| Root (not leaf) | 1 | m-1 | 2 | m |
| Root (is leaf) | 0 | m-1 | 0 (leaf) | 0 (leaf) |
| Internal Node | ⌈m/2⌉-1 | m-1 | ⌈m/2⌉ | m |
| Leaf Node | ⌈m/2⌉-1 | m-1 | 0 (leaf) | 0 (leaf) |
Different textbooks define B-tree 'order' differently. Some use order = maximum children (our definition), others use order = minimum degree. Always check which convention a source uses. The minimum degree convention sets t, then max children = 2t and min children (non-root) = t.
Example: B-tree of order 5
This means every internal node holds between 2 and 4 keys, and between 3 and 5 children. This "half-full" minimum ensures good space utilization while maintaining balance.
B-trees aren't just theoretically interesting—they're engineered to exploit how storage hardware actually works.
Understanding block-based I/O:
Storage devices (HDDs, SSDs, networked storage) transfer data in fixed-size blocks or pages, typically 4KB to 64KB. When you request any byte within a block, the entire block is read:
Request 1 byte at offset 1000:
→ System reads entire 4KB block containing offset 1000
→ Block transferred: bytes 0-4095
→ Your 1 byte extracted from the block
→ The other 4,095 bytes: wasted bandwidth (if unused)
B-tree node sizing strategy:
The key insight: size each B-tree node to exactly fill one disk block. If blocks are 4KB:
Node layout (example):
- Node metadata: 16 bytes
- Key count: 4 bytes
- Keys: 8 bytes each × 500 keys = 4,000 bytes
- Child pointers: 8 bytes each × 501 pointers ≈ 4,008 bytes
(Total: ≈8,028 bytes → fits in two 4KB blocks, or adjust for single block)
Practical node sizing calculation:
For a 4KB block with:
Available space: 4096 - 16 = 4080 bytes
Each key-pointer pair: 8 + 8 = 16 bytes Maximum pairs: 4080 ÷ 16 = 255
So order m ≈ 256 (255 keys, 256 children)
The resulting height:
For n keys with order 256:
A B-tree indexing 4 billion records requires only 4 disk reads for any search!
With binary trees, searching 4 billion records requires ~32 disk reads (log₂(4×10⁹)). With a typical B-tree, it requires only 4 disk reads. This 8× reduction in I/O is why B-trees dominate disk-based applications.
Additional disk-friendly properties:
Sequential access within nodes: After loading a node, searching through its keys happens in memory—fast sequential access
Predictable I/O: The maximum number of disk reads is exactly the tree height, enabling precise performance prediction
Cache efficiency: Frequently accessed nodes (especially near the root) stay in memory buffers, further reducing disk I/O
Write optimization: Changes are localized to specific nodes; we rarely need to rewrite large portions of the tree
Search in a B-tree generalizes binary search tree search to handle multiple keys per node.
Algorithm:
BTreeSearch(node, target):
if node is null:
return NOT_FOUND
// Binary search within the node's keys
i = 0
while i < node.keyCount and target > node.keys[i]:
i = i + 1
// Check if we found the key
if i < node.keyCount and target == node.keys[i]:
return (node, i) // Found at position i in this node
// If this is a leaf, key doesn't exist
if node.isLeaf:
return NOT_FOUND
// Read child from disk if not in memory
DiskRead(node.children[i])
// Recursively search in appropriate child
return BTreeSearch(node.children[i], target)
Complexity Analysis:
Disk I/O: O(log_m n) = O(height) block reads
CPU operations: O(m × height) = O(m × log_m n)
Overall: O(log n) CPU operations, O(log_m n) disk I/Os
Although our pseudocode shows linear search within nodes for clarity, real implementations use binary search. With 255 keys per node, binary search requires at most 8 comparisons (log₂ 255 ≈ 8), all happening in fast in-memory operations after a single disk read.
B-tree insertion follows the same principles as 2-3 tree insertion but handles larger nodes and uses proactive splitting to simplify implementation.
Core Principle:
Keys are always inserted into leaf nodes. If a node becomes overfull (m keys), it splits into two nodes with ⌈m/2⌉-1 keys each, promoting the median key to the parent.
Two approaches: Reactive vs. Proactive Splitting
Proactive Insertion Algorithm (preferred for disk):
BTreeInsert(tree, key):
root = tree.root
// If root is full, split it first (creates new root)
if root.keyCount == m - 1:
newRoot = AllocateNode()
newRoot.isLeaf = false
newRoot.children[0] = root
SplitChild(newRoot, 0, root)
tree.root = newRoot
InsertNonFull(newRoot, key)
else:
InsertNonFull(root, key)
InsertNonFull(node, key):
i = node.keyCount - 1
if node.isLeaf:
// Make room and insert key
while i >= 0 and key < node.keys[i]:
node.keys[i + 1] = node.keys[i]
i = i - 1
node.keys[i + 1] = key
node.keyCount = node.keyCount + 1
DiskWrite(node)
else:
// Find child to descend into
while i >= 0 and key < node.keys[i]:
i = i - 1
i = i + 1
DiskRead(node.children[i])
// Proactively split if child is full
if node.children[i].keyCount == m - 1:
SplitChild(node, i, node.children[i])
if key > node.keys[i]:
i = i + 1
InsertNonFull(node.children[i], key)
The Split Operation:
SplitChild(parent, childIndex, fullChild):
// fullChild has m-1 keys, we split into two nodes with (m-1)/2 keys each
medianIndex = floor((m - 1) / 2) // index of median key
// Create new sibling node
newNode = AllocateNode()
newNode.isLeaf = fullChild.isLeaf
// Move upper half of keys to new node
newNode.keyCount = floor((m - 1) / 2)
for j = 0 to newNode.keyCount - 1:
newNode.keys[j] = fullChild.keys[medianIndex + 1 + j]
// If not leaf, move corresponding children
if not fullChild.isLeaf:
for j = 0 to newNode.keyCount:
newNode.children[j] = fullChild.children[medianIndex + 1 + j]
// Adjust fullChild to keep only lower half
fullChild.keyCount = medianIndex // keeps keys 0 to medianIndex-1
// Make room in parent for the promoted key and new child
for j = parent.keyCount downto childIndex + 1:
parent.children[j + 1] = parent.children[j]
parent.children[childIndex + 1] = newNode
for j = parent.keyCount - 1 downto childIndex:
parent.keys[j + 1] = parent.keys[j]
parent.keys[childIndex] = fullChild.keys[medianIndex] // promote median
parent.keyCount = parent.keyCount + 1
DiskWrite(fullChild)
DiskWrite(newNode)
DiskWrite(parent)
Proactive splitting requires only one downward pass through the tree. We never need to backtrack upward, which would require re-reading nodes from disk. The trade-off is occasional unnecessary splits, but the I/O savings outweigh this cost.
Deletion is more complex than insertion because we must handle underflow while maintaining the minimum key requirement.
Cases to handle:
Handling underflow (node has fewer than ⌈m/2⌉-1 keys):
Case 1: Key k in leaf node x
Simply remove k from x.
If x now has too few keys:
If immediate sibling has spare keys:
Borrow one key via rotation through parent
Else:
Merge x with a sibling, pulling down separator from parent
Case 2: Key k in internal node x
Let y = child preceding k, z = child following k
Case 2a: If y has at least ⌈m/2⌉ keys:
Find predecessor k' of k (rightmost key in y's subtree)
Replace k with k' in x
Recursively delete k' from y's subtree
Case 2b: If z has at least ⌈m/2⌉ keys:
Find successor k' of k (leftmost key in z's subtree)
Replace k with k' in x
Recursively delete k' from z's subtree
Case 2c: If both y and z have exactly ⌈m/2⌉-1 keys:
Merge k and all of z into y
Child y now contains k and all keys from z
Recursively delete k from y
Proactive approach (ensuring minimum keys during descent):
Just as proactive splitting prevents backtracking during insertion, we can ensure nodes have at least t keys (where t = ⌈m/2⌉) during descent:
BTreeDelete(node, key):
// Ensure node has at least t keys (unless root)
i = FindKeyPosition(node, key)
if KeyFoundAt(node, i):
if node.isLeaf:
RemoveFromLeaf(node, i)
else:
DeleteFromInternalNode(node, i)
else:
if node.isLeaf:
// Key not in tree
return NOT_FOUND
// Ensure child has enough keys before descending
EnsureMinimumKeys(node, i)
// Recurse into child
BTreeDelete(node.children[i], key)
EnsureMinimumKeys(node, childIndex):
child = node.children[childIndex]
if child.keyCount >= t:
return // Already has enough keys
// Try borrowing from left sibling
if childIndex > 0 and node.children[childIndex - 1].keyCount >= t:
BorrowFromLeft(node, childIndex)
// Try borrowing from right sibling
else if childIndex < node.keyCount and node.children[childIndex + 1].keyCount >= t:
BorrowFromRight(node, childIndex)
// Must merge
else:
if childIndex > 0:
MergeWithLeft(node, childIndex)
else:
MergeWithRight(node, childIndex)
If merging causes the root to become empty (it loses its only key), the merged child becomes the new root, reducing tree height by 1. This is the only operation that decreases B-tree height—just as root splitting is the only operation that increases it.
While B-trees are powerful, most database systems actually use a variant called the B+ tree. This modification optimizes for the specific access patterns databases need.
Key differences:
All data in leaves: B+ trees store actual data (or pointers to data) only in leaf nodes. Internal nodes contain only keys for navigation.
Leaf nodes are linked: Leaves form a linked list, enabling efficient range scans.
Duplicated keys: Keys from internal nodes are repeated in leaves (they guide navigation but the actual key-value pairs are in leaves).
Higher branching factor: Since internal nodes don't store values, they can fit more keys per block.
Why B+ trees dominate databases:
Range queries: With linked leaves, scanning all records from 30 to 60 just follows leaf links—no tree traversal needed after finding 30.
Better cache utilization: Internal nodes (more frequently accessed) contain only keys and fit more easily in memory. Data-heavy leaves can stay on disk.
Consistent access patterns: Every search traverses to a leaf, making performance predictable.
Full scans: Scanning the entire table just traverses the leaf linked list—no internal nodes needed.
Simpler deletion: Deleting from a leaf doesn't affect navigation (internal keys can remain as "signpost" values).
| Aspect | B-Tree | B+ Tree |
|---|---|---|
| Data location | All nodes | Leaves only |
| Search terminates | Any level | Always at leaf |
| Leaf linking | Typically none | Linked list |
| Range queries | Tree traversal | Leaf scan |
| Internal node size | Keys + values | Keys only |
| Branching factor | Lower | Higher |
| Database usage | Less common | Standard choice |
Let's consolidate the time and space complexity for B-tree operations. For a B-tree with n keys and order m:
| Operation | Disk I/O | CPU Time | Notes |
|---|---|---|---|
| Search | O(log_m n) | O(log n) | Height traversal + binary search per node |
| Insert | O(log_m n) | O(m × log_m n) | May split O(height) nodes |
| Delete | O(log_m n) | O(m × log_m n) | May merge/borrow O(height) times |
| Range query (k items) | O(log_m n + k/m) | O(log n + k) | Find start, then scan leaves (B+) |
| Build from n items | O(n × log_m n) | O(n × log n) | n insertions |
| Bulk load (sorted) | O(n/m) | O(n) | Sequential writes, fill leaves |
Space complexity:
Space utilization:
B-trees guarantee at least 50% space utilization (minimum ⌈m/2⌉ children). In practice, with random insertions, utilization is typically around 69% (ln 2 = 0.693). Bulk loading from sorted data can achieve near-100% utilization.
In the external memory model where block transfers dominate, B-trees achieve O(log_B n) I/Os per operation, where B is the block size. This is provably optimal for comparison-based searching, making B-trees the theoretical best for disk-based search structures.
Implementing a production B-tree involves numerous practical considerations beyond the core algorithms:
1. Node layout and serialization:
Typical node layout on disk:
┌─────────────────────────────────────────────────┐
│ Header (16 bytes) │
│ - Node type (leaf/internal) │
│ - Key count │
│ - Pointer to parent (optional) │
├─────────────────────────────────────────────────┤
│ Keys array (variable) │
│ key[0], key[1], ..., key[n-1] │
├─────────────────────────────────────────────────┤
│ Children/Values array (variable) │
│ - For internal: child node pointers │
│ - For leaves: value data or value pointers │
├─────────────────────────────────────────────────┤
│ Leaf: next/prev leaf pointers (B+ trees) │
└─────────────────────────────────────────────────┘
2. Buffer pool management:
Production systems don't read/write directly to disk. A buffer pool caches frequently-used pages in memory:
3. Concurrency control:
Multiple threads accessing the B-tree need coordination:
4. Recovery and durability:
Database B-trees must survive crashes:
5. Variable-length keys:
Not all keys are fixed-size (e.g., strings):
6. Bulk loading optimization:
When building a B-tree from sorted data, direct construction is faster than repeated inserts:
BulkLoad(sorted_data):
1. Fill leaf pages sequentially (near-100% utilization)
2. Create internal pages bottom-up, one level at a time
3. No rebalancing needed—tree is perfectly balanced by construction
This achieves O(n/B) I/O complexity versus O(n × log n) for n insertions.
We've explored B-trees in depth, understanding their structure, operations, and practical considerations. Let's consolidate the key insights:
What's next:
In the final page of this module, we'll connect B-trees to real database systems, exploring how PostgreSQL, MySQL, and other systems use B-trees for indexing, why database designers choose B-trees over alternatives, and how to design effective indexes using your B-tree knowledge.
You now understand B-trees at a conceptual and algorithmic level: their structure optimized for disk access, their insertion and deletion mechanisms with splitting and merging, and their B+ tree evolution for database workloads. This knowledge is directly applicable to database index design and performance optimization.