Loading content...
Every index entry you see in a database arrived there through B-tree insertion. When you execute INSERT INTO orders VALUES (...), the database must update every index on that table—each update invoking the B-tree insertion algorithm to place the new key in exactly the right location while preserving all structural invariants.
B-tree insertion is more complex than search because it must maintain balance. Unlike a naive binary search tree that can degenerate into a linked list with skewed insertions, a B-tree guarantees logarithmic height regardless of insertion order. This guarantee comes from a beautiful mechanism: node splitting.
By completing this page, you will understand: (1) The complete B-tree insertion algorithm with formal correctness, (2) The mechanics of inserting into nodes at various fill levels, (3) When and how node splitting maintains balance, (4) Proactive vs. reactive splitting strategies, (5) How insertion affects tree height, and (6) Production implementation patterns including concurrent insertion handling.
B-tree insertion follows a two-phase approach:
The Core Invariants Insertion Must Preserve:
The challenge is that inserting a key into a full node would violate the maximum occupancy invariant. The solution is splitting: dividing an overfull node into two nodes of acceptable size.
High-Level Algorithm:
BTREE-INSERT(tree, key, value):
// Case 1: Root is full - split it first (tree grows taller)
IF tree.root.numKeys = 2t - 1:
newRoot ← AllocateNode()
newRoot.isLeaf ← FALSE
newRoot.children[0] ← tree.root
SPLIT-CHILD(newRoot, 0) // Split old root
tree.root ← newRoot
// Now root is guaranteed to have space
INSERT-NONFULL(tree.root, key, value)
The elegantly simple structure handles the special case where the root is full before descending, ensuring we never need to propagate splits upward after completing insertion—a technique called proactive splitting or preemptive splitting.
The root is the only node that can have fewer than ⌈t/2⌉-1 keys (it can have just 1 key). When the root splits, the tree's height increases by 1. This is the only operation that increases tree height—splits at lower levels simply redistribute keys horizontally.
The core insertion logic handles two cases: inserting into a leaf (base case) and recursively descending through internal nodes.
Algorithm: INSERT-NONFULL
INSERT-NONFULL(node, key, value):
i ← node.numKeys - 1
IF node.isLeaf:
// Case A: Insert into leaf node
// Shift keys right to make room
WHILE i >= 0 AND key < node.keys[i]:
node.keys[i+1] ← node.keys[i]
node.values[i+1] ← node.values[i]
i ← i - 1
// Insert new key at position i+1
node.keys[i+1] ← key
node.values[i+1] ← value
node.numKeys ← node.numKeys + 1
DISK-WRITE(node)
ELSE:
// Case B: Internal node - find correct child
WHILE i >= 0 AND key < node.keys[i]:
i ← i - 1
i ← i + 1 // i is now the index of child to descend into
DISK-READ(node.children[i])
// Proactive splitting: if child is full, split it now
IF node.children[i].numKeys = 2t - 1:
SPLIT-CHILD(node, i)
// After split, determine which half to insert into
IF key > node.keys[i]:
i ← i + 1
INSERT-NONFULL(node.children[i], key, value)
Let's trace through each component:
By splitting full nodes on the way down, we guarantee that when we reach the leaf, there's space to insert. This single-pass approach is more efficient than the alternative (insert, then propagate splits upward) because we never need to re-traverse the path or maintain a stack of ancestors.
Node splitting is the heart of B-tree balancing. When a node becomes full (has 2t-1 keys), we cannot insert another key without violating the size invariant. Splitting divides the node into two nodes, each with acceptable key counts.
The SPLIT-CHILD Algorithm:
SPLIT-CHILD(parent, i):
// parent.children[i] is full (has 2t-1 keys)
fullChild ← parent.children[i]
newChild ← AllocateNode()
newChild.isLeaf ← fullChild.isLeaf
// New child gets the upper half of keys (t-1 keys)
newChild.numKeys ← t - 1
FOR j ← 0 TO t - 2:
newChild.keys[j] ← fullChild.keys[t + j]
newChild.values[j] ← fullChild.values[t + j] // For leaves
// If not a leaf, also copy child pointers
IF NOT fullChild.isLeaf:
FOR j ← 0 TO t - 1:
newChild.children[j] ← fullChild.children[t + j]
// Full child keeps the lower half (t-1 keys)
fullChild.numKeys ← t - 1
// Make room in parent for new child pointer
FOR j ← parent.numKeys DOWNTO i + 1:
parent.children[j + 1] ← parent.children[j]
parent.children[i + 1] ← newChild
// Make room in parent for median key
FOR j ← parent.numKeys - 1 DOWNTO i:
parent.keys[j + 1] ← parent.keys[j]
parent.values[j + 1] ← parent.values[j]
// Median key moves up to parent
parent.keys[i] ← fullChild.keys[t - 1] // The median
parent.numKeys ← parent.numKeys + 1
// Write all modified nodes to disk
DISK-WRITE(fullChild)
DISK-WRITE(newChild)
DISK-WRITE(parent)
Visual Example: Splitting a Node with t=3
Before split (node with 5 keys is full):
Parent: [... | 30 | ...]
|
[10, 20, 25, 28, 29] ← Full child
After split (median 25 moves up):
Parent: [... | 25 | 30 | ...]
| |
[10, 20] [28, 29] ← Two half-full children
Key observations:
A single split requires writing 3 nodes: the original child (now half-sized), the new sibling, and the parent (with new key and pointer). In the worst case of cascading splits up to the root, we write O(h) nodes where h is tree height. This is rare—most insertions don't cause splits at all.
Let's trace through a series of insertions into a B-tree with minimum degree t=2 (each node holds 1-3 keys). Starting from an empty tree, we'll insert: 10, 20, 30, 40, 25, 15, 35.
Insert 10 (empty tree):
Create root as leaf: [10]
Insert 20:
Insert into root (has space): [10, 20]
Insert 30:
Insert into root (has space): [10, 20, 30] ← Root now full
Insert 40: Root is full! Must split before inserting:
Step 1: Create new root
Step 2: Split old root at median (20)
Step 3: Tree structure:
[20]
/ \
[10] [30]
Step 4: Insert 40 into right leaf
[20]
/ \
[10] [30, 40]
Insert 25: Descend: 25 > 20, go right → [30, 40] has space
[20]
/ \
[10] [25, 30, 40] ← Right leaf now full
Insert 15: Descend: 15 < 20, go left → [10] has space
[20]
/ \
[10, 15] [25, 30, 40]
Insert 35: Descend: 35 > 20, must go right. But wait—right child [25, 30, 40] is full!
Proactive split at [25, 30, 40]:
- Median 30 moves to root
- Left: [25], Right: [40]
Tree after split:
[20, 30]
/ | \
[10, 15] [25] [40]
Now insert 35: goes into right child (35 > 30)
[20, 30]
/ | \
[10, 15] [25] [35, 40]
| Operation | Split Required? | Nodes Modified | Height Change |
|---|---|---|---|
| Insert 10 | No (new tree) | 1 (new root) | 0→1 |
| Insert 20 | No | 1 (root) | Same |
| Insert 30 | No | 1 (root) | Same |
| Insert 40 | Yes (root split) | 3 (old root, new root, sibling) | 1→2 |
| Insert 25 | No | 1 (leaf) | Same |
| Insert 15 | No | 1 (leaf) | Same |
| Insert 35 | Yes (leaf split) | 3 (leaf, sibling, parent) | Same |
In this example, only 2 of 7 insertions required splits. As t increases (typical values: 50-200), nodes hold more keys before filling, making splits even rarer. A node with t=100 needs 199 insertions before splitting, and most will be distributed across many nodes.
B+-tree insertion differs from B-tree insertion in several important ways due to the structural difference: data resides only in leaves, and internal nodes contain separator keys only.
Key Differences:
Insertion always targets a leaf: We always descend to a leaf node before inserting, regardless of whether we encounter the key as a separator in internal nodes.
Separator key handling during splits: When a leaf splits, we must copy the median key up to the parent (it remains in the leaf). When an internal node splits, we promote the median key (it moves, not copies).
Leaf linking: After a leaf split, we must update sibling pointers to maintain the linked list structure.
12345678910111213141516171819202122232425262728293031323334353637383940414243
// B+-Tree leaf split: median is COPIED upfunction splitLeafNode(parent: InternalNode, childIndex: number): void { const fullLeaf = parent.children[childIndex] as LeafNode; const newLeaf = createLeafNode(); // Transfer upper half of keys AND values to new leaf const midpoint = Math.floor(fullLeaf.keys.length / 2); newLeaf.keys = fullLeaf.keys.splice(midpoint); newLeaf.values = fullLeaf.values.splice(midpoint); // COPY (not move) the first key of new leaf to parent // This key remains in the new leaf! const separatorKey = newLeaf.keys[0]; insertIntoParent(parent, childIndex, separatorKey, newLeaf); // Maintain leaf linked list newLeaf.nextLeaf = fullLeaf.nextLeaf; fullLeaf.nextLeaf = newLeaf; if (newLeaf.nextLeaf) { newLeaf.nextLeaf.prevLeaf = newLeaf; } newLeaf.prevLeaf = fullLeaf;} // B+-Tree internal node split: median is PROMOTED upfunction splitInternalNode(parent: InternalNode, childIndex: number): void { const fullNode = parent.children[childIndex] as InternalNode; const newNode = createInternalNode(); // Find the median key to promote const midpoint = Math.floor(fullNode.keys.length / 2); const promotedKey = fullNode.keys[midpoint]; // This moves UP // Upper half of keys goes to new node (excluding the promoted key) newNode.keys = fullNode.keys.splice(midpoint + 1); newNode.children = fullNode.children.splice(midpoint + 1); // Remove the promoted key from fullNode fullNode.keys.pop(); // Remove the median // Insert promoted key and new child into parent insertIntoParent(parent, childIndex, promotedKey, newNode);}Getting copy vs. promote wrong leads to subtle bugs: if you promote (move) a key from a leaf, range queries will skip that key. If you copy a key from an internal node, you'll have duplicate separators causing incorrect routing. This is a common implementation error.
Understanding insertion complexity requires analyzing both the common case (no split) and worst case (cascading splits to root).
Time Complexity:
Search phase: O(t · log_t n) comparisons
Insert phase (no split): O(t)
Insert phase (with splits): O(t · log_t n) in worst case
Amortized Analysis:
While a single insertion can trigger O(log_t n) splits, this is rare. Using aggregate analysis:
| Metric | Best Case | Worst Case | Amortized |
|---|---|---|---|
| Disk Reads | O(log_t n) | O(log_t n) | O(log_t n) |
| Disk Writes | O(1) | O(log_t n) | O(1) |
| CPU Comparisons | O(log_t n · log t) | O(log_t n · t) | O(log n) |
| Nodes Modified | 1 (leaf only) | O(log_t n) | O(1) |
| New Nodes Created | 0 | O(log_t n) | O(1/t) |
Space Complexity:
Disk I/O Analysis:
For a B-tree with t=100 containing 1 billion keys:
With upper levels cached in memory, typical insertion requires 1-2 disk reads and 1-3 writes.
Insertion can have 'write amplification'—one logical insert causing multiple physical writes. Understanding this is crucial for SSD-based systems where write cycles are limited. Some systems batch insertions or use write-ahead logging to reduce write amplification.
There are two fundamental approaches to handling full nodes during insertion:
Proactive (Preemptive) Splitting:
Split full nodes on the way down, before they're needed:
descend(node, key):
FOR each child on path:
IF child is full:
split child now
descend into appropriate child
insert into leaf (guaranteed non-full)
Advantages:
Disadvantages:
Reactive Splitting:
Descend to leaf, insert, then propagate splits upward:
descend(node, key):
path ← [] // Stack of ancestors
WHILE not at leaf:
path.push(node)
descend to appropriate child
insert into leaf
IF leaf overflows:
propagate splits up using path stack
Advantages:
Disadvantages:
| Aspect | Proactive | Reactive |
|---|---|---|
| Passes | Single | One down, possibly one up |
| Ancestor Tracking | Not needed | Required (stack) |
| Latch Holding | Parent + child only | All ancestors until done |
| Unnecessary Splits | Possible | Never |
| Space Utilization | Slightly lower | Optimal |
| Implementation | Simpler | More complex |
| Concurrency | Better (shorter latch holding) | Worse (longer latch chains) |
Most production databases use proactive splitting due to its concurrency advantages. The slightly higher space usage is acceptable, and the simplicity reduces bug potential. PostgreSQL's B-tree implementation, for example, uses proactive splitting.
Production databases handle hundreds or thousands of concurrent insertions. Without proper synchronization, concurrent modifications corrupt the tree. Several strategies exist:
1. Latch Crabbing (Latch Coupling):
Basic protocol for safe concurrent access:
void insertWithCrabbing(Key key, Value value) {
root->writeLatch(); // Start with write latch on root
Node* current = root;
while (!current->isLeaf) {
Node* child = current->getChild(key);
child->writeLatch();
// Safe release: if child won't split, release parent
if (!child->willSplitOnInsert()) {
current->writeUnlatch();
}
// If child might split, keep parent latched
current = child;
}
// Now holding latches on path from last potentially-splitting node to leaf
performInsertAndSplits();
releaseAllLatches();
}
This protocol ensures that if a split propagates upward, we still hold the necessary latches.
2. Optimistic Latch Coupling:
Assume splits are rare—use read latches on the way down, then verify:
void optimisticInsert(Key key, Value value) {
// Phase 1: Optimistic descent with read latches
readLatchPath();
Node* leaf = findLeaf(key);
IF leaf->hasSpace():
// Common case: upgrade leaf latch, insert, done
leaf->upgradeToWriteLatch();
insert(leaf, key, value);
return;
// Leaf is full - need pessimistic restart
releaseAllLatches();
pessimisticInsertWithCrabbing(key, value);
}
This dramatically reduces latch contention because most insertions don't cause splits.
3. B-Link Trees:
A structural modification that adds sibling pointers at all levels:
This is used in PostgreSQL and several other production systems.
B-tree latch protocols must prevent deadlock. The standard rule: always acquire latches in root-to-leaf order. Never hold a latch on a child and then try to acquire one on its parent. This total ordering prevents cycles.
Inserting millions of keys one at a time is inefficient—every insertion traverses the tree from root. Bulk loading builds a B-tree from sorted data in O(n) time with optimal space utilization.
Algorithm: Bottom-Up Bulk Loading
BULK-LOAD(sortedKeys, sortedValues):
// Step 1: Create all leaf nodes, packed to just under full
leaves ← []
FOR each chunk of (2t-1) keys:
leaf ← CreateLeafNode(chunk)
leaves.append(leaf)
LinkLeafToList(leaf)
// Step 2: Build internal levels bottom-up
currentLevel ← leaves
WHILE len(currentLevel) > 1:
parentLevel ← []
FOR each chunk of 2t children:
parent ← CreateInternalNode()
parent.children ← chunk
parent.keys ← [rightmost key of each child except last]
parentLevel.append(parent)
currentLevel ← parentLevel
// Step 3: currentLevel[0] is the root
RETURN currentLevel[0]
Advantages of Bulk Loading:
| Metric | Repeated Insertion | Bulk Load |
|---|---|---|
| CPU Operations | ~20 million | ~1 million |
| Disk Writes | ~5 million | ~5,000 |
| Final Tree Height | 3-4 | 3 |
| Space Utilization | ~69% | ~95%+ |
| Time (SSD) | ~10 minutes | ~5 seconds |
Use bulk loading for: initial index creation, rebuilding corrupted indexes, data warehouse loads, and any scenario with >10,000 sorted insertions. Most databases automatically detect sorted insert patterns and switch to bulk-load mode.
B-tree insertion combines elegant algorithms with practical engineering considerations. Let's consolidate the essential knowledge:
You now understand the B-tree insertion algorithm comprehensively. The next page examines node splitting in greater detail—exploring edge cases, optimization techniques, and the mathematical properties that guarantee balance.