Data Structures & Algorithms2-3 Trees & B-Trees

2-3 Trees & B-Trees — Conceptual Introduction

LevelIntermediate

Duration60 mins

Topic2-3 Trees & B-Trees

3 / 4

B-Trees for Disk-Based Storage

The Data Structure Behind Modern Databases

If you've ever used a database—and you have, whether you realized it or not—you've relied on B-trees. These remarkable structures are the backbone of virtually every database system in production today, from the SQLite database on your phone to the massive PostgreSQL clusters powering major web applications.

The B-tree isn't just another balanced tree variant. It's a purpose-built structure designed from the ground up for disk-based storage. Every design decision, from branching factor to node layout, reflects deep understanding of how magnetic and solid-state storage actually works. Understanding B-trees means understanding how databases achieve their remarkable performance.

What You Will Learn

By the end of this page, you will understand B-tree structure and invariants, how they're tailored for disk block access patterns, the insertion and deletion algorithms with their splitting and merging mechanics, and variations like B+ trees that further optimize for database workloads.

Formal B-Tree Definition

A B-tree of order m (also called minimum degree t in some literature, where m = 2t) is a self-balancing tree that satisfies the following properties:

Structural Properties:

Every node has at most m children (and thus at most m-1 keys)
Every non-root internal node has at least ⌈m/2⌉ children (thus at least ⌈m/2⌉-1 keys)
The root has at least 2 children if it is not a leaf
All leaves are at the same depth
A node with k children contains exactly k-1 keys

Ordering Properties:

Keys within a node are stored in sorted order: k₁ < k₂ < ... < k_{j}
For internal nodes, child pointer Cᵢ points to a subtree containing keys between kᵢ and kᵢ₊₁

B-Tree of Order m: Key Ranges by Node Type
Node Type	Minimum Keys	Maximum Keys	Minimum Children	Maximum Children
Root (not leaf)	1	m-1	2	m
Root (is leaf)	0	m-1	0 (leaf)	0 (leaf)
Internal Node	⌈m/2⌉-1	m-1	⌈m/2⌉	m
Leaf Node	⌈m/2⌉-1	m-1	0 (leaf)	0 (leaf)

Terminology Variation Warning

Different textbooks define B-tree 'order' differently. Some use order = maximum children (our definition), others use order = minimum degree. Always check which convention a source uses. The minimum degree convention sets t, then max children = 2t and min children (non-root) = t.

Example: B-tree of order 5

Maximum children: 5 (so maximum keys: 4)
Minimum children (non-root): ⌈5/2⌉ = 3 (so minimum keys: 2)
Root minimum: 2 children or is a leaf

This means every internal node holds between 2 and 4 keys, and between 3 and 5 children. This "half-full" minimum ensures good space utilization while maintaining balance.

Why B-Trees Match Disk Access Patterns

B-trees aren't just theoretically interesting—they're engineered to exploit how storage hardware actually works.

Understanding block-based I/O:

Storage devices (HDDs, SSDs, networked storage) transfer data in fixed-size blocks or pages, typically 4KB to 64KB. When you request any byte within a block, the entire block is read:

Request 1 byte at offset 1000:
  → System reads entire 4KB block containing offset 1000
  → Block transferred: bytes 0-4095
  → Your 1 byte extracted from the block
  → The other 4,095 bytes: wasted bandwidth (if unused)

B-tree node sizing strategy:

The key insight: size each B-tree node to exactly fill one disk block. If blocks are 4KB:

Node layout (example):
  - Node metadata: 16 bytes
  - Key count: 4 bytes
  - Keys: 8 bytes each × 500 keys = 4,000 bytes
  - Child pointers: 8 bytes each × 501 pointers ≈ 4,008 bytes
  (Total: ≈8,028 bytes → fits in two 4KB blocks, or adjust for single block)

Practical node sizing calculation:

For a 4KB block with:

8-byte keys
8-byte child pointers (or value pointers)
16 bytes node metadata

Available space: 4096 - 16 = 4080 bytes

Each key-pointer pair: 8 + 8 = 16 bytes Maximum pairs: 4080 ÷ 16 = 255

So order m ≈ 256 (255 keys, 256 children)

The resulting height:

For n keys with order 256:

Height 1: up to 255 keys (root only)
Height 2: up to 255 × 256 = 65,280 keys
Height 3: up to ~16 million keys
Height 4: up to ~4 billion keys

A B-tree indexing 4 billion records requires only 4 disk reads for any search!

The B-Tree Advantage

With binary trees, searching 4 billion records requires ~32 disk reads (log₂(4×10⁹)). With a typical B-tree, it requires only 4 disk reads. This 8× reduction in I/O is why B-trees dominate disk-based applications.

Additional disk-friendly properties:

Sequential access within nodes: After loading a node, searching through its keys happens in memory—fast sequential access
Predictable I/O: The maximum number of disk reads is exactly the tree height, enabling precise performance prediction
Cache efficiency: Frequently accessed nodes (especially near the root) stay in memory buffers, further reducing disk I/O
Write optimization: Changes are localized to specific nodes; we rarely need to rewrite large portions of the tree

B-Tree Search Operation

Search in a B-tree generalizes binary search tree search to handle multiple keys per node.

Algorithm:

BTreeSearch(node, target):
    if node is null:
        return NOT_FOUND
    
    // Binary search within the node's keys
    i = 0
    while i < node.keyCount and target > node.keys[i]:
        i = i + 1
    
    // Check if we found the key
    if i < node.keyCount and target == node.keys[i]:
        return (node, i)  // Found at position i in this node
    
    // If this is a leaf, key doesn't exist
    if node.isLeaf:
        return NOT_FOUND
    
    // Read child from disk if not in memory
    DiskRead(node.children[i])
    
    // Recursively search in appropriate child
    return BTreeSearch(node.children[i], target)

Complexity Analysis:

Disk I/O: O(log_m n) = O(height) block reads
- Each level requires one disk read
- Height is O(log_m n) ≈ O(log n / log m)
- For m = 256, this is roughly log₂ n / 8
CPU operations: O(m × height) = O(m × log_m n)
- At each node, we search among up to m-1 keys
- With binary search within nodes: O(log(m) × log_m n) = O(log n)
Overall: O(log n) CPU operations, O(log_m n) disk I/Os
- The disk I/O is the dominant cost
- For large m, log_m n is much smaller than log₂ n

In-Node Binary Search

Although our pseudocode shows linear search within nodes for clarity, real implementations use binary search. With 255 keys per node, binary search requires at most 8 comparisons (log₂ 255 ≈ 8), all happening in fast in-memory operations after a single disk read.

B-Tree Insertion Operation

B-tree insertion follows the same principles as 2-3 tree insertion but handles larger nodes and uses proactive splitting to simplify implementation.

Core Principle:

Keys are always inserted into leaf nodes. If a node becomes overfull (m keys), it splits into two nodes with ⌈m/2⌉-1 keys each, promoting the median key to the parent.

Two approaches: Reactive vs. Proactive Splitting

Reactive Splitting

•Insert into leaf
•If overflow, split and propagate up
•May require backtracking up the tree
•Matches our 2-3 tree description
•Conceptually simpler

Proactive Splitting

•Split full nodes as we descend
•When we reach leaf, space is guaranteed
•Single downward pass—no backtracking
•Better for disk-based implementations
•CLRS textbook approach

Proactive Insertion Algorithm (preferred for disk):

BTreeInsert(tree, key):
    root = tree.root
    
    // If root is full, split it first (creates new root)
    if root.keyCount == m - 1:
        newRoot = AllocateNode()
        newRoot.isLeaf = false
        newRoot.children[0] = root
        SplitChild(newRoot, 0, root)
        tree.root = newRoot
        InsertNonFull(newRoot, key)
    else:
        InsertNonFull(root, key)

InsertNonFull(node, key):
    i = node.keyCount - 1
    
    if node.isLeaf:
        // Make room and insert key
        while i >= 0 and key < node.keys[i]:
            node.keys[i + 1] = node.keys[i]
            i = i - 1
        node.keys[i + 1] = key
        node.keyCount = node.keyCount + 1
        DiskWrite(node)
    else:
        // Find child to descend into
        while i >= 0 and key < node.keys[i]:
            i = i - 1
        i = i + 1
        
        DiskRead(node.children[i])
        
        // Proactively split if child is full
        if node.children[i].keyCount == m - 1:
            SplitChild(node, i, node.children[i])
            if key > node.keys[i]:
                i = i + 1
        
        InsertNonFull(node.children[i], key)

The Split Operation:

SplitChild(parent, childIndex, fullChild):
    // fullChild has m-1 keys, we split into two nodes with (m-1)/2 keys each
    medianIndex = floor((m - 1) / 2)  // index of median key
    
    // Create new sibling node
    newNode = AllocateNode()
    newNode.isLeaf = fullChild.isLeaf
    
    // Move upper half of keys to new node
    newNode.keyCount = floor((m - 1) / 2)
    for j = 0 to newNode.keyCount - 1:
        newNode.keys[j] = fullChild.keys[medianIndex + 1 + j]
    
    // If not leaf, move corresponding children
    if not fullChild.isLeaf:
        for j = 0 to newNode.keyCount:
            newNode.children[j] = fullChild.children[medianIndex + 1 + j]
    
    // Adjust fullChild to keep only lower half
    fullChild.keyCount = medianIndex  // keeps keys 0 to medianIndex-1
    
    // Make room in parent for the promoted key and new child
    for j = parent.keyCount downto childIndex + 1:
        parent.children[j + 1] = parent.children[j]
    parent.children[childIndex + 1] = newNode
    
    for j = parent.keyCount - 1 downto childIndex:
        parent.keys[j + 1] = parent.keys[j]
    parent.keys[childIndex] = fullChild.keys[medianIndex]  // promote median
    parent.keyCount = parent.keyCount + 1
    
    DiskWrite(fullChild)
    DiskWrite(newNode)
    DiskWrite(parent)

Why Proactive Splitting is Disk-Friendly

Proactive splitting requires only one downward pass through the tree. We never need to backtrack upward, which would require re-reading nodes from disk. The trade-off is occasional unnecessary splits, but the I/O savings outweigh this cost.

B-Tree Deletion Operation

Deletion is more complex than insertion because we must handle underflow while maintaining the minimum key requirement.

Cases to handle:

Key is in a leaf node: Remove directly (may cause underflow)
Key is in an internal node: Replace with predecessor or successor, then delete that

Handling underflow (node has fewer than ⌈m/2⌉-1 keys):

Borrow from sibling: If an adjacent sibling has extra keys, rotate through the parent
Merge with sibling: If no sibling can lend, merge with a sibling (pulling down a parent key)

Case 1: Key k in leaf node x

Simply remove k from x.
If x now has too few keys:
    If immediate sibling has spare keys:
        Borrow one key via rotation through parent
    Else:
        Merge x with a sibling, pulling down separator from parent

Case 2: Key k in internal node x

Let y = child preceding k, z = child following k

Case 2a: If y has at least ⌈m/2⌉ keys:
    Find predecessor k' of k (rightmost key in y's subtree)
    Replace k with k' in x
    Recursively delete k' from y's subtree

Case 2b: If z has at least ⌈m/2⌉ keys:
    Find successor k' of k (leftmost key in z's subtree)
    Replace k with k' in x
    Recursively delete k' from z's subtree

Case 2c: If both y and z have exactly ⌈m/2⌉-1 keys:
    Merge k and all of z into y
    Child y now contains k and all keys from z
    Recursively delete k from y

Proactive approach (ensuring minimum keys during descent):

Just as proactive splitting prevents backtracking during insertion, we can ensure nodes have at least t keys (where t = ⌈m/2⌉) during descent:

BTreeDelete(node, key):
    // Ensure node has at least t keys (unless root)
    i = FindKeyPosition(node, key)
    
    if KeyFoundAt(node, i):
        if node.isLeaf:
            RemoveFromLeaf(node, i)
        else:
            DeleteFromInternalNode(node, i)
    else:
        if node.isLeaf:
            // Key not in tree
            return NOT_FOUND
        
        // Ensure child has enough keys before descending
        EnsureMinimumKeys(node, i)
        
        // Recurse into child
        BTreeDelete(node.children[i], key)

EnsureMinimumKeys(node, childIndex):
    child = node.children[childIndex]
    
    if child.keyCount >= t:
        return  // Already has enough keys
    
    // Try borrowing from left sibling
    if childIndex > 0 and node.children[childIndex - 1].keyCount >= t:
        BorrowFromLeft(node, childIndex)
    // Try borrowing from right sibling
    else if childIndex < node.keyCount and node.children[childIndex + 1].keyCount >= t:
        BorrowFromRight(node, childIndex)
    // Must merge
    else:
        if childIndex > 0:
            MergeWithLeft(node, childIndex)
        else:
            MergeWithRight(node, childIndex)

Root Special Case

If merging causes the root to become empty (it loses its only key), the merged child becomes the new root, reducing tree height by 1. This is the only operation that decreases B-tree height—just as root splitting is the only operation that increases it.

B+ Trees — The Database Variant

While B-trees are powerful, most database systems actually use a variant called the B+ tree. This modification optimizes for the specific access patterns databases need.

Key differences:

All data in leaves: B+ trees store actual data (or pointers to data) only in leaf nodes. Internal nodes contain only keys for navigation.
Leaf nodes are linked: Leaves form a linked list, enabling efficient range scans.
Duplicated keys: Keys from internal nodes are repeated in leaves (they guide navigation but the actual key-value pairs are in leaves).
Higher branching factor: Since internal nodes don't store values, they can fit more keys per block.

Converting Mermaid diagram...

Why B+ trees dominate databases:

Range queries: With linked leaves, scanning all records from 30 to 60 just follows leaf links—no tree traversal needed after finding 30.
Better cache utilization: Internal nodes (more frequently accessed) contain only keys and fit more easily in memory. Data-heavy leaves can stay on disk.
Consistent access patterns: Every search traverses to a leaf, making performance predictable.
Full scans: Scanning the entire table just traverses the leaf linked list—no internal nodes needed.
Simpler deletion: Deleting from a leaf doesn't affect navigation (internal keys can remain as "signpost" values).

B-Tree vs. B+ Tree Comparison
Aspect	B-Tree	B+ Tree
Data location	All nodes	Leaves only
Search terminates	Any level	Always at leaf
Leaf linking	Typically none	Linked list
Range queries	Tree traversal	Leaf scan
Internal node size	Keys + values	Keys only
Branching factor	Lower	Higher
Database usage	Less common	Standard choice

Complexity Analysis Summary

Let's consolidate the time and space complexity for B-tree operations. For a B-tree with n keys and order m:

B-Tree Time Complexity
Operation	Disk I/O	CPU Time	Notes
Search	O(log_m n)	O(log n)	Height traversal + binary search per node
Insert	O(log_m n)	O(m × log_m n)	May split O(height) nodes
Delete	O(log_m n)	O(m × log_m n)	May merge/borrow O(height) times
Range query (k items)	O(log_m n + k/m)	O(log n + k)	Find start, then scan leaves (B+)
Build from n items	O(n × log_m n)	O(n × log n)	n insertions
Bulk load (sorted)	O(n/m)	O(n)	Sequential writes, fill leaves

Space complexity:

Nodes: O(n/m) nodes (each leaf holds up to m-1 keys, so ~n/(m/2) leaves)
Height: O(log_m n) levels
Per-node space: O(m) for keys and children
Total space: O(n) keys + O(n) child pointers = O(n)

Space utilization:

B-trees guarantee at least 50% space utilization (minimum ⌈m/2⌉ children). In practice, with random insertions, utilization is typically around 69% (ln 2 = 0.693). Bulk loading from sorted data can achieve near-100% utilization.

The I/O Model Perspective

In the external memory model where block transfers dominate, B-trees achieve O(log_B n) I/Os per operation, where B is the block size. This is provably optimal for comparison-based searching, making B-trees the theoretical best for disk-based search structures.

Practical Implementation Considerations

Implementing a production B-tree involves numerous practical considerations beyond the core algorithms:

1. Node layout and serialization:

Typical node layout on disk:
┌─────────────────────────────────────────────────┐
│ Header (16 bytes)                               │
│   - Node type (leaf/internal)                   │
│   - Key count                                   │
│   - Pointer to parent (optional)                │
├─────────────────────────────────────────────────┤
│ Keys array (variable)                           │
│   key[0], key[1], ..., key[n-1]                 │
├─────────────────────────────────────────────────┤
│ Children/Values array (variable)                │
│   - For internal: child node pointers           │
│   - For leaves: value data or value pointers    │
├─────────────────────────────────────────────────┤
│ Leaf: next/prev leaf pointers (B+ trees)        │
└─────────────────────────────────────────────────┘

2. Buffer pool management:

Production systems don't read/write directly to disk. A buffer pool caches frequently-used pages in memory:

Hot nodes (especially near the root) stay in memory
LRU or clock algorithms evict cold pages
Dirty pages are written back asynchronously
Pinning prevents eviction during operations

3. Concurrency control:

Multiple threads accessing the B-tree need coordination:

Lock coupling (crabbing): Hold lock on parent while locking child, then release parent
Optimistic concurrency: Assume no conflicts, validate and retry if wrong
B-link trees: Add sibling pointers to handle concurrent splits gracefully

4. Recovery and durability:

Database B-trees must survive crashes:

Write-ahead logging (WAL): Log changes before applying them
Checkpointing: Periodically flush all dirty pages
Recovery: Replay log from last checkpoint after crash
Page LSNs: Track which log records have been applied to each page

5. Variable-length keys:

Not all keys are fixed-size (e.g., strings):

Slot directory: Store offsets to key locations
Key compression: Prefix compression for strings with common prefixes
Overflow pages: For very large keys that don't fit in a node

6. Bulk loading optimization:

When building a B-tree from sorted data, direct construction is faster than repeated inserts:

BulkLoad(sorted_data):
    1. Fill leaf pages sequentially (near-100% utilization)
    2. Create internal pages bottom-up, one level at a time
    3. No rebalancing needed—tree is perfectly balanced by construction

This achieves O(n/B) I/O complexity versus O(n × log n) for n insertions.

Summary: B-Trees for Disk-Based Storage

We've explored B-trees in depth, understanding their structure, operations, and practical considerations. Let's consolidate the key insights:

Key Takeaways

•B-trees are disk-optimized: Node size matches disk block size, maximizing useful data per I/O operation.
•Order m means m children, m-1 keys: Higher order means fewer disk accesses but more in-memory work per node.
•Height is O(log_m n): With typical orders of 100-1000, billion-row tables require only 3-4 disk reads.
•Proactive operations avoid backtracking: Split/ensure-minimum on descent prevents costly upward tree walks.
•B+ trees improve on B-trees: All data in leaves, linked leaves for range scans—the database standard.
•50-69% space utilization typical: Minimum half-full guarantee with ~ln(2) average fill under random workloads.
•Production requires more: Buffer pools, concurrency, recovery, variable keys—the algorithm is just the beginning.

What's next:

In the final page of this module, we'll connect B-trees to real database systems, exploring how PostgreSQL, MySQL, and other systems use B-trees for indexing, why database designers choose B-trees over alternatives, and how to design effective indexes using your B-tree knowledge.

Page Complete

You now understand B-trees at a conceptual and algorithmic level: their structure optimized for disk access, their insertion and deletion mechanisms with splitting and merging, and their B+ tree evolution for database workloads. This knowledge is directly applicable to database index design and performance optimization.

3 / 4

Loading learning content...

Data Structures & Algorithms2-3 Trees & B-Trees

2-3 Trees & B-Trees — Conceptual Introduction

LevelIntermediate

Duration60 mins

Topic2-3 Trees & B-Trees

3 / 4

B-Trees for Disk-Based Storage

The Data Structure Behind Modern Databases

What You Will Learn

Formal B-Tree Definition

A B-tree of order m (also called minimum degree t in some literature, where m = 2t) is a self-balancing tree that satisfies the following properties:

Structural Properties:

Every node has at most m children (and thus at most m-1 keys)
Every non-root internal node has at least ⌈m/2⌉ children (thus at least ⌈m/2⌉-1 keys)
The root has at least 2 children if it is not a leaf
All leaves are at the same depth
A node with k children contains exactly k-1 keys

Ordering Properties:

Keys within a node are stored in sorted order: k₁ < k₂ < ... < k_{j}
For internal nodes, child pointer Cᵢ points to a subtree containing keys between kᵢ and kᵢ₊₁

B-Tree of Order m: Key Ranges by Node Type
Node Type	Minimum Keys	Maximum Keys	Minimum Children	Maximum Children
Root (not leaf)	1	m-1	2	m
Root (is leaf)	0	m-1	0 (leaf)	0 (leaf)
Internal Node	⌈m/2⌉-1	m-1	⌈m/2⌉	m
Leaf Node	⌈m/2⌉-1	m-1	0 (leaf)	0 (leaf)

Terminology Variation Warning

Example: B-tree of order 5

Maximum children: 5 (so maximum keys: 4)
Minimum children (non-root): ⌈5/2⌉ = 3 (so minimum keys: 2)
Root minimum: 2 children or is a leaf

This means every internal node holds between 2 and 4 keys, and between 3 and 5 children. This "half-full" minimum ensures good space utilization while maintaining balance.

Why B-Trees Match Disk Access Patterns

B-trees aren't just theoretically interesting—they're engineered to exploit how storage hardware actually works.

Understanding block-based I/O:

Storage devices (HDDs, SSDs, networked storage) transfer data in fixed-size blocks or pages, typically 4KB to 64KB. When you request any byte within a block, the entire block is read:

Request 1 byte at offset 1000:
  → System reads entire 4KB block containing offset 1000
  → Block transferred: bytes 0-4095
  → Your 1 byte extracted from the block
  → The other 4,095 bytes: wasted bandwidth (if unused)

B-tree node sizing strategy:

The key insight: size each B-tree node to exactly fill one disk block. If blocks are 4KB:

Node layout (example):
  - Node metadata: 16 bytes
  - Key count: 4 bytes
  - Keys: 8 bytes each × 500 keys = 4,000 bytes
  - Child pointers: 8 bytes each × 501 pointers ≈ 4,008 bytes
  (Total: ≈8,028 bytes → fits in two 4KB blocks, or adjust for single block)

Practical node sizing calculation:

For a 4KB block with:

8-byte keys
8-byte child pointers (or value pointers)
16 bytes node metadata

Available space: 4096 - 16 = 4080 bytes

Each key-pointer pair: 8 + 8 = 16 bytes Maximum pairs: 4080 ÷ 16 = 255

So order m ≈ 256 (255 keys, 256 children)

The resulting height:

For n keys with order 256:

Height 1: up to 255 keys (root only)
Height 2: up to 255 × 256 = 65,280 keys
Height 3: up to ~16 million keys
Height 4: up to ~4 billion keys

A B-tree indexing 4 billion records requires only 4 disk reads for any search!

The B-Tree Advantage

Additional disk-friendly properties:

Sequential access within nodes: After loading a node, searching through its keys happens in memory—fast sequential access
Predictable I/O: The maximum number of disk reads is exactly the tree height, enabling precise performance prediction
Cache efficiency: Frequently accessed nodes (especially near the root) stay in memory buffers, further reducing disk I/O
Write optimization: Changes are localized to specific nodes; we rarely need to rewrite large portions of the tree

B-Tree Search Operation

Search in a B-tree generalizes binary search tree search to handle multiple keys per node.

Algorithm:

BTreeSearch(node, target):
    if node is null:
        return NOT_FOUND
    
    // Binary search within the node's keys
    i = 0
    while i < node.keyCount and target > node.keys[i]:
        i = i + 1
    
    // Check if we found the key
    if i < node.keyCount and target == node.keys[i]:
        return (node, i)  // Found at position i in this node
    
    // If this is a leaf, key doesn't exist
    if node.isLeaf:
        return NOT_FOUND
    
    // Read child from disk if not in memory
    DiskRead(node.children[i])
    
    // Recursively search in appropriate child
    return BTreeSearch(node.children[i], target)

Complexity Analysis:

Disk I/O: O(log_m n) = O(height) block reads
- Each level requires one disk read
- Height is O(log_m n) ≈ O(log n / log m)
- For m = 256, this is roughly log₂ n / 8
CPU operations: O(m × height) = O(m × log_m n)
- At each node, we search among up to m-1 keys
- With binary search within nodes: O(log(m) × log_m n) = O(log n)
Overall: O(log n) CPU operations, O(log_m n) disk I/Os
- The disk I/O is the dominant cost
- For large m, log_m n is much smaller than log₂ n

In-Node Binary Search

B-Tree Insertion Operation

B-tree insertion follows the same principles as 2-3 tree insertion but handles larger nodes and uses proactive splitting to simplify implementation.

Core Principle:

Keys are always inserted into leaf nodes. If a node becomes overfull (m keys), it splits into two nodes with ⌈m/2⌉-1 keys each, promoting the median key to the parent.

Two approaches: Reactive vs. Proactive Splitting

Reactive Splitting

•Insert into leaf
•If overflow, split and propagate up
•May require backtracking up the tree
•Matches our 2-3 tree description
•Conceptually simpler

Proactive Splitting

•Split full nodes as we descend
•When we reach leaf, space is guaranteed
•Single downward pass—no backtracking
•Better for disk-based implementations
•CLRS textbook approach

Proactive Insertion Algorithm (preferred for disk):

BTreeInsert(tree, key):
    root = tree.root
    
    // If root is full, split it first (creates new root)
    if root.keyCount == m - 1:
        newRoot = AllocateNode()
        newRoot.isLeaf = false
        newRoot.children[0] = root
        SplitChild(newRoot, 0, root)
        tree.root = newRoot
        InsertNonFull(newRoot, key)
    else:
        InsertNonFull(root, key)

InsertNonFull(node, key):
    i = node.keyCount - 1
    
    if node.isLeaf:
        // Make room and insert key
        while i >= 0 and key < node.keys[i]:
            node.keys[i + 1] = node.keys[i]
            i = i - 1
        node.keys[i + 1] = key
        node.keyCount = node.keyCount + 1
        DiskWrite(node)
    else:
        // Find child to descend into
        while i >= 0 and key < node.keys[i]:
            i = i - 1
        i = i + 1
        
        DiskRead(node.children[i])
        
        // Proactively split if child is full
        if node.children[i].keyCount == m - 1:
            SplitChild(node, i, node.children[i])
            if key > node.keys[i]:
                i = i + 1
        
        InsertNonFull(node.children[i], key)

The Split Operation:

SplitChild(parent, childIndex, fullChild):
    // fullChild has m-1 keys, we split into two nodes with (m-1)/2 keys each
    medianIndex = floor((m - 1) / 2)  // index of median key
    
    // Create new sibling node
    newNode = AllocateNode()
    newNode.isLeaf = fullChild.isLeaf
    
    // Move upper half of keys to new node
    newNode.keyCount = floor((m - 1) / 2)
    for j = 0 to newNode.keyCount - 1:
        newNode.keys[j] = fullChild.keys[medianIndex + 1 + j]
    
    // If not leaf, move corresponding children
    if not fullChild.isLeaf:
        for j = 0 to newNode.keyCount:
            newNode.children[j] = fullChild.children[medianIndex + 1 + j]
    
    // Adjust fullChild to keep only lower half
    fullChild.keyCount = medianIndex  // keeps keys 0 to medianIndex-1
    
    // Make room in parent for the promoted key and new child
    for j = parent.keyCount downto childIndex + 1:
        parent.children[j + 1] = parent.children[j]
    parent.children[childIndex + 1] = newNode
    
    for j = parent.keyCount - 1 downto childIndex:
        parent.keys[j + 1] = parent.keys[j]
    parent.keys[childIndex] = fullChild.keys[medianIndex]  // promote median
    parent.keyCount = parent.keyCount + 1
    
    DiskWrite(fullChild)
    DiskWrite(newNode)
    DiskWrite(parent)

Why Proactive Splitting is Disk-Friendly

B-Tree Deletion Operation

Deletion is more complex than insertion because we must handle underflow while maintaining the minimum key requirement.

Cases to handle:

Key is in a leaf node: Remove directly (may cause underflow)
Key is in an internal node: Replace with predecessor or successor, then delete that

Handling underflow (node has fewer than ⌈m/2⌉-1 keys):

Borrow from sibling: If an adjacent sibling has extra keys, rotate through the parent
Merge with sibling: If no sibling can lend, merge with a sibling (pulling down a parent key)

Case 1: Key k in leaf node x

Simply remove k from x.
If x now has too few keys:
    If immediate sibling has spare keys:
        Borrow one key via rotation through parent
    Else:
        Merge x with a sibling, pulling down separator from parent

Case 2: Key k in internal node x

Let y = child preceding k, z = child following k

Case 2a: If y has at least ⌈m/2⌉ keys:
    Find predecessor k' of k (rightmost key in y's subtree)
    Replace k with k' in x
    Recursively delete k' from y's subtree

Case 2b: If z has at least ⌈m/2⌉ keys:
    Find successor k' of k (leftmost key in z's subtree)
    Replace k with k' in x
    Recursively delete k' from z's subtree

Case 2c: If both y and z have exactly ⌈m/2⌉-1 keys:
    Merge k and all of z into y
    Child y now contains k and all keys from z
    Recursively delete k from y

Proactive approach (ensuring minimum keys during descent):

Just as proactive splitting prevents backtracking during insertion, we can ensure nodes have at least t keys (where t = ⌈m/2⌉) during descent:

BTreeDelete(node, key):
    // Ensure node has at least t keys (unless root)
    i = FindKeyPosition(node, key)
    
    if KeyFoundAt(node, i):
        if node.isLeaf:
            RemoveFromLeaf(node, i)
        else:
            DeleteFromInternalNode(node, i)
    else:
        if node.isLeaf:
            // Key not in tree
            return NOT_FOUND
        
        // Ensure child has enough keys before descending
        EnsureMinimumKeys(node, i)
        
        // Recurse into child
        BTreeDelete(node.children[i], key)

EnsureMinimumKeys(node, childIndex):
    child = node.children[childIndex]
    
    if child.keyCount >= t:
        return  // Already has enough keys
    
    // Try borrowing from left sibling
    if childIndex > 0 and node.children[childIndex - 1].keyCount >= t:
        BorrowFromLeft(node, childIndex)
    // Try borrowing from right sibling
    else if childIndex < node.keyCount and node.children[childIndex + 1].keyCount >= t:
        BorrowFromRight(node, childIndex)
    // Must merge
    else:
        if childIndex > 0:
            MergeWithLeft(node, childIndex)
        else:
            MergeWithRight(node, childIndex)

Root Special Case

B+ Trees — The Database Variant

While B-trees are powerful, most database systems actually use a variant called the B+ tree. This modification optimizes for the specific access patterns databases need.

Key differences:

All data in leaves: B+ trees store actual data (or pointers to data) only in leaf nodes. Internal nodes contain only keys for navigation.
Leaf nodes are linked: Leaves form a linked list, enabling efficient range scans.
Duplicated keys: Keys from internal nodes are repeated in leaves (they guide navigation but the actual key-value pairs are in leaves).
Higher branching factor: Since internal nodes don't store values, they can fit more keys per block.

Converting Mermaid diagram...

Why B+ trees dominate databases:

Range queries: With linked leaves, scanning all records from 30 to 60 just follows leaf links—no tree traversal needed after finding 30.
Better cache utilization: Internal nodes (more frequently accessed) contain only keys and fit more easily in memory. Data-heavy leaves can stay on disk.
Consistent access patterns: Every search traverses to a leaf, making performance predictable.
Full scans: Scanning the entire table just traverses the leaf linked list—no internal nodes needed.
Simpler deletion: Deleting from a leaf doesn't affect navigation (internal keys can remain as "signpost" values).

B-Tree vs. B+ Tree Comparison
Aspect	B-Tree	B+ Tree
Data location	All nodes	Leaves only
Search terminates	Any level	Always at leaf
Leaf linking	Typically none	Linked list
Range queries	Tree traversal	Leaf scan
Internal node size	Keys + values	Keys only
Branching factor	Lower	Higher
Database usage	Less common	Standard choice

Complexity Analysis Summary

Let's consolidate the time and space complexity for B-tree operations. For a B-tree with n keys and order m:

B-Tree Time Complexity
Operation	Disk I/O	CPU Time	Notes
Search	O(log_m n)	O(log n)	Height traversal + binary search per node
Insert	O(log_m n)	O(m × log_m n)	May split O(height) nodes
Delete	O(log_m n)	O(m × log_m n)	May merge/borrow O(height) times
Range query (k items)	O(log_m n + k/m)	O(log n + k)	Find start, then scan leaves (B+)
Build from n items	O(n × log_m n)	O(n × log n)	n insertions
Bulk load (sorted)	O(n/m)	O(n)	Sequential writes, fill leaves

Space complexity:

Nodes: O(n/m) nodes (each leaf holds up to m-1 keys, so ~n/(m/2) leaves)
Height: O(log_m n) levels
Per-node space: O(m) for keys and children
Total space: O(n) keys + O(n) child pointers = O(n)

Space utilization:

The I/O Model Perspective

Practical Implementation Considerations

Implementing a production B-tree involves numerous practical considerations beyond the core algorithms:

1. Node layout and serialization:

Typical node layout on disk:
┌─────────────────────────────────────────────────┐
│ Header (16 bytes)                               │
│   - Node type (leaf/internal)                   │
│   - Key count                                   │
│   - Pointer to parent (optional)                │
├─────────────────────────────────────────────────┤
│ Keys array (variable)                           │
│   key[0], key[1], ..., key[n-1]                 │
├─────────────────────────────────────────────────┤
│ Children/Values array (variable)                │
│   - For internal: child node pointers           │
│   - For leaves: value data or value pointers    │
├─────────────────────────────────────────────────┤
│ Leaf: next/prev leaf pointers (B+ trees)        │
└─────────────────────────────────────────────────┘

2. Buffer pool management:

Production systems don't read/write directly to disk. A buffer pool caches frequently-used pages in memory:

Hot nodes (especially near the root) stay in memory
LRU or clock algorithms evict cold pages
Dirty pages are written back asynchronously
Pinning prevents eviction during operations

3. Concurrency control:

Multiple threads accessing the B-tree need coordination:

Lock coupling (crabbing): Hold lock on parent while locking child, then release parent
Optimistic concurrency: Assume no conflicts, validate and retry if wrong
B-link trees: Add sibling pointers to handle concurrent splits gracefully

4. Recovery and durability:

Database B-trees must survive crashes:

Write-ahead logging (WAL): Log changes before applying them
Checkpointing: Periodically flush all dirty pages
Recovery: Replay log from last checkpoint after crash
Page LSNs: Track which log records have been applied to each page

5. Variable-length keys:

Not all keys are fixed-size (e.g., strings):

Slot directory: Store offsets to key locations
Key compression: Prefix compression for strings with common prefixes
Overflow pages: For very large keys that don't fit in a node

6. Bulk loading optimization:

When building a B-tree from sorted data, direct construction is faster than repeated inserts:

BulkLoad(sorted_data):
    1. Fill leaf pages sequentially (near-100% utilization)
    2. Create internal pages bottom-up, one level at a time
    3. No rebalancing needed—tree is perfectly balanced by construction

This achieves O(n/B) I/O complexity versus O(n × log n) for n insertions.

Summary: B-Trees for Disk-Based Storage

We've explored B-trees in depth, understanding their structure, operations, and practical considerations. Let's consolidate the key insights:

Key Takeaways

•B-trees are disk-optimized: Node size matches disk block size, maximizing useful data per I/O operation.
•Order m means m children, m-1 keys: Higher order means fewer disk accesses but more in-memory work per node.
•Height is O(log_m n): With typical orders of 100-1000, billion-row tables require only 3-4 disk reads.
•Proactive operations avoid backtracking: Split/ensure-minimum on descent prevents costly upward tree walks.
•B+ trees improve on B-trees: All data in leaves, linked leaves for range scans—the database standard.
•50-69% space utilization typical: Minimum half-full guarantee with ~ln(2) average fill under random workloads.
•Production requires more: Buffer pools, concurrency, recovery, variable keys—the algorithm is just the beginning.

What's next:

Page Complete

3 / 4