Loading content...
Every database query that benefits from an index ultimately relies on B-tree search. When you execute a query like SELECT * FROM orders WHERE order_id = 12345, the database doesn't scan millions of rows—it navigates a B-tree, making precisely calculated decisions at each node to locate the target record with logarithmic efficiency.
The B-tree search algorithm is deceptively simple in concept yet profound in its implications. Understanding it deeply reveals why B-trees dominate database indexing and how seemingly minor implementation choices cascade into massive performance differences at scale.
By completing this page, you will understand: (1) The complete B-tree search algorithm with formal correctness properties, (2) How search translates to disk I/O in real systems, (3) The mathematical foundations of logarithmic search performance, (4) Variants and optimizations used in production databases, and (5) Common pitfalls and debugging strategies for B-tree search issues.
B-tree search is a generalized binary search extended to multi-way branching. While a binary search tree compares a key against a single node value and chooses left or right, a B-tree compares against an ordered list of keys within each node and selects the appropriate child pointer from among multiple options.
The Core Invariant:
A B-tree maintains the following ordering invariant that makes search possible:
For every internal node with keys K₁, K₂, ..., Kₙ and child pointers P₀, P₁, ..., Pₙ:
- All keys in the subtree pointed to by P₀ are less than K₁
- All keys in the subtree pointed to by Pᵢ (for 1 ≤ i < n) are between Kᵢ and Kᵢ₊₁
- All keys in the subtree pointed to by Pₙ are greater than or equal to Kₙ
This invariant is the foundation upon which all B-tree operations rest. Search exploits it to eliminate entire subtrees from consideration at each level.
Different B-tree implementations use slightly different boundary semantics (strict less-than vs. less-than-or-equal). The key insight is that there must be a consistent, well-defined rule that partitions the key space. The exact choice affects edge cases in insertion and deletion but not the fundamental search algorithm.
Visual Understanding:
Consider a B-tree node with keys [20, 40, 60] and four child pointers:
[20 | 40 | 60]
/ | | \
P₀ P₁ P₂ P₃
↓ ↓ ↓ ↓
keys keys keys keys
< 20 20-39 40-59 ≥ 60
At each node, we perform a within-node search (typically binary search for efficiency) to determine which path to follow.
Let's formalize the B-tree search algorithm with precise pseudocode and then analyze each component:
Algorithm: B-Tree Search
BTREE-SEARCH(node, key):
Input: node - current B-tree node being examined
key - the search key we're looking for
Output: (node, index) pair if key found, or NULL if not found
1. i ← 0
2. // Find the smallest index where key ≤ node.keys[i]
3. WHILE i < node.numKeys AND key > node.keys[i]:
4. i ← i + 1
5.
6. // Check if we found the key at position i
7. IF i < node.numKeys AND key = node.keys[i]:
8. RETURN (node, i) // Key found at position i
9.
10. // Key not found in this node
11. IF node.isLeaf:
12. RETURN NULL // Key doesn't exist in tree
13. ELSE:
14. // Read child from disk if not in memory
15. DISK-READ(node.children[i])
16. RETURN BTREE-SEARCH(node.children[i], key)
This recursive algorithm captures the essence of B-tree search. Let's examine each phase:
Line 15 (DISK-READ) is where the real cost lies. Each recursive call potentially triggers a disk read, which takes milliseconds—orders of magnitude slower than in-memory operations. The entire design of B-trees is optimized to minimize the number of these disk reads.
The within-node search (finding the correct key position within a single node) is a crucial optimization point. While the number of disk I/Os dominates overall cost, the CPU time for within-node searches becomes significant with large fan-outs or under high query loads.
Linear Search:
The naive approach scans keys sequentially:
Binary Search:
For larger nodes, binary search reduces comparisons:
Interpolation Search:
When keys are uniformly distributed:
| Node Size | Linear Search | Binary Search | Recommendation |
|---|---|---|---|
| 4-8 keys | 4-8 comparisons | 2-3 comparisons | Either works; linear often faster due to simplicity |
| 16-32 keys | 16-32 comparisons | 4-5 comparisons | Binary search preferred |
| 64-128 keys | 64-128 comparisons | 6-7 comparisons | Binary search essential |
| 256+ keys | 256+ comparisons | 8+ comparisons | Binary search mandatory |
SIMD-Accelerated Search:
Modern database systems exploit Single Instruction, Multiple Data (SIMD) instructions for within-node search:
// Conceptual SIMD search within a node
__m256i key_vec = _mm256_set1_epi32(search_key);
__m256i node_keys = _mm256_loadu_si256((__m256i*)node->keys);
__m256i cmp_result = _mm256_cmpgt_epi32(key_vec, node_keys);
int mask = _mm256_movemask_ps((__m256)cmp_result);
int pos = __builtin_popcount(mask); // Count keys < search_key
This compares 8 keys simultaneously, reducing search time dramatically for fixed-size integer keys. Production systems like PostgreSQL and MySQL leverage such optimizations.
A CPU cache line is typically 64 bytes. If keys are 8-byte integers, a cache line holds 8 keys. Accessing the first key in a group brings the entire cache line into L1 cache. Smart implementations align nodes to cache line boundaries and search in cache-line-sized chunks.
Understanding B-tree search complexity requires separating two cost models: disk I/O (the dominant factor) and CPU operations (significant under high load).
Height of a B-Tree:
For a B-tree with minimum degree t (each node has at least t-1 keys and at most 2t-1 keys):
$$h \leq \log_t \left( \frac{n+1}{2} \right)$$
This means a B-tree with t = 100 (typical for disk-based systems) containing 1 billion keys has height at most:
$$h \leq \log_{100}(500,000,000) \approx 4.35$$
So at most 5 disk reads locate any key among 1 billion!
| Cost Metric | Complexity | Practical Interpretation |
|---|---|---|
| Disk I/O | O(log_t n) | Number of nodes read from disk; typically 3-5 for billion-key trees |
| CPU comparisons (linear within-node) | O(t · log_t n) | Linear scan of each visited node |
| CPU comparisons (binary within-node) | O(log t · log_t n) = O(log n) | Binary search within each visited node |
| Space for search | O(log_t n) | Recursion stack depth or path storage |
Why Disk I/O Dominates:
Consider typical access times:
A single disk read is 1,000,000× slower than a CPU comparison. This is why B-trees optimize for fewer disk reads (by maximizing node size) even at the cost of more CPU work per node.
Practical Example:
Searching a B-tree with t = 200 for a key among 10⁹ keys:
The B-tree transforms an impossible problem into a trivial one.
Most database systems keep the root node (and often the first 2-3 levels) permanently in memory. For a billion-key B-tree, this means only 1-2 disk reads are actually needed, not the theoretical 4-5. This optimization is critical for high-performance systems.
While the basic search algorithm is similar for B-trees and B+-trees, there are critical differences that affect implementation and performance:
B-Tree (Classic):
[P₀, K₁, D₁, P₁, K₂, D₂, P₂, ...] where Dᵢ is data pointerB+-Tree (Database Standard):
[P₀, K₁, P₁, K₂, P₂, ...] (no data pointers)B+-Tree Search Modification:
BPLUS-TREE-SEARCH(node, key):
WHILE NOT node.isLeaf:
i ← FindChildIndex(node, key) // Binary search for position
DISK-READ(node.children[i])
node ← node.children[i]
// Now at leaf level - search for exact key
i ← BinarySearch(node.keys, key)
IF i < node.numKeys AND node.keys[i] = key:
RETURN (node, i) // Found: return leaf and index
ELSE:
RETURN NULL // Not found
Note that B+-tree search is inherently iterative rather than recursive—we descend until reaching a leaf, then perform a final lookup. This simplifies implementation and reduces stack usage.
B+-trees have higher fanout (more keys per node, thus shorter trees) because internal nodes don't store data pointers. For a node that fits in one disk page, a B+-tree fits more keys, reducing tree height and disk I/Os. This is why virtually all modern database indexes use B+-trees, not classic B-trees.
While point queries (finding a single key) are important, database workloads often involve range queries: WHERE age BETWEEN 25 AND 35 or WHERE date >= '2024-01-01'. B+-trees excel at range queries due to their leaf-level structure.
Range Query Algorithm:
BPLUS-TREE-RANGE-SEARCH(tree, lowKey, highKey):
// Step 1: Find the starting leaf
leaf ← FindLeaf(tree.root, lowKey)
// Step 2: Collect results by scanning leaves sequentially
results ← []
WHILE leaf ≠ NULL:
FOR i ← 0 TO leaf.numKeys - 1:
IF leaf.keys[i] > highKey:
RETURN results // Past range, done
IF leaf.keys[i] >= lowKey:
results.append((leaf.keys[i], leaf.data[i]))
leaf ← leaf.nextLeaf // Follow sibling pointer
RETURN results
The key insight: once we find the starting position, we traverse linked leaves without ever ascending the tree. This turns range queries into sequential I/O, which is dramatically faster than random I/O.
| Operation | I/O Type | Cost Characteristic |
|---|---|---|
| Find starting leaf | Random (tree descent) | O(log_t n) disk reads |
| Scan matching leaves | Sequential (linked list) | O(k / B) where k = result count, B = keys per leaf |
| Fetch actual records | Random (pointer following) | O(k) in worst case; clustered index avoids this |
Sequential I/O Advantage:
Hard drives and SSDs have vastly different performance characteristics for sequential vs. random access:
| Storage Type | Sequential Read | Random Read | Ratio |
|---|---|---|---|
| HDD | 200 MB/s | 1 MB/s | 200:1 |
| SATA SSD | 550 MB/s | 50 MB/s | 11:1 |
| NVMe SSD | 3500 MB/s | 500 MB/s | 7:1 |
Even on fast NVMe storage, sequential reads are 7× faster. For HDDs, the difference is 200×. B+-tree leaf linking transforms range queries from random access nightmares into fast sequential scans.
Bidirectional Pointers:
Advanced implementations maintain both nextLeaf and prevLeaf pointers, enabling efficient backward scans for:
ORDER BY ... DESC queriesIf all columns needed by a query exist in the index (a 'covering index'), the entire result can be produced by scanning leaves—no record fetches needed. This eliminates the random I/O phase, making range queries exceptionally fast.
Real database systems implement B-tree search with numerous practical considerations that textbook algorithms omit. Let's examine production-grade patterns:
1. Buffer Pool Integration:
Search doesn't directly perform disk I/O—it requests pages from a buffer pool manager:
Node* getNode(PageID pid) {
// Check if page is already in memory
Frame* frame = bufferPool.lookup(pid);
if (frame != nullptr) {
frame->pinCount++; // Prevent eviction during use
return frame->page;
}
// Page miss: fetch from disk
frame = bufferPool.evictAndFetch(pid);
return frame->page;
}
This allows the upper levels of the tree (frequently accessed) to remain in memory while leaves are fetched on demand.
2. Latching for Concurrency:
Concurrent access requires protecting nodes during traversal:
Record* search(Key key) {
Node* current = root;
current->latch.lockShared(); // Shared latch - multiple readers OK
while (!current->isLeaf) {
int i = findChildIndex(current, key);
Node* child = getNode(current->children[i]);
child->latch.lockShared();
current->latch.unlock(); // Release parent before descending (latch coupling)
current = child;
}
// Search within leaf...
current->latch.unlock();
}
This latch coupling (or crabbing) pattern ensures we hold latches on at most two nodes at any time.
Production B-tree nodes often use a slot array (indirection layer) rather than storing keys contiguously. This enables efficient insertion/deletion within a node without moving large amounts of data—just update the slot pointers. Binary search operates on the slot array, then follows the indirection to access actual keys.
When B-tree search behaves unexpectedly, systematic debugging is essential. Common issues include corrupted invariants, incorrect boundary handling, and concurrency bugs.
Invariant Verification:
A correct B-tree satisfies these properties. Violations indicate bugs:
Implement a verifyTree() function that recursively checks these properties during development.
123456789101112131415161718192021222324252627282930313233343536373839404142
interface VerifyResult { valid: boolean; error?: string; path?: number[]; // Path to problematic node} function verifyBTree(node: BTreeNode, minKey: Key | null, maxKey: Key | null, depth: number, expectedDepth: number | null): VerifyResult { // Check ordering within node for (let i = 1; i < node.keys.length; i++) { if (node.keys[i] <= node.keys[i-1]) { return { valid: false, error: `Keys not sorted at depth ${depth}`, path: [i] }; } } // Check boundary constraints if (minKey !== null && node.keys[0] < minKey) { return { valid: false, error: `Key ${node.keys[0]} violates lower bound ${minKey}` }; } if (maxKey !== null && node.keys[node.keys.length-1] >= maxKey) { return { valid: false, error: `Key ${node.keys[node.keys.length-1]} violates upper bound ${maxKey}` }; } // Leaf depth consistency if (node.isLeaf) { if (expectedDepth !== null && depth !== expectedDepth) { return { valid: false, error: `Leaf at depth ${depth}, expected ${expectedDepth}` }; } return { valid: true }; } // Verify children recursively for (let i = 0; i <= node.keys.length; i++) { const childMin = i === 0 ? minKey : node.keys[i-1]; const childMax = i === node.keys.length ? maxKey : node.keys[i]; const childResult = verifyBTree(node.children[i], childMin, childMax, depth + 1, expectedDepth); if (!childResult.valid) { return { ...childResult, path: [i, ...(childResult.path || [])] }; } } return { valid: true };}Common Search Bugs and Fixes:
| Symptom | Likely Cause | Debugging Approach |
|---|---|---|
| Key exists but search returns NULL | Off-by-one in boundary comparison | Log visited nodes and compare paths |
| Wrong value returned | Key duplicates with wrong handling | Verify uniqueness or duplicate semantics |
| Infinite loop | Cycle in tree structure | Add visited-node tracking, verify parent/child consistency |
| Performance degradation | Tree becoming unbalanced | Measure actual height vs. theoretical minimum |
| Inconsistent results | Concurrent modification bug | Enable latch debugging, review latch coupling logic |
Implement a debug mode that logs: (1) Each node visited with its key range, (2) Which child pointer is chosen and why, (3) The final decision (found/not found). This trace makes it trivial to identify where search diverges from expected behavior.
B-tree search is the foundation upon which all B-tree operations and database indexing rest. Let's consolidate the key insights from this comprehensive exploration:
You now understand B-tree search at a depth suitable for production implementation and optimization. The next page explores the insertion algorithm, where we'll see how B-trees maintain their balance invariants while growing—including the critical concept of node splitting.