Loading content...
In a B+-tree, internal nodes serve a singular purpose: to direct searches toward the correct leaf node as efficiently as possible. They contain no actual data—only keys (separators) and pointers that form a roadmap through the tree.
This design philosophy—separating navigation from data storage—is fundamental to B+-tree efficiency. Internal nodes are small, few in number, and almost always resident in the buffer cache. They're the 'fast path' that makes billion-row lookups complete in mere milliseconds.
Understanding internal nodes reveals why B+-trees achieve such high fanout, how tree traversal actually works, and why the root and upper levels can be treated as essentially 'free' in I/O cost calculations.
By the end of this page, you will understand the structure of internal nodes, how separator keys guide searches, why internal nodes are almost always cached, and the relationship between internal node fanout and tree efficiency. You'll be able to reason about tree traversal at a structural level.
Internal nodes (also called inner nodes, branch nodes, or directory nodes) are all non-leaf nodes in the B+-tree. They form the hierarchical structure that enables logarithmic-time search.
Formal Definition:
An internal node is a tree node that contains separator keys and child pointers. Each separator key K divides subsets of values: keys less than K are in the left subtree, keys greater than or equal to K are in the right subtree.
Key Properties:
| Term | Definition | Significance |
|---|---|---|
| Root Node | The single internal node at tree top | Every search starts here; always cached |
| Branch Node | Any internal node between root and leaves | Guides descent; high fanout |
| Separator Key | Key value marking child subtree boundary | Enables binary search within node |
| Child Pointer | Reference to next-level node | Typically 6-8 bytes per pointer |
| Fence Keys | Lowest and highest keys in subtree | Some implementations store for validation |
Internal nodes have a simpler structure than leaves—no sibling links, no data payload, just keys and pointers arranged for efficient binary search.
12345678910111213141516171819202122232425262728293031323334353637
┌──────────────────────────────────────────────────────────────┐│ PAGE HEADER ││ ┌─────────────┬─────────────┬─────────────┬───────────────┐ ││ │ Page ID │ Page Type │ Entry Count │ Level │ ││ │ (8 bytes) │ (INTERNAL) │ (2 bytes) │ (distance to │ ││ │ │ │ │ leaves) │ ││ └─────────────┴─────────────┴─────────────┴───────────────┘ │├──────────────────────────────────────────────────────────────┤│ KEY-POINTER PAIRS ││ ││ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ││ │ Ptr₀ │ │ Ptr₁ │ │ Ptr₂ │ ... │ Ptrₙ │ ││ │(6-8 B) │ │(6-8 B) │ │(6-8 B) │ │(6-8 B) │ ││ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ ││ │ │ │ │ ││ ▼ ▼ ▼ ▼ ││ Keys < K₁ K₁ ≤ Keys < K₂ K₂ ≤ Keys < K₃ Keys ≥ Kₙ ││ ││ ┌─────────┐ ┌─────────┐ ┌─────────┐ ││ │ Key₁ │ │ Key₂ │ ... │ Keyₙ │ ││ │(4-var B)│ │(4-var B)│ │(4-var B)│ ││ └─────────┘ └─────────┘ └─────────┘ ││ │├──────────────────────────────────────────────────────────────┤│ ALTERNATIVE LAYOUT ││ Some systems interleave: [Ptr₀|Key₁|Ptr₁|Key₂|Ptr₂|...] ││ Others store separately: [Ptr₀|Ptr₁|...|Ptrₙ][Key₁|...|Keyₙ]│└──────────────────────────────────────────────────────────────┘ SEARCH ALGORITHM (within single internal node):1. Binary search for smallest Key_i where Key_i > search_key2. If found Key_i at position i: follow Ptr_{i-1}3. If no such Key_i exists: follow Ptr_n (rightmost) Example: Searching for 'M' with keys [D, H, P, T]- M > D, M > H, M < P- Follow Ptr₂ (child containing H ≤ keys < P)A fundamental property: an internal node with n keys has n+1 child pointers. Each key acts as a 'separator' between two adjacent children. This means fanout (child count) = key count + 1. When we say a node has fanout 500, it holds 499 keys and 500 pointers.
The keys stored in internal nodes aren't arbitrary—they're carefully chosen separator keys that efficiently partition the key space.
Separator Key Properties:
Delimiting function — Key Kᵢ separates child subtree i-1 (containing keys < Kᵢ) from subtree i (containing keys ≥ Kᵢ)
Need not exist in data — Separators must correctly partition the key space but needn't be actual data values. Some systems use 'fence keys' or 'suffix truncation' to create minimal separators.
Sorted within node — Keys are maintained in sorted order to enable O(log fanout) binary search within each node.
How Separator Keys Are Chosen:
| Strategy | Description | Advantages |
|---|---|---|
| Copy-up | Copy first key of right child as separator | Simple, always valid separator |
| Push-up | Move first key of right child (removing from child) | Saves space; used in B-trees (not B+) |
| Truncated | Use shortest prefix distinguishing children | Higher fanout with long keys |
| Fence keys | Store min and max of each child subtree | Enables validation and optimization |
Truncation for Higher Fanout:
Consider two adjacent leaf nodes with these key ranges:
['algorithm', 'apple', 'application']['database', 'design', 'developer']A naive separator would be 'database' (10 bytes). But 'd' (1 byte) is sufficient to distinguish any key in the left child from any in the right!
This suffix truncation can dramatically improve internal node fanout when indexing string columns:
| Separator Type | Key Size | Fanout (8KB page) |
|---|---|---|
| Full key (avg 30 bytes) | 30 bytes | ~230 |
| Truncated (avg 5 bytes) | 5 bytes | ~740 |
That's 3x higher fanout, potentially reducing tree height by one level.
PostgreSQL 12+ implements B-tree suffix truncation automatically for text columns. The 'pivot tuple' (separator) stores only enough of the key to distinguish children. This is particularly effective for URLs, file paths, or any data with common prefixes, significantly reducing index size and improving performance.
Because internal nodes contain no data payload, they achieve higher fanout than leaf nodes in most scenarios. This difference is crucial for tree efficiency.
| Scenario | Internal Node Fanout | Leaf Node Fanout | Ratio |
|---|---|---|---|
| INT key, non-clustered index | ~800 | ~600 | 1.3x |
| BIGINT key, clustered index | ~500 | ~100* | 5x |
| VARCHAR(100) key, non-clustered | ~250 | ~200 | 1.25x |
| VARCHAR(100) with truncation | ~600 | ~200 | 3x |
| Composite 3-column key | ~200 | ~150 | 1.3x |
*Clustered index leaf fanout is low because leaves contain entire rows, not just keys.
Fanout Calculation for Internal Nodes:
Internal Fanout = ⌊(Page Size - Header) / (Key Size + Pointer Size)⌋ + 1
For 8KB page, 50B header, 8B key, 6B pointer:
Fanout = ⌊(8192 - 50) / (8 + 6)⌋ + 1 = 581 + 1 = 582
Why Internal Fanout Matters Most:
Tree height is determined by internal node fanout (since we traverse internal nodes during descent). Leaf fanout affects storage size and range scan performance but not descent cost. Therefore:
This is why INCLUDE columns (PostgreSQL 11+, SQL Server) are stored only in leaves—they add query-covering ability without impacting internal fanout.
The root node is an internal node with special properties. It's the entry point for every index operation and receives unique treatment by the database system.
Root Node Properties:
Root Split: The Height Increase Event:
Tree height only increases when the root splits. This is the mechanics:
Before (root is full, height = 2):
[Root: A-M-Z]
/ | \
[Leaf] [Leaf] [Leaf]
After inserting 'P' causes root split (height = 3):
[New Root: M]
/ \
[Old Root L: A-K] [Old Root R: M-P-Z]
/ | \ / | \
[Leaf][Leaf][Leaf] [Leaf][Leaf][Leaf]
Implications:
Height increases are rare — Requires filling root (fanout entries) then splitting. With fanout 500, root splits only after 500+ children exist at level 2.
Height never decreases during operations — Even if data is deleted, empty levels persist. Only REINDEX restores optimal height.
Root-only trees are common — Small tables fit in one page, making root = only leaf. This single-page index has O(1) lookup!
The root page ID is stored in index metadata (system catalogs). When queries access the index, they first read metadata to find the root, then begin descent. Since both metadata and root are always cached, this adds no I/O to the operation.
One of the most important practical properties of internal nodes is their cache behavior. Understanding this explains why indexes are faster than raw I/O arithmetic suggests.
| Tree Level | Node Count | Total Size | Access Frequency | Cache Status |
|---|---|---|---|---|
| Level 1 (Root) | 1 | 8 KB | Every query | 100% cached (pinned) |
| Level 2 | 500 | 4 MB | ~Every query | 99%+ cached |
| Level 3 | 250,000 | 2 GB | Hit by subset of queries | Partially cached |
| Leaves | 125M | ~1 TB | Each query hits 1-few | Rarely cached |
Why Internal Nodes Stay Cached:
Small total size — With fanout 500, internal nodes up to level 3 total just ~2GB for a 100-million-row table. This easily fits in a modest buffer pool.
Frequent access — Every index lookup traverses the same upper levels. Even with key skew, the root and level-2 nodes are hit by virtually all queries.
LRU behavior — Frequently accessed pages remain in cache. Internal nodes are the most frequently accessed pages in the entire database.
Explicit pinning — Some databases pin root pages in memory, never evicting them.
Effective I/O Cost:
For a height-4 B+-tree with well-cached internal nodes:
| Theoretical I/O | 4 pages (root → leaf) |
|---|---|
| Actual I/O | 1-2 pages (typically just leaf) |
| Why | Levels 1-2 cached, level 3 often cached |
This is why database performance often appears better than worst-case analysis predicts.
A practical rule: ensure your buffer pool is large enough to hold all internal nodes of your most critical indexes, plus working set of leaf pages. For most workloads, this means sizing buffer pool to hold:
• All internal nodes (~0.1-1% of total index size) • Hot leaf pages (varies by access pattern) • Data pages for covering queries
Pg_statio_user_indexes and pg_buffercache (PostgreSQL) or sys.dm_os_buffer_descriptors (SQL Server) can show actual cache residency.
Understanding how searches actually traverse internal nodes reveals the elegance of B+-tree design and enables reasoning about access patterns.
12345678910111213141516171819202122232425262728293031323334353637383940
FUNCTION b_plus_tree_search(tree, search_key): // Start at root current_node = tree.root_page // Descend through internal nodes WHILE current_node.type == INTERNAL: // Binary search within node for correct child child_index = binary_search_for_child(current_node, search_key) // Follow pointer to next level current_node = current_node.children[child_index] // Now at leaf level - linear or binary search for key RETURN leaf_search(current_node, search_key) FUNCTION binary_search_for_child(internal_node, search_key): // Find smallest i where key[i] > search_key // If no such i, return rightmost child keys = internal_node.keys // [K₁, K₂, ..., Kₙ] low, high = 0, keys.length - 1 WHILE low <= high: mid = (low + high) / 2 IF keys[mid] > search_key: high = mid - 1 ELSE: low = mid + 1 // 'low' is now the index of first key > search_key // Return that child pointer index RETURN low // children[low] contains keys in range [K_{low-1}, K_{low}) // Example trace: Search for key 'M' in tree with root keys [D, H, P, T]// // Iteration 1: mid=1 (H), M > H → low=2// Iteration 2: mid=2 (P), M < P → high=1 // Iteration 3: low > high, exit// Result: low=2, follow children[2] (contains keys H ≤ k < P)Traversal Cost Analysis:
| Operation | CPU Cost | I/O Cost |
|---|---|---|
| Binary search in node | O(log₂ fanout) = ~9 comparisons | 0 (node already in memory) |
| Page fetch | O(1) | 0-1 (cached or not) |
| Per-level total | ~9 key comparisons | 0-1 page reads |
| Full traversal | O(h × log₂ F) = ~36 comparisons | 0-h page reads |
For typical trees (h=4, F=500):
Comparison Operations:
Key comparison is the CPU-bound aspect of traversal. For:
This is another reason to prefer simple key types—not just for fanout, but for comparison speed during traversal.
Modern databases are beginning to use SIMD (Single Instruction Multiple Data) instructions to parallelize key comparisons within internal nodes. Instead of binary search visiting keys sequentially, SIMD can compare against 8+ keys simultaneously, reducing the ~9 comparisons per node to ~2-3. This optimization is transparent but explains why newer database versions show improved point query performance.
Internal nodes are the silent workhorses of B+-tree indexes — pure navigation structures that guide every search with minimal overhead. Their design enables the remarkable efficiency that makes B+-trees the backbone of database indexing.
What's Next:
With internal nodes understood, we'll explore performance metrics — the measurements and calculations that quantify index efficiency. You'll learn how to analyze I/O costs, estimate search times, and compare index configurations using concrete metrics.
You now understand internal nodes — the navigation layer that makes B+-tree searches efficient. This knowledge enables you to reason about tree traversal, understand caching behavior, and appreciate why B+-trees achieve such remarkable performance. Next, we examine the performance metrics that quantify index efficiency.