Loading learning content...
A B-tree's efficiency comes not just from its tree structure, but from the careful organization of individual nodes. Each node is designed to maximize the useful information retrieved in a single disk read while maintaining the ability to efficiently search within the node.
Understanding node structure is essential because it connects abstract B-tree theory to concrete implementation. How many keys can fit in a node? Where are child pointers stored? How do we handle variable-length keys? These questions determine the practical performance of any B-tree implementation.
By the end of this page, you will understand the internal layout of B-tree nodes, the distinction between internal and leaf nodes, how nodes are organized on disk pages, and practical considerations for node sizing and key storage in production database systems.
At the conceptual level, every B-tree node contains:
Let's visualize this structure:
12345678910111213141516171819
// Conceptual B-tree node (order m)struct BTreeNode { int n; // Number of keys currently in node bool isLeaf; // True if node is a leaf Key keys[m-1]; // Array of keys (at most m-1) Pointer children[m]; // Array of child pointers (at most m) Data data[m-1]; // Associated data (for classic B-trees)} // Example: Node with 3 keys (order 5)// n = 3// keys = [10, 20, 30, _, _] (2 empty slots)// children = [C0, C1, C2, C3, _] (1 empty slot)// isLeaf = false // Visual representation:// [10 | 20 | 30]// / | | \// C0 C1 C2 C3Key Array Organization
The keys array holds keys in sorted order, with the first n positions occupied and remaining positions empty (or containing placeholder values). This fixed-size allocation makes index calculations simple:
node.keys[i]node.children[i]node.children[i+1]The Interleaving Pattern
Keys and child pointers interleave in a specific pattern:
Position: 0 1 2 3 4 5 6
C₀ K₁ C₁ K₂ C₂ K₃ C₃
↓ ↓ ↓ ↓ ↓ ↓ ↓
child key child key child key child
Or equivalently:
C₀ | K₁ | C₁ | K₂ | C₂ | K₃ | C₃
This interleaving ensures that for any key Kᵢ, the left child (containing smaller keys) is immediately before it and the right child (containing larger keys) is immediately after.
Some implementations store keys in one array and children in another (as shown in the struct above). Others interleave them in a single array. Both work; separate arrays are conceptually cleaner while interleaving can improve cache locality during search.
B-tree nodes fall into two categories with distinctly different roles and structures:
Internal Nodes (Non-leaf Nodes)
Internal nodes serve as routing nodes. Their purpose is to direct searches toward the correct subtree. They contain:
Leaf Nodes
Leaf nodes are where data resides. Depending on the B-tree variant:
| Characteristic | Internal Node | Leaf Node |
|---|---|---|
| Role | Navigation/routing | Data storage |
| Contains keys | Yes (as separators) | Yes (as data keys) |
| Contains children | Yes (m pointers max) | No (null or absent) |
| Contains data | Usually no | Yes (or data pointers) |
| Can be root | Yes | Yes (when tree has only root) |
| Minimum keys | ⌈m/2⌉ - 1 | ⌈m/2⌉ - 1 (varies by implementation) |
Space Allocation Differences
Because internal and leaf nodes store different content, they often have different effective capacities:
Internal Node (4KB page):
- Keys: 4 bytes × 200 keys = 800 bytes
- Child pointers: 8 bytes × 201 = 1,608 bytes
- Overhead: 100 bytes (header, etc.)
- Total: ~2,508 bytes → fits in 4KB
Leaf Node (4KB page, with data):
- Keys: 4 bytes × 100 keys = 400 bytes
- Data records: 30 bytes × 100 = 3,000 bytes
- Overhead: 100 bytes
- Total: ~3,500 bytes → fits in 4KB
Result: Internal nodes hold 200 keys, leaves hold 100 keys
This is why B⁺-trees store data only in leaves—it allows internal nodes to have maximum fanout while leaves optimize for data storage.
The internal/leaf distinction affects tree height and I/O patterns. High fanout (many separators) in internal nodes minimizes height. Dense data packing in leaves minimizes total leaf count. B⁺-trees separate these concerns for optimal performance.
When B-tree nodes are persisted to disk, they must be serialized into a fixed-size block (page). The layout must balance several requirements:
Typical Page Layout
+--------------------------------------------------+| Page Header || [PageID | PageType | NumKeys | FreeSpace | LSN] || (24 bytes) |+--------------------------------------------------+| Key Section || [Key₁ | Key₂ | Key₃ | ... | Keyₙ | (unused)] || (variable: n × key_size) |+--------------------------------------------------+| Child Pointer Section || [Child₀ | Child₁ | ... | Childₙ | (unused)] || (variable: (n+1) × pointer_size) |+--------------------------------------------------+| Free Space || (grows/shrinks with content) |+--------------------------------------------------+| Page Footer || [Checksum | Padding] || (8 bytes) |+--------------------------------------------------+Page Header Components
| Field | Size | Description |
|---|---|---|
| PageID | 4-8 bytes | Unique identifier for this page |
| PageType | 1 byte | Internal, leaf, root, overflow, etc. |
| NumKeys | 2 bytes | Count of keys currently stored |
| FreeSpace | 2 bytes | Offset to start of free space |
| LSN | 8 bytes | Log Sequence Number for recovery |
| Parent | 4-8 bytes | Optional: pointer to parent page |
| Prev/Next | 8-16 bytes | Optional: sibling links (for B⁺-trees) |
Fixed vs Variable Layout
For fixed-size keys (integers, fixed char), layout is straightforward:
[header_size + i × key_size][key_section_end + i × pointer_size]For variable-size keys (varchar, text), a more complex layout is needed:
Modern CPUs access memory most efficiently when data is aligned to natural boundaries (4-byte, 8-byte). Misaligned access can be 2-10× slower. Database page layouts typically pad fields to maintain alignment, trading some space for access speed.
Most production databases use a slot directory (or slotted page) layout that elegantly handles variable-length keys and enables efficient reorganization.
Structure Overview
+--------------------------------------------------+| Page Header (fixed size) |+--------------------------------------------------+| Slot Directory (grows →) || [Slot₀ | Slot₁ | Slot₂ | ... ] || Each slot: [offset | length] or [offset | flags] |+--------------------------------------------------+| Free Space || (shrinks from both ends) |+--------------------------------------------------+| ← Records/Keys (grows ←) || [Record₃ | Record₂ | Record₁ | Record₀] || (stored in reverse order from page end) |+--------------------------------------------------+| Page Footer |+--------------------------------------------------+ Slot 0 → points to → Record 0 (at offset X)Slot 1 → points to → Record 1 (at offset Y)...How It Works
Benefits of This Layout
| Benefit | Explanation |
|---|---|
| Variable-length support | Records can be any size; slots are uniform |
| Stable slot IDs | External references can use (page, slot#) as record ID |
| Easy compaction | Move records, update slot pointers; external refs unchanged |
| Search efficiency | Binary search on sorted slots, indirect access to records |
| Deletion simplicity | Mark slot as deleted; physical removal during compaction |
For B-tree Nodes
In a B-tree context:
PostgreSQL uses exactly this slotted page format. Each heap page and B-tree page contains a line pointer array (slot directory) and tuple data. The ItemIdData structure (4 bytes per slot) contains offset, length, and flags. This design has been refined over decades.
Child pointers in B-tree nodes need to identify the disk location of child pages. Different systems use different representations:
Option 1: Page Numbers (Block IDs)
Simplest approach—child pointer is just a page number:
Child pointer = 4 bytes (32-bit page number)
Addressable pages = 2³² = 4 billion pages
At 4KB per page = 16 TB maximum file size
The storage engine translates page number to physical disk offset via:
disk_offset = page_number × page_size
Option 2: File ID + Page Number
For systems with multiple files per database:
Child pointer = [file_id (2 bytes) | page_number (4 bytes)]
Total = 6 bytes
Option 3: Full Disk Address
For complex storage layouts:
Child pointer = [tablespace | file | segment | page]
Total = 8+ bytes
| Approach | Size | Pros | Cons |
|---|---|---|---|
| 32-bit page number | 4 bytes | Simple, compact | 16TB limit at 4KB pages |
| 64-bit page number | 8 bytes | Huge address space | Wastes space for small DBs |
| Page + slot | 6 bytes | Points to specific tuple | More complex, larger |
| TID (PostgreSQL) | 6 bytes | (block, offset) pair | Standard PostgreSQL format |
Fanout Impact
Pointer size directly affects fanout:
Page size: 4096 bytes
Key size: 8 bytes
Overhead: 100 bytes
Usable space: 3996 bytes
With 4-byte pointers:
Entry size = 8 + 4 = 12 bytes
Fanout ≈ 3996 / 12 = 333 children
With 8-byte pointers:
Entry size = 8 + 8 = 16 bytes
Fanout ≈ 3996 / 16 = 249 children
Difference: 333 vs 249 (25% reduction in fanout)
Lower fanout means taller trees and more I/O. This is why compact pointer representations matter.
NULL Pointers
Leaf nodes often use special NULL pointer values:
0xFFFFFFFF or similarSome implementations save space by not storing child pointers at all in leaf nodes, using the page type field to know the node is a leaf.
In B⁺-trees, leaf pointers don't point to child B-tree nodes—they point to data records or data pages. The 'pointer' might be a heap tuple ID, a row address, or even the actual data (for index-organized tables).
How keys are stored within nodes significantly impacts both space efficiency and search performance.
Fixed-Length Keys
For simple types (integers, fixed-size char), keys are stored directly:
INTEGER key: [4 bytes of value]
BIGINT key: [8 bytes of value]
CHAR(10) key: [10 bytes, null-padded if shorter]
UUID key: [16 bytes of value]
Fixed-length keys enable direct offset calculation:
key_offset = base_offset + (key_index × key_size)
Variable-Length Keys
For VARCHAR, TEXT, or composite keys:
Storage format: [length (1-4 bytes) | actual key bytes]
Example 'hello' (VARCHAR):
[05 | 68 65 6C 6C 6F]
↑ ↑
len 'h' 'e' 'l' 'l' 'o'
Requires indirection (slot directory) or offset table for random access.
Key Compression Techniques
Production B-trees often compress keys to increase fanout:
1. Prefix Compression Store only the distinguishing prefix of each key:
Full keys: 'application', 'application_id', 'application_name'
With prefix compression:
'application' (full)
'_id' (prefix 'application' implied)
'_name' (prefix 'application' implied)
Savings: 11 + 3 + 5 = 19 bytes vs 11 + 14 + 16 = 41 bytes
2. Suffix Truncation In internal nodes, only store enough of the key to distinguish subtrees:
Separating 'alexander' from 'benjamin':
Full separator: 'benjamin' (8 bytes)
Truncated: 'b' (1 byte) — sufficient to separate 'a*' from 'b*'
3. Dictionary Encoding For repeated key patterns (esp. composite keys), store dictionary + codes:
Keys: (USA, NY, 10001), (USA, NY, 10002), (USA, CA, 90210)
Dictionary: {0: USA, 1: NY, 2: CA}
Encoded: (0, 1, 10001), (0, 1, 10002), (0, 2, 90210)
| Technique | Space Savings | Search Overhead | Best For |
|---|---|---|---|
| Prefix compression | 20-50% | Low (decode at compare) | Strings with common prefixes |
| Suffix truncation | 30-60% | None | Internal nodes only |
| Dictionary encoding | 50-80% | Medium (dict lookup) | Low-cardinality key parts |
| PFOR (integer) | 40-70% | Low (SIMD decode) | Sorted integer sequences |
Database systems like PostgreSQL's 'deduplication' for B-trees and Oracle's 'key compression' implement these techniques at the storage layer. More advanced systems use learned indexes that predict key positions, potentially reducing storage dramatically.
Beyond keys and pointers, B-tree nodes contain metadata that supports operations, concurrency, and recovery.
Essential Metadata Fields
Concurrency Control Metadata
For concurrent access, nodes track locking state:
struct BTreePageHeader {
// Standard fields
uint32_t page_id;
uint8_t page_type;
uint16_t num_keys;
// Concurrency control
uint32_t btpo_prev; // Left sibling (for deadlock-free locking)
uint32_t btpo_next; // Right sibling
uint16_t btpo_level; // Level in tree
uint16_t btpo_flags; // State flags:
// BTP_LEAF (0x01)
// BTP_ROOT (0x02)
// BTP_DELETED (0x04)
// BTP_META (0x08)
// BTP_HALF_DEAD (0x10)
// BTP_SPLIT_END (0x20)
// Recovery
uint64_t btpo_lsn; // Last modification LSN
};
Flags Explained
The 'half-dead' and split-end flags are critical for crash recovery. If a crash occurs mid-split, these flags allow the recovery process to either complete or undo the split atomically. Without them, the tree could end up in an inconsistent state.
Choosing the right node (page) size is a critical design decision. It involves trade-offs between I/O efficiency, memory usage, and concurrency.
Common Page Sizes
| Database | Default Page Size | Configurable? |
|---|---|---|
| PostgreSQL | 8 KB | Compile-time (1-32 KB) |
| MySQL (InnoDB) | 16 KB | Config (4-64 KB) |
| Oracle | 8 KB | Tablespace-level (2-32 KB) |
| SQL Server | 8 KB | Fixed |
| SQLite | 4 KB | Compile-time (512-64 KB) |
Factors Influencing Page Size
Larger Pages (16-64 KB)
Smaller Pages (4-8 KB)
Matching Hardware Characteristics
Optimal page size depends on storage:
| Storage Type | Optimal Page Size | Why |
|---|---|---|
| HDD | 8-16 KB | Amortize seek time with larger reads |
| SSD (consumer) | 4-8 KB | Matches flash page/block size |
| NVMe SSD | 4-16 KB | Flexible; larger may improve throughput |
| PMEM (Optane) | 256-4096 bytes | Byte-addressable; can use small nodes |
8 KB pages work well for most OLTP workloads on modern hardware. 16 KB suits data warehousing with large sequential scans. Only consider 4 KB if memory is severely constrained or you're on very slow storage where minimizing read amplification matters.
Calculating Expected Fanout
Given your key and pointer sizes, estimate fanout:
Fanout = (page_size - header_overhead - footer) / (key_size + pointer_size + slot_overhead)
Example:
page_size = 8192 bytes
overhead = 100 bytes
key_size = 8 bytes (BIGINT)
pointer_size = 6 bytes (TID)
slot_overhead = 4 bytes
Fanout = (8192 - 100) / (8 + 6 + 4) = 8092 / 18 ≈ 449
For 1 billion keys:
Height = ⌈log₄₄₉(10⁹)⌉ = ⌈9 / 2.65⌉ = 4 levels
This calculation helps predict tree height and I/O requirements before implementation.
We've examined B-tree node structure from concept to implementation. Let's consolidate the key insights:
What's Next
With node structure understood, we move to the balanced property—how B-trees maintain perfect balance through splits and merges, and why this guarantees logarithmic height regardless of insertion patterns.
You now understand how B-tree nodes are organized at both conceptual and physical levels. You can analyze node capacity, calculate expected fanout, and evaluate trade-offs in page sizing and key representation. Next, we'll see how the balanced property emerges from B-tree operations.