Database Management SystemsB-Trees & B+-Trees

B-Tree Concept

LevelIntermediate

Duration60 mins

TopicB-Trees & B+-Trees

3 / 5

Node Structure

Inside the B-tree Node

A B-tree's efficiency comes not just from its tree structure, but from the careful organization of individual nodes. Each node is designed to maximize the useful information retrieved in a single disk read while maintaining the ability to efficiently search within the node.

Understanding node structure is essential because it connects abstract B-tree theory to concrete implementation. How many keys can fit in a node? Where are child pointers stored? How do we handle variable-length keys? These questions determine the practical performance of any B-tree implementation.

What You Will Learn

By the end of this page, you will understand the internal layout of B-tree nodes, the distinction between internal and leaf nodes, how nodes are organized on disk pages, and practical considerations for node sizing and key storage in production database systems.

Conceptual Node Structure

At the conceptual level, every B-tree node contains:

A count of keys currently stored (n)
The keys themselves (K₁, K₂, ..., Kₙ)
Child pointers (C₀, C₁, ..., Cₙ) — for internal nodes only
A flag indicating leaf or internal node status
Optional: data associated with each key (in classic B-trees)

Let's visualize this structure:

btree-node-structure.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Conceptual B-tree node (order m)
struct BTreeNode {
    int      n;              // Number of keys currently in node
    bool     isLeaf;         // True if node is a leaf
    Key      keys[m-1];      // Array of keys (at most m-1)
    Pointer  children[m];    // Array of child pointers (at most m)
    Data     data[m-1];      // Associated data (for classic B-trees)
}
 
// Example: Node with 3 keys (order 5)
// n = 3
// keys = [10, 20, 30, _, _]  (2 empty slots)
// children = [C0, C1, C2, C3, _]  (1 empty slot)
// isLeaf = false
 
// Visual representation:
//     [10 | 20 | 30]
//    /    |    |    \
//   C0   C1   C2   C3

Key Array Organization

The keys array holds keys in sorted order, with the first n positions occupied and remaining positions empty (or containing placeholder values). This fixed-size allocation makes index calculations simple:

Key at position i: node.keys[i]
Child before key i: node.children[i]
Child after key i: node.children[i+1]

The Interleaving Pattern

Keys and child pointers interleave in a specific pattern:

Position:  0    1    2    3    4    5    6
           C₀   K₁   C₁   K₂   C₂   K₃   C₃
           ↓    ↓    ↓    ↓    ↓    ↓    ↓
         child key child key child key child

Or equivalently:
           C₀ | K₁ | C₁ | K₂ | C₂ | K₃ | C₃

This interleaving ensures that for any key Kᵢ, the left child (containing smaller keys) is immediately before it and the right child (containing larger keys) is immediately after.

Array vs. Interleaved Layout

Some implementations store keys in one array and children in another (as shown in the struct above). Others interleave them in a single array. Both work; separate arrays are conceptually cleaner while interleaving can improve cache locality during search.

Internal vs Leaf Nodes

B-tree nodes fall into two categories with distinctly different roles and structures:

Internal Nodes (Non-leaf Nodes)

Internal nodes serve as routing nodes. Their purpose is to direct searches toward the correct subtree. They contain:

Keys that act as separator/pivot values
Child pointers leading to subtrees
No actual data records (the keys are just for navigation)

Leaf Nodes

Leaf nodes are where data resides. Depending on the B-tree variant:

Classic B-tree: Keys with associated data records
B⁺-tree: Keys with pointers to data records
Leaf nodes have no children (or null child pointers)

Internal vs Leaf Node Comparison
Characteristic	Internal Node	Leaf Node
Role	Navigation/routing	Data storage
Contains keys	Yes (as separators)	Yes (as data keys)
Contains children	Yes (m pointers max)	No (null or absent)
Contains data	Usually no	Yes (or data pointers)
Can be root	Yes	Yes (when tree has only root)
Minimum keys	⌈m/2⌉ - 1	⌈m/2⌉ - 1 (varies by implementation)

Space Allocation Differences

Because internal and leaf nodes store different content, they often have different effective capacities:

Internal Node (4KB page):
- Keys: 4 bytes × 200 keys = 800 bytes
- Child pointers: 8 bytes × 201 = 1,608 bytes
- Overhead: 100 bytes (header, etc.)
- Total: ~2,508 bytes → fits in 4KB

Leaf Node (4KB page, with data):
- Keys: 4 bytes × 100 keys = 400 bytes
- Data records: 30 bytes × 100 = 3,000 bytes
- Overhead: 100 bytes
- Total: ~3,500 bytes → fits in 4KB

Result: Internal nodes hold 200 keys, leaves hold 100 keys

This is why B⁺-trees store data only in leaves—it allows internal nodes to have maximum fanout while leaves optimize for data storage.

Why This Matters

The internal/leaf distinction affects tree height and I/O patterns. High fanout (many separators) in internal nodes minimizes height. Dense data packing in leaves minimizes total leaf count. B⁺-trees separate these concerns for optimal performance.

On-Disk Node Layout

When B-tree nodes are persisted to disk, they must be serialized into a fixed-size block (page). The layout must balance several requirements:

Fast key search within the node
Efficient insertion and deletion
Space efficiency (minimal wasted bytes)
Alignment with hardware (sector boundaries, cache lines)

Typical Page Layout

disk-page-layout.txt
+--------------------------------------------------+
|                    Page Header                    |
|  [PageID | PageType | NumKeys | FreeSpace | LSN] |
|                    (24 bytes)                     |
+--------------------------------------------------+
|                    Key Section                    |
|  [Key₁ | Key₂ | Key₃ | ... | Keyₙ | (unused)]   |
|              (variable: n × key_size)             |
+--------------------------------------------------+
|               Child Pointer Section               |
|  [Child₀ | Child₁ | ... | Childₙ | (unused)]    |
|           (variable: (n+1) × pointer_size)        |
+--------------------------------------------------+
|                     Free Space                    |
|             (grows/shrinks with content)          |
+--------------------------------------------------+
|                    Page Footer                    |
|                [Checksum | Padding]               |
|                     (8 bytes)                     |
+--------------------------------------------------+

Page Header Components

Field	Size	Description
PageID	4-8 bytes	Unique identifier for this page
PageType	1 byte	Internal, leaf, root, overflow, etc.
NumKeys	2 bytes	Count of keys currently stored
FreeSpace	2 bytes	Offset to start of free space
LSN	8 bytes	Log Sequence Number for recovery
Parent	4-8 bytes	Optional: pointer to parent page
Prev/Next	8-16 bytes	Optional: sibling links (for B⁺-trees)

Fixed vs Variable Layout

For fixed-size keys (integers, fixed char), layout is straightforward:

Keys occupy positions [header_size + i × key_size]
Child pointers at [key_section_end + i × pointer_size]
Binary search uses direct index calculation

For variable-size keys (varchar, text), a more complex layout is needed:

Slot directory: array of offsets pointing to each key
Keys stored compactly at end of page, growing toward middle
Slot directory grows from header toward middle
More complex but handles arbitrary key sizes

Alignment Matters

Modern CPUs access memory most efficiently when data is aligned to natural boundaries (4-byte, 8-byte). Misaligned access can be 2-10× slower. Database page layouts typically pad fields to maintain alignment, trading some space for access speed.

The Slot Directory Pattern

Most production databases use a slot directory (or slotted page) layout that elegantly handles variable-length keys and enables efficient reorganization.

Structure Overview

slotted-page-layout.txt
+--------------------------------------------------+
| Page Header (fixed size)                         |
+--------------------------------------------------+
| Slot Directory (grows →)                         |
| [Slot₀ | Slot₁ | Slot₂ | ... ]                  |
| Each slot: [offset | length] or [offset | flags] |
+--------------------------------------------------+
|                   Free Space                     |
|            (shrinks from both ends)              |
+--------------------------------------------------+
| ← Records/Keys (grows ←)                         |
| [Record₃ | Record₂ | Record₁ | Record₀]         |
| (stored in reverse order from page end)          |
+--------------------------------------------------+
| Page Footer                                      |
+--------------------------------------------------+
 
Slot 0 → points to → Record 0 (at offset X)
Slot 1 → points to → Record 1 (at offset Y)
...

How It Works

Slot array grows downward from the header (fixed slot size, typically 4 bytes)
Records are inserted from the end of the page upward
Free space is the gap between slots and records
When inserting, add a slot entry and place record in free space
When deleting, mark slot as empty; optionally compact later

Benefits of This Layout

Benefit	Explanation
Variable-length support	Records can be any size; slots are uniform
Stable slot IDs	External references can use (page, slot#) as record ID
Easy compaction	Move records, update slot pointers; external refs unchanged
Search efficiency	Binary search on sorted slots, indirect access to records
Deletion simplicity	Mark slot as deleted; physical removal during compaction

For B-tree Nodes

In a B-tree context:

Each 'record' is a (key, child_pointer) pair for internal nodes
Each 'record' is a (key, data) pair for leaf nodes
Slots remain in sorted key order for binary search
When node splits, half the slots/records move to new page

PostgreSQL's Approach

PostgreSQL uses exactly this slotted page format. Each heap page and B-tree page contains a line pointer array (slot directory) and tuple data. The ItemIdData structure (4 bytes per slot) contains offset, length, and flags. This design has been refined over decades.

Child Pointer Representation

Child pointers in B-tree nodes need to identify the disk location of child pages. Different systems use different representations:

Option 1: Page Numbers (Block IDs)

Simplest approach—child pointer is just a page number:

Child pointer = 4 bytes (32-bit page number)
Addressable pages = 2³² = 4 billion pages
At 4KB per page = 16 TB maximum file size

The storage engine translates page number to physical disk offset via: disk_offset = page_number × page_size

Option 2: File ID + Page Number

For systems with multiple files per database:

Child pointer = [file_id (2 bytes) | page_number (4 bytes)]
Total = 6 bytes

Option 3: Full Disk Address

For complex storage layouts:

Child pointer = [tablespace | file | segment | page]
Total = 8+ bytes

Pointer Representation Trade-offs
Approach	Size	Pros	Cons
32-bit page number	4 bytes	Simple, compact	16TB limit at 4KB pages
64-bit page number	8 bytes	Huge address space	Wastes space for small DBs
Page + slot	6 bytes	Points to specific tuple	More complex, larger
TID (PostgreSQL)	6 bytes	(block, offset) pair	Standard PostgreSQL format

Fanout Impact

Pointer size directly affects fanout:

Page size: 4096 bytes
Key size: 8 bytes
Overhead: 100 bytes
Usable space: 3996 bytes

With 4-byte pointers:
  Entry size = 8 + 4 = 12 bytes
  Fanout ≈ 3996 / 12 = 333 children

With 8-byte pointers:
  Entry size = 8 + 8 = 16 bytes
  Fanout ≈ 3996 / 16 = 249 children

Difference: 333 vs 249 (25% reduction in fanout)

Lower fanout means taller trees and more I/O. This is why compact pointer representations matter.

NULL Pointers

Leaf nodes often use special NULL pointer values:

All zeros: indicates no child
Special sentinel: often 0xFFFFFFFF or similar
Type flag: first bit indicates leaf vs internal

Some implementations save space by not storing child pointers at all in leaf nodes, using the page type field to know the node is a leaf.

Implementation Detail

In B⁺-trees, leaf pointers don't point to child B-tree nodes—they point to data records or data pages. The 'pointer' might be a heap tuple ID, a row address, or even the actual data (for index-organized tables).

Key Representation

How keys are stored within nodes significantly impacts both space efficiency and search performance.

Fixed-Length Keys

For simple types (integers, fixed-size char), keys are stored directly:

INTEGER key:     [4 bytes of value]
BIGINT key:      [8 bytes of value]
CHAR(10) key:    [10 bytes, null-padded if shorter]
UUID key:        [16 bytes of value]

Fixed-length keys enable direct offset calculation:

key_offset = base_offset + (key_index × key_size)

Variable-Length Keys

For VARCHAR, TEXT, or composite keys:

Storage format: [length (1-4 bytes) | actual key bytes]

Example 'hello' (VARCHAR):
  [05 | 68 65 6C 6C 6F]
   ↑    ↑
   len  'h' 'e' 'l' 'l' 'o'

Requires indirection (slot directory) or offset table for random access.

Key Compression Techniques

Production B-trees often compress keys to increase fanout:

1. Prefix Compression Store only the distinguishing prefix of each key:

Full keys: 'application', 'application_id', 'application_name'
With prefix compression:
  'application' (full)
  '_id' (prefix 'application' implied)
  '_name' (prefix 'application' implied)

Savings: 11 + 3 + 5 = 19 bytes vs 11 + 14 + 16 = 41 bytes

2. Suffix Truncation In internal nodes, only store enough of the key to distinguish subtrees:

Separating 'alexander' from 'benjamin':
Full separator: 'benjamin' (8 bytes)
Truncated: 'b' (1 byte) — sufficient to separate 'a*' from 'b*'

3. Dictionary Encoding For repeated key patterns (esp. composite keys), store dictionary + codes:

Keys: (USA, NY, 10001), (USA, NY, 10002), (USA, CA, 90210)
Dictionary: {0: USA, 1: NY, 2: CA}
Encoded: (0, 1, 10001), (0, 1, 10002), (0, 2, 90210)

Key Compression Trade-offs
Technique	Space Savings	Search Overhead	Best For
Prefix compression	20-50%	Low (decode at compare)	Strings with common prefixes
Suffix truncation	30-60%	None	Internal nodes only
Dictionary encoding	50-80%	Medium (dict lookup)	Low-cardinality key parts
PFOR (integer)	40-70%	Low (SIMD decode)	Sorted integer sequences

Modern Innovations

Database systems like PostgreSQL's 'deduplication' for B-trees and Oracle's 'key compression' implement these techniques at the storage layer. More advanced systems use learned indexes that predict key positions, potentially reducing storage dramatically.

Node Metadata and Housekeeping

Beyond keys and pointers, B-tree nodes contain metadata that supports operations, concurrency, and recovery.

Essential Metadata Fields

Common Node Metadata

•Page ID: Unique identifier, used for consistency checks and cross-references
•Page Type: Leaf, internal, root, overflow, deleted, etc.
•Key Count: Number of valid keys currently stored
•Level: Distance from leaves (0 for leaves, 1 for their parents, etc.)
•Free Space: Bytes available for new entries
•Free Space Offset: Where free space begins
•Parent Pointer: Optional link to parent (simplifies splits/merges)
•Sibling Links: Prev/next pointers for leaves (B⁺-tree)
•LSN: Log Sequence Number for crash recovery
•Checksum: Integrity verification for storage corruption
•Transaction ID: For MVCC visibility checks

Concurrency Control Metadata

For concurrent access, nodes track locking state:

struct BTreePageHeader {
    // Standard fields
    uint32_t page_id;
    uint8_t  page_type;
    uint16_t num_keys;
    
    // Concurrency control
    uint32_t btpo_prev;      // Left sibling (for deadlock-free locking)
    uint32_t btpo_next;      // Right sibling
    uint16_t btpo_level;     // Level in tree
    uint16_t btpo_flags;     // State flags:
                              //   BTP_LEAF (0x01)
                              //   BTP_ROOT (0x02)
                              //   BTP_DELETED (0x04)
                              //   BTP_META (0x08)
                              //   BTP_HALF_DEAD (0x10)
                              //   BTP_SPLIT_END (0x20)
    
    // Recovery
    uint64_t btpo_lsn;       // Last modification LSN
};

Flags Explained

BTP_LEAF: This is a leaf node
BTP_ROOT: This is the root node
BTP_DELETED: Node was deleted, pending cleanup
BTP_HALF_DEAD: Mid-deletion state (for crash recovery)
BTP_SPLIT_END: Marks end of split chain

Recovery Implications

The 'half-dead' and split-end flags are critical for crash recovery. If a crash occurs mid-split, these flags allow the recovery process to either complete or undo the split atomically. Without them, the tree could end up in an inconsistent state.

Practical Node Sizing

Choosing the right node (page) size is a critical design decision. It involves trade-offs between I/O efficiency, memory usage, and concurrency.

Common Page Sizes

Database	Default Page Size	Configurable?
PostgreSQL	8 KB	Compile-time (1-32 KB)
MySQL (InnoDB)	16 KB	Config (4-64 KB)
Oracle	8 KB	Tablespace-level (2-32 KB)
SQL Server	8 KB	Fixed
SQLite	4 KB	Compile-time (512-64 KB)

Factors Influencing Page Size

Larger Pages (16-64 KB)

✅ Higher fanout (more keys per node)
✅ Fewer I/Os for sequential scans
✅ Better compression ratios
❌ More wasted space if pages are sparse
❌ Larger buffer pool entries (fewer pages cached)
❌ More lock contention (coarser locking granularity)

Smaller Pages (4-8 KB)

✅ Less wasted space per page
✅ More pages fit in buffer pool
✅ Finer-grained locking
❌ Lower fanout (taller trees)
❌ More I/Os for large scans
❌ Higher metadata overhead percentage

Matching Hardware Characteristics

Optimal page size depends on storage:

Storage Type	Optimal Page Size	Why
HDD	8-16 KB	Amortize seek time with larger reads
SSD (consumer)	4-8 KB	Matches flash page/block size
NVMe SSD	4-16 KB	Flexible; larger may improve throughput
PMEM (Optane)	256-4096 bytes	Byte-addressable; can use small nodes

Rule of Thumb

8 KB pages work well for most OLTP workloads on modern hardware. 16 KB suits data warehousing with large sequential scans. Only consider 4 KB if memory is severely constrained or you're on very slow storage where minimizing read amplification matters.

Calculating Expected Fanout

Given your key and pointer sizes, estimate fanout:

Fanout = (page_size - header_overhead - footer) / (key_size + pointer_size + slot_overhead)

Example:
  page_size = 8192 bytes
  overhead = 100 bytes
  key_size = 8 bytes (BIGINT)
  pointer_size = 6 bytes (TID)
  slot_overhead = 4 bytes

  Fanout = (8192 - 100) / (8 + 6 + 4) = 8092 / 18 ≈ 449

For 1 billion keys:
  Height = ⌈log₄₄₉(10⁹)⌉ = ⌈9 / 2.65⌉ = 4 levels

This calculation helps predict tree height and I/O requirements before implementation.

Summary: Node Structure

We've examined B-tree node structure from concept to implementation. Let's consolidate the key insights:

Key Takeaways

•Nodes contain keys, pointers, and metadata — The interleaving of keys and child pointers enables navigation.
•Internal and leaf nodes differ — Internal nodes route; leaves store data. This separation is explicit in B⁺-trees.
•Slotted page layout handles variability — Slot directories enable variable-length keys with stable record IDs.
•Pointer size affects fanout — Compact pointer representations directly increase fanout and reduce height.
•Key compression increases density — Prefix compression, suffix truncation, and dictionary encoding boost fanout.
•Node metadata supports operations — Type flags, LSN, checksums, and sibling links enable concurrency and recovery.
•Page size is a tuning decision — 8KB is a safe default; adjust based on workload and hardware characteristics.

What's Next

With node structure understood, we move to the balanced property—how B-trees maintain perfect balance through splits and merges, and why this guarantees logarithmic height regardless of insertion patterns.

Page Complete

You now understand how B-tree nodes are organized at both conceptual and physical levels. You can analyze node capacity, calculate expected fanout, and evaluate trade-offs in page sizing and key representation. Next, we'll see how the balanced property emerges from B-tree operations.

3 / 5

Loading learning content...

Database Management SystemsB-Trees & B+-Trees

B-Tree Concept

LevelIntermediate

Duration60 mins

TopicB-Trees & B+-Trees

3 / 5

Node Structure

Inside the B-tree Node

What You Will Learn

Conceptual Node Structure

At the conceptual level, every B-tree node contains:

A count of keys currently stored (n)
The keys themselves (K₁, K₂, ..., Kₙ)
Child pointers (C₀, C₁, ..., Cₙ) — for internal nodes only
A flag indicating leaf or internal node status
Optional: data associated with each key (in classic B-trees)

Let's visualize this structure:

btree-node-structure.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Conceptual B-tree node (order m)
struct BTreeNode {
    int      n;              // Number of keys currently in node
    bool     isLeaf;         // True if node is a leaf
    Key      keys[m-1];      // Array of keys (at most m-1)
    Pointer  children[m];    // Array of child pointers (at most m)
    Data     data[m-1];      // Associated data (for classic B-trees)
}
 
// Example: Node with 3 keys (order 5)
// n = 3
// keys = [10, 20, 30, _, _]  (2 empty slots)
// children = [C0, C1, C2, C3, _]  (1 empty slot)
// isLeaf = false
 
// Visual representation:
//     [10 | 20 | 30]
//    /    |    |    \
//   C0   C1   C2   C3

Key Array Organization

Key at position i: node.keys[i]
Child before key i: node.children[i]
Child after key i: node.children[i+1]

The Interleaving Pattern

Keys and child pointers interleave in a specific pattern:

Position:  0    1    2    3    4    5    6
           C₀   K₁   C₁   K₂   C₂   K₃   C₃
           ↓    ↓    ↓    ↓    ↓    ↓    ↓
         child key child key child key child

Or equivalently:
           C₀ | K₁ | C₁ | K₂ | C₂ | K₃ | C₃

This interleaving ensures that for any key Kᵢ, the left child (containing smaller keys) is immediately before it and the right child (containing larger keys) is immediately after.

Array vs. Interleaved Layout

Internal vs Leaf Nodes

B-tree nodes fall into two categories with distinctly different roles and structures:

Internal Nodes (Non-leaf Nodes)

Internal nodes serve as routing nodes. Their purpose is to direct searches toward the correct subtree. They contain:

Keys that act as separator/pivot values
Child pointers leading to subtrees
No actual data records (the keys are just for navigation)

Leaf Nodes

Leaf nodes are where data resides. Depending on the B-tree variant:

Classic B-tree: Keys with associated data records
B⁺-tree: Keys with pointers to data records
Leaf nodes have no children (or null child pointers)

Internal vs Leaf Node Comparison
Characteristic	Internal Node	Leaf Node
Role	Navigation/routing	Data storage
Contains keys	Yes (as separators)	Yes (as data keys)
Contains children	Yes (m pointers max)	No (null or absent)
Contains data	Usually no	Yes (or data pointers)
Can be root	Yes	Yes (when tree has only root)
Minimum keys	⌈m/2⌉ - 1	⌈m/2⌉ - 1 (varies by implementation)

Space Allocation Differences

Because internal and leaf nodes store different content, they often have different effective capacities:

Internal Node (4KB page):
- Keys: 4 bytes × 200 keys = 800 bytes
- Child pointers: 8 bytes × 201 = 1,608 bytes
- Overhead: 100 bytes (header, etc.)
- Total: ~2,508 bytes → fits in 4KB

Leaf Node (4KB page, with data):
- Keys: 4 bytes × 100 keys = 400 bytes
- Data records: 30 bytes × 100 = 3,000 bytes
- Overhead: 100 bytes
- Total: ~3,500 bytes → fits in 4KB

Result: Internal nodes hold 200 keys, leaves hold 100 keys

This is why B⁺-trees store data only in leaves—it allows internal nodes to have maximum fanout while leaves optimize for data storage.

Why This Matters

On-Disk Node Layout

When B-tree nodes are persisted to disk, they must be serialized into a fixed-size block (page). The layout must balance several requirements:

Fast key search within the node
Efficient insertion and deletion
Space efficiency (minimal wasted bytes)
Alignment with hardware (sector boundaries, cache lines)

Typical Page Layout

disk-page-layout.txt
+--------------------------------------------------+
|                    Page Header                    |
|  [PageID | PageType | NumKeys | FreeSpace | LSN] |
|                    (24 bytes)                     |
+--------------------------------------------------+
|                    Key Section                    |
|  [Key₁ | Key₂ | Key₃ | ... | Keyₙ | (unused)]   |
|              (variable: n × key_size)             |
+--------------------------------------------------+
|               Child Pointer Section               |
|  [Child₀ | Child₁ | ... | Childₙ | (unused)]    |
|           (variable: (n+1) × pointer_size)        |
+--------------------------------------------------+
|                     Free Space                    |
|             (grows/shrinks with content)          |
+--------------------------------------------------+
|                    Page Footer                    |
|                [Checksum | Padding]               |
|                     (8 bytes)                     |
+--------------------------------------------------+

Page Header Components

Field	Size	Description
PageID	4-8 bytes	Unique identifier for this page
PageType	1 byte	Internal, leaf, root, overflow, etc.
NumKeys	2 bytes	Count of keys currently stored
FreeSpace	2 bytes	Offset to start of free space
LSN	8 bytes	Log Sequence Number for recovery
Parent	4-8 bytes	Optional: pointer to parent page
Prev/Next	8-16 bytes	Optional: sibling links (for B⁺-trees)

Fixed vs Variable Layout

For fixed-size keys (integers, fixed char), layout is straightforward:

Keys occupy positions [header_size + i × key_size]
Child pointers at [key_section_end + i × pointer_size]
Binary search uses direct index calculation

For variable-size keys (varchar, text), a more complex layout is needed:

Slot directory: array of offsets pointing to each key
Keys stored compactly at end of page, growing toward middle
Slot directory grows from header toward middle
More complex but handles arbitrary key sizes

Alignment Matters

The Slot Directory Pattern

Most production databases use a slot directory (or slotted page) layout that elegantly handles variable-length keys and enables efficient reorganization.

Structure Overview

slotted-page-layout.txt
+--------------------------------------------------+
| Page Header (fixed size)                         |
+--------------------------------------------------+
| Slot Directory (grows →)                         |
| [Slot₀ | Slot₁ | Slot₂ | ... ]                  |
| Each slot: [offset | length] or [offset | flags] |
+--------------------------------------------------+
|                   Free Space                     |
|            (shrinks from both ends)              |
+--------------------------------------------------+
| ← Records/Keys (grows ←)                         |
| [Record₃ | Record₂ | Record₁ | Record₀]         |
| (stored in reverse order from page end)          |
+--------------------------------------------------+
| Page Footer                                      |
+--------------------------------------------------+
 
Slot 0 → points to → Record 0 (at offset X)
Slot 1 → points to → Record 1 (at offset Y)
...

How It Works

Slot array grows downward from the header (fixed slot size, typically 4 bytes)
Records are inserted from the end of the page upward
Free space is the gap between slots and records
When inserting, add a slot entry and place record in free space
When deleting, mark slot as empty; optionally compact later

Benefits of This Layout

Benefit	Explanation
Variable-length support	Records can be any size; slots are uniform
Stable slot IDs	External references can use (page, slot#) as record ID
Easy compaction	Move records, update slot pointers; external refs unchanged
Search efficiency	Binary search on sorted slots, indirect access to records
Deletion simplicity	Mark slot as deleted; physical removal during compaction

For B-tree Nodes

In a B-tree context:

Each 'record' is a (key, child_pointer) pair for internal nodes
Each 'record' is a (key, data) pair for leaf nodes
Slots remain in sorted key order for binary search
When node splits, half the slots/records move to new page

PostgreSQL's Approach

Child Pointer Representation

Child pointers in B-tree nodes need to identify the disk location of child pages. Different systems use different representations:

Option 1: Page Numbers (Block IDs)

Simplest approach—child pointer is just a page number:

Child pointer = 4 bytes (32-bit page number)
Addressable pages = 2³² = 4 billion pages
At 4KB per page = 16 TB maximum file size

The storage engine translates page number to physical disk offset via: disk_offset = page_number × page_size

Option 2: File ID + Page Number

For systems with multiple files per database:

Child pointer = [file_id (2 bytes) | page_number (4 bytes)]
Total = 6 bytes

Option 3: Full Disk Address

For complex storage layouts:

Child pointer = [tablespace | file | segment | page]
Total = 8+ bytes

Pointer Representation Trade-offs
Approach	Size	Pros	Cons
32-bit page number	4 bytes	Simple, compact	16TB limit at 4KB pages
64-bit page number	8 bytes	Huge address space	Wastes space for small DBs
Page + slot	6 bytes	Points to specific tuple	More complex, larger
TID (PostgreSQL)	6 bytes	(block, offset) pair	Standard PostgreSQL format

Fanout Impact

Pointer size directly affects fanout:

Page size: 4096 bytes
Key size: 8 bytes
Overhead: 100 bytes
Usable space: 3996 bytes

With 4-byte pointers:
  Entry size = 8 + 4 = 12 bytes
  Fanout ≈ 3996 / 12 = 333 children

With 8-byte pointers:
  Entry size = 8 + 8 = 16 bytes
  Fanout ≈ 3996 / 16 = 249 children

Difference: 333 vs 249 (25% reduction in fanout)

Lower fanout means taller trees and more I/O. This is why compact pointer representations matter.

NULL Pointers

Leaf nodes often use special NULL pointer values:

All zeros: indicates no child
Special sentinel: often 0xFFFFFFFF or similar
Type flag: first bit indicates leaf vs internal

Some implementations save space by not storing child pointers at all in leaf nodes, using the page type field to know the node is a leaf.

Implementation Detail

Key Representation

How keys are stored within nodes significantly impacts both space efficiency and search performance.

Fixed-Length Keys

For simple types (integers, fixed-size char), keys are stored directly:

INTEGER key:     [4 bytes of value]
BIGINT key:      [8 bytes of value]
CHAR(10) key:    [10 bytes, null-padded if shorter]
UUID key:        [16 bytes of value]

Fixed-length keys enable direct offset calculation:

key_offset = base_offset + (key_index × key_size)

Variable-Length Keys

For VARCHAR, TEXT, or composite keys:

Storage format: [length (1-4 bytes) | actual key bytes]

Example 'hello' (VARCHAR):
  [05 | 68 65 6C 6C 6F]
   ↑    ↑
   len  'h' 'e' 'l' 'l' 'o'

Requires indirection (slot directory) or offset table for random access.

Key Compression Techniques

Production B-trees often compress keys to increase fanout:

1. Prefix Compression Store only the distinguishing prefix of each key:

Full keys: 'application', 'application_id', 'application_name'
With prefix compression:
  'application' (full)
  '_id' (prefix 'application' implied)
  '_name' (prefix 'application' implied)

Savings: 11 + 3 + 5 = 19 bytes vs 11 + 14 + 16 = 41 bytes

2. Suffix Truncation In internal nodes, only store enough of the key to distinguish subtrees:

Separating 'alexander' from 'benjamin':
Full separator: 'benjamin' (8 bytes)
Truncated: 'b' (1 byte) — sufficient to separate 'a*' from 'b*'

3. Dictionary Encoding For repeated key patterns (esp. composite keys), store dictionary + codes:

Keys: (USA, NY, 10001), (USA, NY, 10002), (USA, CA, 90210)
Dictionary: {0: USA, 1: NY, 2: CA}
Encoded: (0, 1, 10001), (0, 1, 10002), (0, 2, 90210)

Key Compression Trade-offs
Technique	Space Savings	Search Overhead	Best For
Prefix compression	20-50%	Low (decode at compare)	Strings with common prefixes
Suffix truncation	30-60%	None	Internal nodes only
Dictionary encoding	50-80%	Medium (dict lookup)	Low-cardinality key parts
PFOR (integer)	40-70%	Low (SIMD decode)	Sorted integer sequences

Modern Innovations

Node Metadata and Housekeeping

Beyond keys and pointers, B-tree nodes contain metadata that supports operations, concurrency, and recovery.

Essential Metadata Fields

Common Node Metadata

•Page ID: Unique identifier, used for consistency checks and cross-references
•Page Type: Leaf, internal, root, overflow, deleted, etc.
•Key Count: Number of valid keys currently stored
•Level: Distance from leaves (0 for leaves, 1 for their parents, etc.)
•Free Space: Bytes available for new entries
•Free Space Offset: Where free space begins
•Parent Pointer: Optional link to parent (simplifies splits/merges)
•Sibling Links: Prev/next pointers for leaves (B⁺-tree)
•LSN: Log Sequence Number for crash recovery
•Checksum: Integrity verification for storage corruption
•Transaction ID: For MVCC visibility checks

Concurrency Control Metadata

For concurrent access, nodes track locking state:

struct BTreePageHeader {
    // Standard fields
    uint32_t page_id;
    uint8_t  page_type;
    uint16_t num_keys;
    
    // Concurrency control
    uint32_t btpo_prev;      // Left sibling (for deadlock-free locking)
    uint32_t btpo_next;      // Right sibling
    uint16_t btpo_level;     // Level in tree
    uint16_t btpo_flags;     // State flags:
                              //   BTP_LEAF (0x01)
                              //   BTP_ROOT (0x02)
                              //   BTP_DELETED (0x04)
                              //   BTP_META (0x08)
                              //   BTP_HALF_DEAD (0x10)
                              //   BTP_SPLIT_END (0x20)
    
    // Recovery
    uint64_t btpo_lsn;       // Last modification LSN
};

Flags Explained

BTP_LEAF: This is a leaf node
BTP_ROOT: This is the root node
BTP_DELETED: Node was deleted, pending cleanup
BTP_HALF_DEAD: Mid-deletion state (for crash recovery)
BTP_SPLIT_END: Marks end of split chain

Recovery Implications

Practical Node Sizing

Choosing the right node (page) size is a critical design decision. It involves trade-offs between I/O efficiency, memory usage, and concurrency.

Common Page Sizes

Database	Default Page Size	Configurable?
PostgreSQL	8 KB	Compile-time (1-32 KB)
MySQL (InnoDB)	16 KB	Config (4-64 KB)
Oracle	8 KB	Tablespace-level (2-32 KB)
SQL Server	8 KB	Fixed
SQLite	4 KB	Compile-time (512-64 KB)

Factors Influencing Page Size

Larger Pages (16-64 KB)

✅ Higher fanout (more keys per node)
✅ Fewer I/Os for sequential scans
✅ Better compression ratios
❌ More wasted space if pages are sparse
❌ Larger buffer pool entries (fewer pages cached)
❌ More lock contention (coarser locking granularity)

Smaller Pages (4-8 KB)

✅ Less wasted space per page
✅ More pages fit in buffer pool
✅ Finer-grained locking
❌ Lower fanout (taller trees)
❌ More I/Os for large scans
❌ Higher metadata overhead percentage

Matching Hardware Characteristics

Optimal page size depends on storage:

Storage Type	Optimal Page Size	Why
HDD	8-16 KB	Amortize seek time with larger reads
SSD (consumer)	4-8 KB	Matches flash page/block size
NVMe SSD	4-16 KB	Flexible; larger may improve throughput
PMEM (Optane)	256-4096 bytes	Byte-addressable; can use small nodes

Rule of Thumb

Calculating Expected Fanout

Given your key and pointer sizes, estimate fanout:

Fanout = (page_size - header_overhead - footer) / (key_size + pointer_size + slot_overhead)

Example:
  page_size = 8192 bytes
  overhead = 100 bytes
  key_size = 8 bytes (BIGINT)
  pointer_size = 6 bytes (TID)
  slot_overhead = 4 bytes

  Fanout = (8192 - 100) / (8 + 6 + 4) = 8092 / 18 ≈ 449

For 1 billion keys:
  Height = ⌈log₄₄₉(10⁹)⌉ = ⌈9 / 2.65⌉ = 4 levels

This calculation helps predict tree height and I/O requirements before implementation.

Summary: Node Structure

We've examined B-tree node structure from concept to implementation. Let's consolidate the key insights:

Key Takeaways

•Nodes contain keys, pointers, and metadata — The interleaving of keys and child pointers enables navigation.
•Internal and leaf nodes differ — Internal nodes route; leaves store data. This separation is explicit in B⁺-trees.
•Slotted page layout handles variability — Slot directories enable variable-length keys with stable record IDs.
•Pointer size affects fanout — Compact pointer representations directly increase fanout and reduce height.
•Key compression increases density — Prefix compression, suffix truncation, and dictionary encoding boost fanout.
•Node metadata supports operations — Type flags, LSN, checksums, and sibling links enable concurrency and recovery.
•Page size is a tuning decision — 8KB is a safe default; adjust based on workload and hardware characteristics.

What's Next

Page Complete

3 / 5