Database Management SystemsMulti-Level Index

Multi-Level Index Structures

LevelIntermediate

Duration75 mins

TopicMulti-Level Index

3 / 5

Tree Structure

From Concept to Implementation

The multi-level index concept we explored in the previous page—building indexes on indexes to achieve logarithmic-base-fanout search—is elegant in theory. But a concept alone doesn't run on a database server. We need a concrete data structure that embodies this hierarchical organization while efficiently supporting not just searches, but also insertions, deletions, and modifications.

The answer is tree structures—specifically, balanced search trees designed for block-based storage. This page bridges the gap between the multi-level indexing concept and the B-trees and B+-trees that implement it in practice. Understanding this connection is essential: the tree isn't just an implementation detail—it's what makes multi-level indexing practical, maintainable, and robust.

What You Will Learn

By the end of this page, you will understand: (1) how multi-level indexes naturally form tree structures, (2) the key properties of balanced search trees for databases, (3) node organization and the role of internal vs. leaf nodes, (4) why balance is critical and how it's maintained, and (5) the fundamental differences between binary search trees and disk-optimized trees.

From Index Hierarchy to Tree

Let's formalize the relationship between multi-level indexes and tree structures. The mapping is straightforward and reveals why trees are the natural representation for hierarchical indexes.

The correspondence:

Multi-Level Index Term	Tree Term	Description
Index block	Node	A unit of storage containing keys and pointers
Top-level index block	Root node	Single entry point; accessed first in every search
Bottom-level index block	Leaf node	Contains pointers to actual data records
Intermediate-level block	Internal node	Contains pointers to nodes at the next level down
Fanout	Branching factor	Number of child pointers per node
Number of levels	Height	Distance from root to leaves

This correspondence isn't coincidental—a multi-level index is a tree, whether we call it that or not.

tree_structure_mapping.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Multi-Level Index → Tree Structure Mapping
============================================
 
MULTI-LEVEL INDEX VIEW:           TREE VIEW:
                                  
Level 3 (Root):     [Block]       Root Node: [1 block = 1 node]
         ↓                               │
Level 2:    [Block] [Block] ...     Internal Nodes: [118 blocks = 118 nodes]
              ↓       ↓                  │
Level 1:  [B][B][B] [B][B][B] ...   Leaf Nodes: [34,247 blocks = 34,247 nodes]
            ↓                            │
Level 0:     DATA RECORDS               DATA RECORDS
 
 
TREE TERMINOLOGY APPLIED:
 
Height (h)  = Number of levels = 3 (L3 → L2 → L1)
             Note: Height counts edges or levels depending on definition
             
Fanout (f)  = Max children per node = 292 (entries per block)
 
Degree (d)  = Typically: min children per internal node
             For B-trees: d = ⌈f/2⌉ (ensures at least half-full)
             
Root        = Single node at top; accessed first in every query
 
Internal    = Nodes between root and leaves; contain keys + child pointers
Nodes       
             Entry format: (key, child_pointer)
             
Leaf        = Bottom-level nodes; contain keys + data pointers
Nodes       
             Entry format: (key, data_pointer) OR (key, data_value)
 
Balance     = All paths from root to any leaf have the same length
             This is CRITICAL for predictable performance

Why 'Tree' Matters

Viewing multi-level indexes as trees isn't just naming—it unlocks a rich body of knowledge about tree algorithms, balance maintenance, and complexity analysis. When you understand index operations as tree operations, you can reason about performance, predict behavior, and troubleshoot issues using the extensive theory of balanced trees.

Node Organization

Each node in the index tree corresponds to a disk block. Understanding the internal organization of these nodes is essential for grasping how searches, insertions, and deletions work.

Internal node structure:

Internal nodes (including the root, unless it's also a leaf) contain:

Keys: Values from the indexed column that serve as separators
Child pointers: References to nodes at the next level down

The relationship between keys and pointers follows a specific pattern: if a node has n keys, it has n+1 child pointers (or n pointers if using a slightly different convention).

node_structure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
Internal Node Structure
========================
 
Standard format (n keys, n+1 pointers):
 
┌─────────────────────────────────────────────────────────────────────┐
│  P₀  │ K₁ │ P₁ │ K₂ │ P₂ │ K₃ │ P₃ │ ... │ K_{n} │ P_{n} │ unused │
└─────────────────────────────────────────────────────────────────────┘
 
Where:
  K₁ < K₂ < K₃ < ... < K_{n}  (keys in sorted order)
  
Pointer meaning:
  P₀ points to subtree with all keys < K₁
  P₁ points to subtree with keys where K₁ ≤ key < K₂
  P₂ points to subtree with keys where K₂ ≤ key < K₃
  ...
  P_{n} points to subtree with all keys ≥ K_{n}
 
 
EXAMPLE: Internal node with 4 keys (5 pointers)
 
┌──────────────────────────────────────────────────────────────┐
│ P₀ │ 100 │ P₁ │ 200 │ P₂ │ 300 │ P₃ │ 400 │ P₄ │  unused   │
└──────────────────────────────────────────────────────────────┘
  │          │          │          │          │
  ↓          ↓          ↓          ↓          ↓
keys      keys       keys       keys       keys
< 100   100-199    200-299    300-399     ≥ 400
 
 
Memory layout (assuming 4-byte keys, 6-byte pointers):
  - Each (pointer, key) pair: 6 + 4 = 10 bytes
  - First pointer alone: 6 bytes
  - 4 keys, 5 pointers: 6 + (4 × 10) = 46 bytes for entries
  - Block size 4096 bytes → up to ~400 entries possible
  - Actual capacity depends on header/metadata overhead
 
 
Leaf Node Structure (different from internal!)
==============================================
 
Leaf nodes store actual data pointers (or data values):
 
┌────────────────────────────────────────────────────────────────────────┐
│ K₁ │ DP₁ │ K₂ │ DP₂ │ K₃ │ DP₃ │ ... │ K_{n} │ DP_{n} │ next │ unused │
└────────────────────────────────────────────────────────────────────────┘
 
Where:
  K₁, K₂, ... K_{n} = key values (sorted)
  DP₁, DP₂, ... DP_{n} = data pointers (to actual records)
  next = pointer to next leaf node (for range scans) - B+-tree feature
 
Key differences from internal nodes:
  1. No "between keys" semantics - each key has exactly one data pointer
  2. May include sibling pointers for efficient range queries
  3. In B+-tree: ALL data pointers are in leaves, none in internal nodes

Internal Nodes vs. Leaf Nodes
Property	Internal Node	Leaf Node
Content	Keys + child pointers	Keys + data pointers (or data values)
Key count to pointer count	n keys → n+1 child pointers	n keys → n data pointers
Purpose	Navigate to correct subtree	Provide access to actual data
Sibling links	Not typically needed	Often linked for range scans
Key semantics	Separator/router values	Actual indexed values from data

The Search Key Duality

In internal nodes, keys serve as 'road signs' directing search traffic. They may or may not correspond to actual keys in the data. In leaf nodes, keys are the actual indexed values from the database records. This distinction becomes important when understanding B-tree vs. B+-tree designs.

The Critical Balance Property

The single most important property of database index trees is balance: all leaf nodes must be at the same depth from the root. This property guarantees predictable performance—every search follows a path of exactly the same length.

Why balance is non-negotiable:

Consider an unbalanced tree where some leaves are at depth 3 and others at depth 10. Searching for a key in a deep branch requires 10 I/O operations, while searching in a shallow branch requires only 3. This creates:

Unpredictable query times: Same query can have 3x variation in response time
Worst-case pathology: Skewed data can make the tree degenerate to a linked list
Impossible capacity planning: Can't predict system load without knowing key distribution

Unbalanced Tree Problems

•Search cost varies from O(1) to O(n)
•Degenerate case: linear linked list
•Performance depends on insertion order
•Cannot guarantee SLA on query time
•Common with binary search trees (BST)

Balanced Tree Guarantees

•Search cost always O(log n)
•Height = ⌈log_f(n)⌉ exactly
•Performance independent of insertion order
•Predictable, consistent query times
•Foundation of B-trees and variants

balance_demonstration.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
Balance vs. Imbalance: A Concrete Example
==========================================
 
Inserting keys 1, 2, 3, 4, 5, 6, 7 in order:
 
UNBALANCED BST (simple binary search tree):
                    1
                     ╲
                      2
                       ╲
                        3
                         ╲
                          4
                           ╲
                            5
                             ╲
                              6
                               ╲
                                7
 
Height: 7 (equals number of nodes - worst case!)
Search for key 7: 7 comparisons, 7 node accesses
This is O(n), not O(log n)!
 
 
BALANCED TREE (e.g., B-tree, AVL, Red-Black):
 
              4
           ╱     ╲
          2       6
         ╱ ╲     ╱ ╲
        1   3   5   7
 
Height: 3 (logarithmic in number of nodes)
Search for ANY key: ≤3 comparisons, ≤3 node accesses
This is O(log n), regardless of insertion order
 
 
FOR DATABASE INDEXES (with fanout f ≈ 300):
 
Same 7 keys in a B-tree-style node:
        ┌───────────────────────────────────┐
        │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ ... │
        └───────────────────────────────────┘
 
All 7 keys fit in a single node!
Height: 1 (root is also a leaf)
Search for any key: 1 node access
 
This is why high fanout is so powerful:
  - 7 keys → 1 level (single block)
  - 7,000 keys → 2 levels  
  - 7,000,000 keys → 3 levels
  - 7,000,000,000 keys → 4 levels
 
ALL of these are balanced, and the height grows very slowly.

The Balance Maintenance Cost

Balance doesn't come for free. Maintaining balance during insertions and deletions requires extra work: node splits, node merges, key redistributions. But this maintenance cost is O(log n) per operation—well worth paying for guaranteed O(log n) search. The alternative (unbalanced trees) makes search O(n) in the worst case.

Why Not Binary Trees?

If balance is the key property, why not use well-known balanced binary trees like AVL trees or Red-Black trees? These are proven data structures with guaranteed O(log n) operations. Yet no major database uses them for disk-based indexes. Understanding why reveals the fundamental optimization that makes B-trees suitable for databases.

The mismatch between binary trees and disk I/O:

Binary Trees vs. High-Fanout Trees for Disk Storage
Characteristic	Binary Tree (f=2)	B-Tree (f≈300)	Impact
Keys per node	1	~300	Binary wastes 99.7% of block
Height for 1M keys	20 levels	3 levels	7x more I/O for binary
Block utilization	~10 bytes used of 4KB	~4KB fully used	Binary is 400x less efficient
Cache efficiency	Poor - many small nodes	Excellent - few large nodes	Binary thrashes cache
I/O per search	20 random reads	3 random reads	Binary is impractical

binary_vs_btree_analysis.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Binary Tree vs. B-Tree: Quantitative Analysis
==============================================
 
Dataset: 10,000,000 records
Key size: 8 bytes
Pointer size: 6 bytes
Block size: 4,096 bytes
 
BINARY TREE (AVL/Red-Black):
---------------------------
Node content: 1 key + 2 child pointers = 8 + 6 + 6 = 20 bytes
Block utilization: 20 / 4096 = 0.5% (wasting 99.5% of block!)
 
Since we read whole blocks anyway, ONE BLOCK could hold:
  4096 / 20 = 204 nodes
  
But binary tree structure means we visit each node in sequence,
not all nodes in a block together.
 
Height of balanced binary tree with 10M nodes:
  log₂(10,000,000) = 23.25 → 24 levels
 
With tree nodes scattered across blocks:
  Worst case: 24 block reads per search
  (Each level might be in a different block)
 
Even if nodes are clustered:
  - Navigating the tree still requires following pointers
  - Each pointer might go to a different block
  - Expected I/O: ~20 block reads
 
 
B-TREE (fanout ≈ 300):
---------------------
Node content: ~300 keys + ~301 pointers = 300×8 + 301×6 = 4,206 bytes
Block utilization: ~100% (node sized to match block)
 
Height of B-tree with 10M keys:
  log₃₀₀(10,000,000) = 2.83 → 3 levels
 
Block reads per search: exactly 3 (one per level)
 
 
COMPARISON SUMMARY:
                    Binary Tree    B-Tree     Ratio
                    -----------    ------     -----
Height              24             3          8x
I/O per search      ~20            3          6.7x
Block utilization   0.5%           ~100%      200x
Cache friendliness  Very poor      Excellent  
 
The B-tree is designed for block-based storage.
The binary tree is designed for pointer-based (RAM) storage.
Using binary trees on disk is a fundamental mismatch.

The Design Principle

B-trees optimize for the actual cost function of disk storage: minimize block reads. Since each block read costs the same whether it holds 10 bytes or 4,000 bytes of useful data, we should maximize data per block. This means maximum fanout, which means nodes with hundreds of children—the opposite of binary trees.

Fundamental Tree Properties for Indexes

Database index trees (B-trees and variants) maintain several invariants that ensure balance, efficiency, and correctness. These properties are non-negotiable—violating any of them breaks the structure.

The essential properties:

Fundamental B-Tree Properties

•All leaves at same depth: Every path from root to any leaf has exactly the same length. This is the balance property.
•Sorted keys within nodes: Keys in each node are maintained in ascending order, enabling binary search within nodes.
•Key-pointer relationship: If a node has k keys, it has k+1 child pointers (for internal nodes). Each key serves as a separator between two subtrees.
•Minimum fill constraint: Except for the root, every node must be at least half full. This prevents the tree from becoming sparse and wasting space.
•Maximum capacity: No node contains more keys than the order allows (typically, enough to fill a disk block).
•Search key coverage: Every search key in the indexed data appears somewhere in the tree (either in leaves only for B+-trees, or potentially anywhere for B-trees).

btree_properties.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
class BTreeNode:
    """
    Representation of a B-tree node with invariant enforcement.
    
    Order (m): Maximum number of children per node
    For a B-tree of order m:
      - Each internal node has at most m children
      - Each internal node (except root) has at least ⌈m/2⌉ children
      - Root has at least 2 children (if not a leaf)
      - Each node has at most m-1 keys
      - Each node (except root) has at least ⌈m/2⌉-1 keys
      - All leaves are at the same depth
    """
    
    def __init__(self, order: int, is_leaf: bool = False):
        self.order = order          # Maximum children
        self.is_leaf = is_leaf      # Is this a leaf node?
        self.keys = []              # Search keys (sorted)
        self.children = []          # Child pointers (one more than keys)
        self.data_pointers = []     # For leaves: pointers to actual data
    
    @property
    def min_keys(self) -> int:
        """Minimum keys required (except root)."""
        return (self.order + 1) // 2 - 1  # ⌈m/2⌉ - 1
    
    @property
    def max_keys(self) -> int:
        """Maximum keys allowed."""
        return self.order - 1
    
    @property
    def is_underfull(self) -> bool:
        """Node has fewer than minimum keys."""
        return len(self.keys) < self.min_keys
    
    @property
    def is_full(self) -> bool:
        """Node is at maximum capacity."""
        return len(self.keys) >= self.max_keys
    
    @property
    def can_donate(self) -> bool:
        """Node has more than minimum keys (can donate to sibling)."""
        return len(self.keys) > self.min_keys
    
    def validate(self) -> bool:
        """Check that all node invariants hold."""
        # Keys must be sorted
        if self.keys != sorted(self.keys):
            return False
        
        # Key count within bounds (ignoring root exception)
        if len(self.keys) > self.max_keys:
            return False
        
        # Internal nodes: children = keys + 1
        if not self.is_leaf:
            if len(self.children) != len(self.keys) + 1:
                return False
        
        # All keys unique (for primary indexes)
        if len(self.keys) != len(set(self.keys)):
            return False
        
        return True
 
 
def validate_btree(root: BTreeNode) -> tuple[bool, str]:
    """
    Validate entire B-tree structure.
    Returns (is_valid, error_message).
    """
    def check_depth(node: BTreeNode, current_depth: int) -> int:
        """Return depth of first leaf found, or -1 if inconsistent."""
        if node.is_leaf:
            return current_depth
        
        depths = [check_depth(child, current_depth + 1) 
                  for child in node.children]
        
        if len(set(depths)) > 1 or -1 in depths:
            return -1  # Inconsistent depths
        return depths[0]
    
    leaf_depth = check_depth(root, 0)
    if leaf_depth == -1:
        return (False, "Leaves at different depths - not balanced!")
    
    return (True, f"Valid B-tree with height {leaf_depth}")

The Order Parameter

B-tree 'order' (often denoted m) has varying definitions in literature. Some define it as maximum children, others as maximum keys. The practical meaning is: order determines how many keys/children fit in a node. For disk-based trees, order is chosen to make node size equal block size.

Tree Height Analysis

The height of the index tree directly determines search cost. Since each level requires one block read, minimizing height is critical. Let's derive the height bounds mathematically.

Minimum height (best case - all nodes maximally full):

With nodes at maximum capacity (f children per internal node), the tree has maximum density. At depth d, we have f^d nodes. Total keys addressable with height h:

1 + f + f² + f³ + ... + f^(h-1) leaves = (f^h - 1) / (f - 1) ≈ f^h for large f

So height h can address approximately f^h records. Inverting:

h_min = ⌈log_f(n)⌉

Maximum height (worst case - all nodes minimally full):

With nodes at minimum capacity (⌈f/2⌉ children per internal node), the tree is sparsest. Let m = ⌈f/2⌉. Then:

h_max = ⌈log_m(n)⌉ = ⌈log_f(n) / log_f(m)⌉ ≈ 2 × h_min (since m ≈ f/2)

In practice, nodes average 69-75% full, so actual height is close to minimum.

B+-Tree Height for Various Data Sizes (Fanout f = 300)
Records (n)	Min Height (h_min)	Max Height (h_max)	Typical Height	Block Reads to Search
1,000	2	2	2	2 + 1 (data)
100,000	3	3	3	3 + 1 (data)
10,000,000	3	4	3-4	3-4 + 1 (data)
1,000,000,000	4	5	4	4 + 1 (data)
100,000,000,000	5	6	5	5 + 1 (data)
10,000,000,000,000	5	6	5-6	5-6 + 1 (data)

height_calculation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
-- Calculating B+-tree height for various data sizes
-- Practical formula with realistic assumptions
 
WITH params AS (
    SELECT 
        4096 AS block_size_bytes,
        8 AS key_size_bytes,
        8 AS pointer_size_bytes,  -- 6-8 bytes for block pointer
        0.70 AS typical_fill_factor  -- Nodes typically 70% full
),
calculated AS (
    SELECT 
        FLOOR((block_size_bytes - 100) / (key_size_bytes + pointer_size_bytes)) 
            AS max_fanout,
        FLOOR((block_size_bytes - 100) / (key_size_bytes + pointer_size_bytes) 
            * typical_fill_factor) 
            AS typical_fanout
    FROM params
),
data_sizes AS (
    SELECT 1000 AS records, '1 thousand' as label UNION ALL
    SELECT 1000000, '1 million' UNION ALL
    SELECT 1000000000, '1 billion' UNION ALL
    SELECT 1000000000000, '1 trillion'
)
SELECT 
    d.label,
    d.records,
    c.typical_fanout AS fanout,
    CEILING(LOG(d.records) / LOG(c.typical_fanout)) AS tree_height,
    CEILING(LOG(d.records) / LOG(c.typical_fanout)) + 1 AS total_ios_per_search
FROM data_sizes d
CROSS JOIN calculated c;
 
-- Results with typical fanout ~180-200:
-- Records       Fanout  Height  Total I/Os (with data)
-- 1 thousand    180     2       3
-- 1 million     180     3       4
-- 1 billion     180     4       5
-- 1 trillion    180     5       6
 
/*
KEY INSIGHT: 
Even for a TRILLION records, search requires only 6 block reads!
 
This is the magic of high-fanout balanced trees:
  - Tree height grows as log_f(n)
  - With f ≈ 200, each level adds factor of 200 to capacity
  - Physical limits (disk space) are reached before height becomes problematic
*/

The Scaling Sweet Spot

For any database that fits on available storage (petabytes), the index tree height never exceeds 6-7 levels. This means every point query requires at most 7 I/O operations—a constant, predictable cost regardless of whether you have millions or trillions of records. This predictability is why B-trees are universal in databases.

Summary: Trees as the Universal Index Structure

Tree structures formalize the multi-level index concept into a robust, maintainable data structure. The key insight is designing trees for disk block access rather than pointer-based traversal—maximizing fanout to minimize height and I/O operations.

Key Takeaways

•Multi-level indexes ARE trees — The hierarchy of index levels corresponds exactly to tree structure: root, internal nodes, and leaves.
•Node = Block — Each tree node is sized to match one disk block, ensuring one I/O per level traversed.
•Balance is mandatory — All leaves at the same depth guarantees O(log n) search regardless of data distribution or insertion order.
•Binary trees don't work — High fanout (hundreds of children per node) is essential for disk efficiency. Binary trees waste 99%+ of each block.
•Invariants ensure correctness — Sorted keys, minimum fill constraints, and key-pointer relationships maintain tree structure through all modifications.
•Height is bounded — Even trillion-record databases have tree height ≤ 6, meaning ≤ 6 I/Os per search. This is the power of high-fanout balanced trees.

What's next:

With the tree structure foundation in place, the next page examines search efficiency in detail. We'll analyze how searches traverse the tree, quantify the exact I/O costs, compare with alternative approaches, and understand how caching and buffer management interact with tree-based search to achieve optimal performance.

Page Complete

You now understand how multi-level indexes manifest as tree structures: the node organization, balance properties, why high fanout is essential, and how height bounds guarantee efficient search. This prepares you for deeper exploration of search algorithms and tree maintenance operations in subsequent pages.

3 / 5

Loading learning content...

Database Management SystemsMulti-Level Index

Multi-Level Index Structures

LevelIntermediate

Duration75 mins

TopicMulti-Level Index

3 / 5

Tree Structure

From Concept to Implementation

What You Will Learn

From Index Hierarchy to Tree

Let's formalize the relationship between multi-level indexes and tree structures. The mapping is straightforward and reveals why trees are the natural representation for hierarchical indexes.

The correspondence:

Multi-Level Index Term	Tree Term	Description
Index block	Node	A unit of storage containing keys and pointers
Top-level index block	Root node	Single entry point; accessed first in every search
Bottom-level index block	Leaf node	Contains pointers to actual data records
Intermediate-level block	Internal node	Contains pointers to nodes at the next level down
Fanout	Branching factor	Number of child pointers per node
Number of levels	Height	Distance from root to leaves

This correspondence isn't coincidental—a multi-level index is a tree, whether we call it that or not.

tree_structure_mapping.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Multi-Level Index → Tree Structure Mapping
============================================
 
MULTI-LEVEL INDEX VIEW:           TREE VIEW:
                                  
Level 3 (Root):     [Block]       Root Node: [1 block = 1 node]
         ↓                               │
Level 2:    [Block] [Block] ...     Internal Nodes: [118 blocks = 118 nodes]
              ↓       ↓                  │
Level 1:  [B][B][B] [B][B][B] ...   Leaf Nodes: [34,247 blocks = 34,247 nodes]
            ↓                            │
Level 0:     DATA RECORDS               DATA RECORDS
 
 
TREE TERMINOLOGY APPLIED:
 
Height (h)  = Number of levels = 3 (L3 → L2 → L1)
             Note: Height counts edges or levels depending on definition
             
Fanout (f)  = Max children per node = 292 (entries per block)
 
Degree (d)  = Typically: min children per internal node
             For B-trees: d = ⌈f/2⌉ (ensures at least half-full)
             
Root        = Single node at top; accessed first in every query
 
Internal    = Nodes between root and leaves; contain keys + child pointers
Nodes       
             Entry format: (key, child_pointer)
             
Leaf        = Bottom-level nodes; contain keys + data pointers
Nodes       
             Entry format: (key, data_pointer) OR (key, data_value)
 
Balance     = All paths from root to any leaf have the same length
             This is CRITICAL for predictable performance

Why 'Tree' Matters

Node Organization

Each node in the index tree corresponds to a disk block. Understanding the internal organization of these nodes is essential for grasping how searches, insertions, and deletions work.

Internal node structure:

Internal nodes (including the root, unless it's also a leaf) contain:

Keys: Values from the indexed column that serve as separators
Child pointers: References to nodes at the next level down

The relationship between keys and pointers follows a specific pattern: if a node has n keys, it has n+1 child pointers (or n pointers if using a slightly different convention).

node_structure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
Internal Node Structure
========================
 
Standard format (n keys, n+1 pointers):
 
┌─────────────────────────────────────────────────────────────────────┐
│  P₀  │ K₁ │ P₁ │ K₂ │ P₂ │ K₃ │ P₃ │ ... │ K_{n} │ P_{n} │ unused │
└─────────────────────────────────────────────────────────────────────┘
 
Where:
  K₁ < K₂ < K₃ < ... < K_{n}  (keys in sorted order)
  
Pointer meaning:
  P₀ points to subtree with all keys < K₁
  P₁ points to subtree with keys where K₁ ≤ key < K₂
  P₂ points to subtree with keys where K₂ ≤ key < K₃
  ...
  P_{n} points to subtree with all keys ≥ K_{n}
 
 
EXAMPLE: Internal node with 4 keys (5 pointers)
 
┌──────────────────────────────────────────────────────────────┐
│ P₀ │ 100 │ P₁ │ 200 │ P₂ │ 300 │ P₃ │ 400 │ P₄ │  unused   │
└──────────────────────────────────────────────────────────────┘
  │          │          │          │          │
  ↓          ↓          ↓          ↓          ↓
keys      keys       keys       keys       keys
< 100   100-199    200-299    300-399     ≥ 400
 
 
Memory layout (assuming 4-byte keys, 6-byte pointers):
  - Each (pointer, key) pair: 6 + 4 = 10 bytes
  - First pointer alone: 6 bytes
  - 4 keys, 5 pointers: 6 + (4 × 10) = 46 bytes for entries
  - Block size 4096 bytes → up to ~400 entries possible
  - Actual capacity depends on header/metadata overhead
 
 
Leaf Node Structure (different from internal!)
==============================================
 
Leaf nodes store actual data pointers (or data values):
 
┌────────────────────────────────────────────────────────────────────────┐
│ K₁ │ DP₁ │ K₂ │ DP₂ │ K₃ │ DP₃ │ ... │ K_{n} │ DP_{n} │ next │ unused │
└────────────────────────────────────────────────────────────────────────┘
 
Where:
  K₁, K₂, ... K_{n} = key values (sorted)
  DP₁, DP₂, ... DP_{n} = data pointers (to actual records)
  next = pointer to next leaf node (for range scans) - B+-tree feature
 
Key differences from internal nodes:
  1. No "between keys" semantics - each key has exactly one data pointer
  2. May include sibling pointers for efficient range queries
  3. In B+-tree: ALL data pointers are in leaves, none in internal nodes

Internal Nodes vs. Leaf Nodes
Property	Internal Node	Leaf Node
Content	Keys + child pointers	Keys + data pointers (or data values)
Key count to pointer count	n keys → n+1 child pointers	n keys → n data pointers
Purpose	Navigate to correct subtree	Provide access to actual data
Sibling links	Not typically needed	Often linked for range scans
Key semantics	Separator/router values	Actual indexed values from data

The Search Key Duality

The Critical Balance Property

Why balance is non-negotiable:

Unpredictable query times: Same query can have 3x variation in response time
Worst-case pathology: Skewed data can make the tree degenerate to a linked list
Impossible capacity planning: Can't predict system load without knowing key distribution

Unbalanced Tree Problems

•Search cost varies from O(1) to O(n)
•Degenerate case: linear linked list
•Performance depends on insertion order
•Cannot guarantee SLA on query time
•Common with binary search trees (BST)

Balanced Tree Guarantees

•Search cost always O(log n)
•Height = ⌈log_f(n)⌉ exactly
•Performance independent of insertion order
•Predictable, consistent query times
•Foundation of B-trees and variants

balance_demonstration.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
Balance vs. Imbalance: A Concrete Example
==========================================
 
Inserting keys 1, 2, 3, 4, 5, 6, 7 in order:
 
UNBALANCED BST (simple binary search tree):
                    1
                     ╲
                      2
                       ╲
                        3
                         ╲
                          4
                           ╲
                            5
                             ╲
                              6
                               ╲
                                7
 
Height: 7 (equals number of nodes - worst case!)
Search for key 7: 7 comparisons, 7 node accesses
This is O(n), not O(log n)!
 
 
BALANCED TREE (e.g., B-tree, AVL, Red-Black):
 
              4
           ╱     ╲
          2       6
         ╱ ╲     ╱ ╲
        1   3   5   7
 
Height: 3 (logarithmic in number of nodes)
Search for ANY key: ≤3 comparisons, ≤3 node accesses
This is O(log n), regardless of insertion order
 
 
FOR DATABASE INDEXES (with fanout f ≈ 300):
 
Same 7 keys in a B-tree-style node:
        ┌───────────────────────────────────┐
        │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ ... │
        └───────────────────────────────────┘
 
All 7 keys fit in a single node!
Height: 1 (root is also a leaf)
Search for any key: 1 node access
 
This is why high fanout is so powerful:
  - 7 keys → 1 level (single block)
  - 7,000 keys → 2 levels  
  - 7,000,000 keys → 3 levels
  - 7,000,000,000 keys → 4 levels
 
ALL of these are balanced, and the height grows very slowly.

The Balance Maintenance Cost

Why Not Binary Trees?

The mismatch between binary trees and disk I/O:

Binary Trees vs. High-Fanout Trees for Disk Storage
Characteristic	Binary Tree (f=2)	B-Tree (f≈300)	Impact
Keys per node	1	~300	Binary wastes 99.7% of block
Height for 1M keys	20 levels	3 levels	7x more I/O for binary
Block utilization	~10 bytes used of 4KB	~4KB fully used	Binary is 400x less efficient
Cache efficiency	Poor - many small nodes	Excellent - few large nodes	Binary thrashes cache
I/O per search	20 random reads	3 random reads	Binary is impractical

binary_vs_btree_analysis.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Binary Tree vs. B-Tree: Quantitative Analysis
==============================================
 
Dataset: 10,000,000 records
Key size: 8 bytes
Pointer size: 6 bytes
Block size: 4,096 bytes
 
BINARY TREE (AVL/Red-Black):
---------------------------
Node content: 1 key + 2 child pointers = 8 + 6 + 6 = 20 bytes
Block utilization: 20 / 4096 = 0.5% (wasting 99.5% of block!)
 
Since we read whole blocks anyway, ONE BLOCK could hold:
  4096 / 20 = 204 nodes
  
But binary tree structure means we visit each node in sequence,
not all nodes in a block together.
 
Height of balanced binary tree with 10M nodes:
  log₂(10,000,000) = 23.25 → 24 levels
 
With tree nodes scattered across blocks:
  Worst case: 24 block reads per search
  (Each level might be in a different block)
 
Even if nodes are clustered:
  - Navigating the tree still requires following pointers
  - Each pointer might go to a different block
  - Expected I/O: ~20 block reads
 
 
B-TREE (fanout ≈ 300):
---------------------
Node content: ~300 keys + ~301 pointers = 300×8 + 301×6 = 4,206 bytes
Block utilization: ~100% (node sized to match block)
 
Height of B-tree with 10M keys:
  log₃₀₀(10,000,000) = 2.83 → 3 levels
 
Block reads per search: exactly 3 (one per level)
 
 
COMPARISON SUMMARY:
                    Binary Tree    B-Tree     Ratio
                    -----------    ------     -----
Height              24             3          8x
I/O per search      ~20            3          6.7x
Block utilization   0.5%           ~100%      200x
Cache friendliness  Very poor      Excellent  
 
The B-tree is designed for block-based storage.
The binary tree is designed for pointer-based (RAM) storage.
Using binary trees on disk is a fundamental mismatch.

The Design Principle

Fundamental Tree Properties for Indexes

The essential properties:

Fundamental B-Tree Properties

•All leaves at same depth: Every path from root to any leaf has exactly the same length. This is the balance property.
•Sorted keys within nodes: Keys in each node are maintained in ascending order, enabling binary search within nodes.
•Key-pointer relationship: If a node has k keys, it has k+1 child pointers (for internal nodes). Each key serves as a separator between two subtrees.
•Minimum fill constraint: Except for the root, every node must be at least half full. This prevents the tree from becoming sparse and wasting space.
•Maximum capacity: No node contains more keys than the order allows (typically, enough to fill a disk block).
•Search key coverage: Every search key in the indexed data appears somewhere in the tree (either in leaves only for B+-trees, or potentially anywhere for B-trees).

btree_properties.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
class BTreeNode:
    """
    Representation of a B-tree node with invariant enforcement.
    
    Order (m): Maximum number of children per node
    For a B-tree of order m:
      - Each internal node has at most m children
      - Each internal node (except root) has at least ⌈m/2⌉ children
      - Root has at least 2 children (if not a leaf)
      - Each node has at most m-1 keys
      - Each node (except root) has at least ⌈m/2⌉-1 keys
      - All leaves are at the same depth
    """
    
    def __init__(self, order: int, is_leaf: bool = False):
        self.order = order          # Maximum children
        self.is_leaf = is_leaf      # Is this a leaf node?
        self.keys = []              # Search keys (sorted)
        self.children = []          # Child pointers (one more than keys)
        self.data_pointers = []     # For leaves: pointers to actual data
    
    @property
    def min_keys(self) -> int:
        """Minimum keys required (except root)."""
        return (self.order + 1) // 2 - 1  # ⌈m/2⌉ - 1
    
    @property
    def max_keys(self) -> int:
        """Maximum keys allowed."""
        return self.order - 1
    
    @property
    def is_underfull(self) -> bool:
        """Node has fewer than minimum keys."""
        return len(self.keys) < self.min_keys
    
    @property
    def is_full(self) -> bool:
        """Node is at maximum capacity."""
        return len(self.keys) >= self.max_keys
    
    @property
    def can_donate(self) -> bool:
        """Node has more than minimum keys (can donate to sibling)."""
        return len(self.keys) > self.min_keys
    
    def validate(self) -> bool:
        """Check that all node invariants hold."""
        # Keys must be sorted
        if self.keys != sorted(self.keys):
            return False
        
        # Key count within bounds (ignoring root exception)
        if len(self.keys) > self.max_keys:
            return False
        
        # Internal nodes: children = keys + 1
        if not self.is_leaf:
            if len(self.children) != len(self.keys) + 1:
                return False
        
        # All keys unique (for primary indexes)
        if len(self.keys) != len(set(self.keys)):
            return False
        
        return True
 
 
def validate_btree(root: BTreeNode) -> tuple[bool, str]:
    """
    Validate entire B-tree structure.
    Returns (is_valid, error_message).
    """
    def check_depth(node: BTreeNode, current_depth: int) -> int:
        """Return depth of first leaf found, or -1 if inconsistent."""
        if node.is_leaf:
            return current_depth
        
        depths = [check_depth(child, current_depth + 1) 
                  for child in node.children]
        
        if len(set(depths)) > 1 or -1 in depths:
            return -1  # Inconsistent depths
        return depths[0]
    
    leaf_depth = check_depth(root, 0)
    if leaf_depth == -1:
        return (False, "Leaves at different depths - not balanced!")
    
    return (True, f"Valid B-tree with height {leaf_depth}")

The Order Parameter

Tree Height Analysis

The height of the index tree directly determines search cost. Since each level requires one block read, minimizing height is critical. Let's derive the height bounds mathematically.

Minimum height (best case - all nodes maximally full):

With nodes at maximum capacity (f children per internal node), the tree has maximum density. At depth d, we have f^d nodes. Total keys addressable with height h:

1 + f + f² + f³ + ... + f^(h-1) leaves = (f^h - 1) / (f - 1) ≈ f^h for large f

So height h can address approximately f^h records. Inverting:

h_min = ⌈log_f(n)⌉

Maximum height (worst case - all nodes minimally full):

With nodes at minimum capacity (⌈f/2⌉ children per internal node), the tree is sparsest. Let m = ⌈f/2⌉. Then:

h_max = ⌈log_m(n)⌉ = ⌈log_f(n) / log_f(m)⌉ ≈ 2 × h_min (since m ≈ f/2)

In practice, nodes average 69-75% full, so actual height is close to minimum.

B+-Tree Height for Various Data Sizes (Fanout f = 300)
Records (n)	Min Height (h_min)	Max Height (h_max)	Typical Height	Block Reads to Search
1,000	2	2	2	2 + 1 (data)
100,000	3	3	3	3 + 1 (data)
10,000,000	3	4	3-4	3-4 + 1 (data)
1,000,000,000	4	5	4	4 + 1 (data)
100,000,000,000	5	6	5	5 + 1 (data)
10,000,000,000,000	5	6	5-6	5-6 + 1 (data)

height_calculation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
-- Calculating B+-tree height for various data sizes
-- Practical formula with realistic assumptions
 
WITH params AS (
    SELECT 
        4096 AS block_size_bytes,
        8 AS key_size_bytes,
        8 AS pointer_size_bytes,  -- 6-8 bytes for block pointer
        0.70 AS typical_fill_factor  -- Nodes typically 70% full
),
calculated AS (
    SELECT 
        FLOOR((block_size_bytes - 100) / (key_size_bytes + pointer_size_bytes)) 
            AS max_fanout,
        FLOOR((block_size_bytes - 100) / (key_size_bytes + pointer_size_bytes) 
            * typical_fill_factor) 
            AS typical_fanout
    FROM params
),
data_sizes AS (
    SELECT 1000 AS records, '1 thousand' as label UNION ALL
    SELECT 1000000, '1 million' UNION ALL
    SELECT 1000000000, '1 billion' UNION ALL
    SELECT 1000000000000, '1 trillion'
)
SELECT 
    d.label,
    d.records,
    c.typical_fanout AS fanout,
    CEILING(LOG(d.records) / LOG(c.typical_fanout)) AS tree_height,
    CEILING(LOG(d.records) / LOG(c.typical_fanout)) + 1 AS total_ios_per_search
FROM data_sizes d
CROSS JOIN calculated c;
 
-- Results with typical fanout ~180-200:
-- Records       Fanout  Height  Total I/Os (with data)
-- 1 thousand    180     2       3
-- 1 million     180     3       4
-- 1 billion     180     4       5
-- 1 trillion    180     5       6
 
/*
KEY INSIGHT: 
Even for a TRILLION records, search requires only 6 block reads!
 
This is the magic of high-fanout balanced trees:
  - Tree height grows as log_f(n)
  - With f ≈ 200, each level adds factor of 200 to capacity
  - Physical limits (disk space) are reached before height becomes problematic
*/

The Scaling Sweet Spot

Summary: Trees as the Universal Index Structure

Key Takeaways

•Multi-level indexes ARE trees — The hierarchy of index levels corresponds exactly to tree structure: root, internal nodes, and leaves.
•Node = Block — Each tree node is sized to match one disk block, ensuring one I/O per level traversed.
•Balance is mandatory — All leaves at the same depth guarantees O(log n) search regardless of data distribution or insertion order.
•Binary trees don't work — High fanout (hundreds of children per node) is essential for disk efficiency. Binary trees waste 99%+ of each block.
•Invariants ensure correctness — Sorted keys, minimum fill constraints, and key-pointer relationships maintain tree structure through all modifications.
•Height is bounded — Even trillion-record databases have tree height ≤ 6, meaning ≤ 6 I/Os per search. This is the power of high-fanout balanced trees.

What's next:

Page Complete

3 / 5