B Plus Tree Performance - Learning Module

Loading content...

0/241

I/O Cost Analysis

The Foundation of Query Optimization

Every query optimizer in every database makes decisions based on cost estimation—and for B+-tree operations, the dominant cost is almost always disk I/O. Understanding how to model, predict, and minimize I/O costs is fundamental to database performance engineering.

This page develops a rigorous framework for analyzing I/O costs across all B+-tree operations: from simple point lookups to complex range scans, from individual inserts to bulk loading strategies. By the end, you'll be able to predict operation costs, identify bottlenecks, and make informed design decisions.

What You Will Master

By the end of this page, you will understand the I/O cost model fundamentals, analyze costs for all B+-tree operation types, factor in buffer pool effects, predict costs for real-world query patterns, and identify optimization opportunities through cost analysis.

The I/O Cost Model

Before analyzing specific operations, we need a rigorous model for I/O costs. The total cost of any disk access consists of multiple components.

The Fundamental I/O Cost Equation:

$$T_{I/O} = T_{seek} + T_{rotation} + T_{transfer}$$

Where:

T_seek: Time to position the disk head (HDD) or locate block (SSD)
T_rotation: Rotational latency (HDD only, average = 1/2 rotation)
T_transfer: Time to read/write the data

I/O Cost Components by Storage Type
Component	HDD (7200 RPM)	SSD (SATA)	SSD (NVMe)	Notes
Seek/Access	5-10 ms	0.05-0.1 ms	0.01-0.05 ms	Fixed overhead per I/O
Rotational Latency	~4 ms	N/A	N/A	HDD only, 1/2 revolution
Transfer Rate	100-200 MB/s	400-600 MB/s	2-7 GB/s	Sequential bandwidth
Random 4KB Read	~10 ms	~0.1 ms	~0.05 ms	Typical point lookup
Random 4KB IOPS	~100	~50,000	~500,000	Operations per second

Cost Model Parameters:

For analytical purposes, we define key parameters:

cost_model.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
class IOCostModel:
    """
    I/O cost model for database operations.
    All times in milliseconds, sizes in bytes.
    """
    
    def __init__(self, storage_type='ssd_nvme'):
        if storage_type == 'hdd':
            self.seek_time = 8.0        # ms
            self.rotation_latency = 4.0  # ms (7200 RPM)
            self.transfer_rate = 150     # MB/s
        elif storage_type == 'ssd_sata':
            self.seek_time = 0.1        # ms
            self.rotation_latency = 0   # N/A
            self.transfer_rate = 500    # MB/s
        else:  # ssd_nvme
            self.seek_time = 0.03       # ms
            self.rotation_latency = 0   # N/A
            self.transfer_rate = 3000   # MB/s
        
        self.page_size = 8192  # 8 KB default
    
    def random_read_time(self, pages=1):
        """Time for random page reads."""
        per_page = (
            self.seek_time + 
            self.rotation_latency + 
            (self.page_size / 1024 / 1024) / self.transfer_rate * 1000
        )
        return pages * per_page
    
    def sequential_read_time(self, pages):
        """Time for sequential page reads (amortized seek)."""
        # First page: full seek + transfer
        # Subsequent pages: mostly transfer
        first_page = self.random_read_time(1)
        if pages <= 1:
            return first_page
        
        # Subsequent pages: minimal seek, just transfer
        transfer_time = (pages - 1) * (self.page_size / 1024 / 1024) / self.transfer_rate * 1000
        return first_page + transfer_time
    
    def iops_capacity(self):
        """Maximum random I/O operations per second."""
        single_io = self.random_read_time(1)
        return 1000 / single_io  # Convert ms to seconds
 
# Compare storage types
for storage in ['hdd', 'ssd_sata', 'ssd_nvme']:
    model = IOCostModel(storage)
    print(f"\n{storage.upper()}:")
    print(f"  Random 1 page: {model.random_read_time(1):.3f} ms")
    print(f"  Random 10 pages: {model.random_read_time(10):.3f} ms")
    print(f"  Sequential 10 pages: {model.sequential_read_time(10):.3f} ms")
    print(f"  Max IOPS: {model.iops_capacity():.0f}")

The Critical Insight

For HDDs, seek time dominates (8ms >> 0.05ms transfer). For SSDs, transfer becomes proportionally more significant. This fundamentally changes optimization strategies: HDDs reward larger pages and sequential access; SSDs reward minimizing total pages read.

Point Lookup Cost

A point lookup retrieves a single record by its key—the fundamental B+-tree operation. Understanding its cost is essential.

The Point Lookup Algorithm:

Start at root node
Binary search within node to find child pointer
Follow pointer to child node
Repeat until leaf node reached
Binary search leaf for target key

Cost Formula:

$$Cost_{lookup} = h \times C_{read}$$

Where h = tree height and C_read = cost of reading one page.

With Buffer Pool Caching:

In practice, upper levels are typically cached:

$$Cost_{lookup} = (h - L_{cached}) \times C_{read}$$

Where L_cached = number of levels fully cached in memory.

point_lookup_cost.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import math
 
def analyze_point_lookup_cost(
    record_count: int,
    fanout: int,
    page_size_kb: int = 8,
    storage_type: str = 'ssd_nvme',
    cached_levels: int = 2
):
    """
    Analyze the cost of a point lookup in a B+-tree.
    """
    # Calculate tree height
    if record_count <= fanout - 1:
        height = 1
    else:
        height = math.ceil(math.log(record_count / (fanout - 1), fanout)) + 1
    
    # I/O times by storage type (ms)
    io_times = {
        'hdd': 10.0,
        'ssd_sata': 0.15,
        'ssd_nvme': 0.05
    }
    single_read_ms = io_times.get(storage_type, 0.05)
    
    # Actual I/O (accounting for caching)
    uncached_levels = max(0, height - cached_levels)
    total_io_ms = uncached_levels * single_read_ms
    
    # CPU cost (binary search within pages)
    # log2(fanout) comparisons per level, ~0.0001 ms each
    comparisons_per_level = math.log2(fanout)
    cpu_ms = height * comparisons_per_level * 0.0001
    
    return {
        'height': height,
        'cached_levels': min(cached_levels, height),
        'disk_ios': uncached_levels,
        'io_time_ms': total_io_ms,
        'cpu_time_ms': cpu_ms,
        'total_time_ms': total_io_ms + cpu_ms
    }
 
# Analyze various scenarios
scenarios = [
    (1_000_000, 500, 'ssd_nvme', 2),      # 1M records, fast SSD, 2 levels cached
    (1_000_000_000, 500, 'ssd_nvme', 2),  # 1B records, fast SSD, 2 levels cached
    (1_000_000_000, 500, 'hdd', 2),       # 1B records, HDD, 2 levels cached
    (1_000_000_000, 500, 'ssd_nvme', 3),  # 1B records, NVMe, 3 levels cached
]
 
print("Point Lookup Cost Analysis")
print("=" * 75)
for records, fanout, storage, cached in scenarios:
    result = analyze_point_lookup_cost(records, fanout, 8, storage, cached)
    print(f"\n{records:,} records, {storage}, {cached} cached levels:")
    print(f"  Height: {result['height']}, Disk I/Os: {result['disk_ios']}")
    print(f"  I/O Time: {result['io_time_ms']:.3f} ms, CPU: {result['cpu_time_ms']:.4f} ms")
    print(f"  Total: {result['total_time_ms']:.3f} ms")

Point Lookup Cost by Configuration
Records	Height	Cached	HDD Time	SSD Time	NVMe Time
1 Million	3	2 levels	10 ms	0.15 ms	0.05 ms
100 Million	4	2 levels	20 ms	0.30 ms	0.10 ms
1 Billion	4	2 levels	20 ms	0.30 ms	0.10 ms
1 Billion	4	3 levels	10 ms	0.15 ms	0.05 ms
100 Billion	5	3 levels	20 ms	0.30 ms	0.10 ms

The Power of Caching

With typical configurations (fanout ~500, top 2-3 levels cached), B+-tree point lookups require only 1-2 disk I/Os regardless of dataset size. This is why properly tuned B+-trees scale so well—a billion-record lookup takes the same time as a million-record lookup!

Range Scan Cost

Range scans retrieve all records within a key range—leveraging B+-tree's leaf-level linked list. The cost model differs significantly from point lookups.

The Range Scan Algorithm:

Point lookup to find first matching leaf
Scan forward through leaf nodes (following sibling pointers)
Continue until key exceeds range or end of index

Cost Formula:

$$Cost_{range} = Cost_{lookup} + \lceil \frac{M}{E} \rceil \times C_{read}$$

Where:

M = number of matching records
E = records per leaf page
The second term represents sequential leaf traversal

range_scan_cost.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import math
 
def analyze_range_scan_cost(
    total_records: int,
    matching_records: int,
    fanout: int,
    storage_type: str = 'ssd_nvme',
    cached_levels: int = 2,
    is_clustered: bool = True
):
    """
    Analyze cost of a range scan in a B+-tree.
    
    Parameters:
    - is_clustered: True if index is clustered (data in leaf pages)
                   False if secondary index (requires heap lookup)
    """
    # Records per leaf page
    records_per_leaf = fanout - 1
    
    # Leaf pages to scan
    leaf_pages = math.ceil(matching_records / records_per_leaf)
    
    # Tree traversal cost (point lookup to first leaf)
    height = math.ceil(math.log(total_records / records_per_leaf, fanout)) + 1
    traversal_ios = max(0, height - cached_levels)
    
    # I/O characteristics
    io_times = {'hdd': 10.0, 'ssd_sata': 0.15, 'ssd_nvme': 0.05}
    random_io = io_times[storage_type]
    
    # Sequential read benefit (leaves are linked)
    # First leaf: random access, subsequent: ~10% of random cost
    sequential_multiplier = 0.1 if storage_type == 'hdd' else 0.8
    
    if leaf_pages <= 1:
        leaf_scan_ios = leaf_pages
    else:
        # First leaf random, rest sequential
        effective_ios = 1 + (leaf_pages - 1) * sequential_multiplier
        leaf_scan_ios = effective_ios
    
    # For non-clustered: add heap lookups (random I/O per record)
    if not is_clustered:
        # Worst case: each record on different page
        # Best case: all records fit in few pages
        # Estimate: unique pages = matching_records / records_per_data_page
        records_per_data_page = 50  # Estimate for heap pages
        heap_pages = math.ceil(matching_records / records_per_data_page)
        # Some pages will be cached; estimate 50% hit rate
        heap_ios = heap_pages * 0.5
    else:
        heap_ios = 0
    
    total_ios = traversal_ios + leaf_scan_ios + heap_ios
    total_time_ms = random_io * total_ios
    
    return {
        'height': height,
        'traversal_ios': traversal_ios,
        'leaf_pages': leaf_pages,
        'leaf_scan_ios': leaf_scan_ios,
        'heap_ios': heap_ios,
        'total_ios': total_ios,
        'time_ms': total_time_ms,
        'selectivity': matching_records / total_records * 100
    }
 
# Analyze range scan scenarios
print("Range Scan Cost Analysis (1 Billion records, NVMe SSD)")
print("=" * 70)
 
for match_pct in [0.001, 0.01, 0.1, 1.0, 10.0]:
    matching = int(1_000_000_000 * match_pct / 100)
    result = analyze_range_scan_cost(
        1_000_000_000, matching, 500, 'ssd_nvme', 2, True
    )
    print(f"\nSelectivity {match_pct}% ({matching:,} records):")
    print(f"  Leaf pages: {result['leaf_pages']:,}")
    print(f"  Total I/Os: {result['total_ios']:.1f}")
    print(f"  Time: {result['time_ms']:.2f} ms")

Range Scan Cost vs Selectivity (1B Records, f=500, NVMe)
Selectivity	Matching Records	Leaf Pages	I/Os	Time
0.001%	10,000	21	~20	1 ms
0.01%	100,000	201	~180	9 ms
0.1%	1,000,000	2,005	~1,800	90 ms
1%	10,000,000	20,041	~18,000	900 ms
10%	100,000,000	200,401	~180,000	9 sec

The Selectivity Crossover Point

At some selectivity threshold, a full table scan becomes cheaper than an index range scan. This typically occurs around 5-15% selectivity depending on clustering, page sizes, and data distribution. Query optimizers must estimate selectivity accurately to choose correctly.

Clustered vs Non-Clustered Index Impact:

For non-clustered (secondary) indexes, each matching record may require an additional random I/O to fetch the actual data from the heap:

Clustered Index Scan

•Data stored in leaf pages
•Sequential leaf traversal
•Cost = Traversal + Leaf I/Os
•Optimal for range queries
•No additional heap access

Non-Clustered Index Scan

•Leaf contains pointers to data
•Each match → random heap I/O
•Cost = Traversal + Leaf + Heap
•Can be 10-100x more expensive
•Covering indexes avoid heap

Insert Cost

Insertions into B+-trees involve finding the target leaf and adding the entry—potentially triggering page splits if the leaf is full.

The Insert Algorithm:

Traverse from root to target leaf (like point lookup)
If leaf has space: insert, write page
If leaf is full: split leaf, propagate split upward
Write modified pages (leaf, possibly internal nodes)

Base Cost (No Split):

$$Cost_{insert} = Cost_{lookup} + C_{write}$$

Cost With Split:

$$Cost_{insert_split} = Cost_{lookup} + (2 + s) \times C_{write}$$

Where s = number of levels that require splitting (usually 0-1).

insert_cost.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import math
import random
 
def analyze_insert_cost(
    current_records: int,
    fanout: int,
    fill_factor: float = 0.7,
    storage_type: str = 'ssd_nvme',
    cached_levels: int = 2
):
    """
    Analyze the cost of inserting a record into a B+-tree.
    """
    # Calculate tree structure
    records_per_leaf = fanout - 1
    avg_records_per_leaf = records_per_leaf * fill_factor
    
    height = max(1, math.ceil(math.log(current_records / avg_records_per_leaf, fanout)) + 1)
    
    # I/O times
    io_times = {'hdd': 10.0, 'ssd_sata': 0.15, 'ssd_nvme': 0.05}
    io_ms = io_times[storage_type]
    
    # Lookup cost (reads)
    read_ios = max(0, height - cached_levels)
    
    # Write cost depends on split probability
    # Probability of leaf split = 1 - fill_factor (simplified)
    # Actually depends on how full the target leaf is
    leaf_split_prob = 1 - fill_factor
    
    # Internal node split much rarer (propagates up)
    internal_split_prob = leaf_split_prob ** 2  # Approximation
    
    # Expected write I/Os:
    # - Always write 1 leaf
    # - If split: write 2 leaves + 1 internal update
    # - If internal split: additional propagation
    expected_write_ios = 1 + leaf_split_prob * 2 + internal_split_prob * 2
    
    # Total cost
    total_ios = read_ios + expected_write_ios
    
    return {
        'height': height,
        'read_ios': read_ios,
        'expected_write_ios': expected_write_ios,
        'leaf_split_prob': leaf_split_prob,
        'total_ios': total_ios,
        'time_ms': total_ios * io_ms
    }
 
def simulate_bulk_insert_cost(
    initial_records: int,
    insert_count: int,
    fanout: int,
    storage_type: str = 'ssd_nvme'
):
    """
    Simulate the total cost of bulk random insertions.
    """
    total_time_ms = 0
    splits = 0
    
    for i in range(insert_count):
        current = initial_records + i
        result = analyze_insert_cost(current, fanout, 0.7, storage_type, 2)
        total_time_ms += result['time_ms']
        if random.random() < result['leaf_split_prob']:
            splits += 1
    
    return {
        'total_time_ms': total_time_ms,
        'avg_time_ms': total_time_ms / insert_count,
        'split_count': splits,
        'split_rate': splits / insert_count * 100
    }
 
# Analyze single insert
print("Single Insert Cost Analysis")
print("=" * 50)
for records in [1_000_000, 100_000_000, 1_000_000_000]:
    result = analyze_insert_cost(records, 500, 0.7, 'ssd_nvme', 2)
    print(f"\n{records:,} existing records:")
    print(f"  Height: {result['height']}")
    print(f"  Read I/Os: {result['read_ios']}, Write I/Os: {result['expected_write_ios']:.2f}")
    print(f"  Split probability: {result['leaf_split_prob']*100:.1f}%")
    print(f"  Expected time: {result['time_ms']:.3f} ms")

Insert Cost Analysis (f=500, NVMe SSD)
Fill Factor	Split Probability	Avg Write I/Os	Avg Time	Inserts/sec
50%	50%	~3.0	0.15 ms	~6,700
70%	30%	~1.9	0.10 ms	~10,000
80%	20%	~1.5	0.08 ms	~12,500
90%	10%	~1.2	0.06 ms	~16,700
100%	Variable	~1.1	0.06 ms	~16,700

Insert Optimization Strategies

1. Batch writes: Accumulate inserts, sort by key, apply together\n2. Higher fill factor: Reduces space but increases splits\n3. Buffered inserts: Stage in memory, flush periodically\n4. Append optimization: Sequential keys avoid splits entirely

Delete Cost

Deletions can trigger node merges when pages fall below minimum fill—the inverse of splits during insertions.

The Delete Algorithm:

Traverse to target leaf (like point lookup)
Remove entry from leaf
If leaf below minimum fill:
- Try to redistribute with sibling
- If redistribution fails: merge with sibling
Propagate changes upward if needed

Cost Analysis:

delete_cost.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def analyze_delete_cost(
    current_records: int,
    fanout: int,
    deletion_pattern: str = 'random',  # 'random', 'sequential', 'targeted'
    storage_type: str = 'ssd_nvme',
    cached_levels: int = 2
):
    """
    Analyze the cost of deleting a record from a B+-tree.
    """
    import math
    
    records_per_leaf = fanout - 1
    min_fill = (fanout - 1) // 2
    
    height = max(1, math.ceil(math.log(current_records / records_per_leaf, fanout)) + 1)
    
    io_times = {'hdd': 10.0, 'ssd_sata': 0.15, 'ssd_nvme': 0.05}
    io_ms = io_times[storage_type]
    
    # Lookup cost
    read_ios = max(0, height - cached_levels)
    
    # Merge probability depends on deletion pattern
    if deletion_pattern == 'random':
        # Random deletions spread across pages; merge rare
        underflow_prob = 0.05  # Low probability of any page hitting minimum
    elif deletion_pattern == 'sequential':
        # Sequential deletions concentrate on few pages; merge likely
        underflow_prob = 0.30
    else:  # targeted (deleting known sparse regions)
        underflow_prob = 0.50
    
    # Expected write I/Os
    # - Always write 1 leaf (remove entry)
    # - If underflow: read sibling, possibly merge (write 1-2)
    # - If merge: update parent (write 1)
    read_sibling_prob = underflow_prob
    merge_prob = underflow_prob * 0.5  # Half of underflows result in merge
    
    expected_extra_reads = read_sibling_prob * 1  # Read sibling
    expected_write_ios = 1 + merge_prob * 2  # Leaf + possible merge
    
    total_read_ios = read_ios + expected_extra_reads
    total_ios = total_read_ios + expected_write_ios
    
    return {
        'height': height,
        'read_ios': total_read_ios,
        'write_ios': expected_write_ios,
        'underflow_prob': underflow_prob,
        'merge_prob': merge_prob,
        'total_ios': total_ios,
        'time_ms': total_ios * io_ms
    }
 
# Compare deletion patterns
print("Delete Cost Analysis (1B records, f=500, NVMe)")
print("=" * 60)
 
for pattern in ['random', 'sequential', 'targeted']:
    result = analyze_delete_cost(1_000_000_000, 500, pattern, 'ssd_nvme', 2)
    print(f"\n{pattern.capitalize()} deletions:")
    print(f"  Underflow probability: {result['underflow_prob']*100:.0f}%")
    print(f"  Merge probability: {result['merge_prob']*100:.1f}%")
    print(f"  Read I/Os: {result['read_ios']:.2f}, Write I/Os: {result['write_ios']:.2f}")
    print(f"  Total time: {result['time_ms']:.3f} ms")

Deferred Deletion Strategies

Many databases use soft deletes or tombstones: marking records as deleted without immediate removal. This converts deletes to updates initially, deferring merge costs to background cleanup processes. PostgreSQL's VACUUM and InnoDB's purge are examples.

Delete Cost Comparison by Pattern
Pattern	Merge Rate	Avg I/Os	Time	Notes
Random	~2%	~2.1	0.11 ms	Deletions spread; rare underflow
Sequential	~15%	~2.5	0.13 ms	Pages empty in order
Targeted (old data)	~25%	~2.8	0.14 ms	Old partitions often sparse
Bulk delete (range)	~90%	Sequential	Varies	Consider DROP + recreate

Bulk Loading Cost

Bulk loading (building an index from scratch on existing data) can be orders of magnitude faster than individual insertions—but only with the right approach.

Naive Approach (Individual Inserts):

Cost = N × Cost_insert = N × (h × C_read + ~1.5 × C_write)

For 1 billion records with h=4: ~4 billion I/Os!

Optimized Bulk Load (Sort-Merge):

Sort all keys externally
Fill leaf pages sequentially (100% fill)
Build internal nodes bottom-up

Cost = Sort_cost + Sequential_write_cost

bulk_load_cost.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import math
 
def compare_bulk_load_strategies(
    record_count: int,
    fanout: int,
    storage_type: str = 'ssd_nvme',
    available_memory_mb: int = 1024
):
    """
    Compare individual inserts vs optimized bulk load.
    """
    records_per_leaf = fanout - 1
    leaf_pages = math.ceil(record_count / records_per_leaf)
    
    height = math.ceil(math.log(leaf_pages, fanout)) + 1
    
    # Calculate total nodes
    internal_pages = 0
    level_pages = leaf_pages
    while level_pages > 1:
        parent_pages = math.ceil(level_pages / fanout)
        internal_pages += parent_pages
        level_pages = parent_pages
    
    total_pages = leaf_pages + internal_pages
    
    io_times = {'hdd': 10.0, 'ssd_sata': 0.15, 'ssd_nvme': 0.05}
    io_ms = io_times[storage_type]
    
    # Strategy 1: Individual inserts
    # Each insert: ~2 reads (height with caching) + ~1.5 writes
    individual_ios = record_count * (2 + 1.5)
    individual_time_s = individual_ios * io_ms / 1000
    
    # Strategy 2: Optimized bulk load
    # Sort phase: external merge sort
    # Assuming 100 bytes per record
    record_size = 100
    data_size_mb = record_count * record_size / (1024 * 1024)
    sort_passes = math.ceil(math.log(data_size_mb / available_memory_mb, 
                                      available_memory_mb / record_size))
    sort_passes = max(1, sort_passes)
    
    # Sort I/Os: read all data per pass, write all data per pass
    pages_for_data = math.ceil(data_size_mb / 8 * 1024)  # 8KB pages
    sort_ios = pages_for_data * 2 * sort_passes
    
    # Build phase: sequential write of all index pages
    build_ios = total_pages
    
    # Total bulk load I/Os
    bulk_ios = sort_ios + build_ios
    bulk_time_s = bulk_ios * io_ms / 1000
    
    speedup = individual_time_s / bulk_time_s
    
    return {
        'record_count': record_count,
        'total_pages': total_pages,
        'individual_ios': individual_ios,
        'individual_time_s': individual_time_s,
        'individual_time_hours': individual_time_s / 3600,
        'bulk_ios': bulk_ios,
        'bulk_time_s': bulk_time_s,
        'bulk_time_hours': bulk_time_s / 3600,
        'speedup': speedup
    }
 
# Compare for various dataset sizes
print("Bulk Load Strategy Comparison")
print("=" * 70)
 
for records in [1_000_000, 100_000_000, 1_000_000_000]:
    result = compare_bulk_load_strategies(records, 500, 'ssd_nvme')
    print(f"\n{records:,} records:")
    print(f"  Total index pages: {result['total_pages']:,}")
    print(f"  Individual inserts: {result['individual_ios']:,.0f} I/Os = "
          f"{result['individual_time_hours']:.1f} hours")
    print(f"  Bulk load: {result['bulk_ios']:,.0f} I/Os = "
          f"{result['bulk_time_s']:.1f} seconds")
    print(f"  Speedup: {result['speedup']:.0f}x")

Bulk Load vs Individual Insert Comparison (NVMe SSD)
Records	Individual Inserts	Bulk Load	Speedup
1 Million	1.8 hours	~2 seconds	~3,000×
100 Million	~7 days	~3 minutes	~3,000×
1 Billion	~70 days	~30 minutes	~3,000×
10 Billion	~2 years	~5 hours	~3,000×

The Bottom Line on Bulk Loading

Always use optimized bulk loading for initial index creation or large data imports. The ~3000× speedup means the difference between reasonable and impossible. Database utilities like pg_dump/pg_restore, mysqldump, and SQL*Loader leverage these optimizations.

Real-World Cost Factors

Production systems introduce factors not captured in basic cost models. Understanding these is essential for accurate performance prediction.

1. Concurrency and Lock Contention:

Lock Contention Factors

•Page-level locks: Concurrent readers/writers on same page must wait
•Structure modification locks: Splits/merges may lock subtrees
•Hot pages: Root node is accessed by every operation
•Mitigation: B-link trees, latch coupling, lock-free techniques

2. Write-Ahead Logging (WAL):

Every write generates additional I/O for durability:

wal_overhead.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def calculate_wal_overhead(
    operations_per_second: int,
    avg_operation_size_bytes: int = 200,
    page_size: int = 8192,
    full_page_writes: bool = True,
    compression_ratio: float = 0.5
):
    """
    Calculate WAL overhead for B+-tree operations.
    """
    # Base log record per operation
    base_log_bytes = avg_operation_size_bytes
    
    # Full page image on first modification after checkpoint
    # (for crash recovery)
    # Assume 1/100 operations require full page write
    fpw_probability = 0.01 if full_page_writes else 0
    fpw_size = page_size * fpw_probability
    
    # Total log bytes per operation
    log_bytes_per_op = (base_log_bytes + fpw_size) * compression_ratio
    
    # Log bandwidth required
    log_bandwidth_mbps = (log_bytes_per_op * operations_per_second) / (1024 * 1024)
    
    # Sync overhead (fsync calls)
    # Assuming group commit: 1 sync per 10ms at high load
    syncs_per_second = min(100, operations_per_second / 10)
    
    return {
        'log_bytes_per_op': log_bytes_per_op,
        'log_bandwidth_mbps': log_bandwidth_mbps,
        'syncs_per_second': syncs_per_second,
        'additional_latency_ms': 1 / syncs_per_second * 1000 if syncs_per_second > 0 else 0
    }
 
# Analyze for high-throughput OLTP
result = calculate_wal_overhead(10000, 200)
print(f"WAL overhead at 10K TPS:")
print(f"  Log bytes per op: {result['log_bytes_per_op']:.0f}")
print(f"  Log bandwidth: {result['log_bandwidth_mbps']:.1f} MB/s")
print(f"  Syncs/second: {result['syncs_per_second']:.0f}")

3. Buffer Pool Hit Rates:

The buffer pool dramatically affects observed I/O. Real hit rates depend on:

Buffer Pool Impact on Effective I/O Cost
Hit Rate	Effective I/O Time	Lookup Speed	Notes
99%	0.01 × 0.05ms = 0.0005ms	~2M/sec	Working set in memory
95%	0.05 × 0.05ms = 0.0025ms	~400K/sec	Most accesses cached
80%	0.20 × 0.05ms = 0.01ms	~100K/sec	Significant disk I/O
50%	0.50 × 0.05ms = 0.025ms	~40K/sec	Memory pressure
10%	0.90 × 0.05ms = 0.045ms	~22K/sec	Severe thrashing

The Cliff Effect

Performance degrades gracefully as hit rate drops from 99% to 90%. But below ~80%, you may hit a 'cliff' where the system suddenly becomes I/O bound and throughput collapses. Monitor hit rates closely and add memory before reaching this threshold.

Summary: Mastering I/O Cost Analysis

We've developed a comprehensive framework for analyzing B+-tree I/O costs across all operation types.

Key Takeaways

•Point lookups cost h I/Os: With caching, typically 1-2 actual disk reads
•Range scans scale with selectivity: Cost = traversal + matching_records/fanout
•Inserts have split overhead: ~30% extra I/O at 70% fill factor
•Deletes rarely cause merges: Random deletion pattern keeps trees balanced
•Bulk loading is ~3000× faster: Always use optimized bulk load for large datasets
•Buffer pool hit rate is critical: Below 80%, performance degrades rapidly

What's Next:

We'll complete our B+-tree performance analysis with practical considerations—real-world tuning strategies, monitoring approaches, and common pitfalls to avoid in production systems.

Page Complete

You now have the analytical tools to predict and optimize B+-tree I/O costs for any operation pattern. This foundation enables informed index design, capacity planning, and query optimization decisions. Next, we'll explore practical tuning and monitoring strategies.

I/O Cost Analysis

The Foundation of Query Optimization

What You Will Master

The I/O Cost Model

Before analyzing specific operations, we need a rigorous model for I/O costs. The total cost of any disk access consists of multiple components.

The Fundamental I/O Cost Equation:

$$T_{I/O} = T_{seek} + T_{rotation} + T_{transfer}$$

Where:

T_seek: Time to position the disk head (HDD) or locate block (SSD)
T_rotation: Rotational latency (HDD only, average = 1/2 rotation)
T_transfer: Time to read/write the data

I/O Cost Components by Storage Type
Component	HDD (7200 RPM)	SSD (SATA)	SSD (NVMe)	Notes
Seek/Access	5-10 ms	0.05-0.1 ms	0.01-0.05 ms	Fixed overhead per I/O
Rotational Latency	~4 ms	N/A	N/A	HDD only, 1/2 revolution
Transfer Rate	100-200 MB/s	400-600 MB/s	2-7 GB/s	Sequential bandwidth
Random 4KB Read	~10 ms	~0.1 ms	~0.05 ms	Typical point lookup
Random 4KB IOPS	~100	~50,000	~500,000	Operations per second

Cost Model Parameters:

For analytical purposes, we define key parameters:

cost_model.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
class IOCostModel:
    """
    I/O cost model for database operations.
    All times in milliseconds, sizes in bytes.
    """
    
    def __init__(self, storage_type='ssd_nvme'):
        if storage_type == 'hdd':
            self.seek_time = 8.0        # ms
            self.rotation_latency = 4.0  # ms (7200 RPM)
            self.transfer_rate = 150     # MB/s
        elif storage_type == 'ssd_sata':
            self.seek_time = 0.1        # ms
            self.rotation_latency = 0   # N/A
            self.transfer_rate = 500    # MB/s
        else:  # ssd_nvme
            self.seek_time = 0.03       # ms
            self.rotation_latency = 0   # N/A
            self.transfer_rate = 3000   # MB/s
        
        self.page_size = 8192  # 8 KB default
    
    def random_read_time(self, pages=1):
        """Time for random page reads."""
        per_page = (
            self.seek_time + 
            self.rotation_latency + 
            (self.page_size / 1024 / 1024) / self.transfer_rate * 1000
        )
        return pages * per_page
    
    def sequential_read_time(self, pages):
        """Time for sequential page reads (amortized seek)."""
        # First page: full seek + transfer
        # Subsequent pages: mostly transfer
        first_page = self.random_read_time(1)
        if pages <= 1:
            return first_page
        
        # Subsequent pages: minimal seek, just transfer
        transfer_time = (pages - 1) * (self.page_size / 1024 / 1024) / self.transfer_rate * 1000
        return first_page + transfer_time
    
    def iops_capacity(self):
        """Maximum random I/O operations per second."""
        single_io = self.random_read_time(1)
        return 1000 / single_io  # Convert ms to seconds
 
# Compare storage types
for storage in ['hdd', 'ssd_sata', 'ssd_nvme']:
    model = IOCostModel(storage)
    print(f"\n{storage.upper()}:")
    print(f"  Random 1 page: {model.random_read_time(1):.3f} ms")
    print(f"  Random 10 pages: {model.random_read_time(10):.3f} ms")
    print(f"  Sequential 10 pages: {model.sequential_read_time(10):.3f} ms")
    print(f"  Max IOPS: {model.iops_capacity():.0f}")

The Critical Insight

Point Lookup Cost

A point lookup retrieves a single record by its key—the fundamental B+-tree operation. Understanding its cost is essential.

The Point Lookup Algorithm:

Start at root node
Binary search within node to find child pointer
Follow pointer to child node
Repeat until leaf node reached
Binary search leaf for target key

Cost Formula:

$$Cost_{lookup} = h \times C_{read}$$

Where h = tree height and C_read = cost of reading one page.

With Buffer Pool Caching:

In practice, upper levels are typically cached:

$$Cost_{lookup} = (h - L_{cached}) \times C_{read}$$

Where L_cached = number of levels fully cached in memory.

point_lookup_cost.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import math
 
def analyze_point_lookup_cost(
    record_count: int,
    fanout: int,
    page_size_kb: int = 8,
    storage_type: str = 'ssd_nvme',
    cached_levels: int = 2
):
    """
    Analyze the cost of a point lookup in a B+-tree.
    """
    # Calculate tree height
    if record_count <= fanout - 1:
        height = 1
    else:
        height = math.ceil(math.log(record_count / (fanout - 1), fanout)) + 1
    
    # I/O times by storage type (ms)
    io_times = {
        'hdd': 10.0,
        'ssd_sata': 0.15,
        'ssd_nvme': 0.05
    }
    single_read_ms = io_times.get(storage_type, 0.05)
    
    # Actual I/O (accounting for caching)
    uncached_levels = max(0, height - cached_levels)
    total_io_ms = uncached_levels * single_read_ms
    
    # CPU cost (binary search within pages)
    # log2(fanout) comparisons per level, ~0.0001 ms each
    comparisons_per_level = math.log2(fanout)
    cpu_ms = height * comparisons_per_level * 0.0001
    
    return {
        'height': height,
        'cached_levels': min(cached_levels, height),
        'disk_ios': uncached_levels,
        'io_time_ms': total_io_ms,
        'cpu_time_ms': cpu_ms,
        'total_time_ms': total_io_ms + cpu_ms
    }
 
# Analyze various scenarios
scenarios = [
    (1_000_000, 500, 'ssd_nvme', 2),      # 1M records, fast SSD, 2 levels cached
    (1_000_000_000, 500, 'ssd_nvme', 2),  # 1B records, fast SSD, 2 levels cached
    (1_000_000_000, 500, 'hdd', 2),       # 1B records, HDD, 2 levels cached
    (1_000_000_000, 500, 'ssd_nvme', 3),  # 1B records, NVMe, 3 levels cached
]
 
print("Point Lookup Cost Analysis")
print("=" * 75)
for records, fanout, storage, cached in scenarios:
    result = analyze_point_lookup_cost(records, fanout, 8, storage, cached)
    print(f"\n{records:,} records, {storage}, {cached} cached levels:")
    print(f"  Height: {result['height']}, Disk I/Os: {result['disk_ios']}")
    print(f"  I/O Time: {result['io_time_ms']:.3f} ms, CPU: {result['cpu_time_ms']:.4f} ms")
    print(f"  Total: {result['total_time_ms']:.3f} ms")

Point Lookup Cost by Configuration
Records	Height	Cached	HDD Time	SSD Time	NVMe Time
1 Million	3	2 levels	10 ms	0.15 ms	0.05 ms
100 Million	4	2 levels	20 ms	0.30 ms	0.10 ms
1 Billion	4	2 levels	20 ms	0.30 ms	0.10 ms
1 Billion	4	3 levels	10 ms	0.15 ms	0.05 ms
100 Billion	5	3 levels	20 ms	0.30 ms	0.10 ms

The Power of Caching

Range Scan Cost

Range scans retrieve all records within a key range—leveraging B+-tree's leaf-level linked list. The cost model differs significantly from point lookups.

The Range Scan Algorithm:

Point lookup to find first matching leaf
Scan forward through leaf nodes (following sibling pointers)
Continue until key exceeds range or end of index

Cost Formula:

$$Cost_{range} = Cost_{lookup} + \lceil \frac{M}{E} \rceil \times C_{read}$$

Where:

M = number of matching records
E = records per leaf page
The second term represents sequential leaf traversal

range_scan_cost.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import math
 
def analyze_range_scan_cost(
    total_records: int,
    matching_records: int,
    fanout: int,
    storage_type: str = 'ssd_nvme',
    cached_levels: int = 2,
    is_clustered: bool = True
):
    """
    Analyze cost of a range scan in a B+-tree.
    
    Parameters:
    - is_clustered: True if index is clustered (data in leaf pages)
                   False if secondary index (requires heap lookup)
    """
    # Records per leaf page
    records_per_leaf = fanout - 1
    
    # Leaf pages to scan
    leaf_pages = math.ceil(matching_records / records_per_leaf)
    
    # Tree traversal cost (point lookup to first leaf)
    height = math.ceil(math.log(total_records / records_per_leaf, fanout)) + 1
    traversal_ios = max(0, height - cached_levels)
    
    # I/O characteristics
    io_times = {'hdd': 10.0, 'ssd_sata': 0.15, 'ssd_nvme': 0.05}
    random_io = io_times[storage_type]
    
    # Sequential read benefit (leaves are linked)
    # First leaf: random access, subsequent: ~10% of random cost
    sequential_multiplier = 0.1 if storage_type == 'hdd' else 0.8
    
    if leaf_pages <= 1:
        leaf_scan_ios = leaf_pages
    else:
        # First leaf random, rest sequential
        effective_ios = 1 + (leaf_pages - 1) * sequential_multiplier
        leaf_scan_ios = effective_ios
    
    # For non-clustered: add heap lookups (random I/O per record)
    if not is_clustered:
        # Worst case: each record on different page
        # Best case: all records fit in few pages
        # Estimate: unique pages = matching_records / records_per_data_page
        records_per_data_page = 50  # Estimate for heap pages
        heap_pages = math.ceil(matching_records / records_per_data_page)
        # Some pages will be cached; estimate 50% hit rate
        heap_ios = heap_pages * 0.5
    else:
        heap_ios = 0
    
    total_ios = traversal_ios + leaf_scan_ios + heap_ios
    total_time_ms = random_io * total_ios
    
    return {
        'height': height,
        'traversal_ios': traversal_ios,
        'leaf_pages': leaf_pages,
        'leaf_scan_ios': leaf_scan_ios,
        'heap_ios': heap_ios,
        'total_ios': total_ios,
        'time_ms': total_time_ms,
        'selectivity': matching_records / total_records * 100
    }
 
# Analyze range scan scenarios
print("Range Scan Cost Analysis (1 Billion records, NVMe SSD)")
print("=" * 70)
 
for match_pct in [0.001, 0.01, 0.1, 1.0, 10.0]:
    matching = int(1_000_000_000 * match_pct / 100)
    result = analyze_range_scan_cost(
        1_000_000_000, matching, 500, 'ssd_nvme', 2, True
    )
    print(f"\nSelectivity {match_pct}% ({matching:,} records):")
    print(f"  Leaf pages: {result['leaf_pages']:,}")
    print(f"  Total I/Os: {result['total_ios']:.1f}")
    print(f"  Time: {result['time_ms']:.2f} ms")

Range Scan Cost vs Selectivity (1B Records, f=500, NVMe)
Selectivity	Matching Records	Leaf Pages	I/Os	Time
0.001%	10,000	21	~20	1 ms
0.01%	100,000	201	~180	9 ms
0.1%	1,000,000	2,005	~1,800	90 ms
1%	10,000,000	20,041	~18,000	900 ms
10%	100,000,000	200,401	~180,000	9 sec

The Selectivity Crossover Point

Clustered vs Non-Clustered Index Impact:

For non-clustered (secondary) indexes, each matching record may require an additional random I/O to fetch the actual data from the heap:

Clustered Index Scan

•Data stored in leaf pages
•Sequential leaf traversal
•Cost = Traversal + Leaf I/Os
•Optimal for range queries
•No additional heap access

Non-Clustered Index Scan

•Leaf contains pointers to data
•Each match → random heap I/O
•Cost = Traversal + Leaf + Heap
•Can be 10-100x more expensive
•Covering indexes avoid heap

Insert Cost

Insertions into B+-trees involve finding the target leaf and adding the entry—potentially triggering page splits if the leaf is full.

The Insert Algorithm:

Traverse from root to target leaf (like point lookup)
If leaf has space: insert, write page
If leaf is full: split leaf, propagate split upward
Write modified pages (leaf, possibly internal nodes)

Base Cost (No Split):

$$Cost_{insert} = Cost_{lookup} + C_{write}$$

Cost With Split:

$$Cost_{insert_split} = Cost_{lookup} + (2 + s) \times C_{write}$$

Where s = number of levels that require splitting (usually 0-1).

insert_cost.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import math
import random
 
def analyze_insert_cost(
    current_records: int,
    fanout: int,
    fill_factor: float = 0.7,
    storage_type: str = 'ssd_nvme',
    cached_levels: int = 2
):
    """
    Analyze the cost of inserting a record into a B+-tree.
    """
    # Calculate tree structure
    records_per_leaf = fanout - 1
    avg_records_per_leaf = records_per_leaf * fill_factor
    
    height = max(1, math.ceil(math.log(current_records / avg_records_per_leaf, fanout)) + 1)
    
    # I/O times
    io_times = {'hdd': 10.0, 'ssd_sata': 0.15, 'ssd_nvme': 0.05}
    io_ms = io_times[storage_type]
    
    # Lookup cost (reads)
    read_ios = max(0, height - cached_levels)
    
    # Write cost depends on split probability
    # Probability of leaf split = 1 - fill_factor (simplified)
    # Actually depends on how full the target leaf is
    leaf_split_prob = 1 - fill_factor
    
    # Internal node split much rarer (propagates up)
    internal_split_prob = leaf_split_prob ** 2  # Approximation
    
    # Expected write I/Os:
    # - Always write 1 leaf
    # - If split: write 2 leaves + 1 internal update
    # - If internal split: additional propagation
    expected_write_ios = 1 + leaf_split_prob * 2 + internal_split_prob * 2
    
    # Total cost
    total_ios = read_ios + expected_write_ios
    
    return {
        'height': height,
        'read_ios': read_ios,
        'expected_write_ios': expected_write_ios,
        'leaf_split_prob': leaf_split_prob,
        'total_ios': total_ios,
        'time_ms': total_ios * io_ms
    }
 
def simulate_bulk_insert_cost(
    initial_records: int,
    insert_count: int,
    fanout: int,
    storage_type: str = 'ssd_nvme'
):
    """
    Simulate the total cost of bulk random insertions.
    """
    total_time_ms = 0
    splits = 0
    
    for i in range(insert_count):
        current = initial_records + i
        result = analyze_insert_cost(current, fanout, 0.7, storage_type, 2)
        total_time_ms += result['time_ms']
        if random.random() < result['leaf_split_prob']:
            splits += 1
    
    return {
        'total_time_ms': total_time_ms,
        'avg_time_ms': total_time_ms / insert_count,
        'split_count': splits,
        'split_rate': splits / insert_count * 100
    }
 
# Analyze single insert
print("Single Insert Cost Analysis")
print("=" * 50)
for records in [1_000_000, 100_000_000, 1_000_000_000]:
    result = analyze_insert_cost(records, 500, 0.7, 'ssd_nvme', 2)
    print(f"\n{records:,} existing records:")
    print(f"  Height: {result['height']}")
    print(f"  Read I/Os: {result['read_ios']}, Write I/Os: {result['expected_write_ios']:.2f}")
    print(f"  Split probability: {result['leaf_split_prob']*100:.1f}%")
    print(f"  Expected time: {result['time_ms']:.3f} ms")

Insert Cost Analysis (f=500, NVMe SSD)
Fill Factor	Split Probability	Avg Write I/Os	Avg Time	Inserts/sec
50%	50%	~3.0	0.15 ms	~6,700
70%	30%	~1.9	0.10 ms	~10,000
80%	20%	~1.5	0.08 ms	~12,500
90%	10%	~1.2	0.06 ms	~16,700
100%	Variable	~1.1	0.06 ms	~16,700

Insert Optimization Strategies

Delete Cost

Deletions can trigger node merges when pages fall below minimum fill—the inverse of splits during insertions.

The Delete Algorithm:

Traverse to target leaf (like point lookup)
Remove entry from leaf
If leaf below minimum fill:
- Try to redistribute with sibling
- If redistribution fails: merge with sibling
Propagate changes upward if needed

Cost Analysis:

delete_cost.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def analyze_delete_cost(
    current_records: int,
    fanout: int,
    deletion_pattern: str = 'random',  # 'random', 'sequential', 'targeted'
    storage_type: str = 'ssd_nvme',
    cached_levels: int = 2
):
    """
    Analyze the cost of deleting a record from a B+-tree.
    """
    import math
    
    records_per_leaf = fanout - 1
    min_fill = (fanout - 1) // 2
    
    height = max(1, math.ceil(math.log(current_records / records_per_leaf, fanout)) + 1)
    
    io_times = {'hdd': 10.0, 'ssd_sata': 0.15, 'ssd_nvme': 0.05}
    io_ms = io_times[storage_type]
    
    # Lookup cost
    read_ios = max(0, height - cached_levels)
    
    # Merge probability depends on deletion pattern
    if deletion_pattern == 'random':
        # Random deletions spread across pages; merge rare
        underflow_prob = 0.05  # Low probability of any page hitting minimum
    elif deletion_pattern == 'sequential':
        # Sequential deletions concentrate on few pages; merge likely
        underflow_prob = 0.30
    else:  # targeted (deleting known sparse regions)
        underflow_prob = 0.50
    
    # Expected write I/Os
    # - Always write 1 leaf (remove entry)
    # - If underflow: read sibling, possibly merge (write 1-2)
    # - If merge: update parent (write 1)
    read_sibling_prob = underflow_prob
    merge_prob = underflow_prob * 0.5  # Half of underflows result in merge
    
    expected_extra_reads = read_sibling_prob * 1  # Read sibling
    expected_write_ios = 1 + merge_prob * 2  # Leaf + possible merge
    
    total_read_ios = read_ios + expected_extra_reads
    total_ios = total_read_ios + expected_write_ios
    
    return {
        'height': height,
        'read_ios': total_read_ios,
        'write_ios': expected_write_ios,
        'underflow_prob': underflow_prob,
        'merge_prob': merge_prob,
        'total_ios': total_ios,
        'time_ms': total_ios * io_ms
    }
 
# Compare deletion patterns
print("Delete Cost Analysis (1B records, f=500, NVMe)")
print("=" * 60)
 
for pattern in ['random', 'sequential', 'targeted']:
    result = analyze_delete_cost(1_000_000_000, 500, pattern, 'ssd_nvme', 2)
    print(f"\n{pattern.capitalize()} deletions:")
    print(f"  Underflow probability: {result['underflow_prob']*100:.0f}%")
    print(f"  Merge probability: {result['merge_prob']*100:.1f}%")
    print(f"  Read I/Os: {result['read_ios']:.2f}, Write I/Os: {result['write_ios']:.2f}")
    print(f"  Total time: {result['time_ms']:.3f} ms")

Deferred Deletion Strategies

Delete Cost Comparison by Pattern
Pattern	Merge Rate	Avg I/Os	Time	Notes
Random	~2%	~2.1	0.11 ms	Deletions spread; rare underflow
Sequential	~15%	~2.5	0.13 ms	Pages empty in order
Targeted (old data)	~25%	~2.8	0.14 ms	Old partitions often sparse
Bulk delete (range)	~90%	Sequential	Varies	Consider DROP + recreate

Bulk Loading Cost

Bulk loading (building an index from scratch on existing data) can be orders of magnitude faster than individual insertions—but only with the right approach.

Naive Approach (Individual Inserts):

Cost = N × Cost_insert = N × (h × C_read + ~1.5 × C_write)

For 1 billion records with h=4: ~4 billion I/Os!

Optimized Bulk Load (Sort-Merge):

Sort all keys externally
Fill leaf pages sequentially (100% fill)
Build internal nodes bottom-up

Cost = Sort_cost + Sequential_write_cost

bulk_load_cost.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import math
 
def compare_bulk_load_strategies(
    record_count: int,
    fanout: int,
    storage_type: str = 'ssd_nvme',
    available_memory_mb: int = 1024
):
    """
    Compare individual inserts vs optimized bulk load.
    """
    records_per_leaf = fanout - 1
    leaf_pages = math.ceil(record_count / records_per_leaf)
    
    height = math.ceil(math.log(leaf_pages, fanout)) + 1
    
    # Calculate total nodes
    internal_pages = 0
    level_pages = leaf_pages
    while level_pages > 1:
        parent_pages = math.ceil(level_pages / fanout)
        internal_pages += parent_pages
        level_pages = parent_pages
    
    total_pages = leaf_pages + internal_pages
    
    io_times = {'hdd': 10.0, 'ssd_sata': 0.15, 'ssd_nvme': 0.05}
    io_ms = io_times[storage_type]
    
    # Strategy 1: Individual inserts
    # Each insert: ~2 reads (height with caching) + ~1.5 writes
    individual_ios = record_count * (2 + 1.5)
    individual_time_s = individual_ios * io_ms / 1000
    
    # Strategy 2: Optimized bulk load
    # Sort phase: external merge sort
    # Assuming 100 bytes per record
    record_size = 100
    data_size_mb = record_count * record_size / (1024 * 1024)
    sort_passes = math.ceil(math.log(data_size_mb / available_memory_mb, 
                                      available_memory_mb / record_size))
    sort_passes = max(1, sort_passes)
    
    # Sort I/Os: read all data per pass, write all data per pass
    pages_for_data = math.ceil(data_size_mb / 8 * 1024)  # 8KB pages
    sort_ios = pages_for_data * 2 * sort_passes
    
    # Build phase: sequential write of all index pages
    build_ios = total_pages
    
    # Total bulk load I/Os
    bulk_ios = sort_ios + build_ios
    bulk_time_s = bulk_ios * io_ms / 1000
    
    speedup = individual_time_s / bulk_time_s
    
    return {
        'record_count': record_count,
        'total_pages': total_pages,
        'individual_ios': individual_ios,
        'individual_time_s': individual_time_s,
        'individual_time_hours': individual_time_s / 3600,
        'bulk_ios': bulk_ios,
        'bulk_time_s': bulk_time_s,
        'bulk_time_hours': bulk_time_s / 3600,
        'speedup': speedup
    }
 
# Compare for various dataset sizes
print("Bulk Load Strategy Comparison")
print("=" * 70)
 
for records in [1_000_000, 100_000_000, 1_000_000_000]:
    result = compare_bulk_load_strategies(records, 500, 'ssd_nvme')
    print(f"\n{records:,} records:")
    print(f"  Total index pages: {result['total_pages']:,}")
    print(f"  Individual inserts: {result['individual_ios']:,.0f} I/Os = "
          f"{result['individual_time_hours']:.1f} hours")
    print(f"  Bulk load: {result['bulk_ios']:,.0f} I/Os = "
          f"{result['bulk_time_s']:.1f} seconds")
    print(f"  Speedup: {result['speedup']:.0f}x")

Bulk Load vs Individual Insert Comparison (NVMe SSD)
Records	Individual Inserts	Bulk Load	Speedup
1 Million	1.8 hours	~2 seconds	~3,000×
100 Million	~7 days	~3 minutes	~3,000×
1 Billion	~70 days	~30 minutes	~3,000×
10 Billion	~2 years	~5 hours	~3,000×

The Bottom Line on Bulk Loading

Real-World Cost Factors

Production systems introduce factors not captured in basic cost models. Understanding these is essential for accurate performance prediction.

1. Concurrency and Lock Contention:

Lock Contention Factors

•Page-level locks: Concurrent readers/writers on same page must wait
•Structure modification locks: Splits/merges may lock subtrees
•Hot pages: Root node is accessed by every operation
•Mitigation: B-link trees, latch coupling, lock-free techniques

2. Write-Ahead Logging (WAL):

Every write generates additional I/O for durability:

wal_overhead.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def calculate_wal_overhead(
    operations_per_second: int,
    avg_operation_size_bytes: int = 200,
    page_size: int = 8192,
    full_page_writes: bool = True,
    compression_ratio: float = 0.5
):
    """
    Calculate WAL overhead for B+-tree operations.
    """
    # Base log record per operation
    base_log_bytes = avg_operation_size_bytes
    
    # Full page image on first modification after checkpoint
    # (for crash recovery)
    # Assume 1/100 operations require full page write
    fpw_probability = 0.01 if full_page_writes else 0
    fpw_size = page_size * fpw_probability
    
    # Total log bytes per operation
    log_bytes_per_op = (base_log_bytes + fpw_size) * compression_ratio
    
    # Log bandwidth required
    log_bandwidth_mbps = (log_bytes_per_op * operations_per_second) / (1024 * 1024)
    
    # Sync overhead (fsync calls)
    # Assuming group commit: 1 sync per 10ms at high load
    syncs_per_second = min(100, operations_per_second / 10)
    
    return {
        'log_bytes_per_op': log_bytes_per_op,
        'log_bandwidth_mbps': log_bandwidth_mbps,
        'syncs_per_second': syncs_per_second,
        'additional_latency_ms': 1 / syncs_per_second * 1000 if syncs_per_second > 0 else 0
    }
 
# Analyze for high-throughput OLTP
result = calculate_wal_overhead(10000, 200)
print(f"WAL overhead at 10K TPS:")
print(f"  Log bytes per op: {result['log_bytes_per_op']:.0f}")
print(f"  Log bandwidth: {result['log_bandwidth_mbps']:.1f} MB/s")
print(f"  Syncs/second: {result['syncs_per_second']:.0f}")

3. Buffer Pool Hit Rates:

The buffer pool dramatically affects observed I/O. Real hit rates depend on:

Buffer Pool Impact on Effective I/O Cost
Hit Rate	Effective I/O Time	Lookup Speed	Notes
99%	0.01 × 0.05ms = 0.0005ms	~2M/sec	Working set in memory
95%	0.05 × 0.05ms = 0.0025ms	~400K/sec	Most accesses cached
80%	0.20 × 0.05ms = 0.01ms	~100K/sec	Significant disk I/O
50%	0.50 × 0.05ms = 0.025ms	~40K/sec	Memory pressure
10%	0.90 × 0.05ms = 0.045ms	~22K/sec	Severe thrashing

The Cliff Effect

Summary: Mastering I/O Cost Analysis

We've developed a comprehensive framework for analyzing B+-tree I/O costs across all operation types.

Key Takeaways

•Point lookups cost h I/Os: With caching, typically 1-2 actual disk reads
•Range scans scale with selectivity: Cost = traversal + matching_records/fanout
•Inserts have split overhead: ~30% extra I/O at 70% fill factor
•Deletes rarely cause merges: Random deletion pattern keeps trees balanced
•Bulk loading is ~3000× faster: Always use optimized bulk load for large datasets
•Buffer pool hit rate is critical: Below 80%, performance degrades rapidly

What's Next:

We'll complete our B+-tree performance analysis with practical considerations—real-world tuning strategies, monitoring approaches, and common pitfalls to avoid in production systems.

Page Complete