B Plus Tree Performance - Learning Module

Loading content...

0/252

Node Size Selection

The Art and Science of Page Sizing

Selecting the optimal node size (page size) for a B+-tree is one of the most consequential decisions in database system design. This choice ripples through every aspect of system performance—from I/O efficiency to buffer pool utilization, from concurrency to storage overhead.

The challenge lies in balancing competing objectives: larger pages increase fanout and reduce tree height, but they also increase I/O transfer time, lock contention, and wasted space from partially-filled nodes. Finding the sweet spot requires deep understanding of hardware characteristics, workload patterns, and system constraints.

What You Will Master

By the end of this page, you will understand the factors influencing optimal page size, analyze trade-offs between different sizing strategies, evaluate how hardware evolution affects sizing decisions, and apply sizing principles to specific workload types.

Understanding Page Size Fundamentals

The page size (or block size) defines the fundamental unit of storage and I/O in a database system. Every B+-tree node occupies exactly one page—this fixed mapping creates the foundation for all sizing analysis.

The Physical-Logical Interface:

Database pages serve as the bridge between logical data structures and physical storage:

Page Size in the Storage Hierarchy
Layer	Unit	Typical Size	Controlled By
Application	Row/Record	50-500 bytes	Schema design
Database	Page/Block	4KB-64KB	DB configuration
Filesystem	Block	4KB	OS/Filesystem
SSD	Page/Block	4KB/128KB-4MB	Hardware
HDD	Sector	512B-4KB	Hardware

Why Page Size Matters for B+-Trees:

The relationship between page size and B+-tree behavior is direct and profound:

Fanout (f): The maximum number of children per internal node scales nearly linearly with page size
- Larger page → Higher fanout → Shorter tree → Fewer I/O operations per lookup
I/O Granularity: Each node access transfers exactly one page
- Larger page → More data per I/O → Better sequential access, worse random access
Buffer Pool Efficiency: The buffer pool holds pages, not individual records
- Larger page → Fewer pages for same data → More cache misses per eviction
Concurrency Granularity: Locks are typically held at page level
- Larger page → More records per lock → Higher contention potential

Common Production Page Sizes

4 KB: PostgreSQL default, MySQL InnoDB default 8 KB: Oracle default, SQL Server default 16 KB: Common for OLAP workloads 32-64 KB: Specialized data warehouse configurations

The Fanout-Transfer Trade-off

The central tension in page size selection is between maximizing fanout (height reduction) and minimizing transfer cost (I/O time). Understanding this trade-off is essential for informed decision-making.

Fanout Analysis:

For an internal node with page size P, key size K, and pointer size Ptr:

$$\text{Fanout} = f = \lfloor (P - \text{Overhead}) / (K + \text{Ptr}) \rfloor + 1$$

fanout_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def calculate_fanout(page_size_bytes, key_size, pointer_size, overhead=64):
    """
    Calculate B+-tree fanout for given page and key sizes.
    
    Parameters:
    - page_size_bytes: Total page size
    - key_size: Size of each key in bytes
    - pointer_size: Size of each pointer in bytes
    - overhead: Page header, checksums, etc.
    """
    usable_space = page_size_bytes - overhead
    entries = usable_space // (key_size + pointer_size)
    return entries  # fanout = entries for internal nodes
 
# Compare fanout across page sizes for 8-byte keys
print("Fanout by Page Size (8-byte key, 8-byte pointer)")
print("=" * 50)
for page_kb in [4, 8, 16, 32, 64]:
    fanout = calculate_fanout(page_kb * 1024, 8, 8)
    print(f"{page_kb:2d} KB page → Fanout: {fanout:4d}")
 
# Output:
#  4 KB page → Fanout:  248
#  8 KB page → Fanout:  504
# 16 KB page → Fanout: 1016
# 32 KB page → Fanout: 2040
# 64 KB page → Fanout: 4088
 
# Calculate height for 1 billion records
import math
 
def calculate_height(fanout, record_count):
    return math.ceil(math.log(record_count, fanout)) + 1
 
print("
Tree Height for 1 Billion Records")
print("=" * 50)
for page_kb in [4, 8, 16, 32, 64]:
    fanout = calculate_fanout(page_kb * 1024, 8, 8)
    height = calculate_height(fanout, 1_000_000_000)
    print(f"{page_kb:2d} KB page → Height: {height}")

Transfer Time Analysis:

The time to read a page from disk depends on the storage medium and page size:

$$\text{I/O Time} = \text{Seek Time} + \text{Rotational Latency} + \text{Transfer Time}$$

For HDDs:

Seek Time: 5-10 ms (constant, doesn't depend on page size)
Rotational Latency: 2-6 ms (constant)
Transfer Time: ~0.1 ms per KB at 100 MB/s

For SSDs:

Access Latency: 0.05-0.2 ms (mostly constant)
Transfer Time: ~0.003 ms per KB at 500 MB/s

The Critical Insight:

On HDDs, seek time dominates—doubling page size barely increases total I/O time but doubles data retrieved. On SSDs, transfer time is proportionally more significant, but still relatively small.

I/O Time Analysis: 4KB vs 16KB Page
Component	HDD (4KB)	HDD (16KB)	SSD (4KB)	SSD (16KB)
Seek/Latency	8 ms	8 ms	0.1 ms	0.1 ms
Transfer	0.04 ms	0.16 ms	0.008 ms	0.032 ms
Total	8.04 ms	8.16 ms	0.108 ms	0.132 ms
Data per I/O	4 KB	16 KB	4 KB	16 KB
Effective Throughput	0.5 MB/s	1.9 MB/s	37 MB/s	121 MB/s

The Amortization Principle

Larger pages amortize fixed I/O costs (seek, latency) over more data. For HDDs with 8ms seek overhead, a 16KB page is 4x more efficient than 4KB per byte retrieved. This is why HDD-based systems historically favored larger pages.

Storage Medium Considerations

The optimal page size depends heavily on the underlying storage technology. The shift from HDDs to SSDs has fundamentally changed sizing calculus.

Hard Disk Drives (HDDs):

HDDs have high seek latency (5-10ms) dominated by mechanical movement. Page size optimization focuses on minimizing the number of seeks:

HDD Optimization Principles

•Larger pages preferred (16KB-64KB) to amortize seek cost
•Sequential I/O patterns are 100x faster than random
•Pre-fetching matters — reading adjacent pages is nearly free
•Height reduction critical — each level = one seek = 8ms
•Trade-off point: ~16-32KB often optimal for mixed workloads

Solid State Drives (SSDs):

SSDs have fundamentally different characteristics—no mechanical seek, but significant internal parallelism and the concept of "amplification":

SSD Characteristics Affecting Page Size

•4KB internal page size: Most SSDs have 4KB read/write units internally
•No seek penalty: Random 4KB reads are nearly as fast as sequential
•Write amplification: Writing a 4KB page may require erasing a 128KB-4MB block
•Parallelism: Modern SSDs can serve 100K+ IOPS through internal parallelism
•Endurance concern: More writes = faster wear; page size affects this

HDD-Optimized Configuration

Page Size: 16-32 KB Rationale: Amortize 8ms seek Height Impact: Minimal (high fanout) Wasted Space: ~15-20% acceptable Best For: Sequential scans, OLAP

SSD-Optimized Configuration

Page Size: 4-8 KB Rationale: Match SSD page size Height Impact: Acceptable (fast seeks) Wasted Space: Minimize for wear Best For: Random lookups, OLTP

storage_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def compare_lookup_performance(record_count, key_size=8):
    """
    Compare B+-tree lookup performance: HDD vs SSD, various page sizes.
    """
    import math
    
    results = []
    
    for page_kb in [4, 8, 16, 32]:
        for storage_type in ['HDD', 'SSD']:
            # Calculate tree structure
            pointer_size = 8
            overhead = 64
            usable = page_kb * 1024 - overhead
            fanout = usable // (key_size + pointer_size)
            height = math.ceil(math.log(record_count, fanout)) + 1
            
            # I/O times (milliseconds)
            if storage_type == 'HDD':
                seek_latency = 8.0  # ms
                transfer_rate = 0.01  # ms per KB
            else:  # SSD
                seek_latency = 0.1  # ms
                transfer_rate = 0.002  # ms per KB
            
            io_time = seek_latency + page_kb * transfer_rate
            total_lookup_time = height * io_time
            
            results.append({
                'page_kb': page_kb,
                'storage': storage_type,
                'fanout': fanout,
                'height': height,
                'io_time_ms': io_time,
                'lookup_ms': total_lookup_time
            })
    
    return results
 
# Analyze for 1 billion records
results = compare_lookup_performance(1_000_000_000)
 
print("B+-Tree Lookup Performance (1B Records)")
print("=" * 65)
print(f"{'Page':<8} {'Storage':<8} {'Fanout':<8} {'Height':<8} {'I/O (ms)':<12} {'Lookup (ms)':<12}")
print("-" * 65)
for r in results:
    print(f"{r['page_kb']:<8} {r['storage']:<8} {r['fanout']:<8} {r['height']:<8} "
          f"{r['io_time_ms']:<12.3f} {r['lookup_ms']:<12.2f}")

Workload-Specific Considerations

Different workload patterns favor different page sizes. Understanding your workload is crucial for optimal sizing.

OLTP (Online Transaction Processing):

Characteristics: Many concurrent small transactions, point lookups, single-row updates.

OLTP Page Size Considerations
Factor	Smaller Page (4-8KB)	Larger Page (16-32KB)
Point Lookup I/O	Minimal data transfer	Transfer unused data
Buffer Pool Efficiency	More granular caching	Coarse-grained; wastes memory
Lock Contention	Fewer rows per lock	More rows locked together
Undo/Redo Logging	Smaller log records	Larger log records
Write Amplification	Lower	Higher
Recommendation	Preferred for OLTP	Avoid unless scan-heavy

OLAP (Online Analytical Processing):

Characteristics: Large scans, aggregations, few concurrent queries, batch updates.

OLAP Page Size Considerations
Factor	Smaller Page (4-8KB)	Larger Page (16-32KB)
Full Table Scans	More I/O operations	Fewer I/O operations
Sequential Read	More seek overhead	Better amortization
Index Height	Higher tree	Shorter tree
Compression Ratio	Lower efficiency	Better compression opportunity
Pre-fetching	More overhead per fetch	More data per fetch
Recommendation	Suboptimal for OLAP	Preferred for OLAP

Mixed Workloads

Most production databases handle mixed workloads. The default 8KB page size (Oracle, SQL Server) represents a pragmatic middle ground—adequate for OLTP point lookups while not severely penalizing analytical scans.

Time-Series and Append-Heavy Workloads:

These workloads have distinctive patterns—mostly inserts, rare updates, queries often scan recent data:

Time-Series Optimizations

•Larger pages (32KB+): Batch more inserts per page, reduce write frequency
•Fill factor ~100%: Data arrives sorted by time; no need for space for updates
•Compressed pages: Temporal data often highly compressible
•Consider partitioning: Drop old pages wholesale instead of individual deletes
•Write-ahead log batching: Larger pages enable more efficient WAL strategies

Buffer Pool and Memory Interaction

Page size directly affects buffer pool behavior—how many pages fit in memory, eviction frequency, and cache hit rates.

The Granularity Problem:

Consider a fixed 8 GB buffer pool:

Buffer Pool Capacity by Page Size
Page Size	Pages in 8GB	Records @ 100/page	Eviction Granularity
4 KB	2,097,152	209 million	400 records
8 KB	1,048,576	104 million	800 records
16 KB	524,288	52 million	1,600 records
32 KB	262,144	26 million	3,200 records

Implications:

Smaller pages = finer caching granularity: You can cache exactly what's hot, evicting only cold data
Larger pages = coarser granularity: Evicting one page removes more potentially-useful records
Working set alignment: If your working set is 10M records, smaller pages let you cache exactly those

The "Hot Record" Analysis:

cache_efficiency.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def analyze_cache_efficiency(total_records, hot_record_pct, 
                                buffer_pool_mb, records_per_page):
    """
    Analyze cache efficiency for different access patterns.
    
    Assuming hot records are uniformly distributed across pages.
    """
    hot_records = int(total_records * hot_record_pct / 100)
    
    # Worst case: hot records spread across all pages
    # Each page has (hot_record_pct)% hot records
    
    # To cache all hot records, we might need to cache more pages
    # if hot records don't fill pages completely
    
    pages_with_hot_data = total_records // records_per_page
    hot_per_page = records_per_page * hot_record_pct / 100
    
    # Memory to cache all pages with hot data
    memory_for_all_hot_pages_mb = (pages_with_hot_data * records_per_page * 100) / (1024 * 1024)
    
    # Efficiency: what fraction of cached data is actually hot?
    cache_efficiency = hot_record_pct / 100  # theoretical maximum
    
    # With limited buffer pool, how much hot data can we cache?
    buffer_pages = buffer_pool_mb * 1024 * 1024 // (records_per_page * 100)  # assuming 100 bytes/record
    pages_cached = min(buffer_pages, pages_with_hot_data)
    hot_records_cached = pages_cached * hot_per_page
    hot_cache_ratio = hot_records_cached / hot_records if hot_records > 0 else 0
    
    return {
        'hot_records': hot_records,
        'cache_efficiency': cache_efficiency,
        'hot_cache_ratio': hot_cache_ratio,
        'pages_cached': pages_cached
    }
 
# Example: 1 billion records, 10% hot, 8GB buffer pool
for rpp in [50, 100, 200, 400]:  # records per page
    page_kb = rpp * 100 / 1024  # 100 bytes per record
    result = analyze_cache_efficiency(
        total_records=1_000_000_000,
        hot_record_pct=10,
        buffer_pool_mb=8192,
        records_per_page=rpp
    )
    print(f"~{page_kb:.0f}KB page ({rpp} rec/page): "
          f"Can cache {result['hot_cache_ratio']*100:.1f}% of hot data")

The Fragmentation Effect

When hot records are scattered across many pages (common in random-access patterns), larger pages waste buffer pool memory on cold records that happen to share pages with hot ones. This 'cache pollution' can dramatically reduce effective cache size.

Fill Factor and Space Utilization

The fill factor (or packing density) determines how full pages are when data is loaded or reorganized. This interacts with page size to affect both performance and storage efficiency.

Fill Factor Basics:

100% fill factor: Pages are completely full (only achievable with sorted bulk load)
50% fill factor: The minimum guaranteed by B+-tree balancing (after worst-case deletions)
70% average: Typical steady-state for random insertions
Custom fill factor: Many databases allow configuring initial fill (e.g., PostgreSQL's fillfactor)

Fill Factor Impact Analysis (8KB Page)
Fill Factor	Effective Fanout	Space Used (1B records)	Tree Height
100%	500	8 GB	4
90%	450	8.9 GB	4
70%	350	11.4 GB	4
50%	250	16 GB	4-5

When to Use Lower Fill Factors:

Counter-intuitively, a lower initial fill factor can improve performance for certain workloads:

Lower Fill Factor Benefits

•Heavy UPDATE workloads: In-place updates may require row expansion; space prevents page splits
•Clustered index inserts: If inserts aren't at the end, space reduces split frequency
•VARCHAR columns: Variable-length data may grow; pre-allocated space prevents reorganization
•Reduce lock contention: Fewer rows per page = finer-grained locking

fill_factor.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- PostgreSQL: Create index with specific fill factor
CREATE INDEX idx_orders_customer 
ON orders (customer_id)
WITH (fillfactor = 90);  -- Leave 10% free space
 
-- SQL Server: Specify fill factor
CREATE INDEX idx_orders_customer
ON orders (customer_id)
WITH (FILLFACTOR = 90);
 
-- MySQL InnoDB: Control through MERGE_THRESHOLD
-- Pages split when > MERGE_THRESHOLD% full (default 50%)
ALTER TABLE orders 
COMMENT = 'MERGE_THRESHOLD=45';  -- Split earlier, more space
 
-- Oracle: PCTFREE controls space left for updates
CREATE INDEX idx_orders_customer 
ON orders (customer_id)
PCTFREE 10;  -- Reserve 10% for row expansion
 
-- Monitor fill factor / space usage
-- PostgreSQL
SELECT 
    relname,
    pg_size_pretty(pg_relation_size(oid)) as size,
    (pg_stat_get_dead_tuples(oid) * 100.0 / 
     NULLIF(pg_stat_get_live_tuples(oid) + pg_stat_get_dead_tuples(oid), 0)
    ) as dead_tuple_pct
FROM pg_class 
WHERE relkind = 'i' AND relname LIKE 'idx_%';

Fill Factor Strategy

Use high fill factors (90-100%) for read-heavy or append-only workloads. Use lower fill factors (70-80%) for tables with frequent updates to in-place columns. Monitor page split rates to tune—excessive splits indicate fill factor is too high.

Practical Sizing Guidelines

Given the complex trade-offs, here are practical guidelines for page size selection in different scenarios.

Page Size Selection Guidelines
Scenario	Recommended Size	Rationale
General OLTP, SSD storage	4-8 KB	Match SSD page size, minimize contention
General OLTP, HDD storage	8-16 KB	Amortize seek costs
OLAP / Data Warehouse	16-32 KB	Optimize for scans, compression
Time-series / IoT	16-64 KB	Batch inserts, sequential access
Small keys (INT)	4-8 KB	Already high fanout
Large keys (VARCHAR)	16-32 KB	Maintain reasonable fanout
Memory-constrained	4 KB	Maximize buffer pool entries
In-memory database	Varies	CPU cache line considerations

Database-Specific Defaults and Options:

Database Page Size Configuration
Database	Default	Options	When to Change
PostgreSQL	8 KB	Compile-time: 4-32KB	Rarely (requires rebuild)
MySQL InnoDB	16 KB	4KB, 8KB, 16KB, 32KB, 64KB	At initialization
Oracle	8 KB	2-32KB per tablespace	Tablespace creation
SQL Server	8 KB	Fixed	Cannot change
SQLite	4 KB	512B-64KB	At database creation

Changing Page Size is Disruptive

In most databases, changing page size requires recreating the database or tablespace. This is not a tuning parameter you adjust casually—it's an architectural decision made during system design. Get it right initially by understanding your workload patterns.

Summary: Mastering Node Size Selection

We've comprehensively examined the factors influencing optimal B+-tree node (page) size selection.

Key Takeaways

•Page size determines fanout: Larger pages → higher fanout → shorter trees → fewer I/Os per lookup
•Storage type matters: HDDs favor larger pages (amortize seeks); SSDs enable smaller pages
•Workload patterns dictate optimization: OLTP prefers smaller pages; OLAP prefers larger
•Buffer pool granularity: Smaller pages enable finer-grained caching, better for skewed access
•Fill factor interacts with page size: Lower fill reduces splits but increases space; tune based on update patterns
•8 KB is the pragmatic default: Most general workloads are well-served by this balanced choice

What's Next:

With page size and height understood, we'll examine fanout optimization—the techniques for maximizing entries per node through key compression, prefix truncation, and other advanced methods.

Page Complete

You now understand how to select appropriate page sizes for B+-trees based on storage technology, workload characteristics, and memory constraints. This knowledge enables informed database configuration decisions. Next, we'll explore techniques to maximize fanout within any given page size.