Loading content...
Selecting the optimal node size (page size) for a B+-tree is one of the most consequential decisions in database system design. This choice ripples through every aspect of system performance—from I/O efficiency to buffer pool utilization, from concurrency to storage overhead.
The challenge lies in balancing competing objectives: larger pages increase fanout and reduce tree height, but they also increase I/O transfer time, lock contention, and wasted space from partially-filled nodes. Finding the sweet spot requires deep understanding of hardware characteristics, workload patterns, and system constraints.
By the end of this page, you will understand the factors influencing optimal page size, analyze trade-offs between different sizing strategies, evaluate how hardware evolution affects sizing decisions, and apply sizing principles to specific workload types.
The page size (or block size) defines the fundamental unit of storage and I/O in a database system. Every B+-tree node occupies exactly one page—this fixed mapping creates the foundation for all sizing analysis.
The Physical-Logical Interface:
Database pages serve as the bridge between logical data structures and physical storage:
| Layer | Unit | Typical Size | Controlled By |
|---|---|---|---|
| Application | Row/Record | 50-500 bytes | Schema design |
| Database | Page/Block | 4KB-64KB | DB configuration |
| Filesystem | Block | 4KB | OS/Filesystem |
| SSD | Page/Block | 4KB/128KB-4MB | Hardware |
| HDD | Sector | 512B-4KB | Hardware |
Why Page Size Matters for B+-Trees:
The relationship between page size and B+-tree behavior is direct and profound:
Fanout (f): The maximum number of children per internal node scales nearly linearly with page size
I/O Granularity: Each node access transfers exactly one page
Buffer Pool Efficiency: The buffer pool holds pages, not individual records
Concurrency Granularity: Locks are typically held at page level
4 KB: PostgreSQL default, MySQL InnoDB default 8 KB: Oracle default, SQL Server default 16 KB: Common for OLAP workloads 32-64 KB: Specialized data warehouse configurations
The central tension in page size selection is between maximizing fanout (height reduction) and minimizing transfer cost (I/O time). Understanding this trade-off is essential for informed decision-making.
Fanout Analysis:
For an internal node with page size P, key size K, and pointer size Ptr:
$$\text{Fanout} = f = \lfloor (P - \text{Overhead}) / (K + \text{Ptr}) \rfloor + 1$$
1234567891011121314151617181920212223242526272829303132333435363738394041
def calculate_fanout(page_size_bytes, key_size, pointer_size, overhead=64): """ Calculate B+-tree fanout for given page and key sizes. Parameters: - page_size_bytes: Total page size - key_size: Size of each key in bytes - pointer_size: Size of each pointer in bytes - overhead: Page header, checksums, etc. """ usable_space = page_size_bytes - overhead entries = usable_space // (key_size + pointer_size) return entries # fanout = entries for internal nodes # Compare fanout across page sizes for 8-byte keysprint("Fanout by Page Size (8-byte key, 8-byte pointer)")print("=" * 50)for page_kb in [4, 8, 16, 32, 64]: fanout = calculate_fanout(page_kb * 1024, 8, 8) print(f"{page_kb:2d} KB page → Fanout: {fanout:4d}") # Output:# 4 KB page → Fanout: 248# 8 KB page → Fanout: 504# 16 KB page → Fanout: 1016# 32 KB page → Fanout: 2040# 64 KB page → Fanout: 4088 # Calculate height for 1 billion recordsimport math def calculate_height(fanout, record_count): return math.ceil(math.log(record_count, fanout)) + 1 print("Tree Height for 1 Billion Records")print("=" * 50)for page_kb in [4, 8, 16, 32, 64]: fanout = calculate_fanout(page_kb * 1024, 8, 8) height = calculate_height(fanout, 1_000_000_000) print(f"{page_kb:2d} KB page → Height: {height}")Transfer Time Analysis:
The time to read a page from disk depends on the storage medium and page size:
$$\text{I/O Time} = \text{Seek Time} + \text{Rotational Latency} + \text{Transfer Time}$$
For HDDs:
For SSDs:
The Critical Insight:
On HDDs, seek time dominates—doubling page size barely increases total I/O time but doubles data retrieved. On SSDs, transfer time is proportionally more significant, but still relatively small.
| Component | HDD (4KB) | HDD (16KB) | SSD (4KB) | SSD (16KB) |
|---|---|---|---|---|
| Seek/Latency | 8 ms | 8 ms | 0.1 ms | 0.1 ms |
| Transfer | 0.04 ms | 0.16 ms | 0.008 ms | 0.032 ms |
| Total | 8.04 ms | 8.16 ms | 0.108 ms | 0.132 ms |
| Data per I/O | 4 KB | 16 KB | 4 KB | 16 KB |
| Effective Throughput | 0.5 MB/s | 1.9 MB/s | 37 MB/s | 121 MB/s |
Larger pages amortize fixed I/O costs (seek, latency) over more data. For HDDs with 8ms seek overhead, a 16KB page is 4x more efficient than 4KB per byte retrieved. This is why HDD-based systems historically favored larger pages.
The optimal page size depends heavily on the underlying storage technology. The shift from HDDs to SSDs has fundamentally changed sizing calculus.
Hard Disk Drives (HDDs):
HDDs have high seek latency (5-10ms) dominated by mechanical movement. Page size optimization focuses on minimizing the number of seeks:
Solid State Drives (SSDs):
SSDs have fundamentally different characteristics—no mechanical seek, but significant internal parallelism and the concept of "amplification":
Page Size: 16-32 KB Rationale: Amortize 8ms seek Height Impact: Minimal (high fanout) Wasted Space: ~15-20% acceptable Best For: Sequential scans, OLAP
Page Size: 4-8 KB Rationale: Match SSD page size Height Impact: Acceptable (fast seeks) Wasted Space: Minimize for wear Best For: Random lookups, OLTP
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
def compare_lookup_performance(record_count, key_size=8): """ Compare B+-tree lookup performance: HDD vs SSD, various page sizes. """ import math results = [] for page_kb in [4, 8, 16, 32]: for storage_type in ['HDD', 'SSD']: # Calculate tree structure pointer_size = 8 overhead = 64 usable = page_kb * 1024 - overhead fanout = usable // (key_size + pointer_size) height = math.ceil(math.log(record_count, fanout)) + 1 # I/O times (milliseconds) if storage_type == 'HDD': seek_latency = 8.0 # ms transfer_rate = 0.01 # ms per KB else: # SSD seek_latency = 0.1 # ms transfer_rate = 0.002 # ms per KB io_time = seek_latency + page_kb * transfer_rate total_lookup_time = height * io_time results.append({ 'page_kb': page_kb, 'storage': storage_type, 'fanout': fanout, 'height': height, 'io_time_ms': io_time, 'lookup_ms': total_lookup_time }) return results # Analyze for 1 billion recordsresults = compare_lookup_performance(1_000_000_000) print("B+-Tree Lookup Performance (1B Records)")print("=" * 65)print(f"{'Page':<8} {'Storage':<8} {'Fanout':<8} {'Height':<8} {'I/O (ms)':<12} {'Lookup (ms)':<12}")print("-" * 65)for r in results: print(f"{r['page_kb']:<8} {r['storage']:<8} {r['fanout']:<8} {r['height']:<8} " f"{r['io_time_ms']:<12.3f} {r['lookup_ms']:<12.2f}")Different workload patterns favor different page sizes. Understanding your workload is crucial for optimal sizing.
OLTP (Online Transaction Processing):
Characteristics: Many concurrent small transactions, point lookups, single-row updates.
| Factor | Smaller Page (4-8KB) | Larger Page (16-32KB) |
|---|---|---|
| Point Lookup I/O | Minimal data transfer | Transfer unused data |
| Buffer Pool Efficiency | More granular caching | Coarse-grained; wastes memory |
| Lock Contention | Fewer rows per lock | More rows locked together |
| Undo/Redo Logging | Smaller log records | Larger log records |
| Write Amplification | Lower | Higher |
| Recommendation | Preferred for OLTP | Avoid unless scan-heavy |
OLAP (Online Analytical Processing):
Characteristics: Large scans, aggregations, few concurrent queries, batch updates.
| Factor | Smaller Page (4-8KB) | Larger Page (16-32KB) |
|---|---|---|
| Full Table Scans | More I/O operations | Fewer I/O operations |
| Sequential Read | More seek overhead | Better amortization |
| Index Height | Higher tree | Shorter tree |
| Compression Ratio | Lower efficiency | Better compression opportunity |
| Pre-fetching | More overhead per fetch | More data per fetch |
| Recommendation | Suboptimal for OLAP | Preferred for OLAP |
Most production databases handle mixed workloads. The default 8KB page size (Oracle, SQL Server) represents a pragmatic middle ground—adequate for OLTP point lookups while not severely penalizing analytical scans.
Time-Series and Append-Heavy Workloads:
These workloads have distinctive patterns—mostly inserts, rare updates, queries often scan recent data:
Page size directly affects buffer pool behavior—how many pages fit in memory, eviction frequency, and cache hit rates.
The Granularity Problem:
Consider a fixed 8 GB buffer pool:
| Page Size | Pages in 8GB | Records @ 100/page | Eviction Granularity |
|---|---|---|---|
| 4 KB | 2,097,152 | 209 million | 400 records |
| 8 KB | 1,048,576 | 104 million | 800 records |
| 16 KB | 524,288 | 52 million | 1,600 records |
| 32 KB | 262,144 | 26 million | 3,200 records |
Implications:
The "Hot Record" Analysis:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
def analyze_cache_efficiency(total_records, hot_record_pct, buffer_pool_mb, records_per_page): """ Analyze cache efficiency for different access patterns. Assuming hot records are uniformly distributed across pages. """ hot_records = int(total_records * hot_record_pct / 100) # Worst case: hot records spread across all pages # Each page has (hot_record_pct)% hot records # To cache all hot records, we might need to cache more pages # if hot records don't fill pages completely pages_with_hot_data = total_records // records_per_page hot_per_page = records_per_page * hot_record_pct / 100 # Memory to cache all pages with hot data memory_for_all_hot_pages_mb = (pages_with_hot_data * records_per_page * 100) / (1024 * 1024) # Efficiency: what fraction of cached data is actually hot? cache_efficiency = hot_record_pct / 100 # theoretical maximum # With limited buffer pool, how much hot data can we cache? buffer_pages = buffer_pool_mb * 1024 * 1024 // (records_per_page * 100) # assuming 100 bytes/record pages_cached = min(buffer_pages, pages_with_hot_data) hot_records_cached = pages_cached * hot_per_page hot_cache_ratio = hot_records_cached / hot_records if hot_records > 0 else 0 return { 'hot_records': hot_records, 'cache_efficiency': cache_efficiency, 'hot_cache_ratio': hot_cache_ratio, 'pages_cached': pages_cached } # Example: 1 billion records, 10% hot, 8GB buffer poolfor rpp in [50, 100, 200, 400]: # records per page page_kb = rpp * 100 / 1024 # 100 bytes per record result = analyze_cache_efficiency( total_records=1_000_000_000, hot_record_pct=10, buffer_pool_mb=8192, records_per_page=rpp ) print(f"~{page_kb:.0f}KB page ({rpp} rec/page): " f"Can cache {result['hot_cache_ratio']*100:.1f}% of hot data")When hot records are scattered across many pages (common in random-access patterns), larger pages waste buffer pool memory on cold records that happen to share pages with hot ones. This 'cache pollution' can dramatically reduce effective cache size.
The fill factor (or packing density) determines how full pages are when data is loaded or reorganized. This interacts with page size to affect both performance and storage efficiency.
Fill Factor Basics:
fillfactor)| Fill Factor | Effective Fanout | Space Used (1B records) | Tree Height |
|---|---|---|---|
| 100% | 500 | 8 GB | 4 |
| 90% | 450 | 8.9 GB | 4 |
| 70% | 350 | 11.4 GB | 4 |
| 50% | 250 | 16 GB | 4-5 |
When to Use Lower Fill Factors:
Counter-intuitively, a lower initial fill factor can improve performance for certain workloads:
123456789101112131415161718192021222324252627282930
-- PostgreSQL: Create index with specific fill factorCREATE INDEX idx_orders_customer ON orders (customer_id)WITH (fillfactor = 90); -- Leave 10% free space -- SQL Server: Specify fill factorCREATE INDEX idx_orders_customerON orders (customer_id)WITH (FILLFACTOR = 90); -- MySQL InnoDB: Control through MERGE_THRESHOLD-- Pages split when > MERGE_THRESHOLD% full (default 50%)ALTER TABLE orders COMMENT = 'MERGE_THRESHOLD=45'; -- Split earlier, more space -- Oracle: PCTFREE controls space left for updatesCREATE INDEX idx_orders_customer ON orders (customer_id)PCTFREE 10; -- Reserve 10% for row expansion -- Monitor fill factor / space usage-- PostgreSQLSELECT relname, pg_size_pretty(pg_relation_size(oid)) as size, (pg_stat_get_dead_tuples(oid) * 100.0 / NULLIF(pg_stat_get_live_tuples(oid) + pg_stat_get_dead_tuples(oid), 0) ) as dead_tuple_pctFROM pg_class WHERE relkind = 'i' AND relname LIKE 'idx_%';Use high fill factors (90-100%) for read-heavy or append-only workloads. Use lower fill factors (70-80%) for tables with frequent updates to in-place columns. Monitor page split rates to tune—excessive splits indicate fill factor is too high.
Given the complex trade-offs, here are practical guidelines for page size selection in different scenarios.
| Scenario | Recommended Size | Rationale |
|---|---|---|
| General OLTP, SSD storage | 4-8 KB | Match SSD page size, minimize contention |
| General OLTP, HDD storage | 8-16 KB | Amortize seek costs |
| OLAP / Data Warehouse | 16-32 KB | Optimize for scans, compression |
| Time-series / IoT | 16-64 KB | Batch inserts, sequential access |
| Small keys (INT) | 4-8 KB | Already high fanout |
| Large keys (VARCHAR) | 16-32 KB | Maintain reasonable fanout |
| Memory-constrained | 4 KB | Maximize buffer pool entries |
| In-memory database | Varies | CPU cache line considerations |
Database-Specific Defaults and Options:
| Database | Default | Options | When to Change |
|---|---|---|---|
| PostgreSQL | 8 KB | Compile-time: 4-32KB | Rarely (requires rebuild) |
| MySQL InnoDB | 16 KB | 4KB, 8KB, 16KB, 32KB, 64KB | At initialization |
| Oracle | 8 KB | 2-32KB per tablespace | Tablespace creation |
| SQL Server | 8 KB | Fixed | Cannot change |
| SQLite | 4 KB | 512B-64KB | At database creation |
In most databases, changing page size requires recreating the database or tablespace. This is not a tuning parameter you adjust casually—it's an architectural decision made during system design. Get it right initially by understanding your workload patterns.
We've comprehensively examined the factors influencing optimal B+-tree node (page) size selection.
What's Next:
With page size and height understood, we'll examine fanout optimization—the techniques for maximizing entries per node through key compression, prefix truncation, and other advanced methods.
You now understand how to select appropriate page sizes for B+-trees based on storage technology, workload characteristics, and memory constraints. This knowledge enables informed database configuration decisions. Next, we'll explore techniques to maximize fanout within any given page size.