Cost Models - Learning Module

Loading content...

0/252

I/O Cost

The Bottleneck That Shaped Database Design

For decades, a single constraint dominated database design: disk I/O is slow. While CPUs doubled in speed every 18 months, disk access times improved by mere percentage points per year. This asymmetry created a profound truth that echoes through every layer of database architecture: the most expensive thing a database can do is wait for data from disk.

This reality shaped everything from B+-tree indexes (minimize random reads) to buffer pools (cache hot data in RAM) to write-ahead logging (sequential writes are fast). Understanding I/O cost isn't just about one term in a cost equation—it's about understanding why databases are architected the way they are.

Even in the era of SSDs and massive RAM, I/O cost remains the dominant factor in query optimization for most workloads. The mechanics have changed (SSDs vs. spinning disks), but the principles persist: accessing data costs time, and minimizing unnecessary access is paramount.

What You Will Learn

By the end of this page, you will understand the fundamental units of database I/O (pages, blocks, extents), the difference between sequential and random I/O and why it matters enormously, how databases model disk access costs for various operations, the role of buffer pools and how they affect I/O cost calculations, and concrete cost formulas for table scans, index scans, and join operations.

The Page as the Unit of Transfer

Databases don't read individual rows from disk—they read pages (also called blocks). A page is a fixed-size chunk of data, typically 4KB, 8KB, or 16KB, that serves as the atomic unit of I/O.

Why Pages?

Hardware Alignment: Disks and SSDs are organized into sectors (typically 512 bytes or 4KB). Reading aligned, larger chunks is more efficient than arbitrary byte ranges.
Amortization: The cost of a disk seek (positioning the read head) dominates read time. Once positioned, reading additional nearby bytes adds minimal overhead.
Caching Efficiency: Fixed-size pages simplify buffer pool management. Every page is the same size, making allocation and eviction straightforward.
Logging and Recovery: Transaction logs and recovery mechanisms operate at page granularity, simplifying consistency guarantees.

Page Sizes Across Major Database Systems
Database	Default Page Size	Configurable Range	Notes
PostgreSQL	8 KB	1KB – 32KB (compile-time)	8KB optimal for general workloads
MySQL (InnoDB)	16 KB	4KB – 64KB	16KB balances row storage and I/O
Oracle	2 KB – 32 KB	Tablespace-specific	Multiple block sizes per database
SQL Server	8 KB	Fixed	Extents of 8 pages (64KB) for allocation
SQLite	4 KB	512B – 64KB	Matches OS page size by default

Pages and Rows:

A page contains multiple rows (tuples). The exact number depends on row size:

Rows per page ≈ (Page Size - Page Header) / Average Row Size

For a PostgreSQL 8KB page with 60-byte header and 100-byte average row size:

Available space: 8192 - 60 = 8132 bytes
Rows per page: 8132 / 100 ≈ 81 rows

This calculation matters enormously for cost estimation. To read 10,000 rows from this table requires reading:

10,000 / 81 ≈ 124 pages

Not 10,000 I/O operations, but 124. I/O cost is counted in pages, not rows.

Page Packing and Storage Overhead

Rows can't span pages (in most systems), so a 5000-byte row on an 8KB page leaves 3KB unused. Large rows dramatically reduce storage efficiency. Additionally, pages maintain fill factor (e.g., 90%) to leave room for updates. These factors affect actual I/O volumes and should inform schema design.

Sequential vs Random I/O

The most fundamental distinction in I/O cost modeling is between sequential and random access. This distinction has profound implications for query optimization.

Sequential I/O

•Reading consecutive pages in order
•Minimal seek/positioning overhead
•Enables read-ahead prefetching
•HDDs: ~100-200 MB/s sustained
•SSDs: ~500-3500 MB/s sustained
•Example: Full table scan

Random I/O

•Reading scattered pages across disk
•Each read incurs seek/positioning cost
•No prefetching benefit
•HDDs: ~100-200 IOPS (0.5-1 MB/s effective)
•SSDs: ~10,000-100,000+ IOPS
•Example: Index lookup to heap

The Mathematics of Disk Access:

For a traditional spinning hard drive:

Time to read one page = Seek Time + Rotational Latency + Transfer Time

Typical values:
  Seek Time: ~8ms (average)
  Rotational Latency: ~4ms (at 7200 RPM, half rotation average)
  Transfer Time: ~0.05ms (8KB at 150 MB/s)

Total random read: ~12ms per page
Random IOPS: ~83 per second

For sequential reading, seek and rotational latency amortize across many pages:

Sequential read of N pages:
  = 1 seek + 1 rotation + (N × transfer time)
  = 12ms + (N × 0.05ms)

For N=1000 pages: 12ms + 50ms = 62ms
Effective: 1000 pages / 62ms ≈ 16,000 pages per second

The ratio of random to sequential cost: ~200× for HDDs

This massive difference explains why:

Full table scans often beat index scans for large result sets
Clustered indexes that co-locate related rows improve performance
Sequential write logs (WAL) enable fast durable transactions

The SSD Revolution

SSDs eliminate mechanical seek time, reducing the random/sequential gap dramatically—from 200× to perhaps 4×. However, the gap persists because of: (1) internal SSD parallelism favors sequential patterns, (2) operating system and filesystem prefetching optimizes sequential reads, and (3) amplification effects on writes. Cost models must be recalibrated for SSD environments.

Random vs Sequential I/O Cost Ratios
Storage Type	Random I/O Cost	Sequential I/O Cost	Ratio
7200 RPM HDD	12 ms per page	0.05 ms per page	240×
15000 RPM HDD	6 ms per page	0.03 ms per page	200×
SATA SSD	0.1 ms per page	0.02 ms per page	5×
NVMe SSD	0.02 ms per page	0.005 ms per page	4×
RAM (baseline)	0.0001 ms per page	0.00005 ms per page	2×

I/O Cost Formulas for Common Operations

Database cost models use formulas parameterized by page counts and access patterns. Let's examine the core formulas for fundamental operations.

io_cost_formulas.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
=============================================================
I/O COST FORMULAS FOR CORE DATABASE OPERATIONS
=============================================================
 
Notation:
  B_r     = Number of pages in relation R (blocks)
  T_r     = Number of tuples in relation R  
  f_r     = Blocking factor (tuples per page)
  c_seq   = Cost per sequential page read (typically 1.0)
  c_rand  = Cost per random page read (typically 4.0)
  sel     = Selectivity of predicate (fraction of tuples matched)
 
-------------------------------------------------------------
1. SEQUENTIAL TABLE SCAN
-------------------------------------------------------------
Cost = B_r × c_seq
 
Example: Table with 10,000 pages, c_seq = 1.0
Cost = 10,000 × 1.0 = 10,000 cost units
 
This reads every page once, sequentially. Simple and predictable.
 
-------------------------------------------------------------
2. INDEX SCAN (Secondary Index, Unclustered)
-------------------------------------------------------------
Cost = Height_index × c_rand           -- Traverse index
     + (sel × T_r) × c_rand            -- Random heap lookups
 
Each matching tuple requires a random I/O to fetch its heap page
(worst case: each tuple on a different page).
 
Example: 1M tuples, 10% selectivity, index height 3
Cost = 3 × 4 + (0.1 × 1,000,000) × 4
     = 12 + 400,000 = 400,012 cost units
 
-------------------------------------------------------------
3. INDEX SCAN (Clustered / Index-Organized)
-------------------------------------------------------------
Cost = Height_index × c_rand           -- Traverse index
     + (sel × B_r) × c_seq             -- Sequential data scan
 
Data pages are ordered by index, so we read contiguous ranges.
 
Example: 10,000 pages, 10% selectivity, index height 3
Cost = 3 × 4 + (0.1 × 10,000) × 1
     = 12 + 1,000 = 1,012 cost units
 
-------------------------------------------------------------
4. INDEX-ONLY SCAN (Covering Index)
-------------------------------------------------------------
Cost = Height_index × c_rand           -- Traverse index
     + (sel × B_index) × c_seq         -- Scan index leaves only
 
No heap access! Much cheaper when index contains all needed columns.
 
Example: Index with 2,000 pages, 10% selectivity
Cost = 3 × 4 + (0.1 × 2,000) × 1
     = 12 + 200 = 212 cost units
 
-------------------------------------------------------------
5. NESTED LOOP JOIN (Simple)
-------------------------------------------------------------
Cost = B_outer                              -- Scan outer once
     + T_outer × (Height_index × c_rand)    -- Index lookup per outer tuple
     + (T_outer × sel_join) × c_rand        -- Heap fetch per match
 
For each outer tuple, traverse inner's index and fetch matching rows.
 
-------------------------------------------------------------
6. HASH JOIN
-------------------------------------------------------------
Cost = 3 × (B_build + B_probe)          -- Read both, write partitions, read again
 
Partitioning may spill to disk, tripling I/O in worst case.
When hash table fits in memory: Cost = B_build + B_probe
 
-------------------------------------------------------------
7. SORT-MERGE JOIN
-------------------------------------------------------------
Cost = Sort_cost(R) + Sort_cost(S) + (B_r + B_s)
 
External sort cost depends on memory and relation size.
For large relations: Sort_cost ≈ 2 × B × (1 + ⌈log_{M-1}(B/M)⌉)
where M = memory pages available.

The Index Crossover Point

Notice how unclustered index scans become extremely expensive for large selectivities. At around 5-15% selectivity (depending on clustering), a sequential table scan becomes cheaper than an index scan. This is why optimizers often choose 'counterintuitive' full table scans—they're genuinely faster for large result sets.

Buffer Pool Effects on I/O Cost

The buffer pool (or buffer cache) is an in-memory area where the database caches recently accessed pages. Its effect on I/O cost is profound but complex to model accurately.

Converting Mermaid diagram...

The Cache Hit Rate:

The most critical factor is the cache hit rate—the probability that a requested page is already in memory:

Effective I/O Cost = (1 - hit_rate) × Physical I/O Cost

If 90% of page accesses are cache hits, actual I/O cost is only 10% of the formula estimate.

Modeling Cache Behavior:

Accurate cache modeling is difficult because:

Workload Dependency: Hit rates depend on workload patterns, which vary constantly
Shared Resources: Multiple queries compete for buffer pool space
Temporal Effects: A table scanned moments ago will be fully cached
Size Relationships: Tables smaller than buffer pool may be fully cached

PostgreSQL's Approach:

PostgreSQL uses effective_cache_size to estimate total cache size (buffer pool + OS cache) and adjusts index scan costs based on the probability of finding index and heap pages cached:

Effective index cost = Base cost × (Random I/O cost factor adjusted for caching)

The optimizer assumes frequently accessed pages (index root/internal nodes) are likely cached, while leaf nodes and heap pages of cold data are not.

Cache Hit Rate Impact on Query Performance
Hit Rate	10,000 Random I/Os	Effective I/Os	Speedup Factor
0%	10,000	10,000	1.0×
50%	10,000	5,000	2.0×
80%	10,000	2,000	5.0×
95%	10,000	500	20.0×
99%	10,000	100	100.0×

Buffer Pool Sizing Rule of Thumb

For OLTP workloads, sizing the buffer pool to hold the 'working set' (frequently accessed tables and indexes) dramatically improves performance. For PostgreSQL, 25% of RAM is a common starting point. For dedicated database servers, 60-80% is achievable. Monitor cache hit rates—below 95% for OLTP suggests undersized caching.

PostgreSQL's I/O Cost Parameters

PostgreSQL provides transparent, tunable I/O cost parameters. Understanding these illuminates how a production database models I/O costs.

postgresql_cost_params.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
-- Core I/O cost parameters with defaults and typical adjustments
-- =============================================================
 
-- Sequential page read cost (baseline = 1.0)
-- This is the baseline against which other costs are measured
seq_page_cost = 1.0
 
-- Random page read cost 
-- Default assumes HDD; reduce for SSD (1.1 - 2.0)
random_page_cost = 4.0    -- Default (HDD-oriented)
-- random_page_cost = 1.5  -- Typical for SSD
-- random_page_cost = 1.1  -- Fast NVMe
 
-- Estimated effective cache size (buffer pool + OS cache)
-- Tells optimizer how much data is likely cached
effective_cache_size = '4GB'   -- Default, very conservative
-- effective_cache_size = '48GB' -- Typical for 64GB RAM server
 
-- Example: Impact on index vs table scan decision
-- Consider: SELECT * FROM orders WHERE status = 'shipped'
-- Assume: 1M rows, 100K pages, 5% selectivity, index height 3
 
-- With random_page_cost = 4.0 (HDD):
--   Index scan: 3×4 + 50,000×4 = 200,012 (50K random reads)
--   Seq scan:   100,000×1 = 100,000
--   Decision: Sequential scan wins (surprise!)
 
-- With random_page_cost = 1.5 (SSD):
--   Index scan: 3×1.5 + 50,000×1.5 = 75,004
--   Seq scan:   100,000×1 = 100,000  
--   Decision: Index scan wins
 
-- Viewing current settings:
SHOW seq_page_cost;
SHOW random_page_cost;
SHOW effective_cache_size;
 
-- Temporarily changing for comparison:
SET random_page_cost = 1.5;
EXPLAIN SELECT * FROM large_table WHERE indexed_col = 'value';
 
SET random_page_cost = 4.0;
EXPLAIN SELECT * FROM large_table WHERE indexed_col = 'value';

Parameter Tuning Strategy:

1. random_page_cost:

HDD storage: Keep at 4.0 (default)
SATA SSD: Reduce to 1.5-2.0
NVMe SSD: Reduce to 1.1-1.5
If data fits in RAM: Can go as low as 1.0-1.1

2. effective_cache_size:

Set to ~50-75% of total RAM for dedicated DB servers
For shared systems, estimate DB's share of OS cache
Larger values make the optimizer more willing to use indexes (assumes pages will be cached)

3. Validation Approach:

Use EXPLAIN (ANALYZE, BUFFERS) to compare estimated vs actual I/O
Monitor pg_stat_user_tables for actual seq_scan vs idx_scan counts
Compare actual query times across different parameter settings

The Cost of Misconfiguration

Running with default random_page_cost=4.0 on a fully SSD system causes the optimizer to undervalue index scans. Queries that could complete in 100ms via index scan instead use table scans taking 10 seconds. Conversely, setting random_page_cost too low on HDD systems causes expensive random I/O patterns. Always calibrate to your hardware.

Write I/O Considerations

Query optimization primarily focuses on read I/O, but write operations also contribute to cost, particularly for UPDATE, DELETE, and queries requiring temporary storage.

Sources of Write I/O During Query Execution

•Sort Spills — When sorting data exceeds work_mem, intermediate sort runs are written to temporary files on disk. External sorts can require 2-3× the data size in write I/O.
•Hash Spills — Hash joins and aggregations exceeding memory limits spill hash buckets to disk. Each partition may be read and written multiple times during recursive partitioning.
•Materialization — Subqueries or CTEs that must be materialized write their results to temporary storage before consumption by outer queries.
•Index Maintenance — INSERT, UPDATE, and DELETE must update affected indexes. Each index update can require random writes to leaf pages plus potential page splits.
•Transaction Logging — All modifications generate WAL (Write-Ahead Log) records. WAL writes are sequential and buffered, but must complete before transaction commit.
•Checkpoint Activity — Background checkpoints flush dirty pages, competing for I/O bandwidth with query execution.

Write Cost Multipliers:

Writes are generally more expensive than reads:

HDD sequential write: 1.5× read cost (write verification overhead)  
HDD random write:     2.0× read cost (seek + write + verify)
SSD write:            1.0-1.2× read cost (plus wear leveling background)
SSD with fsync:       2-10× (depending on controller and queue depth)

Sort Spill I/O Estimation:

Data to sort: D pages
Memory available: M pages
Merge order: M - 1 (pages of sorted runs merged per pass)

Number of passes = ⌈logₘ₋₁(⌈D/M⌉)⌉

Total I/O = 2 × D × (1 + Number of passes)
           [Initial write of runs + read/write per merge pass]

For 10,000 pages with 100 pages of work_mem:

Runs: 100
Passes: ⌈log₉₉(100)⌉ = 1
Total I/O: 2 × 10,000 × 2 = 40,000 page I/Os

This is why increasing work_mem for sort-heavy queries dramatically improves performance.

The Hidden Cost of Spilling

Spilling to disk doesn't just add I/O—it changes the access pattern from in-memory (nanoseconds) to disk-based (milliseconds). A sort that fits in memory might take 50ms; the same sort spilling to disk might take 5 seconds. Monitor temp file usage (pg_stat_statements, auto_explain) to identify spill-prone queries.

Real-World I/O Cost Analysis

Let's apply I/O cost understanding to analyze a real query scenario. Consider an e-commerce database with the following characteristics:

io_analysis_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
-- Schema and statistics for our analysis
-- ============================================
-- Table: orders (1,000,000 rows, ~125,000 pages)
-- Table: order_items (5,000,000 rows, ~400,000 pages)
-- Table: products (50,000 rows, ~5,000 pages)
 
-- Query: Find all orders with high-value items from electronics category
SELECT o.order_id, o.order_date, SUM(oi.quantity * oi.unit_price) as total
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.order_date >= '2024-01-01'
  AND p.category = 'Electronics'
GROUP BY o.order_id, o.order_date
HAVING SUM(oi.quantity * oi.unit_price) > 1000
ORDER BY total DESC;
 
-- EXPLAIN ANALYZE output:
/*
Sort  (cost=234567.89..234589.12 rows=8493 width=28)
      (actual time=1823.456..1825.789 rows=7234 loops=1)
  Sort Key: (sum((oi.quantity * oi.unit_price))) DESC
  Sort Method: quicksort  Memory: 892kB
  Buffers: shared hit=15234 read=108234
  I/O Timings: read=892.123
  
  ->  HashAggregate  (cost=234123.45..234456.78 rows=8493 width=28)
                     (actual time=1789.012..1812.345 rows=7234 loops=1)
        Group Key: o.order_id, o.order_date
        Filter: (sum((oi.quantity * oi.unit_price)) > '1000'::numeric)
        Rows Removed by Filter: 45678
        Batches: 1  Memory Usage: 4321kB
        Buffers: shared hit=15234 read=108234
        
        ->  Hash Join  (cost=8234.56..189012.34 rows=453678 width=24)
                       (actual time=234.567..1456.789 rows=423456 loops=1)
              Hash Cond: (oi.product_id = p.product_id)
              Buffers: shared hit=15234 read=108234
              
              ->  Hash Join  (cost=4567.89..178901.23 rows=2156789 width=28)
                             (actual time=189.012..1123.456 rows=2034567 loops=1)
                    Hash Cond: (oi.order_id = o.order_id)
                    Buffers: shared hit=12345 read=105678
                    
                    ->  Seq Scan on order_items oi
                        (cost=0.00..134567.00 rows=5000000 width=20)
                        (actual time=0.045..567.890 rows=5000000 loops=1)
                        Buffers: shared hit=1234 read=98766
                    
                    ->  Hash  (cost=3456.78..3456.78 rows=88889 width=12)
                              (actual time=188.901..188.901 rows=85432 loops=1)
                          Buckets: 131072  Batches: 1  Memory Usage: 4567kB
                          Buffers: shared hit=11111 read=6912
                          
                          ->  Index Scan using idx_orders_date on orders o
                              (cost=0.42..3456.78 rows=88889 width=12)
                              (actual time=0.034..156.789 rows=85432 loops=1)
                              Index Cond: (order_date >= '2024-01-01'::date)
                              Buffers: shared hit=11111 read=6912
                    
              ->  Hash  (cost=3456.67..3456.67 rows=8000 width=4)
                        (actual time=45.234..45.234 rows=7892 loops=1)
                    Buckets: 16384  Batches: 1  Memory Usage: 406kB
                    Buffers: shared hit=2889 read=2556
                    
                    ->  Seq Scan on products p
                        (cost=0.00..3456.67 rows=8000 width=4)
                        (actual time=0.023..34.567 rows=7892 loops=1)
                        Filter: (category = 'Electronics'::text)
                        Rows Removed by Filter: 42108
                        Buffers: shared hit=2889 read=2556
*/

I/O Analysis:

Component	Shared Hit	Shared Read	Total Pages	Hit Rate
order_items scan	1,234	98,766	100,000	1.2%
orders index scan	11,111	6,912	18,023	61.6%
products scan	2,889	2,556	5,445	53.0%
Total	15,234	108,234	123,468	12.3%

Observations:

order_items dominates I/O: The 400K page table required ~100K page reads (25% of table), accounting for 91% of physical I/O.
orders benefits from index: Despite 1M rows, index scan read only 18K pages total (data pages for matching rows are clustered).
products fits in cache: High hit rate (53%) on a 5K page table—repeated access during hash join probe.
Actual I/O time (892ms) indicates: Average 8.2μs per read—consistent with SSD storage.

Optimization Opportunities:

Partition order_items by date → reduce scan to relevant partition
Add index on products.category → eliminate products table scan
Consider covering index on orders → index-only scan possible

Summary: Mastering I/O Cost

I/O cost modeling is the foundation of query optimization. Let's consolidate the critical concepts:

Key Takeaways

•Pages are the I/O unit — Cost is measured in pages read/written, not individual rows. Understanding page structure reveals true operation costs.
•Sequential vs random I/O differs by 4-200× — The gap depends on storage type (HDD vs SSD). This ratio fundamentally drives optimizer decisions.
•The index crossover point is real — Beyond 5-15% selectivity, table scans often beat index scans due to random I/O accumulation.
•Buffer pools transform I/O economics — High cache hit rates (95%+) reduce effective I/O cost dramatically. Proper sizing is essential.
•Cost parameters require calibration — random_page_cost and effective_cache_size must reflect your actual hardware for accurate plan selection.
•Write I/O matters for spills and sorts — Queries exceeding memory limits incur massive write I/O penalties. Monitor temp file usage.
•EXPLAIN ANALYZE BUFFERS reveals truth — Actual I/O patterns expose optimization opportunities hidden by estimates alone.

What's Next:

I/O cost is only half the picture. Modern systems with large memories and fast storage increasingly see CPU becoming the bottleneck. The next page explores CPU Cost—how databases model computation, predicate evaluation, and processing overhead to make well-rounded optimization decisions.

Page Complete

You now possess a deep understanding of I/O cost modeling—the historically dominant factor in query optimization. This knowledge enables you to interpret EXPLAIN output, diagnose I/O bottlenecks, and configure database parameters for optimal performance on your specific hardware.