Loading content...
For decades, a single constraint dominated database design: disk I/O is slow. While CPUs doubled in speed every 18 months, disk access times improved by mere percentage points per year. This asymmetry created a profound truth that echoes through every layer of database architecture: the most expensive thing a database can do is wait for data from disk.
This reality shaped everything from B+-tree indexes (minimize random reads) to buffer pools (cache hot data in RAM) to write-ahead logging (sequential writes are fast). Understanding I/O cost isn't just about one term in a cost equation—it's about understanding why databases are architected the way they are.
Even in the era of SSDs and massive RAM, I/O cost remains the dominant factor in query optimization for most workloads. The mechanics have changed (SSDs vs. spinning disks), but the principles persist: accessing data costs time, and minimizing unnecessary access is paramount.
By the end of this page, you will understand the fundamental units of database I/O (pages, blocks, extents), the difference between sequential and random I/O and why it matters enormously, how databases model disk access costs for various operations, the role of buffer pools and how they affect I/O cost calculations, and concrete cost formulas for table scans, index scans, and join operations.
Databases don't read individual rows from disk—they read pages (also called blocks). A page is a fixed-size chunk of data, typically 4KB, 8KB, or 16KB, that serves as the atomic unit of I/O.
Why Pages?
Hardware Alignment: Disks and SSDs are organized into sectors (typically 512 bytes or 4KB). Reading aligned, larger chunks is more efficient than arbitrary byte ranges.
Amortization: The cost of a disk seek (positioning the read head) dominates read time. Once positioned, reading additional nearby bytes adds minimal overhead.
Caching Efficiency: Fixed-size pages simplify buffer pool management. Every page is the same size, making allocation and eviction straightforward.
Logging and Recovery: Transaction logs and recovery mechanisms operate at page granularity, simplifying consistency guarantees.
| Database | Default Page Size | Configurable Range | Notes |
|---|---|---|---|
| PostgreSQL | 8 KB | 1KB – 32KB (compile-time) | 8KB optimal for general workloads |
| MySQL (InnoDB) | 16 KB | 4KB – 64KB | 16KB balances row storage and I/O |
| Oracle | 2 KB – 32 KB | Tablespace-specific | Multiple block sizes per database |
| SQL Server | 8 KB | Fixed | Extents of 8 pages (64KB) for allocation |
| SQLite | 4 KB | 512B – 64KB | Matches OS page size by default |
Pages and Rows:
A page contains multiple rows (tuples). The exact number depends on row size:
Rows per page ≈ (Page Size - Page Header) / Average Row Size
For a PostgreSQL 8KB page with 60-byte header and 100-byte average row size:
This calculation matters enormously for cost estimation. To read 10,000 rows from this table requires reading:
Not 10,000 I/O operations, but 124. I/O cost is counted in pages, not rows.
Rows can't span pages (in most systems), so a 5000-byte row on an 8KB page leaves 3KB unused. Large rows dramatically reduce storage efficiency. Additionally, pages maintain fill factor (e.g., 90%) to leave room for updates. These factors affect actual I/O volumes and should inform schema design.
The most fundamental distinction in I/O cost modeling is between sequential and random access. This distinction has profound implications for query optimization.
The Mathematics of Disk Access:
For a traditional spinning hard drive:
Time to read one page = Seek Time + Rotational Latency + Transfer Time
Typical values:
Seek Time: ~8ms (average)
Rotational Latency: ~4ms (at 7200 RPM, half rotation average)
Transfer Time: ~0.05ms (8KB at 150 MB/s)
Total random read: ~12ms per page
Random IOPS: ~83 per second
For sequential reading, seek and rotational latency amortize across many pages:
Sequential read of N pages:
= 1 seek + 1 rotation + (N × transfer time)
= 12ms + (N × 0.05ms)
For N=1000 pages: 12ms + 50ms = 62ms
Effective: 1000 pages / 62ms ≈ 16,000 pages per second
The ratio of random to sequential cost: ~200× for HDDs
This massive difference explains why:
SSDs eliminate mechanical seek time, reducing the random/sequential gap dramatically—from 200× to perhaps 4×. However, the gap persists because of: (1) internal SSD parallelism favors sequential patterns, (2) operating system and filesystem prefetching optimizes sequential reads, and (3) amplification effects on writes. Cost models must be recalibrated for SSD environments.
| Storage Type | Random I/O Cost | Sequential I/O Cost | Ratio |
|---|---|---|---|
| 7200 RPM HDD | 12 ms per page | 0.05 ms per page | 240× |
| 15000 RPM HDD | 6 ms per page | 0.03 ms per page | 200× |
| SATA SSD | 0.1 ms per page | 0.02 ms per page | 5× |
| NVMe SSD | 0.02 ms per page | 0.005 ms per page | 4× |
| RAM (baseline) | 0.0001 ms per page | 0.00005 ms per page | 2× |
Database cost models use formulas parameterized by page counts and access patterns. Let's examine the core formulas for fundamental operations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
=============================================================I/O COST FORMULAS FOR CORE DATABASE OPERATIONS============================================================= Notation: B_r = Number of pages in relation R (blocks) T_r = Number of tuples in relation R f_r = Blocking factor (tuples per page) c_seq = Cost per sequential page read (typically 1.0) c_rand = Cost per random page read (typically 4.0) sel = Selectivity of predicate (fraction of tuples matched) -------------------------------------------------------------1. SEQUENTIAL TABLE SCAN-------------------------------------------------------------Cost = B_r × c_seq Example: Table with 10,000 pages, c_seq = 1.0Cost = 10,000 × 1.0 = 10,000 cost units This reads every page once, sequentially. Simple and predictable. -------------------------------------------------------------2. INDEX SCAN (Secondary Index, Unclustered)-------------------------------------------------------------Cost = Height_index × c_rand -- Traverse index + (sel × T_r) × c_rand -- Random heap lookups Each matching tuple requires a random I/O to fetch its heap page(worst case: each tuple on a different page). Example: 1M tuples, 10% selectivity, index height 3Cost = 3 × 4 + (0.1 × 1,000,000) × 4 = 12 + 400,000 = 400,012 cost units -------------------------------------------------------------3. INDEX SCAN (Clustered / Index-Organized)-------------------------------------------------------------Cost = Height_index × c_rand -- Traverse index + (sel × B_r) × c_seq -- Sequential data scan Data pages are ordered by index, so we read contiguous ranges. Example: 10,000 pages, 10% selectivity, index height 3Cost = 3 × 4 + (0.1 × 10,000) × 1 = 12 + 1,000 = 1,012 cost units -------------------------------------------------------------4. INDEX-ONLY SCAN (Covering Index)-------------------------------------------------------------Cost = Height_index × c_rand -- Traverse index + (sel × B_index) × c_seq -- Scan index leaves only No heap access! Much cheaper when index contains all needed columns. Example: Index with 2,000 pages, 10% selectivityCost = 3 × 4 + (0.1 × 2,000) × 1 = 12 + 200 = 212 cost units -------------------------------------------------------------5. NESTED LOOP JOIN (Simple)-------------------------------------------------------------Cost = B_outer -- Scan outer once + T_outer × (Height_index × c_rand) -- Index lookup per outer tuple + (T_outer × sel_join) × c_rand -- Heap fetch per match For each outer tuple, traverse inner's index and fetch matching rows. -------------------------------------------------------------6. HASH JOIN-------------------------------------------------------------Cost = 3 × (B_build + B_probe) -- Read both, write partitions, read again Partitioning may spill to disk, tripling I/O in worst case.When hash table fits in memory: Cost = B_build + B_probe -------------------------------------------------------------7. SORT-MERGE JOIN-------------------------------------------------------------Cost = Sort_cost(R) + Sort_cost(S) + (B_r + B_s) External sort cost depends on memory and relation size.For large relations: Sort_cost ≈ 2 × B × (1 + ⌈log_{M-1}(B/M)⌉)where M = memory pages available.Notice how unclustered index scans become extremely expensive for large selectivities. At around 5-15% selectivity (depending on clustering), a sequential table scan becomes cheaper than an index scan. This is why optimizers often choose 'counterintuitive' full table scans—they're genuinely faster for large result sets.
The buffer pool (or buffer cache) is an in-memory area where the database caches recently accessed pages. Its effect on I/O cost is profound but complex to model accurately.
The Cache Hit Rate:
The most critical factor is the cache hit rate—the probability that a requested page is already in memory:
Effective I/O Cost = (1 - hit_rate) × Physical I/O Cost
If 90% of page accesses are cache hits, actual I/O cost is only 10% of the formula estimate.
Modeling Cache Behavior:
Accurate cache modeling is difficult because:
PostgreSQL's Approach:
PostgreSQL uses effective_cache_size to estimate total cache size (buffer pool + OS cache) and adjusts index scan costs based on the probability of finding index and heap pages cached:
Effective index cost = Base cost × (Random I/O cost factor adjusted for caching)
The optimizer assumes frequently accessed pages (index root/internal nodes) are likely cached, while leaf nodes and heap pages of cold data are not.
| Hit Rate | 10,000 Random I/Os | Effective I/Os | Speedup Factor |
|---|---|---|---|
| 0% | 10,000 | 10,000 | 1.0× |
| 50% | 10,000 | 5,000 | 2.0× |
| 80% | 10,000 | 2,000 | 5.0× |
| 95% | 10,000 | 500 | 20.0× |
| 99% | 10,000 | 100 | 100.0× |
For OLTP workloads, sizing the buffer pool to hold the 'working set' (frequently accessed tables and indexes) dramatically improves performance. For PostgreSQL, 25% of RAM is a common starting point. For dedicated database servers, 60-80% is achievable. Monitor cache hit rates—below 95% for OLTP suggests undersized caching.
PostgreSQL provides transparent, tunable I/O cost parameters. Understanding these illuminates how a production database models I/O costs.
12345678910111213141516171819202122232425262728293031323334353637383940414243
-- Core I/O cost parameters with defaults and typical adjustments-- ============================================================= -- Sequential page read cost (baseline = 1.0)-- This is the baseline against which other costs are measuredseq_page_cost = 1.0 -- Random page read cost -- Default assumes HDD; reduce for SSD (1.1 - 2.0)random_page_cost = 4.0 -- Default (HDD-oriented)-- random_page_cost = 1.5 -- Typical for SSD-- random_page_cost = 1.1 -- Fast NVMe -- Estimated effective cache size (buffer pool + OS cache)-- Tells optimizer how much data is likely cachedeffective_cache_size = '4GB' -- Default, very conservative-- effective_cache_size = '48GB' -- Typical for 64GB RAM server -- Example: Impact on index vs table scan decision-- Consider: SELECT * FROM orders WHERE status = 'shipped'-- Assume: 1M rows, 100K pages, 5% selectivity, index height 3 -- With random_page_cost = 4.0 (HDD):-- Index scan: 3×4 + 50,000×4 = 200,012 (50K random reads)-- Seq scan: 100,000×1 = 100,000-- Decision: Sequential scan wins (surprise!) -- With random_page_cost = 1.5 (SSD):-- Index scan: 3×1.5 + 50,000×1.5 = 75,004-- Seq scan: 100,000×1 = 100,000 -- Decision: Index scan wins -- Viewing current settings:SHOW seq_page_cost;SHOW random_page_cost;SHOW effective_cache_size; -- Temporarily changing for comparison:SET random_page_cost = 1.5;EXPLAIN SELECT * FROM large_table WHERE indexed_col = 'value'; SET random_page_cost = 4.0;EXPLAIN SELECT * FROM large_table WHERE indexed_col = 'value';Parameter Tuning Strategy:
1. random_page_cost:
2. effective_cache_size:
3. Validation Approach:
EXPLAIN (ANALYZE, BUFFERS) to compare estimated vs actual I/Opg_stat_user_tables for actual seq_scan vs idx_scan countsRunning with default random_page_cost=4.0 on a fully SSD system causes the optimizer to undervalue index scans. Queries that could complete in 100ms via index scan instead use table scans taking 10 seconds. Conversely, setting random_page_cost too low on HDD systems causes expensive random I/O patterns. Always calibrate to your hardware.
Query optimization primarily focuses on read I/O, but write operations also contribute to cost, particularly for UPDATE, DELETE, and queries requiring temporary storage.
Write Cost Multipliers:
Writes are generally more expensive than reads:
HDD sequential write: 1.5× read cost (write verification overhead)
HDD random write: 2.0× read cost (seek + write + verify)
SSD write: 1.0-1.2× read cost (plus wear leveling background)
SSD with fsync: 2-10× (depending on controller and queue depth)
Sort Spill I/O Estimation:
Data to sort: D pages
Memory available: M pages
Merge order: M - 1 (pages of sorted runs merged per pass)
Number of passes = ⌈logₘ₋₁(⌈D/M⌉)⌉
Total I/O = 2 × D × (1 + Number of passes)
[Initial write of runs + read/write per merge pass]
For 10,000 pages with 100 pages of work_mem:
This is why increasing work_mem for sort-heavy queries dramatically improves performance.
Spilling to disk doesn't just add I/O—it changes the access pattern from in-memory (nanoseconds) to disk-based (milliseconds). A sort that fits in memory might take 50ms; the same sort spilling to disk might take 5 seconds. Monitor temp file usage (pg_stat_statements, auto_explain) to identify spill-prone queries.
Let's apply I/O cost understanding to analyze a real query scenario. Consider an e-commerce database with the following characteristics:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
-- Schema and statistics for our analysis-- ============================================-- Table: orders (1,000,000 rows, ~125,000 pages)-- Table: order_items (5,000,000 rows, ~400,000 pages)-- Table: products (50,000 rows, ~5,000 pages) -- Query: Find all orders with high-value items from electronics categorySELECT o.order_id, o.order_date, SUM(oi.quantity * oi.unit_price) as totalFROM orders oJOIN order_items oi ON o.order_id = oi.order_idJOIN products p ON oi.product_id = p.product_idWHERE o.order_date >= '2024-01-01' AND p.category = 'Electronics'GROUP BY o.order_id, o.order_dateHAVING SUM(oi.quantity * oi.unit_price) > 1000ORDER BY total DESC; -- EXPLAIN ANALYZE output:/*Sort (cost=234567.89..234589.12 rows=8493 width=28) (actual time=1823.456..1825.789 rows=7234 loops=1) Sort Key: (sum((oi.quantity * oi.unit_price))) DESC Sort Method: quicksort Memory: 892kB Buffers: shared hit=15234 read=108234 I/O Timings: read=892.123 -> HashAggregate (cost=234123.45..234456.78 rows=8493 width=28) (actual time=1789.012..1812.345 rows=7234 loops=1) Group Key: o.order_id, o.order_date Filter: (sum((oi.quantity * oi.unit_price)) > '1000'::numeric) Rows Removed by Filter: 45678 Batches: 1 Memory Usage: 4321kB Buffers: shared hit=15234 read=108234 -> Hash Join (cost=8234.56..189012.34 rows=453678 width=24) (actual time=234.567..1456.789 rows=423456 loops=1) Hash Cond: (oi.product_id = p.product_id) Buffers: shared hit=15234 read=108234 -> Hash Join (cost=4567.89..178901.23 rows=2156789 width=28) (actual time=189.012..1123.456 rows=2034567 loops=1) Hash Cond: (oi.order_id = o.order_id) Buffers: shared hit=12345 read=105678 -> Seq Scan on order_items oi (cost=0.00..134567.00 rows=5000000 width=20) (actual time=0.045..567.890 rows=5000000 loops=1) Buffers: shared hit=1234 read=98766 -> Hash (cost=3456.78..3456.78 rows=88889 width=12) (actual time=188.901..188.901 rows=85432 loops=1) Buckets: 131072 Batches: 1 Memory Usage: 4567kB Buffers: shared hit=11111 read=6912 -> Index Scan using idx_orders_date on orders o (cost=0.42..3456.78 rows=88889 width=12) (actual time=0.034..156.789 rows=85432 loops=1) Index Cond: (order_date >= '2024-01-01'::date) Buffers: shared hit=11111 read=6912 -> Hash (cost=3456.67..3456.67 rows=8000 width=4) (actual time=45.234..45.234 rows=7892 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 406kB Buffers: shared hit=2889 read=2556 -> Seq Scan on products p (cost=0.00..3456.67 rows=8000 width=4) (actual time=0.023..34.567 rows=7892 loops=1) Filter: (category = 'Electronics'::text) Rows Removed by Filter: 42108 Buffers: shared hit=2889 read=2556*/I/O Analysis:
| Component | Shared Hit | Shared Read | Total Pages | Hit Rate |
|---|---|---|---|---|
| order_items scan | 1,234 | 98,766 | 100,000 | 1.2% |
| orders index scan | 11,111 | 6,912 | 18,023 | 61.6% |
| products scan | 2,889 | 2,556 | 5,445 | 53.0% |
| Total | 15,234 | 108,234 | 123,468 | 12.3% |
Observations:
order_items dominates I/O: The 400K page table required ~100K page reads (25% of table), accounting for 91% of physical I/O.
orders benefits from index: Despite 1M rows, index scan read only 18K pages total (data pages for matching rows are clustered).
products fits in cache: High hit rate (53%) on a 5K page table—repeated access during hash join probe.
Actual I/O time (892ms) indicates: Average 8.2μs per read—consistent with SSD storage.
Optimization Opportunities:
I/O cost modeling is the foundation of query optimization. Let's consolidate the critical concepts:
What's Next:
I/O cost is only half the picture. Modern systems with large memories and fast storage increasingly see CPU becoming the bottleneck. The next page explores CPU Cost—how databases model computation, predicate evaluation, and processing overhead to make well-rounded optimization decisions.
You now possess a deep understanding of I/O cost modeling—the historically dominant factor in query optimization. This knowledge enables you to interpret EXPLAIN output, diagnose I/O bottlenecks, and configure database parameters for optimal performance on your specific hardware.