Loading learning content...
Disk I/O is often the ultimate bottleneck in data-intensive systems. While CPUs operate in nanoseconds, network in milliseconds, and memory in microseconds, mechanical disks operate in milliseconds per operation—millions of times slower than the CPU.
Even with modern SSDs, the gap between memory and storage remains vast. A memory access takes ~100 nanoseconds; an SSD read takes ~25 microseconds—still 250x slower. Understanding disk I/O limitations is essential for designing systems that handle data at scale.
This page covers storage performance characteristics, access patterns that destroy performance, and strategies for optimizing disk-bound workloads.
By the end of this page, you will understand the performance characteristics of different storage technologies, how access patterns impact I/O performance, techniques for diagnosing disk bottlenecks, and strategies for optimizing disk-bound workloads.
Not all storage is created equal. The performance characteristics of different storage technologies vary by orders of magnitude:
Key Storage Metrics:
| Technology | Random IOPS | Sequential Throughput | Latency | Best For |
|---|---|---|---|---|
| HDD (7200 RPM) | 75-150 IOPS | 100-200 MB/s | ~10-15 ms | Archival, sequential access |
| SATA SSD | 30K-100K IOPS | 400-550 MB/s | ~100 μs | General purpose |
| NVMe SSD (Consumer) | 100K-500K IOPS | 2-5 GB/s | ~20-50 μs | Workstations, mid-tier servers |
| NVMe SSD (Enterprise) | 500K-1M+ IOPS | 5-7 GB/s | ~10-25 μs | High-performance databases |
| Optane / Intel PMem | 500K-2M IOPS | 2-3 GB/s | ~10 μs | Ultra-low latency use cases |
| AWS EBS gp3 | 3K-16K IOPS | 125-1000 MB/s | ~0.5-2 ms | General cloud workloads |
| AWS EBS io2 | Up to 256K IOPS | Up to 4 GB/s | ~0.2-0.5 ms | High-performance cloud DB |
| Network storage (NFS) | Varies widely | Network-limited | 1-10+ ms | Shared access, less performance-critical |
The HDD vs SSD Revolution:
The transition from HDD to SSD fundamentally changed I/O performance. HDDs use spinning platters and moving read/write heads—physical mechanics that impose latency floors:
For random access (reading different parts of the disk), HDDs are limited to ~100-200 IOPS because each operation requires physical movement. SSDs have no moving parts—random access is nearly as fast as sequential access.
In cloud environments (AWS EBS, GCP Persistent Disks, Azure Managed Disks), storage is network-attached, not local. Performance depends on volume type, provisioned IOPS, and instance bandwidth limits. Always check the instance's EBS bandwidth allocation—it's often the hidden bottleneck.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
#!/bin/bash# =====================================================# STORAGE BENCHMARKING WITH fio# ===================================================== # fio: Flexible I/O Tester - the standard for storage benchmarking # Random Read IOPS (4KB blocks, 32 queue depth)fio --name=randread --ioengine=libaio --iodepth=32 \ --rw=randread --bs=4k --direct=1 --size=4G \ --numjobs=4 --runtime=60 --group_reporting # Random Write IOPSfio --name=randwrite --ioengine=libaio --iodepth=32 \ --rw=randwrite --bs=4k --direct=1 --size=4G \ --numjobs=4 --runtime=60 --group_reporting # Sequential Read Throughputfio --name=seqread --ioengine=libaio --iodepth=32 \ --rw=read --bs=1M --direct=1 --size=4G \ --numjobs=1 --runtime=60 --group_reporting # Sequential Write Throughputfio --name=seqwrite --ioengine=libaio --iodepth=32 \ --rw=write --bs=1M --direct=1 --size=4G \ --numjobs=1 --runtime=60 --group_reporting # =====================================================# INTERPRETING fio OUTPUT# ===================================================== # Sample output:# read: IOPS=245k, BW=958MiB/s (1005MB/s)(56.2GiB/60001msec)# lat (usec): min=4, max=9876, avg=126.43, stdev=312.45## Key metrics:# - IOPS: 245,000 random reads/second# - BW: 958 MB/s bandwidth# - lat avg: 126 microseconds average latency# - lat stdev: 312 microseconds (variability) # =====================================================# QUICK DISK PERFORMANCE CHECK# ===================================================== # Write test (careful: writes to current directory)dd if=/dev/zero of=testfile bs=1G count=1 oflag=direct# Shows sequential write speed # Read backdd if=testfile of=/dev/null bs=1G count=1 iflag=direct# Shows sequential read speed # Clean uprm testfileHow you access disk matters as much as what disk you have. The same storage can perform brilliantly or terribly depending on access patterns.
Sequential vs Random Access:
| Access Pattern | HDD Performance | SSD Performance | Typical Workloads |
|---|---|---|---|
| Sequential read | 100-200 MB/s | 2-7 GB/s | Log processing, video streaming, backups |
| Sequential write | 100-200 MB/s | 1-5 GB/s | Logging, bulk data loads |
| Random read | 75-150 IOPS | 100K-1M IOPS | Database lookups, user sessions |
| Random write | 75-150 IOPS | 50K-500K IOPS | Transaction logs, updates |
For HDDs, random access is catastrophic. A 150 IOPS HDD can handle 150 database queries per second if each requires a disk read. That's 9,000 queries per minute—pathetically low for any real workload.
For SSDs, the gap is smaller but still significant. Random reads at 4KB might achieve 100K IOPS (400 MB/s equivalent), while sequential reads at 1MB blocks might achieve 3 GB/s—still 7x faster.
Performance-Killing Patterns:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
# =====================================================# I/O ACCESS PATTERN COMPARISON# ===================================================== import osimport time # =====================================================# ANTI-PATTERN: Many small synchronous writes# ===================================================== def write_records_bad(records: list, filepath: str): """Write each record separately - terrible I/O performance.""" for record in records: with open(filepath, 'ab') as f: # Open/close per record! f.write(record.encode() + b'\n') f.flush() # Force to disk os.fsync(f.fileno()) # Wait for disk # 10,000 records at 3 IOPS (HDD) = 55 minutes!# 10,000 records at 100K IOPS (SSD) = 0.1 seconds# But we're also opening/closing file 10,000 times! # PATTERN: Batch writes with proper bufferingdef write_records_good(records: list, filepath: str): """Batch records and write efficiently.""" with open(filepath, 'ab') as f: for record in records: f.write(record.encode() + b'\n') f.flush() # One flush for all records os.fsync(f.fileno()) # One sync # 10,000 records = 1 I/O operation (plus OS buffering) # =====================================================# ANTI-PATTERN: Sequential reads of random positions# ===================================================== def process_ids_bad(ids: list, data_file: str): """Read each ID's data in order given - random access.""" results = [] for id in ids: # IDs are not sorted by file position offset = get_offset_for_id(id) with open(data_file, 'rb') as f: f.seek(offset) # Random seek data = f.read(1024) results.append(process(data)) return results # 10,000 IDs with random seeks on HDD = ~100 seconds# Each seek is ~10ms # PATTERN: Sort reads by position, then processdef process_ids_good(ids: list, data_file: str): """Sort IDs by file position for sequential access.""" # Get offsets and sort id_offsets = [(id, get_offset_for_id(id)) for id in ids] id_offsets.sort(key=lambda x: x[1]) # Sort by offset results = [] with open(data_file, 'rb') as f: for id, offset in id_offsets: f.seek(offset) data = f.read(1024) results.append((id, process(data))) # Reorder results back to original ID order return [r for id, r in sorted(results, key=lambda x: ids.index(x[0]))] # Sequential access: 10,000 reads in ~1 second on HDD # =====================================================# PATTERN: Async I/O for parallel operations# ===================================================== import asyncioimport aiofiles async def read_files_parallel(filepaths: list): """Read multiple files in parallel - maximizes I/O utilization.""" async def read_one(path): async with aiofiles.open(path, 'r') as f: return await f.read() return await asyncio.gather(*[read_one(p) for p in filepaths]) # 100 files with 10ms latency each:# Sequential: 100 * 10ms = 1 second# Parallel: ~10ms (all in flight simultaneously)Linux uses free RAM as a disk cache (the 'buff/cache' in free output). If your working set fits in available RAM, re-reads come from fast memory, not slow disk. Design your workload to maximize cache hits—read (same data) locality wins.
Disk I/O bottlenecks can be tricky to identify because they often manifest as high CPU iowait or slow application response time. Here's a systematic diagnostic approach:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
#!/bin/bash# =====================================================# STEP 1: CHECK FOR I/O WAIT# ===================================================== # Is CPU waiting on I/O?top# Look at '%wa' (I/O wait) in the CPU line# High %wa (>10%) indicates disk bottleneck # Or use vmstatvmstat 1# 'wa' column shows I/O wait percentage # sar provides historysar -u 1 10# %iowait column # =====================================================# STEP 2: IDENTIFY BUSY DISKS# ===================================================== iostat -xz 1# Output columns explained:# rrqm/s, wrqm/s - read/write merges (kernel coalescing)# r/s, w/s - reads/writes per second# rMB/s, wMB/s - throughput# await - average I/O wait time (ms) - KEY METRIC# %util - percentage of time disk was busy - KEY METRIC # Example output:# Device r/s w/s rMB/s wMB/s await %util# sda 245.00 15.00 1.23 0.12 24.35 78.50 # Interpretation:# - 245 reads/second, 15 writes/second# - await 24ms is HIGH for SSD (should be <1ms)# - %util 78% means disk is busy 78% of the time# - %util >70% sustained = bottleneck # =====================================================# STEP 3: IDENTIFY I/O-HEAVY PROCESSES# ===================================================== # Real-time I/O per processiotop -o # Show only processes with I/O # Output:# TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND# 1234 be/4 mysql 12.45 M/s 3.21 M/s 0.00 % 87.23 % mysqld # IO> column shows percentage of time this process is waiting on I/O # Process-level I/O statspidstat -d 1# Shows kB_rd/s (read), kB_wr/s (write) per process # =====================================================# STEP 4: ANALYZE I/O PATTERNS# ===================================================== # What operations are happening?# Using blktrace for block-level I/O tracing blktrace -d /dev/sda -o trace# Run workload...blkparse -i trace.blktrace.* -o trace.txt # Or use bcc/eBPF tools (modern Linux)# biolatency - I/O latency histogram/usr/share/bcc/tools/biolatency# Shows distribution of I/O latency # biosnoop - trace each I/O with latency/usr/share/bcc/tools/biosnoop# Real-time trace of every I/O operation # =====================================================# STEP 5: AWS EBS-SPECIFIC DIAGNOSIS# ===================================================== # EBS has IOPS and throughput limits# Check CloudWatch metrics:# - VolumeReadOps, VolumeWriteOps (total ops)# - VolumeQueueLength (should be <1 ideally)# - BurstBalance (for gp2/gp3, shows burst credit remaining) # If BurstBalance is 0: you've exhausted IOPS credits# If VolumeQueueLength >10: I/O is queuing up # Check instance EBS bandwidth:# Some instances have shared EBS bandwidth# c5.xlarge: baseline 4750 Mbps EBS bandwidth# If EBS usage exceeds this, you're throttled| Symptom | Likely Cause | Key Metric | Solution |
|---|---|---|---|
| High %wa, low CPU | I/O-bound workload | iostat await, %util | Faster storage, caching, query optimization |
| High await on SSD | Saturated SSD or queuing | iostat %util, queue depth | Upgrade to higher-IOPS storage |
| Spiky I/O wait | Periodic writes (checkpoints) | Correlate with logs | Spread checkpoints, async writes |
| BurstBalance = 0 (AWS) | Exceeded burst IOPS limit | CloudWatch BurstBalance | Provision more IOPS, gp3 instead of gp2 |
| High VolumeQueueLength | I/O demand exceeds capacity | CloudWatch metrics | Scale storage or reduce I/O |
| iotop shows one process | Single-threaded I/O bottleneck | iotop IO% | Parallelize I/O in application |
iostat's 'await' (average wait time per I/O) is the most important single metric for disk performance. For SSDs, typical await should be <1ms. For HDDs, 10-15ms is expected. If await is much higher, the disk is queuing operations and you have a bottleneck.
Databases are the most common source of disk I/O bottlenecks. Understanding how databases use disk helps optimize for both performance and durability.
Database I/O Components:
| Component | Access Pattern | I/O Characteristics | Optimization |
|---|---|---|---|
| Data files | Random read/write | Depends on workload | Indexes, memory for cache |
| Write-Ahead Log (WAL) | Sequential write | Critical for durability | Fast sequential storage |
| Redo logs | Sequential write | Sync writes for durability | Separate disk, battery-backed cache |
| Temp/sort files | Sequential read/write | Large sequential I/O | Fast temp storage, more memory |
| Archive logs | Sequential write, rare read | Background writes | Cheaper storage acceptable |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
-- =====================================================-- POSTGRESQL I/O OPTIMIZATION-- ===================================================== -- Check cache hit ratio (should be >99% for OLTP)SELECT sum(heap_blks_read) as heap_read, sum(heap_blks_hit) as heap_hit, sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read))::float as cache_hit_ratioFROM pg_statio_user_tables; -- Low cache_hit_ratio (<95%) = data doesn't fit in memory-- Solution: increase shared_buffers or reduce working set -- Check index usageSELECT relname, idx_scan as index_scans, seq_scan as sequential_scans, n_tup_upd as updates, n_tup_del as deletesFROM pg_stat_user_tablesWHERE seq_scan > 0ORDER BY seq_scan DESC; -- High sequential scans on large tables = missing indexes-- Each seq_scan reads entire table from disk -- I/O-heavy queries (pg_stat_statements extension)SELECT query, calls, total_time, blk_read_time, -- Time spent reading from disk blk_write_time, -- Time spent writing to disk (blk_read_time + blk_write_time) / total_time * 100 as io_percentFROM pg_stat_statementsORDER BY blk_read_time DESCLIMIT 10; -- High io_percent queries are I/O-bound-- Candidates for optimization: indexes, caching -- =====================================================-- KEY POSTGRESQL I/O SETTINGS-- ===================================================== -- shared_buffers: PostgreSQL's buffer cache-- Typically 25% of RAM for dedicated DB server-- Example: 4GB on a 16GB serverALTER SYSTEM SET shared_buffers = '4GB'; -- effective_cache_size: Tell planner about OS cache-- Typically 50-75% of RAMALTER SYSTEM SET effective_cache_size = '12GB'; -- work_mem: Memory for sorting/hashing per operation-- Per-operation! 10 concurrent sorts × 64MB = 640MBALTER SYSTEM SET work_mem = '64MB'; -- maintenance_work_mem: Memory for VACUUM, CREATE INDEXALTER SYSTEM SET maintenance_work_mem = '1GB'; -- wal_buffers: WAL buffer size (16-64MB typically)ALTER SYSTEM SET wal_buffers = '64MB'; -- checkpoint_timeout, max_wal_size: Control checkpoint frequency-- More frequent = more consistent I/O, more total I/O-- Less frequent = less total I/O, spikier performanceALTER SYSTEM SET checkpoint_timeout = '30min';ALTER SYSTEM SET max_wal_size = '4GB';Separating I/O Workloads:
For I/O-heavy databases, separating different I/O types onto different storage can dramatically improve performance:
The most effective I/O optimization is avoiding disk entirely. Size your database cache (shared_buffers, innodb_buffer_pool_size) so that the hot data set fits in memory. A cache hit is ~100 nanoseconds; a disk read is ~25+ microseconds—250x faster.
Beyond database optimization, applications can implement patterns that minimize disk I/O impact:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150
# =====================================================# BUFFERED WRITING PATTERN# ===================================================== import threadingimport timefrom queue import Queuefrom typing import Any class BufferedWriter: """Batch writes to reduce I/O operations.""" def __init__(self, filepath: str, buffer_size: int = 1000, flush_interval: float = 1.0): self.filepath = filepath self.buffer_size = buffer_size self.flush_interval = flush_interval self.buffer = [] self.lock = threading.Lock() # Background flush thread self._flush_thread = threading.Thread(target=self._periodic_flush) self._flush_thread.daemon = True self._flush_thread.start() def write(self, data: str): """Add data to buffer; non-blocking.""" with self.lock: self.buffer.append(data) if len(self.buffer) >= self.buffer_size: self._flush() def _flush(self): """Write buffer to disk.""" if not self.buffer: return # Take ownership of buffer to_write = self.buffer self.buffer = [] # Single I/O operation for entire batch with open(self.filepath, 'a') as f: f.writelines(to_write) def _periodic_flush(self): """Ensure data is written even if buffer isn't full.""" while True: time.sleep(self.flush_interval) with self.lock: self._flush() # Usage:writer = BufferedWriter('/var/log/app.log', buffer_size=1000)for event in events: writer.write(event) # Fast, buffers in memory # Result: 1000 individual writes become 1 disk write # =====================================================# ASYNC FILE OPERATIONS# ===================================================== import asyncioimport aiofiles async def process_files_async(filepaths: list): """Process multiple files with non-blocking I/O.""" async def process_one(filepath): async with aiofiles.open(filepath, 'r') as f: content = await f.read() # Process content... return len(content) # All files read in parallel - overlapping I/O results = await asyncio.gather(*[process_one(p) for p in filepaths]) return results # 100 files × 10ms each:# Sequential: 1000ms# Parallel: ~10-20ms (limited by concurrent I/O capacity) # =====================================================# MEMORY-MAPPED FILES FOR READ-HEAVY WORKLOADS# ===================================================== import mmap class MmapReader: """Memory-mapped file access - let OS manage caching.""" def __init__(self, filepath: str): self.file = open(filepath, 'rb') self.mmap = mmap.mmap(self.file.fileno(), 0, access=mmap.ACCESS_READ) def read_at(self, offset: int, length: int) -> bytes: """Read from any position - OS caches in page cache.""" return self.mmap[offset:offset + length] def close(self): self.mmap.close() self.file.close() # Advantages:# - OS handles caching automatically# - Zero-copy access to cached data# - Efficient for random access patterns# - Memory usage = what's currently accessed, not whole file # =====================================================# WRITE-AHEAD LOG PATTERN# ===================================================== class WALWriter: """ Write-ahead log: convert random writes to sequential. Used by databases, message queues, and other durable systems. """ def __init__(self, log_path: str): self.log_file = open(log_path, 'ab') def append(self, operation: dict) -> int: """Append operation to log; return log sequence number.""" import json # Serialize data = json.dumps(operation).encode() + b'\n' # Sequential append (very fast) position = self.log_file.tell() self.log_file.write(data) self.log_file.flush() return position def sync(self): """Ensure log is durable on disk.""" import os os.fsync(self.log_file.fileno()) # Pattern:# 1. Write operation to WAL (sequential, fast)# 2. Apply operation to data structures (in memory)# 3. Periodically checkpoint data to disk (batch random writes)# 4. Truncate old WAL entries after checkpointEvery write can be: (1) buffered in memory only (fast, not durable), (2) flushed to OS buffer (medium, survives process crash), or (3) fsynced to disk (slow, survives power failure). Choose the right durability level for each data type. Not everything needs full durability.
Cloud storage (AWS EBS, GCP Persistent Disk, Azure Managed Disks) has unique characteristics that differ from local storage:
Key Cloud Storage Factors:
| AWS EBS Type | IOPS | Throughput | Latency | Cost Model |
|---|---|---|---|---|
| gp2 | 100-16,000 (burst) | 128-250 MB/s | 1-2 ms | Per GB, includes IOPS |
| gp3 | 3,000-16,000 (provisioned) | 125-1000 MB/s | 1-2 ms | Separate IOPS pricing |
| io2 | Up to 256,000 | Up to 4 GB/s | 0.2-0.5 ms | High per-IOPS cost |
| st1 (HDD) | 500 IOPS base | Up to 500 MB/s | High | Cheap per GB |
| sc1 (HDD) | 250 IOPS base | Up to 250 MB/s | High | Cheapest per GB |
Cloud Storage Gotchas:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
#!/bin/bash# =====================================================# CLOUD STORAGE PERFORMANCE CHECKLIST (AWS)# ===================================================== # 1. Check instance EBS bandwidth# Look up instance type in AWS docs# Example: m5.xlarge = 4,750 Mbps EBS bandwidth # 2. Check volume configurationaws ec2 describe-volumes --volume-id vol-xxxxx# Verify: VolumeType, Size, Iops, Throughput # 3. Monitor CloudWatch metrics# VolumeReadOps, VolumeWriteOps - total ops# VolumeReadBytes, VolumeWriteBytes - throughput# VolumeQueueLength - pending I/O (should be <1)# BurstBalance - burst credits remaining (gp2) # 4. Check for EBS throughput throttling# CloudWatch: EBSIOBalance%, EBSByteBalance%# If these drop to 0, you're being throttled # =====================================================# RECOMMENDATIONS BY WORKLOAD# ===================================================== # Workload: Database (high IOPS random)# - Use io2 for production databases# - Provision IOPS based on peak, not average# - Monitor VolumeQueueLength for congestion # Workload: Analytics (high throughput sequential)# - gp3 with high throughput provisioned# - Or st1 for cold data with sequential access# - Use multiple volumes in RAID 0 for higher throughput # Workload: Logs (sequential write, rare read)# - gp3 baseline (3000 IOPS, 125 MB/s)# - Or st1 for very high volume, low performance need # Workload: General application# - Start with gp3 (better $/IOPS than gp2)# - Provision IOPS/throughput as needed# - Monitor and adjust # =====================================================# GCP PERSISTENT DISK COMPARISON# ===================================================== # pd-standard: HDD, cheapest, low performance# pd-balanced: SSD, moderate cost/performance# pd-ssd: SSD, high performance# pd-extreme: Up to 120K IOPS # GCP disks auto-scale IOPS with size (unlike AWS)# 30 IOPS/GB for pd-ssd, up to 100K IOPS maxDisk I/O is often the ultimate bottleneck in data-intensive systems. Understanding storage characteristics, access patterns, and optimization strategies is essential for building performant applications.
Module Complete:
You've now completed the comprehensive study of Identifying Bottlenecks. You understand the five major bottleneck categories:
With this framework, you can systematically diagnose performance issues in any system, identify the true bottleneck, and apply targeted optimizations.
You now have a comprehensive framework for identifying performance bottlenecks across all major resource categories. This diagnostic mindset—classifying the bottleneck before optimizing—is the hallmark of experienced performance engineers.