System Design (HLD)Identifying Bottlenecks

Identifying Performance Bottlenecks

LevelAdvanced

Duration90 mins

TopicIdentifying Bottlenecks

5 / 5

Disk I/O Limitations: When Storage Becomes the Bottleneck

Disk I/O: The Slowest Link in the Chain

Disk I/O is often the ultimate bottleneck in data-intensive systems. While CPUs operate in nanoseconds, network in milliseconds, and memory in microseconds, mechanical disks operate in milliseconds per operation—millions of times slower than the CPU.

Even with modern SSDs, the gap between memory and storage remains vast. A memory access takes ~100 nanoseconds; an SSD read takes ~25 microseconds—still 250x slower. Understanding disk I/O limitations is essential for designing systems that handle data at scale.

This page covers storage performance characteristics, access patterns that destroy performance, and strategies for optimizing disk-bound workloads.

What You Will Learn

By the end of this page, you will understand the performance characteristics of different storage technologies, how access patterns impact I/O performance, techniques for diagnosing disk bottlenecks, and strategies for optimizing disk-bound workloads.

Storage Technology Performance Characteristics

Not all storage is created equal. The performance characteristics of different storage technologies vary by orders of magnitude:

Key Storage Metrics:

IOPS (I/O Operations Per Second): How many read/write operations per second. Critical for databases, transactional workloads.
Throughput (MB/s): How much data transferred per second. Critical for sequential reads/writes, streaming.
Latency: Time to complete a single I/O operation. Critical for real-time, latency-sensitive applications.
Queue Depth: How many I/O operations can be in flight simultaneously. Modern SSDs benefit from deep queues.

Storage Technology Performance Comparison
Technology	Random IOPS	Sequential Throughput	Latency	Best For
HDD (7200 RPM)	75-150 IOPS	100-200 MB/s	~10-15 ms	Archival, sequential access
SATA SSD	30K-100K IOPS	400-550 MB/s	~100 μs	General purpose
NVMe SSD (Consumer)	100K-500K IOPS	2-5 GB/s	~20-50 μs	Workstations, mid-tier servers
NVMe SSD (Enterprise)	500K-1M+ IOPS	5-7 GB/s	~10-25 μs	High-performance databases
Optane / Intel PMem	500K-2M IOPS	2-3 GB/s	~10 μs	Ultra-low latency use cases
AWS EBS gp3	3K-16K IOPS	125-1000 MB/s	~0.5-2 ms	General cloud workloads
AWS EBS io2	Up to 256K IOPS	Up to 4 GB/s	~0.2-0.5 ms	High-performance cloud DB
Network storage (NFS)	Varies widely	Network-limited	1-10+ ms	Shared access, less performance-critical

The HDD vs SSD Revolution:

The transition from HDD to SSD fundamentally changed I/O performance. HDDs use spinning platters and moving read/write heads—physical mechanics that impose latency floors:

Seek time: Moving the head to the right track (~5-10 ms)
Rotational latency: Waiting for the right sector to rotate under the head (~4 ms for 7200 RPM)
Transfer time: Actually reading the data (relatively fast)

For random access (reading different parts of the disk), HDDs are limited to ~100-200 IOPS because each operation requires physical movement. SSDs have no moving parts—random access is nearly as fast as sequential access.

Cloud Storage Is Different

In cloud environments (AWS EBS, GCP Persistent Disks, Azure Managed Disks), storage is network-attached, not local. Performance depends on volume type, provisioned IOPS, and instance bandwidth limits. Always check the instance's EBS bandwidth allocation—it's often the hidden bottleneck.

storage_benchmarking.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#!/bin/bash
# =====================================================
# STORAGE BENCHMARKING WITH fio
# =====================================================
 
# fio: Flexible I/O Tester - the standard for storage benchmarking
 
# Random Read IOPS (4KB blocks, 32 queue depth)
fio --name=randread --ioengine=libaio --iodepth=32 \
    --rw=randread --bs=4k --direct=1 --size=4G \
    --numjobs=4 --runtime=60 --group_reporting
 
# Random Write IOPS
fio --name=randwrite --ioengine=libaio --iodepth=32 \
    --rw=randwrite --bs=4k --direct=1 --size=4G \
    --numjobs=4 --runtime=60 --group_reporting
 
# Sequential Read Throughput
fio --name=seqread --ioengine=libaio --iodepth=32 \
    --rw=read --bs=1M --direct=1 --size=4G \
    --numjobs=1 --runtime=60 --group_reporting
 
# Sequential Write Throughput
fio --name=seqwrite --ioengine=libaio --iodepth=32 \
    --rw=write --bs=1M --direct=1 --size=4G \
    --numjobs=1 --runtime=60 --group_reporting
 
 
# =====================================================
# INTERPRETING fio OUTPUT
# =====================================================
 
# Sample output:
# read: IOPS=245k, BW=958MiB/s (1005MB/s)(56.2GiB/60001msec)
#    lat (usec): min=4, max=9876, avg=126.43, stdev=312.45
#
# Key metrics:
# - IOPS: 245,000 random reads/second
# - BW: 958 MB/s bandwidth
# - lat avg: 126 microseconds average latency
# - lat stdev: 312 microseconds (variability)
 
 
# =====================================================
# QUICK DISK PERFORMANCE CHECK
# =====================================================
 
# Write test (careful: writes to current directory)
dd if=/dev/zero of=testfile bs=1G count=1 oflag=direct
# Shows sequential write speed
 
# Read back
dd if=testfile of=/dev/null bs=1G count=1 iflag=direct
# Shows sequential read speed
 
# Clean up
rm testfile

Access Patterns That Kill Disk Performance

How you access disk matters as much as what disk you have. The same storage can perform brilliantly or terribly depending on access patterns.

Sequential vs Random Access:

Sequential vs Random Access Impact
Access Pattern	HDD Performance	SSD Performance	Typical Workloads
Sequential read	100-200 MB/s	2-7 GB/s	Log processing, video streaming, backups
Sequential write	100-200 MB/s	1-5 GB/s	Logging, bulk data loads
Random read	75-150 IOPS	100K-1M IOPS	Database lookups, user sessions
Random write	75-150 IOPS	50K-500K IOPS	Transaction logs, updates

For HDDs, random access is catastrophic. A 150 IOPS HDD can handle 150 database queries per second if each requires a disk read. That's 9,000 queries per minute—pathetically low for any real workload.

For SSDs, the gap is smaller but still significant. Random reads at 4KB might achieve 100K IOPS (400 MB/s equivalent), while sequential reads at 1MB blocks might achieve 3 GB/s—still 7x faster.

Performance-Killing Patterns:

I/O Anti-Patterns

•Small random writes: Each write is a separate I/O operation. Writing 1000 bytes in 1000 separate writes is 1000x worse than one 1000-byte write.
•Synchronous I/O in a loop: Waiting for each I/O to complete before starting the next. No parallelism, queue depth of 1.
•Mixing reads and writes on HDD: Head must seek between read and write areas. Severely degrades performance.
•Reading beyond cache: When working set exceeds OS page cache, every read hits disk. Performance cliff.
•Write amplification (SSDs): SSDs must erase blocks before writing. Small writes cause read-modify-write cycles, reducing write throughput and SSD lifespan.

io_patterns.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# =====================================================
# I/O ACCESS PATTERN COMPARISON
# =====================================================
 
import os
import time
 
# =====================================================
# ANTI-PATTERN: Many small synchronous writes
# =====================================================
 
def write_records_bad(records: list, filepath: str):
    """Write each record separately - terrible I/O performance."""
    for record in records:
        with open(filepath, 'ab') as f:  # Open/close per record!
            f.write(record.encode() + b'\n')
            f.flush()  # Force to disk
            os.fsync(f.fileno())  # Wait for disk
 
# 10,000 records at 3 IOPS (HDD) = 55 minutes!
# 10,000 records at 100K IOPS (SSD) = 0.1 seconds
# But we're also opening/closing file 10,000 times!
 
 
# PATTERN: Batch writes with proper buffering
def write_records_good(records: list, filepath: str):
    """Batch records and write efficiently."""
    with open(filepath, 'ab') as f:
        for record in records:
            f.write(record.encode() + b'\n')
        f.flush()  # One flush for all records
        os.fsync(f.fileno())  # One sync
 
# 10,000 records = 1 I/O operation (plus OS buffering)
 
 
# =====================================================
# ANTI-PATTERN: Sequential reads of random positions
# =====================================================
 
def process_ids_bad(ids: list, data_file: str):
    """Read each ID's data in order given - random access."""
    results = []
    for id in ids:  # IDs are not sorted by file position
        offset = get_offset_for_id(id)
        with open(data_file, 'rb') as f:
            f.seek(offset)  # Random seek
            data = f.read(1024)
            results.append(process(data))
    return results
 
# 10,000 IDs with random seeks on HDD = ~100 seconds
# Each seek is ~10ms
 
 
# PATTERN: Sort reads by position, then process
def process_ids_good(ids: list, data_file: str):
    """Sort IDs by file position for sequential access."""
    # Get offsets and sort
    id_offsets = [(id, get_offset_for_id(id)) for id in ids]
    id_offsets.sort(key=lambda x: x[1])  # Sort by offset
    
    results = []
    with open(data_file, 'rb') as f:
        for id, offset in id_offsets:
            f.seek(offset)
            data = f.read(1024)
            results.append((id, process(data)))
    
    # Reorder results back to original ID order
    return [r for id, r in sorted(results, key=lambda x: ids.index(x[0]))]
 
# Sequential access: 10,000 reads in ~1 second on HDD
 
 
# =====================================================
# PATTERN: Async I/O for parallel operations
# =====================================================
 
import asyncio
import aiofiles
 
async def read_files_parallel(filepaths: list):
    """Read multiple files in parallel - maximizes I/O utilization."""
    async def read_one(path):
        async with aiofiles.open(path, 'r') as f:
            return await f.read()
    
    return await asyncio.gather(*[read_one(p) for p in filepaths])
 
# 100 files with 10ms latency each:
# Sequential: 100 * 10ms = 1 second
# Parallel: ~10ms (all in flight simultaneously)

The Page Cache Is Your Friend

Linux uses free RAM as a disk cache (the 'buff/cache' in free output). If your working set fits in available RAM, re-reads come from fast memory, not slow disk. Design your workload to maximize cache hits—read (same data) locality wins.

Diagnosing Disk I/O Bottlenecks

Disk I/O bottlenecks can be tricky to identify because they often manifest as high CPU iowait or slow application response time. Here's a systematic diagnostic approach:

disk_diagnostics.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
#!/bin/bash
# =====================================================
# STEP 1: CHECK FOR I/O WAIT
# =====================================================
 
# Is CPU waiting on I/O?
top
# Look at '%wa' (I/O wait) in the CPU line
# High %wa (>10%) indicates disk bottleneck
 
# Or use vmstat
vmstat 1
# 'wa' column shows I/O wait percentage
 
# sar provides history
sar -u 1 10
# %iowait column
 
 
# =====================================================
# STEP 2: IDENTIFY BUSY DISKS
# =====================================================
 
iostat -xz 1
# Output columns explained:
# rrqm/s, wrqm/s - read/write merges (kernel coalescing)
# r/s, w/s       - reads/writes per second
# rMB/s, wMB/s   - throughput
# await          - average I/O wait time (ms) - KEY METRIC
# %util          - percentage of time disk was busy - KEY METRIC
 
# Example output:
# Device   r/s    w/s   rMB/s  wMB/s  await  %util
# sda    245.00  15.00   1.23   0.12  24.35  78.50
 
# Interpretation:
# - 245 reads/second, 15 writes/second
# - await 24ms is HIGH for SSD (should be <1ms)
# - %util 78% means disk is busy 78% of the time
# - %util >70% sustained = bottleneck
 
 
# =====================================================
# STEP 3: IDENTIFY I/O-HEAVY PROCESSES
# =====================================================
 
# Real-time I/O per process
iotop -o  # Show only processes with I/O
 
# Output:
# TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
# 1234 be/4  mysql     12.45 M/s   3.21 M/s  0.00 %  87.23 %  mysqld
 
# IO> column shows percentage of time this process is waiting on I/O
 
 
# Process-level I/O stats
pidstat -d 1
# Shows kB_rd/s (read), kB_wr/s (write) per process
 
 
# =====================================================
# STEP 4: ANALYZE I/O PATTERNS
# =====================================================
 
# What operations are happening?
# Using blktrace for block-level I/O tracing
 
blktrace -d /dev/sda -o trace
# Run workload...
blkparse -i trace.blktrace.* -o trace.txt
 
# Or use bcc/eBPF tools (modern Linux)
# biolatency - I/O latency histogram
/usr/share/bcc/tools/biolatency
# Shows distribution of I/O latency
 
# biosnoop - trace each I/O with latency
/usr/share/bcc/tools/biosnoop
# Real-time trace of every I/O operation
 
 
# =====================================================
# STEP 5: AWS EBS-SPECIFIC DIAGNOSIS
# =====================================================
 
# EBS has IOPS and throughput limits
# Check CloudWatch metrics:
# - VolumeReadOps, VolumeWriteOps (total ops)
# - VolumeQueueLength (should be <1 ideally)
# - BurstBalance (for gp2/gp3, shows burst credit remaining)
 
# If BurstBalance is 0: you've exhausted IOPS credits
# If VolumeQueueLength >10: I/O is queuing up
 
# Check instance EBS bandwidth:
# Some instances have shared EBS bandwidth
# c5.xlarge: baseline 4750 Mbps EBS bandwidth
# If EBS usage exceeds this, you're throttled

Disk I/O Diagnostic Patterns
Symptom	Likely Cause	Key Metric	Solution
High %wa, low CPU	I/O-bound workload	iostat await, %util	Faster storage, caching, query optimization
High await on SSD	Saturated SSD or queuing	iostat %util, queue depth	Upgrade to higher-IOPS storage
Spiky I/O wait	Periodic writes (checkpoints)	Correlate with logs	Spread checkpoints, async writes
BurstBalance = 0 (AWS)	Exceeded burst IOPS limit	CloudWatch BurstBalance	Provision more IOPS, gp3 instead of gp2
High VolumeQueueLength	I/O demand exceeds capacity	CloudWatch metrics	Scale storage or reduce I/O
iotop shows one process	Single-threaded I/O bottleneck	iotop IO%	Parallelize I/O in application

The await Metric

iostat's 'await' (average wait time per I/O) is the most important single metric for disk performance. For SSDs, typical await should be <1ms. For HDDs, 10-15ms is expected. If await is much higher, the disk is queuing operations and you have a bottleneck.

Database Disk I/O Optimization

Databases are the most common source of disk I/O bottlenecks. Understanding how databases use disk helps optimize for both performance and durability.

Database I/O Components:

Database Disk I/O Components
Component	Access Pattern	I/O Characteristics	Optimization
Data files	Random read/write	Depends on workload	Indexes, memory for cache
Write-Ahead Log (WAL)	Sequential write	Critical for durability	Fast sequential storage
Redo logs	Sequential write	Sync writes for durability	Separate disk, battery-backed cache
Temp/sort files	Sequential read/write	Large sequential I/O	Fast temp storage, more memory
Archive logs	Sequential write, rare read	Background writes	Cheaper storage acceptable

postgresql_io_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
-- =====================================================
-- POSTGRESQL I/O OPTIMIZATION
-- =====================================================
 
-- Check cache hit ratio (should be >99% for OLTP)
SELECT 
    sum(heap_blks_read) as heap_read,
    sum(heap_blks_hit) as heap_hit,
    sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read))::float as cache_hit_ratio
FROM pg_statio_user_tables;
 
-- Low cache_hit_ratio (<95%) = data doesn't fit in memory
-- Solution: increase shared_buffers or reduce working set
 
 
-- Check index usage
SELECT 
    relname,
    idx_scan as index_scans,
    seq_scan as sequential_scans,
    n_tup_upd as updates,
    n_tup_del as deletes
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_scan DESC;
 
-- High sequential scans on large tables = missing indexes
-- Each seq_scan reads entire table from disk
 
 
-- I/O-heavy queries (pg_stat_statements extension)
SELECT query, 
       calls,
       total_time,
       blk_read_time,  -- Time spent reading from disk
       blk_write_time, -- Time spent writing to disk
       (blk_read_time + blk_write_time) / total_time * 100 as io_percent
FROM pg_stat_statements
ORDER BY blk_read_time DESC
LIMIT 10;
 
-- High io_percent queries are I/O-bound
-- Candidates for optimization: indexes, caching
 
 
-- =====================================================
-- KEY POSTGRESQL I/O SETTINGS
-- =====================================================
 
-- shared_buffers: PostgreSQL's buffer cache
-- Typically 25% of RAM for dedicated DB server
-- Example: 4GB on a 16GB server
ALTER SYSTEM SET shared_buffers = '4GB';
 
-- effective_cache_size: Tell planner about OS cache
-- Typically 50-75% of RAM
ALTER SYSTEM SET effective_cache_size = '12GB';
 
-- work_mem: Memory for sorting/hashing per operation
-- Per-operation! 10 concurrent sorts × 64MB = 640MB
ALTER SYSTEM SET work_mem = '64MB';
 
-- maintenance_work_mem: Memory for VACUUM, CREATE INDEX
ALTER SYSTEM SET maintenance_work_mem = '1GB';
 
-- wal_buffers: WAL buffer size (16-64MB typically)
ALTER SYSTEM SET wal_buffers = '64MB';
 
-- checkpoint_timeout, max_wal_size: Control checkpoint frequency
-- More frequent = more consistent I/O, more total I/O
-- Less frequent = less total I/O, spikier performance
ALTER SYSTEM SET checkpoint_timeout = '30min';
ALTER SYSTEM SET max_wal_size = '4GB';

Separating I/O Workloads:

For I/O-heavy databases, separating different I/O types onto different storage can dramatically improve performance:

Storage Separation Strategies

•WAL/redo logs on separate disk: WAL is sequential write, data files are random. Separating prevents head seek contention on HDD, provides I/O bandwidth isolation on SSD.
•Temp/sort space on fast storage: Temp files are created during large sorts, joins, and hash operations. Fast temp storage speeds complex queries.
•Indexes on separate disk: In extreme cases, separating index files from data files can help (rarely needed with SSDs).
•Archive logs on cheaper storage: Archive logs are written once, rarely read. Use cost-effective storage with good sequential write.

Memory Is the Best Optimization

The most effective I/O optimization is avoiding disk entirely. Size your database cache (shared_buffers, innodb_buffer_pool_size) so that the hot data set fits in memory. A cache hit is ~100 nanoseconds; a disk read is ~25+ microseconds—250x faster.

Application-Level I/O Optimization

Beyond database optimization, applications can implement patterns that minimize disk I/O impact:

Application I/O Optimization Strategies

•Batch writes: Instead of writing each record individually, batch into larger writes. Reduces I/O operations by orders of magnitude.
•Async/background writes: For non-critical data (analytics, logs), write asynchronously. Don't block user requests on disk I/O.
•Write-ahead logging: Write to a fast sequential log, then asynchronously apply to data files. Converts random writes to sequential.
•Buffer/cache in memory: Accumulate small writes in memory buffers, flush periodically. Standard pattern in loggers, analytics collectors.
•Use appropriate fsync strategy: Full durability (fsync per write) vs eventual durability (periodic fsync). Trade latency for safety.
•Memory-mapped files: Let OS manage caching; can be very efficient for read-heavy workloads.
•Streaming over batching: For enormous datasets, stream processing avoids loading all data into memory, but still reads all from disk.

io_optimization_patterns.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# =====================================================
# BUFFERED WRITING PATTERN
# =====================================================
 
import threading
import time
from queue import Queue
from typing import Any
 
class BufferedWriter:
    """Batch writes to reduce I/O operations."""
    
    def __init__(self, filepath: str, buffer_size: int = 1000,
                 flush_interval: float = 1.0):
        self.filepath = filepath
        self.buffer_size = buffer_size
        self.flush_interval = flush_interval
        self.buffer = []
        self.lock = threading.Lock()
        
        # Background flush thread
        self._flush_thread = threading.Thread(target=self._periodic_flush)
        self._flush_thread.daemon = True
        self._flush_thread.start()
    
    def write(self, data: str):
        """Add data to buffer; non-blocking."""
        with self.lock:
            self.buffer.append(data)
            if len(self.buffer) >= self.buffer_size:
                self._flush()
    
    def _flush(self):
        """Write buffer to disk."""
        if not self.buffer:
            return
        
        # Take ownership of buffer
        to_write = self.buffer
        self.buffer = []
        
        # Single I/O operation for entire batch
        with open(self.filepath, 'a') as f:
            f.writelines(to_write)
    
    def _periodic_flush(self):
        """Ensure data is written even if buffer isn't full."""
        while True:
            time.sleep(self.flush_interval)
            with self.lock:
                self._flush()
 
# Usage:
writer = BufferedWriter('/var/log/app.log', buffer_size=1000)
for event in events:
    writer.write(event)  # Fast, buffers in memory
 
# Result: 1000 individual writes become 1 disk write
 
 
# =====================================================
# ASYNC FILE OPERATIONS
# =====================================================
 
import asyncio
import aiofiles
 
async def process_files_async(filepaths: list):
    """Process multiple files with non-blocking I/O."""
    
    async def process_one(filepath):
        async with aiofiles.open(filepath, 'r') as f:
            content = await f.read()
            # Process content...
            return len(content)
    
    # All files read in parallel - overlapping I/O
    results = await asyncio.gather(*[process_one(p) for p in filepaths])
    return results
 
# 100 files × 10ms each:
# Sequential: 1000ms
# Parallel: ~10-20ms (limited by concurrent I/O capacity)
 
 
# =====================================================
# MEMORY-MAPPED FILES FOR READ-HEAVY WORKLOADS
# =====================================================
 
import mmap
 
class MmapReader:
    """Memory-mapped file access - let OS manage caching."""
    
    def __init__(self, filepath: str):
        self.file = open(filepath, 'rb')
        self.mmap = mmap.mmap(self.file.fileno(), 0, access=mmap.ACCESS_READ)
    
    def read_at(self, offset: int, length: int) -> bytes:
        """Read from any position - OS caches in page cache."""
        return self.mmap[offset:offset + length]
    
    def close(self):
        self.mmap.close()
        self.file.close()
 
# Advantages:
# - OS handles caching automatically
# - Zero-copy access to cached data
# - Efficient for random access patterns
# - Memory usage = what's currently accessed, not whole file
 
 
# =====================================================
# WRITE-AHEAD LOG PATTERN
# =====================================================
 
class WALWriter:
    """
    Write-ahead log: convert random writes to sequential.
    Used by databases, message queues, and other durable systems.
    """
    
    def __init__(self, log_path: str):
        self.log_file = open(log_path, 'ab')
    
    def append(self, operation: dict) -> int:
        """Append operation to log; return log sequence number."""
        import json
        
        # Serialize
        data = json.dumps(operation).encode() + b'\n'
        
        # Sequential append (very fast)
        position = self.log_file.tell()
        self.log_file.write(data)
        self.log_file.flush()
        
        return position
    
    def sync(self):
        """Ensure log is durable on disk."""
        import os
        os.fsync(self.log_file.fileno())
 
# Pattern:
# 1. Write operation to WAL (sequential, fast)
# 2. Apply operation to data structures (in memory)
# 3. Periodically checkpoint data to disk (batch random writes)
# 4. Truncate old WAL entries after checkpoint

The Durability Trade-Off

Every write can be: (1) buffered in memory only (fast, not durable), (2) flushed to OS buffer (medium, survives process crash), or (3) fsynced to disk (slow, survives power failure). Choose the right durability level for each data type. Not everything needs full durability.

Cloud Storage Considerations

Cloud storage (AWS EBS, GCP Persistent Disk, Azure Managed Disks) has unique characteristics that differ from local storage:

Key Cloud Storage Factors:

Cloud Block Storage Comparison
AWS EBS Type	IOPS	Throughput	Latency	Cost Model
gp2	100-16,000 (burst)	128-250 MB/s	1-2 ms	Per GB, includes IOPS
gp3	3,000-16,000 (provisioned)	125-1000 MB/s	1-2 ms	Separate IOPS pricing
io2	Up to 256,000	Up to 4 GB/s	0.2-0.5 ms	High per-IOPS cost
st1 (HDD)	500 IOPS base	Up to 500 MB/s	High	Cheap per GB
sc1 (HDD)	250 IOPS base	Up to 250 MB/s	High	Cheapest per GB

Cloud Storage Gotchas:

Cloud Storage Pitfalls

•Instance EBS bandwidth limits: EC2 instances have maximum EBS bandwidth (e.g., 4,750 Mbps for m5.xlarge). Multiple high-performance EBS volumes can exceed this limit, causing throttling.
•Burst credit exhaustion (gp2): gp2 volumes earn I/O credits over time. Sustained high IOPS depletes credits, dropping to baseline (3 IOPS/GB). A 100GB gp2 volume has only 300 baseline IOPS.
•Network is the interface: EBS volumes attach over the network. Network issues affect storage performance. Enhanced networking (SR-IOV) helps.
•Latency varies: Cloud storage has higher latency variability than local NVMe. Design for p99 latency, not average.
•Snapshot performance: First read after a snapshot restore may be slower (lazy loading from S3). Pre-warming helps.

cloud_storage_checklist.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#!/bin/bash
# =====================================================
# CLOUD STORAGE PERFORMANCE CHECKLIST (AWS)
# =====================================================
 
# 1. Check instance EBS bandwidth
# Look up instance type in AWS docs
# Example: m5.xlarge = 4,750 Mbps EBS bandwidth
 
# 2. Check volume configuration
aws ec2 describe-volumes --volume-id vol-xxxxx
# Verify: VolumeType, Size, Iops, Throughput
 
# 3. Monitor CloudWatch metrics
# VolumeReadOps, VolumeWriteOps - total ops
# VolumeReadBytes, VolumeWriteBytes - throughput
# VolumeQueueLength - pending I/O (should be <1)
# BurstBalance - burst credits remaining (gp2)
 
# 4. Check for EBS throughput throttling
# CloudWatch: EBSIOBalance%, EBSByteBalance%
# If these drop to 0, you're being throttled
 
 
# =====================================================
# RECOMMENDATIONS BY WORKLOAD
# =====================================================
 
# Workload: Database (high IOPS random)
# - Use io2 for production databases
# - Provision IOPS based on peak, not average
# - Monitor VolumeQueueLength for congestion
 
# Workload: Analytics (high throughput sequential)
# - gp3 with high throughput provisioned
# - Or st1 for cold data with sequential access
# - Use multiple volumes in RAID 0 for higher throughput
 
# Workload: Logs (sequential write, rare read)
# - gp3 baseline (3000 IOPS, 125 MB/s)
# - Or st1 for very high volume, low performance need
 
# Workload: General application
# - Start with gp3 (better $/IOPS than gp2)
# - Provision IOPS/throughput as needed
# - Monitor and adjust
 
 
# =====================================================
# GCP PERSISTENT DISK COMPARISON
# =====================================================
 
# pd-standard: HDD, cheapest, low performance
# pd-balanced: SSD, moderate cost/performance
# pd-ssd: SSD, high performance
# pd-extreme: Up to 120K IOPS
 
# GCP disks auto-scale IOPS with size (unlike AWS)
# 30 IOPS/GB for pd-ssd, up to 100K IOPS max

Summary: Mastering Disk I/O Bottlenecks

Disk I/O is often the ultimate bottleneck in data-intensive systems. Understanding storage characteristics, access patterns, and optimization strategies is essential for building performant applications.

Key Takeaways

•Storage technologies differ dramatically: HDDs: ~100-200 IOPS. SSDs: 100K-1M+ IOPS. Choose storage appropriate for your workload.
•Random vs sequential access matters: Sequential access is 10-100x faster than random access, especially on HDDs. Organize access patterns for locality.
•Monitor await and %util: These iostat metrics tell you if disk is the bottleneck. High await (>1ms on SSD) indicates saturation.
•Memory is the best optimization: Data in RAM is 100-1000x faster to access than on disk. Size caches to fit your hot working set.
•Batch writes aggressively: One large write is vastly cheaper than many small writes. Buffer and flush periodically.
•Separate I/O workloads: Put WAL/logs on different storage than data files. Isolates sequential from random I/O.
•Cloud storage has hidden limits: Instance EBS bandwidth, burst credits, and network attachment all impose constraints beyond raw storage specs.

Module Complete:

You've now completed the comprehensive study of Identifying Bottlenecks. You understand the five major bottleneck categories:

CPU-bound vs I/O-bound: The fundamental classification that determines your optimization strategy
Database bottlenecks: Query performance, connections, locks, replication lag, and resource exhaustion
Network bottlenecks: Bandwidth, latency, connection overhead, and latency amplification
Memory constraints: Heap management, GC impact, memory leaks, and sizing strategies
Disk I/O limitations: Storage characteristics, access patterns, and optimization techniques

With this framework, you can systematically diagnose performance issues in any system, identify the true bottleneck, and apply targeted optimizations.

Module Complete

You now have a comprehensive framework for identifying performance bottlenecks across all major resource categories. This diagnostic mindset—classifying the bottleneck before optimizing—is the hallmark of experienced performance engineers.

5 / 5

Loading learning content...

System Design (HLD)Identifying Bottlenecks

Identifying Performance Bottlenecks

LevelAdvanced

Duration90 mins

TopicIdentifying Bottlenecks

5 / 5

Disk I/O Limitations: When Storage Becomes the Bottleneck

Disk I/O: The Slowest Link in the Chain

This page covers storage performance characteristics, access patterns that destroy performance, and strategies for optimizing disk-bound workloads.

What You Will Learn

Storage Technology Performance Characteristics

Not all storage is created equal. The performance characteristics of different storage technologies vary by orders of magnitude:

Key Storage Metrics:

IOPS (I/O Operations Per Second): How many read/write operations per second. Critical for databases, transactional workloads.
Throughput (MB/s): How much data transferred per second. Critical for sequential reads/writes, streaming.
Latency: Time to complete a single I/O operation. Critical for real-time, latency-sensitive applications.
Queue Depth: How many I/O operations can be in flight simultaneously. Modern SSDs benefit from deep queues.

Storage Technology Performance Comparison
Technology	Random IOPS	Sequential Throughput	Latency	Best For
HDD (7200 RPM)	75-150 IOPS	100-200 MB/s	~10-15 ms	Archival, sequential access
SATA SSD	30K-100K IOPS	400-550 MB/s	~100 μs	General purpose
NVMe SSD (Consumer)	100K-500K IOPS	2-5 GB/s	~20-50 μs	Workstations, mid-tier servers
NVMe SSD (Enterprise)	500K-1M+ IOPS	5-7 GB/s	~10-25 μs	High-performance databases
Optane / Intel PMem	500K-2M IOPS	2-3 GB/s	~10 μs	Ultra-low latency use cases
AWS EBS gp3	3K-16K IOPS	125-1000 MB/s	~0.5-2 ms	General cloud workloads
AWS EBS io2	Up to 256K IOPS	Up to 4 GB/s	~0.2-0.5 ms	High-performance cloud DB
Network storage (NFS)	Varies widely	Network-limited	1-10+ ms	Shared access, less performance-critical

The HDD vs SSD Revolution:

The transition from HDD to SSD fundamentally changed I/O performance. HDDs use spinning platters and moving read/write heads—physical mechanics that impose latency floors:

Seek time: Moving the head to the right track (~5-10 ms)
Rotational latency: Waiting for the right sector to rotate under the head (~4 ms for 7200 RPM)
Transfer time: Actually reading the data (relatively fast)

Cloud Storage Is Different

storage_benchmarking.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#!/bin/bash
# =====================================================
# STORAGE BENCHMARKING WITH fio
# =====================================================
 
# fio: Flexible I/O Tester - the standard for storage benchmarking
 
# Random Read IOPS (4KB blocks, 32 queue depth)
fio --name=randread --ioengine=libaio --iodepth=32 \
    --rw=randread --bs=4k --direct=1 --size=4G \
    --numjobs=4 --runtime=60 --group_reporting
 
# Random Write IOPS
fio --name=randwrite --ioengine=libaio --iodepth=32 \
    --rw=randwrite --bs=4k --direct=1 --size=4G \
    --numjobs=4 --runtime=60 --group_reporting
 
# Sequential Read Throughput
fio --name=seqread --ioengine=libaio --iodepth=32 \
    --rw=read --bs=1M --direct=1 --size=4G \
    --numjobs=1 --runtime=60 --group_reporting
 
# Sequential Write Throughput
fio --name=seqwrite --ioengine=libaio --iodepth=32 \
    --rw=write --bs=1M --direct=1 --size=4G \
    --numjobs=1 --runtime=60 --group_reporting
 
 
# =====================================================
# INTERPRETING fio OUTPUT
# =====================================================
 
# Sample output:
# read: IOPS=245k, BW=958MiB/s (1005MB/s)(56.2GiB/60001msec)
#    lat (usec): min=4, max=9876, avg=126.43, stdev=312.45
#
# Key metrics:
# - IOPS: 245,000 random reads/second
# - BW: 958 MB/s bandwidth
# - lat avg: 126 microseconds average latency
# - lat stdev: 312 microseconds (variability)
 
 
# =====================================================
# QUICK DISK PERFORMANCE CHECK
# =====================================================
 
# Write test (careful: writes to current directory)
dd if=/dev/zero of=testfile bs=1G count=1 oflag=direct
# Shows sequential write speed
 
# Read back
dd if=testfile of=/dev/null bs=1G count=1 iflag=direct
# Shows sequential read speed
 
# Clean up
rm testfile

Access Patterns That Kill Disk Performance

How you access disk matters as much as what disk you have. The same storage can perform brilliantly or terribly depending on access patterns.

Sequential vs Random Access:

Sequential vs Random Access Impact
Access Pattern	HDD Performance	SSD Performance	Typical Workloads
Sequential read	100-200 MB/s	2-7 GB/s	Log processing, video streaming, backups
Sequential write	100-200 MB/s	1-5 GB/s	Logging, bulk data loads
Random read	75-150 IOPS	100K-1M IOPS	Database lookups, user sessions
Random write	75-150 IOPS	50K-500K IOPS	Transaction logs, updates

For SSDs, the gap is smaller but still significant. Random reads at 4KB might achieve 100K IOPS (400 MB/s equivalent), while sequential reads at 1MB blocks might achieve 3 GB/s—still 7x faster.

Performance-Killing Patterns:

I/O Anti-Patterns

•Small random writes: Each write is a separate I/O operation. Writing 1000 bytes in 1000 separate writes is 1000x worse than one 1000-byte write.
•Synchronous I/O in a loop: Waiting for each I/O to complete before starting the next. No parallelism, queue depth of 1.
•Mixing reads and writes on HDD: Head must seek between read and write areas. Severely degrades performance.
•Reading beyond cache: When working set exceeds OS page cache, every read hits disk. Performance cliff.
•Write amplification (SSDs): SSDs must erase blocks before writing. Small writes cause read-modify-write cycles, reducing write throughput and SSD lifespan.

io_patterns.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# =====================================================
# I/O ACCESS PATTERN COMPARISON
# =====================================================
 
import os
import time
 
# =====================================================
# ANTI-PATTERN: Many small synchronous writes
# =====================================================
 
def write_records_bad(records: list, filepath: str):
    """Write each record separately - terrible I/O performance."""
    for record in records:
        with open(filepath, 'ab') as f:  # Open/close per record!
            f.write(record.encode() + b'\n')
            f.flush()  # Force to disk
            os.fsync(f.fileno())  # Wait for disk
 
# 10,000 records at 3 IOPS (HDD) = 55 minutes!
# 10,000 records at 100K IOPS (SSD) = 0.1 seconds
# But we're also opening/closing file 10,000 times!
 
 
# PATTERN: Batch writes with proper buffering
def write_records_good(records: list, filepath: str):
    """Batch records and write efficiently."""
    with open(filepath, 'ab') as f:
        for record in records:
            f.write(record.encode() + b'\n')
        f.flush()  # One flush for all records
        os.fsync(f.fileno())  # One sync
 
# 10,000 records = 1 I/O operation (plus OS buffering)
 
 
# =====================================================
# ANTI-PATTERN: Sequential reads of random positions
# =====================================================
 
def process_ids_bad(ids: list, data_file: str):
    """Read each ID's data in order given - random access."""
    results = []
    for id in ids:  # IDs are not sorted by file position
        offset = get_offset_for_id(id)
        with open(data_file, 'rb') as f:
            f.seek(offset)  # Random seek
            data = f.read(1024)
            results.append(process(data))
    return results
 
# 10,000 IDs with random seeks on HDD = ~100 seconds
# Each seek is ~10ms
 
 
# PATTERN: Sort reads by position, then process
def process_ids_good(ids: list, data_file: str):
    """Sort IDs by file position for sequential access."""
    # Get offsets and sort
    id_offsets = [(id, get_offset_for_id(id)) for id in ids]
    id_offsets.sort(key=lambda x: x[1])  # Sort by offset
    
    results = []
    with open(data_file, 'rb') as f:
        for id, offset in id_offsets:
            f.seek(offset)
            data = f.read(1024)
            results.append((id, process(data)))
    
    # Reorder results back to original ID order
    return [r for id, r in sorted(results, key=lambda x: ids.index(x[0]))]
 
# Sequential access: 10,000 reads in ~1 second on HDD
 
 
# =====================================================
# PATTERN: Async I/O for parallel operations
# =====================================================
 
import asyncio
import aiofiles
 
async def read_files_parallel(filepaths: list):
    """Read multiple files in parallel - maximizes I/O utilization."""
    async def read_one(path):
        async with aiofiles.open(path, 'r') as f:
            return await f.read()
    
    return await asyncio.gather(*[read_one(p) for p in filepaths])
 
# 100 files with 10ms latency each:
# Sequential: 100 * 10ms = 1 second
# Parallel: ~10ms (all in flight simultaneously)

The Page Cache Is Your Friend

Diagnosing Disk I/O Bottlenecks

Disk I/O bottlenecks can be tricky to identify because they often manifest as high CPU iowait or slow application response time. Here's a systematic diagnostic approach:

disk_diagnostics.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
#!/bin/bash
# =====================================================
# STEP 1: CHECK FOR I/O WAIT
# =====================================================
 
# Is CPU waiting on I/O?
top
# Look at '%wa' (I/O wait) in the CPU line
# High %wa (>10%) indicates disk bottleneck
 
# Or use vmstat
vmstat 1
# 'wa' column shows I/O wait percentage
 
# sar provides history
sar -u 1 10
# %iowait column
 
 
# =====================================================
# STEP 2: IDENTIFY BUSY DISKS
# =====================================================
 
iostat -xz 1
# Output columns explained:
# rrqm/s, wrqm/s - read/write merges (kernel coalescing)
# r/s, w/s       - reads/writes per second
# rMB/s, wMB/s   - throughput
# await          - average I/O wait time (ms) - KEY METRIC
# %util          - percentage of time disk was busy - KEY METRIC
 
# Example output:
# Device   r/s    w/s   rMB/s  wMB/s  await  %util
# sda    245.00  15.00   1.23   0.12  24.35  78.50
 
# Interpretation:
# - 245 reads/second, 15 writes/second
# - await 24ms is HIGH for SSD (should be <1ms)
# - %util 78% means disk is busy 78% of the time
# - %util >70% sustained = bottleneck
 
 
# =====================================================
# STEP 3: IDENTIFY I/O-HEAVY PROCESSES
# =====================================================
 
# Real-time I/O per process
iotop -o  # Show only processes with I/O
 
# Output:
# TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
# 1234 be/4  mysql     12.45 M/s   3.21 M/s  0.00 %  87.23 %  mysqld
 
# IO> column shows percentage of time this process is waiting on I/O
 
 
# Process-level I/O stats
pidstat -d 1
# Shows kB_rd/s (read), kB_wr/s (write) per process
 
 
# =====================================================
# STEP 4: ANALYZE I/O PATTERNS
# =====================================================
 
# What operations are happening?
# Using blktrace for block-level I/O tracing
 
blktrace -d /dev/sda -o trace
# Run workload...
blkparse -i trace.blktrace.* -o trace.txt
 
# Or use bcc/eBPF tools (modern Linux)
# biolatency - I/O latency histogram
/usr/share/bcc/tools/biolatency
# Shows distribution of I/O latency
 
# biosnoop - trace each I/O with latency
/usr/share/bcc/tools/biosnoop
# Real-time trace of every I/O operation
 
 
# =====================================================
# STEP 5: AWS EBS-SPECIFIC DIAGNOSIS
# =====================================================
 
# EBS has IOPS and throughput limits
# Check CloudWatch metrics:
# - VolumeReadOps, VolumeWriteOps (total ops)
# - VolumeQueueLength (should be <1 ideally)
# - BurstBalance (for gp2/gp3, shows burst credit remaining)
 
# If BurstBalance is 0: you've exhausted IOPS credits
# If VolumeQueueLength >10: I/O is queuing up
 
# Check instance EBS bandwidth:
# Some instances have shared EBS bandwidth
# c5.xlarge: baseline 4750 Mbps EBS bandwidth
# If EBS usage exceeds this, you're throttled

Disk I/O Diagnostic Patterns
Symptom	Likely Cause	Key Metric	Solution
High %wa, low CPU	I/O-bound workload	iostat await, %util	Faster storage, caching, query optimization
High await on SSD	Saturated SSD or queuing	iostat %util, queue depth	Upgrade to higher-IOPS storage
Spiky I/O wait	Periodic writes (checkpoints)	Correlate with logs	Spread checkpoints, async writes
BurstBalance = 0 (AWS)	Exceeded burst IOPS limit	CloudWatch BurstBalance	Provision more IOPS, gp3 instead of gp2
High VolumeQueueLength	I/O demand exceeds capacity	CloudWatch metrics	Scale storage or reduce I/O
iotop shows one process	Single-threaded I/O bottleneck	iotop IO%	Parallelize I/O in application

The await Metric

Database Disk I/O Optimization

Databases are the most common source of disk I/O bottlenecks. Understanding how databases use disk helps optimize for both performance and durability.

Database I/O Components:

Database Disk I/O Components
Component	Access Pattern	I/O Characteristics	Optimization
Data files	Random read/write	Depends on workload	Indexes, memory for cache
Write-Ahead Log (WAL)	Sequential write	Critical for durability	Fast sequential storage
Redo logs	Sequential write	Sync writes for durability	Separate disk, battery-backed cache
Temp/sort files	Sequential read/write	Large sequential I/O	Fast temp storage, more memory
Archive logs	Sequential write, rare read	Background writes	Cheaper storage acceptable

postgresql_io_optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
-- =====================================================
-- POSTGRESQL I/O OPTIMIZATION
-- =====================================================
 
-- Check cache hit ratio (should be >99% for OLTP)
SELECT 
    sum(heap_blks_read) as heap_read,
    sum(heap_blks_hit) as heap_hit,
    sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read))::float as cache_hit_ratio
FROM pg_statio_user_tables;
 
-- Low cache_hit_ratio (<95%) = data doesn't fit in memory
-- Solution: increase shared_buffers or reduce working set
 
 
-- Check index usage
SELECT 
    relname,
    idx_scan as index_scans,
    seq_scan as sequential_scans,
    n_tup_upd as updates,
    n_tup_del as deletes
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_scan DESC;
 
-- High sequential scans on large tables = missing indexes
-- Each seq_scan reads entire table from disk
 
 
-- I/O-heavy queries (pg_stat_statements extension)
SELECT query, 
       calls,
       total_time,
       blk_read_time,  -- Time spent reading from disk
       blk_write_time, -- Time spent writing to disk
       (blk_read_time + blk_write_time) / total_time * 100 as io_percent
FROM pg_stat_statements
ORDER BY blk_read_time DESC
LIMIT 10;
 
-- High io_percent queries are I/O-bound
-- Candidates for optimization: indexes, caching
 
 
-- =====================================================
-- KEY POSTGRESQL I/O SETTINGS
-- =====================================================
 
-- shared_buffers: PostgreSQL's buffer cache
-- Typically 25% of RAM for dedicated DB server
-- Example: 4GB on a 16GB server
ALTER SYSTEM SET shared_buffers = '4GB';
 
-- effective_cache_size: Tell planner about OS cache
-- Typically 50-75% of RAM
ALTER SYSTEM SET effective_cache_size = '12GB';
 
-- work_mem: Memory for sorting/hashing per operation
-- Per-operation! 10 concurrent sorts × 64MB = 640MB
ALTER SYSTEM SET work_mem = '64MB';
 
-- maintenance_work_mem: Memory for VACUUM, CREATE INDEX
ALTER SYSTEM SET maintenance_work_mem = '1GB';
 
-- wal_buffers: WAL buffer size (16-64MB typically)
ALTER SYSTEM SET wal_buffers = '64MB';
 
-- checkpoint_timeout, max_wal_size: Control checkpoint frequency
-- More frequent = more consistent I/O, more total I/O
-- Less frequent = less total I/O, spikier performance
ALTER SYSTEM SET checkpoint_timeout = '30min';
ALTER SYSTEM SET max_wal_size = '4GB';

Separating I/O Workloads:

For I/O-heavy databases, separating different I/O types onto different storage can dramatically improve performance:

Storage Separation Strategies

•WAL/redo logs on separate disk: WAL is sequential write, data files are random. Separating prevents head seek contention on HDD, provides I/O bandwidth isolation on SSD.
•Temp/sort space on fast storage: Temp files are created during large sorts, joins, and hash operations. Fast temp storage speeds complex queries.
•Indexes on separate disk: In extreme cases, separating index files from data files can help (rarely needed with SSDs).
•Archive logs on cheaper storage: Archive logs are written once, rarely read. Use cost-effective storage with good sequential write.

Memory Is the Best Optimization

Application-Level I/O Optimization

Beyond database optimization, applications can implement patterns that minimize disk I/O impact:

Application I/O Optimization Strategies

•Batch writes: Instead of writing each record individually, batch into larger writes. Reduces I/O operations by orders of magnitude.
•Async/background writes: For non-critical data (analytics, logs), write asynchronously. Don't block user requests on disk I/O.
•Write-ahead logging: Write to a fast sequential log, then asynchronously apply to data files. Converts random writes to sequential.
•Buffer/cache in memory: Accumulate small writes in memory buffers, flush periodically. Standard pattern in loggers, analytics collectors.
•Use appropriate fsync strategy: Full durability (fsync per write) vs eventual durability (periodic fsync). Trade latency for safety.
•Memory-mapped files: Let OS manage caching; can be very efficient for read-heavy workloads.
•Streaming over batching: For enormous datasets, stream processing avoids loading all data into memory, but still reads all from disk.

io_optimization_patterns.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# =====================================================
# BUFFERED WRITING PATTERN
# =====================================================
 
import threading
import time
from queue import Queue
from typing import Any
 
class BufferedWriter:
    """Batch writes to reduce I/O operations."""
    
    def __init__(self, filepath: str, buffer_size: int = 1000,
                 flush_interval: float = 1.0):
        self.filepath = filepath
        self.buffer_size = buffer_size
        self.flush_interval = flush_interval
        self.buffer = []
        self.lock = threading.Lock()
        
        # Background flush thread
        self._flush_thread = threading.Thread(target=self._periodic_flush)
        self._flush_thread.daemon = True
        self._flush_thread.start()
    
    def write(self, data: str):
        """Add data to buffer; non-blocking."""
        with self.lock:
            self.buffer.append(data)
            if len(self.buffer) >= self.buffer_size:
                self._flush()
    
    def _flush(self):
        """Write buffer to disk."""
        if not self.buffer:
            return
        
        # Take ownership of buffer
        to_write = self.buffer
        self.buffer = []
        
        # Single I/O operation for entire batch
        with open(self.filepath, 'a') as f:
            f.writelines(to_write)
    
    def _periodic_flush(self):
        """Ensure data is written even if buffer isn't full."""
        while True:
            time.sleep(self.flush_interval)
            with self.lock:
                self._flush()
 
# Usage:
writer = BufferedWriter('/var/log/app.log', buffer_size=1000)
for event in events:
    writer.write(event)  # Fast, buffers in memory
 
# Result: 1000 individual writes become 1 disk write
 
 
# =====================================================
# ASYNC FILE OPERATIONS
# =====================================================
 
import asyncio
import aiofiles
 
async def process_files_async(filepaths: list):
    """Process multiple files with non-blocking I/O."""
    
    async def process_one(filepath):
        async with aiofiles.open(filepath, 'r') as f:
            content = await f.read()
            # Process content...
            return len(content)
    
    # All files read in parallel - overlapping I/O
    results = await asyncio.gather(*[process_one(p) for p in filepaths])
    return results
 
# 100 files × 10ms each:
# Sequential: 1000ms
# Parallel: ~10-20ms (limited by concurrent I/O capacity)
 
 
# =====================================================
# MEMORY-MAPPED FILES FOR READ-HEAVY WORKLOADS
# =====================================================
 
import mmap
 
class MmapReader:
    """Memory-mapped file access - let OS manage caching."""
    
    def __init__(self, filepath: str):
        self.file = open(filepath, 'rb')
        self.mmap = mmap.mmap(self.file.fileno(), 0, access=mmap.ACCESS_READ)
    
    def read_at(self, offset: int, length: int) -> bytes:
        """Read from any position - OS caches in page cache."""
        return self.mmap[offset:offset + length]
    
    def close(self):
        self.mmap.close()
        self.file.close()
 
# Advantages:
# - OS handles caching automatically
# - Zero-copy access to cached data
# - Efficient for random access patterns
# - Memory usage = what's currently accessed, not whole file
 
 
# =====================================================
# WRITE-AHEAD LOG PATTERN
# =====================================================
 
class WALWriter:
    """
    Write-ahead log: convert random writes to sequential.
    Used by databases, message queues, and other durable systems.
    """
    
    def __init__(self, log_path: str):
        self.log_file = open(log_path, 'ab')
    
    def append(self, operation: dict) -> int:
        """Append operation to log; return log sequence number."""
        import json
        
        # Serialize
        data = json.dumps(operation).encode() + b'\n'
        
        # Sequential append (very fast)
        position = self.log_file.tell()
        self.log_file.write(data)
        self.log_file.flush()
        
        return position
    
    def sync(self):
        """Ensure log is durable on disk."""
        import os
        os.fsync(self.log_file.fileno())
 
# Pattern:
# 1. Write operation to WAL (sequential, fast)
# 2. Apply operation to data structures (in memory)
# 3. Periodically checkpoint data to disk (batch random writes)
# 4. Truncate old WAL entries after checkpoint

The Durability Trade-Off

Cloud Storage Considerations

Cloud storage (AWS EBS, GCP Persistent Disk, Azure Managed Disks) has unique characteristics that differ from local storage:

Key Cloud Storage Factors:

Cloud Block Storage Comparison
AWS EBS Type	IOPS	Throughput	Latency	Cost Model
gp2	100-16,000 (burst)	128-250 MB/s	1-2 ms	Per GB, includes IOPS
gp3	3,000-16,000 (provisioned)	125-1000 MB/s	1-2 ms	Separate IOPS pricing
io2	Up to 256,000	Up to 4 GB/s	0.2-0.5 ms	High per-IOPS cost
st1 (HDD)	500 IOPS base	Up to 500 MB/s	High	Cheap per GB
sc1 (HDD)	250 IOPS base	Up to 250 MB/s	High	Cheapest per GB

Cloud Storage Gotchas:

Cloud Storage Pitfalls

•Instance EBS bandwidth limits: EC2 instances have maximum EBS bandwidth (e.g., 4,750 Mbps for m5.xlarge). Multiple high-performance EBS volumes can exceed this limit, causing throttling.
•Burst credit exhaustion (gp2): gp2 volumes earn I/O credits over time. Sustained high IOPS depletes credits, dropping to baseline (3 IOPS/GB). A 100GB gp2 volume has only 300 baseline IOPS.
•Network is the interface: EBS volumes attach over the network. Network issues affect storage performance. Enhanced networking (SR-IOV) helps.
•Latency varies: Cloud storage has higher latency variability than local NVMe. Design for p99 latency, not average.
•Snapshot performance: First read after a snapshot restore may be slower (lazy loading from S3). Pre-warming helps.

cloud_storage_checklist.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#!/bin/bash
# =====================================================
# CLOUD STORAGE PERFORMANCE CHECKLIST (AWS)
# =====================================================
 
# 1. Check instance EBS bandwidth
# Look up instance type in AWS docs
# Example: m5.xlarge = 4,750 Mbps EBS bandwidth
 
# 2. Check volume configuration
aws ec2 describe-volumes --volume-id vol-xxxxx
# Verify: VolumeType, Size, Iops, Throughput
 
# 3. Monitor CloudWatch metrics
# VolumeReadOps, VolumeWriteOps - total ops
# VolumeReadBytes, VolumeWriteBytes - throughput
# VolumeQueueLength - pending I/O (should be <1)
# BurstBalance - burst credits remaining (gp2)
 
# 4. Check for EBS throughput throttling
# CloudWatch: EBSIOBalance%, EBSByteBalance%
# If these drop to 0, you're being throttled
 
 
# =====================================================
# RECOMMENDATIONS BY WORKLOAD
# =====================================================
 
# Workload: Database (high IOPS random)
# - Use io2 for production databases
# - Provision IOPS based on peak, not average
# - Monitor VolumeQueueLength for congestion
 
# Workload: Analytics (high throughput sequential)
# - gp3 with high throughput provisioned
# - Or st1 for cold data with sequential access
# - Use multiple volumes in RAID 0 for higher throughput
 
# Workload: Logs (sequential write, rare read)
# - gp3 baseline (3000 IOPS, 125 MB/s)
# - Or st1 for very high volume, low performance need
 
# Workload: General application
# - Start with gp3 (better $/IOPS than gp2)
# - Provision IOPS/throughput as needed
# - Monitor and adjust
 
 
# =====================================================
# GCP PERSISTENT DISK COMPARISON
# =====================================================
 
# pd-standard: HDD, cheapest, low performance
# pd-balanced: SSD, moderate cost/performance
# pd-ssd: SSD, high performance
# pd-extreme: Up to 120K IOPS
 
# GCP disks auto-scale IOPS with size (unlike AWS)
# 30 IOPS/GB for pd-ssd, up to 100K IOPS max

Summary: Mastering Disk I/O Bottlenecks

Key Takeaways

•Storage technologies differ dramatically: HDDs: ~100-200 IOPS. SSDs: 100K-1M+ IOPS. Choose storage appropriate for your workload.
•Random vs sequential access matters: Sequential access is 10-100x faster than random access, especially on HDDs. Organize access patterns for locality.
•Monitor await and %util: These iostat metrics tell you if disk is the bottleneck. High await (>1ms on SSD) indicates saturation.
•Memory is the best optimization: Data in RAM is 100-1000x faster to access than on disk. Size caches to fit your hot working set.
•Batch writes aggressively: One large write is vastly cheaper than many small writes. Buffer and flush periodically.
•Separate I/O workloads: Put WAL/logs on different storage than data files. Isolates sequential from random I/O.
•Cloud storage has hidden limits: Instance EBS bandwidth, burst credits, and network attachment all impose constraints beyond raw storage specs.

Module Complete:

You've now completed the comprehensive study of Identifying Bottlenecks. You understand the five major bottleneck categories:

CPU-bound vs I/O-bound: The fundamental classification that determines your optimization strategy
Database bottlenecks: Query performance, connections, locks, replication lag, and resource exhaustion
Network bottlenecks: Bandwidth, latency, connection overhead, and latency amplification
Memory constraints: Heap management, GC impact, memory leaks, and sizing strategies
Disk I/O limitations: Storage characteristics, access patterns, and optimization techniques

With this framework, you can systematically diagnose performance issues in any system, identify the true bottleneck, and apply targeted optimizations.

Module Complete

5 / 5