Io Hardware Performance - Learning Module

Loading content...

0/227

Performance Optimization

The Art and Science of I/O Optimization

Throughout this module, we've explored throughput, latency, bandwidth utilization, and hardware bottlenecks. Each concept provides a lens for understanding I/O performance. Now we bring these perspectives together into a unified optimization methodology.

I/O performance optimization is both art and science. The science provides frameworks: measurement methodology, queuing theory, bottleneck analysis. The art lies in knowing which framework applies to a given situation, recognizing patterns from experience, and balancing competing constraints with pragmatic judgment.

Master engineers don't memorize tuning parameters—they understand systems deeply enough to derive optimal configurations from first principles. They know that the same hardware, under different workloads, requires different optimizations. They recognize when diminishing returns make further optimization wasteful, and when fundamental architectural changes are needed rather than incremental tuning.

What You Will Learn

By the end of this page, you will understand systematic approaches to I/O performance optimization, hardware selection criteria, system tuning methodologies, workload-specific optimization strategies, and the decision frameworks that guide expert performance engineers.

Optimization Methodology

Effective performance optimization follows a systematic methodology rather than ad-hoc tuning. Random changes without measurement lead to wasted effort and can even degrade performance.

The Optimization Cycle

Define Goals → What does "success" look like?
Measure Baseline → Where are we starting from?
Analyze → What limits current performance?
Hypothesize → What change might help?
Implement → Make one change at a time
Measure Impact → Did it help? By how much?
Document → Record what was tried and results
Iterate → Return to step 3 until goals are met

Goal Definition

Vague goals like "make it faster" lead to unfocused efforts. Specific, measurable goals drive effective optimization:

Poor Goal	Better Goal
Improve database performance	Reduce p99 query latency from 50ms to 20ms
Make storage faster	Achieve 1M random read IOPS at 100µs p99
Optimize file transfers	Sustain 5 GB/s sequential write throughput
Speed up backup	Complete daily backup window in < 4 hours

Goals should include:

Specific metric (throughput, latency, IOPS)
Target value (quantified improvement)
Workload context (under what conditions)
Constraints (budget, timeline, risk tolerance)

Baseline Measurement

Never optimize without a baseline. Measure the current state comprehensively:

Required Baseline Metrics:

Throughput (read, write, combined)
Latency distribution (p50, p90, p99, p99.9)
IOPS by operation type
Device utilization and queue depth
CPU utilization (user, system, iowait)
Memory usage and bandwidth
Network throughput and latency

Baseline Conditions:

Document workload characteristics
Run during representative load periods
Collect multiple samples for statistical validity
Record system configuration snapshot

One Change at a Time

The most common optimization mistake is changing multiple parameters simultaneously. When performance changes (for better or worse), you cannot determine which change caused the effect. Make one change, measure, then decide whether to keep or revert before proceeding to the next hypothesis.

Hardware Selection for Performance

Hardware selection is the highest-impact optimization decision. Choosing the right hardware for a workload enables performance levels that no amount of tuning on wrong hardware can achieve.

Storage Hardware Selection

Storage Hardware Selection Matrix
Workload Type	Recommended Storage	Rationale
Random read-intensive (database OLTP)	NVMe SSD, low-latency optimized	Minimizes random read latency
Sequential write-intensive (logging)	NVMe SSD with high sustained write	Sustained throughput over burst
Mixed random R/W (virtualization)	Enterprise NVMe, balanced profile	Consistent mixed-load performance
Archival / infrequent access	HDD or QLC SSD	Cost per GB optimized
Extreme latency sensitivity (HFT)	Intel Optane / CXL memory	Sub-10µs latency requirement
High capacity streaming (media)	High-density NVMe or HDD array	TB-scale capacity with streaming focus

Key Storage Selection Criteria

Interface: NVMe over PCIe is mandatory for high performance. SATA limits throughput to 550 MB/s regardless of drive capability. Ensure sufficient PCIe lanes and correct generation (4.0/5.0).

NAND Type: SLC offers best endurance and latency but highest cost. TLC provides good balance. QLC offers density at reduced write endurance and performance.

Controller Quality: Enterprise controllers handle sustained workloads, power-loss protection, and consistent performance. Consumer controllers may throttle under load.

Endurance Rating: Measured in DWPD (Drive Writes Per Day) or TBW (Total Bytes Written). Match to expected write volume.

Form Factor: U.2/U.3 for enterprise hot-swap. M.2 for compact installations. E1.S emerging for density.

Network Hardware Selection

Workload	Recommended Network	Key Features
General purpose	25 GbE	Balanced cost/performance
High-performance computing	100+ GbE or InfiniBand	Low latency, high bandwidth
Storage networking	25+ GbE with RDMA support	NVMe-oF, low latency required
Low latency (trading)	Kernel bypass, FPGA NICs	Sub-microsecond latency
Edge/cost-sensitive	10 GbE	Mature, commodity pricing

Network Selection Criteria:

Speed: Match to aggregate workload requirements
Latency: Cut-through switching, RDMA for latency-critical paths
Offload: TSO/LRO, checksum offload, RDMA support
PCIe Bandwidth: Ensure NIC PCIe matches network speed
RSS Queues: Multiple queues for multi-core scaling

CPU Selection for I/O Workloads

CPU selection impacts I/O through:

Core Count: More cores handle more concurrent I/O operations. High-IOPS workloads benefit from many cores.

Clock Speed: Single-threaded latency-sensitive paths benefit from higher clocks.

PCIe Lanes: CPUs provide finite PCIe lanes. High-speed storage + networking can exhaust lanes on consumer CPUs.

Memory Channels: More channels provide more memory bandwidth for DMA operations.

NUMA Topology: Multi-socket systems require attention to device/processor affinity.

CPU Focus	Workload Fit
High core count	Many concurrent I/O streams
High clock speed	Latency-sensitive, single-threaded
Many PCIe lanes	Many high-speed devices
High memory bandwidth	Large data movements, streaming

Balanced Configuration

Design systems with balanced components. An enterprise NVMe array connected via slow network, or extreme network capacity with slow storage, creates bottlenecks that waste investment. Profile expected workloads and size all components proportionally.

System-Level Tuning

With appropriate hardware selected, system-level tuning optimizes how the operating system manages I/O resources.

Storage Subsystem Tuning

storage_tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
#!/bin/bash
# Comprehensive Storage Tuning Script
 
# ============================================
# I/O SCHEDULER SELECTION
# ============================================
 
# For NVMe devices: 'none' eliminates scheduler overhead
# The device handles queuing internally
for dev in /sys/block/nvme*; do
    [ -d "$dev" ] && echo "none" > $dev/queue/scheduler
done
 
# For SATA SSDs: 'mq-deadline' provides good balance
for dev in /sys/block/sd*; do
    if [ -d "$dev" ] && cat $dev/queue/rotational 2>/dev/null | grep -q "0"; then
        echo "mq-deadline" > $dev/queue/scheduler
    fi
done
 
# For HDDs: 'bfq' for desktop fairness, 'mq-deadline' for servers
for dev in /sys/block/sd*; do
    if [ -d "$dev" ] && cat $dev/queue/rotational 2>/dev/null | grep -q "1"; then
        echo "mq-deadline" > $dev/queue/scheduler
    fi
done
 
# ============================================
# QUEUE DEPTH OPTIMIZATION
# ============================================
 
# Increase queue depth for high-IOPS NVMe devices
# Default 256 is often suboptimal for enterprise drives
for dev in /sys/block/nvme*; do
    [ -d "$dev" ] && echo 1024 > $dev/queue/nr_requests
done
 
# ============================================
# READ-AHEAD TUNING
# ============================================
 
# For sequential workloads: increase read-ahead significantly
# Value in KB; default is often 128KB
for dev in /sys/block/nvme* /sys/block/sd*; do
    [ -d "$dev" ] && echo 4096 > $dev/queue/read_ahead_kb
done
 
# For random workloads: reduce read-ahead to avoid wasted I/O
# Uncomment for database workloads:
# for dev in /sys/block/nvme*; do
#     [ -d "$dev" ] && echo 8 > $dev/queue/read_ahead_kb
# done
 
# ============================================
# WRITE CACHE TUNING
# ============================================
 
# Increase dirty page limits for write-intensive workloads
# Allows more write buffering before flush
 
# Percentage of memory for dirty pages (default 20)
sysctl -w vm.dirty_ratio=40
 
# Start background writeback earlier (default 10)
sysctl -w vm.dirty_background_ratio=5
 
# Maximum age of dirty data in centiseconds (default 3000 = 30s)
sysctl -w vm.dirty_expire_centisecs=6000
 
# How often pdflush wakes up (default 500 = 5s)
sysctl -w vm.dirty_writeback_centisecs=500
 
# ============================================
# FILESYSTEM MOUNT OPTIONS
# ============================================
 
# Example optimized mount options for data volume:
# mount -o noatime,nodiratime,discard /dev/nvme0n1p1 /data
 
# noatime: Skip access time updates (major write reduction)
# nodiratime: Skip directory access time
# discard: Enable TRIM (or use fstrim.timer for batched TRIM)
 
# For XFS with high I/O concurrency:
# mount -o noatime,allocsize=64m,inode64 /dev/nvme0n1p1 /data
 
# For ext4 with journaling optimization:
# mount -o noatime,barrier=0,data=writeback /dev/nvme0n1p1 /data
# WARNING: barrier=0 risks data loss on power failure
 
echo "Storage tuning applied."

Network Subsystem Tuning

network_tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#!/bin/bash
# Network Performance Tuning Script
 
IFACE=${1:-eth0}
 
# ============================================
# BUFFER SIZES
# ============================================
 
# Increase socket buffer sizes for high-throughput connections
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.core.rmem_default=16777216
sysctl -w net.core.wmem_default=16777216
 
# TCP-specific buffer sizes
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728"
 
# ============================================
# CONGESTION CONTROL
# ============================================
 
# Use BBR for better throughput on modern networks
sysctl -w net.ipv4.tcp_congestion_control=bbr
 
# Enable TCP fast open for reduced connection latency
sysctl -w net.ipv4.tcp_fastopen=3
 
# ============================================
# LATENCY OPTIMIZATION
# ============================================
 
# Disable Nagle's algorithm for latency-sensitive applications
# (Also require application-level TCP_NODELAY)
sysctl -w net.ipv4.tcp_low_latency=1
 
# Reduce SYN retransmit delay
sysctl -w net.ipv4.tcp_syn_retries=2
sysctl -w net.ipv4.tcp_synack_retries=2
 
# ============================================
# NIC TUNING
# ============================================
 
# Maximize ring buffer sizes
ethtool -G $IFACE rx 4096 tx 4096 2>/dev/null || true
 
# Enable interrupt coalescing for throughput
# (trades latency for reduced CPU interrupt load)
ethtool -C $IFACE rx-usecs 50 tx-usecs 50 2>/dev/null || true
 
# Balance interrupts across CPUs
# (let irqbalance handle, or use manual affinity)
echo 2 > /proc/irq/$(cat /proc/interrupts | grep $IFACE | awk '{print $1}' | tr -d ':')/smp_affinity 2>/dev/null || true
 
# Enable hardware offloads
ethtool -K $IFACE tso on gso on gro on lro on 2>/dev/null || true
 
# ============================================
# BUSY POLLING (trades CPU for latency)
# ============================================
 
# Enable busy polling for low-latency applications
sysctl -w net.core.busy_poll=50
sysctl -w net.core.busy_read=50
 
echo "Network tuning applied to $IFACE"

Memory and NUMA Tuning

memory_numa_tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#!/bin/bash
# Memory and NUMA Optimization Script
 
# ============================================
# TRANSPARENT HUGE PAGES
# ============================================
 
# Disable THP for latency-sensitive workloads (databases)
# THP can cause unexpected latency spikes during compaction
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
 
# For throughput-oriented workloads, keep enabled:
# echo always > /sys/kernel/mm/transparent_hugepage/enabled
# echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag
 
# ============================================
# SWAP CONFIGURATION
# ============================================
 
# Reduce swappiness for I/O-heavy workloads
# Keeps more file cache in memory
sysctl -w vm.swappiness=10
 
# For systems with ample RAM, minimize swapping:
sysctl -w vm.swappiness=1
 
# ============================================
# NUMA BALANCING
# ============================================
 
# Disable automatic NUMA balancing for I/O-bound workloads
# Manual pinning provides better control
sysctl -w kernel.numa_balancing=0
 
# ============================================
# ALLOCATING ON CORRECT NUMA NODE
# ============================================
 
# Find NUMA node for NVMe device
for nvme in /sys/class/nvme/nvme*; do
    name=$(basename $nvme)
    node=$(cat $nvme/device/numa_node 2>/dev/null)
    echo "$name is on NUMA node $node"
done
 
# Example: Run application pinned to NUMA node 0
# numactl --cpunodebind=0 --membind=0 ./my_io_app
 
# ============================================
# PAGE CACHE TUNING
# ============================================
 
# Increase minimum free memory to reduce reclaim pressure
# Value in KB; set to ~1% of RAM
sysctl -w vm.min_free_kbytes=2097152  # 2GB on a 200GB system
 
# Ratio of cache to pagecache for caching priority
sysctl -w vm.vfs_cache_pressure=50  # Default 100
 
echo "Memory and NUMA tuning applied."

Tuning Is Workload-Specific

Generic tuning scripts provide starting points, but optimal settings depend on specific workloads. Sequential streaming benefits from large read-ahead; random database I/O may be harmed by it. Always validate tuning changes with representative workload testing.

Application-Level Optimization

Application architecture and I/O patterns often have more impact than system tuning. Poorly designed applications underutilize even the best hardware.

I/O API Selection

The choice of I/O API fundamentally determines achievable performance:

I/O API Comparison for Performance
API	Characteristics	Best For
Synchronous read/write	Simple, blocking, queue depth=1	Simple scripting, low-volume I/O
pread/pwrite	Positional I/O, still blocking	Multi-threaded with thread-per-file
POSIX AIO	Async but thread-pool based	Legacy applications needing async
Linux AIO (libaio)	Kernel async, O_DIRECT required	Database engines, O_DIRECT workloads
io_uring	Modern kernel async, flexible	High-performance applications, any I/O type
SPDK/DPDK	User-space drivers, kernel bypass	Maximum performance, specialized workloads

io_uring: The Modern Answer

io_uring (Linux 5.1+) provides the best combination of performance and programmability:

Minimal syscall overhead: Submission and completion via shared ring buffers
Kernel-side polling: Optional SQPOLL mode eliminates syscalls entirely
Batching: Submit/complete many operations per kernel transition
Flexibility: Works with files, sockets, timers, and more
O_DIRECT optional: Works with buffered I/O unlike libaio

iouring_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
/**
 * High-Performance I/O with io_uring
 * 
 * Demonstrates best practices for achieving maximum I/O performance:
 * - Ring buffer sizes matched to expected concurrency
 * - Registered buffers to avoid per-I/O buffer mapping
 * - Batched submission and completion
 * - Optional kernel-side polling
 */
 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <liburing.h>
 
#define QUEUE_DEPTH 256
#define BLOCK_SIZE (256 * 1024)  // 256KB for high throughput
#define BATCH_SIZE 32
 
struct io_context {
    struct io_uring ring;
    void **buffers;
    int buffer_count;
    int fd;
};
 
/**
 * Initialize io_uring with performance-optimized settings
 */
int init_io_context(struct io_context *ctx, const char *path, int use_sqpoll) {
    struct io_uring_params params = {0};
    
    // Enable kernel-side polling for minimum latency
    // Requires root or CAP_SYS_ADMIN
    if (use_sqpoll) {
        params.flags = IORING_SETUP_SQPOLL;
        params.sq_thread_idle = 2000;  // 2s before thread sleeps
    }
    
    // Use SQ_AFF to pin polling thread to specific CPU
    // params.flags |= IORING_SETUP_SQ_AFF;
    // params.sq_thread_cpu = 0;  // Pin to CPU 0
    
    int ret = io_uring_queue_init_params(QUEUE_DEPTH, &ctx->ring, &params);
    if (ret < 0) {
        fprintf(stderr, "io_uring init failed: %d\n", ret);
        return ret;
    }
    
    // Open file with O_DIRECT for direct device access
    ctx->fd = open(path, O_RDONLY | O_DIRECT);
    if (ctx->fd < 0) {
        perror("open");
        io_uring_queue_exit(&ctx->ring);
        return -1;
    }
    
    // Pre-allocate and register buffers
    // Registered buffers avoid per-I/O buffer mapping overhead
    ctx->buffer_count = QUEUE_DEPTH;
    ctx->buffers = malloc(ctx->buffer_count * sizeof(void*));
    
    struct iovec *iovecs = malloc(ctx->buffer_count * sizeof(struct iovec));
    
    for (int i = 0; i < ctx->buffer_count; i++) {
        posix_memalign(&ctx->buffers[i], BLOCK_SIZE, BLOCK_SIZE);
        iovecs[i].iov_base = ctx->buffers[i];
        iovecs[i].iov_len = BLOCK_SIZE;
    }
    
    // Register buffers with kernel
    ret = io_uring_register_buffers(&ctx->ring, iovecs, ctx->buffer_count);
    if (ret < 0) {
        fprintf(stderr, "Buffer registration failed: %d\n", ret);
        // Continue without registration - performance penalty only
    }
    
    free(iovecs);
    return 0;
}
 
/**
 * Submit a batch of I/O requests
 */
int submit_batch(struct io_context *ctx, off_t *offsets, int count) {
    for (int i = 0; i < count && i < BATCH_SIZE; i++) {
        struct io_uring_sqe *sqe = io_uring_get_sqe(&ctx->ring);
        if (!sqe) {
            // Queue full - submit what we have
            io_uring_submit(&ctx->ring);
            sqe = io_uring_get_sqe(&ctx->ring);
            if (!sqe) return i;  // Still full
        }
        
        // Use registered buffer for best performance
        int buf_idx = i % ctx->buffer_count;
        io_uring_prep_read_fixed(sqe, ctx->fd, 
                                  ctx->buffers[buf_idx], 
                                  BLOCK_SIZE, 
                                  offsets[i],
                                  buf_idx);
        
        // Store offset for completion tracking
        io_uring_sqe_set_data(sqe, (void*)offsets[i]);
    }
    
    return io_uring_submit(&ctx->ring);
}
 
/**
 * Process completions efficiently with batching
 */
int process_completions(struct io_context *ctx, int min_complete) {
    struct io_uring_cqe *cqe;
    unsigned head;
    int completed = 0;
    
    // Wait for at least min_complete operations
    if (min_complete > 0) {
        io_uring_wait_cqe_nr(&ctx->ring, &cqe, min_complete);
    }
    
    // Process all available completions
    io_uring_for_each_cqe(&ctx->ring, head, cqe) {
        if (cqe->res < 0) {
            fprintf(stderr, "I/O error at offset %lld: %d\n",
                    (long long)io_uring_cqe_get_data(cqe), cqe->res);
        }
        completed++;
    }
    
    io_uring_cq_advance(&ctx->ring, completed);
    return completed;
}
 
/**
 * Cleanup resources
 */
void cleanup_io_context(struct io_context *ctx) {
    io_uring_unregister_buffers(&ctx->ring);
    
    for (int i = 0; i < ctx->buffer_count; i++) {
        free(ctx->buffers[i]);
    }
    free(ctx->buffers);
    
    close(ctx->fd);
    io_uring_queue_exit(&ctx->ring);
}

I/O Pattern Optimization

Beyond API choice, I/O patterns significantly affect performance:

Pattern Optimization Techniques

•Request Size Optimization: Larger requests amortize overhead. For throughput, use 256KB-1MB requests. For latency, match to data access granularity.
•Alignment: Align I/O to sector boundaries (512B or 4KB). Misaligned writes cause read-modify-write cycles.
•Queue Depth: Maintain 16-64 outstanding I/O requests to saturate NVMe devices. Monitor average queue depth.
•Sequential Access: When possible, access data sequentially. Enables prefetching and avoids seek overhead on HDDs.
•Batching Small Writes: Accumulate small writes in memory; flush as one large write periodically.
•Separate Read/Write Paths: Dedicate threads/queues for reads vs writes to avoid mode-switching overhead.

Profile Before Optimizing

Use profiling tools (strace, blktrace, perf) to understand actual I/O patterns before optimizing. Assumptions about workload behavior are frequently wrong. Measure, don't guess.

Workload-Specific Optimization

Different workloads require different optimization strategies. One-size-fits-all tuning is rarely optimal.

Database Workloads (OLTP)

Online transaction processing is characterized by:

Random 4-16KB reads and writes
Latency-sensitive (user transactions)
High concurrency (many small operations)

Optimization Strategy:

Minimize read-ahead (wasted for random access)
Use high-IOPS storage (NVMe, Optane)
Pin buffer pool to memory (huge pages)
Disable NUMA auto-balancing; bind to device-local CPUs
O_DIRECT for data files (database handles caching)
Log files on separate device with sequential optimization

Analytics Workloads (OLAP)

Analytical processing differs significantly:

Large sequential scans
Throughput-optimized (scan speed)
Fewer, larger operations

Optimization Strategy:

Maximum read-ahead (16MB+)
High sequential throughput storage
Allow OS caching (data may be reused)
Columnar storage formats for scan efficiency
Parallel scans across multiple devices
Consider compression (trades CPU for I/O)

Streaming/Media Workloads

Video, audio, and other streaming characteristics:

Sustained sequential throughput
Continuous, predictable I/O
Latency less critical than consistency

Optimization Strategy:

Large read-ahead and buffer sizes
Predictive prefetching
Guarantee bandwidth reservation (QoS)
High sequential throughput storage
Network QoS for remote streaming
Ring buffers to smooth jitter

Virtualization Workloads

VM and container hosting presents mixed challenges:

Highly mixed random/sequential
Multiple independent workloads
Unpredictable load patterns
Noisy neighbor concerns

Optimization Strategy:

High IOPS storage (NVMe arrays)
Balanced read/write performance
I/O scheduling for fairness (BFQ or cgroup limits)
Separate system and data storage
TRIM/discard passthrough for guests
VirtIO for paravirtualized I/O efficiency

Workload Optimization Quick Reference
Workload	Primary Metric	Key Tuning
OLTP Database	Random IOPS, low latency	Minimal read-ahead, O_DIRECT, high IOPS storage
OLAP Analytics	Sequential throughput	Large read-ahead, parallel scans, allow caching
Streaming Media	Sustained bandwidth	Large buffers, bandwidth QoS, prefetching
Backup/Archive	Sequential write	Large write buffer, delayed writeback, compression
Virtualization	Mixed, isolation	Balanced storage, I/O limits, fair scheduling
Logging	Write append	Sequential writes, sync options, rotate strategies

Mixed Workloads Challenge

Systems running mixed workloads (e.g., database + analytics on same storage) face conflicting optimization requirements. Consider physical separation, time-based scheduling, or dynamic tuning that adjusts based on detected workload patterns.

Performance Testing and Validation

Performance optimization without validation is guesswork. Rigorous testing confirms improvements and prevents regressions.

Testing Methodology

1. Representative Workload Use workloads that accurately represent production:

Same I/O sizes and patterns
Similar concurrency levels
Comparable data volumes
Matching time durations

2. Isolation Eliminate confounding variables:

Dedicated test system or exclusive access
No background processes
Consistent starting state (caches cleared or warmed as appropriate)

3. Statistical Rigor Collect enough samples for confidence:

Multiple test iterations (5-10 minimum)
Report mean, standard deviation, percentiles
Watch for outliers and bimodal distributions

benchmark_fio.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
#!/bin/bash
# Comprehensive I/O Benchmark Suite using fio
# Tests storage performance across multiple workload patterns
 
DEVICE=${1:-/dev/nvme0n1}
SIZE=100G
RUNTIME=60
OUTPUT_DIR=./fio_results_$(date +%Y%m%d_%H%M%S)
 
mkdir -p $OUTPUT_DIR
 
echo "=== Starting I/O Benchmark Suite ==="
echo "Device: $DEVICE"
echo "Results: $OUTPUT_DIR"
echo
 
# ============================================
# 1. SEQUENTIAL READ THROUGHPUT
# ============================================
echo "Running: Sequential Read Throughput..."
fio --name=seq_read \
    --filename=$DEVICE \
    --direct=1 \
    --rw=read \
    --bs=1m \
    --ioengine=io_uring \
    --iodepth=64 \
    --numjobs=4 \
    --size=$SIZE \
    --runtime=$RUNTIME \
    --time_based \
    --group_reporting \
    --output=$OUTPUT_DIR/seq_read.json \
    --output-format=json
 
# ============================================
# 2. SEQUENTIAL WRITE THROUGHPUT
# ============================================
echo "Running: Sequential Write Throughput..."
fio --name=seq_write \
    --filename=$DEVICE \
    --direct=1 \
    --rw=write \
    --bs=1m \
    --ioengine=io_uring \
    --iodepth=64 \
    --numjobs=4 \
    --size=$SIZE \
    --runtime=$RUNTIME \
    --time_based \
    --group_reporting \
    --output=$OUTPUT_DIR/seq_write.json \
    --output-format=json
 
# ============================================
# 3. RANDOM READ IOPS (4K)
# ============================================
echo "Running: Random Read IOPS..."
fio --name=rand_read \
    --filename=$DEVICE \
    --direct=1 \
    --rw=randread \
    --bs=4k \
    --ioengine=io_uring \
    --iodepth=256 \
    --numjobs=4 \
    --size=$SIZE \
    --runtime=$RUNTIME \
    --time_based \
    --group_reporting \
    --output=$OUTPUT_DIR/rand_read.json \
    --output-format=json
 
# ============================================
# 4. RANDOM WRITE IOPS (4K)
# ============================================
echo "Running: Random Write IOPS..."
fio --name=rand_write \
    --filename=$DEVICE \
    --direct=1 \
    --rw=randwrite \
    --bs=4k \
    --ioengine=io_uring \
    --iodepth=256 \
    --numjobs=4 \
    --size=$SIZE \
    --runtime=$RUNTIME \
    --time_based \
    --group_reporting \
    --output=$OUTPUT_DIR/rand_write.json \
    --output-format=json
 
# ============================================
# 5. MIXED WORKLOAD (70/30 READ/WRITE)
# ============================================
echo "Running: Mixed Workload..."
fio --name=mixed \
    --filename=$DEVICE \
    --direct=1 \
    --rw=randrw \
    --rwmixread=70 \
    --bs=8k \
    --ioengine=io_uring \
    --iodepth=128 \
    --numjobs=4 \
    --size=$SIZE \
    --runtime=$RUNTIME \
    --time_based \
    --group_reporting \
    --output=$OUTPUT_DIR/mixed.json \
    --output-format=json
 
# ============================================
# 6. LATENCY TEST (QD=1)
# ============================================
echo "Running: Latency Test..."
fio --name=latency \
    --filename=$DEVICE \
    --direct=1 \
    --rw=randread \
    --bs=4k \
    --ioengine=io_uring \
    --iodepth=1 \
    --numjobs=1 \
    --size=$SIZE \
    --runtime=$RUNTIME \
    --time_based \
    --percentile_list=50:90:99:99.9:99.99 \
    --output=$OUTPUT_DIR/latency.json \
    --output-format=json
 
echo
echo "=== Benchmark Complete ==="
echo "Results saved to: $OUTPUT_DIR"
echo
echo "Summary:"
for f in $OUTPUT_DIR/*.json; do
    test_name=$(basename $f .json)
    echo "--- $test_name ---"
    jq '.jobs[0] | {read_bw_mb: (.read.bw / 1024), write_bw_mb: (.write.bw / 1024), read_iops: .read.iops, write_iops: .write.iops, lat_us_p99: .read.clat_ns.percentile."99.000000" / 1000}' $f 2>/dev/null
done

Validation Against Goals

After testing, validate against defined goals:

Goal	Measured Result	Status
1M random read IOPS	950,000 IOPS	⚠️ 95% of target
p99 latency < 100µs	85µs	✅ Achieved
5 GB/s sequential write	5.2 GB/s	✅ Exceeded

For goals not achieved:

Identify the limiting factor (is it hardware or tuning?)
Determine if the gap is acceptable or requires action
Feed findings back into the optimization cycle

Automate Testing

Build automated benchmark suites that run with every configuration change. Store results in a time-series database. Trend analysis catches regressions early and quantifies the impact of changes over time.

Performance Monitoring and Maintenance

Performance optimization is not a one-time event. Systems change, workloads evolve, and hardware ages. Continuous monitoring and periodic re-optimization maintain performance over time.

Monitoring Strategy

Real-Time Dashboards Display current performance metrics for immediate visibility:

Throughput (read/write MB/s)
IOPS (read/write operations/second)
Latency (p50, p90, p99)
Device utilization and queue depth
Error rates

Alerting Notify on performance anomalies:

Latency exceeding SLA thresholds
Throughput dropping below baseline
Utilization sustained at saturation levels
Error rates increasing
Capacity approaching limits

Capacity Planning

Project future performance needs:

$$\text{Months until capacity} = \frac{\text{Available capacity} - \text{Current usage}}{\text{Growth rate}}$$

Track trends in:

Storage consumption growth
IOPS demand growth
Network bandwidth growth
Peak vs average ratios

Plan hardware upgrades before reaching capacity limits. It's better to have 30% headroom than to hit saturation unexpectedly.

Maintenance Tasks

Regular maintenance preserves performance:

Task	Frequency	Purpose
TRIM/discard on SSDs	Daily (fstrim.timer)	Maintain SSD performance
File system defrag (HDD)	Weekly	Reduce fragmentation
Monitor SMART data	Daily	Early failure detection
Review performance trends	Weekly	Detect gradual degradation
Re-baseline benchmarks	Monthly	Validate sustained performance
Review tuning parameters	Quarterly	Adjust for workload changes
Capacity planning review	Quarterly	Plan future resources

Degradation Detection

Performance degrades over time due to:

Storage aging: SSDs slow as cells wear; HDDs develop bad sectors Fragmentation: File system fragmentation increases seek times Bloat: Databases and logs accumulate, increasing I/O volume Configuration drift: Manual changes accumulate inconsistencies

Compare periodic benchmarks to original baseline. A 20% degradation from baseline warrants investigation.

Infrastructure as Code

Capture performance tuning in version-controlled configuration (Ansible, Puppet, Terraform). This ensures consistency across systems, enables rollback, and documents the rationale for each setting.

Summary: Mastering I/O Performance Optimization

I/O performance optimization integrates hardware selection, system tuning, application design, and operational practices into a cohesive discipline. Success requires methodology, measurement, and iteration.

Key Takeaways

•Follow a methodology — Define goals, measure baseline, change one thing, measure impact, iterate. Avoid random tuning.
•Hardware selection is highest-impact — Right hardware enables performance levels no tuning on wrong hardware achieves. Match hardware to workload requirements.
•System tuning is workload-specific — Generic tuning guides provide starting points. Optimal settings depend on actual workload patterns.
•Application design often matters most — Use appropriate I/O APIs (io_uring), maintain queue depth, optimize access patterns, batch small operations.
•Different workloads need different optimization — OLTP requires low latency and high IOPS; OLAP requires sequential throughput. Optimize for your actual workload.
•Testing validates optimization — Rigorous benchmarking confirms improvements. Automated testing catches regressions.
•Performance requires ongoing attention — Monitor continuously, maintain regularly, re-baseline periodically. Performance degrades without maintenance.

Module Complete

You have now completed Module 6: I/O Hardware Performance. You understand throughput, latency, bandwidth utilization, hardware bottlenecks, and performance optimization comprehensively. This knowledge equips you to:

Analyze I/O performance systematically
Identify and address bottlenecks
Select appropriate hardware for workloads
Configure systems for optimal performance
Design applications that maximize I/O efficiency
Monitor and maintain performance over time

Module Complete

Congratulations! You've mastered I/O Hardware Performance. You now possess the deep understanding required to architect, tune, and optimize I/O subsystems at the level of an experienced systems engineer. Apply these principles systematically, and you will consistently achieve excellent I/O performance in any system you work with.

Performance Optimization

The Art and Science of I/O Optimization

What You Will Learn

Optimization Methodology

Effective performance optimization follows a systematic methodology rather than ad-hoc tuning. Random changes without measurement lead to wasted effort and can even degrade performance.

The Optimization Cycle

Define Goals → What does "success" look like?
Measure Baseline → Where are we starting from?
Analyze → What limits current performance?
Hypothesize → What change might help?
Implement → Make one change at a time
Measure Impact → Did it help? By how much?
Document → Record what was tried and results
Iterate → Return to step 3 until goals are met

Goal Definition

Vague goals like "make it faster" lead to unfocused efforts. Specific, measurable goals drive effective optimization:

Poor Goal	Better Goal
Improve database performance	Reduce p99 query latency from 50ms to 20ms
Make storage faster	Achieve 1M random read IOPS at 100µs p99
Optimize file transfers	Sustain 5 GB/s sequential write throughput
Speed up backup	Complete daily backup window in < 4 hours

Goals should include:

Specific metric (throughput, latency, IOPS)
Target value (quantified improvement)
Workload context (under what conditions)
Constraints (budget, timeline, risk tolerance)

Baseline Measurement

Never optimize without a baseline. Measure the current state comprehensively:

Required Baseline Metrics:

Throughput (read, write, combined)
Latency distribution (p50, p90, p99, p99.9)
IOPS by operation type
Device utilization and queue depth
CPU utilization (user, system, iowait)
Memory usage and bandwidth
Network throughput and latency

Baseline Conditions:

Document workload characteristics
Run during representative load periods
Collect multiple samples for statistical validity
Record system configuration snapshot

One Change at a Time

Hardware Selection for Performance

Hardware selection is the highest-impact optimization decision. Choosing the right hardware for a workload enables performance levels that no amount of tuning on wrong hardware can achieve.

Storage Hardware Selection

Storage Hardware Selection Matrix
Workload Type	Recommended Storage	Rationale
Random read-intensive (database OLTP)	NVMe SSD, low-latency optimized	Minimizes random read latency
Sequential write-intensive (logging)	NVMe SSD with high sustained write	Sustained throughput over burst
Mixed random R/W (virtualization)	Enterprise NVMe, balanced profile	Consistent mixed-load performance
Archival / infrequent access	HDD or QLC SSD	Cost per GB optimized
Extreme latency sensitivity (HFT)	Intel Optane / CXL memory	Sub-10µs latency requirement
High capacity streaming (media)	High-density NVMe or HDD array	TB-scale capacity with streaming focus

Key Storage Selection Criteria

Interface: NVMe over PCIe is mandatory for high performance. SATA limits throughput to 550 MB/s regardless of drive capability. Ensure sufficient PCIe lanes and correct generation (4.0/5.0).

NAND Type: SLC offers best endurance and latency but highest cost. TLC provides good balance. QLC offers density at reduced write endurance and performance.

Controller Quality: Enterprise controllers handle sustained workloads, power-loss protection, and consistent performance. Consumer controllers may throttle under load.

Endurance Rating: Measured in DWPD (Drive Writes Per Day) or TBW (Total Bytes Written). Match to expected write volume.

Form Factor: U.2/U.3 for enterprise hot-swap. M.2 for compact installations. E1.S emerging for density.

Network Hardware Selection

Workload	Recommended Network	Key Features
General purpose	25 GbE	Balanced cost/performance
High-performance computing	100+ GbE or InfiniBand	Low latency, high bandwidth
Storage networking	25+ GbE with RDMA support	NVMe-oF, low latency required
Low latency (trading)	Kernel bypass, FPGA NICs	Sub-microsecond latency
Edge/cost-sensitive	10 GbE	Mature, commodity pricing

Network Selection Criteria:

Speed: Match to aggregate workload requirements
Latency: Cut-through switching, RDMA for latency-critical paths
Offload: TSO/LRO, checksum offload, RDMA support
PCIe Bandwidth: Ensure NIC PCIe matches network speed
RSS Queues: Multiple queues for multi-core scaling

CPU Selection for I/O Workloads

CPU selection impacts I/O through:

Core Count: More cores handle more concurrent I/O operations. High-IOPS workloads benefit from many cores.

Clock Speed: Single-threaded latency-sensitive paths benefit from higher clocks.

PCIe Lanes: CPUs provide finite PCIe lanes. High-speed storage + networking can exhaust lanes on consumer CPUs.

Memory Channels: More channels provide more memory bandwidth for DMA operations.

NUMA Topology: Multi-socket systems require attention to device/processor affinity.

CPU Focus	Workload Fit
High core count	Many concurrent I/O streams
High clock speed	Latency-sensitive, single-threaded
Many PCIe lanes	Many high-speed devices
High memory bandwidth	Large data movements, streaming

Balanced Configuration

System-Level Tuning

With appropriate hardware selected, system-level tuning optimizes how the operating system manages I/O resources.

Storage Subsystem Tuning

storage_tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
#!/bin/bash
# Comprehensive Storage Tuning Script
 
# ============================================
# I/O SCHEDULER SELECTION
# ============================================
 
# For NVMe devices: 'none' eliminates scheduler overhead
# The device handles queuing internally
for dev in /sys/block/nvme*; do
    [ -d "$dev" ] && echo "none" > $dev/queue/scheduler
done
 
# For SATA SSDs: 'mq-deadline' provides good balance
for dev in /sys/block/sd*; do
    if [ -d "$dev" ] && cat $dev/queue/rotational 2>/dev/null | grep -q "0"; then
        echo "mq-deadline" > $dev/queue/scheduler
    fi
done
 
# For HDDs: 'bfq' for desktop fairness, 'mq-deadline' for servers
for dev in /sys/block/sd*; do
    if [ -d "$dev" ] && cat $dev/queue/rotational 2>/dev/null | grep -q "1"; then
        echo "mq-deadline" > $dev/queue/scheduler
    fi
done
 
# ============================================
# QUEUE DEPTH OPTIMIZATION
# ============================================
 
# Increase queue depth for high-IOPS NVMe devices
# Default 256 is often suboptimal for enterprise drives
for dev in /sys/block/nvme*; do
    [ -d "$dev" ] && echo 1024 > $dev/queue/nr_requests
done
 
# ============================================
# READ-AHEAD TUNING
# ============================================
 
# For sequential workloads: increase read-ahead significantly
# Value in KB; default is often 128KB
for dev in /sys/block/nvme* /sys/block/sd*; do
    [ -d "$dev" ] && echo 4096 > $dev/queue/read_ahead_kb
done
 
# For random workloads: reduce read-ahead to avoid wasted I/O
# Uncomment for database workloads:
# for dev in /sys/block/nvme*; do
#     [ -d "$dev" ] && echo 8 > $dev/queue/read_ahead_kb
# done
 
# ============================================
# WRITE CACHE TUNING
# ============================================
 
# Increase dirty page limits for write-intensive workloads
# Allows more write buffering before flush
 
# Percentage of memory for dirty pages (default 20)
sysctl -w vm.dirty_ratio=40
 
# Start background writeback earlier (default 10)
sysctl -w vm.dirty_background_ratio=5
 
# Maximum age of dirty data in centiseconds (default 3000 = 30s)
sysctl -w vm.dirty_expire_centisecs=6000
 
# How often pdflush wakes up (default 500 = 5s)
sysctl -w vm.dirty_writeback_centisecs=500
 
# ============================================
# FILESYSTEM MOUNT OPTIONS
# ============================================
 
# Example optimized mount options for data volume:
# mount -o noatime,nodiratime,discard /dev/nvme0n1p1 /data
 
# noatime: Skip access time updates (major write reduction)
# nodiratime: Skip directory access time
# discard: Enable TRIM (or use fstrim.timer for batched TRIM)
 
# For XFS with high I/O concurrency:
# mount -o noatime,allocsize=64m,inode64 /dev/nvme0n1p1 /data
 
# For ext4 with journaling optimization:
# mount -o noatime,barrier=0,data=writeback /dev/nvme0n1p1 /data
# WARNING: barrier=0 risks data loss on power failure
 
echo "Storage tuning applied."

Network Subsystem Tuning

network_tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#!/bin/bash
# Network Performance Tuning Script
 
IFACE=${1:-eth0}
 
# ============================================
# BUFFER SIZES
# ============================================
 
# Increase socket buffer sizes for high-throughput connections
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.core.rmem_default=16777216
sysctl -w net.core.wmem_default=16777216
 
# TCP-specific buffer sizes
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sysctl -w net.ipv4.tcp_wmem="4096 65536 134217728"
 
# ============================================
# CONGESTION CONTROL
# ============================================
 
# Use BBR for better throughput on modern networks
sysctl -w net.ipv4.tcp_congestion_control=bbr
 
# Enable TCP fast open for reduced connection latency
sysctl -w net.ipv4.tcp_fastopen=3
 
# ============================================
# LATENCY OPTIMIZATION
# ============================================
 
# Disable Nagle's algorithm for latency-sensitive applications
# (Also require application-level TCP_NODELAY)
sysctl -w net.ipv4.tcp_low_latency=1
 
# Reduce SYN retransmit delay
sysctl -w net.ipv4.tcp_syn_retries=2
sysctl -w net.ipv4.tcp_synack_retries=2
 
# ============================================
# NIC TUNING
# ============================================
 
# Maximize ring buffer sizes
ethtool -G $IFACE rx 4096 tx 4096 2>/dev/null || true
 
# Enable interrupt coalescing for throughput
# (trades latency for reduced CPU interrupt load)
ethtool -C $IFACE rx-usecs 50 tx-usecs 50 2>/dev/null || true
 
# Balance interrupts across CPUs
# (let irqbalance handle, or use manual affinity)
echo 2 > /proc/irq/$(cat /proc/interrupts | grep $IFACE | awk '{print $1}' | tr -d ':')/smp_affinity 2>/dev/null || true
 
# Enable hardware offloads
ethtool -K $IFACE tso on gso on gro on lro on 2>/dev/null || true
 
# ============================================
# BUSY POLLING (trades CPU for latency)
# ============================================
 
# Enable busy polling for low-latency applications
sysctl -w net.core.busy_poll=50
sysctl -w net.core.busy_read=50
 
echo "Network tuning applied to $IFACE"

Memory and NUMA Tuning

memory_numa_tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#!/bin/bash
# Memory and NUMA Optimization Script
 
# ============================================
# TRANSPARENT HUGE PAGES
# ============================================
 
# Disable THP for latency-sensitive workloads (databases)
# THP can cause unexpected latency spikes during compaction
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
 
# For throughput-oriented workloads, keep enabled:
# echo always > /sys/kernel/mm/transparent_hugepage/enabled
# echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag
 
# ============================================
# SWAP CONFIGURATION
# ============================================
 
# Reduce swappiness for I/O-heavy workloads
# Keeps more file cache in memory
sysctl -w vm.swappiness=10
 
# For systems with ample RAM, minimize swapping:
sysctl -w vm.swappiness=1
 
# ============================================
# NUMA BALANCING
# ============================================
 
# Disable automatic NUMA balancing for I/O-bound workloads
# Manual pinning provides better control
sysctl -w kernel.numa_balancing=0
 
# ============================================
# ALLOCATING ON CORRECT NUMA NODE
# ============================================
 
# Find NUMA node for NVMe device
for nvme in /sys/class/nvme/nvme*; do
    name=$(basename $nvme)
    node=$(cat $nvme/device/numa_node 2>/dev/null)
    echo "$name is on NUMA node $node"
done
 
# Example: Run application pinned to NUMA node 0
# numactl --cpunodebind=0 --membind=0 ./my_io_app
 
# ============================================
# PAGE CACHE TUNING
# ============================================
 
# Increase minimum free memory to reduce reclaim pressure
# Value in KB; set to ~1% of RAM
sysctl -w vm.min_free_kbytes=2097152  # 2GB on a 200GB system
 
# Ratio of cache to pagecache for caching priority
sysctl -w vm.vfs_cache_pressure=50  # Default 100
 
echo "Memory and NUMA tuning applied."

Tuning Is Workload-Specific

Application-Level Optimization

Application architecture and I/O patterns often have more impact than system tuning. Poorly designed applications underutilize even the best hardware.

I/O API Selection

The choice of I/O API fundamentally determines achievable performance:

I/O API Comparison for Performance
API	Characteristics	Best For
Synchronous read/write	Simple, blocking, queue depth=1	Simple scripting, low-volume I/O
pread/pwrite	Positional I/O, still blocking	Multi-threaded with thread-per-file
POSIX AIO	Async but thread-pool based	Legacy applications needing async
Linux AIO (libaio)	Kernel async, O_DIRECT required	Database engines, O_DIRECT workloads
io_uring	Modern kernel async, flexible	High-performance applications, any I/O type
SPDK/DPDK	User-space drivers, kernel bypass	Maximum performance, specialized workloads

io_uring: The Modern Answer

io_uring (Linux 5.1+) provides the best combination of performance and programmability:

Minimal syscall overhead: Submission and completion via shared ring buffers
Kernel-side polling: Optional SQPOLL mode eliminates syscalls entirely
Batching: Submit/complete many operations per kernel transition
Flexibility: Works with files, sockets, timers, and more
O_DIRECT optional: Works with buffered I/O unlike libaio

iouring_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
/**
 * High-Performance I/O with io_uring
 * 
 * Demonstrates best practices for achieving maximum I/O performance:
 * - Ring buffer sizes matched to expected concurrency
 * - Registered buffers to avoid per-I/O buffer mapping
 * - Batched submission and completion
 * - Optional kernel-side polling
 */
 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <liburing.h>
 
#define QUEUE_DEPTH 256
#define BLOCK_SIZE (256 * 1024)  // 256KB for high throughput
#define BATCH_SIZE 32
 
struct io_context {
    struct io_uring ring;
    void **buffers;
    int buffer_count;
    int fd;
};
 
/**
 * Initialize io_uring with performance-optimized settings
 */
int init_io_context(struct io_context *ctx, const char *path, int use_sqpoll) {
    struct io_uring_params params = {0};
    
    // Enable kernel-side polling for minimum latency
    // Requires root or CAP_SYS_ADMIN
    if (use_sqpoll) {
        params.flags = IORING_SETUP_SQPOLL;
        params.sq_thread_idle = 2000;  // 2s before thread sleeps
    }
    
    // Use SQ_AFF to pin polling thread to specific CPU
    // params.flags |= IORING_SETUP_SQ_AFF;
    // params.sq_thread_cpu = 0;  // Pin to CPU 0
    
    int ret = io_uring_queue_init_params(QUEUE_DEPTH, &ctx->ring, &params);
    if (ret < 0) {
        fprintf(stderr, "io_uring init failed: %d\n", ret);
        return ret;
    }
    
    // Open file with O_DIRECT for direct device access
    ctx->fd = open(path, O_RDONLY | O_DIRECT);
    if (ctx->fd < 0) {
        perror("open");
        io_uring_queue_exit(&ctx->ring);
        return -1;
    }
    
    // Pre-allocate and register buffers
    // Registered buffers avoid per-I/O buffer mapping overhead
    ctx->buffer_count = QUEUE_DEPTH;
    ctx->buffers = malloc(ctx->buffer_count * sizeof(void*));
    
    struct iovec *iovecs = malloc(ctx->buffer_count * sizeof(struct iovec));
    
    for (int i = 0; i < ctx->buffer_count; i++) {
        posix_memalign(&ctx->buffers[i], BLOCK_SIZE, BLOCK_SIZE);
        iovecs[i].iov_base = ctx->buffers[i];
        iovecs[i].iov_len = BLOCK_SIZE;
    }
    
    // Register buffers with kernel
    ret = io_uring_register_buffers(&ctx->ring, iovecs, ctx->buffer_count);
    if (ret < 0) {
        fprintf(stderr, "Buffer registration failed: %d\n", ret);
        // Continue without registration - performance penalty only
    }
    
    free(iovecs);
    return 0;
}
 
/**
 * Submit a batch of I/O requests
 */
int submit_batch(struct io_context *ctx, off_t *offsets, int count) {
    for (int i = 0; i < count && i < BATCH_SIZE; i++) {
        struct io_uring_sqe *sqe = io_uring_get_sqe(&ctx->ring);
        if (!sqe) {
            // Queue full - submit what we have
            io_uring_submit(&ctx->ring);
            sqe = io_uring_get_sqe(&ctx->ring);
            if (!sqe) return i;  // Still full
        }
        
        // Use registered buffer for best performance
        int buf_idx = i % ctx->buffer_count;
        io_uring_prep_read_fixed(sqe, ctx->fd, 
                                  ctx->buffers[buf_idx], 
                                  BLOCK_SIZE, 
                                  offsets[i],
                                  buf_idx);
        
        // Store offset for completion tracking
        io_uring_sqe_set_data(sqe, (void*)offsets[i]);
    }
    
    return io_uring_submit(&ctx->ring);
}
 
/**
 * Process completions efficiently with batching
 */
int process_completions(struct io_context *ctx, int min_complete) {
    struct io_uring_cqe *cqe;
    unsigned head;
    int completed = 0;
    
    // Wait for at least min_complete operations
    if (min_complete > 0) {
        io_uring_wait_cqe_nr(&ctx->ring, &cqe, min_complete);
    }
    
    // Process all available completions
    io_uring_for_each_cqe(&ctx->ring, head, cqe) {
        if (cqe->res < 0) {
            fprintf(stderr, "I/O error at offset %lld: %d\n",
                    (long long)io_uring_cqe_get_data(cqe), cqe->res);
        }
        completed++;
    }
    
    io_uring_cq_advance(&ctx->ring, completed);
    return completed;
}
 
/**
 * Cleanup resources
 */
void cleanup_io_context(struct io_context *ctx) {
    io_uring_unregister_buffers(&ctx->ring);
    
    for (int i = 0; i < ctx->buffer_count; i++) {
        free(ctx->buffers[i]);
    }
    free(ctx->buffers);
    
    close(ctx->fd);
    io_uring_queue_exit(&ctx->ring);
}

I/O Pattern Optimization

Beyond API choice, I/O patterns significantly affect performance:

Pattern Optimization Techniques

•Request Size Optimization: Larger requests amortize overhead. For throughput, use 256KB-1MB requests. For latency, match to data access granularity.
•Alignment: Align I/O to sector boundaries (512B or 4KB). Misaligned writes cause read-modify-write cycles.
•Queue Depth: Maintain 16-64 outstanding I/O requests to saturate NVMe devices. Monitor average queue depth.
•Sequential Access: When possible, access data sequentially. Enables prefetching and avoids seek overhead on HDDs.
•Batching Small Writes: Accumulate small writes in memory; flush as one large write periodically.
•Separate Read/Write Paths: Dedicate threads/queues for reads vs writes to avoid mode-switching overhead.

Profile Before Optimizing

Use profiling tools (strace, blktrace, perf) to understand actual I/O patterns before optimizing. Assumptions about workload behavior are frequently wrong. Measure, don't guess.

Workload-Specific Optimization

Different workloads require different optimization strategies. One-size-fits-all tuning is rarely optimal.

Database Workloads (OLTP)

Online transaction processing is characterized by:

Random 4-16KB reads and writes
Latency-sensitive (user transactions)
High concurrency (many small operations)

Optimization Strategy:

Minimize read-ahead (wasted for random access)
Use high-IOPS storage (NVMe, Optane)
Pin buffer pool to memory (huge pages)
Disable NUMA auto-balancing; bind to device-local CPUs
O_DIRECT for data files (database handles caching)
Log files on separate device with sequential optimization

Analytics Workloads (OLAP)

Analytical processing differs significantly:

Large sequential scans
Throughput-optimized (scan speed)
Fewer, larger operations

Optimization Strategy:

Maximum read-ahead (16MB+)
High sequential throughput storage
Allow OS caching (data may be reused)
Columnar storage formats for scan efficiency
Parallel scans across multiple devices
Consider compression (trades CPU for I/O)

Streaming/Media Workloads

Video, audio, and other streaming characteristics:

Sustained sequential throughput
Continuous, predictable I/O
Latency less critical than consistency

Optimization Strategy:

Large read-ahead and buffer sizes
Predictive prefetching
Guarantee bandwidth reservation (QoS)
High sequential throughput storage
Network QoS for remote streaming
Ring buffers to smooth jitter

Virtualization Workloads

VM and container hosting presents mixed challenges:

Highly mixed random/sequential
Multiple independent workloads
Unpredictable load patterns
Noisy neighbor concerns

Optimization Strategy:

High IOPS storage (NVMe arrays)
Balanced read/write performance
I/O scheduling for fairness (BFQ or cgroup limits)
Separate system and data storage
TRIM/discard passthrough for guests
VirtIO for paravirtualized I/O efficiency

Workload Optimization Quick Reference
Workload	Primary Metric	Key Tuning
OLTP Database	Random IOPS, low latency	Minimal read-ahead, O_DIRECT, high IOPS storage
OLAP Analytics	Sequential throughput	Large read-ahead, parallel scans, allow caching
Streaming Media	Sustained bandwidth	Large buffers, bandwidth QoS, prefetching
Backup/Archive	Sequential write	Large write buffer, delayed writeback, compression
Virtualization	Mixed, isolation	Balanced storage, I/O limits, fair scheduling
Logging	Write append	Sequential writes, sync options, rotate strategies

Mixed Workloads Challenge

Performance Testing and Validation

Performance optimization without validation is guesswork. Rigorous testing confirms improvements and prevents regressions.

Testing Methodology

1. Representative Workload Use workloads that accurately represent production:

Same I/O sizes and patterns
Similar concurrency levels
Comparable data volumes
Matching time durations

2. Isolation Eliminate confounding variables:

Dedicated test system or exclusive access
No background processes
Consistent starting state (caches cleared or warmed as appropriate)

3. Statistical Rigor Collect enough samples for confidence:

Multiple test iterations (5-10 minimum)
Report mean, standard deviation, percentiles
Watch for outliers and bimodal distributions

benchmark_fio.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
#!/bin/bash
# Comprehensive I/O Benchmark Suite using fio
# Tests storage performance across multiple workload patterns
 
DEVICE=${1:-/dev/nvme0n1}
SIZE=100G
RUNTIME=60
OUTPUT_DIR=./fio_results_$(date +%Y%m%d_%H%M%S)
 
mkdir -p $OUTPUT_DIR
 
echo "=== Starting I/O Benchmark Suite ==="
echo "Device: $DEVICE"
echo "Results: $OUTPUT_DIR"
echo
 
# ============================================
# 1. SEQUENTIAL READ THROUGHPUT
# ============================================
echo "Running: Sequential Read Throughput..."
fio --name=seq_read \
    --filename=$DEVICE \
    --direct=1 \
    --rw=read \
    --bs=1m \
    --ioengine=io_uring \
    --iodepth=64 \
    --numjobs=4 \
    --size=$SIZE \
    --runtime=$RUNTIME \
    --time_based \
    --group_reporting \
    --output=$OUTPUT_DIR/seq_read.json \
    --output-format=json
 
# ============================================
# 2. SEQUENTIAL WRITE THROUGHPUT
# ============================================
echo "Running: Sequential Write Throughput..."
fio --name=seq_write \
    --filename=$DEVICE \
    --direct=1 \
    --rw=write \
    --bs=1m \
    --ioengine=io_uring \
    --iodepth=64 \
    --numjobs=4 \
    --size=$SIZE \
    --runtime=$RUNTIME \
    --time_based \
    --group_reporting \
    --output=$OUTPUT_DIR/seq_write.json \
    --output-format=json
 
# ============================================
# 3. RANDOM READ IOPS (4K)
# ============================================
echo "Running: Random Read IOPS..."
fio --name=rand_read \
    --filename=$DEVICE \
    --direct=1 \
    --rw=randread \
    --bs=4k \
    --ioengine=io_uring \
    --iodepth=256 \
    --numjobs=4 \
    --size=$SIZE \
    --runtime=$RUNTIME \
    --time_based \
    --group_reporting \
    --output=$OUTPUT_DIR/rand_read.json \
    --output-format=json
 
# ============================================
# 4. RANDOM WRITE IOPS (4K)
# ============================================
echo "Running: Random Write IOPS..."
fio --name=rand_write \
    --filename=$DEVICE \
    --direct=1 \
    --rw=randwrite \
    --bs=4k \
    --ioengine=io_uring \
    --iodepth=256 \
    --numjobs=4 \
    --size=$SIZE \
    --runtime=$RUNTIME \
    --time_based \
    --group_reporting \
    --output=$OUTPUT_DIR/rand_write.json \
    --output-format=json
 
# ============================================
# 5. MIXED WORKLOAD (70/30 READ/WRITE)
# ============================================
echo "Running: Mixed Workload..."
fio --name=mixed \
    --filename=$DEVICE \
    --direct=1 \
    --rw=randrw \
    --rwmixread=70 \
    --bs=8k \
    --ioengine=io_uring \
    --iodepth=128 \
    --numjobs=4 \
    --size=$SIZE \
    --runtime=$RUNTIME \
    --time_based \
    --group_reporting \
    --output=$OUTPUT_DIR/mixed.json \
    --output-format=json
 
# ============================================
# 6. LATENCY TEST (QD=1)
# ============================================
echo "Running: Latency Test..."
fio --name=latency \
    --filename=$DEVICE \
    --direct=1 \
    --rw=randread \
    --bs=4k \
    --ioengine=io_uring \
    --iodepth=1 \
    --numjobs=1 \
    --size=$SIZE \
    --runtime=$RUNTIME \
    --time_based \
    --percentile_list=50:90:99:99.9:99.99 \
    --output=$OUTPUT_DIR/latency.json \
    --output-format=json
 
echo
echo "=== Benchmark Complete ==="
echo "Results saved to: $OUTPUT_DIR"
echo
echo "Summary:"
for f in $OUTPUT_DIR/*.json; do
    test_name=$(basename $f .json)
    echo "--- $test_name ---"
    jq '.jobs[0] | {read_bw_mb: (.read.bw / 1024), write_bw_mb: (.write.bw / 1024), read_iops: .read.iops, write_iops: .write.iops, lat_us_p99: .read.clat_ns.percentile."99.000000" / 1000}' $f 2>/dev/null
done

Validation Against Goals

After testing, validate against defined goals:

Goal	Measured Result	Status
1M random read IOPS	950,000 IOPS	⚠️ 95% of target
p99 latency < 100µs	85µs	✅ Achieved
5 GB/s sequential write	5.2 GB/s	✅ Exceeded

For goals not achieved:

Identify the limiting factor (is it hardware or tuning?)
Determine if the gap is acceptable or requires action
Feed findings back into the optimization cycle

Automate Testing

Performance Monitoring and Maintenance

Performance optimization is not a one-time event. Systems change, workloads evolve, and hardware ages. Continuous monitoring and periodic re-optimization maintain performance over time.

Monitoring Strategy

Real-Time Dashboards Display current performance metrics for immediate visibility:

Throughput (read/write MB/s)
IOPS (read/write operations/second)
Latency (p50, p90, p99)
Device utilization and queue depth
Error rates

Alerting Notify on performance anomalies:

Latency exceeding SLA thresholds
Throughput dropping below baseline
Utilization sustained at saturation levels
Error rates increasing
Capacity approaching limits

Capacity Planning

Project future performance needs:

$$\text{Months until capacity} = \frac{\text{Available capacity} - \text{Current usage}}{\text{Growth rate}}$$

Track trends in:

Storage consumption growth
IOPS demand growth
Network bandwidth growth
Peak vs average ratios

Plan hardware upgrades before reaching capacity limits. It's better to have 30% headroom than to hit saturation unexpectedly.

Maintenance Tasks

Regular maintenance preserves performance:

Task	Frequency	Purpose
TRIM/discard on SSDs	Daily (fstrim.timer)	Maintain SSD performance
File system defrag (HDD)	Weekly	Reduce fragmentation
Monitor SMART data	Daily	Early failure detection
Review performance trends	Weekly	Detect gradual degradation
Re-baseline benchmarks	Monthly	Validate sustained performance
Review tuning parameters	Quarterly	Adjust for workload changes
Capacity planning review	Quarterly	Plan future resources

Degradation Detection

Performance degrades over time due to:

Compare periodic benchmarks to original baseline. A 20% degradation from baseline warrants investigation.

Infrastructure as Code

Capture performance tuning in version-controlled configuration (Ansible, Puppet, Terraform). This ensures consistency across systems, enables rollback, and documents the rationale for each setting.

Summary: Mastering I/O Performance Optimization

Key Takeaways

•Follow a methodology — Define goals, measure baseline, change one thing, measure impact, iterate. Avoid random tuning.
•Hardware selection is highest-impact — Right hardware enables performance levels no tuning on wrong hardware achieves. Match hardware to workload requirements.
•System tuning is workload-specific — Generic tuning guides provide starting points. Optimal settings depend on actual workload patterns.
•Application design often matters most — Use appropriate I/O APIs (io_uring), maintain queue depth, optimize access patterns, batch small operations.
•Different workloads need different optimization — OLTP requires low latency and high IOPS; OLAP requires sequential throughput. Optimize for your actual workload.
•Testing validates optimization — Rigorous benchmarking confirms improvements. Automated testing catches regressions.
•Performance requires ongoing attention — Monitor continuously, maintain regularly, re-baseline periodically. Performance degrades without maintenance.

Module Complete

Analyze I/O performance systematically
Identify and address bottlenecks
Select appropriate hardware for workloads
Configure systems for optimal performance
Design applications that maximize I/O efficiency
Monitor and maintain performance over time

Module Complete