Io Hardware Performance - Learning Module

Loading content...

0/227

I/O Throughput

The Pulse of Data Movement

Every computation, every file access, every network packet, every database query ultimately depends on one critical operation: moving data. While processors execute billions of instructions per second and memory responds in nanoseconds, the speed at which data flows through I/O subsystems often determines the actual performance users experience. A report that takes seconds to generate waits minutes for disk I/O. A real-time video stream stutters not because of codec complexity, but because data cannot arrive fast enough.

At the heart of understanding I/O performance lies a deceptively simple concept: throughput—the rate at which data moves through a system. Yet beneath this simple definition lies a rich landscape of engineering trade-offs, physical constraints, protocol overhead, and optimization opportunities that separate ordinary systems from high-performance ones.

What You Will Learn

By the end of this page, you will understand how to measure, analyze, and reason about I/O throughput. You'll learn the difference between theoretical and achievable throughput, understand the factors that limit real-world performance, and develop the mental models needed to diagnose and optimize I/O bottlenecks in production systems.

Defining I/O Throughput

I/O throughput is formally defined as the amount of data successfully transferred per unit time between a source and destination. It is the fundamental metric for quantifying data movement capacity and is typically expressed in bytes per second (B/s) or its multiples:

KB/s (Kilobytes per second): 10³ bytes/second
MB/s (Megabytes per second): 10⁶ bytes/second
GB/s (Gigabytes per second): 10⁹ bytes/second
TB/s (Terabytes per second): 10¹² bytes/second

Alternatively, throughput may be expressed in bits per second (bps), particularly for network interfaces:

Mbps (Megabits per second): 10⁶ bits/second
Gbps (Gigabits per second): 10⁹ bits/second

The conversion factor of 8 bits per byte means that a 1 Gbps network link theoretically transfers 125 MB/s of payload data—though as we'll see, reality is considerably more nuanced.

Binary vs Decimal Prefixes

Storage manufacturers often use decimal prefixes (1 GB = 10⁹ bytes), while operating systems traditionally use binary prefixes (1 GiB = 2³⁰ bytes = 1,073,741,824 bytes). This ~7.4% difference causes confusion when comparing advertised capacities versus reported values. Always verify which convention is in use when analyzing throughput measurements.

The Throughput Equation

At its most fundamental level, throughput can be expressed as:

$$\text{Throughput} = \frac{\text{Data Transferred}}{\text{Time Elapsed}}$$

However, this simple equation obscures critical details. A more accurate model accounts for the lifecycle of an I/O operation:

$$T_{effective} = \frac{D}{T_{setup} + T_{transfer} + T_{completion}}$$

Where:

D = Amount of data transferred
T_setup = Time to initiate the I/O operation (command processing, seek time for disks, connection establishment for networks)
T_transfer = Time for actual data movement
T_completion = Time to finalize the operation (writing to destination buffers, acknowledgments, interrupt processing)

This decomposition reveals why small I/O operations often achieve much lower throughput than large ones: the fixed overhead ($T_{setup} + T_{completion}$) dominates when $D$ is small.

Throughput Categories in Modern I/O Systems
Category	Definition	Use Case
Raw Throughput	Maximum theoretical data rate of the physical medium	Interface specifications, hardware design limits
Effective Throughput	Actual data rate achieved after protocol overhead	Application-level performance measurement
Sustained Throughput	Throughput maintained over extended periods	Long-running batch operations, streaming workloads
Peak Throughput	Maximum momentary throughput during burst transfers	Cache hits, burst I/O patterns
Aggregate Throughput	Combined throughput across multiple channels or devices	RAID arrays, parallel I/O subsystems

Measuring Throughput

Accurate throughput measurement is both critical and surprisingly complex. Different measurement methodologies yield different results, and understanding these differences is essential for proper system analysis.

Sequential vs Random I/O Throughput

For storage devices, throughput varies dramatically based on access patterns:

Sequential throughput measures performance when accessing contiguous data blocks in order. This pattern allows devices to optimize for streaming transfers—HDDs can minimize seek overhead, SSDs can leverage internal parallelism, and caches achieve high hit rates.
Random throughput measures performance when accessing non-contiguous blocks in unpredictable order. This worst-case pattern exposes all overhead costs: seek latency, rotational delay, flash translation layer lookups, and cache misses.

Sequential I/O Characteristics

•Predictable access patterns enable prefetching
•Minimizes seek and rotational delays (HDD)
•Enables efficient use of DMA burst transfers
•High cache effectiveness due to locality
•Large transfer sizes amortize fixed overhead
•Can approach theoretical interface limits

Random I/O Characteristics

•Unpredictable patterns defeat prefetching
•Every access incurs full seek penalty (HDD)
•Small transfers dominated by setup overhead
•Low cache hit rates due to poor locality
•Exposes IOPS limitations rather than throughput
•Far below theoretical interface capacity

Block Size Impact on Measured Throughput

The size of individual I/O requests profoundly affects measured throughput. Consider the relationship:

$$\text{Throughput} = \text{IOPS} \times \text{Block Size}$$

Where IOPS (I/O Operations Per Second) represents the rate of completed I/O requests. For a device capable of 10,000 IOPS:

Block Size	Calculated Throughput
4 KB	40 MB/s
64 KB	640 MB/s
256 KB	2,560 MB/s
1 MB	10,240 MB/s

This relationship explains why database workloads (typically 4-16 KB blocks) achieve vastly different throughput than backup operations (often 256 KB+ blocks) on identical hardware.

throughput_benchmark.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
/**
 * Throughput Measurement Framework
 * 
 * Demonstrates proper methodology for measuring I/O throughput
 * with consideration for warmup, multiple iterations, and
 * statistical analysis of results.
 */
 
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <string.h>
#include <errno.h>
 
#define KB (1024ULL)
#define MB (1024ULL * KB)
#define GB (1024ULL * MB)
 
#define WARMUP_ITERATIONS 3
#define MEASUREMENT_ITERATIONS 10
#define DEFAULT_TRANSFER_SIZE (1 * GB)
#define MAX_BLOCK_SIZE (1 * MB)
 
typedef struct {
    double throughput_mbps;     // Measured MB/s
    double elapsed_seconds;     // Total time
    size_t bytes_transferred;   // Actual bytes moved
    int error_count;            // Any I/O errors
} BenchmarkResult;
 
typedef struct {
    double mean;                // Average throughput
    double std_dev;             // Standard deviation
    double min;                 // Minimum observed
    double max;                 // Maximum observed
    double p95;                 // 95th percentile
} ThroughputStatistics;
 
/**
 * High-resolution timer for nanosecond precision
 */
static inline double get_time_seconds(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec + ts.tv_nsec / 1e9;
}
 
/**
 * Measure sequential read throughput
 * 
 * Key considerations:
 * - Use O_DIRECT to bypass OS buffer cache (measures device speed)
 * - Align buffer to page boundary (required for O_DIRECT)
 * - Use large block sizes to minimize syscall overhead
 */
BenchmarkResult measure_seq_read_throughput(
    const char* device_path,
    size_t total_bytes,
    size_t block_size
) {
    BenchmarkResult result = {0};
    
    // O_DIRECT bypasses OS caching for accurate device measurement
    int fd = open(device_path, O_RDONLY | O_DIRECT);
    if (fd < 0) {
        perror("Failed to open device");
        result.error_count = 1;
        return result;
    }
    
    // Allocate aligned buffer (O_DIRECT requirement)
    void* buffer;
    if (posix_memalign(&buffer, 4096, block_size) != 0) {
        result.error_count = 1;
        close(fd);
        return result;
    }
    
    // Perform measurement
    double start_time = get_time_seconds();
    size_t total_read = 0;
    
    while (total_read < total_bytes) {
        ssize_t bytes = read(fd, buffer, block_size);
        if (bytes <= 0) {
            if (bytes < 0 && errno != EINTR) {
                result.error_count++;
            }
            break;
        }
        total_read += bytes;
    }
    
    double end_time = get_time_seconds();
    
    // Calculate results
    result.elapsed_seconds = end_time - start_time;
    result.bytes_transferred = total_read;
    result.throughput_mbps = (total_read / MB) / result.elapsed_seconds;
    
    free(buffer);
    close(fd);
    return result;
}
 
/**
 * Calculate statistical summary of throughput measurements
 */
ThroughputStatistics calculate_statistics(
    double* measurements,
    int count
) {
    ThroughputStatistics stats = {0};
    
    // Calculate mean
    for (int i = 0; i < count; i++) {
        stats.mean += measurements[i];
        if (measurements[i] < stats.min || stats.min == 0) 
            stats.min = measurements[i];
        if (measurements[i] > stats.max) 
            stats.max = measurements[i];
    }
    stats.mean /= count;
    
    // Calculate standard deviation
    for (int i = 0; i < count; i++) {
        double diff = measurements[i] - stats.mean;
        stats.std_dev += diff * diff;
    }
    stats.std_dev = sqrt(stats.std_dev / count);
    
    // Sort for percentile calculation
    // (simplified insertion sort for demonstration)
    for (int i = 1; i < count; i++) {
        double key = measurements[i];
        int j = i - 1;
        while (j >= 0 && measurements[j] > key) {
            measurements[j + 1] = measurements[j];
            j--;
        }
        measurements[j + 1] = key;
    }
    stats.p95 = measurements[(int)(count * 0.95)];
    
    return stats;
}
 
/**
 * Run comprehensive throughput benchmark
 */
void run_throughput_benchmark(const char* device_path) {
    size_t block_sizes[] = {4*KB, 16*KB, 64*KB, 256*KB, 1*MB};
    int num_block_sizes = sizeof(block_sizes) / sizeof(block_sizes[0]);
    
    printf("\n%-12s %12s %12s %12s %12s\n",
           "Block Size", "Mean (MB/s)", "StdDev", "Min", "Max");
    printf("%-12s %12s %12s %12s %12s\n",
           "----------", "-----------", "------", "---", "---");
    
    for (int bs = 0; bs < num_block_sizes; bs++) {
        double measurements[MEASUREMENT_ITERATIONS];
        
        // Warmup iterations (discard results)
        for (int i = 0; i < WARMUP_ITERATIONS; i++) {
            measure_seq_read_throughput(
                device_path, 
                DEFAULT_TRANSFER_SIZE / 10, 
                block_sizes[bs]
            );
        }
        
        // Actual measurement iterations
        for (int i = 0; i < MEASUREMENT_ITERATIONS; i++) {
            BenchmarkResult result = measure_seq_read_throughput(
                device_path,
                DEFAULT_TRANSFER_SIZE,
                block_sizes[bs]
            );
            measurements[i] = result.throughput_mbps;
        }
        
        ThroughputStatistics stats = 
            calculate_statistics(measurements, MEASUREMENT_ITERATIONS);
        
        printf("%-12zu %12.2f %12.2f %12.2f %12.2f\n",
               block_sizes[bs] / KB,
               stats.mean, 
               stats.std_dev,
               stats.min, 
               stats.max);
    }
}

Measurement Best Practices

Always include warmup iterations before measurement to stabilize caches and trigger any just-in-time optimizations. Collect multiple samples and report statistical measures (mean, standard deviation, percentiles) rather than single values. Use O_DIRECT when measuring actual device throughput to bypass OS caching, but remember that applications typically benefit from OS caching.

Theoretical vs Practical Throughput

A fundamental reality of I/O systems is that practical throughput never achieves theoretical maximum. Understanding this gap—and the factors that cause it—is essential for realistic capacity planning and performance optimization.

Theoretical Throughput

Theoretical throughput represents the maximum data rate that the physical interface can sustain under ideal conditions. For example:

Interface	Theoretical Throughput	Calculation
SATA III	6 Gbps = 600 MB/s	Raw signaling rate
PCIe 4.0 x4	64 Gbps = 8 GB/s	16 GT/s × 4 lanes × 2B/4b encoding
NVMe over PCIe 4.0 x4	~7.88 GB/s	After 128b/130b encoding
USB 3.2 Gen 2	10 Gbps = 1.25 GB/s	SuperSpeed+ specification
10 GbE	10 Gbps = 1.25 GB/s	Wire speed maximum

These numbers represent wire speed—the maximum signaling capacity of the physical layer.

The Overhead Cascade

Multiple layers of overhead reduce practical throughput:

1. Encoding Overhead Physical interfaces use line coding to maintain signal integrity. SATA uses 8b/10b encoding (20% overhead), while PCIe 4.0+ uses 128b/130b (~1.5% overhead). This is a fixed tax on all transfers.

2. Protocol Overhead Every I/O operation includes framing, addressing, commands, and status information beyond actual data. A 4 KB NVMe read might require:

64-byte submission queue entry
16-byte completion queue entry
Various protocol handshaking

3. Software Stack Overhead Data traverses multiple software layers, each adding latency and consuming CPU cycles:

Application → System call → File system → Block layer → Device driver → Hardware

4. Interleaving Overhead Physical media cannot always stream continuously. HDDs have seek time and rotational delay. SSDs have internal garbage collection. Networks have packet gaps and retransmissions.

Throughput Efficiency Analysis for Common Storage Interfaces
Interface	Theoretical Max	Typical Sustained	Efficiency	Primary Loss Factors
SATA III SSD	600 MB/s	550 MB/s	~92%	8b/10b encoding, command overhead
7,200 RPM HDD (seq)	200 MB/s	150-180 MB/s	~80%	Track switching, zone density variation
PCIe 4.0 NVMe SSD	7,880 MB/s	5,000-7,000 MB/s	65-90%	Controller limits, thermal throttling
10 GbE (TCP)	1,250 MB/s	1,100-1,180 MB/s	~92%	Ethernet framing, IP/TCP headers
USB 3.2 Gen 2	1,250 MB/s	900-1,050 MB/s	~80%	Protocol overhead, cable quality

The Reality Check Formula

A useful heuristic for estimating practical throughput:

$$T_{practical} \approx T_{theoretical} \times \eta_{encoding} \times \eta_{protocol} \times \eta_{media}$$

Where:

$\eta_{encoding}$ = Encoding efficiency (0.8 for 8b/10b, 0.99 for 128b/130b)
$\eta_{protocol}$ = Protocol payload efficiency (typically 0.90-0.98)
$\eta_{media}$ = Media-specific efficiency (HDDs ~0.7-0.9, SSDs ~0.8-0.95, network ~0.92-0.98)

Example: Estimating NVMe SSD Throughput

For a PCIe 4.0 x4 NVMe SSD:

Theoretical: 8 GB/s (raw lane bandwidth)
Encoding (128b/130b): × 0.985 = 7.88 GB/s
Protocol efficiency: × 0.95 = 7.49 GB/s
Controller/Flash limits: × 0.85 = 6.37 GB/s

This matches observed real-world throughput of 6-7 GB/s for high-end NVMe drives.

Beyond Simple Throughput

While theoretical throughput provides upper bounds, real-world performance depends heavily on workload characteristics. A database performing random 4 KB reads achieves vastly different throughput than a video editor streaming sequential 1 MB blocks—even on identical hardware. Always measure throughput under representative workload conditions.

Factors Affecting I/O Throughput

I/O throughput is influenced by a complex interplay of hardware capabilities, software design, workload characteristics, and environmental factors. Understanding these dependencies enables targeted optimization and realistic performance predictions.

Hardware Factors

Physical Interface Constraints

•Bus Width and Clock Speed — A wider bus (more lanes in PCIe, more bits in parallel interfaces) directly scales bandwidth. Higher signaling rates increase bit transfer speed but may reduce signal integrity.
•Controller Processing Capacity — Storage controllers have finite command processing rates (IOPS) and internal bandwidth. Even with a fast interface, controller limitations can bound throughput.
•Media Transfer Rate — The underlying storage or transmission medium has intrinsic limits. HDD platters spin at fixed RPMs, NAND flash has program/erase constraints, and network switches have buffer and switching limits.
•Buffer and Cache Sizes — Larger on-device buffers enable burst absorption and command coalescing, smoothing throughput over time. Cache hits bypass slow media entirely.
•Thermal Conditions — Many devices throttle under sustained load to prevent overheating. NVMe SSDs commonly drop from 7 GB/s to 2-3 GB/s during extended writes due to thermal throttling.

Software Factors

The software stack profoundly impacts achieved throughput, often more than hardware selection:

Software Stack Impact

•Queue Depth — The number of outstanding I/O requests affects throughput significantly. Deeper queues allow devices to optimize internal scheduling and maintain continuous operation. Shallow queues (depth 1) leave devices idle between commands.
•Request Size — Larger I/O requests amortize fixed overhead costs, increasing effective throughput. Small requests are latency-bound rather than bandwidth-bound.
•Request Alignment — I/O requests aligned to device sector boundaries avoid read-modify-write cycles. Misaligned requests reduce throughput by requiring additional internal operations.
•File System Overhead — File systems add metadata operations, journaling writes, and allocation overhead. Direct I/O bypasses file system caching but loses these abstractions.
•CPU Utilization — I/O processing requires CPU cycles for system calls, interrupt handling, and data copying. CPU-bound systems may not feed devices fast enough to saturate interface bandwidth.

Workload Characteristics

The pattern of I/O requests fundamentally shapes achievable throughput:

Workload Pattern Impact on Throughput
Pattern	Description	Throughput Impact
Sequential Large	Contiguous blocks, large requests (1 MB+)	Highest throughput; approaches interface limits
Sequential Small	Contiguous blocks, small requests (4-16 KB)	Good throughput; limited by IOPS × block size
Random Large	Non-contiguous blocks, large requests	Moderate throughput; media seek/access overhead
Random Small	Non-contiguous blocks, small requests	Lowest throughput; dominated by latency
Mixed	Combination of patterns	Varies; often worse than either pure pattern

The Queue Depth Effect

Queue depth—the number of simultaneous outstanding I/O requests—dramatically influences throughput, particularly for devices with internal parallelism:

Queue Depth	Typical Impact
1	~30-40% of peak throughput; device idles between requests
4	~50-70% of peak; some pipelining possible
16	~80-90% of peak; good internal parallelism exploitation
32+	~95%+ of peak; diminishing returns beyond this

NVMe SSDs are designed for high queue depths (up to 64K queues × 64K depth), which is why they excel in enterprise workloads that generate many concurrent requests, but may show similar performance to SATA SSDs in single-threaded desktop workloads.

Optimizing Queue Depth

Modern file systems and applications should use asynchronous I/O (io_uring on Linux, IOCP on Windows) to maintain appropriate queue depths. Synchronous I/O with single-threaded applications fundamentally limits queue depth to 1, leaving most device capacity unused.

Throughput in Different I/O Contexts

Throughput manifests differently across storage, network, and peripheral I/O subsystems, each with unique characteristics and optimization strategies.

Storage I/O Throughput

Storage throughput depends heavily on the storage medium and access pattern. Modern storage hierarchies exhibit vast throughput ranges:

Storage Throughput by Technology and Pattern
Storage Type	Sequential Read	Sequential Write	Random Read (4K)	Random Write (4K)
7,200 RPM HDD	150-200 MB/s	150-180 MB/s	0.5-2 MB/s	0.5-2 MB/s
15,000 RPM SAS HDD	200-250 MB/s	200-230 MB/s	1-2 MB/s	1-2 MB/s
SATA SSD	550 MB/s	520 MB/s	150-350 MB/s	100-300 MB/s
NVMe SSD (PCIe 3.0)	3,500 MB/s	3,000 MB/s	400-600 MB/s	250-400 MB/s
NVMe SSD (PCIe 4.0)	7,000 MB/s	5,500 MB/s	600-1000 MB/s	400-600 MB/s
Intel Optane (3D XPoint)	2,700 MB/s	2,200 MB/s	2,400 MB/s	1,900 MB/s

The contrast between sequential and random throughput is striking. An HDD with 180 MB/s sequential throughput achieves only 1 MB/s for random 4K reads—a 180× difference. This gap, caused by mechanical seek latency (~10 ms per seek), fundamentally shapes storage system design.

Network I/O Throughput

Network throughput is bounded by link capacity, protocol overhead, and congestion:

Network Type	Wire Speed	Practical TCP	Overhead Sources
1 GbE	125 MB/s	110-120 MB/s	Ethernet framing, IP/TCP headers, ACKs
10 GbE	1.25 GB/s	1.1-1.18 GB/s	Same + switch latency, buffer limits
25 GbE	3.125 GB/s	2.8-3.0 GB/s	Same + NIC processing limits
100 GbE	12.5 GB/s	10-11 GB/s	Same + multiple streams needed
InfiniBand HDR	25 GB/s	23-24 GB/s	RDMA bypasses OS stack

Network applications often require multiple parallel streams or RDMA (Remote Direct Memory Access) to saturate high-speed links. A single TCP stream rarely achieves more than 3-5 Gbps due to latency-bandwidth product constraints.

Peripheral I/O Throughput

Peripheral throughput varies enormously based on device class:

Peripheral Type	Typical Throughput	Limiting Factor
USB Mouse/Keyboard	1-10 KB/s	Low-frequency polling
USB Audio (48 kHz stereo)	384 KB/s	Audio sample rate
USB 4K Webcam	45-100 MB/s	Video resolution/framerate
Thunderbolt 4 Storage	3,000 MB/s	PCIe 3.0 x4 tunneling
PCIe GPU (x16)	32 GB/s	Memory bandwidth limited

Peripheral throughput optimization often focuses on reducing CPU overhead (via DMA), batching transfers, and matching buffer sizes to typical payload sizes.

Aggregate System Throughput

In real systems, multiple I/O subsystems compete for shared resources (PCIe lanes, memory bandwidth, CPU attention). Total system throughput may be less than the sum of individual device throughputs due to contention and scheduling overhead. Careful system design balances load across independent I/O paths.

Throughput Optimization Strategies

Optimizing I/O throughput requires a systematic approach across hardware selection, software architecture, and operational tuning.

Hardware-Level Optimization

Hardware Strategies

•Choose Appropriate Interface — Match interface bandwidth to workload requirements. NVMe provides massive throughput but costs more; SATA suffices for many sequential workloads.
•Ensure Cooling Adequacy — Thermal throttling devastates sustained throughput. Enterprise SSDs include power-loss protection capacitors and sustained write ratings; consumer drives throttle aggressively.
•Balance PCIe Lane Allocation — PCIe lanes are finite resources. A GPU at x16 may leave storage constrained to x4. Plan slot allocation to match bandwidth needs.
•Consider RAID for Aggregate Throughput — RAID 0 (striping) multiplies throughput across drives. RAID 5/6 adds parity overhead but provides redundancy.
•Use High-Quality Cables and Connectors — Signal integrity issues cause retries and reduce effective throughput. Use certified cables for high-speed interfaces.

Software-Level Optimization

Software optimizations often provide the largest throughput gains:

throughput_optimization_linux.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
#!/bin/bash
# Linux I/O Throughput Optimization Checklist
 
# ============================================
# 1. SCHEDULER SELECTION
# ============================================
# NVMe devices benefit from 'none' or 'mq-deadline'
# HDDs benefit from 'bfq' for fairness or 'mq-deadline'
 
# Check current scheduler for nvme0n1
cat /sys/block/nvme0n1/queue/scheduler
 
# Set optimal scheduler (requires root)
echo "none" > /sys/block/nvme0n1/queue/scheduler  # For NVMe
echo "mq-deadline" > /sys/block/sda/queue/scheduler  # For SSD/HDD
 
# ============================================
# 2. QUEUE DEPTH TUNING
# ============================================
# Increase queue depth for high-throughput workloads
# Default is often 128-256; can increase for NVMe
 
# Check current queue depth
cat /sys/block/nvme0n1/queue/nr_requests
 
# Increase queue depth (power of 2 recommended)
echo 1024 > /sys/block/nvme0n1/queue/nr_requests
 
# ============================================
# 3. READ-AHEAD TUNING
# ============================================
# Increase read-ahead for sequential workloads
# Values in 512-byte sectors (multiply KB by 2)
 
# Check current read-ahead (in sectors)
cat /sys/block/nvme0n1/queue/read_ahead_kb
 
# Set 16 MB read-ahead for streaming workloads
echo 16384 > /sys/block/nvme0n1/queue/read_ahead_kb
 
# ============================================
# 4. FILESYSTEM MOUNT OPTIONS
# ============================================
# XFS/ext4 options for throughput:
# - noatime: Skip access time updates
# - nodiratime: Skip directory access time
# - discard: Enable TRIM (or use fstrim.timer)
# - nobarrier: Disable write barriers (DANGER: data loss risk)
 
# Example fstab entry for NVMe data volume:
# /dev/nvme0n1p1 /data xfs defaults,noatime,discard 0 2
 
# ============================================
# 5. MEMORY AND CACHING
# ============================================
# Increase dirty page limits for write-heavy workloads
 
# Current settings
sysctl vm.dirty_ratio vm.dirty_background_ratio
 
# Aggressive write caching (up to 40% of RAM as dirty pages)
sysctl -w vm.dirty_ratio=40
sysctl -w vm.dirty_background_ratio=10
 
# ============================================
# 6. NUMA AWARENESS
# ============================================
# Bind I/O-intensive processes to NUMA node with storage
# Check NUMA topology
lscpu | grep NUMA
cat /sys/block/nvme0n1/device/numa_node
 
# Run application on correct NUMA node
numactl --cpunodebind=0 --membind=0 ./io_intensive_app
 
# ============================================
# 7. INTERRUPT AFFINITY
# ============================================
# Balance IRQs across CPUs for high-throughput NICs/storage
# Install irqbalance or manually tune
 
# Check NVMe interrupt distribution
cat /proc/interrupts | grep nvme
 
# Distribute NVMe interrupts across CPUs (example)
# Set affinity mask for each interrupt queue

Application-Level Optimization

Beyond system tuning, application design choices profoundly affect throughput:

Application Design for Throughput

•Use Asynchronous I/O — io_uring (Linux), IOCP (Windows), or kqueue (BSD) enable high concurrency without thread overhead. Maintain queue depths of 16-64 for storage devices.
•Batch I/O Requests — Combine small writes into larger buffers before issuing I/O. File system buffering does this automatically, but explicit batching provides control.
•Align Requests — Ensure I/O offsets and sizes align to physical sector size (512B or 4KB). Misaligned writes trigger read-modify-write cycles.
•Preallocate Files — Use fallocate() to reserve contiguous space before writing. This avoids fragmentation and metadata overhead during writes.
•Use Memory Mapping — mmap() allows the OS to optimize page faults into efficient large reads. Works well for random access to large files.
•Profile and Measure — Use tools like iostat, blktrace, and perf to identify actual bottlenecks rather than assumed ones.

Optimization Trade-offs

Throughput optimization often trades off against other qualities. Aggressive caching risks data loss on power failure. Large I/O requests improve throughput but increase latency for individual operations. Deep queues benefit throughput but may starve low-priority requests. Always consider the broader system requirements when tuning for throughput.

Throughput Analysis and Monitoring

Production systems require continuous throughput monitoring to detect degradation, plan capacity, and diagnose issues. Effective monitoring combines real-time metrics, historical trending, and diagnostic deep-dives.

Essential Metrics

Beyond raw throughput (MB/s), comprehensive monitoring tracks:

Metric	Description	Interpretation
Read throughput	Bytes read per second	Baseline for read-heavy workloads
Write throughput	Bytes written per second	Watch for write amplification
IOPS	I/O operations per second	Complements throughput for small I/O
Queue depth	Outstanding I/O requests	Low depth suggests application limits
I/O wait	CPU time waiting for I/O	High iowait indicates I/O bottleneck
Device utilization	Percentage of time device busy	100% indicates saturation

monitoring_commands.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# Real-time I/O throughput monitoring commands
 
# ============================================
# iostat - Comprehensive I/O statistics
# ============================================
# -x: Extended statistics
# -m: Display in MB/s (not sectors)
# 1: Update every 1 second
 
iostat -xm 1
 
# Sample output:
# Device   r/s     w/s    rMB/s    wMB/s   await  %util
# nvme0n1  8524    4218   432.1    215.6    0.12   78.5
#
# Key columns:
# - r/s, w/s: Read/write IOPS
# - rMB/s, wMB/s: Throughput in MB/s
# - await: Average queue wait + service time (ms)
# - %util: Percentage time device was busy
 
# ============================================
# iotop - Per-process I/O monitoring
# ============================================
# -o: Only show processes doing I/O
# -b: Batch mode (for scripting)
 
sudo iotop -o
 
# Identify which processes consume I/O bandwidth
 
# ============================================
# blktrace - Detailed block layer tracing
# ============================================
# Captures low-level I/O events for deep analysis
 
# Start trace on device
sudo blktrace -d /dev/nvme0n1 -o trace_output
 
# Analyze trace
blkparse trace_output | head -100
 
# Generate statistical summary
btt -i trace_output.blktrace.0
 
# ============================================
# dstat - Combined system statistics
# ============================================
# Shows I/O, CPU, network together
 
dstat -cdnm 1
# c: CPU stats
# d: Disk I/O
# n: Network I/O
# m: Memory stats
 
# ============================================
# nfsstat/nfsiostat - NFS throughput
# ============================================
# For NFS-mounted filesystems
 
nfsiostat 1
 
# ============================================
# Network throughput monitoring
# ============================================
# iftop for real-time network throughput
sudo iftop -i eth0
 
# nethogs for per-process network bandwidth
sudo nethogs eth0
 
# ============================================
# Continuous logging for trending
# ============================================
# Collect iostat data for historical analysis
 
iostat -xm 60 >> /var/log/iostat.log &
 
# Or use structured collection with sar
sar -d 60 >> /var/log/sar_disk.log &

Interpreting Monitoring Data

Raw metrics require context for meaningful interpretation:

1. Baseline Establishment Before identifying problems, establish normal throughput patterns. What does healthy throughput look like during peak hours? Off-peak? During batch jobs?

2. Saturation Detection Device utilization at or near 100% indicates saturation—the device cannot handle additional load without queuing delays. However, NVMe devices may show low utilization while achieving high throughput due to command parallelism.

3. Queue Depth Analysis If throughput is below expectations but queue depth is low, the bottleneck is likely in the application (not generating enough requests) rather than the device. Conversely, high queue depth with high utilization indicates true device saturation.

4. Trending Analysis Sudden throughput drops often indicate:

Thermal throttling (check device temperature)
Media degradation (check SMART data)
Workload pattern change (more random I/O)
Garbage collection overhead (SSDs)
Background processes consuming bandwidth

Gradual throughput decline may indicate:

Fragmentation accumulation
Free space exhaustion affecting SSD performance
Hardware wear (increasing error rates)

Proactive Monitoring

Set alerts for throughput anomalies before users notice performance degradation. Alert when throughput drops below 80% of baseline, when utilization consistently exceeds 85%, or when queue depth grows unexpectedly. Early warning enables proactive remediation rather than reactive firefighting.

Summary: Mastering I/O Throughput

I/O throughput—the rate of data movement—is a foundational metric for understanding and optimizing system performance. While the concept is simple, achieving high throughput in practice requires deep understanding of hardware, software, and workload interactions.

Key Takeaways

•Throughput measures data movement rate — Expressed in MB/s, GB/s, or Gbps depending on context, it quantifies how quickly data flows through I/O subsystems.
•Theoretical throughput is never achieved — Encoding overhead, protocol overhead, and media constraints create gaps between specification and reality. Expect 70-95% of theoretical maximums.
•Access patterns dominate throughput — Sequential access approaches interface limits; random small I/O is latency-bound. The same hardware delivers vastly different throughput based on workload.
•Queue depth unlocks device parallelism — Modern devices require concurrent I/O requests to achieve peak throughput. Single-threaded synchronous I/O leaves most capacity unused.
•Software stack matters enormously — Choice of scheduler, file system options, and application I/O patterns often affect throughput more than hardware selection.
•Monitor throughput continuously — Baseline normal behavior, detect saturation, and investigate anomalies. Tools like iostat, iotop, and blktrace enable deep analysis.

What's Next

Throughput tells us how much data moves, but not how quickly individual operations complete. The next page examines latency considerations—the other half of the I/O performance equation—and explores how throughput and latency interact in complex ways that shape real-world system behavior.

Page Complete

You now have a comprehensive understanding of I/O throughput: its measurement, the factors that affect it, the gap between theory and practice, and strategies for optimization. This foundation prepares you to analyze and optimize data movement in any computing system.

I/O Throughput

The Pulse of Data Movement

What You Will Learn

Defining I/O Throughput

KB/s (Kilobytes per second): 10³ bytes/second
MB/s (Megabytes per second): 10⁶ bytes/second
GB/s (Gigabytes per second): 10⁹ bytes/second
TB/s (Terabytes per second): 10¹² bytes/second

Alternatively, throughput may be expressed in bits per second (bps), particularly for network interfaces:

Mbps (Megabits per second): 10⁶ bits/second
Gbps (Gigabits per second): 10⁹ bits/second

The conversion factor of 8 bits per byte means that a 1 Gbps network link theoretically transfers 125 MB/s of payload data—though as we'll see, reality is considerably more nuanced.

Binary vs Decimal Prefixes

The Throughput Equation

At its most fundamental level, throughput can be expressed as:

$$\text{Throughput} = \frac{\text{Data Transferred}}{\text{Time Elapsed}}$$

However, this simple equation obscures critical details. A more accurate model accounts for the lifecycle of an I/O operation:

$$T_{effective} = \frac{D}{T_{setup} + T_{transfer} + T_{completion}}$$

Where:

D = Amount of data transferred
T_setup = Time to initiate the I/O operation (command processing, seek time for disks, connection establishment for networks)
T_transfer = Time for actual data movement
T_completion = Time to finalize the operation (writing to destination buffers, acknowledgments, interrupt processing)

This decomposition reveals why small I/O operations often achieve much lower throughput than large ones: the fixed overhead ($T_{setup} + T_{completion}$) dominates when $D$ is small.

Throughput Categories in Modern I/O Systems
Category	Definition	Use Case
Raw Throughput	Maximum theoretical data rate of the physical medium	Interface specifications, hardware design limits
Effective Throughput	Actual data rate achieved after protocol overhead	Application-level performance measurement
Sustained Throughput	Throughput maintained over extended periods	Long-running batch operations, streaming workloads
Peak Throughput	Maximum momentary throughput during burst transfers	Cache hits, burst I/O patterns
Aggregate Throughput	Combined throughput across multiple channels or devices	RAID arrays, parallel I/O subsystems

Measuring Throughput

Sequential vs Random I/O Throughput

For storage devices, throughput varies dramatically based on access patterns:

Sequential throughput measures performance when accessing contiguous data blocks in order. This pattern allows devices to optimize for streaming transfers—HDDs can minimize seek overhead, SSDs can leverage internal parallelism, and caches achieve high hit rates.
Random throughput measures performance when accessing non-contiguous blocks in unpredictable order. This worst-case pattern exposes all overhead costs: seek latency, rotational delay, flash translation layer lookups, and cache misses.

Sequential I/O Characteristics

•Predictable access patterns enable prefetching
•Minimizes seek and rotational delays (HDD)
•Enables efficient use of DMA burst transfers
•High cache effectiveness due to locality
•Large transfer sizes amortize fixed overhead
•Can approach theoretical interface limits

Random I/O Characteristics

•Unpredictable patterns defeat prefetching
•Every access incurs full seek penalty (HDD)
•Small transfers dominated by setup overhead
•Low cache hit rates due to poor locality
•Exposes IOPS limitations rather than throughput
•Far below theoretical interface capacity

Block Size Impact on Measured Throughput

The size of individual I/O requests profoundly affects measured throughput. Consider the relationship:

$$\text{Throughput} = \text{IOPS} \times \text{Block Size}$$

Where IOPS (I/O Operations Per Second) represents the rate of completed I/O requests. For a device capable of 10,000 IOPS:

Block Size	Calculated Throughput
4 KB	40 MB/s
64 KB	640 MB/s
256 KB	2,560 MB/s
1 MB	10,240 MB/s

This relationship explains why database workloads (typically 4-16 KB blocks) achieve vastly different throughput than backup operations (often 256 KB+ blocks) on identical hardware.

throughput_benchmark.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
/**
 * Throughput Measurement Framework
 * 
 * Demonstrates proper methodology for measuring I/O throughput
 * with consideration for warmup, multiple iterations, and
 * statistical analysis of results.
 */
 
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <string.h>
#include <errno.h>
 
#define KB (1024ULL)
#define MB (1024ULL * KB)
#define GB (1024ULL * MB)
 
#define WARMUP_ITERATIONS 3
#define MEASUREMENT_ITERATIONS 10
#define DEFAULT_TRANSFER_SIZE (1 * GB)
#define MAX_BLOCK_SIZE (1 * MB)
 
typedef struct {
    double throughput_mbps;     // Measured MB/s
    double elapsed_seconds;     // Total time
    size_t bytes_transferred;   // Actual bytes moved
    int error_count;            // Any I/O errors
} BenchmarkResult;
 
typedef struct {
    double mean;                // Average throughput
    double std_dev;             // Standard deviation
    double min;                 // Minimum observed
    double max;                 // Maximum observed
    double p95;                 // 95th percentile
} ThroughputStatistics;
 
/**
 * High-resolution timer for nanosecond precision
 */
static inline double get_time_seconds(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec + ts.tv_nsec / 1e9;
}
 
/**
 * Measure sequential read throughput
 * 
 * Key considerations:
 * - Use O_DIRECT to bypass OS buffer cache (measures device speed)
 * - Align buffer to page boundary (required for O_DIRECT)
 * - Use large block sizes to minimize syscall overhead
 */
BenchmarkResult measure_seq_read_throughput(
    const char* device_path,
    size_t total_bytes,
    size_t block_size
) {
    BenchmarkResult result = {0};
    
    // O_DIRECT bypasses OS caching for accurate device measurement
    int fd = open(device_path, O_RDONLY | O_DIRECT);
    if (fd < 0) {
        perror("Failed to open device");
        result.error_count = 1;
        return result;
    }
    
    // Allocate aligned buffer (O_DIRECT requirement)
    void* buffer;
    if (posix_memalign(&buffer, 4096, block_size) != 0) {
        result.error_count = 1;
        close(fd);
        return result;
    }
    
    // Perform measurement
    double start_time = get_time_seconds();
    size_t total_read = 0;
    
    while (total_read < total_bytes) {
        ssize_t bytes = read(fd, buffer, block_size);
        if (bytes <= 0) {
            if (bytes < 0 && errno != EINTR) {
                result.error_count++;
            }
            break;
        }
        total_read += bytes;
    }
    
    double end_time = get_time_seconds();
    
    // Calculate results
    result.elapsed_seconds = end_time - start_time;
    result.bytes_transferred = total_read;
    result.throughput_mbps = (total_read / MB) / result.elapsed_seconds;
    
    free(buffer);
    close(fd);
    return result;
}
 
/**
 * Calculate statistical summary of throughput measurements
 */
ThroughputStatistics calculate_statistics(
    double* measurements,
    int count
) {
    ThroughputStatistics stats = {0};
    
    // Calculate mean
    for (int i = 0; i < count; i++) {
        stats.mean += measurements[i];
        if (measurements[i] < stats.min || stats.min == 0) 
            stats.min = measurements[i];
        if (measurements[i] > stats.max) 
            stats.max = measurements[i];
    }
    stats.mean /= count;
    
    // Calculate standard deviation
    for (int i = 0; i < count; i++) {
        double diff = measurements[i] - stats.mean;
        stats.std_dev += diff * diff;
    }
    stats.std_dev = sqrt(stats.std_dev / count);
    
    // Sort for percentile calculation
    // (simplified insertion sort for demonstration)
    for (int i = 1; i < count; i++) {
        double key = measurements[i];
        int j = i - 1;
        while (j >= 0 && measurements[j] > key) {
            measurements[j + 1] = measurements[j];
            j--;
        }
        measurements[j + 1] = key;
    }
    stats.p95 = measurements[(int)(count * 0.95)];
    
    return stats;
}
 
/**
 * Run comprehensive throughput benchmark
 */
void run_throughput_benchmark(const char* device_path) {
    size_t block_sizes[] = {4*KB, 16*KB, 64*KB, 256*KB, 1*MB};
    int num_block_sizes = sizeof(block_sizes) / sizeof(block_sizes[0]);
    
    printf("\n%-12s %12s %12s %12s %12s\n",
           "Block Size", "Mean (MB/s)", "StdDev", "Min", "Max");
    printf("%-12s %12s %12s %12s %12s\n",
           "----------", "-----------", "------", "---", "---");
    
    for (int bs = 0; bs < num_block_sizes; bs++) {
        double measurements[MEASUREMENT_ITERATIONS];
        
        // Warmup iterations (discard results)
        for (int i = 0; i < WARMUP_ITERATIONS; i++) {
            measure_seq_read_throughput(
                device_path, 
                DEFAULT_TRANSFER_SIZE / 10, 
                block_sizes[bs]
            );
        }
        
        // Actual measurement iterations
        for (int i = 0; i < MEASUREMENT_ITERATIONS; i++) {
            BenchmarkResult result = measure_seq_read_throughput(
                device_path,
                DEFAULT_TRANSFER_SIZE,
                block_sizes[bs]
            );
            measurements[i] = result.throughput_mbps;
        }
        
        ThroughputStatistics stats = 
            calculate_statistics(measurements, MEASUREMENT_ITERATIONS);
        
        printf("%-12zu %12.2f %12.2f %12.2f %12.2f\n",
               block_sizes[bs] / KB,
               stats.mean, 
               stats.std_dev,
               stats.min, 
               stats.max);
    }
}

Measurement Best Practices

Theoretical vs Practical Throughput

Theoretical Throughput

Theoretical throughput represents the maximum data rate that the physical interface can sustain under ideal conditions. For example:

Interface	Theoretical Throughput	Calculation
SATA III	6 Gbps = 600 MB/s	Raw signaling rate
PCIe 4.0 x4	64 Gbps = 8 GB/s	16 GT/s × 4 lanes × 2B/4b encoding
NVMe over PCIe 4.0 x4	~7.88 GB/s	After 128b/130b encoding
USB 3.2 Gen 2	10 Gbps = 1.25 GB/s	SuperSpeed+ specification
10 GbE	10 Gbps = 1.25 GB/s	Wire speed maximum

These numbers represent wire speed—the maximum signaling capacity of the physical layer.

The Overhead Cascade

Multiple layers of overhead reduce practical throughput:

2. Protocol Overhead Every I/O operation includes framing, addressing, commands, and status information beyond actual data. A 4 KB NVMe read might require:

64-byte submission queue entry
16-byte completion queue entry
Various protocol handshaking

3. Software Stack Overhead Data traverses multiple software layers, each adding latency and consuming CPU cycles:

Application → System call → File system → Block layer → Device driver → Hardware

Throughput Efficiency Analysis for Common Storage Interfaces
Interface	Theoretical Max	Typical Sustained	Efficiency	Primary Loss Factors
SATA III SSD	600 MB/s	550 MB/s	~92%	8b/10b encoding, command overhead
7,200 RPM HDD (seq)	200 MB/s	150-180 MB/s	~80%	Track switching, zone density variation
PCIe 4.0 NVMe SSD	7,880 MB/s	5,000-7,000 MB/s	65-90%	Controller limits, thermal throttling
10 GbE (TCP)	1,250 MB/s	1,100-1,180 MB/s	~92%	Ethernet framing, IP/TCP headers
USB 3.2 Gen 2	1,250 MB/s	900-1,050 MB/s	~80%	Protocol overhead, cable quality

The Reality Check Formula

A useful heuristic for estimating practical throughput:

$$T_{practical} \approx T_{theoretical} \times \eta_{encoding} \times \eta_{protocol} \times \eta_{media}$$

Where:

$\eta_{encoding}$ = Encoding efficiency (0.8 for 8b/10b, 0.99 for 128b/130b)
$\eta_{protocol}$ = Protocol payload efficiency (typically 0.90-0.98)
$\eta_{media}$ = Media-specific efficiency (HDDs ~0.7-0.9, SSDs ~0.8-0.95, network ~0.92-0.98)

Example: Estimating NVMe SSD Throughput

For a PCIe 4.0 x4 NVMe SSD:

Theoretical: 8 GB/s (raw lane bandwidth)
Encoding (128b/130b): × 0.985 = 7.88 GB/s
Protocol efficiency: × 0.95 = 7.49 GB/s
Controller/Flash limits: × 0.85 = 6.37 GB/s

This matches observed real-world throughput of 6-7 GB/s for high-end NVMe drives.

Beyond Simple Throughput

Factors Affecting I/O Throughput

Hardware Factors

Physical Interface Constraints

•Bus Width and Clock Speed — A wider bus (more lanes in PCIe, more bits in parallel interfaces) directly scales bandwidth. Higher signaling rates increase bit transfer speed but may reduce signal integrity.
•Controller Processing Capacity — Storage controllers have finite command processing rates (IOPS) and internal bandwidth. Even with a fast interface, controller limitations can bound throughput.
•Media Transfer Rate — The underlying storage or transmission medium has intrinsic limits. HDD platters spin at fixed RPMs, NAND flash has program/erase constraints, and network switches have buffer and switching limits.
•Buffer and Cache Sizes — Larger on-device buffers enable burst absorption and command coalescing, smoothing throughput over time. Cache hits bypass slow media entirely.
•Thermal Conditions — Many devices throttle under sustained load to prevent overheating. NVMe SSDs commonly drop from 7 GB/s to 2-3 GB/s during extended writes due to thermal throttling.

Software Factors

The software stack profoundly impacts achieved throughput, often more than hardware selection:

Software Stack Impact

•Queue Depth — The number of outstanding I/O requests affects throughput significantly. Deeper queues allow devices to optimize internal scheduling and maintain continuous operation. Shallow queues (depth 1) leave devices idle between commands.
•Request Size — Larger I/O requests amortize fixed overhead costs, increasing effective throughput. Small requests are latency-bound rather than bandwidth-bound.
•Request Alignment — I/O requests aligned to device sector boundaries avoid read-modify-write cycles. Misaligned requests reduce throughput by requiring additional internal operations.
•File System Overhead — File systems add metadata operations, journaling writes, and allocation overhead. Direct I/O bypasses file system caching but loses these abstractions.
•CPU Utilization — I/O processing requires CPU cycles for system calls, interrupt handling, and data copying. CPU-bound systems may not feed devices fast enough to saturate interface bandwidth.

Workload Characteristics

The pattern of I/O requests fundamentally shapes achievable throughput:

Workload Pattern Impact on Throughput
Pattern	Description	Throughput Impact
Sequential Large	Contiguous blocks, large requests (1 MB+)	Highest throughput; approaches interface limits
Sequential Small	Contiguous blocks, small requests (4-16 KB)	Good throughput; limited by IOPS × block size
Random Large	Non-contiguous blocks, large requests	Moderate throughput; media seek/access overhead
Random Small	Non-contiguous blocks, small requests	Lowest throughput; dominated by latency
Mixed	Combination of patterns	Varies; often worse than either pure pattern

The Queue Depth Effect

Queue depth—the number of simultaneous outstanding I/O requests—dramatically influences throughput, particularly for devices with internal parallelism:

Queue Depth	Typical Impact
1	~30-40% of peak throughput; device idles between requests
4	~50-70% of peak; some pipelining possible
16	~80-90% of peak; good internal parallelism exploitation
32+	~95%+ of peak; diminishing returns beyond this

Optimizing Queue Depth

Throughput in Different I/O Contexts

Throughput manifests differently across storage, network, and peripheral I/O subsystems, each with unique characteristics and optimization strategies.

Storage I/O Throughput

Storage throughput depends heavily on the storage medium and access pattern. Modern storage hierarchies exhibit vast throughput ranges:

Storage Throughput by Technology and Pattern
Storage Type	Sequential Read	Sequential Write	Random Read (4K)	Random Write (4K)
7,200 RPM HDD	150-200 MB/s	150-180 MB/s	0.5-2 MB/s	0.5-2 MB/s
15,000 RPM SAS HDD	200-250 MB/s	200-230 MB/s	1-2 MB/s	1-2 MB/s
SATA SSD	550 MB/s	520 MB/s	150-350 MB/s	100-300 MB/s
NVMe SSD (PCIe 3.0)	3,500 MB/s	3,000 MB/s	400-600 MB/s	250-400 MB/s
NVMe SSD (PCIe 4.0)	7,000 MB/s	5,500 MB/s	600-1000 MB/s	400-600 MB/s
Intel Optane (3D XPoint)	2,700 MB/s	2,200 MB/s	2,400 MB/s	1,900 MB/s

Network I/O Throughput

Network throughput is bounded by link capacity, protocol overhead, and congestion:

Network Type	Wire Speed	Practical TCP	Overhead Sources
1 GbE	125 MB/s	110-120 MB/s	Ethernet framing, IP/TCP headers, ACKs
10 GbE	1.25 GB/s	1.1-1.18 GB/s	Same + switch latency, buffer limits
25 GbE	3.125 GB/s	2.8-3.0 GB/s	Same + NIC processing limits
100 GbE	12.5 GB/s	10-11 GB/s	Same + multiple streams needed
InfiniBand HDR	25 GB/s	23-24 GB/s	RDMA bypasses OS stack

Peripheral I/O Throughput

Peripheral throughput varies enormously based on device class:

Peripheral Type	Typical Throughput	Limiting Factor
USB Mouse/Keyboard	1-10 KB/s	Low-frequency polling
USB Audio (48 kHz stereo)	384 KB/s	Audio sample rate
USB 4K Webcam	45-100 MB/s	Video resolution/framerate
Thunderbolt 4 Storage	3,000 MB/s	PCIe 3.0 x4 tunneling
PCIe GPU (x16)	32 GB/s	Memory bandwidth limited

Peripheral throughput optimization often focuses on reducing CPU overhead (via DMA), batching transfers, and matching buffer sizes to typical payload sizes.

Aggregate System Throughput

Throughput Optimization Strategies

Optimizing I/O throughput requires a systematic approach across hardware selection, software architecture, and operational tuning.

Hardware-Level Optimization

Hardware Strategies

•Choose Appropriate Interface — Match interface bandwidth to workload requirements. NVMe provides massive throughput but costs more; SATA suffices for many sequential workloads.
•Ensure Cooling Adequacy — Thermal throttling devastates sustained throughput. Enterprise SSDs include power-loss protection capacitors and sustained write ratings; consumer drives throttle aggressively.
•Balance PCIe Lane Allocation — PCIe lanes are finite resources. A GPU at x16 may leave storage constrained to x4. Plan slot allocation to match bandwidth needs.
•Consider RAID for Aggregate Throughput — RAID 0 (striping) multiplies throughput across drives. RAID 5/6 adds parity overhead but provides redundancy.
•Use High-Quality Cables and Connectors — Signal integrity issues cause retries and reduce effective throughput. Use certified cables for high-speed interfaces.

Software-Level Optimization

Software optimizations often provide the largest throughput gains:

throughput_optimization_linux.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
#!/bin/bash
# Linux I/O Throughput Optimization Checklist
 
# ============================================
# 1. SCHEDULER SELECTION
# ============================================
# NVMe devices benefit from 'none' or 'mq-deadline'
# HDDs benefit from 'bfq' for fairness or 'mq-deadline'
 
# Check current scheduler for nvme0n1
cat /sys/block/nvme0n1/queue/scheduler
 
# Set optimal scheduler (requires root)
echo "none" > /sys/block/nvme0n1/queue/scheduler  # For NVMe
echo "mq-deadline" > /sys/block/sda/queue/scheduler  # For SSD/HDD
 
# ============================================
# 2. QUEUE DEPTH TUNING
# ============================================
# Increase queue depth for high-throughput workloads
# Default is often 128-256; can increase for NVMe
 
# Check current queue depth
cat /sys/block/nvme0n1/queue/nr_requests
 
# Increase queue depth (power of 2 recommended)
echo 1024 > /sys/block/nvme0n1/queue/nr_requests
 
# ============================================
# 3. READ-AHEAD TUNING
# ============================================
# Increase read-ahead for sequential workloads
# Values in 512-byte sectors (multiply KB by 2)
 
# Check current read-ahead (in sectors)
cat /sys/block/nvme0n1/queue/read_ahead_kb
 
# Set 16 MB read-ahead for streaming workloads
echo 16384 > /sys/block/nvme0n1/queue/read_ahead_kb
 
# ============================================
# 4. FILESYSTEM MOUNT OPTIONS
# ============================================
# XFS/ext4 options for throughput:
# - noatime: Skip access time updates
# - nodiratime: Skip directory access time
# - discard: Enable TRIM (or use fstrim.timer)
# - nobarrier: Disable write barriers (DANGER: data loss risk)
 
# Example fstab entry for NVMe data volume:
# /dev/nvme0n1p1 /data xfs defaults,noatime,discard 0 2
 
# ============================================
# 5. MEMORY AND CACHING
# ============================================
# Increase dirty page limits for write-heavy workloads
 
# Current settings
sysctl vm.dirty_ratio vm.dirty_background_ratio
 
# Aggressive write caching (up to 40% of RAM as dirty pages)
sysctl -w vm.dirty_ratio=40
sysctl -w vm.dirty_background_ratio=10
 
# ============================================
# 6. NUMA AWARENESS
# ============================================
# Bind I/O-intensive processes to NUMA node with storage
# Check NUMA topology
lscpu | grep NUMA
cat /sys/block/nvme0n1/device/numa_node
 
# Run application on correct NUMA node
numactl --cpunodebind=0 --membind=0 ./io_intensive_app
 
# ============================================
# 7. INTERRUPT AFFINITY
# ============================================
# Balance IRQs across CPUs for high-throughput NICs/storage
# Install irqbalance or manually tune
 
# Check NVMe interrupt distribution
cat /proc/interrupts | grep nvme
 
# Distribute NVMe interrupts across CPUs (example)
# Set affinity mask for each interrupt queue

Application-Level Optimization

Beyond system tuning, application design choices profoundly affect throughput:

Application Design for Throughput

•Use Asynchronous I/O — io_uring (Linux), IOCP (Windows), or kqueue (BSD) enable high concurrency without thread overhead. Maintain queue depths of 16-64 for storage devices.
•Batch I/O Requests — Combine small writes into larger buffers before issuing I/O. File system buffering does this automatically, but explicit batching provides control.
•Align Requests — Ensure I/O offsets and sizes align to physical sector size (512B or 4KB). Misaligned writes trigger read-modify-write cycles.
•Preallocate Files — Use fallocate() to reserve contiguous space before writing. This avoids fragmentation and metadata overhead during writes.
•Use Memory Mapping — mmap() allows the OS to optimize page faults into efficient large reads. Works well for random access to large files.
•Profile and Measure — Use tools like iostat, blktrace, and perf to identify actual bottlenecks rather than assumed ones.

Optimization Trade-offs

Throughput Analysis and Monitoring

Essential Metrics

Beyond raw throughput (MB/s), comprehensive monitoring tracks:

Metric	Description	Interpretation
Read throughput	Bytes read per second	Baseline for read-heavy workloads
Write throughput	Bytes written per second	Watch for write amplification
IOPS	I/O operations per second	Complements throughput for small I/O
Queue depth	Outstanding I/O requests	Low depth suggests application limits
I/O wait	CPU time waiting for I/O	High iowait indicates I/O bottleneck
Device utilization	Percentage of time device busy	100% indicates saturation

monitoring_commands.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# Real-time I/O throughput monitoring commands
 
# ============================================
# iostat - Comprehensive I/O statistics
# ============================================
# -x: Extended statistics
# -m: Display in MB/s (not sectors)
# 1: Update every 1 second
 
iostat -xm 1
 
# Sample output:
# Device   r/s     w/s    rMB/s    wMB/s   await  %util
# nvme0n1  8524    4218   432.1    215.6    0.12   78.5
#
# Key columns:
# - r/s, w/s: Read/write IOPS
# - rMB/s, wMB/s: Throughput in MB/s
# - await: Average queue wait + service time (ms)
# - %util: Percentage time device was busy
 
# ============================================
# iotop - Per-process I/O monitoring
# ============================================
# -o: Only show processes doing I/O
# -b: Batch mode (for scripting)
 
sudo iotop -o
 
# Identify which processes consume I/O bandwidth
 
# ============================================
# blktrace - Detailed block layer tracing
# ============================================
# Captures low-level I/O events for deep analysis
 
# Start trace on device
sudo blktrace -d /dev/nvme0n1 -o trace_output
 
# Analyze trace
blkparse trace_output | head -100
 
# Generate statistical summary
btt -i trace_output.blktrace.0
 
# ============================================
# dstat - Combined system statistics
# ============================================
# Shows I/O, CPU, network together
 
dstat -cdnm 1
# c: CPU stats
# d: Disk I/O
# n: Network I/O
# m: Memory stats
 
# ============================================
# nfsstat/nfsiostat - NFS throughput
# ============================================
# For NFS-mounted filesystems
 
nfsiostat 1
 
# ============================================
# Network throughput monitoring
# ============================================
# iftop for real-time network throughput
sudo iftop -i eth0
 
# nethogs for per-process network bandwidth
sudo nethogs eth0
 
# ============================================
# Continuous logging for trending
# ============================================
# Collect iostat data for historical analysis
 
iostat -xm 60 >> /var/log/iostat.log &
 
# Or use structured collection with sar
sar -d 60 >> /var/log/sar_disk.log &

Interpreting Monitoring Data

Raw metrics require context for meaningful interpretation:

1. Baseline Establishment Before identifying problems, establish normal throughput patterns. What does healthy throughput look like during peak hours? Off-peak? During batch jobs?

4. Trending Analysis Sudden throughput drops often indicate:

Thermal throttling (check device temperature)
Media degradation (check SMART data)
Workload pattern change (more random I/O)
Garbage collection overhead (SSDs)
Background processes consuming bandwidth

Gradual throughput decline may indicate:

Fragmentation accumulation
Free space exhaustion affecting SSD performance
Hardware wear (increasing error rates)

Proactive Monitoring

Summary: Mastering I/O Throughput

Key Takeaways

•Throughput measures data movement rate — Expressed in MB/s, GB/s, or Gbps depending on context, it quantifies how quickly data flows through I/O subsystems.
•Theoretical throughput is never achieved — Encoding overhead, protocol overhead, and media constraints create gaps between specification and reality. Expect 70-95% of theoretical maximums.
•Access patterns dominate throughput — Sequential access approaches interface limits; random small I/O is latency-bound. The same hardware delivers vastly different throughput based on workload.
•Queue depth unlocks device parallelism — Modern devices require concurrent I/O requests to achieve peak throughput. Single-threaded synchronous I/O leaves most capacity unused.
•Software stack matters enormously — Choice of scheduler, file system options, and application I/O patterns often affect throughput more than hardware selection.
•Monitor throughput continuously — Baseline normal behavior, detect saturation, and investigate anomalies. Tools like iostat, iotop, and blktrace enable deep analysis.

What's Next

Page Complete