Operating SystemsI/O Hardware Performance

I/O Hardware Performance

LevelAdvanced

Duration90 mins

TopicI/O Hardware Performance

3 / 5

Bandwidth Utilization

Making Every Bit Count

A storage array boasts 100 GB/s aggregate bandwidth. A network fabric promises 400 Gbps between racks. Yet applications report transfer speeds of 10 GB/s and 40 Gbps respectively. Where did the other 90% go?

The gap between available bandwidth and utilized bandwidth is one of the most pervasive challenges in I/O systems engineering. Raw hardware specifications tell only part of the story. Protocol overhead consumes capacity. Contention between workloads causes interference. Inefficient access patterns leave channels idle. Configuration mismatches cause bottlenecks.

Bandwidth utilization measures how effectively systems convert raw capacity into useful work. A system achieving 95% utilization extracts maximum value from infrastructure investments. One achieving 30% wastes resources—or worse, delivers poor performance while appearing underloaded. Understanding and optimizing bandwidth utilization is essential for both cost efficiency and performance excellence.

What You Will Learn

By the end of this page, you will understand how to measure and analyze bandwidth utilization, identify efficiency losses across the I/O stack, understand contention effects in shared resources, and apply strategies for maximizing the productive use of I/O bandwidth.

Understanding Bandwidth Utilization

Bandwidth utilization is the ratio of actual data transfer rate to maximum available bandwidth:

$$U = \frac{B_{actual}}{B_{max}} \times 100%$$

Where:

U = Utilization (percentage)
B_actual = Observed throughput
B_max = Theoretical maximum bandwidth

However, this simple formula obscures important nuances. What exactly constitutes "actual" bandwidth? And what is the appropriate baseline for "maximum"?

Types of Bandwidth in I/O Systems
Bandwidth Type	Definition	Example
Raw/Wire Bandwidth	Physical signaling capacity of the interface	PCIe 4.0 x4: 64 Gbps raw
Encoded Bandwidth	Available after line coding overhead	PCIe 4.0 x4: 62.7 Gbps (128b/130b)
Protocol Bandwidth	Available for payload after protocol headers	NVMe over PCIe: ~61 Gbps effective
Useful Bandwidth	Data valuable to the application	After deduplication: varies by workload

Utilization Efficiency Chain

Effective utilization is the product of efficiencies at each layer:

$$U_{effective} = \eta_{encoding} \times \eta_{protocol} \times \eta_{overhead} \times \eta_{access} \times \eta_{contention}$$

Where:

η_encoding = Encoding efficiency (e.g., 98.5% for 128b/130b)
η_protocol = Protocol payload efficiency (typically 90-98%)
η_overhead = Software stack efficiency (75-95%)
η_access = Access pattern efficiency (10-95% depending on workload)
η_contention = Resource sharing efficiency (50-100%)

Example Analysis: A PCIe 4.0 x4 NVMe SSD under random 4KB reads:

Raw: 64 Gbps (8 GB/s)
Encoding (128b/130b): × 0.985 = 7.88 GB/s
Protocol overhead: × 0.95 = 7.49 GB/s
Stack processing: × 0.90 = 6.74 GB/s
Random access pattern: × 0.15 = 1.01 GB/s (IOPS-limited)
Contention (shared system): × 0.80 = 0.81 GB/s

Result: 0.81 / 8.0 = 10.1% utilization — but this is optimal for this workload!

Utilization Isn't Always Low = Bad

Low bandwidth utilization isn't inherently problematic. Random small I/O workloads are IOPS-bound, not bandwidth-bound. A database server achieving 50 MB/s on a 7 GB/s NVMe drive may be perfectly optimized—it's simply doing 500,000 IOPS of 100-byte reads rather than streaming large files. Context matters when evaluating utilization metrics.

Measuring Bandwidth Utilization

Accurate bandwidth utilization measurement requires understanding what to measure and how to interpret results in context.

Key Utilization Metrics

Bandwidth Utilization Metrics
Metric	Description	Interpretation
Instantaneous Utilization	Current bandwidth use at measurement point	Useful for real-time dashboards; noisy
Average Utilization	Mean utilization over time window	Good for capacity planning; hides bursts
Peak Utilization	Maximum utilization during period	Identifies saturation events
Sustained Utilization	Utilization during active transfer periods	Measures efficiency when system is working
Busy Time Utilization	% time with any activity × utilization during activity	Separates idle time from inefficiency

bandwidth_utilization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
"""
Bandwidth Utilization Analysis Framework
 
Provides comprehensive utilities for measuring, analyzing, and 
reporting bandwidth utilization across I/O subsystems.
"""
 
import time
from dataclasses import dataclass
from typing import List, Optional
import subprocess
import re
 
@dataclass
class BandwidthSample:
    """Single bandwidth measurement sample."""
    timestamp: float
    bytes_read: int
    bytes_written: int
    device_busy_pct: float  # 0-100
    
@dataclass  
class UtilizationReport:
    """Comprehensive utilization analysis."""
    device: str
    max_bandwidth_mbps: float
    
    # Throughput metrics
    avg_read_mbps: float
    avg_write_mbps: float
    peak_read_mbps: float
    peak_write_mbps: float
    
    # Utilization metrics
    avg_utilization_pct: float
    peak_utilization_pct: float
    sustained_utilization_pct: float  # During active periods only
    
    # Efficiency analysis
    read_write_ratio: float
    bandwidth_efficiency: float  # Actual vs theoretical
 
 
class BandwidthUtilizationAnalyzer:
    """
    Analyzes bandwidth utilization for block devices.
    
    Uses /proc/diskstats on Linux for accurate measurement.
    """
    
    def __init__(self, device: str, max_bandwidth_mbps: float):
        """
        Initialize analyzer.
        
        Args:
            device: Device name (e.g., 'nvme0n1', 'sda')
            max_bandwidth_mbps: Theoretical max bandwidth in MB/s
        """
        self.device = device
        self.max_bandwidth_mbps = max_bandwidth_mbps
        self.samples: List[BandwidthSample] = []
        
    def collect_sample(self) -> BandwidthSample:
        """Collect current bandwidth utilization sample."""
        # Read from /proc/diskstats
        with open('/proc/diskstats', 'r') as f:
            for line in f:
                fields = line.split()
                if len(fields) >= 14 and fields[2] == self.device:
                    # Fields: major minor name rd_ios rd_mrg rd_sect rd_ticks
                    #         wr_ios wr_mrg wr_sect wr_ticks ios_inflight
                    #         io_ticks weighted_io_ticks
                    bytes_read = int(fields[5]) * 512    # Sectors to bytes
                    bytes_written = int(fields[9]) * 512
                    io_ticks = int(fields[12])           # ms active
                    
                    sample = BandwidthSample(
                        timestamp=time.time(),
                        bytes_read=bytes_read,
                        bytes_written=bytes_written,
                        device_busy_pct=0.0  # Calculated from successive samples
                    )
                    self.samples.append(sample)
                    return sample
                    
        raise ValueError(f"Device {self.device} not found in /proc/diskstats")
    
    def collect_samples(self, duration_seconds: float, interval_seconds: float = 1.0):
        """Collect samples over a duration."""
        end_time = time.time() + duration_seconds
        while time.time() < end_time:
            self.collect_sample()
            time.sleep(interval_seconds)
    
    def analyze(self) -> UtilizationReport:
        """Analyze collected samples to produce utilization report."""
        if len(self.samples) < 2:
            raise ValueError("Need at least 2 samples for analysis")
        
        read_rates = []
        write_rates = []
        utilizations = []
        
        for i in range(1, len(self.samples)):
            prev, curr = self.samples[i-1], self.samples[i]
            dt = curr.timestamp - prev.timestamp
            
            if dt <= 0:
                continue
                
            # Calculate rates in MB/s
            read_rate = (curr.bytes_read - prev.bytes_read) / dt / (1024 * 1024)
            write_rate = (curr.bytes_written - prev.bytes_written) / dt / (1024 * 1024)
            total_rate = read_rate + write_rate
            
            utilization = (total_rate / self.max_bandwidth_mbps) * 100
            
            read_rates.append(read_rate)
            write_rates.append(write_rate)
            utilizations.append(min(utilization, 100))  # Cap at 100%
        
        # Calculate sustained utilization (only during active periods)
        active_utilizations = [u for u in utilizations if u > 1.0]  # >1% = active
        sustained_util = sum(active_utilizations) / len(active_utilizations) if active_utilizations else 0
        
        # Calculate read/write ratio
        total_read = sum(read_rates)
        total_write = sum(write_rates)
        rw_ratio = total_read / total_write if total_write > 0 else float('inf')
        
        return UtilizationReport(
            device=self.device,
            max_bandwidth_mbps=self.max_bandwidth_mbps,
            avg_read_mbps=sum(read_rates) / len(read_rates),
            avg_write_mbps=sum(write_rates) / len(write_rates),
            peak_read_mbps=max(read_rates),
            peak_write_mbps=max(write_rates),
            avg_utilization_pct=sum(utilizations) / len(utilizations),
            peak_utilization_pct=max(utilizations),
            sustained_utilization_pct=sustained_util,
            read_write_ratio=rw_ratio,
            bandwidth_efficiency=(sum(utilizations) / len(utilizations)) / 100
        )
 
    def print_report(self, report: UtilizationReport):
        """Print formatted utilization report."""
        print(f"\n{'='*60}")
        print(f"Bandwidth Utilization Report: {report.device}")
        print(f"{'='*60}")
        print(f"Max Bandwidth: {report.max_bandwidth_mbps:.1f} MB/s")
        print()
        print("Throughput:")
        print(f"  Average Read:  {report.avg_read_mbps:8.2f} MB/s")
        print(f"  Average Write: {report.avg_write_mbps:8.2f} MB/s")
        print(f"  Peak Read:     {report.peak_read_mbps:8.2f} MB/s")
        print(f"  Peak Write:    {report.peak_write_mbps:8.2f} MB/s")
        print()
        print("Utilization:")
        print(f"  Average:   {report.avg_utilization_pct:6.2f}%")
        print(f"  Peak:      {report.peak_utilization_pct:6.2f}%")
        print(f"  Sustained: {report.sustained_utilization_pct:6.2f}%")
        print()
        print("Efficiency:")
        print(f"  Read/Write Ratio: {report.read_write_ratio:.2f}")
        print(f"  Bandwidth Efficiency: {report.bandwidth_efficiency:.1%}")
        
        # Provide recommendations
        print()
        print("Analysis:")
        if report.avg_utilization_pct < 20:
            print("  ⚠ Low average utilization - check for IOPS bottleneck or idle time")
        elif report.avg_utilization_pct > 80:
            print("  ⚠ High utilization - approaching saturation")
        else:
            print("  ✓ Healthy utilization range")
            
        if report.peak_utilization_pct > 95 and report.avg_utilization_pct < 50:
            print("  ⚠ Bursty workload - consider spreading load or adding caching")
 
 
# Example usage
if __name__ == "__main__":
    analyzer = BandwidthUtilizationAnalyzer(
        device="nvme0n1",
        max_bandwidth_mbps=7000  # PCIe 4.0 x4 NVMe theoretical
    )
    
    print("Collecting samples for 60 seconds...")
    analyzer.collect_samples(duration_seconds=60, interval_seconds=1.0)
    
    report = analyzer.analyze()
    analyzer.print_report(report)

Monitoring Tools for Bandwidth Utilization

Linux provides several tools for bandwidth utilization monitoring:

utilization_monitoring.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
#!/bin/bash
# Bandwidth Utilization Monitoring Commands
 
# ============================================
# STORAGE BANDWIDTH UTILIZATION
# ============================================
 
# iostat with utilization (%util column)
# Shows both throughput and device utilization
iostat -xm 1
 
# Example output interpretation:
# Device   r/s   w/s   rMB/s   wMB/s  %util
# nvme0n1  1200  800   600     400    45%
#
# This indicates:
# - Total throughput: 600 + 400 = 1000 MB/s
# - Device is busy 45% of the time
# - If max bandwidth is 7000 MB/s: 1000/7000 = 14.3% bandwidth utilization
# - But device is only busy 45% of time
# - During busy time: 14.3% / 45% = ~32% sustained utilization
 
# ============================================
# NETWORK BANDWIDTH UTILIZATION
# ============================================
 
# sar for network interface utilization
sar -n DEV 1
 
# More detailed with iftop
sudo iftop -i eth0 -t -s 10
 
# Calculate utilization for 10 GbE link (1250 MB/s max)
# If observing 800 MB/s: 800/1250 = 64% utilization
 
# ============================================
# PCIE BANDWIDTH UTILIZATION
# ============================================
 
# Use perf with PCIe events (requires appropriate PMU support)
sudo perf stat -e 'pci/r/w bytes' -a sleep 10
 
# Or use Intel PCM (Performance Counter Monitor)
# Shows per-socket PCIe bandwidth
sudo pcm-pcie
 
# ============================================
# MEMORY BANDWIDTH UTILIZATION
# ============================================
 
# Intel PCM for memory bandwidth
sudo pcm-memory 1
 
# Or use perf with memory controller events
sudo perf stat -e 'uncore_imc/cas_count_read/,uncore_imc/cas_count_write/' -a sleep 10
 
# ============================================
# AUTOMATED UTILIZATION TRACKING
# ============================================
 
# Log to file for historical analysis
(
while true; do
    echo "=== $(date) ===" >> /var/log/bandwidth_util.log
    iostat -xm 1 1 | tail -n +7 >> /var/log/bandwidth_util.log
    sleep 60
done
) &
 
# Prometheus node_exporter provides:
# - node_disk_read_bytes_total
# - node_disk_written_bytes_total
# - node_network_receive_bytes_total
# - node_network_transmit_bytes_total
# Calculate utilization as rate(metric) / max_bandwidth

Utilization vs Saturation

High utilization isn't the same as saturation. A device at 80% utilization handling workload efficiently is very different from a device at 80% utilization with a growing queue of pending requests. Monitor queue depth alongside utilization to distinguish healthy high utilization from saturation.

Sources of Utilization Inefficiency

Multiple factors reduce bandwidth utilization efficiency. Understanding these sources enables targeted optimization.

Protocol and Encoding Overhead

Every I/O protocol consumes bandwidth for non-data purposes:

Protocol Overhead Analysis
Protocol/Layer	Overhead Type	Bandwidth Impact
PCIe 4.0 (128b/130b)	Line encoding	~1.5% loss
NVMe Command	64-byte submission queue entry	~1.5% per 4KB I/O
NVMe Completion	16-byte completion entry	~0.4% per 4KB I/O
Ethernet (1500B MTU)	14B header + 4B FCS + 12B IFG	~2% loss
Ethernet (Jumbo MTU 9000B)	Same fixed overhead	~0.3% loss
TCP/IP Headers	40+ bytes per packet	~2.7% for 1500B packets
SATA (8b/10b)	Line encoding	~20% loss
USB 3.x	Packet framing + overhead	~10-15% loss

Access Pattern Inefficiency

How data is accessed dramatically affects utilization:

Small I/O Operations: The fixed overhead per operation consumes bandwidth proportionally more for small requests. A 64-byte NVMe command consumes 1.5% of a 4KB transfer but would be 50% of a 128-byte transfer.

Random Access: On HDDs, seek time creates dead time where no data flows. On SSDs, random access limits internal parallelism and increases flash read latency.

Read-Write Mixing: Many devices optimize for either reads or writes. Mixed patterns cause mode-switching overhead:, context switches in controllers and cache thrashing.

Access Pattern	Typical Utilization Efficiency
Sequential large reads	85-95%
Sequential large writes	80-92%
Sequential small (4KB) reads	50-70%
Random large reads	40-60%
Random small (4KB) reads	10-30%
Mixed random read/write	8-25%

Software Stack Overhead

Each software layer adds processing that limits sustainable throughput:

System Call Overhead: Transitioning between user and kernel mode costs ~100 ns-1 µs per call. At 1 million IOPS, this adds up.

Context Switching: Blocked I/O causes thread context switches (~1-10 µs each), wasting CPU cycles and cache coherency.

Data Copying: Data often copies multiple times: user buffer → kernel buffer → device buffer. Each copy consumes memory bandwidth and CPU cycles.

Interrupt Handling: Each I/O completion triggers an interrupt (~2-5 µs processing). At high IOPS, interrupt overhead becomes significant.

Allocation and Locking: Memory allocation and synchronization primitives in the I/O path add variable delays.

Device-Internal Inefficiencies

Even at the device level, bandwidth is lost:

Garbage Collection (SSDs): When free blocks are low, SSDs must compact data, consuming internal bandwidth. This can reduce available bandwidth by 30-50% under sustained writes.

Wear Leveling: Moving data to even wear adds overhead, particularly impactful when data is moved from idle blocks.

Error Correction: LDPC decoding in SSDs, error retry in HDDs consume processing time that reduces throughput.

Thermal Throttling: Sustained high throughput causes temperature increases, triggering reduced performance modes.

Power State Transitions: Low-power states require wake-up time (1-50 ms for HDDs, 10-100 µs for SSDs), causing delays after idle periods.

The Overhead Cascade

These inefficiencies multiply rather than add. A 90% efficient protocol running over a 90% efficient stack with a 50% efficient access pattern and 80% efficient device yields: 0.9 × 0.9 × 0.5 × 0.8 = 32.4% overall efficiency. This explains why practical throughput is often a fraction of theoretical maximums.

Contention and Shared Resources

In real systems, I/O resources are rarely dedicated to single workloads. Contention between competing demands reduces effective bandwidth available to each.

Types of Resource Contention

Contention Points in I/O Systems

•Storage Device Contention: Multiple processes or VMs accessing the same SSD. Each process gets a share of device IOPS and bandwidth.
•Bus/Interconnect Contention: PCIe lanes, memory bus, and on-chip interconnects have fixed bandwidth that must be shared.
•Network Fabric Contention: Switch ports, core uplinks, and shared fabric capacity create potential bottlenecks.
•Host Controller Contention: SAS/SATA controllers, NVMe controllers, and RAID cards have finite processing capacity.
•CPU Contention: I/O processing requires CPU cycles. CPU-bound systems may not drive I/O fast enough.
•Memory Bandwidth Contention: DMA operations compete with CPU memory access, especially on single-socket systems.

Modeling Contention Effects

When N equal workloads contend for a shared resource with capacity C, naively each receives C/N. Reality is worse due to contention overhead:

$$B_{per_workload} = \frac{C \times \eta_{contention}}{N}$$

Where η_contention accounts for:

Context switching between workloads
Queue management overhead
Loss of access pattern optimization
Fairness enforcement overhead

Typical values for η_contention:

Scenario	η_contention
Dedicated device	1.0
2 similar workloads	0.95-0.98
2-4 diverse workloads	0.85-0.95
5-10 workloads	0.75-0.90
Many small workloads	0.60-0.80
VM/container multi-tenancy	0.50-0.80

The "Noisy Neighbor" Problem

In multi-tenant environments, one aggressive workload can consume disproportionate resources:

Symptom: Application A's throughput drops 50% when Application B starts a backup job.

Cause: Application B issues large sequential I/Os that monopolize device bandwidth and cause queue head-of-line blocking.

Solutions:

I/O Scheduling: Use fair schedulers (BFQ, mq-deadline) that enforce proportional access
Resource Limits: cgroups/blkio throttling to cap bandwidth per workload
Resource Isolation: Dedicated devices for critical workloads
Time-Based Separation: Schedule batch workloads during off-peak hours
QoS Classification: Prioritize latency-sensitive over throughput-oriented

cgroup_bandwidth_limits.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#!/bin/bash
# Bandwidth Management with cgroups v2
# Isolate workloads and prevent noisy neighbor effects
 
# Enable cgroup v2 controllers
echo "+io +cpu +memory" > /sys/fs/cgroup/cgroup.subtree_control
 
# ============================================
# Create isolated cgroups for different workloads
# ============================================
 
# High priority workload (production database)
mkdir -p /sys/fs/cgroup/prod-db
echo "256:0 rbps=3000000000 wbps=2000000000" > /sys/fs/cgroup/prod-db/io.max
# Device 256:0 (check with lsblk), read 3GB/s, write 2GB/s
 
# Low priority workload (backup job)  
mkdir -p /sys/fs/cgroup/backup
echo "256:0 rbps=500000000 wbps=500000000" > /sys/fs/cgroup/backup/io.max
# Limited to 500MB/s read and write
 
# Best-effort workload (dev/test)
mkdir -p /sys/fs/cgroup/dev
echo "256:0 rbps=max wbps=max" > /sys/fs/cgroup/dev/io.max
echo "256:0 100" > /sys/fs/cgroup/dev/io.weight  # Lower weight for fair sharing
 
# ============================================
# Set I/O priority weights (relative scheduling)
# ============================================
 
# Higher weight = higher priority in contention
# Range: 1-10000, default 100
 
echo "256:0 500" > /sys/fs/cgroup/prod-db/io.weight   # 5x priority
echo "256:0 50" > /sys/fs/cgroup/backup/io.weight     # 0.5x priority  
echo "256:0 100" > /sys/fs/cgroup/dev/io.weight       # Normal
 
# ============================================
# Launch processes in cgroups
# ============================================
 
# Run database in prod cgroup
echo $DATABASE_PID > /sys/fs/cgroup/prod-db/cgroup.procs
 
# Run backup in limited cgroup
cgexec -g io:backup /usr/bin/backup-script.sh
 
# ============================================
# Monitor per-cgroup I/O
# ============================================
 
# View current I/O statistics per cgroup
cat /sys/fs/cgroup/*/io.stat
 
# Example output:
# 256:0 rbytes=1234567890 wbytes=987654321 rios=12345 wios=9876 dbytes=0 dios=0
 
# ============================================
# NUMA-aware bandwidth isolation
# ============================================
 
# Bind to specific NUMA node for consistent performance
echo "0" > /sys/fs/cgroup/prod-db/cpuset.mems
echo "0-7" > /sys/fs/cgroup/prod-db/cpuset.cpus
 
# This ensures database traffic uses NUMA-local memory
# and PCIe paths, reducing cross-socket bandwidth contention

Proactive Contention Management

Don't wait for noisy neighbor complaints. Establish bandwidth budgets upfront, implement per-workload limits, and monitor for violations. Quota enforcement is easier to implement and explain than reactive throttling during incidents.

Maximizing Bandwidth Utilization

Improving bandwidth utilization requires systematic optimization across hardware configuration, system tuning, and application design.

Hardware Configuration Strategies

Hardware-Level Optimizations

•Match Interface to Workload: Don't over-provision (waste) or under-provision (bottleneck). A write-heavy workload on SLC cache + QLC NAND may show 95% utilization at low throughput due to cache exhaustion.
•Balance Channel Distribution: RAID stripes, NVMe namespaces, and multi-path I/O distribute load. Ensure balanced utilization across all channels.
•NUMA Awareness: Route I/O through the NUMA node that hosts the device. Cross-socket traffic consumes memory bandwidth and adds latency.
•PCIe Slot Placement: Place high-bandwidth devices in slots with dedicated lanes. Avoid sharing PCIe switches unnecessarily.
•Enable Maximum Link Speed: Verify BIOS/device negotiated maximum link speed. PCIe training failures cause fallback to lower speeds.

System Tuning Strategies

OS configuration significantly impacts achievable utilization:

System Tuning for Bandwidth Utilization
Tuning Area	Parameter	Optimization
I/O Scheduler	none/mq-deadline for NVMe	Reduces scheduling overhead for devices that handle queuing internally
Queue Depth	nr_requests 1024+	Allows more concurrent I/O to keep device busy
Read Ahead	read_ahead_kb 16384	Prefetches sequential data; reduces I/O wait cycles
Dirty Pages	dirty_ratio 40%	Buffers more writes; improves sustained write throughput
Kernel Bypass	io_uring sq_poll	Kernel polls for completions; reduces syscall overhead
Interrupt Coalescing	Device-specific	Batches interrupts; trades latency for throughput
NUMA Balancing	Disable for I/O nodes	Prevents migration that disrupts DMA mappings

Application Design Strategies

Application architecture often has the largest impact on utilization:

Application-Level Optimizations

•Use Large I/O Sizes: Larger requests amortize fixed overhead. 256KB-1MB requests maximize throughput utilization on SSDs.
•Maintain Queue Depth: Keep 16-64 I/O requests outstanding to saturate modern devices. Use io_uring or IOCP for async I/O.
•Align I/O to Device Sectors: Misaligned writes cause read-modify-write cycles, halving effective bandwidth.
•Separate Read and Write Paths: Dedicated threads for reads vs writes avoid mode-switching overhead and enable better scheduling.
•Batch Small Writes: Accumulate small writes in buffers; flush periodically as large sequential writes.
•Stream Data: For large transfers, stream data rather than loading entirely into memory. Reduces memory bandwidth contention.
•Parallelize Across Devices: Stripe large workloads across multiple devices to multiply aggregate bandwidth.

high_bandwidth_io.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
/**
 * High-Bandwidth I/O Pattern Using io_uring
 * 
 * Demonstrates techniques for maximizing bandwidth utilization:
 * - Large aligned I/O requests (1MB)
 * - Deep queue depth (64 outstanding)
 * - Kernel-side polling to reduce syscall overhead
 * - Registered buffers to avoid memory mapping overhead
 */
 
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <string.h>
#include <liburing.h>
 
#define QUEUE_DEPTH 64
#define BLOCK_SIZE (1024 * 1024)  // 1MB blocks for high bandwidth
#define NUM_BLOCKS 1024
 
struct io_data {
    int fd;
    off_t offset;
    struct iovec iov;
};
 
/**
 * Initialize io_uring with optimal settings for bandwidth
 */
int setup_io_uring(struct io_uring *ring) {
    struct io_uring_params params = {0};
    
    // Enable kernel-side polling (reduces syscalls)
    params.flags = IORING_SETUP_SQPOLL;
    params.sq_thread_idle = 1000;  // ms before kernel thread sleeps
    
    int ret = io_uring_queue_init_params(QUEUE_DEPTH, ring, &params);
    if (ret < 0) {
        fprintf(stderr, "io_uring init failed: %d\n", ret);
        return -1;
    }
    
    return 0;
}
 
/**
 * Submit read requests to fill queue
 */
int submit_reads(struct io_uring *ring, int fd, 
                 void **buffers, off_t *offsets, int count) {
    for (int i = 0; i < count; i++) {
        struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
        if (!sqe) {
            // Queue full, submit and wait for space
            io_uring_submit(ring);
            sqe = io_uring_get_sqe(ring);
        }
        
        // Prepare read with pre-allocated aligned buffer
        io_uring_prep_read(sqe, fd, buffers[i], BLOCK_SIZE, offsets[i]);
        
        // Store index for completion tracking
        io_uring_sqe_set_data(sqe, (void*)(long)i);
    }
    
    return io_uring_submit(ring);
}
 
/**
 * Process completions and resubmit new reads
 */
int process_completions(struct io_uring *ring) {
    struct io_uring_cqe *cqe;
    unsigned head;
    int completed = 0;
    
    io_uring_for_each_cqe(ring, head, cqe) {
        if (cqe->res < 0) {
            fprintf(stderr, "I/O error: %d\n", cqe->res);
        } else if (cqe->res != BLOCK_SIZE) {
            fprintf(stderr, "Short read: %d\n", cqe->res);
        }
        completed++;
    }
    
    if (completed > 0) {
        io_uring_cq_advance(ring, completed);
    }
    
    return completed;
}
 
/**
 * Main high-bandwidth read loop
 */
int high_bandwidth_read(const char *path, size_t total_bytes) {
    struct io_uring ring;
    int fd;
    
    // Open with O_DIRECT for direct device access
    fd = open(path, O_RDONLY | O_DIRECT);
    if (fd < 0) {
        perror("open");
        return -1;
    }
    
    if (setup_io_uring(&ring) < 0) {
        close(fd);
        return -1;
    }
    
    // Allocate aligned buffers
    void *buffers[QUEUE_DEPTH];
    off_t offsets[QUEUE_DEPTH];
    
    for (int i = 0; i < QUEUE_DEPTH; i++) {
        posix_memalign(&buffers[i], BLOCK_SIZE, BLOCK_SIZE);
        offsets[i] = (off_t)i * BLOCK_SIZE;
    }
    
    // Initial submission to fill queue
    submit_reads(&ring, fd, buffers, offsets, QUEUE_DEPTH);
    
    size_t bytes_read = 0;
    off_t next_offset = QUEUE_DEPTH * BLOCK_SIZE;
    
    while (bytes_read < total_bytes) {
        // Wait for at least one completion
        struct io_uring_cqe *cqe;
        io_uring_wait_cqe(&ring, &cqe);
        
        // Process all available completions
        int completed = process_completions(&ring);
        bytes_read += completed * BLOCK_SIZE;
        
        // Resubmit to maintain queue depth
        for (int i = 0; i < completed && next_offset < total_bytes; i++) {
            struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
            io_uring_prep_read(sqe, fd, buffers[i], BLOCK_SIZE, next_offset);
            next_offset += BLOCK_SIZE;
        }
        io_uring_submit(&ring);
    }
    
    // Cleanup
    io_uring_queue_exit(&ring);
    for (int i = 0; i < QUEUE_DEPTH; i++) {
        free(buffers[i]);
    }
    close(fd);
    
    return 0;
}

The 80% Rule

For sustained workloads, target 75-85% bandwidth utilization. This leaves headroom for bursts and variability while extracting most of the available capacity. Operating constantly at 95%+ risks saturation during demand spikes and causes latency degradation.

Bandwidth Utilization Patterns

Understanding common utilization patterns helps diagnose issues and optimize systems.

Pattern 1: Sustained High Utilization

Signature: Consistently >85% utilization over extended periods

Causes:

Workload consistently approaching device capacity
Successful optimization achieving design throughput
Undersized infrastructure for workload

Implications:

Little headroom for demand spikes
Latency likely elevated due to queuing
Any increase in demand causes saturation

Actions:

Verify this is expected/desired state
Monitor latency closely; alert on degradation
Plan capacity expansion before problems emerge

Pattern 2: Bursty Utilization

Signature: Low average (20-40%) with high peaks (>90%)

Causes:

Batch processing jobs
Cache flush events
Periodic backup operations
Request batching/buffering

Implications:

Average utilization understates peak demands
May cause latency spikes during bursts
Capacity appears adequate but bursts cause problems

Actions:

Implement rate limiting/throttling
Spread batch operations over longer windows
Increase buffering to absorb bursts
Consider separate infrastructure for burst workloads

Pattern 3: Persistently Low Utilization

Signature: Consistently <30% despite perceived performance issues

Causes:

IOPS bottleneck (small random I/O)
Application serialization (deep call chains, synchronous I/O)
CPU bottleneck limiting I/O generation rate
Lock contention in application
Network latency masking storage capability

Implications:

Bandwidth capacity is not the constraint
Adding raw throughput won't help
Root cause is elsewhere in the stack

Actions:

Identify actual bottleneck (IOPS, CPU, locks)
Optimize access patterns (larger I/O, prefetching)
Increase parallelism (more threads, async I/O)
Profile application for CPU/memory bottlenecks

Healthy Utilization Signs

•Stable utilization matching workload patterns
•Consistent throughput during sustained operations
•Low, stable latency alongside high utilization
•Predictable queue depths
•Smooth transitions between load levels

Unhealthy Utilization Signs

•100% utilization with growing queues
•Erratic throughput fluctuations
•Latency spikes correlated with utilization peaks
•Queue depth approaching device limits
•Sudden drops in utilization (thermal, GC)

Pattern Recognition

Chart utilization over time at multiple granularities (seconds, minutes, hours). Daily and weekly patterns reveal batch job impacts. Correlate utilization with latency, queue depth, and application metrics to understand whether observed utilization is healthy or problematic.

Summary: Optimizing Bandwidth Utilization

Bandwidth utilization measures how effectively I/O systems convert raw capacity into useful work. Achieving high utilization requires understanding the efficiency chain from physical interface to application layer.

Key Takeaways

•Utilization is multiplicative — Protocol, stack, access pattern, and contention efficiencies multiply to produce overall utilization. Each layer matters.
•Low utilization isn't always bad — Small random I/O workloads are IOPS-limited, not bandwidth-limited. Context determines whether low utilization indicates a problem.
•Overhead accumulates — Encoding, protocol, software, and device overhead compound. Understanding where bandwidth goes enables targeted optimization.
•Contention degrades everyone — Shared resources require management. Without limits and priorities, all workloads suffer during contention.
•Application design dominates — Large requests, deep queues, alignment, and async I/O have more impact than most system tuning.
•Target 75-85% sustained — Leave headroom for variability. Consistent high utilization without queue buildup indicates healthy optimization.

What's Next

With throughput, latency, and utilization understood, the next page examines hardware bottlenecks—the physical constraints that limit I/O performance regardless of software optimization. We'll explore how to identify, diagnose, and address hardware-level limitations.

Page Complete

You now understand bandwidth utilization comprehensively: how to measure it, sources of inefficiency, contention effects, and optimization strategies. This knowledge enables you to maximize the value extracted from I/O infrastructure investments.

3 / 5

Loading learning content...

Operating SystemsI/O Hardware Performance

I/O Hardware Performance

LevelAdvanced

Duration90 mins

TopicI/O Hardware Performance

3 / 5

Bandwidth Utilization

Making Every Bit Count

What You Will Learn

Understanding Bandwidth Utilization

Bandwidth utilization is the ratio of actual data transfer rate to maximum available bandwidth:

$$U = \frac{B_{actual}}{B_{max}} \times 100%$$

Where:

U = Utilization (percentage)
B_actual = Observed throughput
B_max = Theoretical maximum bandwidth

However, this simple formula obscures important nuances. What exactly constitutes "actual" bandwidth? And what is the appropriate baseline for "maximum"?

Types of Bandwidth in I/O Systems
Bandwidth Type	Definition	Example
Raw/Wire Bandwidth	Physical signaling capacity of the interface	PCIe 4.0 x4: 64 Gbps raw
Encoded Bandwidth	Available after line coding overhead	PCIe 4.0 x4: 62.7 Gbps (128b/130b)
Protocol Bandwidth	Available for payload after protocol headers	NVMe over PCIe: ~61 Gbps effective
Useful Bandwidth	Data valuable to the application	After deduplication: varies by workload

Utilization Efficiency Chain

Effective utilization is the product of efficiencies at each layer:

$$U_{effective} = \eta_{encoding} \times \eta_{protocol} \times \eta_{overhead} \times \eta_{access} \times \eta_{contention}$$

Where:

η_encoding = Encoding efficiency (e.g., 98.5% for 128b/130b)
η_protocol = Protocol payload efficiency (typically 90-98%)
η_overhead = Software stack efficiency (75-95%)
η_access = Access pattern efficiency (10-95% depending on workload)
η_contention = Resource sharing efficiency (50-100%)

Example Analysis: A PCIe 4.0 x4 NVMe SSD under random 4KB reads:

Raw: 64 Gbps (8 GB/s)
Encoding (128b/130b): × 0.985 = 7.88 GB/s
Protocol overhead: × 0.95 = 7.49 GB/s
Stack processing: × 0.90 = 6.74 GB/s
Random access pattern: × 0.15 = 1.01 GB/s (IOPS-limited)
Contention (shared system): × 0.80 = 0.81 GB/s

Result: 0.81 / 8.0 = 10.1% utilization — but this is optimal for this workload!

Utilization Isn't Always Low = Bad

Measuring Bandwidth Utilization

Accurate bandwidth utilization measurement requires understanding what to measure and how to interpret results in context.

Key Utilization Metrics

Bandwidth Utilization Metrics
Metric	Description	Interpretation
Instantaneous Utilization	Current bandwidth use at measurement point	Useful for real-time dashboards; noisy
Average Utilization	Mean utilization over time window	Good for capacity planning; hides bursts
Peak Utilization	Maximum utilization during period	Identifies saturation events
Sustained Utilization	Utilization during active transfer periods	Measures efficiency when system is working
Busy Time Utilization	% time with any activity × utilization during activity	Separates idle time from inefficiency

bandwidth_utilization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
"""
Bandwidth Utilization Analysis Framework
 
Provides comprehensive utilities for measuring, analyzing, and 
reporting bandwidth utilization across I/O subsystems.
"""
 
import time
from dataclasses import dataclass
from typing import List, Optional
import subprocess
import re
 
@dataclass
class BandwidthSample:
    """Single bandwidth measurement sample."""
    timestamp: float
    bytes_read: int
    bytes_written: int
    device_busy_pct: float  # 0-100
    
@dataclass  
class UtilizationReport:
    """Comprehensive utilization analysis."""
    device: str
    max_bandwidth_mbps: float
    
    # Throughput metrics
    avg_read_mbps: float
    avg_write_mbps: float
    peak_read_mbps: float
    peak_write_mbps: float
    
    # Utilization metrics
    avg_utilization_pct: float
    peak_utilization_pct: float
    sustained_utilization_pct: float  # During active periods only
    
    # Efficiency analysis
    read_write_ratio: float
    bandwidth_efficiency: float  # Actual vs theoretical
 
 
class BandwidthUtilizationAnalyzer:
    """
    Analyzes bandwidth utilization for block devices.
    
    Uses /proc/diskstats on Linux for accurate measurement.
    """
    
    def __init__(self, device: str, max_bandwidth_mbps: float):
        """
        Initialize analyzer.
        
        Args:
            device: Device name (e.g., 'nvme0n1', 'sda')
            max_bandwidth_mbps: Theoretical max bandwidth in MB/s
        """
        self.device = device
        self.max_bandwidth_mbps = max_bandwidth_mbps
        self.samples: List[BandwidthSample] = []
        
    def collect_sample(self) -> BandwidthSample:
        """Collect current bandwidth utilization sample."""
        # Read from /proc/diskstats
        with open('/proc/diskstats', 'r') as f:
            for line in f:
                fields = line.split()
                if len(fields) >= 14 and fields[2] == self.device:
                    # Fields: major minor name rd_ios rd_mrg rd_sect rd_ticks
                    #         wr_ios wr_mrg wr_sect wr_ticks ios_inflight
                    #         io_ticks weighted_io_ticks
                    bytes_read = int(fields[5]) * 512    # Sectors to bytes
                    bytes_written = int(fields[9]) * 512
                    io_ticks = int(fields[12])           # ms active
                    
                    sample = BandwidthSample(
                        timestamp=time.time(),
                        bytes_read=bytes_read,
                        bytes_written=bytes_written,
                        device_busy_pct=0.0  # Calculated from successive samples
                    )
                    self.samples.append(sample)
                    return sample
                    
        raise ValueError(f"Device {self.device} not found in /proc/diskstats")
    
    def collect_samples(self, duration_seconds: float, interval_seconds: float = 1.0):
        """Collect samples over a duration."""
        end_time = time.time() + duration_seconds
        while time.time() < end_time:
            self.collect_sample()
            time.sleep(interval_seconds)
    
    def analyze(self) -> UtilizationReport:
        """Analyze collected samples to produce utilization report."""
        if len(self.samples) < 2:
            raise ValueError("Need at least 2 samples for analysis")
        
        read_rates = []
        write_rates = []
        utilizations = []
        
        for i in range(1, len(self.samples)):
            prev, curr = self.samples[i-1], self.samples[i]
            dt = curr.timestamp - prev.timestamp
            
            if dt <= 0:
                continue
                
            # Calculate rates in MB/s
            read_rate = (curr.bytes_read - prev.bytes_read) / dt / (1024 * 1024)
            write_rate = (curr.bytes_written - prev.bytes_written) / dt / (1024 * 1024)
            total_rate = read_rate + write_rate
            
            utilization = (total_rate / self.max_bandwidth_mbps) * 100
            
            read_rates.append(read_rate)
            write_rates.append(write_rate)
            utilizations.append(min(utilization, 100))  # Cap at 100%
        
        # Calculate sustained utilization (only during active periods)
        active_utilizations = [u for u in utilizations if u > 1.0]  # >1% = active
        sustained_util = sum(active_utilizations) / len(active_utilizations) if active_utilizations else 0
        
        # Calculate read/write ratio
        total_read = sum(read_rates)
        total_write = sum(write_rates)
        rw_ratio = total_read / total_write if total_write > 0 else float('inf')
        
        return UtilizationReport(
            device=self.device,
            max_bandwidth_mbps=self.max_bandwidth_mbps,
            avg_read_mbps=sum(read_rates) / len(read_rates),
            avg_write_mbps=sum(write_rates) / len(write_rates),
            peak_read_mbps=max(read_rates),
            peak_write_mbps=max(write_rates),
            avg_utilization_pct=sum(utilizations) / len(utilizations),
            peak_utilization_pct=max(utilizations),
            sustained_utilization_pct=sustained_util,
            read_write_ratio=rw_ratio,
            bandwidth_efficiency=(sum(utilizations) / len(utilizations)) / 100
        )
 
    def print_report(self, report: UtilizationReport):
        """Print formatted utilization report."""
        print(f"\n{'='*60}")
        print(f"Bandwidth Utilization Report: {report.device}")
        print(f"{'='*60}")
        print(f"Max Bandwidth: {report.max_bandwidth_mbps:.1f} MB/s")
        print()
        print("Throughput:")
        print(f"  Average Read:  {report.avg_read_mbps:8.2f} MB/s")
        print(f"  Average Write: {report.avg_write_mbps:8.2f} MB/s")
        print(f"  Peak Read:     {report.peak_read_mbps:8.2f} MB/s")
        print(f"  Peak Write:    {report.peak_write_mbps:8.2f} MB/s")
        print()
        print("Utilization:")
        print(f"  Average:   {report.avg_utilization_pct:6.2f}%")
        print(f"  Peak:      {report.peak_utilization_pct:6.2f}%")
        print(f"  Sustained: {report.sustained_utilization_pct:6.2f}%")
        print()
        print("Efficiency:")
        print(f"  Read/Write Ratio: {report.read_write_ratio:.2f}")
        print(f"  Bandwidth Efficiency: {report.bandwidth_efficiency:.1%}")
        
        # Provide recommendations
        print()
        print("Analysis:")
        if report.avg_utilization_pct < 20:
            print("  ⚠ Low average utilization - check for IOPS bottleneck or idle time")
        elif report.avg_utilization_pct > 80:
            print("  ⚠ High utilization - approaching saturation")
        else:
            print("  ✓ Healthy utilization range")
            
        if report.peak_utilization_pct > 95 and report.avg_utilization_pct < 50:
            print("  ⚠ Bursty workload - consider spreading load or adding caching")
 
 
# Example usage
if __name__ == "__main__":
    analyzer = BandwidthUtilizationAnalyzer(
        device="nvme0n1",
        max_bandwidth_mbps=7000  # PCIe 4.0 x4 NVMe theoretical
    )
    
    print("Collecting samples for 60 seconds...")
    analyzer.collect_samples(duration_seconds=60, interval_seconds=1.0)
    
    report = analyzer.analyze()
    analyzer.print_report(report)

Monitoring Tools for Bandwidth Utilization

Linux provides several tools for bandwidth utilization monitoring:

utilization_monitoring.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
#!/bin/bash
# Bandwidth Utilization Monitoring Commands
 
# ============================================
# STORAGE BANDWIDTH UTILIZATION
# ============================================
 
# iostat with utilization (%util column)
# Shows both throughput and device utilization
iostat -xm 1
 
# Example output interpretation:
# Device   r/s   w/s   rMB/s   wMB/s  %util
# nvme0n1  1200  800   600     400    45%
#
# This indicates:
# - Total throughput: 600 + 400 = 1000 MB/s
# - Device is busy 45% of the time
# - If max bandwidth is 7000 MB/s: 1000/7000 = 14.3% bandwidth utilization
# - But device is only busy 45% of time
# - During busy time: 14.3% / 45% = ~32% sustained utilization
 
# ============================================
# NETWORK BANDWIDTH UTILIZATION
# ============================================
 
# sar for network interface utilization
sar -n DEV 1
 
# More detailed with iftop
sudo iftop -i eth0 -t -s 10
 
# Calculate utilization for 10 GbE link (1250 MB/s max)
# If observing 800 MB/s: 800/1250 = 64% utilization
 
# ============================================
# PCIE BANDWIDTH UTILIZATION
# ============================================
 
# Use perf with PCIe events (requires appropriate PMU support)
sudo perf stat -e 'pci/r/w bytes' -a sleep 10
 
# Or use Intel PCM (Performance Counter Monitor)
# Shows per-socket PCIe bandwidth
sudo pcm-pcie
 
# ============================================
# MEMORY BANDWIDTH UTILIZATION
# ============================================
 
# Intel PCM for memory bandwidth
sudo pcm-memory 1
 
# Or use perf with memory controller events
sudo perf stat -e 'uncore_imc/cas_count_read/,uncore_imc/cas_count_write/' -a sleep 10
 
# ============================================
# AUTOMATED UTILIZATION TRACKING
# ============================================
 
# Log to file for historical analysis
(
while true; do
    echo "=== $(date) ===" >> /var/log/bandwidth_util.log
    iostat -xm 1 1 | tail -n +7 >> /var/log/bandwidth_util.log
    sleep 60
done
) &
 
# Prometheus node_exporter provides:
# - node_disk_read_bytes_total
# - node_disk_written_bytes_total
# - node_network_receive_bytes_total
# - node_network_transmit_bytes_total
# Calculate utilization as rate(metric) / max_bandwidth

Utilization vs Saturation

Sources of Utilization Inefficiency

Multiple factors reduce bandwidth utilization efficiency. Understanding these sources enables targeted optimization.

Protocol and Encoding Overhead

Every I/O protocol consumes bandwidth for non-data purposes:

Protocol Overhead Analysis
Protocol/Layer	Overhead Type	Bandwidth Impact
PCIe 4.0 (128b/130b)	Line encoding	~1.5% loss
NVMe Command	64-byte submission queue entry	~1.5% per 4KB I/O
NVMe Completion	16-byte completion entry	~0.4% per 4KB I/O
Ethernet (1500B MTU)	14B header + 4B FCS + 12B IFG	~2% loss
Ethernet (Jumbo MTU 9000B)	Same fixed overhead	~0.3% loss
TCP/IP Headers	40+ bytes per packet	~2.7% for 1500B packets
SATA (8b/10b)	Line encoding	~20% loss
USB 3.x	Packet framing + overhead	~10-15% loss

Access Pattern Inefficiency

How data is accessed dramatically affects utilization:

Random Access: On HDDs, seek time creates dead time where no data flows. On SSDs, random access limits internal parallelism and increases flash read latency.

Read-Write Mixing: Many devices optimize for either reads or writes. Mixed patterns cause mode-switching overhead:, context switches in controllers and cache thrashing.

Access Pattern	Typical Utilization Efficiency
Sequential large reads	85-95%
Sequential large writes	80-92%
Sequential small (4KB) reads	50-70%
Random large reads	40-60%
Random small (4KB) reads	10-30%
Mixed random read/write	8-25%

Software Stack Overhead

Each software layer adds processing that limits sustainable throughput:

System Call Overhead: Transitioning between user and kernel mode costs ~100 ns-1 µs per call. At 1 million IOPS, this adds up.

Context Switching: Blocked I/O causes thread context switches (~1-10 µs each), wasting CPU cycles and cache coherency.

Data Copying: Data often copies multiple times: user buffer → kernel buffer → device buffer. Each copy consumes memory bandwidth and CPU cycles.

Interrupt Handling: Each I/O completion triggers an interrupt (~2-5 µs processing). At high IOPS, interrupt overhead becomes significant.

Allocation and Locking: Memory allocation and synchronization primitives in the I/O path add variable delays.

Device-Internal Inefficiencies

Even at the device level, bandwidth is lost:

Garbage Collection (SSDs): When free blocks are low, SSDs must compact data, consuming internal bandwidth. This can reduce available bandwidth by 30-50% under sustained writes.

Wear Leveling: Moving data to even wear adds overhead, particularly impactful when data is moved from idle blocks.

Error Correction: LDPC decoding in SSDs, error retry in HDDs consume processing time that reduces throughput.

Thermal Throttling: Sustained high throughput causes temperature increases, triggering reduced performance modes.

Power State Transitions: Low-power states require wake-up time (1-50 ms for HDDs, 10-100 µs for SSDs), causing delays after idle periods.

The Overhead Cascade

Contention and Shared Resources

In real systems, I/O resources are rarely dedicated to single workloads. Contention between competing demands reduces effective bandwidth available to each.

Types of Resource Contention

Contention Points in I/O Systems

•Storage Device Contention: Multiple processes or VMs accessing the same SSD. Each process gets a share of device IOPS and bandwidth.
•Bus/Interconnect Contention: PCIe lanes, memory bus, and on-chip interconnects have fixed bandwidth that must be shared.
•Network Fabric Contention: Switch ports, core uplinks, and shared fabric capacity create potential bottlenecks.
•Host Controller Contention: SAS/SATA controllers, NVMe controllers, and RAID cards have finite processing capacity.
•CPU Contention: I/O processing requires CPU cycles. CPU-bound systems may not drive I/O fast enough.
•Memory Bandwidth Contention: DMA operations compete with CPU memory access, especially on single-socket systems.

Modeling Contention Effects

When N equal workloads contend for a shared resource with capacity C, naively each receives C/N. Reality is worse due to contention overhead:

$$B_{per_workload} = \frac{C \times \eta_{contention}}{N}$$

Where η_contention accounts for:

Context switching between workloads
Queue management overhead
Loss of access pattern optimization
Fairness enforcement overhead

Typical values for η_contention:

Scenario	η_contention
Dedicated device	1.0
2 similar workloads	0.95-0.98
2-4 diverse workloads	0.85-0.95
5-10 workloads	0.75-0.90
Many small workloads	0.60-0.80
VM/container multi-tenancy	0.50-0.80

The "Noisy Neighbor" Problem

In multi-tenant environments, one aggressive workload can consume disproportionate resources:

Symptom: Application A's throughput drops 50% when Application B starts a backup job.

Cause: Application B issues large sequential I/Os that monopolize device bandwidth and cause queue head-of-line blocking.

Solutions:

I/O Scheduling: Use fair schedulers (BFQ, mq-deadline) that enforce proportional access
Resource Limits: cgroups/blkio throttling to cap bandwidth per workload
Resource Isolation: Dedicated devices for critical workloads
Time-Based Separation: Schedule batch workloads during off-peak hours
QoS Classification: Prioritize latency-sensitive over throughput-oriented

cgroup_bandwidth_limits.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#!/bin/bash
# Bandwidth Management with cgroups v2
# Isolate workloads and prevent noisy neighbor effects
 
# Enable cgroup v2 controllers
echo "+io +cpu +memory" > /sys/fs/cgroup/cgroup.subtree_control
 
# ============================================
# Create isolated cgroups for different workloads
# ============================================
 
# High priority workload (production database)
mkdir -p /sys/fs/cgroup/prod-db
echo "256:0 rbps=3000000000 wbps=2000000000" > /sys/fs/cgroup/prod-db/io.max
# Device 256:0 (check with lsblk), read 3GB/s, write 2GB/s
 
# Low priority workload (backup job)  
mkdir -p /sys/fs/cgroup/backup
echo "256:0 rbps=500000000 wbps=500000000" > /sys/fs/cgroup/backup/io.max
# Limited to 500MB/s read and write
 
# Best-effort workload (dev/test)
mkdir -p /sys/fs/cgroup/dev
echo "256:0 rbps=max wbps=max" > /sys/fs/cgroup/dev/io.max
echo "256:0 100" > /sys/fs/cgroup/dev/io.weight  # Lower weight for fair sharing
 
# ============================================
# Set I/O priority weights (relative scheduling)
# ============================================
 
# Higher weight = higher priority in contention
# Range: 1-10000, default 100
 
echo "256:0 500" > /sys/fs/cgroup/prod-db/io.weight   # 5x priority
echo "256:0 50" > /sys/fs/cgroup/backup/io.weight     # 0.5x priority  
echo "256:0 100" > /sys/fs/cgroup/dev/io.weight       # Normal
 
# ============================================
# Launch processes in cgroups
# ============================================
 
# Run database in prod cgroup
echo $DATABASE_PID > /sys/fs/cgroup/prod-db/cgroup.procs
 
# Run backup in limited cgroup
cgexec -g io:backup /usr/bin/backup-script.sh
 
# ============================================
# Monitor per-cgroup I/O
# ============================================
 
# View current I/O statistics per cgroup
cat /sys/fs/cgroup/*/io.stat
 
# Example output:
# 256:0 rbytes=1234567890 wbytes=987654321 rios=12345 wios=9876 dbytes=0 dios=0
 
# ============================================
# NUMA-aware bandwidth isolation
# ============================================
 
# Bind to specific NUMA node for consistent performance
echo "0" > /sys/fs/cgroup/prod-db/cpuset.mems
echo "0-7" > /sys/fs/cgroup/prod-db/cpuset.cpus
 
# This ensures database traffic uses NUMA-local memory
# and PCIe paths, reducing cross-socket bandwidth contention

Proactive Contention Management

Maximizing Bandwidth Utilization

Improving bandwidth utilization requires systematic optimization across hardware configuration, system tuning, and application design.

Hardware Configuration Strategies

Hardware-Level Optimizations

•Match Interface to Workload: Don't over-provision (waste) or under-provision (bottleneck). A write-heavy workload on SLC cache + QLC NAND may show 95% utilization at low throughput due to cache exhaustion.
•Balance Channel Distribution: RAID stripes, NVMe namespaces, and multi-path I/O distribute load. Ensure balanced utilization across all channels.
•NUMA Awareness: Route I/O through the NUMA node that hosts the device. Cross-socket traffic consumes memory bandwidth and adds latency.
•PCIe Slot Placement: Place high-bandwidth devices in slots with dedicated lanes. Avoid sharing PCIe switches unnecessarily.
•Enable Maximum Link Speed: Verify BIOS/device negotiated maximum link speed. PCIe training failures cause fallback to lower speeds.

System Tuning Strategies

OS configuration significantly impacts achievable utilization:

System Tuning for Bandwidth Utilization
Tuning Area	Parameter	Optimization
I/O Scheduler	none/mq-deadline for NVMe	Reduces scheduling overhead for devices that handle queuing internally
Queue Depth	nr_requests 1024+	Allows more concurrent I/O to keep device busy
Read Ahead	read_ahead_kb 16384	Prefetches sequential data; reduces I/O wait cycles
Dirty Pages	dirty_ratio 40%	Buffers more writes; improves sustained write throughput
Kernel Bypass	io_uring sq_poll	Kernel polls for completions; reduces syscall overhead
Interrupt Coalescing	Device-specific	Batches interrupts; trades latency for throughput
NUMA Balancing	Disable for I/O nodes	Prevents migration that disrupts DMA mappings

Application Design Strategies

Application architecture often has the largest impact on utilization:

Application-Level Optimizations

•Use Large I/O Sizes: Larger requests amortize fixed overhead. 256KB-1MB requests maximize throughput utilization on SSDs.
•Maintain Queue Depth: Keep 16-64 I/O requests outstanding to saturate modern devices. Use io_uring or IOCP for async I/O.
•Align I/O to Device Sectors: Misaligned writes cause read-modify-write cycles, halving effective bandwidth.
•Separate Read and Write Paths: Dedicated threads for reads vs writes avoid mode-switching overhead and enable better scheduling.
•Batch Small Writes: Accumulate small writes in buffers; flush periodically as large sequential writes.
•Stream Data: For large transfers, stream data rather than loading entirely into memory. Reduces memory bandwidth contention.
•Parallelize Across Devices: Stripe large workloads across multiple devices to multiply aggregate bandwidth.

high_bandwidth_io.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
/**
 * High-Bandwidth I/O Pattern Using io_uring
 * 
 * Demonstrates techniques for maximizing bandwidth utilization:
 * - Large aligned I/O requests (1MB)
 * - Deep queue depth (64 outstanding)
 * - Kernel-side polling to reduce syscall overhead
 * - Registered buffers to avoid memory mapping overhead
 */
 
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <string.h>
#include <liburing.h>
 
#define QUEUE_DEPTH 64
#define BLOCK_SIZE (1024 * 1024)  // 1MB blocks for high bandwidth
#define NUM_BLOCKS 1024
 
struct io_data {
    int fd;
    off_t offset;
    struct iovec iov;
};
 
/**
 * Initialize io_uring with optimal settings for bandwidth
 */
int setup_io_uring(struct io_uring *ring) {
    struct io_uring_params params = {0};
    
    // Enable kernel-side polling (reduces syscalls)
    params.flags = IORING_SETUP_SQPOLL;
    params.sq_thread_idle = 1000;  // ms before kernel thread sleeps
    
    int ret = io_uring_queue_init_params(QUEUE_DEPTH, ring, &params);
    if (ret < 0) {
        fprintf(stderr, "io_uring init failed: %d\n", ret);
        return -1;
    }
    
    return 0;
}
 
/**
 * Submit read requests to fill queue
 */
int submit_reads(struct io_uring *ring, int fd, 
                 void **buffers, off_t *offsets, int count) {
    for (int i = 0; i < count; i++) {
        struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
        if (!sqe) {
            // Queue full, submit and wait for space
            io_uring_submit(ring);
            sqe = io_uring_get_sqe(ring);
        }
        
        // Prepare read with pre-allocated aligned buffer
        io_uring_prep_read(sqe, fd, buffers[i], BLOCK_SIZE, offsets[i]);
        
        // Store index for completion tracking
        io_uring_sqe_set_data(sqe, (void*)(long)i);
    }
    
    return io_uring_submit(ring);
}
 
/**
 * Process completions and resubmit new reads
 */
int process_completions(struct io_uring *ring) {
    struct io_uring_cqe *cqe;
    unsigned head;
    int completed = 0;
    
    io_uring_for_each_cqe(ring, head, cqe) {
        if (cqe->res < 0) {
            fprintf(stderr, "I/O error: %d\n", cqe->res);
        } else if (cqe->res != BLOCK_SIZE) {
            fprintf(stderr, "Short read: %d\n", cqe->res);
        }
        completed++;
    }
    
    if (completed > 0) {
        io_uring_cq_advance(ring, completed);
    }
    
    return completed;
}
 
/**
 * Main high-bandwidth read loop
 */
int high_bandwidth_read(const char *path, size_t total_bytes) {
    struct io_uring ring;
    int fd;
    
    // Open with O_DIRECT for direct device access
    fd = open(path, O_RDONLY | O_DIRECT);
    if (fd < 0) {
        perror("open");
        return -1;
    }
    
    if (setup_io_uring(&ring) < 0) {
        close(fd);
        return -1;
    }
    
    // Allocate aligned buffers
    void *buffers[QUEUE_DEPTH];
    off_t offsets[QUEUE_DEPTH];
    
    for (int i = 0; i < QUEUE_DEPTH; i++) {
        posix_memalign(&buffers[i], BLOCK_SIZE, BLOCK_SIZE);
        offsets[i] = (off_t)i * BLOCK_SIZE;
    }
    
    // Initial submission to fill queue
    submit_reads(&ring, fd, buffers, offsets, QUEUE_DEPTH);
    
    size_t bytes_read = 0;
    off_t next_offset = QUEUE_DEPTH * BLOCK_SIZE;
    
    while (bytes_read < total_bytes) {
        // Wait for at least one completion
        struct io_uring_cqe *cqe;
        io_uring_wait_cqe(&ring, &cqe);
        
        // Process all available completions
        int completed = process_completions(&ring);
        bytes_read += completed * BLOCK_SIZE;
        
        // Resubmit to maintain queue depth
        for (int i = 0; i < completed && next_offset < total_bytes; i++) {
            struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
            io_uring_prep_read(sqe, fd, buffers[i], BLOCK_SIZE, next_offset);
            next_offset += BLOCK_SIZE;
        }
        io_uring_submit(&ring);
    }
    
    // Cleanup
    io_uring_queue_exit(&ring);
    for (int i = 0; i < QUEUE_DEPTH; i++) {
        free(buffers[i]);
    }
    close(fd);
    
    return 0;
}

The 80% Rule

Bandwidth Utilization Patterns

Understanding common utilization patterns helps diagnose issues and optimize systems.

Pattern 1: Sustained High Utilization

Signature: Consistently >85% utilization over extended periods

Causes:

Workload consistently approaching device capacity
Successful optimization achieving design throughput
Undersized infrastructure for workload

Implications:

Little headroom for demand spikes
Latency likely elevated due to queuing
Any increase in demand causes saturation

Actions:

Verify this is expected/desired state
Monitor latency closely; alert on degradation
Plan capacity expansion before problems emerge

Pattern 2: Bursty Utilization

Signature: Low average (20-40%) with high peaks (>90%)

Causes:

Batch processing jobs
Cache flush events
Periodic backup operations
Request batching/buffering

Implications:

Average utilization understates peak demands
May cause latency spikes during bursts
Capacity appears adequate but bursts cause problems

Actions:

Implement rate limiting/throttling
Spread batch operations over longer windows
Increase buffering to absorb bursts
Consider separate infrastructure for burst workloads

Pattern 3: Persistently Low Utilization

Signature: Consistently <30% despite perceived performance issues

Causes:

IOPS bottleneck (small random I/O)
Application serialization (deep call chains, synchronous I/O)
CPU bottleneck limiting I/O generation rate
Lock contention in application
Network latency masking storage capability

Implications:

Bandwidth capacity is not the constraint
Adding raw throughput won't help
Root cause is elsewhere in the stack

Actions:

Identify actual bottleneck (IOPS, CPU, locks)
Optimize access patterns (larger I/O, prefetching)
Increase parallelism (more threads, async I/O)
Profile application for CPU/memory bottlenecks

Healthy Utilization Signs

•Stable utilization matching workload patterns
•Consistent throughput during sustained operations
•Low, stable latency alongside high utilization
•Predictable queue depths
•Smooth transitions between load levels

Unhealthy Utilization Signs

•100% utilization with growing queues
•Erratic throughput fluctuations
•Latency spikes correlated with utilization peaks
•Queue depth approaching device limits
•Sudden drops in utilization (thermal, GC)

Pattern Recognition

Summary: Optimizing Bandwidth Utilization

Key Takeaways

•Utilization is multiplicative — Protocol, stack, access pattern, and contention efficiencies multiply to produce overall utilization. Each layer matters.
•Low utilization isn't always bad — Small random I/O workloads are IOPS-limited, not bandwidth-limited. Context determines whether low utilization indicates a problem.
•Overhead accumulates — Encoding, protocol, software, and device overhead compound. Understanding where bandwidth goes enables targeted optimization.
•Contention degrades everyone — Shared resources require management. Without limits and priorities, all workloads suffer during contention.
•Application design dominates — Large requests, deep queues, alignment, and async I/O have more impact than most system tuning.
•Target 75-85% sustained — Leave headroom for variability. Consistent high utilization without queue buildup indicates healthy optimization.

What's Next

Page Complete

3 / 5