Loading content...
In the world of I/O performance, throughput often captures the headlines—advertisements boast of blazing-fast SSDs achieving 7,000 MB/s, networks promising 100 Gbps, storage arrays delivering petabytes per second. Yet for the vast majority of real-world applications, latency—not throughput—determines perceived performance.
Consider a web server handling thousands of user requests. Each request needs to fetch a small amount of data from storage. A drive capable of 7,000 MB/s throughput but with 100 microsecond latency will feel dramatically slower than one achieving 500 MB/s with 10 microsecond latency—because the web server issues thousands of small requests, each paying the latency tax, rather than a few massive transfers that would highlight throughput.
Latency is insidious precisely because it's invisible until you measure it. Systems appear to work correctly; they're just slow. Users don't see an error message—they see a delay, a hesitation, a fraction of a second that accumulates into frustration and abandonment.
By the end of this page, you will understand I/O latency in depth: its components, how to measure it accurately, the complex relationship between latency and throughput, the critical importance of tail latencies in large-scale systems, and proven strategies for minimizing latency in performance-sensitive applications.
I/O latency is the time elapsed between initiating an I/O operation and receiving the result. Unlike throughput, which measures aggregate data flow, latency measures the delay experienced by a single operation. It is typically expressed in time units:
The term latency encompasses several related but distinct concepts:
| Latency Type | Definition | Measured From → To |
|---|---|---|
| Access Latency | Time to access a single unit of data | Request issued → Data available |
| Command Latency | Time for device to process a command | Command arrival at device → Response sent |
| Queue Latency | Time spent waiting in queues | Request submitted → Processing begins |
| Service Time | Actual processing/transfer time | Processing begins → Operation completes |
| End-to-End Latency | Total user-perceived latency | Application request → Application receives response |
The Latency Anatomy
End-to-end I/O latency is composed of multiple stages, each contributing delay:
$$L_{total} = L_{software} + L_{queue} + L_{transfer} + L_{device} + L_{media}$$
1. Software Latency ($L_{software}$) Time spent in software layers: system call overhead, file system processing, driver execution, interrupt handling. Ranges from nanoseconds (optimized kernel paths) to milliseconds (complex file system operations).
2. Queue Latency ($L_{queue}$) Time spent waiting in various queues: OS I/O scheduler queue, device queue, network buffers. Can dominate under contention; minimal when system is lightly loaded.
3. Transfer Latency ($L_{transfer}$) Time to move data across interfaces: PCIe bus, SATA cable, network link. Proportional to data size; negligible for small operations.
4. Device Latency ($L_{device}$) Controller processing time: command parsing, internal scheduling, error checking. Modern NVMe controllers add ~10-20 µs.
5. Media Latency ($L_{media}$) Time for physical media access: NAND flash read/program operations, HDD seek and rotational delay, network propagation. Often the dominant component for storage devices.
While often used interchangeably, "latency" typically refers to network or device-level delays, while "response time" encompasses the complete application-level experience. A database query's response time includes latency from multiple I/O operations plus CPU processing, query optimization, and result formatting.
Understanding latency sources enables targeted optimization. Different I/O types have dramatically different latency profiles, determined by fundamental physical and architectural constraints.
Storage Media Latency
Storage devices exhibit the widest latency variation, spanning six orders of magnitude:
| Storage Medium | Typical Read Latency | Typical Write Latency | Primary Delay Source |
|---|---|---|---|
| CPU L1 Cache | ~1 ns | ~1 ns | Register propagation |
| CPU L3 Cache | ~10-20 ns | ~10-20 ns | Cache coherency |
| DRAM | ~60-100 ns | ~60-100 ns | Row/column addressing |
| Intel Optane (3D XPoint) | ~10-20 µs | ~10-20 µs | Media physics |
| NVMe SSD (TLC NAND) | ~50-100 µs | ~20-50 µs | NAND cell read/program |
| SATA SSD | ~100-200 µs | ~50-100 µs | Protocol + NAND |
| 15K RPM Enterprise HDD | ~2-8 ms | ~2-8 ms | Seek + rotational |
| 7200 RPM Desktop HDD | ~4-15 ms | ~4-15 ms | Seek + rotational |
| 5400 RPM Laptop HDD | ~8-20 ms | ~8-20 ms | Seek + rotational |
| Tape (LTO) | ~10-60 s | ~10-60 s | Mechanical seek |
HDD Latency Deep Dive
Hard disk drive latency is dominated by mechanical delays:
$$L_{HDD} = L_{seek} + L_{rotational} + L_{transfer}$$
Seek Time ($L_{seek}$): Time to move read/write heads to the target track. Depends on distance traveled; full-stroke seeks take 15-20 ms, adjacent track seeks ~1-2 ms, average ~8-10 ms.
Rotational Latency ($L_{rotational}$): Time waiting for the target sector to rotate under the head. Average is half a revolution:
Transfer Time: Negligible for small reads (~0.1 ms for a 4 KB block).
This mechanical latency of 5-15 ms per random access fundamentally limits HDD IOPS to ~100-200, explaining why SSDs revolutionized workloads with random access patterns.
SSD Latency Deep Dive
Solid-state drives eliminate mechanical delays but introduce their own latency sources:
NAND Flash Latency:
Write Amplification: When writing to a previously-written block, the SSD must:
This "read-modify-write" cycle can increase write latency 10-100× under adverse conditions.
Garbage Collection: Background reorganization to reclaim space affects latency unpredictably. Enterprise SSDs include power/DRAM to allow deferred GC; consumer drives may stall during GC.
Controller Overhead: Command processing, wear leveling decisions, error correction (LDPC decoding), and encryption add 10-50 µs per operation.
Network Latency
Network latency components include:
Propagation Delay: Speed of light in fiber: ~5 µs per kilometer. New York to London (~5,500 km) ≈ 28 ms one-way minimum.
Transmission Delay: Time to send bits onto the wire. A 1 KB packet on 1 Gbps link = 8 µs. Dominates on slow links.
Processing Delay: Router/switch processing: ~1-10 µs per hop for hardware switching; ~100+ µs for software routers.
Queuing Delay: Time waiting in router/switch buffers. Highly variable; can add milliseconds under congestion.
| Network Path | Typical Round-Trip Latency | Primary Factor |
|---|---|---|
| Localhost (loopback) | 5-50 µs | OS kernel processing |
| Same rack (data center) | 50-100 µs | Switch latency |
| Same data center | 100-500 µs | Multiple hops |
| Same region | 1-5 ms | Distance + routing |
| Cross-continent | 50-150 ms | Speed of light |
| Satellite | 500-700 ms | Geostationary orbit |
Network latency has an irreducible minimum: the speed of light. Light in fiber travels at ~200,000 km/s, so circumnavigating Earth (~40,000 km) requires at least 200 ms. No protocol optimization can beat physics. This is why content delivery networks (CDNs) and edge computing exist—to bring computation closer to users.
Accurate latency measurement is both essential and challenging. Subtle measurement errors can lead to dramatically wrong conclusions about system performance.
Measurement Challenges
1. Timer Resolution System clocks have granularity limits. Windows' GetTickCount64() has ~15 ms resolution—useless for microsecond SSD latencies. Use high-resolution timers:
clock_gettime(CLOCK_MONOTONIC_RAW) — nanosecond resolutionQueryPerformanceCounter() — sub-microsecond typically2. Measurement Overhead The act of measuring affects results. System calls, memory allocation, and lock acquisition in measurement code add noise. Minimize instrumentation in hot paths.
3. Caching Effects First accesses are often slower than subsequent ones (cold vs. warm cache). Decide whether to measure cold or warm latency based on your workload's characteristics.
4. Statistical Validity Single measurements are meaningless for stochastic systems. Collect thousands of samples; report percentiles, not just averages.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
/** * Precise I/O Latency Measurement Framework * * Demonstrates best practices for accurate latency measurement: * - High-resolution timers * - O_DIRECT to bypass OS caching * - Statistical analysis with percentiles * - Warmup iterations * - Outlier detection */ #include <stdio.h>#include <stdlib.h>#include <fcntl.h>#include <unistd.h>#include <time.h>#include <stdint.h> #define NUM_SAMPLES 10000#define WARMUP_SAMPLES 1000#define BLOCK_SIZE 4096 typedef struct { double min; double max; double mean; double p50; // Median double p90; double p99; double p999; // Tail latency} LatencyStats; /** * High-resolution timer using CLOCK_MONOTONIC_RAW * Avoids NTP adjustments that can cause discontinuities */static inline uint64_t get_time_ns(void) { struct timespec ts; clock_gettime(CLOCK_MONOTONIC_RAW, &ts); return (uint64_t)ts.tv_sec * 1000000000ULL + ts.tv_nsec;} /** * Comparison function for qsort */static int compare_double(const void* a, const void* b) { double da = *(const double*)a; double db = *(const double*)b; return (da > db) - (da < db);} /** * Calculate comprehensive latency statistics */LatencyStats calculate_latency_stats(double* latencies, int count) { LatencyStats stats = {0}; // Sort for percentile calculation qsort(latencies, count, sizeof(double), compare_double); stats.min = latencies[0]; stats.max = latencies[count - 1]; stats.p50 = latencies[count / 2]; stats.p90 = latencies[(int)(count * 0.90)]; stats.p99 = latencies[(int)(count * 0.99)]; stats.p999 = latencies[(int)(count * 0.999)]; // Calculate mean double sum = 0; for (int i = 0; i < count; i++) { sum += latencies[i]; } stats.mean = sum / count; return stats;} /** * Measure random read latency with O_DIRECT */void measure_read_latency(const char* device, size_t device_size) { // Open with O_DIRECT to bypass OS buffer cache int fd = open(device, O_RDONLY | O_DIRECT); if (fd < 0) { perror("Failed to open device"); return; } // Aligned buffer for O_DIRECT void* buffer; posix_memalign(&buffer, BLOCK_SIZE, BLOCK_SIZE); double* latencies = malloc(NUM_SAMPLES * sizeof(double)); // Random offsets for reads (block-aligned) size_t num_blocks = device_size / BLOCK_SIZE; // Warmup phase - discard results printf("Performing %d warmup reads...\n", WARMUP_SAMPLES); for (int i = 0; i < WARMUP_SAMPLES; i++) { off_t offset = (rand() % num_blocks) * BLOCK_SIZE; lseek(fd, offset, SEEK_SET); read(fd, buffer, BLOCK_SIZE); } // Measurement phase printf("Measuring %d random reads...\n", NUM_SAMPLES); for (int i = 0; i < NUM_SAMPLES; i++) { off_t offset = (rand() % num_blocks) * BLOCK_SIZE; // Position lseek(fd, offset, SEEK_SET); // Time the read operation uint64_t start = get_time_ns(); ssize_t bytes = read(fd, buffer, BLOCK_SIZE); uint64_t end = get_time_ns(); if (bytes != BLOCK_SIZE) { fprintf(stderr, "Short read at sample %d\n", i); continue; } // Store latency in microseconds latencies[i] = (end - start) / 1000.0; } // Calculate and report statistics LatencyStats stats = calculate_latency_stats(latencies, NUM_SAMPLES); printf("\n=== Latency Statistics (microseconds) ===\n"); printf("Samples: %d\n", NUM_SAMPLES); printf("Min: %.2f µs\n", stats.min); printf("Mean: %.2f µs\n", stats.mean); printf("p50: %.2f µs (median)\n", stats.p50); printf("p90: %.2f µs\n", stats.p90); printf("p99: %.2f µs\n", stats.p99); printf("p99.9: %.2f µs (tail)\n", stats.p999); printf("Max: %.2f µs\n", stats.max); // Report derived IOPS capability printf("\n=== Derived Performance ===\n"); printf("Mean IOPS: %.0f\n", 1000000.0 / stats.mean); printf("p99 IOPS: %.0f\n", 1000000.0 / stats.p99); free(latencies); free(buffer); close(fd);}Essential Latency Metrics
When reporting latency, always include:
Percentiles, Not Just Averages
Averages hide critically important information. Consider two systems:
| System | Mean | p99 | p99.9 |
|---|---|---|---|
| A | 1 ms | 5 ms | 50 ms |
| B | 2 ms | 3 ms | 4 ms |
System B appears worse by average, but System A has catastrophic tail latency. At scale, those p99.9 outliers become frequent: with 10,000 requests, System A will see ~10 operations taking 50+ ms, devastating aggregate response time.
Standard Percentiles:
Many benchmarks suffer from "coordinated omission"—when a slow operation occurs, the benchmark waits, delaying subsequent measurements and hiding the full impact of the stall. Proper measurement tracks intended submission times, not just time between completions. Tools like Gil Tene's HdrHistogram address this by recording wait time for operations that should have been issued but weren't.
Latency and throughput are often discussed separately, but they are deeply interconnected. Understanding their relationship is crucial for capacity planning and performance optimization.
Little's Law
The fundamental relationship between latency, throughput, and concurrency is captured by Little's Law:
$$L = \lambda \times W$$
Where:
Rearranging: $\lambda = L / W$
Implications for I/O systems:
The Latency-Throughput Curve
As load increases, latency follows a predictable pattern:
Light load: Latency is near minimum (no queuing). Throughput proportional to load.
Moderate load: Latency begins increasing as queues form. Throughput continues rising.
Heavy load: "Hockey stick" effect—latency increases sharply. Throughput approaches maximum.
Saturation: Latency grows without bound. Throughput plateaus or even degrades due to overhead.
This behavior is modeled by queuing theory. For an M/M/1 queue (single server with Poisson arrivals):
$$W = \frac{1}{\mu - \lambda}$$
Where:
As λ approaches μ, latency approaches infinity. This is why operating at high utilization (>80%) dramatically increases latency variability.
| Utilization | Relative Latency | Practical Implication |
|---|---|---|
| 10% | 1.1× | Baseline - minimal queuing |
| 50% | 2× | Moderate queuing; acceptable for most workloads |
| 70% | 3.3× | Noticeable delays; approaching threshold |
| 80% | 5× | Significant queuing; typical SLA limit |
| 90% | 10× | Severe delays; tail latency explodes |
| 95% | 20× | Near saturation; unacceptable for interactive |
| 99% | 100× | System effectively unusable for interactive work |
Bandwidth-Delay Product
For networks and high-speed links, the bandwidth-delay product (BDP) determines optimal buffer sizes and concurrency:
$$BDP = Bandwidth \times RTT$$
Example: A 10 Gbps link with 50 ms RTT: $$BDP = 10 \times 10^9 \text{ b/s} \times 0.050 \text{ s} = 500 \text{ Mb} = 62.5 \text{ MB}$$
To fully utilize this link, 62.5 MB must be "in flight" at any time. With 1 MB requests, you need 63 concurrent requests. With 64 KB requests: nearly 1,000 concurrent requests.
This explains why high-latency links require either large transfers or deep parallelism to achieve high throughput.
For most workloads, operate at 50-70% utilization—the 'knee' of the latency curve where throughput is high but latency remains acceptable. Reserve headroom for demand spikes. Running consistently at 90%+ utilization optimizes throughput at the cost of unpredictable user experience.
In large-scale distributed systems, tail latency—the latency of the slowest requests (p99, p99.9, p99.99)—often matters more than median or mean latency. This phenomenon, explored extensively by Google engineers, fundamentally shapes how high-performance systems are designed.
The Tail-at-Scale Problem
Consider a service that must query 100 backend servers to assemble a response:
As fan-out increases, tail latency dominates:
| Fan-out | Probability of hitting at least one slow server |
|---|---|
| 1 | 1% |
| 10 | 9.6% |
| 50 | 39.5% |
| 100 | 63.4% |
| 1000 | 99.996% |
Sources of Tail Latency
Tail latency arises from many sources, often in combination:
1. Resource Contention
2. Background Activities
3. Hardware Variability
4. Queueing Effects
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
"""Hedged Request Pattern for Tail Latency Reduction This pattern issues duplicate requests after a short delay,using whichever completes first. Dramatically reduces p99latency at the cost of increased server load.""" import asyncioimport aiohttpfrom typing import List, Any, Optionalimport time class HedgedRequestClient: """ Client that implements hedged requests to reduce tail latency. Strategy: 1. Send primary request immediately 2. After hedge_delay_ms, send backup request (if primary hasn't completed) 3. Return first successful response, cancel the other """ def __init__( self, servers: List[str], hedge_delay_ms: float = 10.0, max_hedged_requests: int = 2 ): self.servers = servers self.hedge_delay_ms = hedge_delay_ms self.max_hedged_requests = max_hedged_requests self.session: Optional[aiohttp.ClientSession] = None async def __aenter__(self): self.session = aiohttp.ClientSession() return self async def __aexit__(self, *args): await self.session.close() async def fetch_with_hedging(self, path: str) -> dict: """ Fetch with hedged requests to reduce tail latency. Returns the first successful response. Cancels outstanding requests after first completes. """ start_time = time.perf_counter() # Create primary request primary_server = self.servers[0] tasks = [ asyncio.create_task( self._fetch_from_server(primary_server, path), name=f"primary-{primary_server}" ) ] # Schedule hedged requests with delays async def delayed_hedged_request(server: str, delay: float): await asyncio.sleep(delay / 1000) # Convert ms to seconds return await self._fetch_from_server(server, path) for i, server in enumerate(self.servers[1:self.max_hedged_requests]): delay = self.hedge_delay_ms * (i + 1) tasks.append( asyncio.create_task( delayed_hedged_request(server, delay), name=f"hedge-{server}" ) ) # Wait for first successful completion done, pending = await asyncio.wait( tasks, return_when=asyncio.FIRST_COMPLETED ) # Cancel pending requests for task in pending: task.cancel() try: await task except asyncio.CancelledError: pass # Get result from completed task result_task = done.pop() elapsed_ms = (time.perf_counter() - start_time) * 1000 result = await result_task result['hedging_info'] = { 'winning_server': result_task.get_name(), 'elapsed_ms': elapsed_ms, 'hedged_requests_issued': len(done) + len(pending) } return result async def _fetch_from_server(self, server: str, path: str) -> dict: """Make actual HTTP request to a server.""" url = f"http://{server}{path}" async with self.session.get(url) as response: data = await response.json() return { 'server': server, 'status': response.status, 'data': data } # Usage exampleasync def example_usage(): servers = [ "server1.example.com:8080", "server2.example.com:8080", "server3.example.com:8080" ] async with HedgedRequestClient( servers=servers, hedge_delay_ms=10.0, # Send hedge after 10ms max_hedged_requests=2 # At most 2 total requests ) as client: result = await client.fetch_with_hedging("/api/data") print(f"Response from: {result['hedging_info']['winning_server']}") print(f"Latency: {result['hedging_info']['elapsed_ms']:.2f} ms")Hedged requests increase server load by up to 2× (or more with higher hedge counts). Only use hedging for latency-critical paths where the additional load is acceptable. Set hedge delays based on observed latency percentiles—typically just above median latency. Too short: excessive load. Too long: minimal benefit.
Reducing I/O latency requires a multi-layered approach addressing hardware, operating system, and application concerns.
Hardware-Level Optimizations
Operating System Optimizations
The OS software stack can add significant latency; system tuning reduces this overhead:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
#!/bin/bash# Linux Latency Optimization Settings # ============================================# 1. CPU SCHEDULING FOR LOW LATENCY# ============================================ # Set scheduler to deadline for latency-sensitive workloads# (Alternative: SCHED_FIFO for real-time requirements) # For NVMe devices - use 'none' scheduler (device handles it)echo "none" > /sys/block/nvme0n1/queue/scheduler # ============================================# 2. INTERRUPT HANDLING# ============================================ # Disable irqbalance for dedicated latency-sensitive systemssystemctl stop irqbalance # Pin interrupts to specific CPUs# Find NVMe interruptscat /proc/interrupts | grep nvme # Set CPU affinity for each interrupt (example for IRQ 43)echo 2 > /proc/irq/43/smp_affinity # Pin to CPU 1 # ============================================ # 3. NUMA OPTIMIZATION# ============================================ # Check NUMA node for NVMe devicecat /sys/block/nvme0n1/device/numa_node # Run application on same NUMA node as storagenumactl --cpunodebind=0 --membind=0 ./latency_sensitive_app # ============================================# 4. KERNEL BYPASS OPTIONS# ============================================ # Enable polling mode for NVMe (reduces interrupt latency)# Requires kernel support and can increase CPU usageecho 1 > /sys/block/nvme0n1/queue/io_poll # For io_uring: enable sqpoll for kernel-side polling# (configured in application code, not sysctl) # ============================================# 5. NETWORK LATENCY TUNING# ============================================ # Enable TCP low latency modesysctl -w net.ipv4.tcp_low_latency=1 # Disable Nagle's algorithm (also in application code)sysctl -w net.ipv4.tcp_nodelay=1 # Reduce SYN/ACK retransmit delayssysctl -w net.ipv4.tcp_synack_retries=2 # Enable busy polling for sockets (trades CPU for latency)sysctl -w net.core.busy_poll=50sysctl -w net.core.busy_read=50 # ============================================# 6. MEMORY MANAGEMENT# ============================================ # Disable transparent huge pages (can cause latency spikes)echo never > /sys/kernel/mm/transparent_hugepage/enabledecho never > /sys/kernel/mm/transparent_hugepage/defrag # Lock critical application pages in memory# (done via mlockall() in application) # Reduce swappiness for latency-sensitive systemssysctl -w vm.swappiness=1 # ============================================# 7. CGROUP ISOLATION# ============================================ # Create isolated cgroup for latency-sensitive workloadmkdir -p /sys/fs/cgroup/latency_critical # Dedicate CPUsecho "0-3" > /sys/fs/cgroup/latency_critical/cpuset.cpusecho 0 > /sys/fs/cgroup/latency_critical/cpuset.mems # Assign I/O priority (requires cgroup v2)echo "100 1000:0 max" > /sys/fs/cgroup/latency_critical/io.maxApplication-Level Optimizations
Application design choices have the largest impact on achieved latency:
Each layer boundary (user→kernel, kernel→device, local→network) typically adds ~10× latency. A user-space computation takes nanoseconds; a syscall takes microseconds; a storage I/O takes tens of microseconds to milliseconds. Minimize layer crossings in latency-critical paths.
Understanding latency in practice requires examining real-world scenarios where latency choices directly impact system behavior.
Case Study 1: Database Query Latency
A database query's latency compounds across multiple I/O operations:
Total: 220-1950 µs (0.2-2 ms) for an NVMe-backed database
For HDD: Replace 50-100 µs reads with 5-15 ms → 5-150 ms total
This 50-100× latency difference explains why databases benefit enormously from SSDs.
Case Study 2: Web Application Response Time
A typical web request involves multiple latency components:
User browser → CDN (50ms if cache miss) → Load balancer (0.5ms)
→ Web server (1ms) → Application server (5-20ms)
→ Cache check (0.1ms) → Database (1-10ms)
→ External API (50-200ms) ← Often the tail!
Observations:
Case Study 3: High-Frequency Trading
HFT systems push latency limits:
| Component | Target Latency | Technique |
|---|---|---|
| Market data ingestion | < 1 µs | Kernel bypass (DPDK), NIC timestamping |
| Decision logic | < 10 µs | Lock-free, cache-optimized, no allocation |
| Order generation | < 1 µs | Pre-computed templates |
| Network transmission | < 10 µs | FPGA acceleration, co-location |
| Total tick-to-trade | < 25 µs | Every microsecond matters |
At these scales, even cache misses (100 ns L3 → DRAM) impact competitiveness. Code paths are measured in clock cycles, not milliseconds.
Case Study 4: Distributed Storage System
A distributed storage read operation:
Total: ~200-700 µs for a single-hop read
With replication (quorum read from 2/3 nodes): ~300-1000 µs Across availability zones: Add 1-5 ms per zone crossing
Successful latency-sensitive systems start with a latency budget (e.g., 'p99 must be < 50 ms'), then allocate that budget across components. If a single external call takes 40 ms, only 10 ms remains for all other work. This drives architectural decisions: caching, async processing, parallel execution, and geographic distribution.
Latency—the time for individual operations to complete—often determines user-perceived performance more than throughput. While throughput measures aggregate capacity, latency captures the experience of waiting.
What's Next
With throughput and latency understood, the next page examines bandwidth utilization—how effectively systems use available capacity. We'll explore efficiency metrics, contention effects, and strategies for maximizing useful work from I/O infrastructure.
You now understand I/O latency in depth: its sources, measurement methodology, relationship to throughput, the critical importance of tail latency, and optimization strategies across the stack. This knowledge enables you to diagnose latency issues and design latency-sensitive systems effectively.