Loading learning content...
In the world of distributed systems, latency is the pulse that reveals everything. It's the time a user waits between clicking a button and seeing a response. It's the delay between a service making a request and receiving an answer. It's the invisible tax on every interaction in your system.
Latency matters because users experience it directly. A page that loads in 100ms feels instantaneous. The same page at 3 seconds feels broken. Research from Google, Amazon, and Microsoft consistently shows that increased latency directly correlates with decreased user engagement, lower conversion rates, and reduced revenue. Amazon famously discovered that every 100ms of added latency cost them 1% in sales.
By the end of this page, you will understand latency at a fundamental level: what it measures, where it originates, how to decompose it into components, and how to reason about latency in distributed system design. You'll develop the vocabulary and mental models that distinguish engineers who build responsive systems from those who build sluggish ones.
Latency is often casually defined as "how long something takes," but this imprecision causes significant confusion in system design conversations. Let's establish rigorous definitions that eliminate ambiguity.
Response Time vs. Latency
These terms are often used interchangeably, but they have subtle differences:
Latency refers specifically to the delay introduced by the system—the time the request spends waiting or being processed, excluding time required by fundamental physics (like the speed of light across network wires).
Response time is the total time from when a client sends a request to when it receives the complete response. It includes network transit time, queuing time, processing time, and all other delays.
In practice, when engineers say "latency," they usually mean response time. For clarity in this module, we'll use latency to mean the end-to-end response time as experienced by the caller.
Latency measurement depends on where you start the clock. Client-perceived latency includes network round-trips. Server-measured latency typically starts when the request arrives at the server. Database latency starts when the query reaches the database engine. Always be explicit about which latency you're measuring—confusion here derails performance discussions.
The Anatomy of a Request's Lifetime
To truly understand latency, we must decompose the journey of a single request through a distributed system. Consider a user loading their profile page:
| Phase | What Happens | Typical Duration | Where Measured |
|---|---|---|---|
| DNS Resolution | Browser resolves domain to IP address | 0-150ms (cached: 0ms) | Client |
| TCP Connection | Three-way handshake establishes connection | 10-100ms (depends on distance) | Client |
| TLS Handshake | Cryptographic negotiation for HTTPS | 20-200ms (2 round trips) | Client |
| Request Transmission | HTTP request travels over network | 1-50ms (depends on payload) | Network |
| Load Balancer Routing | Request routed to appropriate server | 0.1-5ms | Infrastructure |
| Request Queuing | Request waits in server queue | 0-1000ms+ (depends on load) | Server |
| Application Processing | Business logic execution | 5-500ms (depends on complexity) | Server |
| Database Queries | Data retrieval from storage | 1-100ms per query | Database |
| Response Transmission | HTTP response travels back | 1-100ms (depends on payload) | Network |
| Client Rendering | Browser processes and displays response | 10-500ms | Client |
The Critical Insight:
A single "simple" request can touch dozens of components, each contributing latency. A profile page request might involve:
Each hop adds latency. Understanding this decomposition is the first step to optimization—you can't fix what you can't measure and attribute.
Latency doesn't come from nowhere—it has identifiable sources. Expert engineers develop an intuition for these sources so they can quickly diagnose where delays originate. Let's examine each major category:
Latency in distributed systems compounds. If Service A calls Service B, which calls Service C, the total latency is at least LA + LB + LC (and often more due to coordination overhead). This is why microservice architectures can suffer from "latency multiplication"—each additional hop in the call chain adds to the critical path.
Measurement is fundamental to performance engineering. You cannot optimize what you don't measure, and you cannot trust measurements you don't understand. Latency measurement comes with significant pitfalls that trip up even experienced engineers.
Where to Measure
The location of measurement determines what you learn:
| Measurement Point | What It Captures | What It Misses | Use Case |
|---|---|---|---|
| Client Application | Full user-perceived latency including rendering | N/A (most complete) | User experience monitoring (RUM) |
| Client Network Edge | Network + server latency | Client-side rendering | Mobile app performance |
| Load Balancer | Server processing + downstream latency | Client-to-LB network latency | Infrastructure monitoring |
| Application Server | Application processing time | Network and LB latency | Application profiling |
| Database | Query execution time | Application logic and network | Database optimization |
The Coordinated Omission Problem
One of the most insidious measurement errors is coordinated omission. Most benchmarking tools measure latency like this:
The problem: if request 1 takes 5 seconds (due to a GC pause, for example), request 2 never gets measured during that slow period. The benchmark appears to show all requests completing quickly, hiding the fact that the system was unresponsive for 5 seconds.
The fix: Measure at the intended send time, not the actual send time. If you intended to send 100 requests/second, measure the latency from when each request should have been sent, even if the previous request blocked you.
Use purpose-built tools like HDR Histogram (available in most languages) for latency measurement. They're designed to capture the full distribution without coordinated omission, using minimal memory while maintaining high precision across the full range of values you might observe.
Timing Resolution and Accuracy
The clock you use matters:
Wall-clock time (System.currentTimeMillis(), Date.now()): Millisecond resolution, can jump forward or backward due to NTP adjustments. Fine for coarse measurements.
Monotonic clocks (System.nanoTime(), performance.now()): Won't go backward, microsecond or better resolution. Required for accurate short-duration measurements.
CPU cycle counters (RDTSC): Nanosecond precision but complex to use correctly. Reserved for extreme precision needs.
Rule of thumb: For latencies under 100ms, use monotonic clocks. The error from wall-clock adjustments can exceed your measurement.
A critical insight that separates novice from expert thinking: latency is not a single number—it's a probability distribution.
When someone asks "What's the latency of your API?", there's no single correct answer. Every request has a different latency. Some complete in 50ms, others in 200ms, and occasionally one takes 5 seconds. The question should be "What does your latency distribution look like?"
Why Averages Lie
The arithmetic mean (average) is the most commonly reported statistic and the most misleading for latency:
Averages are dominated by the bulk of requests and hide the long tail. Real user experience includes the tail.
In high-scale systems, the tail dominates experience. If you handle 1 million requests/day and have 0.1% outliers, that's 1,000 users per day with terrible experiences. If each user session involves 100 API calls, a 1% bad request probability means 63% of sessions include at least one slow request.
Percentiles Tell the Real Story
Percentiles (quantiles) describe the distribution without hiding the tail:
A complete latency profile might be: p50: 50ms, p90: 75ms, p95: 100ms, p99: 250ms, p99.9: 2s
This tells us most requests are fast (50ms), but the tail is long (2s at p99.9). An average of 60ms would hide this entirely.
| Percentile | Affected Users | At 1M Requests/Day | Typical Focus |
|---|---|---|---|
| p50 (Median) | 50% | 500,000 users worse than this | General performance baseline |
| p90 | 10% | 100,000 users worse than this | Common SLA target |
| p95 | 5% | 50,000 users worse than this | Aggressive SLA target |
| p99 | 1% | 10,000 users worse than this | Premium tier SLA |
| p99.9 | 0.1% | 1,000 users worse than this | High-value customer SLA |
| p99.99 | 0.01% | 100 users worse than this | Trading/real-time systems |
Latency metrics only matter because they affect real users. Understanding the human perception of latency helps you set appropriate targets and prioritize optimization work.
Human Perception Thresholds
Decades of human-computer interaction research have established clear perception boundaries:
| Latency Range | User Perception | Appropriate Use Cases |
|---|---|---|
| 0-100ms | Instantaneous, direct manipulation feeling | Keystrokes, UI feedback, drag operations |
| 100-300ms | Slight delay but feels responsive | Button clicks, simple API calls, navigation |
| 300-1000ms | Noticeable delay, attention may wander | Page loads, form submissions, complex queries |
| 1-5 seconds | Significant delay, user may lose focus | Complex operations with progress indicator |
| 5-10 seconds | Frustrating, user considers abandoning | Only for background tasks with clear feedback |
| 10+ seconds | Intolerable for interactive use | Must be async/background only |
The 100ms Rule
For interactive elements (buttons, links, controls), the 100ms threshold is critical. Below 100ms, users perceive instant response. Above 100ms, they perceive a delay. This isn't opinion—it's neurological. The human brain's processing latency is approximately 100ms, so responses faster than this appear simultaneous with the action.
Implications for system design:
Users don't experience latency directly—they experience perceived performance. Techniques like skeleton screens, progressive loading, and optimistic UI updates can make a 1-second operation feel faster than a 500ms operation that shows a blank screen. Design for perception, not just measurement.
The Business Impact
Study after study confirms latency's business consequences:
These aren't edge cases—they're consistent patterns across industries. Latency is not just an engineering concern; it's a business-critical metric.
Professional systems engineering uses latency budgets to allocate the acceptable delay across system components. This transforms vague goals ("make it fast") into precise, measurable contracts.
What is a Latency Budget?
A latency budget defines how much time each component is "allowed" to contribute to total latency. For a 300ms total budget on a user request:
| Component | Budget | Reasoning |
|---|---|---|
| Network (client to LB) | 50ms | Depends on user location, somewhat out of control |
| Load Balancer | 5ms | Should be negligible |
| API Gateway | 10ms | Auth, rate limiting, routing |
| Application Server | 100ms | Business logic processing |
| Database (all queries) | 80ms | Multiple queries, indexed lookups |
| External Services | 30ms | Third-party APIs with caching |
| Network (LB to client) | 25ms | Response transmission |
| Total | 300ms | User-facing SLO target |
Service Level Objectives (SLOs)
Latency SLOs define the target latency at specific percentiles:
SLO: 99% of requests complete in < 300ms
SLO: 99.9% of requests complete in < 1s
SLO: 99.99% of requests complete in < 5s
Note the structure: percentage of requests + latency threshold. The percentage is crucial—a 300ms p50 target is very different from a 300ms p99 target.
Why Different Percentile Targets?
Higher percentiles are exponentially harder to achieve. The difference between p99 and p99.9 often requires different architectural approaches:
SLOs are internal engineering targets. SLAs (Service Level Agreements) are external contractual commitments with consequences (credits, penalties). Set SLOs tighter than SLAs—your internal target should be 99.9% so you have margin before breaching a 99% SLA commitment.
Armed with understanding of where latency comes from, we can systematically reduce it. Different strategies address different latency sources:
The hardest latency to control is the tail—the p99, p99.9, and beyond. These extreme percentiles resist simple optimization and require specialized techniques.
Why Tail Latency is Hard
Tail latency comes from rare, hard-to-reproduce events:
These events are sporadic and hard to profile because they don't happen consistently.
In fan-out architectures (one request triggers N parallel sub-requests), tail latency amplifies exponentially. If each sub-request has 1% chance of being slow, with 100 parallel requests, 63% of aggregate requests will be slow (1 - 0.99^100). The tail becomes the common case.
Techniques for Tail Latency Control
Latency is the fundamental pulse of system performance. Let's consolidate our understanding:
What's Next:
Now that we understand latency—the time dimension of performance—we'll explore throughput: the volume dimension. How many requests per second can your system handle? How do latency and throughput interact? The next page answers these questions.
You now have a deep understanding of latency as a performance metric. You can decompose latency into components, measure it correctly, interpret distributions, set appropriate SLOs, and apply optimization techniques. Next, we'll explore throughput—the other fundamental axis of system performance.