Loading content...
If availability is the prerequisite for user experience—the service must exist to be used—then latency is the quality of that experience. A service can be perfectly available yet feel broken if every interaction takes 10 seconds. Conversely, a fast service creates the sensation of power and responsiveness that defines great software.
Latency is deceptively complex to measure correctly. Unlike availability's binary success/failure, latency is a continuous distribution. Every request has a latency, and these latencies vary—sometimes dramatically. A service might respond in 50ms usually, but occasionally spike to 5 seconds. What is "the latency" of this service? 50ms? 5 seconds? Something in between? The answer profoundly affects your SLI.
Worse, user perception of latency isn't linear. The difference between 50ms and 100ms is barely noticeable. The difference between 1 second and 2 seconds feels significant. And the difference between 10 seconds and 20 seconds is essentially irrelevant—both feel like "forever." A good latency SLI must account for these perceptual realities.
By the end of this page, you will understand the fundamentals of latency measurement, why percentiles matter more than averages, how to analyze latency distributions, techniques for setting meaningful latency targets, and how to build latency SLIs that genuinely reflect user experience. You'll be equipped to reason about latency with the precision needed for production systems.
Latency, in its simplest form, is the time between a user initiating an action and receiving a response. But this simple definition hides substantial complexity.
The Request Timeline
Consider a typical web request. The user clicks a button, and a cascade of operations begins:
The "latency" could be measured at many points along this chain, and different measurements capture different realities.
| Measurement Point | What It Captures | What It Misses |
|---|---|---|
| Database query time | Data layer performance | All network, other processing, client-side |
| Service method duration | Core business logic | Network, serialization, client rendering |
| Server-side request duration | Full server processing | Network round-trip, client processing |
| Time to First Byte (TTFB) | Server + outbound network initiation | Initial network hop, client rendering |
| First Contentful Paint (FCP) | First visual content visible | Full page load, interactivity |
| Time to Interactive (TTI) | Page usable for interaction | Complete content load, ongoing loading |
| Largest Contentful Paint (LCP) | Main content visible | Interactivity timing |
| Full page load | Everything loaded | Perception of usability may come sooner |
User-Centric Latency Measurement
For user-centric SLIs, we want to measure what users perceive. This typically means:
The ideal is to measure from the user's device—Real User Monitoring (RUM). When that's not possible, measure as close to the network edge as feasible, and acknowledge the gap between your measurement and true user experience.
Client vs. Server Latency
Server-side latency is what your application directly controls. Client-side latency includes everything:
A user on a 3G mobile connection in a rural area experiences different latency than a user on gigabit fiber in a data center city—even though your server response time is identical.
Implication: If you only measure server-side latency, you may miss that your mobile users are suffering. Client-side measurement, while noisier, captures the true user experience distribution.
Best practice is to capture both server-side and client-side latency. Server-side provides clean data for engineering optimization. Client-side provides ground truth for user experience. When they diverge significantly, investigate—you may have network issues, CDN problems, or client-side performance bottlenecks.
Average latency is the most commonly reported metric and the most misleading. Understanding why requires examining how latency actually distributes.
Latency Distributions Are Not Normal
If human heights followed a normal (Gaussian) distribution, you'd expect approximately the same number of people significantly below average as above average. This is not how latency works.
Latency distributions are typically right-skewed (or "long-tailed"):
This skew occurs because latency has a floor (you can't respond faster than physics allows) but no ceiling (failures can cause nearly infinite delays).
The Problem with Averages
Consider two services with identical average latency of 200ms:
Service A: 100 requests, all at 200ms
Service B: 99 requests at 100ms, 1 request at 10,100ms
Both services have the same average, but Service B has a severe tail latency problem that the average completely obscures. If you have 1 million daily users, that "1%" is 10,000 frustrated users per day.
Tail latencies matter because:
| Percentile | Meaning | Typical Use |
|---|---|---|
| p50 (Median) | Half of requests are faster, half are slower | Baseline user experience |
| p75 | 75% of requests are faster | Typical user experience |
| p90 | 90% of requests are faster, 10% slower | Common SLI threshold |
| p95 | 95% of requests are faster | Stricter SLI threshold |
| p99 | 99% of requests are faster, 1% slower | Tail latency monitoring |
| p99.9 | 99.9% of requests are faster | Extreme tail issues |
Which Percentile to Use?
The choice of percentile depends on your context:
p50 (Median):
p90 or p95:
p99:
p99.9 or higher:
Practical guidance: Most services should track p50, p90, p95, and p99. Set SLO targets at p90 or p95 for the primary SLI, with a complementary (looser) p99 target to ensure tails are bounded.
Percentiles cannot be meaningfully averaged. If you have p95 = 100ms for hour 1 and p95 = 200ms for hour 2, the p95 for the combined period is NOT 150ms. To get accurate percentiles over longer periods, you must store the full distribution (or use techniques like HDR histograms) and compute percentiles from raw data.
Beyond tracking specific percentiles, understanding your full latency distribution provides insights that point metrics miss.
Distribution Shapes and Their Meaning
Unimodal right-skewed: The classic latency distribution. A single peak (mode) near the minimum, with a tail extending toward higher latencies. This is "normal" for most services.
Bimodal: Two distinct peaks. Often indicates two different code paths (cached vs. uncached) or two distinct user populations (local vs. international). Not inherently problematic, but should be understood.
Uniform/flat: No clear typical value. Often indicates a queueing problem—requests aren't processed on arrival but wait in line.
Heavy-tailed: Extreme outliers far beyond the typical range. May indicate occasional catastrophic slowdowns (GC pauses, lock contention, cold caches).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
import numpy as npfrom scipy import statsfrom dataclasses import dataclassfrom typing import List, Tuple @dataclassclass DistributionAnalysis: """Comprehensive analysis of a latency distribution.""" count: int min_ms: float max_ms: float mean_ms: float median_ms: float std_dev_ms: float percentiles: dict # p50, p90, p95, p99, p99_9 skewness: float kurtosis: float is_heavy_tailed: bool coefficient_of_variation: float # std_dev / mean def analyze_latency_distribution(latencies_ms: List[float]) -> DistributionAnalysis: """ Perform comprehensive analysis of a latency distribution. Args: latencies_ms: List of latency measurements in milliseconds Returns: DistributionAnalysis with statistical characterization """ if len(latencies_ms) < 10: raise ValueError("Insufficient data for distribution analysis") arr = np.array(latencies_ms) # Basic statistics count = len(arr) min_ms = float(np.min(arr)) max_ms = float(np.max(arr)) mean_ms = float(np.mean(arr)) median_ms = float(np.median(arr)) std_dev_ms = float(np.std(arr)) # Percentiles percentiles = { 'p50': float(np.percentile(arr, 50)), 'p75': float(np.percentile(arr, 75)), 'p90': float(np.percentile(arr, 90)), 'p95': float(np.percentile(arr, 95)), 'p99': float(np.percentile(arr, 99)), 'p99_9': float(np.percentile(arr, 99.9)) if count >= 1000 else None, } # Shape statistics skewness = float(stats.skew(arr)) kurtosis = float(stats.kurtosis(arr)) # Heavy-tailed detection: kurtosis > 3 suggests heavier tails than normal # Also check if p99/p50 ratio is extreme p99_p50_ratio = percentiles['p99'] / percentiles['p50'] if percentiles['p50'] > 0 else 0 is_heavy_tailed = kurtosis > 6 or p99_p50_ratio > 10 # Coefficient of variation: high values indicate high variability coefficient_of_variation = std_dev_ms / mean_ms if mean_ms > 0 else 0 return DistributionAnalysis( count=count, min_ms=min_ms, max_ms=max_ms, mean_ms=mean_ms, median_ms=median_ms, std_dev_ms=std_dev_ms, percentiles=percentiles, skewness=skewness, kurtosis=kurtosis, is_heavy_tailed=is_heavy_tailed, coefficient_of_variation=coefficient_of_variation, ) def detect_modality(latencies_ms: List[float]) -> Tuple[str, int]: """ Detect whether distribution is unimodal, bimodal, or multimodal. Returns: Tuple of (modality_type, estimated_number_of_modes) """ from scipy.signal import find_peaks arr = np.array(latencies_ms) # Create histogram hist, bin_edges = np.histogram(arr, bins='auto', density=True) # Find peaks in histogram peaks, _ = find_peaks(hist, height=0.1 * max(hist), distance=3) n_modes = len(peaks) if n_modes <= 1: return ("unimodal", 1) elif n_modes == 2: return ("bimodal", 2) else: return ("multimodal", n_modes) # Example usagelatencies = [45, 52, 48, 51, 47, 320, 42, 55, 49, 46, ...] # sample dataanalysis = analyze_latency_distribution(latencies) print(f"Median latency: {analysis.percentiles['p50']:.1f}ms")print(f"p95 latency: {analysis.percentiles['p95']:.1f}ms")print(f"p99 latency: {analysis.percentiles['p99']:.1f}ms")print(f"Heavy-tailed: {analysis.is_heavy_tailed}")print(f"Coefficient of variation: {analysis.coefficient_of_variation:.2f}")Histograms for Human Understanding
While percentiles are precise, latency histograms provide intuitive understanding of distribution shape. When analyzing latency, always visualize:
Key patterns to watch for:
Choosing the right latency target requires balancing user expectations, technical constraints, and business requirements.
Framework for Target Selection
Step 1: Understand user expectations
User tolerance for latency depends on the interaction type. Research and industry practice provide benchmarks:
Step 2: Analyze current performance
Before setting targets, understand where you are:
Setting a target you can't achieve is demoralizing. Setting a target you always meet is not ambitious enough.
Step 3: Consider the full stack
Your latency target for user-facing interactions must budget for the entire stack:
Work backwards from user-facing target to component budgets.
| Use Case | p50 Target | p95 Target | p99 Target | Rationale |
|---|---|---|---|---|
| Typeahead/autocomplete | < 50ms | < 100ms | < 200ms | Must feel instant to not disrupt typing |
| Page navigation | < 200ms | < 500ms | < 1s | Users expect smooth transitions |
| Search results | < 300ms | < 800ms | < 2s | Fast enough to feel like filtering, not waiting |
| Form submission | < 500ms | < 1s | < 3s | Users wait for confirmation; needs to be responsive |
| Report generation | < 2s | < 5s | < 15s | Users expect computation; progress feedback critical |
| Batch API operations | < 10s | < 30s | < 60s | Asynchronous acceptable; timeout must be clear |
The Multi-Percentile SLI Pattern
A single latency target is often insufficient. Consider a multi-tier approach:
Primary target (p95): This is your main SLI target. "95% of requests complete within X ms." This sets the expectation for almost all users.
Tail target (p99): A secondary, looser target that bounds the worst case. "99% of requests complete within Y ms (where Y > X)." This prevents severe tail latency.
Example:
This says: "Almost everyone (95%) gets a response in 200ms. Even the slowest 5% still get a response in under 1 second."
Why not just p99 for everything?
The stricter the percentile, the harder to maintain. P99 targets are highly sensitive to outliers—one pathological request in 100 can cause an SLO breach. For most services:
Apdex (Application Performance Index) is an alternative to percentile-based SLIs. It classifies requests as Satisfied (< T), Tolerating (< 4T), or Frustrated (> 4T), where T is your target threshold. Apdex = (Satisfied + Tolerating/2) / Total. An Apdex of 0.95 means excellent experience. This can be more intuitive for non-technical stakeholders while still driving good latency outcomes.
Accurate latency measurement is harder than it appears. Several challenges can corrupt your data.
Challenge 1: Clock Skew and Precision
Distributed systems involve multiple clocks, and clocks drift. If you compute latency by subtracting timestamps from different machines, clock skew introduces errors.
Solutions:
Challenge 2: Coordinated Omission
Coordinated omission is a measurement error where slow requests are systematically under-counted. It happens when your load generator waits for a response before sending the next request.
If your service is supposed to handle 100 requests/second:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
// Demonstrating coordinated omission problem // WRONG: This code exhibits coordinated omissionasync function naiveLoadGenerator(targetRps: number, durationSec: number) { const intervalMs = 1000 / targetRps; const results: number[] = []; const endTime = Date.now() + durationSec * 1000; while (Date.now() < endTime) { const start = Date.now(); await makeRequest(); // PROBLEM: We wait for response before continuing const latency = Date.now() - start; results.push(latency); // If request took 5 seconds, we've missed 5 seconds' worth of requests // Those "would have been slow" requests are not captured const sleepTime = Math.max(0, intervalMs - latency); await sleep(sleepTime); } return results; // Under-reports tail latency!} // RIGHT: Corrected load generator - measures intended send timeasync function correctedLoadGenerator(targetRps: number, durationSec: number) { const intervalMs = 1000 / targetRps; const results: { intendedSendTime: number; actualLatency: number; correctedLatency: number }[] = []; const startTime = Date.now(); const endTime = startTime + durationSec * 1000; let requestNumber = 0; while (Date.now() < endTime) { const intendedSendTime = startTime + requestNumber * intervalMs; const actualSendTime = Date.now(); // The request is already "late" by the time we send it const alreadyLateBy = actualSendTime - intendedSendTime; const requestStart = Date.now(); await makeRequest(); const requestDuration = Date.now() - requestStart; // Corrected latency: how long from INTENDED send time to response // This captures the full user-perceived delay const correctedLatency = alreadyLateBy + requestDuration; results.push({ intendedSendTime, actualLatency: requestDuration, correctedLatency: correctedLatency, }); requestNumber++; // Don't sleep - immediately check if we should send next request // In production, you'd use a proper scheduler } return results;} // The correctedLatency values will show true tail behavior// actualLatency may look good while correctedLatency reveals the problemChallenge 3: Survivorship Bias
Timed-out requests often don't appear in latency data because they never complete. If a request times out after 30 seconds:
Solutions:
Challenge 4: Aggregation Loss
As discussed earlier, percentiles can't be averaged. If you're using pre-aggregated metrics:
Challenge 5: Sampling Bias
For high-traffic services, you may sample requests for detailed tracing. Sampling must not bias toward fast requests:
For latency SLIs, prefer non-sampled metrics or ensure sampling is unbiased with respect to latency.
If clients time out before the server responds, the server's latency measurement doesn't reflect user experience. The user gave up after 10 seconds, but the server might log a 30-second "successful" response to a connection that's already closed. Monitor client-side timeouts separately from server-side latency.
In systems with multiple components and dependencies, understanding how latency accumulates is critical for meaningful SLIs.
The Latency Budget Concept
A latency budget allocates your end-to-end latency target across components. For a 300ms user-facing SLI:
| Component | Budget | Notes |
|---|---|---|
| Client processing | 20ms | JavaScript execution, DOM update |
| Network (client → edge) | 30ms | Variable, depends on user location |
| CDN/edge processing | 10ms | Static content, edge logic |
| Network (edge → origin) | 20ms | Internal network, usually fast |
| API gateway | 10ms | Auth, routing, rate limiting |
| Application server | 150ms | Core business logic |
| Database queries | 50ms | Data access |
| Serialization/response | 10ms | JSON encoding, compression |
| Total | 300ms | The latency budget must not exceed target |
This budget makes hidden assumptions explicit and allocates responsibility.
Serial vs. Parallel Latency
How components are arranged affects total latency:
Serial execution: Latencies add up. If A takes 100ms, then B takes 100ms, total is 200ms.
Parallel execution: Latency is the max of parallel paths. If A and B execute in parallel (both 100ms), total is 100ms.
Microservices architectures offer opportunities for parallelization, but dependencies often force serialization. Request fans out to 5 services, but one service depends on another's result—that path is serial.
Latency in fan-out patterns:
When a request fans out to N backends and waits for all:
If each backend has p99 = 100ms, and you fan out to 10 backends:
Fan-out amplifies tail latency dramatically. This is why services with many dependencies struggle with tail latency despite each dependency being individually fast.
Hedge requests improve tail latency but increase load on backends. If your backends are near capacity, hedging can make things worse by overloading them. Use hedging judiciously, preferring it during low-load periods or for critical-path requests only.
Let's examine concrete latency SLI specifications for real-world systems.
Example 1: Search API Latency SLI
123456789101112131415161718192021222324252627282930313233343536373839404142434445
# Latency SLI Specification: Search APIsli: name: "Search API Response Time" description: "Time from search query receipt to response delivery" measurement: what: "Server-side request duration" start_event: "HTTP request headers received" end_event: "HTTP response body sent" unit: "milliseconds" scope: endpoints: - "/api/v1/search" - "/api/v2/search" filters: - "Exclude synthetic monitoring (header X-Synthetic: true)" - "Exclude extremely large result sets (> 1000 results)" - "Only successful responses (HTTP 2xx)" targets: primary: percentile: p95 threshold_ms: 200 rationale: "95% of searches feel instant to users" secondary: percentile: p99 threshold_ms: 800 rationale: "Even slow searches complete before user frustration" baseline: percentile: p50 threshold_ms: 50 rationale: "Typical experience should be excellent" slo_formula: > proportion_of_requests(latency < threshold) >= target_percentage Concrete: proportion_of_requests(latency < 200ms) >= 95% aggregation: window: "5 minutes for alerting, 30 days rolling for SLO compliance" failure_modes: - "Requests timing out (> 30s) are counted as failures at 30000ms" - "Connection resets counted as failures at threshold + 1ms"Example 2: End-to-End Page Load Latency SLI
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
# Latency SLI Specification: Page Load Performancesli: name: "Product Page Load Time" description: "End-to-end time for product page to become usable" measurement: what: "Client-side Largest Contentful Paint (LCP)" source: "Real User Monitoring (RUM)" start_event: "Navigation start (user click or typed URL)" end_event: "LCP event fired (main content rendered)" unit: "milliseconds" notes: > LCP is chosen as it represents when users perceive the page as "loaded" - the main product image and details are visible. This is more user-relevant than DOMContentLoaded or load event. scope: pages: - "/products/:id" - "/items/:sku" user_segments: - "All users (no sampling)" devices: - "All device types" networks: - "All network types" - "Note: We may set different targets by network class later" targets: primary: percentile: p75 threshold_ms: 2500 rationale: > Core Web Vitals recommends LCP < 2.5s at p75 for "good" UX. This is our minimum acceptable bar. stretch: percentile: p75 threshold_ms: 1500 rationale: "Competitive differentiation through speed" tail: percentile: p95 threshold_ms: 4000 rationale: "No user should wait excessively" segmentation: by_network: 4g_plus: p75_target: 1500ms 3g: p75_target: 3500ms slow_2g: p75_target: 6000ms # Different expectations for constrained networks by_region: - Track but don't alert on regional variations initially - Use to identify CDN or geographic performance issues correlation: business_metric: "Product page bounce rate" expected_relationship: "Negative - faster pages, lower bounce rate"Example 3: Database Query Latency SLI (Internal Component)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
# Latency SLI Specification: Database Query Performancesli: name: "Primary Database Read Latency" description: "Time for database queries to execute" purpose: "Internal component SLI - supports user-facing SLIs" measurement: what: "Query execution time as reported by client library" start_event: "Query sent to connection pool" end_event: "Last row of result set received" includes: - Connection pool wait time - Network round-trip to database - Query execution on database - Result set transmission excludes: - Application-side result processing - ORM hydration time scope: query_types: - "SELECT (read) queries only" databases: - "primary-postgres-cluster" targets: by_query_class: simple_lookups: # Primary key lookups p50: 2ms p99: 10ms indexed_queries: # Queries using indexes p50: 10ms p99: 50ms complex_queries: # JOINs, aggregations p50: 50ms p99: 200ms reports: # Analytical queries (expected slow) p50: 500ms p99: 5000ms aggregate: p95: 30ms p99: 100ms budget_context: > With 5 queries average per user request and 150ms server budget, database queries can consume max 50ms (leaving 100ms for other work). p95 at 30ms provides headroom. alerts: - condition: "p95 > 50ms for 5 minutes" severity: "warning" - condition: "p99 > 200ms for 5 minutes" severity: "critical"Latency SLIs capture the quality of user experience—not just whether the service works, but how it feels to use. Let's consolidate the key learnings:
What's Next
We've covered availability ("did it work?") and latency ("how fast was it?"). The next SLI dimension is error rate—measuring the proportion of operations that fail and understanding the nuances of error classification, severity, and impact.
You now have a comprehensive understanding of latency SLIs—from the philosophy of measurement through practical implementation. You can analyze latency distributions, set appropriate percentile-based targets, avoid common measurement pitfalls, and design latency SLIs that genuinely reflect user experience.