Choosing SLIs - Learning Module

Loading content...

0/273

Latency SLIs

The Speed of Experience

If availability is the prerequisite for user experience—the service must exist to be used—then latency is the quality of that experience. A service can be perfectly available yet feel broken if every interaction takes 10 seconds. Conversely, a fast service creates the sensation of power and responsiveness that defines great software.

Latency is deceptively complex to measure correctly. Unlike availability's binary success/failure, latency is a continuous distribution. Every request has a latency, and these latencies vary—sometimes dramatically. A service might respond in 50ms usually, but occasionally spike to 5 seconds. What is "the latency" of this service? 50ms? 5 seconds? Something in between? The answer profoundly affects your SLI.

Worse, user perception of latency isn't linear. The difference between 50ms and 100ms is barely noticeable. The difference between 1 second and 2 seconds feels significant. And the difference between 10 seconds and 20 seconds is essentially irrelevant—both feel like "forever." A good latency SLI must account for these perceptual realities.

What You Will Learn

By the end of this page, you will understand the fundamentals of latency measurement, why percentiles matter more than averages, how to analyze latency distributions, techniques for setting meaningful latency targets, and how to build latency SLIs that genuinely reflect user experience. You'll be equipped to reason about latency with the precision needed for production systems.

The Nature of Latency: What We're Actually Measuring

Latency, in its simplest form, is the time between a user initiating an action and receiving a response. But this simple definition hides substantial complexity.

The Request Timeline

Consider a typical web request. The user clicks a button, and a cascade of operations begins:

Client processing (1-50ms): JavaScript event handling, request preparation
Connection establishment (0-300ms): DNS lookup, TCP handshake, TLS negotiation (may be cached/reused)
Request transmission (10-500ms): Data travels from client through network to server, varies by payload size and network quality
Server queue time (0-?ms): Request waits for available worker thread
Server processing (10-1000ms): Application logic, database queries, external service calls
Response transmission (10-500ms): Data travels back from server to client
Client rendering (10-500ms): Browser parses response, executes JavaScript, updates DOM
User perception (variable): User cognitively processes the visual change

The "latency" could be measured at many points along this chain, and different measurements capture different realities.

Latency Measurement Points and Their Scope
Measurement Point	What It Captures	What It Misses
Database query time	Data layer performance	All network, other processing, client-side
Service method duration	Core business logic	Network, serialization, client rendering
Server-side request duration	Full server processing	Network round-trip, client processing
Time to First Byte (TTFB)	Server + outbound network initiation	Initial network hop, client rendering
First Contentful Paint (FCP)	First visual content visible	Full page load, interactivity
Time to Interactive (TTI)	Page usable for interaction	Complete content load, ongoing loading
Largest Contentful Paint (LCP)	Main content visible	Interactivity timing
Full page load	Everything loaded	Perception of usability may come sooner

User-Centric Latency Measurement

For user-centric SLIs, we want to measure what users perceive. This typically means:

For UI interactions: Time from user action (click, keypress) to visual feedback indicating completion
For API calls: Time from request initiation to response processing completion
For page loads: Time to first meaningful content or interactivity
For background operations: Time to completion notification

The ideal is to measure from the user's device—Real User Monitoring (RUM). When that's not possible, measure as close to the network edge as feasible, and acknowledge the gap between your measurement and true user experience.

Client vs. Server Latency

Server-side latency is what your application directly controls. Client-side latency includes everything:

Server-side latency: Typically stable, predictable
Network latency: Varies by geography, network quality, congestion
Client processing: Varies dramatically by device capability

A user on a 3G mobile connection in a rural area experiences different latency than a user on gigabit fiber in a data center city—even though your server response time is identical.

Implication: If you only measure server-side latency, you may miss that your mobile users are suffering. Client-side measurement, while noisier, captures the true user experience distribution.

Measure Both, Interpret Carefully

Best practice is to capture both server-side and client-side latency. Server-side provides clean data for engineering optimization. Client-side provides ground truth for user experience. When they diverge significantly, investigate—you may have network issues, CDN problems, or client-side performance bottlenecks.

Why Averages Lie: The Case for Percentiles

Average latency is the most commonly reported metric and the most misleading. Understanding why requires examining how latency actually distributes.

Latency Distributions Are Not Normal

If human heights followed a normal (Gaussian) distribution, you'd expect approximately the same number of people significantly below average as above average. This is not how latency works.

Latency distributions are typically right-skewed (or "long-tailed"):

Many requests cluster around a typical value
A "tail" of requests takes much longer
Extreme outliers can be orders of magnitude slower than typical

This skew occurs because latency has a floor (you can't respond faster than physics allows) but no ceiling (failures can cause nearly infinite delays).

The Problem with Averages

Consider two services with identical average latency of 200ms:

Service A: 100 requests, all at 200ms

Average: 200ms
Every user experiences 200ms

Service B: 99 requests at 100ms, 1 request at 10,100ms

Average: (99 × 100 + 1 × 10100) / 100 = 200ms
99% of users experience 100ms (great!)
1% of users wait over 10 seconds (terrible!)

Both services have the same average, but Service B has a severe tail latency problem that the average completely obscures. If you have 1 million daily users, that "1%" is 10,000 frustrated users per day.

Tail latencies matter because:

High-frequency users are more likely to encounter tail latencies (more attempts = more chances)
In microservices, tail latencies compound (one slow service slows the whole request)
Tail latencies often indicate system problems that will worsen
The users experiencing tail latencies don't care about the average

Understanding Percentiles
Percentile	Meaning	Typical Use
p50 (Median)	Half of requests are faster, half are slower	Baseline user experience
p75	75% of requests are faster	Typical user experience
p90	90% of requests are faster, 10% slower	Common SLI threshold
p95	95% of requests are faster	Stricter SLI threshold
p99	99% of requests are faster, 1% slower	Tail latency monitoring
p99.9	99.9% of requests are faster	Extreme tail issues

Which Percentile to Use?

The choice of percentile depends on your context:

p50 (Median):

Shows what a "typical" user experiences
Good baseline, but doesn't surface tail problems
Use for understanding typical case, not for SLIs

p90 or p95:

Common choice for SLIs
Captures most users' experience while allowing for some outliers
Balances sensitivity to problems with tolerance for noise

p99:

More aggressive; doesn't tolerate tail latency
Important for high-frequency users and chained services
Can be noisy for low-traffic services (1 in 100 is still a sample size issue)

p99.9 or higher:

Catches the worst outliers
Relevant for very high-traffic services
Challenging to reason about and often dominated by pathological cases

Practical guidance: Most services should track p50, p90, p95, and p99. Set SLO targets at p90 or p95 for the primary SLI, with a complementary (looser) p99 target to ensure tails are bounded.

Percentile Aggregation Trap

Percentiles cannot be meaningfully averaged. If you have p95 = 100ms for hour 1 and p95 = 200ms for hour 2, the p95 for the combined period is NOT 150ms. To get accurate percentiles over longer periods, you must store the full distribution (or use techniques like HDR histograms) and compute percentiles from raw data.

Analyzing Latency Distributions

Beyond tracking specific percentiles, understanding your full latency distribution provides insights that point metrics miss.

Distribution Shapes and Their Meaning

Unimodal right-skewed: The classic latency distribution. A single peak (mode) near the minimum, with a tail extending toward higher latencies. This is "normal" for most services.

Bimodal: Two distinct peaks. Often indicates two different code paths (cached vs. uncached) or two distinct user populations (local vs. international). Not inherently problematic, but should be understood.

Uniform/flat: No clear typical value. Often indicates a queueing problem—requests aren't processed on arrival but wait in line.

Heavy-tailed: Extreme outliers far beyond the typical range. May indicate occasional catastrophic slowdowns (GC pauses, lock contention, cold caches).

latency-distribution-analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import List, Tuple
 
@dataclass
class DistributionAnalysis:
    """Comprehensive analysis of a latency distribution."""
    count: int
    min_ms: float
    max_ms: float
    mean_ms: float
    median_ms: float
    std_dev_ms: float
    percentiles: dict  # p50, p90, p95, p99, p99_9
    skewness: float
    kurtosis: float
    is_heavy_tailed: bool
    coefficient_of_variation: float  # std_dev / mean
    
def analyze_latency_distribution(latencies_ms: List[float]) -> DistributionAnalysis:
    """
    Perform comprehensive analysis of a latency distribution.
    
    Args:
        latencies_ms: List of latency measurements in milliseconds
        
    Returns:
        DistributionAnalysis with statistical characterization
    """
    if len(latencies_ms) < 10:
        raise ValueError("Insufficient data for distribution analysis")
    
    arr = np.array(latencies_ms)
    
    # Basic statistics
    count = len(arr)
    min_ms = float(np.min(arr))
    max_ms = float(np.max(arr))
    mean_ms = float(np.mean(arr))
    median_ms = float(np.median(arr))
    std_dev_ms = float(np.std(arr))
    
    # Percentiles
    percentiles = {
        'p50': float(np.percentile(arr, 50)),
        'p75': float(np.percentile(arr, 75)),
        'p90': float(np.percentile(arr, 90)),
        'p95': float(np.percentile(arr, 95)),
        'p99': float(np.percentile(arr, 99)),
        'p99_9': float(np.percentile(arr, 99.9)) if count >= 1000 else None,
    }
    
    # Shape statistics
    skewness = float(stats.skew(arr))
    kurtosis = float(stats.kurtosis(arr))
    
    # Heavy-tailed detection: kurtosis > 3 suggests heavier tails than normal
    # Also check if p99/p50 ratio is extreme
    p99_p50_ratio = percentiles['p99'] / percentiles['p50'] if percentiles['p50'] > 0 else 0
    is_heavy_tailed = kurtosis > 6 or p99_p50_ratio > 10
    
    # Coefficient of variation: high values indicate high variability
    coefficient_of_variation = std_dev_ms / mean_ms if mean_ms > 0 else 0
    
    return DistributionAnalysis(
        count=count,
        min_ms=min_ms,
        max_ms=max_ms,
        mean_ms=mean_ms,
        median_ms=median_ms,
        std_dev_ms=std_dev_ms,
        percentiles=percentiles,
        skewness=skewness,
        kurtosis=kurtosis,
        is_heavy_tailed=is_heavy_tailed,
        coefficient_of_variation=coefficient_of_variation,
    )
 
def detect_modality(latencies_ms: List[float]) -> Tuple[str, int]:
    """
    Detect whether distribution is unimodal, bimodal, or multimodal.
    
    Returns:
        Tuple of (modality_type, estimated_number_of_modes)
    """
    from scipy.signal import find_peaks
    
    arr = np.array(latencies_ms)
    
    # Create histogram
    hist, bin_edges = np.histogram(arr, bins='auto', density=True)
    
    # Find peaks in histogram
    peaks, _ = find_peaks(hist, height=0.1 * max(hist), distance=3)
    
    n_modes = len(peaks)
    
    if n_modes <= 1:
        return ("unimodal", 1)
    elif n_modes == 2:
        return ("bimodal", 2)
    else:
        return ("multimodal", n_modes)
 
# Example usage
latencies = [45, 52, 48, 51, 47, 320, 42, 55, 49, 46, ...]  # sample data
analysis = analyze_latency_distribution(latencies)
 
print(f"Median latency: {analysis.percentiles['p50']:.1f}ms")
print(f"p95 latency: {analysis.percentiles['p95']:.1f}ms")
print(f"p99 latency: {analysis.percentiles['p99']:.1f}ms")
print(f"Heavy-tailed: {analysis.is_heavy_tailed}")
print(f"Coefficient of variation: {analysis.coefficient_of_variation:.2f}")

Histograms for Human Understanding

While percentiles are precise, latency histograms provide intuitive understanding of distribution shape. When analyzing latency, always visualize:

Linear-scale histogram: Shows the bulk of the distribution clearly
Log-scale histogram: Reveals tail behavior that's invisible on linear scale
Time-series of percentiles: Shows how the distribution changes over time

Key patterns to watch for:

Shifting mode: If the peak "typical" latency moves, something fundamental changed (deployment, traffic pattern, dependency)
Growing tail: If p99/p50 ratio increases over time, tail problems are worsening
New modes appearing: Sudden bimodality suggests a new code path or a subset of users hitting problems
Periodic spikes: Regular tail latency spikes often correlate with batch jobs, garbage collection, or maintenance operations

Setting Latency Targets

Choosing the right latency target requires balancing user expectations, technical constraints, and business requirements.

Framework for Target Selection

Step 1: Understand user expectations

User tolerance for latency depends on the interaction type. Research and industry practice provide benchmarks:

Perceived instant (< 100ms): Simple UI updates, toggle switches, button feedback
Perceptibly responsive (< 300ms): Navigation, simple searches, form submissions
Noticeable but acceptable (< 1s): Complex searches, page loads, moderate data processing
Requires feedback (1-5s): Heavy computations, complex operations, needs progress indication
Frustration zone (> 5s): Users question whether it's working, may retry or abandon

Step 2: Analyze current performance

Before setting targets, understand where you are:

What are your current percentiles (p50, p90, p95, p99)?
How stable are these over time?
What's causing tail latency—is it addressable?

Setting a target you can't achieve is demoralizing. Setting a target you always meet is not ambitious enough.

Step 3: Consider the full stack

Your latency target for user-facing interactions must budget for the entire stack:

If your target is 200ms end-to-end
And network round-trip is 50ms
And client rendering is 20ms
Then server processing must complete in 130ms

Work backwards from user-facing target to component budgets.

Latency Target Guidelines by Use Case
Use Case	p50 Target	p95 Target	p99 Target	Rationale
Typeahead/autocomplete	< 50ms	< 100ms	< 200ms	Must feel instant to not disrupt typing
Page navigation	< 200ms	< 500ms	< 1s	Users expect smooth transitions
Search results	< 300ms	< 800ms	< 2s	Fast enough to feel like filtering, not waiting
Form submission	< 500ms	< 1s	< 3s	Users wait for confirmation; needs to be responsive
Report generation	< 2s	< 5s	< 15s	Users expect computation; progress feedback critical
Batch API operations	< 10s	< 30s	< 60s	Asynchronous acceptable; timeout must be clear

The Multi-Percentile SLI Pattern

A single latency target is often insufficient. Consider a multi-tier approach:

Primary target (p95): This is your main SLI target. "95% of requests complete within X ms." This sets the expectation for almost all users.

Tail target (p99): A secondary, looser target that bounds the worst case. "99% of requests complete within Y ms (where Y > X)." This prevents severe tail latency.

Example:

p95 target: 200ms
p99 target: 1000ms

This says: "Almost everyone (95%) gets a response in 200ms. Even the slowest 5% still get a response in under 1 second."

Why not just p99 for everything?

The stricter the percentile, the harder to maintain. P99 targets are highly sensitive to outliers—one pathological request in 100 can cause an SLO breach. For most services:

p95 provides the right balance for the primary SLI
p99 serves as a tail bound to prevent catastrophic degradation
p99.9 is reserved for extremely high-traffic critical services

The Apdex Score Alternative

Apdex (Application Performance Index) is an alternative to percentile-based SLIs. It classifies requests as Satisfied (< T), Tolerating (< 4T), or Frustrated (> 4T), where T is your target threshold. Apdex = (Satisfied + Tolerating/2) / Total. An Apdex of 0.95 means excellent experience. This can be more intuitive for non-technical stakeholders while still driving good latency outcomes.

Latency Measurement Challenges

Accurate latency measurement is harder than it appears. Several challenges can corrupt your data.

Challenge 1: Clock Skew and Precision

Distributed systems involve multiple clocks, and clocks drift. If you compute latency by subtracting timestamps from different machines, clock skew introduces errors.

Solutions:

Measure duration on a single machine (start and end times from the same clock)
Use monotonic clocks (not affected by NTP adjustments) for duration measurement
Synchronize clocks with NTP but understand remaining drift margins

Challenge 2: Coordinated Omission

Coordinated omission is a measurement error where slow requests are systematically under-counted. It happens when your load generator waits for a response before sending the next request.

If your service is supposed to handle 100 requests/second:

If request 1 takes 10 seconds, requests 2-1000 don't get sent during that 10 seconds
Your data shows 1 slow request, not the 999 requests that would have been slow had they been sent
Result: Tail latencies are massively underestimated

coordinated-omission-example.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// Demonstrating coordinated omission problem
 
// WRONG: This code exhibits coordinated omission
async function naiveLoadGenerator(targetRps: number, durationSec: number) {
  const intervalMs = 1000 / targetRps;
  const results: number[] = [];
  
  const endTime = Date.now() + durationSec * 1000;
  while (Date.now() < endTime) {
    const start = Date.now();
    await makeRequest();  // PROBLEM: We wait for response before continuing
    const latency = Date.now() - start;
    results.push(latency);
    
    // If request took 5 seconds, we've missed 5 seconds' worth of requests
    // Those "would have been slow" requests are not captured
    
    const sleepTime = Math.max(0, intervalMs - latency);
    await sleep(sleepTime);
  }
  
  return results;  // Under-reports tail latency!
}
 
// RIGHT: Corrected load generator - measures intended send time
async function correctedLoadGenerator(targetRps: number, durationSec: number) {
  const intervalMs = 1000 / targetRps;
  const results: { intendedSendTime: number; actualLatency: number; correctedLatency: number }[] = [];
  
  const startTime = Date.now();
  const endTime = startTime + durationSec * 1000;
  let requestNumber = 0;
  
  while (Date.now() < endTime) {
    const intendedSendTime = startTime + requestNumber * intervalMs;
    const actualSendTime = Date.now();
    
    // The request is already "late" by the time we send it
    const alreadyLateBy = actualSendTime - intendedSendTime;
    
    const requestStart = Date.now();
    await makeRequest();
    const requestDuration = Date.now() - requestStart;
    
    // Corrected latency: how long from INTENDED send time to response
    // This captures the full user-perceived delay
    const correctedLatency = alreadyLateBy + requestDuration;
    
    results.push({
      intendedSendTime,
      actualLatency: requestDuration,
      correctedLatency: correctedLatency,
    });
    
    requestNumber++;
    
    // Don't sleep - immediately check if we should send next request
    // In production, you'd use a proper scheduler
  }
  
  return results;
}
 
// The correctedLatency values will show true tail behavior
// actualLatency may look good while correctedLatency reveals the problem

Challenge 3: Survivorship Bias

Timed-out requests often don't appear in latency data because they never complete. If a request times out after 30 seconds:

It's not in your latency distribution (no final timestamp)
Your latency data appears better than reality
The worst user experiences are invisible

Solutions:

Count timeouts separately and track timeout rate
Include timeouts in latency data at the timeout threshold value
For SLO purposes, treat timeouts as failures at maximum latency

Challenge 4: Aggregation Loss

As discussed earlier, percentiles can't be averaged. If you're using pre-aggregated metrics:

Store histograms (bucketed counts), not pre-computed percentiles
Use mergeable data structures like HDR Histogram or T-Digest
Compute percentiles from raw or histogram data at query time

Challenge 5: Sampling Bias

For high-traffic services, you may sample requests for detailed tracing. Sampling must not bias toward fast requests:

Head-based sampling (decide at request start): May under-sample slow requests that are in-flight during sampling decisions
Tail-based sampling (decide at request end): Can preferentially capture slow requests, biasing upward
True random sampling: Unbiased but may miss rare slow requests

For latency SLIs, prefer non-sampled metrics or ensure sampling is unbiased with respect to latency.

The Client Timeout Trap

If clients time out before the server responds, the server's latency measurement doesn't reflect user experience. The user gave up after 10 seconds, but the server might log a 30-second "successful" response to a connection that's already closed. Monitor client-side timeouts separately from server-side latency.

Latency Budgets and Dependencies

In systems with multiple components and dependencies, understanding how latency accumulates is critical for meaningful SLIs.

The Latency Budget Concept

A latency budget allocates your end-to-end latency target across components. For a 300ms user-facing SLI:

Component	Budget	Notes
Client processing	20ms	JavaScript execution, DOM update
Network (client → edge)	30ms	Variable, depends on user location
CDN/edge processing	10ms	Static content, edge logic
Network (edge → origin)	20ms	Internal network, usually fast
API gateway	10ms	Auth, routing, rate limiting
Application server	150ms	Core business logic
Database queries	50ms	Data access
Serialization/response	10ms	JSON encoding, compression
Total	300ms	The latency budget must not exceed target

This budget makes hidden assumptions explicit and allocates responsibility.

Serial vs. Parallel Latency

How components are arranged affects total latency:

Serial execution: Latencies add up. If A takes 100ms, then B takes 100ms, total is 200ms.

Parallel execution: Latency is the max of parallel paths. If A and B execute in parallel (both 100ms), total is 100ms.

Microservices architectures offer opportunities for parallelization, but dependencies often force serialization. Request fans out to 5 services, but one service depends on another's result—that path is serial.

Latency in fan-out patterns:

When a request fans out to N backends and waits for all:

Total latency = max(latencies of all N backends)
This is dominated by the slowest backend
The more backends, the more likely one is slow (tail latency amplification)

If each backend has p99 = 100ms, and you fan out to 10 backends:

Probability any single backend is slow: 1%
Probability at least one is slow with 10 backends: 1 - (0.99)^10 ≈ 10%
Your effective p90 is now near the backend's p99

Fan-out amplifies tail latency dramatically. This is why services with many dependencies struggle with tail latency despite each dependency being individually fast.

Strategies for Managing Latency with Dependencies

•Hedge requests: Send duplicate requests to multiple replicas; use first response. Reduces tail latency at cost of increased load.
•Backup requests: If primary request doesn't complete quickly, spawn a backup. Less aggressive than hedging but still effective.
•Timeouts and degradation: Set tight timeouts on dependencies; degrade gracefully if they're slow rather than waiting forever.
•Caching: Cache dependency results to avoid latency entirely for repeat requests.
•Asynchronous processing: Decouple time-sensitive operations from dependencies that can be resolved later.
•Reduce fan-out: Consolidate dependencies; every additional dependency multiplies tail latency risk.
•Colocation: Place high-communication components in the same region/zone to minimize network latency.

The Hedging Tradeoff

Hedge requests improve tail latency but increase load on backends. If your backends are near capacity, hedging can make things worse by overloading them. Use hedging judiciously, preferring it during low-load periods or for critical-path requests only.

Practical Latency SLI Specifications

Let's examine concrete latency SLI specifications for real-world systems.

Example 1: Search API Latency SLI

search-latency-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Latency SLI Specification: Search API
sli:
  name: "Search API Response Time"
  description: "Time from search query receipt to response delivery"
  
  measurement:
    what: "Server-side request duration"
    start_event: "HTTP request headers received"
    end_event: "HTTP response body sent"
    unit: "milliseconds"
    
  scope:
    endpoints:
      - "/api/v1/search"
      - "/api/v2/search"
    filters:
      - "Exclude synthetic monitoring (header X-Synthetic: true)"
      - "Exclude extremely large result sets (> 1000 results)"
      - "Only successful responses (HTTP 2xx)"
  
  targets:
    primary:
      percentile: p95
      threshold_ms: 200
      rationale: "95% of searches feel instant to users"
    secondary:
      percentile: p99
      threshold_ms: 800
      rationale: "Even slow searches complete before user frustration"
    baseline:
      percentile: p50
      threshold_ms: 50
      rationale: "Typical experience should be excellent"
  
  slo_formula: >
    proportion_of_requests(latency < threshold) >= target_percentage
    
    Concrete: proportion_of_requests(latency < 200ms) >= 95%
    
  aggregation:
    window: "5 minutes for alerting, 30 days rolling for SLO compliance"
    
  failure_modes:
    - "Requests timing out (> 30s) are counted as failures at 30000ms"
    - "Connection resets counted as failures at threshold + 1ms"

Example 2: End-to-End Page Load Latency SLI

page-load-latency-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# Latency SLI Specification: Page Load Performance
sli:
  name: "Product Page Load Time"
  description: "End-to-end time for product page to become usable"
  
  measurement:
    what: "Client-side Largest Contentful Paint (LCP)"
    source: "Real User Monitoring (RUM)"
    start_event: "Navigation start (user click or typed URL)"
    end_event: "LCP event fired (main content rendered)"
    unit: "milliseconds"
    
    notes: >
      LCP is chosen as it represents when users perceive the page
      as "loaded" - the main product image and details are visible.
      This is more user-relevant than DOMContentLoaded or load event.
  
  scope:
    pages:
      - "/products/:id"
      - "/items/:sku"
    user_segments:
      - "All users (no sampling)"
    devices:
      - "All device types"
    networks:
      - "All network types"
      - "Note: We may set different targets by network class later"
  
  targets:
    primary:
      percentile: p75
      threshold_ms: 2500
      rationale: >
        Core Web Vitals recommends LCP < 2.5s at p75 for "good" UX.
        This is our minimum acceptable bar.
    stretch:
      percentile: p75
      threshold_ms: 1500
      rationale: "Competitive differentiation through speed"
    tail:
      percentile: p95
      threshold_ms: 4000
      rationale: "No user should wait excessively"
  
  segmentation:
    by_network:
      4g_plus:
        p75_target: 1500ms
      3g:
        p75_target: 3500ms
      slow_2g:
        p75_target: 6000ms  # Different expectations for constrained networks
    by_region:
      - Track but don't alert on regional variations initially
      - Use to identify CDN or geographic performance issues
  
  correlation:
    business_metric: "Product page bounce rate"
    expected_relationship: "Negative - faster pages, lower bounce rate"

Example 3: Database Query Latency SLI (Internal Component)

database-latency-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# Latency SLI Specification: Database Query Performance
sli:
  name: "Primary Database Read Latency"  
  description: "Time for database queries to execute"
  purpose: "Internal component SLI - supports user-facing SLIs"
  
  measurement:
    what: "Query execution time as reported by client library"
    start_event: "Query sent to connection pool"
    end_event: "Last row of result set received"
    includes:
      - Connection pool wait time
      - Network round-trip to database
      - Query execution on database
      - Result set transmission
    excludes:
      - Application-side result processing
      - ORM hydration time
  
  scope:
    query_types:
      - "SELECT (read) queries only"
    databases:
      - "primary-postgres-cluster"
    
  targets:
    by_query_class:
      simple_lookups:  # Primary key lookups
        p50: 2ms
        p99: 10ms
      indexed_queries:  # Queries using indexes
        p50: 10ms
        p99: 50ms
      complex_queries:  # JOINs, aggregations
        p50: 50ms
        p99: 200ms
      reports:  # Analytical queries (expected slow)
        p50: 500ms
        p99: 5000ms
    
    aggregate:
      p95: 30ms
      p99: 100ms
      
  budget_context: >
    With 5 queries average per user request and 150ms server budget,
    database queries can consume max 50ms (leaving 100ms for other work).
    p95 at 30ms provides headroom.
    
  alerts:
    - condition: "p95 > 50ms for 5 minutes"
      severity: "warning"
    - condition: "p99 > 200ms for 5 minutes"  
      severity: "critical"

Summary: Mastering Latency SLIs

Latency SLIs capture the quality of user experience—not just whether the service works, but how it feels to use. Let's consolidate the key learnings:

Key Principles for Latency SLIs

•Measure as close to the user as possible. Client-side RUM provides ground truth. Server-side measurements miss network transit, rendering, and client processing.
•Always use percentiles, never averages. Averages hide tail latency. Track p50, p90, p95, and p99 to understand the full distribution.
•Understand your distribution shape. Histograms reveal bimodality, heavy tails, and shifts that point metrics miss. Visualize regularly.
•Set multi-tier targets. Use p95 for primary SLI (typical experience) and p99 for tail bounds (worst case limit).
•Watch for measurement pitfalls. Coordinated omission, survivorship bias, and aggregation errors corrupt latency data. Validate your measurement methodology.
•Budget latency across components. Your end-to-end target must account for every component in the path. Make allocations explicit.
•Account for fan-out amplification. Multiple dependencies multiply tail latency risk. Use techniques like hedging, caching, and tight timeouts.
•Correlate with user perception. Latency targets should map to user research and competitive benchmarks, not just what's technically achievable.

What's Next

We've covered availability ("did it work?") and latency ("how fast was it?"). The next SLI dimension is error rate—measuring the proportion of operations that fail and understanding the nuances of error classification, severity, and impact.

Page Complete

You now have a comprehensive understanding of latency SLIs—from the philosophy of measurement through practical implementation. You can analyze latency distributions, set appropriate percentile-based targets, avoid common measurement pitfalls, and design latency SLIs that genuinely reflect user experience.

Latency SLIs

The Speed of Experience

What You Will Learn

The Nature of Latency: What We're Actually Measuring

Latency, in its simplest form, is the time between a user initiating an action and receiving a response. But this simple definition hides substantial complexity.

The Request Timeline

Consider a typical web request. The user clicks a button, and a cascade of operations begins:

Client processing (1-50ms): JavaScript event handling, request preparation
Connection establishment (0-300ms): DNS lookup, TCP handshake, TLS negotiation (may be cached/reused)
Request transmission (10-500ms): Data travels from client through network to server, varies by payload size and network quality
Server queue time (0-?ms): Request waits for available worker thread
Server processing (10-1000ms): Application logic, database queries, external service calls
Response transmission (10-500ms): Data travels back from server to client
Client rendering (10-500ms): Browser parses response, executes JavaScript, updates DOM
User perception (variable): User cognitively processes the visual change

The "latency" could be measured at many points along this chain, and different measurements capture different realities.

Latency Measurement Points and Their Scope
Measurement Point	What It Captures	What It Misses
Database query time	Data layer performance	All network, other processing, client-side
Service method duration	Core business logic	Network, serialization, client rendering
Server-side request duration	Full server processing	Network round-trip, client processing
Time to First Byte (TTFB)	Server + outbound network initiation	Initial network hop, client rendering
First Contentful Paint (FCP)	First visual content visible	Full page load, interactivity
Time to Interactive (TTI)	Page usable for interaction	Complete content load, ongoing loading
Largest Contentful Paint (LCP)	Main content visible	Interactivity timing
Full page load	Everything loaded	Perception of usability may come sooner

User-Centric Latency Measurement

For user-centric SLIs, we want to measure what users perceive. This typically means:

For UI interactions: Time from user action (click, keypress) to visual feedback indicating completion
For API calls: Time from request initiation to response processing completion
For page loads: Time to first meaningful content or interactivity
For background operations: Time to completion notification

Client vs. Server Latency

Server-side latency is what your application directly controls. Client-side latency includes everything:

Server-side latency: Typically stable, predictable
Network latency: Varies by geography, network quality, congestion
Client processing: Varies dramatically by device capability

A user on a 3G mobile connection in a rural area experiences different latency than a user on gigabit fiber in a data center city—even though your server response time is identical.

Implication: If you only measure server-side latency, you may miss that your mobile users are suffering. Client-side measurement, while noisier, captures the true user experience distribution.

Measure Both, Interpret Carefully

Why Averages Lie: The Case for Percentiles

Average latency is the most commonly reported metric and the most misleading. Understanding why requires examining how latency actually distributes.

Latency Distributions Are Not Normal

If human heights followed a normal (Gaussian) distribution, you'd expect approximately the same number of people significantly below average as above average. This is not how latency works.

Latency distributions are typically right-skewed (or "long-tailed"):

Many requests cluster around a typical value
A "tail" of requests takes much longer
Extreme outliers can be orders of magnitude slower than typical

This skew occurs because latency has a floor (you can't respond faster than physics allows) but no ceiling (failures can cause nearly infinite delays).

The Problem with Averages

Consider two services with identical average latency of 200ms:

Service A: 100 requests, all at 200ms

Average: 200ms
Every user experiences 200ms

Service B: 99 requests at 100ms, 1 request at 10,100ms

Average: (99 × 100 + 1 × 10100) / 100 = 200ms
99% of users experience 100ms (great!)
1% of users wait over 10 seconds (terrible!)

Tail latencies matter because:

High-frequency users are more likely to encounter tail latencies (more attempts = more chances)
In microservices, tail latencies compound (one slow service slows the whole request)
Tail latencies often indicate system problems that will worsen
The users experiencing tail latencies don't care about the average

Understanding Percentiles
Percentile	Meaning	Typical Use
p50 (Median)	Half of requests are faster, half are slower	Baseline user experience
p75	75% of requests are faster	Typical user experience
p90	90% of requests are faster, 10% slower	Common SLI threshold
p95	95% of requests are faster	Stricter SLI threshold
p99	99% of requests are faster, 1% slower	Tail latency monitoring
p99.9	99.9% of requests are faster	Extreme tail issues

Which Percentile to Use?

The choice of percentile depends on your context:

p50 (Median):

Shows what a "typical" user experiences
Good baseline, but doesn't surface tail problems
Use for understanding typical case, not for SLIs

p90 or p95:

Common choice for SLIs
Captures most users' experience while allowing for some outliers
Balances sensitivity to problems with tolerance for noise

p99:

More aggressive; doesn't tolerate tail latency
Important for high-frequency users and chained services
Can be noisy for low-traffic services (1 in 100 is still a sample size issue)

p99.9 or higher:

Catches the worst outliers
Relevant for very high-traffic services
Challenging to reason about and often dominated by pathological cases

Practical guidance: Most services should track p50, p90, p95, and p99. Set SLO targets at p90 or p95 for the primary SLI, with a complementary (looser) p99 target to ensure tails are bounded.

Percentile Aggregation Trap

Analyzing Latency Distributions

Beyond tracking specific percentiles, understanding your full latency distribution provides insights that point metrics miss.

Distribution Shapes and Their Meaning

Unimodal right-skewed: The classic latency distribution. A single peak (mode) near the minimum, with a tail extending toward higher latencies. This is "normal" for most services.

Uniform/flat: No clear typical value. Often indicates a queueing problem—requests aren't processed on arrival but wait in line.

Heavy-tailed: Extreme outliers far beyond the typical range. May indicate occasional catastrophic slowdowns (GC pauses, lock contention, cold caches).

latency-distribution-analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import List, Tuple
 
@dataclass
class DistributionAnalysis:
    """Comprehensive analysis of a latency distribution."""
    count: int
    min_ms: float
    max_ms: float
    mean_ms: float
    median_ms: float
    std_dev_ms: float
    percentiles: dict  # p50, p90, p95, p99, p99_9
    skewness: float
    kurtosis: float
    is_heavy_tailed: bool
    coefficient_of_variation: float  # std_dev / mean
    
def analyze_latency_distribution(latencies_ms: List[float]) -> DistributionAnalysis:
    """
    Perform comprehensive analysis of a latency distribution.
    
    Args:
        latencies_ms: List of latency measurements in milliseconds
        
    Returns:
        DistributionAnalysis with statistical characterization
    """
    if len(latencies_ms) < 10:
        raise ValueError("Insufficient data for distribution analysis")
    
    arr = np.array(latencies_ms)
    
    # Basic statistics
    count = len(arr)
    min_ms = float(np.min(arr))
    max_ms = float(np.max(arr))
    mean_ms = float(np.mean(arr))
    median_ms = float(np.median(arr))
    std_dev_ms = float(np.std(arr))
    
    # Percentiles
    percentiles = {
        'p50': float(np.percentile(arr, 50)),
        'p75': float(np.percentile(arr, 75)),
        'p90': float(np.percentile(arr, 90)),
        'p95': float(np.percentile(arr, 95)),
        'p99': float(np.percentile(arr, 99)),
        'p99_9': float(np.percentile(arr, 99.9)) if count >= 1000 else None,
    }
    
    # Shape statistics
    skewness = float(stats.skew(arr))
    kurtosis = float(stats.kurtosis(arr))
    
    # Heavy-tailed detection: kurtosis > 3 suggests heavier tails than normal
    # Also check if p99/p50 ratio is extreme
    p99_p50_ratio = percentiles['p99'] / percentiles['p50'] if percentiles['p50'] > 0 else 0
    is_heavy_tailed = kurtosis > 6 or p99_p50_ratio > 10
    
    # Coefficient of variation: high values indicate high variability
    coefficient_of_variation = std_dev_ms / mean_ms if mean_ms > 0 else 0
    
    return DistributionAnalysis(
        count=count,
        min_ms=min_ms,
        max_ms=max_ms,
        mean_ms=mean_ms,
        median_ms=median_ms,
        std_dev_ms=std_dev_ms,
        percentiles=percentiles,
        skewness=skewness,
        kurtosis=kurtosis,
        is_heavy_tailed=is_heavy_tailed,
        coefficient_of_variation=coefficient_of_variation,
    )
 
def detect_modality(latencies_ms: List[float]) -> Tuple[str, int]:
    """
    Detect whether distribution is unimodal, bimodal, or multimodal.
    
    Returns:
        Tuple of (modality_type, estimated_number_of_modes)
    """
    from scipy.signal import find_peaks
    
    arr = np.array(latencies_ms)
    
    # Create histogram
    hist, bin_edges = np.histogram(arr, bins='auto', density=True)
    
    # Find peaks in histogram
    peaks, _ = find_peaks(hist, height=0.1 * max(hist), distance=3)
    
    n_modes = len(peaks)
    
    if n_modes <= 1:
        return ("unimodal", 1)
    elif n_modes == 2:
        return ("bimodal", 2)
    else:
        return ("multimodal", n_modes)
 
# Example usage
latencies = [45, 52, 48, 51, 47, 320, 42, 55, 49, 46, ...]  # sample data
analysis = analyze_latency_distribution(latencies)
 
print(f"Median latency: {analysis.percentiles['p50']:.1f}ms")
print(f"p95 latency: {analysis.percentiles['p95']:.1f}ms")
print(f"p99 latency: {analysis.percentiles['p99']:.1f}ms")
print(f"Heavy-tailed: {analysis.is_heavy_tailed}")
print(f"Coefficient of variation: {analysis.coefficient_of_variation:.2f}")

Histograms for Human Understanding

While percentiles are precise, latency histograms provide intuitive understanding of distribution shape. When analyzing latency, always visualize:

Linear-scale histogram: Shows the bulk of the distribution clearly
Log-scale histogram: Reveals tail behavior that's invisible on linear scale
Time-series of percentiles: Shows how the distribution changes over time

Key patterns to watch for:

Shifting mode: If the peak "typical" latency moves, something fundamental changed (deployment, traffic pattern, dependency)
Growing tail: If p99/p50 ratio increases over time, tail problems are worsening
New modes appearing: Sudden bimodality suggests a new code path or a subset of users hitting problems
Periodic spikes: Regular tail latency spikes often correlate with batch jobs, garbage collection, or maintenance operations

Setting Latency Targets

Choosing the right latency target requires balancing user expectations, technical constraints, and business requirements.

Framework for Target Selection

Step 1: Understand user expectations

User tolerance for latency depends on the interaction type. Research and industry practice provide benchmarks:

Perceived instant (< 100ms): Simple UI updates, toggle switches, button feedback
Perceptibly responsive (< 300ms): Navigation, simple searches, form submissions
Noticeable but acceptable (< 1s): Complex searches, page loads, moderate data processing
Requires feedback (1-5s): Heavy computations, complex operations, needs progress indication
Frustration zone (> 5s): Users question whether it's working, may retry or abandon

Step 2: Analyze current performance

Before setting targets, understand where you are:

What are your current percentiles (p50, p90, p95, p99)?
How stable are these over time?
What's causing tail latency—is it addressable?

Setting a target you can't achieve is demoralizing. Setting a target you always meet is not ambitious enough.

Step 3: Consider the full stack

Your latency target for user-facing interactions must budget for the entire stack:

If your target is 200ms end-to-end
And network round-trip is 50ms
And client rendering is 20ms
Then server processing must complete in 130ms

Work backwards from user-facing target to component budgets.

Latency Target Guidelines by Use Case
Use Case	p50 Target	p95 Target	p99 Target	Rationale
Typeahead/autocomplete	< 50ms	< 100ms	< 200ms	Must feel instant to not disrupt typing
Page navigation	< 200ms	< 500ms	< 1s	Users expect smooth transitions
Search results	< 300ms	< 800ms	< 2s	Fast enough to feel like filtering, not waiting
Form submission	< 500ms	< 1s	< 3s	Users wait for confirmation; needs to be responsive
Report generation	< 2s	< 5s	< 15s	Users expect computation; progress feedback critical
Batch API operations	< 10s	< 30s	< 60s	Asynchronous acceptable; timeout must be clear

The Multi-Percentile SLI Pattern

A single latency target is often insufficient. Consider a multi-tier approach:

Primary target (p95): This is your main SLI target. "95% of requests complete within X ms." This sets the expectation for almost all users.

Tail target (p99): A secondary, looser target that bounds the worst case. "99% of requests complete within Y ms (where Y > X)." This prevents severe tail latency.

Example:

p95 target: 200ms
p99 target: 1000ms

This says: "Almost everyone (95%) gets a response in 200ms. Even the slowest 5% still get a response in under 1 second."

Why not just p99 for everything?

The stricter the percentile, the harder to maintain. P99 targets are highly sensitive to outliers—one pathological request in 100 can cause an SLO breach. For most services:

p95 provides the right balance for the primary SLI
p99 serves as a tail bound to prevent catastrophic degradation
p99.9 is reserved for extremely high-traffic critical services

The Apdex Score Alternative

Latency Measurement Challenges

Accurate latency measurement is harder than it appears. Several challenges can corrupt your data.

Challenge 1: Clock Skew and Precision

Distributed systems involve multiple clocks, and clocks drift. If you compute latency by subtracting timestamps from different machines, clock skew introduces errors.

Solutions:

Measure duration on a single machine (start and end times from the same clock)
Use monotonic clocks (not affected by NTP adjustments) for duration measurement
Synchronize clocks with NTP but understand remaining drift margins

Challenge 2: Coordinated Omission

Coordinated omission is a measurement error where slow requests are systematically under-counted. It happens when your load generator waits for a response before sending the next request.

If your service is supposed to handle 100 requests/second:

If request 1 takes 10 seconds, requests 2-1000 don't get sent during that 10 seconds
Your data shows 1 slow request, not the 999 requests that would have been slow had they been sent
Result: Tail latencies are massively underestimated

coordinated-omission-example.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// Demonstrating coordinated omission problem
 
// WRONG: This code exhibits coordinated omission
async function naiveLoadGenerator(targetRps: number, durationSec: number) {
  const intervalMs = 1000 / targetRps;
  const results: number[] = [];
  
  const endTime = Date.now() + durationSec * 1000;
  while (Date.now() < endTime) {
    const start = Date.now();
    await makeRequest();  // PROBLEM: We wait for response before continuing
    const latency = Date.now() - start;
    results.push(latency);
    
    // If request took 5 seconds, we've missed 5 seconds' worth of requests
    // Those "would have been slow" requests are not captured
    
    const sleepTime = Math.max(0, intervalMs - latency);
    await sleep(sleepTime);
  }
  
  return results;  // Under-reports tail latency!
}
 
// RIGHT: Corrected load generator - measures intended send time
async function correctedLoadGenerator(targetRps: number, durationSec: number) {
  const intervalMs = 1000 / targetRps;
  const results: { intendedSendTime: number; actualLatency: number; correctedLatency: number }[] = [];
  
  const startTime = Date.now();
  const endTime = startTime + durationSec * 1000;
  let requestNumber = 0;
  
  while (Date.now() < endTime) {
    const intendedSendTime = startTime + requestNumber * intervalMs;
    const actualSendTime = Date.now();
    
    // The request is already "late" by the time we send it
    const alreadyLateBy = actualSendTime - intendedSendTime;
    
    const requestStart = Date.now();
    await makeRequest();
    const requestDuration = Date.now() - requestStart;
    
    // Corrected latency: how long from INTENDED send time to response
    // This captures the full user-perceived delay
    const correctedLatency = alreadyLateBy + requestDuration;
    
    results.push({
      intendedSendTime,
      actualLatency: requestDuration,
      correctedLatency: correctedLatency,
    });
    
    requestNumber++;
    
    // Don't sleep - immediately check if we should send next request
    // In production, you'd use a proper scheduler
  }
  
  return results;
}
 
// The correctedLatency values will show true tail behavior
// actualLatency may look good while correctedLatency reveals the problem

Challenge 3: Survivorship Bias

Timed-out requests often don't appear in latency data because they never complete. If a request times out after 30 seconds:

It's not in your latency distribution (no final timestamp)
Your latency data appears better than reality
The worst user experiences are invisible

Solutions:

Count timeouts separately and track timeout rate
Include timeouts in latency data at the timeout threshold value
For SLO purposes, treat timeouts as failures at maximum latency

Challenge 4: Aggregation Loss

As discussed earlier, percentiles can't be averaged. If you're using pre-aggregated metrics:

Store histograms (bucketed counts), not pre-computed percentiles
Use mergeable data structures like HDR Histogram or T-Digest
Compute percentiles from raw or histogram data at query time

Challenge 5: Sampling Bias

For high-traffic services, you may sample requests for detailed tracing. Sampling must not bias toward fast requests:

Head-based sampling (decide at request start): May under-sample slow requests that are in-flight during sampling decisions
Tail-based sampling (decide at request end): Can preferentially capture slow requests, biasing upward
True random sampling: Unbiased but may miss rare slow requests

For latency SLIs, prefer non-sampled metrics or ensure sampling is unbiased with respect to latency.

The Client Timeout Trap

Latency Budgets and Dependencies

In systems with multiple components and dependencies, understanding how latency accumulates is critical for meaningful SLIs.

The Latency Budget Concept

A latency budget allocates your end-to-end latency target across components. For a 300ms user-facing SLI:

Component	Budget	Notes
Client processing	20ms	JavaScript execution, DOM update
Network (client → edge)	30ms	Variable, depends on user location
CDN/edge processing	10ms	Static content, edge logic
Network (edge → origin)	20ms	Internal network, usually fast
API gateway	10ms	Auth, routing, rate limiting
Application server	150ms	Core business logic
Database queries	50ms	Data access
Serialization/response	10ms	JSON encoding, compression
Total	300ms	The latency budget must not exceed target

This budget makes hidden assumptions explicit and allocates responsibility.

Serial vs. Parallel Latency

How components are arranged affects total latency:

Serial execution: Latencies add up. If A takes 100ms, then B takes 100ms, total is 200ms.

Parallel execution: Latency is the max of parallel paths. If A and B execute in parallel (both 100ms), total is 100ms.

Latency in fan-out patterns:

When a request fans out to N backends and waits for all:

Total latency = max(latencies of all N backends)
This is dominated by the slowest backend
The more backends, the more likely one is slow (tail latency amplification)

If each backend has p99 = 100ms, and you fan out to 10 backends:

Probability any single backend is slow: 1%
Probability at least one is slow with 10 backends: 1 - (0.99)^10 ≈ 10%
Your effective p90 is now near the backend's p99

Fan-out amplifies tail latency dramatically. This is why services with many dependencies struggle with tail latency despite each dependency being individually fast.

Strategies for Managing Latency with Dependencies

•Hedge requests: Send duplicate requests to multiple replicas; use first response. Reduces tail latency at cost of increased load.
•Backup requests: If primary request doesn't complete quickly, spawn a backup. Less aggressive than hedging but still effective.
•Timeouts and degradation: Set tight timeouts on dependencies; degrade gracefully if they're slow rather than waiting forever.
•Caching: Cache dependency results to avoid latency entirely for repeat requests.
•Asynchronous processing: Decouple time-sensitive operations from dependencies that can be resolved later.
•Reduce fan-out: Consolidate dependencies; every additional dependency multiplies tail latency risk.
•Colocation: Place high-communication components in the same region/zone to minimize network latency.

The Hedging Tradeoff

Practical Latency SLI Specifications

Let's examine concrete latency SLI specifications for real-world systems.

Example 1: Search API Latency SLI

search-latency-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Latency SLI Specification: Search API
sli:
  name: "Search API Response Time"
  description: "Time from search query receipt to response delivery"
  
  measurement:
    what: "Server-side request duration"
    start_event: "HTTP request headers received"
    end_event: "HTTP response body sent"
    unit: "milliseconds"
    
  scope:
    endpoints:
      - "/api/v1/search"
      - "/api/v2/search"
    filters:
      - "Exclude synthetic monitoring (header X-Synthetic: true)"
      - "Exclude extremely large result sets (> 1000 results)"
      - "Only successful responses (HTTP 2xx)"
  
  targets:
    primary:
      percentile: p95
      threshold_ms: 200
      rationale: "95% of searches feel instant to users"
    secondary:
      percentile: p99
      threshold_ms: 800
      rationale: "Even slow searches complete before user frustration"
    baseline:
      percentile: p50
      threshold_ms: 50
      rationale: "Typical experience should be excellent"
  
  slo_formula: >
    proportion_of_requests(latency < threshold) >= target_percentage
    
    Concrete: proportion_of_requests(latency < 200ms) >= 95%
    
  aggregation:
    window: "5 minutes for alerting, 30 days rolling for SLO compliance"
    
  failure_modes:
    - "Requests timing out (> 30s) are counted as failures at 30000ms"
    - "Connection resets counted as failures at threshold + 1ms"

Example 2: End-to-End Page Load Latency SLI

page-load-latency-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# Latency SLI Specification: Page Load Performance
sli:
  name: "Product Page Load Time"
  description: "End-to-end time for product page to become usable"
  
  measurement:
    what: "Client-side Largest Contentful Paint (LCP)"
    source: "Real User Monitoring (RUM)"
    start_event: "Navigation start (user click or typed URL)"
    end_event: "LCP event fired (main content rendered)"
    unit: "milliseconds"
    
    notes: >
      LCP is chosen as it represents when users perceive the page
      as "loaded" - the main product image and details are visible.
      This is more user-relevant than DOMContentLoaded or load event.
  
  scope:
    pages:
      - "/products/:id"
      - "/items/:sku"
    user_segments:
      - "All users (no sampling)"
    devices:
      - "All device types"
    networks:
      - "All network types"
      - "Note: We may set different targets by network class later"
  
  targets:
    primary:
      percentile: p75
      threshold_ms: 2500
      rationale: >
        Core Web Vitals recommends LCP < 2.5s at p75 for "good" UX.
        This is our minimum acceptable bar.
    stretch:
      percentile: p75
      threshold_ms: 1500
      rationale: "Competitive differentiation through speed"
    tail:
      percentile: p95
      threshold_ms: 4000
      rationale: "No user should wait excessively"
  
  segmentation:
    by_network:
      4g_plus:
        p75_target: 1500ms
      3g:
        p75_target: 3500ms
      slow_2g:
        p75_target: 6000ms  # Different expectations for constrained networks
    by_region:
      - Track but don't alert on regional variations initially
      - Use to identify CDN or geographic performance issues
  
  correlation:
    business_metric: "Product page bounce rate"
    expected_relationship: "Negative - faster pages, lower bounce rate"

Example 3: Database Query Latency SLI (Internal Component)

database-latency-sli.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# Latency SLI Specification: Database Query Performance
sli:
  name: "Primary Database Read Latency"  
  description: "Time for database queries to execute"
  purpose: "Internal component SLI - supports user-facing SLIs"
  
  measurement:
    what: "Query execution time as reported by client library"
    start_event: "Query sent to connection pool"
    end_event: "Last row of result set received"
    includes:
      - Connection pool wait time
      - Network round-trip to database
      - Query execution on database
      - Result set transmission
    excludes:
      - Application-side result processing
      - ORM hydration time
  
  scope:
    query_types:
      - "SELECT (read) queries only"
    databases:
      - "primary-postgres-cluster"
    
  targets:
    by_query_class:
      simple_lookups:  # Primary key lookups
        p50: 2ms
        p99: 10ms
      indexed_queries:  # Queries using indexes
        p50: 10ms
        p99: 50ms
      complex_queries:  # JOINs, aggregations
        p50: 50ms
        p99: 200ms
      reports:  # Analytical queries (expected slow)
        p50: 500ms
        p99: 5000ms
    
    aggregate:
      p95: 30ms
      p99: 100ms
      
  budget_context: >
    With 5 queries average per user request and 150ms server budget,
    database queries can consume max 50ms (leaving 100ms for other work).
    p95 at 30ms provides headroom.
    
  alerts:
    - condition: "p95 > 50ms for 5 minutes"
      severity: "warning"
    - condition: "p99 > 200ms for 5 minutes"  
      severity: "critical"

Summary: Mastering Latency SLIs

Latency SLIs capture the quality of user experience—not just whether the service works, but how it feels to use. Let's consolidate the key learnings:

Key Principles for Latency SLIs

•Measure as close to the user as possible. Client-side RUM provides ground truth. Server-side measurements miss network transit, rendering, and client processing.
•Always use percentiles, never averages. Averages hide tail latency. Track p50, p90, p95, and p99 to understand the full distribution.
•Understand your distribution shape. Histograms reveal bimodality, heavy tails, and shifts that point metrics miss. Visualize regularly.
•Set multi-tier targets. Use p95 for primary SLI (typical experience) and p99 for tail bounds (worst case limit).
•Watch for measurement pitfalls. Coordinated omission, survivorship bias, and aggregation errors corrupt latency data. Validate your measurement methodology.
•Budget latency across components. Your end-to-end target must account for every component in the path. Make allocations explicit.
•Account for fan-out amplification. Multiple dependencies multiply tail latency risk. Use techniques like hedging, caching, and tight timeouts.
•Correlate with user perception. Latency targets should map to user research and competitive benchmarks, not just what's technically achievable.

What's Next

Page Complete