System Design (HLD)Choosing SLIs

Choosing Service Level Indicators

LevelAdvanced

Duration90 mins

TopicChoosing SLIs

5 / 5

Measuring SLIs Accurately

The Measurement Challenge

An SLI is only as valuable as the data backing it. You can design the perfect user-centric SLI on paper, but if your measurement infrastructure introduces gaps, biases, or errors, your SLI becomes fiction—a number that moves but doesn't reflect reality.

The gap between measured performance and experienced performance is where reliability programs go wrong. Teams celebrate hitting SLO targets while users complain. Dashboards show green while revenue drops. Alerts stay quiet while outages rage. These disconnects stem from measurement failures: the right events aren't captured, the wrong events are included, timing is imprecise, or data is lost in transit.

Measuring SLIs accurately is a discipline unto itself. It requires understanding where data originates, how it flows, what can corrupt it, and how to validate that your measurements reflect truth. This page equips you with strategies to ensure your SLIs are trustworthy.

What You Will Learn

By the end of this page, you will understand comprehensive SLI instrumentation strategies, data collection architecture options, sampling and aggregation techniques, methods for validating measurement accuracy, and common measurement failures with their mitigations. You'll be equipped to build SLI measurement systems you can trust.

The Measurement Stack: From Event to SLI

SLI measurement involves a pipeline from raw events to computed indicators. Understanding each stage reveals potential failure points.

Stage 1: Event Generation

The journey begins when something measurable happens:

A request arrives at your service
A response is sent
A user clicks a button
A process completes

At this stage, instrumentation code captures the event and its metadata. Potential issues:

Missing instrumentation: Events that should be captured aren't
Incorrect timestamps: Clock skew, wrong timezone, missing precision
Incomplete metadata: Essential attributes not recorded

Stage 2: Data Collection

Captured events must be collected and transmitted:

Client sends metrics to collection endpoint
Application emits logs to aggregator
Traces flow to tracing backend

Potential issues:

Data loss in transit: Network failures, queue overflows, backpressure
Sampling introduces bias: Not all events captured equally
Batching delays: Events not available in real-time

The SLI Measurement Pipeline
Stage	Description	Key Risks	Validation Approach
Event Generation	Instrumentation captures raw events	Missing events, wrong timestamps	Instrumentation audits, test events
Data Collection	Events transmitted to collection systems	Loss in transit, sampling bias	Collection rate monitoring, trace audits
Storage	Events persisted for analysis	Retention gaps, storage failures	Storage health checks, reconciliation
Aggregation	Raw events rolled up into metrics	Aggregation errors, timing windows	Cross-source verification, raw vs. aggregate comparison
Computation	SLI calculated from aggregated data	Formula errors, edge cases	Unit tests, known-answer tests
Presentation	SLI displayed to stakeholders	Visualization errors, stale data	Manual verification, cache invalidation

Stage 3: Storage

Collected data is stored for analysis:

Time-series databases for metrics
Log aggregation systems
Data warehouses for historical analysis

Potential issues:

Retention policies drop old data: Historical context lost
Storage failures: Data permanently lost
Ingestion lag: Data appears late or out of order

Stage 4: Aggregation

Raw events are aggregated into metrics:

Request counts, error counts, latency histograms
Windowed calculations (last 5 minutes, last hour)
Percentile computations

Potential issues:

Aggregation window misalignment: Different systems use different windows
Percentile aggregation errors: Averaging percentiles (wrong!)
Cardinality explosion: Too many dimensions overwhelm storage

Stage 5: Computation

The final SLI value is calculated:

Success rate = successes / total
p95 latency extracted from histogram
Error budget consumed computed

Potential issues:

Formula bugs: Incorrect calculation logic
Edge case handling: Division by zero, empty windows
Floating point errors: Precision issues in extreme cases

Stage 6: Presentation

SLI is displayed to humans or triggers automated actions:

Dashboards, reports
Alert evaluation
SLO compliance assessment

Potential issues:

Stale data displayed: Caching, refresh delays
Visualization artifacts: Charts that mislead
Threshold misconfigurations: Alerting on wrong values

Trust, But Verify

Every stage in the measurement pipeline is a potential point of failure. The more stages, the more opportunities for errors to accumulate. Regularly validate that end-to-end SLI values reflect ground truth, not just that each stage appears healthy individually.

Instrumentation Strategies

Instrumentation is the foundation of SLI measurement. The code that captures events determines what data is available for all downstream processing.

Instrumentation Placement

Server-side instrumentation:

Captures what the server observes
High reliability (under your control)
Misses client perspective (network, rendering)

Client-side instrumentation (Real User Monitoring):

Captures true user experience
Subject to client environment variability
May be blocked or disabled by users

Synthetic instrumentation:

Consistent, controlled conditions
Provides reliable baselines
Doesn't reflect real user diversity

Infrastructure instrumentation:

Load balancer logs, CDN analytics
Captures what reaches your infrastructure
Misses application-level semantics

instrumentation-patterns.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import { metrics, Histogram, Counter } from './metrics-client';
 
// Properly instrumented service layer for SLI measurement
 
// Define metrics with appropriate buckets and labels
const requestLatency = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  labelNames: ['method', 'route', 'status_code', 'outcome'],
});
 
const requestTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code', 'outcome'],
});
 
// Middleware for automatic request instrumentation
export function instrumentRequest(req: Request, res: Response, next: NextFunction) {
  const startTime = process.hrtime.bigint();
  const route = extractRoutePattern(req);  // Use pattern, not path with IDs
  
  // Capture response details after completion
  res.on('finish', () => {
    const endTime = process.hrtime.bigint();
    const durationSeconds = Number(endTime - startTime) / 1e9;
    
    // Determine outcome based on status code and response
    const outcome = determineOutcome(res.statusCode, res.get('X-Error-Code'));
    
    // Record histogram observation
    requestLatency.observe(
      {
        method: req.method,
        route: route,
        status_code: String(res.statusCode),
        outcome: outcome,
      },
      durationSeconds
    );
    
    // Increment counter
    requestTotal.inc({
      method: req.method,
      route: route,
      status_code: String(res.statusCode),
      outcome: outcome,
    });
  });
  
  // Handle cases where response never finishes (connection dropped)
  req.on('close', () => {
    if (!res.finished) {
      // Client disconnected before response completed
      const durationSeconds = Number(process.hrtime.bigint() - startTime) / 1e9;
      requestLatency.observe(
        {
          method: req.method,
          route: route,
          status_code: 'client_closed',
          outcome: 'error',
        },
        durationSeconds
      );
      requestTotal.inc({
        method: req.method,
        route: route,
        status_code: 'client_closed',
        outcome: 'error',
      });
    }
  });
  
  next();
}
 
// Route pattern extraction - avoid high cardinality
function extractRoutePattern(req: Request): string {
  // Convert /users/12345 to /users/{id}
  // This keeps metric cardinality bounded
  const match = req.route?.path;
  if (match) return match;
  
  // Fallback: try to infer pattern
  return req.path
    .replace(/\/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi, '/{uuid}')
    .replace(/\/\d+/g, '/{id}');
}
 
// Outcome determination for SLI classification
function determineOutcome(statusCode: number, errorCode?: string): string {
  if (statusCode >= 500) return 'error';
  if (statusCode === 429) return 'rate_limited';
  if (statusCode >= 400) return 'client_error';
  if (errorCode) return 'semantic_error';
  return 'success';
}

Instrumentation Best Practices

1. Use semantic labels, not raw values: Labels like route="/users/{id}" keep cardinality bounded. Labels like route="/users/12345" explode cardinality.

2. Capture all request outcomes: Including aborted, timed out, and edge cases. If a request occurs but doesn't complete normally, it should still be instrumented.

3. Use monotonic clocks for durations: System clocks can jump (NTP adjustments). Use monotonic timers like process.hrtime() or time.monotonic() for accurate duration measurement.

4. Attach correlation context: Link metrics to traces and logs via correlation IDs. When SLI breaches occur, you need to drill into specific requests.

5. Version your instrumentation: When changing instrumentation, you may break historical comparisons. Version metrics or document changes.

6. Test instrumentation explicitly: Unit test that expected metrics are emitted for various scenarios. Integration test that metrics appear in your monitoring system.

The Cardinality Trap

High-cardinality labels (user IDs, request IDs, full URLs) rapidly exhaust time-series database capacity and budget. Design labels for SLI-relevant segmentation (route pattern, outcome type, region), not debugging granularity. Use tracing for per-request debugging.

Sampling and Its Implications

At scale, capturing 100% of events may be impractical or cost-prohibitive. Sampling trades completeness for efficiency—but it must be done carefully to avoid biasing your SLIs.

When Sampling Is Appropriate

Tracing: Full distributed traces are expensive; sampling is standard
Detailed logging: Verbose per-request logging at high traffic
Real User Monitoring: Client-side data collection at high volume
Debugging data: Stack traces, payload snapshots

When Sampling Is Dangerous

SLI numerator/denominator counts: Need accurate totals for percentage calculation
Error counting: Rare errors may be missed entirely
Tail latency: Sampling may miss the extreme outliers that matter most

Sampling Strategy Trade-offs
Strategy	Description	Pros	Cons
Head-based random	Sample decision at request start	Simple, consistent trace	May miss slow requests, bias toward short
Tail-based	Sample decision after request completes	Can select errors/slow requests	Complex, requires buffering all requests
Rate limiting	Cap N samples per time window	Controlled volume	Busy periods under-sampled relative to quiet
Priority-based	Higher rate for important contexts	Captures critical cases	Complex to configure priorities
Error-biased	Always sample errors, sample successes	Errors never missed	May bias error rate measurement upward

Accurate SLI Calculation With Sampling

For SLI calculation, you generally need accurate counts, not sampled estimates. The standard approach:

Metrics (counters, histograms): Capture 100%. These are cheap—incrementing a counter costs almost nothing.

Traces and detailed data: Sample. These are expensive—full context per request adds up.

By separating concerns, you get accurate SLIs from counters and rich debugging data from sampled traces.

If you must sample for SLI calculation, you need statistical techniques:

# Unbiased error rate estimation from sampled data

sampled_errors = 500
sampled_total = 10000
sampling_rate = 0.01  # 1% sample

# Estimated actual counts
estimated_total_errors = sampled_errors / sampling_rate  # 50,000
estimated_total_requests = sampled_total / sampling_rate  # 1,000,000

# Error rate (same whether sampled or not, if sampling is unbiased)
error_rate = sampled_errors / sampled_total  # 5%

# Confidence interval (wider with lower sample size)
# Use Wilson score interval or similar

Critical requirement: Sampling must be uniform and unbiased with respect to the SLI outcome. If slow requests are more likely to be sampled (or less likely), your latency SLI is biased.

Error-Biased Sampling Distorts Error Rates

A common pattern is to always sample errors and randomly sample successes. This is great for debugging but terrible for SLI calculation. If you sample 100% of errors and 1% of successes, naive error rate calculation is completely wrong. Either calculate SLIs from non-sampled counters, or apply proper statistical correction.

Aggregation Techniques for SLIs

Raw events must be aggregated into SLI values. How you aggregate determines what the SLI reflects.

Time Window Aggregation

SLIs are computed over time windows. Window selection matters:

Rolling windows: "Last 5 minutes," "Last 30 days"

Always current
Smooth transitions
Hard to compare (different windows overlap)

Fixed/calendar windows: "Monday 00:00-23:59," "January 2024"

Clean boundaries for reporting
Comparable across periods
May have little data early in period

Sliding tumbling windows: A fixed-size window that moves in discrete steps

Compromise between rolling and fixed
Care needed for request spanning window boundary

Cross-Window Percentile Aggregation

The most common aggregation mistake: averaging percentiles.

Wrong approach:

Hour 1 p95: 100ms
Hour 2 p95: 200ms
Daily p95: (100 + 200) / 2 = 150ms  # WRONG!

Why it's wrong: Percentiles are not additive. The true daily p95 depends on the full distribution of both hours, not their individual p95 values.

histogram-aggregation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple
 
@dataclass
class HistogramBucket:
    """A bucket in a histogram with upper bound and count."""
    upper_bound: float
    count: int
 
class HistogramAggregator:
    """
    Aggregates latency histograms correctly for SLI computation.
    
    Histograms can be merged (summing bucket counts) to compute
    accurate percentiles over longer periods.
    """
    
    def __init__(self, bucket_bounds: List[float]):
        """
        Initialize with bucket boundaries.
        Example: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, float('inf')]
        """
        self.bucket_bounds = sorted(bucket_bounds)
        if self.bucket_bounds[-1] != float('inf'):
            self.bucket_bounds.append(float('inf'))
        self.bucket_counts = [0] * len(self.bucket_bounds)
        self.total_count = 0
        self.total_sum = 0.0
    
    def observe(self, value: float) -> None:
        """Record a single observation."""
        for i, bound in enumerate(self.bucket_bounds):
            if value <= bound:
                self.bucket_counts[i] += 1
                break
        self.total_count += 1
        self.total_sum += value
    
    def merge(self, other: 'HistogramAggregator') -> 'HistogramAggregator':
        """
        Merge two histograms. This is the correct way to aggregate
        histograms across time windows or servers.
        """
        if self.bucket_bounds != other.bucket_bounds:
            raise ValueError("Cannot merge histograms with different bucket bounds")
        
        result = HistogramAggregator(self.bucket_bounds[:-1])  # Exclude inf
        result.bucket_counts = [
            a + b for a, b in zip(self.bucket_counts, other.bucket_counts)
        ]
        result.total_count = self.total_count + other.total_count
        result.total_sum = self.total_sum + other.total_sum
        return result
    
    def percentile(self, p: float) -> float:
        """
        Estimate the p-th percentile from the histogram.
        This is an approximation since we only have bucketed data.
        """
        if p < 0 or p > 100:
            raise ValueError("Percentile must be between 0 and 100")
        
        if self.total_count == 0:
            return 0.0
        
        target_count = (p / 100.0) * self.total_count
        cumulative = 0
        
        for i, count in enumerate(self.bucket_counts):
            if cumulative + count >= target_count:
                # Linear interpolation within bucket
                lower_bound = self.bucket_bounds[i - 1] if i > 0 else 0
                upper_bound = self.bucket_bounds[i]
                
                if upper_bound == float('inf'):
                    return lower_bound  # Can't interpolate into infinity
                
                # Fraction into this bucket
                fraction = (target_count - cumulative) / count if count > 0 else 0
                return lower_bound + fraction * (upper_bound - lower_bound)
            
            cumulative += count
        
        return self.bucket_bounds[-2]  # Last finite bucket
 
# Usage example: Aggregate hourly histograms into daily
hourly_histograms: List[HistogramAggregator] = load_hourly_histograms()
 
# Correct: Merge histograms, then compute percentile
daily_histogram = hourly_histograms[0]
for h in hourly_histograms[1:]:
    daily_histogram = daily_histogram.merge(h)
 
accurate_daily_p95 = daily_histogram.percentile(95)
print(f"Accurate daily p95: {accurate_daily_p95:.2f}ms")
 
# Wrong: Average of hourly p95s
wrong_daily_p95 = np.mean([h.percentile(95) for h in hourly_histograms])
print(f"Wrong daily p95 (averaged): {wrong_daily_p95:.2f}ms")

Spatial Aggregation

Beyond time, you may aggregate across dimensions:

Aggregate across servers: Sum counts and merge histograms from all instances Aggregate across regions: Combine data from multiple data centers Aggregate across versions: Combine data from canary and stable deployments

The same rules apply: sum counts, merge histograms, then compute percentiles—never average percentiles.

Weighted Aggregation

Sometimes different segments should contribute differently:

Traffic-weighted: High-traffic endpoints contribute more to aggregate SLI Business-weighted: Critical endpoints weighted higher regardless of traffic

Weighted SLI = Σ(segment_SLI × segment_weight) / Σ(segment_weight)

This works for averages and success rates. For percentiles, weight by sample count (merge histograms proportionally).

Use Native Histogram Support

Modern monitoring systems (Prometheus, OpenTelemetry) have native histogram support that handles aggregation correctly. Use these built-in primitives rather than implementing your own. They handle edge cases and cross-instance aggregation that are easy to get wrong manually.

Validating Measurement Accuracy

How do you know your SLI measurements are correct? Measurement systems can silently fail or drift. Validation strategies create confidence.

Validation Strategy 1: Known-Answer Tests

Inject test events with known characteristics and verify they appear correctly in SLI calculations:

Create synthetic requests with predetermined latency (e.g., artificially delay by 100ms)
Verify these requests appear in the latency histogram at the correct bucket
Create synthetic errors and verify they're counted

This validates the complete pipeline from instrumentation to presentation.

Validation Strategy 2: Cross-Source Verification

Compute the same SLI from multiple independent data sources:

SLI from application metrics vs. SLI from load balancer logs
SLI from server-side data vs. SLI from client-side RUM
SLI from time-series database vs. SLI from log aggregation

Discrepancies indicate problems in one or more measurement pipelines.

sli-validation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
from dataclasses import dataclass
from typing import Dict, List, Optional
import logging
 
@dataclass
class SLIValue:
    value: float
    source: str
    window_start: str
    window_end: str
    sample_count: int
 
class SLIValidator:
    """
    Validates SLI measurements across multiple sources and over time.
    """
    
    def __init__(
        self,
        tolerance_percent: float = 5.0,
        minimum_sample_count: int = 100
    ):
        self.tolerance_percent = tolerance_percent
        self.minimum_sample_count = minimum_sample_count
        self.logger = logging.getLogger(__name__)
    
    def cross_source_validation(
        self,
        sli_values: List[SLIValue]
    ) -> Dict[str, any]:
        """
        Compare SLI values from multiple sources for the same period.
        Flags discrepancies that exceed tolerance.
        """
        if len(sli_values) < 2:
            return {"status": "insufficient_sources", "sources": len(sli_values)}
        
        # Check sample counts
        sufficient_data = [
            v for v in sli_values 
            if v.sample_count >= self.minimum_sample_count
        ]
        
        if len(sufficient_data) < 2:
            return {
                "status": "insufficient_samples",
                "sources_with_data": len(sufficient_data)
            }
        
        # Calculate variance
        values = [v.value for v in sufficient_data]
        mean_value = sum(values) / len(values)
        max_deviation = max(abs(v - mean_value) for v in values)
        deviation_percent = (max_deviation / mean_value * 100) if mean_value > 0 else 0
        
        # Find discrepant pairs
        discrepancies = []
        for i, v1 in enumerate(sufficient_data):
            for v2 in sufficient_data[i+1:]:
                diff_percent = abs(v1.value - v2.value) / max(v1.value, v2.value) * 100
                if diff_percent > self.tolerance_percent:
                    discrepancies.append({
                        "source_a": v1.source,
                        "value_a": v1.value,
                        "source_b": v2.source,
                        "value_b": v2.value,
                        "difference_percent": round(diff_percent, 2)
                    })
        
        status = "pass" if not discrepancies else "discrepancy_detected"
        
        return {
            "status": status,
            "mean_value": round(mean_value, 4),
            "max_deviation_percent": round(deviation_percent, 2),
            "sources_checked": len(sufficient_data),
            "discrepancies": discrepancies
        }
    
    def temporal_consistency_check(
        self,
        historical_slis: List[SLIValue],
        current_sli: SLIValue,
        anomaly_threshold_std: float = 3.0
    ) -> Dict[str, any]:
        """
        Check if current SLI value is consistent with historical pattern.
        Flags sudden changes that might indicate measurement issues.
        """
        if len(historical_slis) < 10:
            return {"status": "insufficient_history"}
        
        historical_values = [v.value for v in historical_slis[-30:]]  # Last 30 periods
        
        mean = sum(historical_values) / len(historical_values)
        variance = sum((v - mean) ** 2 for v in historical_values) / len(historical_values)
        std_dev = variance ** 0.5
        
        if std_dev == 0:
            # All historical values identical - any change is notable
            is_anomaly = current_sli.value != mean
        else:
            z_score = (current_sli.value - mean) / std_dev
            is_anomaly = abs(z_score) > anomaly_threshold_std
        
        return {
            "status": "anomaly_detected" if is_anomaly else "consistent",
            "current_value": current_sli.value,
            "historical_mean": round(mean, 4),
            "historical_std_dev": round(std_dev, 4),
            "z_score": round((current_sli.value - mean) / std_dev, 2) if std_dev > 0 else None,
            "note": "Verify measurement pipeline if anomaly detected"
        }
    
    def injection_test(
        self,
        inject_request_func,
        verify_metric_func,
        test_latency_ms: float = 100.0
    ) -> Dict[str, any]:
        """
        Inject a known test request and verify it appears in metrics.
        """
        test_id = generate_unique_id()
        
        # Inject test request with known characteristics
        inject_start = time.time()
        inject_request_func(
            test_id=test_id,
            artificial_latency_ms=test_latency_ms,
            mark_as_test=True
        )
        inject_end = time.time()
        
        # Wait for metrics pipeline to process
        time.sleep(5)  # Allow time for propagation
        
        # Verify the request appears in metrics
        verification = verify_metric_func(
            test_id=test_id,
            expected_latency_ms=test_latency_ms,
            time_range=(inject_start, inject_end)
        )
        
        return {
            "status": "pass" if verification.found else "fail",
            "test_id": test_id,
            "expected_latency_ms": test_latency_ms,
            "found_in_metrics": verification.found,
            "measured_latency_ms": verification.measured_latency,
            "latency_error_ms": abs(verification.measured_latency - test_latency_ms) if verification.found else None
        }

Validation Strategy 3: Sanity Bounds

Set bounds on SLI values that, if exceeded, indicate measurement problems rather than real issues:

Availability > 100%: Impossible, indicates double-counting
Latency p50 > latency p99: Impossible, indicates aggregation error
Request count drops to zero: Likely instrumentation failure, not traffic collapse
Error rate exactly 0% for extended period: May indicate errors not being classified correctly

Validation Strategy 4: Consistency with Business Metrics

SLI values should correlate with business metrics:

If SLI shows high availability but revenue drops, investigate discrepancy
If SLI shows good latency but user session duration drops, dig deeper
If SLI shows low error rate but support tickets spike, something's being missed

Validation Strategy 5: Manual Spot Checks

Periodically validate manually:

Pick a random time window, count events manually from logs
Compare manual count to automated SLI calculation
Document discrepancies and root cause

Measure the Measurement System

Your SLI calculation pipeline is itself a service that can have availability issues, latency problems, and errors. Consider meta-SLIs: What's the availability of your metrics pipeline? What's the latency of SLI calculation? How often are SLI values missing or delayed? These meta-metrics protect against silent measurement failures.

Common Measurement Failures and Mitigations

Knowing common failure modes helps you design robust measurement systems and quickly diagnose problems when SLI values seem wrong.

Failure 1: Survivorship Bias

What happens: Only successful requests are instrumented. Failed requests (that crashed before logging, timed out before completion) are invisible.

Symptom: SLI looks better than user experience. Users complain but metrics are green.

Mitigation: Instrument at request entry and exit. Use middleware that guarantees logging regardless of request outcome. Count timeouts explicitly from client perspective.

Failure 2: Clock Synchronization Issues

What happens: Distributed systems have clock skew. Request that started on Server A at 12:00:00 might be logged on Server B at 11:59:58.

Symptom: Negative latencies, events appearing out of order, aggregation window misalignment.

Mitigation: Use NTP synchronization with monitoring. Measure durations on single hosts. Use monotonic clocks for timing. Include clock offset estimates in data.

Critical Measurement Failures to Watch For

•Data loss in transit: Metrics lost during collection backpressure. Monitor collection queue depth and drop rate.
•Cardinality explosion: Too many unique label combinations exhausting storage. Results in data dropping silently.
•Time zone confusion: Aggregating data across time zones incorrectly. Standardize on UTC throughout pipeline.
•Deployment gaps: New code version not instrumented, or instrumentation changed. Blue-green deploys need instrumentation on both versions.
•Semantic drift: What counts as 'error' changes over time without SLI definition update. Audit classification logic regularly.
•Stale cached data: Dashboards showing cached values after actual changes. Ensure proper cache invalidation.

Failure 3: Aggregation Computation Errors

What happens: Incorrect aggregation logic (averaging percentiles, wrong window alignment, off-by-one errors in bucket boundaries).

Symptom: SLI values don't make sense (p50 > p95), or change unexpectedly when aggregation code changes.

Mitigation: Unit test aggregation code thoroughly. Use well-tested libraries. Cross-validate with alternative calculation methods.

Failure 4: Sampling Bias

What happens: Non-uniform sampling preferentially captures or misses certain request types.

Symptom: SLI differs significantly from RUM data or user perception. Rare events (errors, slow requests) under-represented.

Mitigation: Use unbiased sampling. For tracing, consider tail-based sampling that captures interesting requests. Compute SLIs from 100% metric data, not sampled traces.

Failure 5: Definition Drift

What happens: SLI definition on paper doesn't match implementation. Code changes over time but SLI spec isn't updated.

Symptom: SLI values seem reasonable but don't reflect stakeholder understanding of what's measured.

Mitigation: Treat SLI definition as code. Version control it. Review changes. Periodically audit implementation against specification.

Failure 6: Scope Creep

What happens: SLI scope expands to include traffic types it wasn't designed for (internal tools, synthetic monitoring, bot traffic).

Symptom: SLI becomes polluted, no longer reflects target user population.

Mitigation: Explicit traffic filtering. Separate SLIs for different traffic types. Regular audits of what's in the denominator.

Building a Robust Measurement Architecture

Based on the principles and pitfalls discussed, here's a framework for designing reliable SLI measurement infrastructure.

Layer 1: Instrumentation

Use standardized instrumentation libraries (OpenTelemetry, Prometheus client)
Enforce instrumentation through frameworks, not developer discipline
Capture at the edge (load balancer, API gateway) as backup
Version instrumentation changes and track them

Layer 2: Collection

Use durable collection with retry capability (avoid UDP where loss matters)
Monitor collection pipeline health (queue depth, rejection rate)
Set up alerting on collection failures (sudden drop in metric rate)
Design for backpressure—degrade gracefully, don't lose critical data

Layer 3: Storage

Use purpose-built time-series databases for metrics
Configure retention appropriate for SLO periods (at least 30 days, often 13 months for YoY comparison)
Monitor storage health, capacity, and query performance
Plan for disaster recovery—losing metrics history is painful

measurement-architecture.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# SLI Measurement Architecture Reference
measurement_architecture:
  
  instrumentation_layer:
    standards:
      - "OpenTelemetry for tracing"
      - "Prometheus client for metrics"
      - "Structured JSON logging"
    requirements:
      - "All services MUST use approved instrumentation libraries"
      - "Custom instrumentation requires SRE review"
      - "Instrumentation coverage verified in CI pipeline"
    anti_patterns:
      - "Direct stdout/stderr for SLI-relevant data"
      - "Custom serialization formats"
      - "Unbounded cardinality labels"
  
  collection_layer:
    primary: "Prometheus federated pull model"
    secondary: "Load balancer log shipping (backup data source)"
    rum: "Client SDK → Edge ingestion → Stream processing"
    reliability:
      - "Metrics scrape success rate SLI > 99.9%"
      - "Collection latency p99 < 2 minutes"
      - "Automatic retry on transient failures"
    monitoring:
      - "Scrape target health dashboard"
      - "Alert on target down > 5 minutes"
      - "Alert on sudden metric count drop"
  
  storage_layer:
    time_series_database: "Prometheus (short-term) + Thanos (long-term)"
    retention:
      high_resolution: "15 days (15s intervals)"
      medium_resolution: "90 days (1m intervals)"
      low_resolution: "13 months (5m intervals)"
    high_availability:
      - "Multi-replica ingestion"
      - "Cross-region replication for disaster recovery"
    monitoring:
      - "Storage capacity forecasting"
      - "Query latency tracking"
      - "Ingestion success rate"
  
  computation_layer:
    sli_calculator:
      language: "PromQL / SQL depending on source"
      execution: "Every 1 minute for real-time, daily for reports"
      output: "SLI values written to dedicated metrics"
    validation:
      - "Known-answer tests run hourly"
      - "Cross-source comparison daily"
      - "Sanity bounds checked on every computation"
    error_handling:
      - "Missing data points → flag as incomplete, don't extrapolate"
      - "Computation failure → alert, no stale data displayed"
  
  presentation_layer:
    dashboards:
      - "Real-time SLI status (1-minute refresh)"
      - "SLO compliance tracking (1-hour refresh)"
      - "Historical trends (daily refresh)"
    alerts:
      - "Routed to PagerDuty for critical"
      - "Routed to Slack for warning"
    reports:
      - "Weekly SLO summary email"
      - "Monthly executive report"

Layer 4: Computation

Use declarative SLI definitions (configuration, not code) where possible
Test SLI computation logic with unit tests and known-answer tests
Handle edge cases explicitly (empty windows, missing data)
Version and audit SLI calculation logic changes

Layer 5: Presentation

Cache appropriately but ensure freshness (max cache age based on SLI update frequency)
Display data source metadata (when last updated, sample count)
Show confidence indicators for low-sample-size periods
Link from SLI values to underlying data for drill-down

Operational Considerations

On-call for measurement infrastructure: Treat metrics pipeline as a critical service
Runbooks for measurement failures: What to do when metrics are missing or wrong
Regular audits: Quarterly review of SLI definitions vs. implementation
Change management: SLI changes reviewed like production code changes

The Measurement SLO

Consider setting SLOs for your measurement system itself: metrics collection availability, SLI computation latency, dashboard uptime. If you can't trust that your measurements are current and correct, you can't trust the SLIs they produce.

Summary: Measuring What Matters, Accurately

Accurate SLI measurement is the foundation upon which reliable SLO practice is built. Without trustworthy data, SLOs are theater. Let's consolidate the key learnings:

Key Principles for Accurate SLI Measurement

•Understand the full measurement pipeline. From event generation through presentation, each stage can introduce errors. Know your pipeline and its failure modes.
•Instrument comprehensively and consistently. Use standardized libraries, enforce through frameworks, and capture all outcomes—including failures and timeouts.
•Be careful with sampling. Sample traces and details for cost; use 100% counters for SLI calculation. Never compute SLIs from biased samples without correction.
•Aggregate correctly. Sum counts, merge histograms, then compute percentiles. Never average percentiles. Use specialized libraries that handle this correctly.
•Validate continuously. Use known-answer tests, cross-source verification, temporal consistency checks, and business metric correlation to catch measurement drift.
•Know the common failures. Survivorship bias, clock skew, cardinality explosion, sampling bias, definition drift—recognize them quickly when symptoms appear.
•Build robust architecture. Treat your measurement infrastructure as a production system with its own reliability requirements.
•Trust, but verify. Even a well-designed system can fail silently. Regular audits and validation maintain confidence over time.

Module Complete

Congratulations! You've completed the Choosing SLIs module. You now understand how to:

Design user-centric SLIs that reflect true user experience
Measure availability in ways that capture real unavailability
Track latency using percentiles and proper distribution analysis
Classify and count errors meaningfully
Implement accurate measurement infrastructure

With these foundations, you're ready to move on to Setting SLOs—translating SLI measurements into targets that balance reliability investment with feature velocity.

Module Complete

You've mastered the art and science of choosing and measuring SLIs. You can design SLIs that reflect user experience, avoid common measurement pitfalls, build robust measurement infrastructure, and validate that your data reflects reality. These skills form the foundation for effective SLO practice and reliable service operation.

5 / 5

Loading learning content...

System Design (HLD)Choosing SLIs

Choosing Service Level Indicators

LevelAdvanced

Duration90 mins

TopicChoosing SLIs

5 / 5

Measuring SLIs Accurately

The Measurement Challenge

What You Will Learn

The Measurement Stack: From Event to SLI

SLI measurement involves a pipeline from raw events to computed indicators. Understanding each stage reveals potential failure points.

Stage 1: Event Generation

The journey begins when something measurable happens:

A request arrives at your service
A response is sent
A user clicks a button
A process completes

At this stage, instrumentation code captures the event and its metadata. Potential issues:

Missing instrumentation: Events that should be captured aren't
Incorrect timestamps: Clock skew, wrong timezone, missing precision
Incomplete metadata: Essential attributes not recorded

Stage 2: Data Collection

Captured events must be collected and transmitted:

Client sends metrics to collection endpoint
Application emits logs to aggregator
Traces flow to tracing backend

Potential issues:

Data loss in transit: Network failures, queue overflows, backpressure
Sampling introduces bias: Not all events captured equally
Batching delays: Events not available in real-time

The SLI Measurement Pipeline
Stage	Description	Key Risks	Validation Approach
Event Generation	Instrumentation captures raw events	Missing events, wrong timestamps	Instrumentation audits, test events
Data Collection	Events transmitted to collection systems	Loss in transit, sampling bias	Collection rate monitoring, trace audits
Storage	Events persisted for analysis	Retention gaps, storage failures	Storage health checks, reconciliation
Aggregation	Raw events rolled up into metrics	Aggregation errors, timing windows	Cross-source verification, raw vs. aggregate comparison
Computation	SLI calculated from aggregated data	Formula errors, edge cases	Unit tests, known-answer tests
Presentation	SLI displayed to stakeholders	Visualization errors, stale data	Manual verification, cache invalidation

Stage 3: Storage

Collected data is stored for analysis:

Time-series databases for metrics
Log aggregation systems
Data warehouses for historical analysis

Potential issues:

Retention policies drop old data: Historical context lost
Storage failures: Data permanently lost
Ingestion lag: Data appears late or out of order

Stage 4: Aggregation

Raw events are aggregated into metrics:

Request counts, error counts, latency histograms
Windowed calculations (last 5 minutes, last hour)
Percentile computations

Potential issues:

Aggregation window misalignment: Different systems use different windows
Percentile aggregation errors: Averaging percentiles (wrong!)
Cardinality explosion: Too many dimensions overwhelm storage

Stage 5: Computation

The final SLI value is calculated:

Success rate = successes / total
p95 latency extracted from histogram
Error budget consumed computed

Potential issues:

Formula bugs: Incorrect calculation logic
Edge case handling: Division by zero, empty windows
Floating point errors: Precision issues in extreme cases

Stage 6: Presentation

SLI is displayed to humans or triggers automated actions:

Dashboards, reports
Alert evaluation
SLO compliance assessment

Potential issues:

Stale data displayed: Caching, refresh delays
Visualization artifacts: Charts that mislead
Threshold misconfigurations: Alerting on wrong values

Trust, But Verify

Instrumentation Strategies

Instrumentation is the foundation of SLI measurement. The code that captures events determines what data is available for all downstream processing.

Instrumentation Placement

Server-side instrumentation:

Captures what the server observes
High reliability (under your control)
Misses client perspective (network, rendering)

Client-side instrumentation (Real User Monitoring):

Captures true user experience
Subject to client environment variability
May be blocked or disabled by users

Synthetic instrumentation:

Consistent, controlled conditions
Provides reliable baselines
Doesn't reflect real user diversity

Infrastructure instrumentation:

Load balancer logs, CDN analytics
Captures what reaches your infrastructure
Misses application-level semantics

instrumentation-patterns.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import { metrics, Histogram, Counter } from './metrics-client';
 
// Properly instrumented service layer for SLI measurement
 
// Define metrics with appropriate buckets and labels
const requestLatency = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  labelNames: ['method', 'route', 'status_code', 'outcome'],
});
 
const requestTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code', 'outcome'],
});
 
// Middleware for automatic request instrumentation
export function instrumentRequest(req: Request, res: Response, next: NextFunction) {
  const startTime = process.hrtime.bigint();
  const route = extractRoutePattern(req);  // Use pattern, not path with IDs
  
  // Capture response details after completion
  res.on('finish', () => {
    const endTime = process.hrtime.bigint();
    const durationSeconds = Number(endTime - startTime) / 1e9;
    
    // Determine outcome based on status code and response
    const outcome = determineOutcome(res.statusCode, res.get('X-Error-Code'));
    
    // Record histogram observation
    requestLatency.observe(
      {
        method: req.method,
        route: route,
        status_code: String(res.statusCode),
        outcome: outcome,
      },
      durationSeconds
    );
    
    // Increment counter
    requestTotal.inc({
      method: req.method,
      route: route,
      status_code: String(res.statusCode),
      outcome: outcome,
    });
  });
  
  // Handle cases where response never finishes (connection dropped)
  req.on('close', () => {
    if (!res.finished) {
      // Client disconnected before response completed
      const durationSeconds = Number(process.hrtime.bigint() - startTime) / 1e9;
      requestLatency.observe(
        {
          method: req.method,
          route: route,
          status_code: 'client_closed',
          outcome: 'error',
        },
        durationSeconds
      );
      requestTotal.inc({
        method: req.method,
        route: route,
        status_code: 'client_closed',
        outcome: 'error',
      });
    }
  });
  
  next();
}
 
// Route pattern extraction - avoid high cardinality
function extractRoutePattern(req: Request): string {
  // Convert /users/12345 to /users/{id}
  // This keeps metric cardinality bounded
  const match = req.route?.path;
  if (match) return match;
  
  // Fallback: try to infer pattern
  return req.path
    .replace(/\/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi, '/{uuid}')
    .replace(/\/\d+/g, '/{id}');
}
 
// Outcome determination for SLI classification
function determineOutcome(statusCode: number, errorCode?: string): string {
  if (statusCode >= 500) return 'error';
  if (statusCode === 429) return 'rate_limited';
  if (statusCode >= 400) return 'client_error';
  if (errorCode) return 'semantic_error';
  return 'success';
}

Instrumentation Best Practices

1. Use semantic labels, not raw values: Labels like route="/users/{id}" keep cardinality bounded. Labels like route="/users/12345" explode cardinality.

2. Capture all request outcomes: Including aborted, timed out, and edge cases. If a request occurs but doesn't complete normally, it should still be instrumented.

3. Use monotonic clocks for durations: System clocks can jump (NTP adjustments). Use monotonic timers like process.hrtime() or time.monotonic() for accurate duration measurement.

4. Attach correlation context: Link metrics to traces and logs via correlation IDs. When SLI breaches occur, you need to drill into specific requests.

5. Version your instrumentation: When changing instrumentation, you may break historical comparisons. Version metrics or document changes.

6. Test instrumentation explicitly: Unit test that expected metrics are emitted for various scenarios. Integration test that metrics appear in your monitoring system.

The Cardinality Trap

Sampling and Its Implications

At scale, capturing 100% of events may be impractical or cost-prohibitive. Sampling trades completeness for efficiency—but it must be done carefully to avoid biasing your SLIs.

When Sampling Is Appropriate

Tracing: Full distributed traces are expensive; sampling is standard
Detailed logging: Verbose per-request logging at high traffic
Real User Monitoring: Client-side data collection at high volume
Debugging data: Stack traces, payload snapshots

When Sampling Is Dangerous

SLI numerator/denominator counts: Need accurate totals for percentage calculation
Error counting: Rare errors may be missed entirely
Tail latency: Sampling may miss the extreme outliers that matter most

Sampling Strategy Trade-offs
Strategy	Description	Pros	Cons
Head-based random	Sample decision at request start	Simple, consistent trace	May miss slow requests, bias toward short
Tail-based	Sample decision after request completes	Can select errors/slow requests	Complex, requires buffering all requests
Rate limiting	Cap N samples per time window	Controlled volume	Busy periods under-sampled relative to quiet
Priority-based	Higher rate for important contexts	Captures critical cases	Complex to configure priorities
Error-biased	Always sample errors, sample successes	Errors never missed	May bias error rate measurement upward

Accurate SLI Calculation With Sampling

For SLI calculation, you generally need accurate counts, not sampled estimates. The standard approach:

Metrics (counters, histograms): Capture 100%. These are cheap—incrementing a counter costs almost nothing.

Traces and detailed data: Sample. These are expensive—full context per request adds up.

By separating concerns, you get accurate SLIs from counters and rich debugging data from sampled traces.

If you must sample for SLI calculation, you need statistical techniques:

# Unbiased error rate estimation from sampled data

sampled_errors = 500
sampled_total = 10000
sampling_rate = 0.01  # 1% sample

# Estimated actual counts
estimated_total_errors = sampled_errors / sampling_rate  # 50,000
estimated_total_requests = sampled_total / sampling_rate  # 1,000,000

# Error rate (same whether sampled or not, if sampling is unbiased)
error_rate = sampled_errors / sampled_total  # 5%

# Confidence interval (wider with lower sample size)
# Use Wilson score interval or similar

Critical requirement: Sampling must be uniform and unbiased with respect to the SLI outcome. If slow requests are more likely to be sampled (or less likely), your latency SLI is biased.

Error-Biased Sampling Distorts Error Rates

Aggregation Techniques for SLIs

Raw events must be aggregated into SLI values. How you aggregate determines what the SLI reflects.

Time Window Aggregation

SLIs are computed over time windows. Window selection matters:

Rolling windows: "Last 5 minutes," "Last 30 days"

Always current
Smooth transitions
Hard to compare (different windows overlap)

Fixed/calendar windows: "Monday 00:00-23:59," "January 2024"

Clean boundaries for reporting
Comparable across periods
May have little data early in period

Sliding tumbling windows: A fixed-size window that moves in discrete steps

Compromise between rolling and fixed
Care needed for request spanning window boundary

Cross-Window Percentile Aggregation

The most common aggregation mistake: averaging percentiles.

Wrong approach:

Hour 1 p95: 100ms
Hour 2 p95: 200ms
Daily p95: (100 + 200) / 2 = 150ms  # WRONG!

Why it's wrong: Percentiles are not additive. The true daily p95 depends on the full distribution of both hours, not their individual p95 values.

histogram-aggregation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from dataclasses import dataclass
from typing import List, Tuple
 
@dataclass
class HistogramBucket:
    """A bucket in a histogram with upper bound and count."""
    upper_bound: float
    count: int
 
class HistogramAggregator:
    """
    Aggregates latency histograms correctly for SLI computation.
    
    Histograms can be merged (summing bucket counts) to compute
    accurate percentiles over longer periods.
    """
    
    def __init__(self, bucket_bounds: List[float]):
        """
        Initialize with bucket boundaries.
        Example: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, float('inf')]
        """
        self.bucket_bounds = sorted(bucket_bounds)
        if self.bucket_bounds[-1] != float('inf'):
            self.bucket_bounds.append(float('inf'))
        self.bucket_counts = [0] * len(self.bucket_bounds)
        self.total_count = 0
        self.total_sum = 0.0
    
    def observe(self, value: float) -> None:
        """Record a single observation."""
        for i, bound in enumerate(self.bucket_bounds):
            if value <= bound:
                self.bucket_counts[i] += 1
                break
        self.total_count += 1
        self.total_sum += value
    
    def merge(self, other: 'HistogramAggregator') -> 'HistogramAggregator':
        """
        Merge two histograms. This is the correct way to aggregate
        histograms across time windows or servers.
        """
        if self.bucket_bounds != other.bucket_bounds:
            raise ValueError("Cannot merge histograms with different bucket bounds")
        
        result = HistogramAggregator(self.bucket_bounds[:-1])  # Exclude inf
        result.bucket_counts = [
            a + b for a, b in zip(self.bucket_counts, other.bucket_counts)
        ]
        result.total_count = self.total_count + other.total_count
        result.total_sum = self.total_sum + other.total_sum
        return result
    
    def percentile(self, p: float) -> float:
        """
        Estimate the p-th percentile from the histogram.
        This is an approximation since we only have bucketed data.
        """
        if p < 0 or p > 100:
            raise ValueError("Percentile must be between 0 and 100")
        
        if self.total_count == 0:
            return 0.0
        
        target_count = (p / 100.0) * self.total_count
        cumulative = 0
        
        for i, count in enumerate(self.bucket_counts):
            if cumulative + count >= target_count:
                # Linear interpolation within bucket
                lower_bound = self.bucket_bounds[i - 1] if i > 0 else 0
                upper_bound = self.bucket_bounds[i]
                
                if upper_bound == float('inf'):
                    return lower_bound  # Can't interpolate into infinity
                
                # Fraction into this bucket
                fraction = (target_count - cumulative) / count if count > 0 else 0
                return lower_bound + fraction * (upper_bound - lower_bound)
            
            cumulative += count
        
        return self.bucket_bounds[-2]  # Last finite bucket
 
# Usage example: Aggregate hourly histograms into daily
hourly_histograms: List[HistogramAggregator] = load_hourly_histograms()
 
# Correct: Merge histograms, then compute percentile
daily_histogram = hourly_histograms[0]
for h in hourly_histograms[1:]:
    daily_histogram = daily_histogram.merge(h)
 
accurate_daily_p95 = daily_histogram.percentile(95)
print(f"Accurate daily p95: {accurate_daily_p95:.2f}ms")
 
# Wrong: Average of hourly p95s
wrong_daily_p95 = np.mean([h.percentile(95) for h in hourly_histograms])
print(f"Wrong daily p95 (averaged): {wrong_daily_p95:.2f}ms")

Spatial Aggregation

Beyond time, you may aggregate across dimensions:

The same rules apply: sum counts, merge histograms, then compute percentiles—never average percentiles.

Weighted Aggregation

Sometimes different segments should contribute differently:

Traffic-weighted: High-traffic endpoints contribute more to aggregate SLI Business-weighted: Critical endpoints weighted higher regardless of traffic

Weighted SLI = Σ(segment_SLI × segment_weight) / Σ(segment_weight)

This works for averages and success rates. For percentiles, weight by sample count (merge histograms proportionally).

Use Native Histogram Support

Validating Measurement Accuracy

How do you know your SLI measurements are correct? Measurement systems can silently fail or drift. Validation strategies create confidence.

Validation Strategy 1: Known-Answer Tests

Inject test events with known characteristics and verify they appear correctly in SLI calculations:

Create synthetic requests with predetermined latency (e.g., artificially delay by 100ms)
Verify these requests appear in the latency histogram at the correct bucket
Create synthetic errors and verify they're counted

This validates the complete pipeline from instrumentation to presentation.

Validation Strategy 2: Cross-Source Verification

Compute the same SLI from multiple independent data sources:

SLI from application metrics vs. SLI from load balancer logs
SLI from server-side data vs. SLI from client-side RUM
SLI from time-series database vs. SLI from log aggregation

Discrepancies indicate problems in one or more measurement pipelines.

sli-validation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
from dataclasses import dataclass
from typing import Dict, List, Optional
import logging
 
@dataclass
class SLIValue:
    value: float
    source: str
    window_start: str
    window_end: str
    sample_count: int
 
class SLIValidator:
    """
    Validates SLI measurements across multiple sources and over time.
    """
    
    def __init__(
        self,
        tolerance_percent: float = 5.0,
        minimum_sample_count: int = 100
    ):
        self.tolerance_percent = tolerance_percent
        self.minimum_sample_count = minimum_sample_count
        self.logger = logging.getLogger(__name__)
    
    def cross_source_validation(
        self,
        sli_values: List[SLIValue]
    ) -> Dict[str, any]:
        """
        Compare SLI values from multiple sources for the same period.
        Flags discrepancies that exceed tolerance.
        """
        if len(sli_values) < 2:
            return {"status": "insufficient_sources", "sources": len(sli_values)}
        
        # Check sample counts
        sufficient_data = [
            v for v in sli_values 
            if v.sample_count >= self.minimum_sample_count
        ]
        
        if len(sufficient_data) < 2:
            return {
                "status": "insufficient_samples",
                "sources_with_data": len(sufficient_data)
            }
        
        # Calculate variance
        values = [v.value for v in sufficient_data]
        mean_value = sum(values) / len(values)
        max_deviation = max(abs(v - mean_value) for v in values)
        deviation_percent = (max_deviation / mean_value * 100) if mean_value > 0 else 0
        
        # Find discrepant pairs
        discrepancies = []
        for i, v1 in enumerate(sufficient_data):
            for v2 in sufficient_data[i+1:]:
                diff_percent = abs(v1.value - v2.value) / max(v1.value, v2.value) * 100
                if diff_percent > self.tolerance_percent:
                    discrepancies.append({
                        "source_a": v1.source,
                        "value_a": v1.value,
                        "source_b": v2.source,
                        "value_b": v2.value,
                        "difference_percent": round(diff_percent, 2)
                    })
        
        status = "pass" if not discrepancies else "discrepancy_detected"
        
        return {
            "status": status,
            "mean_value": round(mean_value, 4),
            "max_deviation_percent": round(deviation_percent, 2),
            "sources_checked": len(sufficient_data),
            "discrepancies": discrepancies
        }
    
    def temporal_consistency_check(
        self,
        historical_slis: List[SLIValue],
        current_sli: SLIValue,
        anomaly_threshold_std: float = 3.0
    ) -> Dict[str, any]:
        """
        Check if current SLI value is consistent with historical pattern.
        Flags sudden changes that might indicate measurement issues.
        """
        if len(historical_slis) < 10:
            return {"status": "insufficient_history"}
        
        historical_values = [v.value for v in historical_slis[-30:]]  # Last 30 periods
        
        mean = sum(historical_values) / len(historical_values)
        variance = sum((v - mean) ** 2 for v in historical_values) / len(historical_values)
        std_dev = variance ** 0.5
        
        if std_dev == 0:
            # All historical values identical - any change is notable
            is_anomaly = current_sli.value != mean
        else:
            z_score = (current_sli.value - mean) / std_dev
            is_anomaly = abs(z_score) > anomaly_threshold_std
        
        return {
            "status": "anomaly_detected" if is_anomaly else "consistent",
            "current_value": current_sli.value,
            "historical_mean": round(mean, 4),
            "historical_std_dev": round(std_dev, 4),
            "z_score": round((current_sli.value - mean) / std_dev, 2) if std_dev > 0 else None,
            "note": "Verify measurement pipeline if anomaly detected"
        }
    
    def injection_test(
        self,
        inject_request_func,
        verify_metric_func,
        test_latency_ms: float = 100.0
    ) -> Dict[str, any]:
        """
        Inject a known test request and verify it appears in metrics.
        """
        test_id = generate_unique_id()
        
        # Inject test request with known characteristics
        inject_start = time.time()
        inject_request_func(
            test_id=test_id,
            artificial_latency_ms=test_latency_ms,
            mark_as_test=True
        )
        inject_end = time.time()
        
        # Wait for metrics pipeline to process
        time.sleep(5)  # Allow time for propagation
        
        # Verify the request appears in metrics
        verification = verify_metric_func(
            test_id=test_id,
            expected_latency_ms=test_latency_ms,
            time_range=(inject_start, inject_end)
        )
        
        return {
            "status": "pass" if verification.found else "fail",
            "test_id": test_id,
            "expected_latency_ms": test_latency_ms,
            "found_in_metrics": verification.found,
            "measured_latency_ms": verification.measured_latency,
            "latency_error_ms": abs(verification.measured_latency - test_latency_ms) if verification.found else None
        }

Validation Strategy 3: Sanity Bounds

Set bounds on SLI values that, if exceeded, indicate measurement problems rather than real issues:

Availability > 100%: Impossible, indicates double-counting
Latency p50 > latency p99: Impossible, indicates aggregation error
Request count drops to zero: Likely instrumentation failure, not traffic collapse
Error rate exactly 0% for extended period: May indicate errors not being classified correctly

Validation Strategy 4: Consistency with Business Metrics

SLI values should correlate with business metrics:

If SLI shows high availability but revenue drops, investigate discrepancy
If SLI shows good latency but user session duration drops, dig deeper
If SLI shows low error rate but support tickets spike, something's being missed

Validation Strategy 5: Manual Spot Checks

Periodically validate manually:

Pick a random time window, count events manually from logs
Compare manual count to automated SLI calculation
Document discrepancies and root cause

Measure the Measurement System

Common Measurement Failures and Mitigations

Knowing common failure modes helps you design robust measurement systems and quickly diagnose problems when SLI values seem wrong.

Failure 1: Survivorship Bias

What happens: Only successful requests are instrumented. Failed requests (that crashed before logging, timed out before completion) are invisible.

Symptom: SLI looks better than user experience. Users complain but metrics are green.

Mitigation: Instrument at request entry and exit. Use middleware that guarantees logging regardless of request outcome. Count timeouts explicitly from client perspective.

Failure 2: Clock Synchronization Issues

What happens: Distributed systems have clock skew. Request that started on Server A at 12:00:00 might be logged on Server B at 11:59:58.

Symptom: Negative latencies, events appearing out of order, aggregation window misalignment.

Mitigation: Use NTP synchronization with monitoring. Measure durations on single hosts. Use monotonic clocks for timing. Include clock offset estimates in data.

Critical Measurement Failures to Watch For

•Data loss in transit: Metrics lost during collection backpressure. Monitor collection queue depth and drop rate.
•Cardinality explosion: Too many unique label combinations exhausting storage. Results in data dropping silently.
•Time zone confusion: Aggregating data across time zones incorrectly. Standardize on UTC throughout pipeline.
•Deployment gaps: New code version not instrumented, or instrumentation changed. Blue-green deploys need instrumentation on both versions.
•Semantic drift: What counts as 'error' changes over time without SLI definition update. Audit classification logic regularly.
•Stale cached data: Dashboards showing cached values after actual changes. Ensure proper cache invalidation.

Failure 3: Aggregation Computation Errors

What happens: Incorrect aggregation logic (averaging percentiles, wrong window alignment, off-by-one errors in bucket boundaries).

Symptom: SLI values don't make sense (p50 > p95), or change unexpectedly when aggregation code changes.

Mitigation: Unit test aggregation code thoroughly. Use well-tested libraries. Cross-validate with alternative calculation methods.

Failure 4: Sampling Bias

What happens: Non-uniform sampling preferentially captures or misses certain request types.

Symptom: SLI differs significantly from RUM data or user perception. Rare events (errors, slow requests) under-represented.

Mitigation: Use unbiased sampling. For tracing, consider tail-based sampling that captures interesting requests. Compute SLIs from 100% metric data, not sampled traces.

Failure 5: Definition Drift

What happens: SLI definition on paper doesn't match implementation. Code changes over time but SLI spec isn't updated.

Symptom: SLI values seem reasonable but don't reflect stakeholder understanding of what's measured.

Mitigation: Treat SLI definition as code. Version control it. Review changes. Periodically audit implementation against specification.

Failure 6: Scope Creep

What happens: SLI scope expands to include traffic types it wasn't designed for (internal tools, synthetic monitoring, bot traffic).

Symptom: SLI becomes polluted, no longer reflects target user population.

Mitigation: Explicit traffic filtering. Separate SLIs for different traffic types. Regular audits of what's in the denominator.

Building a Robust Measurement Architecture

Based on the principles and pitfalls discussed, here's a framework for designing reliable SLI measurement infrastructure.

Layer 1: Instrumentation

Use standardized instrumentation libraries (OpenTelemetry, Prometheus client)
Enforce instrumentation through frameworks, not developer discipline
Capture at the edge (load balancer, API gateway) as backup
Version instrumentation changes and track them

Layer 2: Collection

Use durable collection with retry capability (avoid UDP where loss matters)
Monitor collection pipeline health (queue depth, rejection rate)
Set up alerting on collection failures (sudden drop in metric rate)
Design for backpressure—degrade gracefully, don't lose critical data

Layer 3: Storage

Use purpose-built time-series databases for metrics
Configure retention appropriate for SLO periods (at least 30 days, often 13 months for YoY comparison)
Monitor storage health, capacity, and query performance
Plan for disaster recovery—losing metrics history is painful

measurement-architecture.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# SLI Measurement Architecture Reference
measurement_architecture:
  
  instrumentation_layer:
    standards:
      - "OpenTelemetry for tracing"
      - "Prometheus client for metrics"
      - "Structured JSON logging"
    requirements:
      - "All services MUST use approved instrumentation libraries"
      - "Custom instrumentation requires SRE review"
      - "Instrumentation coverage verified in CI pipeline"
    anti_patterns:
      - "Direct stdout/stderr for SLI-relevant data"
      - "Custom serialization formats"
      - "Unbounded cardinality labels"
  
  collection_layer:
    primary: "Prometheus federated pull model"
    secondary: "Load balancer log shipping (backup data source)"
    rum: "Client SDK → Edge ingestion → Stream processing"
    reliability:
      - "Metrics scrape success rate SLI > 99.9%"
      - "Collection latency p99 < 2 minutes"
      - "Automatic retry on transient failures"
    monitoring:
      - "Scrape target health dashboard"
      - "Alert on target down > 5 minutes"
      - "Alert on sudden metric count drop"
  
  storage_layer:
    time_series_database: "Prometheus (short-term) + Thanos (long-term)"
    retention:
      high_resolution: "15 days (15s intervals)"
      medium_resolution: "90 days (1m intervals)"
      low_resolution: "13 months (5m intervals)"
    high_availability:
      - "Multi-replica ingestion"
      - "Cross-region replication for disaster recovery"
    monitoring:
      - "Storage capacity forecasting"
      - "Query latency tracking"
      - "Ingestion success rate"
  
  computation_layer:
    sli_calculator:
      language: "PromQL / SQL depending on source"
      execution: "Every 1 minute for real-time, daily for reports"
      output: "SLI values written to dedicated metrics"
    validation:
      - "Known-answer tests run hourly"
      - "Cross-source comparison daily"
      - "Sanity bounds checked on every computation"
    error_handling:
      - "Missing data points → flag as incomplete, don't extrapolate"
      - "Computation failure → alert, no stale data displayed"
  
  presentation_layer:
    dashboards:
      - "Real-time SLI status (1-minute refresh)"
      - "SLO compliance tracking (1-hour refresh)"
      - "Historical trends (daily refresh)"
    alerts:
      - "Routed to PagerDuty for critical"
      - "Routed to Slack for warning"
    reports:
      - "Weekly SLO summary email"
      - "Monthly executive report"

Layer 4: Computation

Use declarative SLI definitions (configuration, not code) where possible
Test SLI computation logic with unit tests and known-answer tests
Handle edge cases explicitly (empty windows, missing data)
Version and audit SLI calculation logic changes

Layer 5: Presentation

Cache appropriately but ensure freshness (max cache age based on SLI update frequency)
Display data source metadata (when last updated, sample count)
Show confidence indicators for low-sample-size periods
Link from SLI values to underlying data for drill-down

Operational Considerations

On-call for measurement infrastructure: Treat metrics pipeline as a critical service
Runbooks for measurement failures: What to do when metrics are missing or wrong
Regular audits: Quarterly review of SLI definitions vs. implementation
Change management: SLI changes reviewed like production code changes

The Measurement SLO

Summary: Measuring What Matters, Accurately

Accurate SLI measurement is the foundation upon which reliable SLO practice is built. Without trustworthy data, SLOs are theater. Let's consolidate the key learnings:

Key Principles for Accurate SLI Measurement

•Understand the full measurement pipeline. From event generation through presentation, each stage can introduce errors. Know your pipeline and its failure modes.
•Instrument comprehensively and consistently. Use standardized libraries, enforce through frameworks, and capture all outcomes—including failures and timeouts.
•Be careful with sampling. Sample traces and details for cost; use 100% counters for SLI calculation. Never compute SLIs from biased samples without correction.
•Aggregate correctly. Sum counts, merge histograms, then compute percentiles. Never average percentiles. Use specialized libraries that handle this correctly.
•Validate continuously. Use known-answer tests, cross-source verification, temporal consistency checks, and business metric correlation to catch measurement drift.
•Know the common failures. Survivorship bias, clock skew, cardinality explosion, sampling bias, definition drift—recognize them quickly when symptoms appear.
•Build robust architecture. Treat your measurement infrastructure as a production system with its own reliability requirements.
•Trust, but verify. Even a well-designed system can fail silently. Regular audits and validation maintain confidence over time.

Module Complete

Congratulations! You've completed the Choosing SLIs module. You now understand how to:

Design user-centric SLIs that reflect true user experience
Measure availability in ways that capture real unavailability
Track latency using percentiles and proper distribution analysis
Classify and count errors meaningfully
Implement accurate measurement infrastructure

With these foundations, you're ready to move on to Setting SLOs—translating SLI measurements into targets that balance reliability investment with feature velocity.

Module Complete

5 / 5