Loading learning content...
An SLI is only as valuable as the data backing it. You can design the perfect user-centric SLI on paper, but if your measurement infrastructure introduces gaps, biases, or errors, your SLI becomes fiction—a number that moves but doesn't reflect reality.
The gap between measured performance and experienced performance is where reliability programs go wrong. Teams celebrate hitting SLO targets while users complain. Dashboards show green while revenue drops. Alerts stay quiet while outages rage. These disconnects stem from measurement failures: the right events aren't captured, the wrong events are included, timing is imprecise, or data is lost in transit.
Measuring SLIs accurately is a discipline unto itself. It requires understanding where data originates, how it flows, what can corrupt it, and how to validate that your measurements reflect truth. This page equips you with strategies to ensure your SLIs are trustworthy.
By the end of this page, you will understand comprehensive SLI instrumentation strategies, data collection architecture options, sampling and aggregation techniques, methods for validating measurement accuracy, and common measurement failures with their mitigations. You'll be equipped to build SLI measurement systems you can trust.
SLI measurement involves a pipeline from raw events to computed indicators. Understanding each stage reveals potential failure points.
Stage 1: Event Generation
The journey begins when something measurable happens:
At this stage, instrumentation code captures the event and its metadata. Potential issues:
Stage 2: Data Collection
Captured events must be collected and transmitted:
Potential issues:
| Stage | Description | Key Risks | Validation Approach |
|---|---|---|---|
| Event Generation | Instrumentation captures raw events | Missing events, wrong timestamps | Instrumentation audits, test events |
| Data Collection | Events transmitted to collection systems | Loss in transit, sampling bias | Collection rate monitoring, trace audits |
| Storage | Events persisted for analysis | Retention gaps, storage failures | Storage health checks, reconciliation |
| Aggregation | Raw events rolled up into metrics | Aggregation errors, timing windows | Cross-source verification, raw vs. aggregate comparison |
| Computation | SLI calculated from aggregated data | Formula errors, edge cases | Unit tests, known-answer tests |
| Presentation | SLI displayed to stakeholders | Visualization errors, stale data | Manual verification, cache invalidation |
Stage 3: Storage
Collected data is stored for analysis:
Potential issues:
Stage 4: Aggregation
Raw events are aggregated into metrics:
Potential issues:
Stage 5: Computation
The final SLI value is calculated:
Potential issues:
Stage 6: Presentation
SLI is displayed to humans or triggers automated actions:
Potential issues:
Every stage in the measurement pipeline is a potential point of failure. The more stages, the more opportunities for errors to accumulate. Regularly validate that end-to-end SLI values reflect ground truth, not just that each stage appears healthy individually.
Instrumentation is the foundation of SLI measurement. The code that captures events determines what data is available for all downstream processing.
Instrumentation Placement
Server-side instrumentation:
Client-side instrumentation (Real User Monitoring):
Synthetic instrumentation:
Infrastructure instrumentation:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
import { metrics, Histogram, Counter } from './metrics-client'; // Properly instrumented service layer for SLI measurement // Define metrics with appropriate buckets and labelsconst requestLatency = new Histogram({ name: 'http_request_duration_seconds', help: 'HTTP request duration in seconds', buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10], labelNames: ['method', 'route', 'status_code', 'outcome'],}); const requestTotal = new Counter({ name: 'http_requests_total', help: 'Total HTTP requests', labelNames: ['method', 'route', 'status_code', 'outcome'],}); // Middleware for automatic request instrumentationexport function instrumentRequest(req: Request, res: Response, next: NextFunction) { const startTime = process.hrtime.bigint(); const route = extractRoutePattern(req); // Use pattern, not path with IDs // Capture response details after completion res.on('finish', () => { const endTime = process.hrtime.bigint(); const durationSeconds = Number(endTime - startTime) / 1e9; // Determine outcome based on status code and response const outcome = determineOutcome(res.statusCode, res.get('X-Error-Code')); // Record histogram observation requestLatency.observe( { method: req.method, route: route, status_code: String(res.statusCode), outcome: outcome, }, durationSeconds ); // Increment counter requestTotal.inc({ method: req.method, route: route, status_code: String(res.statusCode), outcome: outcome, }); }); // Handle cases where response never finishes (connection dropped) req.on('close', () => { if (!res.finished) { // Client disconnected before response completed const durationSeconds = Number(process.hrtime.bigint() - startTime) / 1e9; requestLatency.observe( { method: req.method, route: route, status_code: 'client_closed', outcome: 'error', }, durationSeconds ); requestTotal.inc({ method: req.method, route: route, status_code: 'client_closed', outcome: 'error', }); } }); next();} // Route pattern extraction - avoid high cardinalityfunction extractRoutePattern(req: Request): string { // Convert /users/12345 to /users/{id} // This keeps metric cardinality bounded const match = req.route?.path; if (match) return match; // Fallback: try to infer pattern return req.path .replace(/\/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/gi, '/{uuid}') .replace(/\/\d+/g, '/{id}');} // Outcome determination for SLI classificationfunction determineOutcome(statusCode: number, errorCode?: string): string { if (statusCode >= 500) return 'error'; if (statusCode === 429) return 'rate_limited'; if (statusCode >= 400) return 'client_error'; if (errorCode) return 'semantic_error'; return 'success';}Instrumentation Best Practices
1. Use semantic labels, not raw values:
Labels like route="/users/{id}" keep cardinality bounded. Labels like route="/users/12345" explode cardinality.
2. Capture all request outcomes: Including aborted, timed out, and edge cases. If a request occurs but doesn't complete normally, it should still be instrumented.
3. Use monotonic clocks for durations:
System clocks can jump (NTP adjustments). Use monotonic timers like process.hrtime() or time.monotonic() for accurate duration measurement.
4. Attach correlation context: Link metrics to traces and logs via correlation IDs. When SLI breaches occur, you need to drill into specific requests.
5. Version your instrumentation: When changing instrumentation, you may break historical comparisons. Version metrics or document changes.
6. Test instrumentation explicitly: Unit test that expected metrics are emitted for various scenarios. Integration test that metrics appear in your monitoring system.
High-cardinality labels (user IDs, request IDs, full URLs) rapidly exhaust time-series database capacity and budget. Design labels for SLI-relevant segmentation (route pattern, outcome type, region), not debugging granularity. Use tracing for per-request debugging.
At scale, capturing 100% of events may be impractical or cost-prohibitive. Sampling trades completeness for efficiency—but it must be done carefully to avoid biasing your SLIs.
When Sampling Is Appropriate
When Sampling Is Dangerous
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Head-based random | Sample decision at request start | Simple, consistent trace | May miss slow requests, bias toward short |
| Tail-based | Sample decision after request completes | Can select errors/slow requests | Complex, requires buffering all requests |
| Rate limiting | Cap N samples per time window | Controlled volume | Busy periods under-sampled relative to quiet |
| Priority-based | Higher rate for important contexts | Captures critical cases | Complex to configure priorities |
| Error-biased | Always sample errors, sample successes | Errors never missed | May bias error rate measurement upward |
Accurate SLI Calculation With Sampling
For SLI calculation, you generally need accurate counts, not sampled estimates. The standard approach:
Metrics (counters, histograms): Capture 100%. These are cheap—incrementing a counter costs almost nothing.
Traces and detailed data: Sample. These are expensive—full context per request adds up.
By separating concerns, you get accurate SLIs from counters and rich debugging data from sampled traces.
If you must sample for SLI calculation, you need statistical techniques:
# Unbiased error rate estimation from sampled data
sampled_errors = 500
sampled_total = 10000
sampling_rate = 0.01 # 1% sample
# Estimated actual counts
estimated_total_errors = sampled_errors / sampling_rate # 50,000
estimated_total_requests = sampled_total / sampling_rate # 1,000,000
# Error rate (same whether sampled or not, if sampling is unbiased)
error_rate = sampled_errors / sampled_total # 5%
# Confidence interval (wider with lower sample size)
# Use Wilson score interval or similar
Critical requirement: Sampling must be uniform and unbiased with respect to the SLI outcome. If slow requests are more likely to be sampled (or less likely), your latency SLI is biased.
A common pattern is to always sample errors and randomly sample successes. This is great for debugging but terrible for SLI calculation. If you sample 100% of errors and 1% of successes, naive error rate calculation is completely wrong. Either calculate SLIs from non-sampled counters, or apply proper statistical correction.
Raw events must be aggregated into SLI values. How you aggregate determines what the SLI reflects.
Time Window Aggregation
SLIs are computed over time windows. Window selection matters:
Rolling windows: "Last 5 minutes," "Last 30 days"
Fixed/calendar windows: "Monday 00:00-23:59," "January 2024"
Sliding tumbling windows: A fixed-size window that moves in discrete steps
Cross-Window Percentile Aggregation
The most common aggregation mistake: averaging percentiles.
Wrong approach:
Hour 1 p95: 100ms
Hour 2 p95: 200ms
Daily p95: (100 + 200) / 2 = 150ms # WRONG!
Why it's wrong: Percentiles are not additive. The true daily p95 depends on the full distribution of both hours, not their individual p95 values.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
import numpy as npfrom dataclasses import dataclassfrom typing import List, Tuple @dataclassclass HistogramBucket: """A bucket in a histogram with upper bound and count.""" upper_bound: float count: int class HistogramAggregator: """ Aggregates latency histograms correctly for SLI computation. Histograms can be merged (summing bucket counts) to compute accurate percentiles over longer periods. """ def __init__(self, bucket_bounds: List[float]): """ Initialize with bucket boundaries. Example: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, float('inf')] """ self.bucket_bounds = sorted(bucket_bounds) if self.bucket_bounds[-1] != float('inf'): self.bucket_bounds.append(float('inf')) self.bucket_counts = [0] * len(self.bucket_bounds) self.total_count = 0 self.total_sum = 0.0 def observe(self, value: float) -> None: """Record a single observation.""" for i, bound in enumerate(self.bucket_bounds): if value <= bound: self.bucket_counts[i] += 1 break self.total_count += 1 self.total_sum += value def merge(self, other: 'HistogramAggregator') -> 'HistogramAggregator': """ Merge two histograms. This is the correct way to aggregate histograms across time windows or servers. """ if self.bucket_bounds != other.bucket_bounds: raise ValueError("Cannot merge histograms with different bucket bounds") result = HistogramAggregator(self.bucket_bounds[:-1]) # Exclude inf result.bucket_counts = [ a + b for a, b in zip(self.bucket_counts, other.bucket_counts) ] result.total_count = self.total_count + other.total_count result.total_sum = self.total_sum + other.total_sum return result def percentile(self, p: float) -> float: """ Estimate the p-th percentile from the histogram. This is an approximation since we only have bucketed data. """ if p < 0 or p > 100: raise ValueError("Percentile must be between 0 and 100") if self.total_count == 0: return 0.0 target_count = (p / 100.0) * self.total_count cumulative = 0 for i, count in enumerate(self.bucket_counts): if cumulative + count >= target_count: # Linear interpolation within bucket lower_bound = self.bucket_bounds[i - 1] if i > 0 else 0 upper_bound = self.bucket_bounds[i] if upper_bound == float('inf'): return lower_bound # Can't interpolate into infinity # Fraction into this bucket fraction = (target_count - cumulative) / count if count > 0 else 0 return lower_bound + fraction * (upper_bound - lower_bound) cumulative += count return self.bucket_bounds[-2] # Last finite bucket # Usage example: Aggregate hourly histograms into dailyhourly_histograms: List[HistogramAggregator] = load_hourly_histograms() # Correct: Merge histograms, then compute percentiledaily_histogram = hourly_histograms[0]for h in hourly_histograms[1:]: daily_histogram = daily_histogram.merge(h) accurate_daily_p95 = daily_histogram.percentile(95)print(f"Accurate daily p95: {accurate_daily_p95:.2f}ms") # Wrong: Average of hourly p95swrong_daily_p95 = np.mean([h.percentile(95) for h in hourly_histograms])print(f"Wrong daily p95 (averaged): {wrong_daily_p95:.2f}ms")Spatial Aggregation
Beyond time, you may aggregate across dimensions:
Aggregate across servers: Sum counts and merge histograms from all instances Aggregate across regions: Combine data from multiple data centers Aggregate across versions: Combine data from canary and stable deployments
The same rules apply: sum counts, merge histograms, then compute percentiles—never average percentiles.
Weighted Aggregation
Sometimes different segments should contribute differently:
Traffic-weighted: High-traffic endpoints contribute more to aggregate SLI Business-weighted: Critical endpoints weighted higher regardless of traffic
Weighted SLI = Σ(segment_SLI × segment_weight) / Σ(segment_weight)
This works for averages and success rates. For percentiles, weight by sample count (merge histograms proportionally).
Modern monitoring systems (Prometheus, OpenTelemetry) have native histogram support that handles aggregation correctly. Use these built-in primitives rather than implementing your own. They handle edge cases and cross-instance aggregation that are easy to get wrong manually.
How do you know your SLI measurements are correct? Measurement systems can silently fail or drift. Validation strategies create confidence.
Validation Strategy 1: Known-Answer Tests
Inject test events with known characteristics and verify they appear correctly in SLI calculations:
This validates the complete pipeline from instrumentation to presentation.
Validation Strategy 2: Cross-Source Verification
Compute the same SLI from multiple independent data sources:
Discrepancies indicate problems in one or more measurement pipelines.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152
from dataclasses import dataclassfrom typing import Dict, List, Optionalimport logging @dataclassclass SLIValue: value: float source: str window_start: str window_end: str sample_count: int class SLIValidator: """ Validates SLI measurements across multiple sources and over time. """ def __init__( self, tolerance_percent: float = 5.0, minimum_sample_count: int = 100 ): self.tolerance_percent = tolerance_percent self.minimum_sample_count = minimum_sample_count self.logger = logging.getLogger(__name__) def cross_source_validation( self, sli_values: List[SLIValue] ) -> Dict[str, any]: """ Compare SLI values from multiple sources for the same period. Flags discrepancies that exceed tolerance. """ if len(sli_values) < 2: return {"status": "insufficient_sources", "sources": len(sli_values)} # Check sample counts sufficient_data = [ v for v in sli_values if v.sample_count >= self.minimum_sample_count ] if len(sufficient_data) < 2: return { "status": "insufficient_samples", "sources_with_data": len(sufficient_data) } # Calculate variance values = [v.value for v in sufficient_data] mean_value = sum(values) / len(values) max_deviation = max(abs(v - mean_value) for v in values) deviation_percent = (max_deviation / mean_value * 100) if mean_value > 0 else 0 # Find discrepant pairs discrepancies = [] for i, v1 in enumerate(sufficient_data): for v2 in sufficient_data[i+1:]: diff_percent = abs(v1.value - v2.value) / max(v1.value, v2.value) * 100 if diff_percent > self.tolerance_percent: discrepancies.append({ "source_a": v1.source, "value_a": v1.value, "source_b": v2.source, "value_b": v2.value, "difference_percent": round(diff_percent, 2) }) status = "pass" if not discrepancies else "discrepancy_detected" return { "status": status, "mean_value": round(mean_value, 4), "max_deviation_percent": round(deviation_percent, 2), "sources_checked": len(sufficient_data), "discrepancies": discrepancies } def temporal_consistency_check( self, historical_slis: List[SLIValue], current_sli: SLIValue, anomaly_threshold_std: float = 3.0 ) -> Dict[str, any]: """ Check if current SLI value is consistent with historical pattern. Flags sudden changes that might indicate measurement issues. """ if len(historical_slis) < 10: return {"status": "insufficient_history"} historical_values = [v.value for v in historical_slis[-30:]] # Last 30 periods mean = sum(historical_values) / len(historical_values) variance = sum((v - mean) ** 2 for v in historical_values) / len(historical_values) std_dev = variance ** 0.5 if std_dev == 0: # All historical values identical - any change is notable is_anomaly = current_sli.value != mean else: z_score = (current_sli.value - mean) / std_dev is_anomaly = abs(z_score) > anomaly_threshold_std return { "status": "anomaly_detected" if is_anomaly else "consistent", "current_value": current_sli.value, "historical_mean": round(mean, 4), "historical_std_dev": round(std_dev, 4), "z_score": round((current_sli.value - mean) / std_dev, 2) if std_dev > 0 else None, "note": "Verify measurement pipeline if anomaly detected" } def injection_test( self, inject_request_func, verify_metric_func, test_latency_ms: float = 100.0 ) -> Dict[str, any]: """ Inject a known test request and verify it appears in metrics. """ test_id = generate_unique_id() # Inject test request with known characteristics inject_start = time.time() inject_request_func( test_id=test_id, artificial_latency_ms=test_latency_ms, mark_as_test=True ) inject_end = time.time() # Wait for metrics pipeline to process time.sleep(5) # Allow time for propagation # Verify the request appears in metrics verification = verify_metric_func( test_id=test_id, expected_latency_ms=test_latency_ms, time_range=(inject_start, inject_end) ) return { "status": "pass" if verification.found else "fail", "test_id": test_id, "expected_latency_ms": test_latency_ms, "found_in_metrics": verification.found, "measured_latency_ms": verification.measured_latency, "latency_error_ms": abs(verification.measured_latency - test_latency_ms) if verification.found else None }Validation Strategy 3: Sanity Bounds
Set bounds on SLI values that, if exceeded, indicate measurement problems rather than real issues:
Validation Strategy 4: Consistency with Business Metrics
SLI values should correlate with business metrics:
Validation Strategy 5: Manual Spot Checks
Periodically validate manually:
Your SLI calculation pipeline is itself a service that can have availability issues, latency problems, and errors. Consider meta-SLIs: What's the availability of your metrics pipeline? What's the latency of SLI calculation? How often are SLI values missing or delayed? These meta-metrics protect against silent measurement failures.
Knowing common failure modes helps you design robust measurement systems and quickly diagnose problems when SLI values seem wrong.
Failure 1: Survivorship Bias
What happens: Only successful requests are instrumented. Failed requests (that crashed before logging, timed out before completion) are invisible.
Symptom: SLI looks better than user experience. Users complain but metrics are green.
Mitigation: Instrument at request entry and exit. Use middleware that guarantees logging regardless of request outcome. Count timeouts explicitly from client perspective.
Failure 2: Clock Synchronization Issues
What happens: Distributed systems have clock skew. Request that started on Server A at 12:00:00 might be logged on Server B at 11:59:58.
Symptom: Negative latencies, events appearing out of order, aggregation window misalignment.
Mitigation: Use NTP synchronization with monitoring. Measure durations on single hosts. Use monotonic clocks for timing. Include clock offset estimates in data.
Failure 3: Aggregation Computation Errors
What happens: Incorrect aggregation logic (averaging percentiles, wrong window alignment, off-by-one errors in bucket boundaries).
Symptom: SLI values don't make sense (p50 > p95), or change unexpectedly when aggregation code changes.
Mitigation: Unit test aggregation code thoroughly. Use well-tested libraries. Cross-validate with alternative calculation methods.
Failure 4: Sampling Bias
What happens: Non-uniform sampling preferentially captures or misses certain request types.
Symptom: SLI differs significantly from RUM data or user perception. Rare events (errors, slow requests) under-represented.
Mitigation: Use unbiased sampling. For tracing, consider tail-based sampling that captures interesting requests. Compute SLIs from 100% metric data, not sampled traces.
Failure 5: Definition Drift
What happens: SLI definition on paper doesn't match implementation. Code changes over time but SLI spec isn't updated.
Symptom: SLI values seem reasonable but don't reflect stakeholder understanding of what's measured.
Mitigation: Treat SLI definition as code. Version control it. Review changes. Periodically audit implementation against specification.
Failure 6: Scope Creep
What happens: SLI scope expands to include traffic types it wasn't designed for (internal tools, synthetic monitoring, bot traffic).
Symptom: SLI becomes polluted, no longer reflects target user population.
Mitigation: Explicit traffic filtering. Separate SLIs for different traffic types. Regular audits of what's in the denominator.
Based on the principles and pitfalls discussed, here's a framework for designing reliable SLI measurement infrastructure.
Layer 1: Instrumentation
Layer 2: Collection
Layer 3: Storage
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
# SLI Measurement Architecture Referencemeasurement_architecture: instrumentation_layer: standards: - "OpenTelemetry for tracing" - "Prometheus client for metrics" - "Structured JSON logging" requirements: - "All services MUST use approved instrumentation libraries" - "Custom instrumentation requires SRE review" - "Instrumentation coverage verified in CI pipeline" anti_patterns: - "Direct stdout/stderr for SLI-relevant data" - "Custom serialization formats" - "Unbounded cardinality labels" collection_layer: primary: "Prometheus federated pull model" secondary: "Load balancer log shipping (backup data source)" rum: "Client SDK → Edge ingestion → Stream processing" reliability: - "Metrics scrape success rate SLI > 99.9%" - "Collection latency p99 < 2 minutes" - "Automatic retry on transient failures" monitoring: - "Scrape target health dashboard" - "Alert on target down > 5 minutes" - "Alert on sudden metric count drop" storage_layer: time_series_database: "Prometheus (short-term) + Thanos (long-term)" retention: high_resolution: "15 days (15s intervals)" medium_resolution: "90 days (1m intervals)" low_resolution: "13 months (5m intervals)" high_availability: - "Multi-replica ingestion" - "Cross-region replication for disaster recovery" monitoring: - "Storage capacity forecasting" - "Query latency tracking" - "Ingestion success rate" computation_layer: sli_calculator: language: "PromQL / SQL depending on source" execution: "Every 1 minute for real-time, daily for reports" output: "SLI values written to dedicated metrics" validation: - "Known-answer tests run hourly" - "Cross-source comparison daily" - "Sanity bounds checked on every computation" error_handling: - "Missing data points → flag as incomplete, don't extrapolate" - "Computation failure → alert, no stale data displayed" presentation_layer: dashboards: - "Real-time SLI status (1-minute refresh)" - "SLO compliance tracking (1-hour refresh)" - "Historical trends (daily refresh)" alerts: - "Routed to PagerDuty for critical" - "Routed to Slack for warning" reports: - "Weekly SLO summary email" - "Monthly executive report"Layer 4: Computation
Layer 5: Presentation
Operational Considerations
Consider setting SLOs for your measurement system itself: metrics collection availability, SLI computation latency, dashboard uptime. If you can't trust that your measurements are current and correct, you can't trust the SLIs they produce.
Accurate SLI measurement is the foundation upon which reliable SLO practice is built. Without trustworthy data, SLOs are theater. Let's consolidate the key learnings:
Module Complete
Congratulations! You've completed the Choosing SLIs module. You now understand how to:
With these foundations, you're ready to move on to Setting SLOs—translating SLI measurements into targets that balance reliability investment with feature velocity.
You've mastered the art and science of choosing and measuring SLIs. You can design SLIs that reflect user experience, avoid common measurement pitfalls, build robust measurement infrastructure, and validate that your data reflects reality. These skills form the foundation for effective SLO practice and reliable service operation.