Loading content...
In the world of distributed systems, opinions about reliability are worthless without data. Engineers often argue about whether a service is 'fast enough' or 'reliable enough,' but these debates remain subjective until we establish objective, quantitative measurements. This is precisely where Service Level Indicators (SLIs) transform reliability from an art into a science.
Service Level Indicators are the quantitative measures that objectively capture the health and behavior of your service from the user's perspective. They are not arbitrary metrics chosen for convenience—they are carefully selected measurements that directly correlate with user happiness and business outcomes.
By the end of this page, you will understand what SLIs are, why they matter profoundly for reliability engineering, and how to select and implement SLIs that truly represent your users' experience. You'll learn to distinguish good SLIs from vanity metrics, and master the art of measurement that drives all subsequent reliability decisions.
A Service Level Indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service being provided. In simpler terms, it's a metric that tells you how well your service is performing from the perspective of its users or consumers.
The key word here is quantitative. An SLI must be:
An SLI is typically expressed as a ratio:
SLI = (Good events / Total events) × 100%
For example, an availability SLI might be:
(Successful requests / Total requests) × 100%
This ratio format is crucial—it normalizes the metric regardless of traffic volume and makes it directly comparable across time periods.
Not every metric qualifies as an SLI. Consider the difference:
Metrics that are NOT good SLIs:
Metrics that ARE good SLIs:
| Metric Type | Example | User Impact | Good SLI? |
|---|---|---|---|
| Request success rate | 99.5% of requests return 2xx | Direct - users see errors | ✅ Excellent |
| P99 latency | 95% of requests complete in <200ms | Direct - users experience delays | ✅ Excellent |
| CPU utilization | Average 65% across fleet | Indirect - may not affect users | ❌ Poor |
| Disk I/O | 500 IOPS sustained | Indirect - internal concern | ❌ Poor |
| Data freshness | 99% of reads are <1 min stale | Direct - users see outdated data | ✅ Good |
| Error rate by type | 0.1% timeouts, 0.05% 500s | Direct - different failure modes | ✅ Good |
SLIs aren't just nice-to-have metrics—they are the foundation upon which all reliability engineering is built. Without well-defined SLIs, you cannot:
Teams that choose poor SLIs often find themselves in frustrating situations:
• Alerts fire but users are happy (false positives waste engineering time) • Users complain but dashboards show green (false negatives destroy trust) • Engineering debates become political rather than data-driven • Reliability investments are seen as cost centers rather than strategic
The root cause is almost always measuring the wrong things—internal metrics instead of user-facing indicators.
The core philosophy behind SLIs is what Google's SRE team calls the user happiness hypothesis:
If you measure something that genuinely reflects user experience, and that measure is good, then your users are probably happy. If it's bad, they're probably unhappy.
This seems obvious, but it's surprisingly difficult to achieve in practice. Many organizations measure what's easy to collect (CPU, memory, disk) rather than what actually matters (success rate, latency, correctness).
The hierarchy of measurement quality:
The closer you measure to the actual user experience, the more meaningful your SLI becomes.
While every service is unique, most services can be characterized by a common set of SLI categories. Google's SRE book popularized what are now known as the Four Golden Signals, which translate directly into SLI categories:
1. Availability (Success Rate) 2. Latency (Response Time) 3. Throughput (Request Rate) 4. Error Rate (Failure Classification)
For most request-driven services, you'll want SLIs covering at least the first two categories. Let's examine each in depth.
Not all services are request-driven. For pipeline/batch systems, you might measure: • Freshness — How recently was the data processed? • Throughput — How many records processed per hour? • Correctness — What percentage of outputs are verified correct?
For storage systems: • Durability — Probability that stored data can be retrieved • Availability — Percentage of time the storage responds • Latency — Time to read/write data
Definition: The proportion of requests that are served successfully.
Availability SLI = (Successful Requests / Total Requests) × 100%
What counts as 'successful'?
This is where precision matters. Consider these scenarios:
Best Practice: Define success as 'the server processed the request correctly, regardless of the business outcome.' A 404 for a genuinely missing resource is successful processing. A 404 caused by a routing bug is not.
Definition: The proportion of requests served faster than a threshold.
Latency SLI = (Requests faster than threshold / Total Requests) × 100%
Why percentiles, not averages?
Averages hide problems. Consider two scenarios:
The averages (110ms vs 200ms) seem similar, but Scenario B has 10% of users waiting 11x longer! Percentiles reveal this.
Common latency percentiles:
| Percentile | What It Tells You | When to Use |
|---|---|---|
| P50 | Typical user experience | General performance baseline |
| P90 | Experience for 'slower' requests | Capacity planning |
| P99 | Worst case for most users | SLO targets for latency-sensitive services |
| P99.9 | Extreme tail latency | Enterprise customers, SLA enforcement |
| Max | Absolute worst case | Debugging, rarely for SLIs (too noisy) |
Definition: The rate at which the system processes requests.
Throughput SLI = Requests processed per second (or minute/hour)
When throughput matters as an SLI:
Throughput vs. Capacity:
Throughput SLIs measure actual work done. Capacity is the theoretical maximum. Your SLI should reflect that you can sustain expected throughput, not just peak.
Definition: The proportion of requests that fail, often broken down by error type.
Error Rate SLI = (Failed Requests / Total Requests) × 100%
Why classify errors?
Not all errors are created equal:
Separate SLIs for error classes enable targeted responses.
Selecting SLIs is both an art and a science. You need enough SLIs to capture user experience comprehensively, but not so many that they become unmanageable. Here's a systematic approach:
A common mistake is defining too many SLIs initially. Start with 2-3 core SLIs:
• Availability: Are requests succeeding? • Latency: Are they fast enough? • Primary user journey: Is the key use case working?
Add more specific SLIs only when you have evidence that these three don't capture important failure modes.
Anti-Pattern 1: The Vanity SLI
Measuring something that always looks good but doesn't reflect reality.
Example: Measuring only 'server is up' (synthetic health check) while ignoring that 10% of real requests fail with timeouts.
Anti-Pattern 2: The Lagging SLI
Measuring outcomes that users discovered hours ago.
Example: Daily aggregated error counts. By the time you notice, thousands of users have already been impacted.
Anti-Pattern 3: The Too-Granular SLI
Measuring at such fine granularity that normal variance triggers constant investigation.
Example: Per-second error rate on low-traffic endpoints. A single failure creates 100% error rate for that second.
Anti-Pattern 4: The Hidden Context SLI
Measuring without distinguishing fundamentally different request types.
Example: Combined latency for 'quick lookup' and 'complex report generation' requests, making the SLI uninterpretable.
Once you've selected your SLIs, implementation quality determines whether they're useful or misleading. Let's examine common implementation patterns and their tradeoffs.
How it works: Your application code emits metrics for every request—latency, status code, endpoint, etc.
Pros:
Cons:
Best for: Internal services, API backends, microservices
123456789101112131415161718192021222324252627282930313233343536373839
from prometheus_client import Counter, Histogramimport time # SLI: Request success raterequest_total = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status_class']) # SLI: Request latencyrequest_latency = Histogram( 'http_request_duration_seconds', 'HTTP request latency', ['method', 'endpoint'], buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]) def handle_request(request): start_time = time.time() try: response = process_request(request) status_class = '2xx' if 200 <= response.status < 300 else '4xx' return response except Exception as e: status_class = '5xx' raise finally: # Always record metrics duration = time.time() - start_time request_latency.labels( method=request.method, endpoint=request.path_template ).observe(duration) request_total.labels( method=request.method, endpoint=request.path_template, status_class=status_class ).inc()How it works: Your load balancer or API gateway records metrics for every request passing through.
Pros:
Cons:
Best for: Public-facing APIs, multi-service architectures
How it works: JavaScript or mobile SDK measures actual user experience in-browser/in-app.
Pros:
Cons:
Best for: User-facing web applications, mobile apps
Production systems typically implement SLIs at multiple layers:
• Client-side RUM — Understanding true user experience • Edge/Load Balancer — Primary SLI data source for availability and latency • Application metrics — Detailed diagnostics and per-endpoint tracking • Synthetic monitoring — Continuous validation of critical paths
Each layer has its purposes. The combination provides comprehensive visibility.
Raw SLI measurements occur at the individual request level, but SLIs are typically reported as aggregated values over time windows. The choice of aggregation method and window size profoundly affects the utility of your SLI.
For Availability/Success Rate SLIs:
SLI = (Sum of successful requests in window) / (Sum of total requests in window) × 100%
This is straightforward—sum numerator and denominator separately, then divide.
For Latency SLIs:
Two common approaches:
The second approach (threshold-based) works better with SLO frameworks because it fits the 'good events / total events' model:
Latency SLI = (Requests < 200ms / Total requests) × 100%
| Window Size | Pros | Cons | Use Cases |
|---|---|---|---|
| 1 minute | Fast detection of issues | Noisy, especially for low-traffic services | Alerting, real-time dashboards |
| 5 minutes | Good balance of speed and stability | Still somewhat noisy | Operational dashboards |
| 1 hour | Stable, less noise | Slow to detect issues | Reporting, trend analysis |
| 28/30 days | Matches SLO evaluation periods | Very slow-moving | SLO burn rate, error budgets |
Rolling Windows:
Fixed Windows:
Recommendation: Use rolling windows for operational monitoring and SLO tracking. Use fixed windows for executive reporting and SLA compliance.
Be careful when aggregating SLIs across multiple services or endpoints. If Service A handles 1M requests at 99.9% success and Service B handles 1K requests at 90% success:
• Naive average: (99.9 + 90) / 2 = 94.95% ← Misleading! • Weighted average: (999,000 + 900) / (1,000,000 + 1,000) = 99.89% ← Correct
Always weight by traffic volume, and consider whether aggregation is even meaningful for your use case.
Let's examine how different types of systems define their SLIs, drawing from real-world practices at scale.
Critical User Journeys:
SLI Definitions:
Why these matter: A user who can't search, add to cart, or checkout generates zero revenue. Each SLI directly ties to revenue impact.
Critical User Journeys:
SLI Definitions:
Why these matter: User engagement is extremely sensitive to playback quality. A rebuffer drives users to competitors.
Critical User Journeys:
SLI Definitions:
Why these matter: Financial systems have regulatory requirements and zero tolerance for data inconsistency.
Each industry has unique SLI requirements:
• Healthcare: HIPAA compliance, audit logging success rate • Gaming: Matchmaking latency, session persistence • IoT: Device connectivity rate, command delivery success • Advertising: Bid request latency (strict <100ms), impression delivery rate
Understand your industry's critical metrics before defining SLIs.
We've covered the foundation of reliability measurement. Let's consolidate the key takeaways:
What's next:
With SLIs defining what we measure, we need to establish targets for acceptable performance. The next page explores Service Level Objectives (SLOs)—how to set meaningful reliability targets that balance user expectations with engineering reality.
You now understand Service Level Indicators—the quantitative foundation of reliability engineering. SLIs transform 'is the service working?' from a subjective question into an objective, measurable, actionable metric. Next, we'll explore how to set targets for these indicators with Service Level Objectives.