SLIs, SLOs, SLAs - Learning Module

Loading content...

0/273

SLI: Service Level Indicator

The Foundation of Measurable Reliability

In the world of distributed systems, opinions about reliability are worthless without data. Engineers often argue about whether a service is 'fast enough' or 'reliable enough,' but these debates remain subjective until we establish objective, quantitative measurements. This is precisely where Service Level Indicators (SLIs) transform reliability from an art into a science.

Service Level Indicators are the quantitative measures that objectively capture the health and behavior of your service from the user's perspective. They are not arbitrary metrics chosen for convenience—they are carefully selected measurements that directly correlate with user happiness and business outcomes.

What You Will Learn

By the end of this page, you will understand what SLIs are, why they matter profoundly for reliability engineering, and how to select and implement SLIs that truly represent your users' experience. You'll learn to distinguish good SLIs from vanity metrics, and master the art of measurement that drives all subsequent reliability decisions.

What is a Service Level Indicator?

A Service Level Indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service being provided. In simpler terms, it's a metric that tells you how well your service is performing from the perspective of its users or consumers.

The key word here is quantitative. An SLI must be:

Measurable: It can be expressed as a number
Objective: Different observers measuring the same thing get the same result
User-centric: It reflects something users actually care about
Actionable: When it degrades, you can do something about it

The SLI Definition Formula

An SLI is typically expressed as a ratio:

SLI = (Good events / Total events) × 100%

For example, an availability SLI might be:

(Successful requests / Total requests) × 100%

This ratio format is crucial—it normalizes the metric regardless of traffic volume and makes it directly comparable across time periods.

The Critical Distinction: SLIs vs General Metrics

Not every metric qualifies as an SLI. Consider the difference:

Metrics that are NOT good SLIs:

CPU utilization (users don't care about your CPU)
Memory consumption (internal implementation detail)
Queue depth (proxy measure, not direct user experience)
Server count (infrastructure concern)

Metrics that ARE good SLIs:

Request success rate (users directly experience failures)
Request latency (users wait for responses)
Data freshness (users see stale or current data)
Throughput (users get blocked when capacity is exceeded)

SLI Candidates vs Internal Metrics
Metric Type	Example	User Impact	Good SLI?
Request success rate	99.5% of requests return 2xx	Direct - users see errors	✅ Excellent
P99 latency	95% of requests complete in <200ms	Direct - users experience delays	✅ Excellent
CPU utilization	Average 65% across fleet	Indirect - may not affect users	❌ Poor
Disk I/O	500 IOPS sustained	Indirect - internal concern	❌ Poor
Data freshness	99% of reads are <1 min stale	Direct - users see outdated data	✅ Good
Error rate by type	0.1% timeouts, 0.05% 500s	Direct - different failure modes	✅ Good

Why SLIs Are Foundational to Reliability Engineering

SLIs aren't just nice-to-have metrics—they are the foundation upon which all reliability engineering is built. Without well-defined SLIs, you cannot:

Set meaningful objectives - SLOs are literally built on SLIs
Create effective alerting - What do you alert on if you don't know what matters?
Prioritize engineering work - How do you justify reliability investments?
Communicate with stakeholders - What's your answer when the CEO asks 'Is the service reliable?'
Conduct meaningful postmortems - You need quantitative measures of impact

The Cost of Poor SLI Selection

Teams that choose poor SLIs often find themselves in frustrating situations:

• Alerts fire but users are happy (false positives waste engineering time) • Users complain but dashboards show green (false negatives destroy trust) • Engineering debates become political rather than data-driven • Reliability investments are seen as cost centers rather than strategic

The root cause is almost always measuring the wrong things—internal metrics instead of user-facing indicators.

The User Happiness Hypothesis

The core philosophy behind SLIs is what Google's SRE team calls the user happiness hypothesis:

If you measure something that genuinely reflects user experience, and that measure is good, then your users are probably happy. If it's bad, they're probably unhappy.

This seems obvious, but it's surprisingly difficult to achieve in practice. Many organizations measure what's easy to collect (CPU, memory, disk) rather than what actually matters (success rate, latency, correctness).

The hierarchy of measurement quality:

Best: Measure at the user's device (true user experience)
Good: Measure at the edge/load balancer (close to user)
Acceptable: Measure at the application server (includes internal processing)
Poor: Measure internal resources (CPU, memory, queues)

The closer you measure to the actual user experience, the more meaningful your SLI becomes.

Benefits of Well-Defined SLIs

•Objective Decision Making — Reliability discussions shift from opinions to data. 'I think the service is slow' becomes 'P99 latency degraded from 120ms to 340ms.'
•Clear Prioritization — When you can quantify impact ('This fix will recover 0.3% of our error budget'), prioritization becomes straightforward.
•Aligned Incentives — Product, engineering, and operations teams share a common definition of 'good' based on measurable user impact.
•Historical Trending — Quantitative SLIs enable trend analysis. You can detect slow degradation before it becomes catastrophic.
•Comparative Analysis — Compare reliability across services, teams, or time periods using a common vocabulary.

The Four Golden SLIs

While every service is unique, most services can be characterized by a common set of SLI categories. Google's SRE book popularized what are now known as the Four Golden Signals, which translate directly into SLI categories:

1. Availability (Success Rate) 2. Latency (Response Time) 3. Throughput (Request Rate) 4. Error Rate (Failure Classification)

For most request-driven services, you'll want SLIs covering at least the first two categories. Let's examine each in depth.

SLIs for Different Service Types

Not all services are request-driven. For pipeline/batch systems, you might measure: • Freshness — How recently was the data processed? • Throughput — How many records processed per hour? • Correctness — What percentage of outputs are verified correct?

For storage systems: • Durability — Probability that stored data can be retrieved • Availability — Percentage of time the storage responds • Latency — Time to read/write data

1. Availability SLI

Definition: The proportion of requests that are served successfully.

Availability SLI = (Successful Requests / Total Requests) × 100%

What counts as 'successful'?

This is where precision matters. Consider these scenarios:

HTTP 200 OK — Clearly successful
HTTP 404 Not Found — Is this a server error or valid 'resource doesn't exist'?
HTTP 400 Bad Request — Client's fault, not server failure
HTTP 500 Internal Server Error — Server failure, not successful
HTTP 503 Service Unavailable — Server overloaded, not successful

Best Practice: Define success as 'the server processed the request correctly, regardless of the business outcome.' A 404 for a genuinely missing resource is successful processing. A 404 caused by a routing bug is not.

2. Latency SLI

Definition: The proportion of requests served faster than a threshold.

Latency SLI = (Requests faster than threshold / Total Requests) × 100%

Why percentiles, not averages?

Averages hide problems. Consider two scenarios:

Scenario A: 90% of requests at 100ms, 10% at 200ms → Average: 110ms
Scenario B: 90% of requests at 100ms, 10% at 1100ms → Average: 200ms

The averages (110ms vs 200ms) seem similar, but Scenario B has 10% of users waiting 11x longer! Percentiles reveal this.

Common latency percentiles:

P50 (median): The 'typical' experience
P90: 90% of requests are faster than this
P99: Only 1% of requests are slower
P99.9: The long tail, often critical for enterprise customers

Latency Percentiles: What They Reveal
Percentile	What It Tells You	When to Use
P50	Typical user experience	General performance baseline
P90	Experience for 'slower' requests	Capacity planning
P99	Worst case for most users	SLO targets for latency-sensitive services
P99.9	Extreme tail latency	Enterprise customers, SLA enforcement
Max	Absolute worst case	Debugging, rarely for SLIs (too noisy)

3. Throughput SLI

Definition: The rate at which the system processes requests.

Throughput SLI = Requests processed per second (or minute/hour)

When throughput matters as an SLI:

Ingestion systems: Must keep up with incoming data rate
Batch processing: Must complete within time windows
Rate-limited APIs: Must honor contracted request quotas

Throughput vs. Capacity:

Throughput SLIs measure actual work done. Capacity is the theoretical maximum. Your SLI should reflect that you can sustain expected throughput, not just peak.

4. Error Rate SLI (Error Classification)

Definition: The proportion of requests that fail, often broken down by error type.

Error Rate SLI = (Failed Requests / Total Requests) × 100%

Why classify errors?

Not all errors are created equal:

Retriable errors (503, timeout): Frustrating but recoverable
Business logic errors (400, 422): Might be client issues
Critical errors (500, data corruption): Require immediate attention
Dependency errors: Your service is fine, downstream is broken

Separate SLIs for error classes enable targeted responses.

Selecting the Right SLIs for Your Service

Selecting SLIs is both an art and a science. You need enough SLIs to capture user experience comprehensively, but not so many that they become unmanageable. Here's a systematic approach:

The SLI Selection Framework

•Identify your users — Who consumes your service? End users via web/mobile? Other services via API? Batch jobs?
•Map user journeys — What are the critical paths? Login → Browse → Purchase? Or Query → Process → Store?
•Define 'good' vs 'bad' — For each journey, what constitutes success? What makes users unhappy?
•Choose measurable proxies — What metrics best represent those user journeys?
•Validate with users — Does improving this SLI actually correlate with user satisfaction?
•Iterate and refine — SLIs evolve as your understanding deepens

Start Simple, Add Complexity Later

A common mistake is defining too many SLIs initially. Start with 2-3 core SLIs:

• Availability: Are requests succeeding? • Latency: Are they fast enough? • Primary user journey: Is the key use case working?

Add more specific SLIs only when you have evidence that these three don't capture important failure modes.

SLI Anti-Patterns to Avoid

Anti-Pattern 1: The Vanity SLI

Measuring something that always looks good but doesn't reflect reality.

Example: Measuring only 'server is up' (synthetic health check) while ignoring that 10% of real requests fail with timeouts.

Anti-Pattern 2: The Lagging SLI

Measuring outcomes that users discovered hours ago.

Example: Daily aggregated error counts. By the time you notice, thousands of users have already been impacted.

Anti-Pattern 3: The Too-Granular SLI

Measuring at such fine granularity that normal variance triggers constant investigation.

Example: Per-second error rate on low-traffic endpoints. A single failure creates 100% error rate for that second.

Anti-Pattern 4: The Hidden Context SLI

Measuring without distinguishing fundamentally different request types.

Example: Combined latency for 'quick lookup' and 'complex report generation' requests, making the SLI uninterpretable.

Poor SLI Choices

•Server CPU utilization
•Database connection pool size
•Queue depth (without context)
•Number of active threads
•Disk space remaining
•Synthetic health check pass rate alone

Excellent SLI Choices

•Request success rate (by endpoint class)
•P99 latency for critical endpoints
•End-to-end transaction success rate
•Data freshness (for async systems)
•Availability during peak hours
•User-perceived error rate (client-side measurement)

SLI Implementation Patterns

Once you've selected your SLIs, implementation quality determines whether they're useful or misleading. Let's examine common implementation patterns and their tradeoffs.

1. Server-Side Metrics (Application Instrumentation)

How it works: Your application code emits metrics for every request—latency, status code, endpoint, etc.

Pros:

Rich contextual information (user ID, request type, etc.)
Easy to implement with metrics libraries
Can capture internal timing breakdowns

Cons:

Doesn't capture network latency between server and user
Excludes failures that never reach your server (DNS failures, routing issues)
Requires code changes for each new metric

Best for: Internal services, API backends, microservices

sli-instrumentation-example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from prometheus_client import Counter, Histogram
import time
 
# SLI: Request success rate
request_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status_class']
)
 
# SLI: Request latency
request_latency = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
 
def handle_request(request):
    start_time = time.time()
    try:
        response = process_request(request)
        status_class = '2xx' if 200 <= response.status < 300 else '4xx'
        return response
    except Exception as e:
        status_class = '5xx'
        raise
    finally:
        # Always record metrics
        duration = time.time() - start_time
        request_latency.labels(
            method=request.method,
            endpoint=request.path_template
        ).observe(duration)
        request_total.labels(
            method=request.method,
            endpoint=request.path_template,
            status_class=status_class
        ).inc()

2. Load Balancer/Edge Metrics

How it works: Your load balancer or API gateway records metrics for every request passing through.

Pros:

Captures all traffic, including requests that crash your application
Includes time spent in load balancing/routing
Consistent measurement across all backends
No application code changes required

Cons:

Less contextual information than application metrics
Still doesn't capture client-side experience
May require log processing rather than real-time metrics

Best for: Public-facing APIs, multi-service architectures

3. Client-Side/Real User Monitoring (RUM)

How it works: JavaScript or mobile SDK measures actual user experience in-browser/in-app.

Pros:

Truest representation of user experience
Captures network latency, DNS resolution, rendering time
Detects issues invisible to server-side metrics

Cons:

More complex to implement
Sampling often required for high-traffic sites
Privacy considerations with user data
Can't rely on client-side data for alerting (delayed, unreliable delivery)

Best for: User-facing web applications, mobile apps

The Multi-Layered Approach

Production systems typically implement SLIs at multiple layers:

• Client-side RUM — Understanding true user experience • Edge/Load Balancer — Primary SLI data source for availability and latency • Application metrics — Detailed diagnostics and per-endpoint tracking • Synthetic monitoring — Continuous validation of critical paths

Each layer has its purposes. The combination provides comprehensive visibility.

SLI Aggregation and Time Windows

Raw SLI measurements occur at the individual request level, but SLIs are typically reported as aggregated values over time windows. The choice of aggregation method and window size profoundly affects the utility of your SLI.

Aggregation Methods

For Availability/Success Rate SLIs:

SLI = (Sum of successful requests in window) / (Sum of total requests in window) × 100%

This is straightforward—sum numerator and denominator separately, then divide.

For Latency SLIs:

Two common approaches:

Percentile over window: Calculate P99 of all requests in the window
Success rate below threshold: What % of requests were faster than X ms?

The second approach (threshold-based) works better with SLO frameworks because it fits the 'good events / total events' model:

Latency SLI = (Requests < 200ms / Total requests) × 100%

Time Window Selection
Window Size	Pros	Cons	Use Cases
1 minute	Fast detection of issues	Noisy, especially for low-traffic services	Alerting, real-time dashboards
5 minutes	Good balance of speed and stability	Still somewhat noisy	Operational dashboards
1 hour	Stable, less noise	Slow to detect issues	Reporting, trend analysis
28/30 days	Matches SLO evaluation periods	Very slow-moving	SLO burn rate, error budgets

Rolling vs Fixed Windows

Rolling Windows:

Look back from the current moment (e.g., 'last 30 days')
Continuously updated
Better for real-time SLO tracking
Example: 'Over the last 28 days, our availability was 99.95%'

Fixed Windows:

Calendar-based (e.g., 'this month', 'this quarter')
Reset at boundaries
Better for contractual SLAs and reporting
Example: 'In Q3 2024, our availability was 99.93%'

Recommendation: Use rolling windows for operational monitoring and SLO tracking. Use fixed windows for executive reporting and SLA compliance.

The Aggregation Trap

Be careful when aggregating SLIs across multiple services or endpoints. If Service A handles 1M requests at 99.9% success and Service B handles 1K requests at 90% success:

• Naive average: (99.9 + 90) / 2 = 94.95% ← Misleading! • Weighted average: (999,000 + 900) / (1,000,000 + 1,000) = 99.89% ← Correct

Always weight by traffic volume, and consider whether aggregation is even meaningful for your use case.

Real-World SLI Examples

Let's examine how different types of systems define their SLIs, drawing from real-world practices at scale.

E-Commerce Platform SLIs

Critical User Journeys:

Product browsing → Product detail view
Add to cart → Checkout → Payment
Order tracking

SLI Definitions:

Search Availability: % of search requests returning valid results (not errors)
Search Latency: % of search requests completing in < 500ms
Checkout Availability: % of checkout requests succeeding (payment processing works)
Checkout Latency: % of checkout flows completing in < 3 seconds
Inventory Accuracy: % of 'in stock' items actually available when ordered

Why these matter: A user who can't search, add to cart, or checkout generates zero revenue. Each SLI directly ties to revenue impact.

Streaming Video Service SLIs

Critical User Journeys:

Browse catalog → Select content
Start playback → Watch without interruption
Search → Find content quickly

SLI Definitions:

Playback Start Time: % of playback requests starting in < 2 seconds
Rebuffering Ratio: % of playback time without rebuffering events
Catalog Availability: % of catalog API requests succeeding
Content Availability: % of content library accessible (not geo-blocked, not broken links)
Video Quality Stability: % of sessions maintaining 1080p or higher

Why these matter: User engagement is extremely sensitive to playback quality. A rebuffer drives users to competitors.

Banking API SLIs

Critical User Journeys:

Login authentication
Balance inquiry
Fund transfer
Transaction history

SLI Definitions:

Authentication Success Rate: % of login attempts completing successfully (excluding invalid credentials)
Transaction Integrity: % of transfers where amount debited equals amount credited
API Latency: % of API calls completing in < 100ms
Data Consistency: % of balance queries returning data < 1 second stale
Availability During Business Hours: % uptime during 6 AM - 10 PM local time

Why these matter: Financial systems have regulatory requirements and zero tolerance for data inconsistency.

Industry-Specific Considerations

Each industry has unique SLI requirements:

• Healthcare: HIPAA compliance, audit logging success rate • Gaming: Matchmaking latency, session persistence • IoT: Device connectivity rate, command delivery success • Advertising: Bid request latency (strict <100ms), impression delivery rate

Understand your industry's critical metrics before defining SLIs.

Summary: Mastering Service Level Indicators

We've covered the foundation of reliability measurement. Let's consolidate the key takeaways:

Key Takeaways

•SLIs are quantitative measures of user experience — They transform reliability from subjective opinion into objective data.
•Good SLIs measure what users care about — Success rate, latency, availability, not CPU or memory.
•The Four Golden SLIs cover most services — Availability, Latency, Throughput, and Error Rate form a solid foundation.
•SLI selection is strategic — Start simple with 2-3 core SLIs, expand only when you identify gaps.
•Implementation quality matters — Where you measure (client, edge, server) affects what you capture.
•Aggregation requires care — Time windows, weighting, and percentile calculation all affect SLI accuracy.
•SLIs are the foundation for everything else — SLOs, SLAs, error budgets, and alerting all depend on well-defined SLIs.

What's next:

With SLIs defining what we measure, we need to establish targets for acceptable performance. The next page explores Service Level Objectives (SLOs)—how to set meaningful reliability targets that balance user expectations with engineering reality.

Page Complete

You now understand Service Level Indicators—the quantitative foundation of reliability engineering. SLIs transform 'is the service working?' from a subjective question into an objective, measurable, actionable metric. Next, we'll explore how to set targets for these indicators with Service Level Objectives.

SLI: Service Level Indicator

The Foundation of Measurable Reliability

What You Will Learn

What is a Service Level Indicator?

The key word here is quantitative. An SLI must be:

Measurable: It can be expressed as a number
Objective: Different observers measuring the same thing get the same result
User-centric: It reflects something users actually care about
Actionable: When it degrades, you can do something about it

The SLI Definition Formula

An SLI is typically expressed as a ratio:

SLI = (Good events / Total events) × 100%

For example, an availability SLI might be:

(Successful requests / Total requests) × 100%

This ratio format is crucial—it normalizes the metric regardless of traffic volume and makes it directly comparable across time periods.

The Critical Distinction: SLIs vs General Metrics

Not every metric qualifies as an SLI. Consider the difference:

Metrics that are NOT good SLIs:

CPU utilization (users don't care about your CPU)
Memory consumption (internal implementation detail)
Queue depth (proxy measure, not direct user experience)
Server count (infrastructure concern)

Metrics that ARE good SLIs:

Request success rate (users directly experience failures)
Request latency (users wait for responses)
Data freshness (users see stale or current data)
Throughput (users get blocked when capacity is exceeded)

SLI Candidates vs Internal Metrics
Metric Type	Example	User Impact	Good SLI?
Request success rate	99.5% of requests return 2xx	Direct - users see errors	✅ Excellent
P99 latency	95% of requests complete in <200ms	Direct - users experience delays	✅ Excellent
CPU utilization	Average 65% across fleet	Indirect - may not affect users	❌ Poor
Disk I/O	500 IOPS sustained	Indirect - internal concern	❌ Poor
Data freshness	99% of reads are <1 min stale	Direct - users see outdated data	✅ Good
Error rate by type	0.1% timeouts, 0.05% 500s	Direct - different failure modes	✅ Good

Why SLIs Are Foundational to Reliability Engineering

SLIs aren't just nice-to-have metrics—they are the foundation upon which all reliability engineering is built. Without well-defined SLIs, you cannot:

Set meaningful objectives - SLOs are literally built on SLIs
Create effective alerting - What do you alert on if you don't know what matters?
Prioritize engineering work - How do you justify reliability investments?
Communicate with stakeholders - What's your answer when the CEO asks 'Is the service reliable?'
Conduct meaningful postmortems - You need quantitative measures of impact

The Cost of Poor SLI Selection

Teams that choose poor SLIs often find themselves in frustrating situations:

The root cause is almost always measuring the wrong things—internal metrics instead of user-facing indicators.

The User Happiness Hypothesis

The core philosophy behind SLIs is what Google's SRE team calls the user happiness hypothesis:

If you measure something that genuinely reflects user experience, and that measure is good, then your users are probably happy. If it's bad, they're probably unhappy.

The hierarchy of measurement quality:

Best: Measure at the user's device (true user experience)
Good: Measure at the edge/load balancer (close to user)
Acceptable: Measure at the application server (includes internal processing)
Poor: Measure internal resources (CPU, memory, queues)

The closer you measure to the actual user experience, the more meaningful your SLI becomes.

Benefits of Well-Defined SLIs

•Objective Decision Making — Reliability discussions shift from opinions to data. 'I think the service is slow' becomes 'P99 latency degraded from 120ms to 340ms.'
•Clear Prioritization — When you can quantify impact ('This fix will recover 0.3% of our error budget'), prioritization becomes straightforward.
•Aligned Incentives — Product, engineering, and operations teams share a common definition of 'good' based on measurable user impact.
•Historical Trending — Quantitative SLIs enable trend analysis. You can detect slow degradation before it becomes catastrophic.
•Comparative Analysis — Compare reliability across services, teams, or time periods using a common vocabulary.

The Four Golden SLIs

1. Availability (Success Rate) 2. Latency (Response Time) 3. Throughput (Request Rate) 4. Error Rate (Failure Classification)

For most request-driven services, you'll want SLIs covering at least the first two categories. Let's examine each in depth.

SLIs for Different Service Types

For storage systems: • Durability — Probability that stored data can be retrieved • Availability — Percentage of time the storage responds • Latency — Time to read/write data

1. Availability SLI

Definition: The proportion of requests that are served successfully.

Availability SLI = (Successful Requests / Total Requests) × 100%

What counts as 'successful'?

This is where precision matters. Consider these scenarios:

HTTP 200 OK — Clearly successful
HTTP 404 Not Found — Is this a server error or valid 'resource doesn't exist'?
HTTP 400 Bad Request — Client's fault, not server failure
HTTP 500 Internal Server Error — Server failure, not successful
HTTP 503 Service Unavailable — Server overloaded, not successful

2. Latency SLI

Definition: The proportion of requests served faster than a threshold.

Latency SLI = (Requests faster than threshold / Total Requests) × 100%

Why percentiles, not averages?

Averages hide problems. Consider two scenarios:

Scenario A: 90% of requests at 100ms, 10% at 200ms → Average: 110ms
Scenario B: 90% of requests at 100ms, 10% at 1100ms → Average: 200ms

The averages (110ms vs 200ms) seem similar, but Scenario B has 10% of users waiting 11x longer! Percentiles reveal this.

Common latency percentiles:

P50 (median): The 'typical' experience
P90: 90% of requests are faster than this
P99: Only 1% of requests are slower
P99.9: The long tail, often critical for enterprise customers

Latency Percentiles: What They Reveal
Percentile	What It Tells You	When to Use
P50	Typical user experience	General performance baseline
P90	Experience for 'slower' requests	Capacity planning
P99	Worst case for most users	SLO targets for latency-sensitive services
P99.9	Extreme tail latency	Enterprise customers, SLA enforcement
Max	Absolute worst case	Debugging, rarely for SLIs (too noisy)

3. Throughput SLI

Definition: The rate at which the system processes requests.

Throughput SLI = Requests processed per second (or minute/hour)

When throughput matters as an SLI:

Ingestion systems: Must keep up with incoming data rate
Batch processing: Must complete within time windows
Rate-limited APIs: Must honor contracted request quotas

Throughput vs. Capacity:

Throughput SLIs measure actual work done. Capacity is the theoretical maximum. Your SLI should reflect that you can sustain expected throughput, not just peak.

4. Error Rate SLI (Error Classification)

Definition: The proportion of requests that fail, often broken down by error type.

Error Rate SLI = (Failed Requests / Total Requests) × 100%

Why classify errors?

Not all errors are created equal:

Retriable errors (503, timeout): Frustrating but recoverable
Business logic errors (400, 422): Might be client issues
Critical errors (500, data corruption): Require immediate attention
Dependency errors: Your service is fine, downstream is broken

Separate SLIs for error classes enable targeted responses.

Selecting the Right SLIs for Your Service

Selecting SLIs is both an art and a science. You need enough SLIs to capture user experience comprehensively, but not so many that they become unmanageable. Here's a systematic approach:

The SLI Selection Framework

•Identify your users — Who consumes your service? End users via web/mobile? Other services via API? Batch jobs?
•Map user journeys — What are the critical paths? Login → Browse → Purchase? Or Query → Process → Store?
•Define 'good' vs 'bad' — For each journey, what constitutes success? What makes users unhappy?
•Choose measurable proxies — What metrics best represent those user journeys?
•Validate with users — Does improving this SLI actually correlate with user satisfaction?
•Iterate and refine — SLIs evolve as your understanding deepens

Start Simple, Add Complexity Later

A common mistake is defining too many SLIs initially. Start with 2-3 core SLIs:

• Availability: Are requests succeeding? • Latency: Are they fast enough? • Primary user journey: Is the key use case working?

Add more specific SLIs only when you have evidence that these three don't capture important failure modes.

SLI Anti-Patterns to Avoid

Anti-Pattern 1: The Vanity SLI

Measuring something that always looks good but doesn't reflect reality.

Example: Measuring only 'server is up' (synthetic health check) while ignoring that 10% of real requests fail with timeouts.

Anti-Pattern 2: The Lagging SLI

Measuring outcomes that users discovered hours ago.

Example: Daily aggregated error counts. By the time you notice, thousands of users have already been impacted.

Anti-Pattern 3: The Too-Granular SLI

Measuring at such fine granularity that normal variance triggers constant investigation.

Example: Per-second error rate on low-traffic endpoints. A single failure creates 100% error rate for that second.

Anti-Pattern 4: The Hidden Context SLI

Measuring without distinguishing fundamentally different request types.

Example: Combined latency for 'quick lookup' and 'complex report generation' requests, making the SLI uninterpretable.

Poor SLI Choices

•Server CPU utilization
•Database connection pool size
•Queue depth (without context)
•Number of active threads
•Disk space remaining
•Synthetic health check pass rate alone

Excellent SLI Choices

•Request success rate (by endpoint class)
•P99 latency for critical endpoints
•End-to-end transaction success rate
•Data freshness (for async systems)
•Availability during peak hours
•User-perceived error rate (client-side measurement)

SLI Implementation Patterns

Once you've selected your SLIs, implementation quality determines whether they're useful or misleading. Let's examine common implementation patterns and their tradeoffs.

1. Server-Side Metrics (Application Instrumentation)

How it works: Your application code emits metrics for every request—latency, status code, endpoint, etc.

Pros:

Rich contextual information (user ID, request type, etc.)
Easy to implement with metrics libraries
Can capture internal timing breakdowns

Cons:

Doesn't capture network latency between server and user
Excludes failures that never reach your server (DNS failures, routing issues)
Requires code changes for each new metric

Best for: Internal services, API backends, microservices

sli-instrumentation-example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from prometheus_client import Counter, Histogram
import time
 
# SLI: Request success rate
request_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status_class']
)
 
# SLI: Request latency
request_latency = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
 
def handle_request(request):
    start_time = time.time()
    try:
        response = process_request(request)
        status_class = '2xx' if 200 <= response.status < 300 else '4xx'
        return response
    except Exception as e:
        status_class = '5xx'
        raise
    finally:
        # Always record metrics
        duration = time.time() - start_time
        request_latency.labels(
            method=request.method,
            endpoint=request.path_template
        ).observe(duration)
        request_total.labels(
            method=request.method,
            endpoint=request.path_template,
            status_class=status_class
        ).inc()

2. Load Balancer/Edge Metrics

How it works: Your load balancer or API gateway records metrics for every request passing through.

Pros:

Captures all traffic, including requests that crash your application
Includes time spent in load balancing/routing
Consistent measurement across all backends
No application code changes required

Cons:

Less contextual information than application metrics
Still doesn't capture client-side experience
May require log processing rather than real-time metrics

Best for: Public-facing APIs, multi-service architectures

3. Client-Side/Real User Monitoring (RUM)

How it works: JavaScript or mobile SDK measures actual user experience in-browser/in-app.

Pros:

Truest representation of user experience
Captures network latency, DNS resolution, rendering time
Detects issues invisible to server-side metrics

Cons:

More complex to implement
Sampling often required for high-traffic sites
Privacy considerations with user data
Can't rely on client-side data for alerting (delayed, unreliable delivery)

Best for: User-facing web applications, mobile apps

The Multi-Layered Approach

Production systems typically implement SLIs at multiple layers:

Each layer has its purposes. The combination provides comprehensive visibility.

SLI Aggregation and Time Windows

Aggregation Methods

For Availability/Success Rate SLIs:

SLI = (Sum of successful requests in window) / (Sum of total requests in window) × 100%

This is straightforward—sum numerator and denominator separately, then divide.

For Latency SLIs:

Two common approaches:

Percentile over window: Calculate P99 of all requests in the window
Success rate below threshold: What % of requests were faster than X ms?

The second approach (threshold-based) works better with SLO frameworks because it fits the 'good events / total events' model:

Latency SLI = (Requests < 200ms / Total requests) × 100%

Time Window Selection
Window Size	Pros	Cons	Use Cases
1 minute	Fast detection of issues	Noisy, especially for low-traffic services	Alerting, real-time dashboards
5 minutes	Good balance of speed and stability	Still somewhat noisy	Operational dashboards
1 hour	Stable, less noise	Slow to detect issues	Reporting, trend analysis
28/30 days	Matches SLO evaluation periods	Very slow-moving	SLO burn rate, error budgets

Rolling vs Fixed Windows

Rolling Windows:

Look back from the current moment (e.g., 'last 30 days')
Continuously updated
Better for real-time SLO tracking
Example: 'Over the last 28 days, our availability was 99.95%'

Fixed Windows:

Calendar-based (e.g., 'this month', 'this quarter')
Reset at boundaries
Better for contractual SLAs and reporting
Example: 'In Q3 2024, our availability was 99.93%'

Recommendation: Use rolling windows for operational monitoring and SLO tracking. Use fixed windows for executive reporting and SLA compliance.

The Aggregation Trap

Be careful when aggregating SLIs across multiple services or endpoints. If Service A handles 1M requests at 99.9% success and Service B handles 1K requests at 90% success:

• Naive average: (99.9 + 90) / 2 = 94.95% ← Misleading! • Weighted average: (999,000 + 900) / (1,000,000 + 1,000) = 99.89% ← Correct

Always weight by traffic volume, and consider whether aggregation is even meaningful for your use case.

Real-World SLI Examples

Let's examine how different types of systems define their SLIs, drawing from real-world practices at scale.

E-Commerce Platform SLIs

Critical User Journeys:

Product browsing → Product detail view
Add to cart → Checkout → Payment
Order tracking

SLI Definitions:

Search Availability: % of search requests returning valid results (not errors)
Search Latency: % of search requests completing in < 500ms
Checkout Availability: % of checkout requests succeeding (payment processing works)
Checkout Latency: % of checkout flows completing in < 3 seconds
Inventory Accuracy: % of 'in stock' items actually available when ordered

Why these matter: A user who can't search, add to cart, or checkout generates zero revenue. Each SLI directly ties to revenue impact.

Streaming Video Service SLIs

Critical User Journeys:

Browse catalog → Select content
Start playback → Watch without interruption
Search → Find content quickly

SLI Definitions:

Playback Start Time: % of playback requests starting in < 2 seconds
Rebuffering Ratio: % of playback time without rebuffering events
Catalog Availability: % of catalog API requests succeeding
Content Availability: % of content library accessible (not geo-blocked, not broken links)
Video Quality Stability: % of sessions maintaining 1080p or higher

Why these matter: User engagement is extremely sensitive to playback quality. A rebuffer drives users to competitors.

Banking API SLIs

Critical User Journeys:

Login authentication
Balance inquiry
Fund transfer
Transaction history

SLI Definitions:

Authentication Success Rate: % of login attempts completing successfully (excluding invalid credentials)
Transaction Integrity: % of transfers where amount debited equals amount credited
API Latency: % of API calls completing in < 100ms
Data Consistency: % of balance queries returning data < 1 second stale
Availability During Business Hours: % uptime during 6 AM - 10 PM local time

Why these matter: Financial systems have regulatory requirements and zero tolerance for data inconsistency.

Industry-Specific Considerations

Each industry has unique SLI requirements:

Understand your industry's critical metrics before defining SLIs.

Summary: Mastering Service Level Indicators

We've covered the foundation of reliability measurement. Let's consolidate the key takeaways:

Key Takeaways

•SLIs are quantitative measures of user experience — They transform reliability from subjective opinion into objective data.
•Good SLIs measure what users care about — Success rate, latency, availability, not CPU or memory.
•The Four Golden SLIs cover most services — Availability, Latency, Throughput, and Error Rate form a solid foundation.
•SLI selection is strategic — Start simple with 2-3 core SLIs, expand only when you identify gaps.
•Implementation quality matters — Where you measure (client, edge, server) affects what you capture.
•Aggregation requires care — Time windows, weighting, and percentile calculation all affect SLI accuracy.
•SLIs are the foundation for everything else — SLOs, SLAs, error budgets, and alerting all depend on well-defined SLIs.

What's next:

Page Complete