System Design (HLD)SLOs, SLIs & Incident Management

SLIs, SLOs, and SLAs

LevelIntermediate

Duration75 mins

TopicSLOs, SLIs & Incident Management

2 / 5

SLO: Service Level Objective

Defining 'Good Enough' Reliability

Having established SLIs as our measurement foundation, we face a critical question: How reliable is 'reliable enough'? This is not a technical question with a mathematical answer—it's a strategic decision that balances user expectations, engineering capabilities, business constraints, and the fundamental economics of reliability.

A Service Level Objective (SLO) is a target value or range for an SLI that defines the acceptable level of service. It transforms the question 'How are we performing?' into 'Are we performing well enough?' SLOs are the bridge between raw measurements and engineering action.

What You Will Learn

By the end of this page, you will understand how to set meaningful SLOs, the art and science behind choosing the right targets, and why 100% is almost never the right answer. You'll learn the framework for balancing reliability with velocity and how SLOs become the foundation for engineering decisions.

What is a Service Level Objective?

A Service Level Objective (SLO) is a target value for an SLI, expressed as a percentage or threshold, that represents the level of reliability your service aims to maintain. Where SLIs tell you what to measure, SLOs tell you what value constitutes success.

The SLO Formula:

SLO = SLI ≥ Target over Time Window

Concrete Examples:

Availability SLO: 99.9% of requests succeed over a rolling 30-day window
Latency SLO: 95% of requests complete in under 200ms over a rolling 7-day window
Error Rate SLO: Less than 0.1% of requests result in 500-series errors per day

Notice that every SLO has three components:

The SLI being measured (what we're tracking)
The target threshold (the line between acceptable and unacceptable)
The time window (the period over which we evaluate)

SLO as a Contract with Yourself

Think of an SLO as a promise you make to your users—but more importantly, to your engineering organization. It answers:

• 'When should we stop shipping features to fix reliability?' • 'How do we prioritize a reliability fix vs. a product feature?' • 'When is it okay to take risks with deployments?'

Without SLOs, these questions devolve into political debates. With SLOs, they become data-driven decisions.

SLI vs SLO: A Clear Distinction
Aspect	SLI (Indicator)	SLO (Objective)
What it is	A measurement	A target for that measurement
Example	99.4% of requests succeeded	≥99.9% of requests should succeed
Nature	Descriptive (what is)	Prescriptive (what should be)
Source	Observability data	Business and engineering judgment
Triggers action when	Data is collected	Target is not met
Changes over time?	Constantly (reflects reality)	Rarely (reflects strategy)

Why 100% Reliability is the Wrong Target

Newcomers to reliability engineering often ask: 'Why not aim for 100% availability?' This seemingly reasonable question reveals a fundamental misunderstanding of distributed systems, economics, and user behavior. Let's dismantle the 100% myth systematically.

The 100% Impossibility

In distributed systems, 100% reliability is not just expensive—it's mathematically impossible. Networks have non-zero failure rates. Hardware fails. Software has bugs. Users themselves make mistakes. Targeting 100% means your reliability goal is definitionally unachievable, which makes it useless for decision-making.

The Diminishing Returns of Reliability

Reliability improvement follows an exponential cost curve. Each additional 'nine' of availability costs roughly 10x more than the previous one.

Consider the real costs:

Availability	Downtime/Year	Relative Cost*	Infrastructure Complexity
99% ('two nines')	3.65 days	1x	Single server, basic monitoring
99.9% ('three nines')	8.76 hours	10x	Redundancy, health checks, basic automation
99.99% ('four nines')	52.6 minutes	100x	Multi-AZ, automated failover, sophisticated monitoring
99.999% ('five nines')	5.26 minutes	1000x	Multi-region, chaos engineering, SRE team
99.9999% ('six nines')	31.5 seconds	10000x+	Specialized hardware, formal verification

*Relative cost is illustrative—actual costs vary by system.

The User Experience Threshold

Here's a crucial insight: Users can't tell the difference between 99.99% and 99.999%. In both cases, they experience less than 1 minute of downtime per week. The incremental cost of that improvement is massive, but the user benefit is imperceptible.

The 'last mile' problem:

Even if your service achieves 99.999% availability, your users experience:

ISP outages
Local network issues
Device failures
Browser/app crashes
User errors

The user's actual experience is dominated by factors outside your control. Investing in 99.999% when the user sees 99% due to their own infrastructure is wasted effort.

The Opportunity Cost of Over-Reliability

Every engineering hour spent on reliability is an hour not spent on features, performance, or innovation. This isn't laziness—it's economics.

The frozen product anti-pattern:

Teams targeting 100% reliability often become paralyzed:

Deployments are terrifying (might break something)
Feature velocity drops to near zero
Technical debt accumulates (changes are risky)
Talented engineers leave (bored or frustrated)

Eventually, competitors with faster release cycles overtake the 'reliable' product.

The Right Target is 'Reliable Enough'

•Match user expectations — If users expect 99.9%, delivering 99.999% at 100x the cost is wasteful.
•Match dependencies — Your service can't be more reliable than the least reliable component it depends on.
•Match business value — High-value services (payments) warrant higher investment than low-value ones (internal tools).
•Reserve room for learning — Some 'failures' are necessary for testing, experimentation, and improvement.

Setting the Right SLO: A Rigorous Framework

Setting SLOs is not guesswork—it's a structured process that incorporates user research, historical data, business requirements, and engineering capability. Here's a comprehensive framework:

The SLO Setting Process

•Understand user expectations — What do users consider unacceptable? Research, surveys, support tickets.
•Analyze historical performance — What has the service actually achieved? Baseline your current SLIs.
•Identify business requirements — Are there contractual obligations? Revenue implications?
•Assess engineering capability — What can you realistically achieve with current resources?
•Compare with dependencies — Your SLO can't exceed your dependencies' SLOs.
•Draft initial targets — Propose SLOs based on the above analysis.
•Validate with stakeholders — Product, engineering, leadership alignment.
•Implement and measure — Deploy SLO tracking and alerting.
•Iterate based on data — Adjust SLOs as you learn more.

Step 1: Understanding User Expectations

Methods to gauge user expectations:

Support ticket analysis: What do users complain about? What latency/error rate triggers complaints?
Competitive benchmarking: What do comparable services offer?
User research: Direct surveys, interviews, focus groups
Behavioral data: At what latency do users abandon sessions? What error rate causes churn?

Example finding: Analysis of e-commerce checkout abandonment shows users tolerate up to 3 seconds for checkout confirmation, but abandon rapidly beyond 5 seconds. This suggests a latency SLO target of 95% < 3 seconds, not 95% < 100ms.

Step 2: Historical Performance Analysis

Establish your baseline:

Last 90 days availability: 99.4% (min 98.7%, max 99.8%)
P99 latency: 340ms (range 280-520ms)
Error rate: 0.3% average

Key questions:

During periods of 99.4% availability, did users complain?
Did business metrics (revenue, engagement) correlate with reliability?
Were outages random or caused by specific fixable issues?

Use this data to set achievable starting points. Don't set SLOs you've never achieved—you'll immediately be in violation.

Step 3: Dependency Chain Analysis

The SLO ceiling rule:

Your service's SLO cannot exceed the combined SLOs of your critical dependencies. If you depend on:

Database: 99.99% SLO
Payment processor: 99.95% SLO
Authentication service: 99.9% SLO

Your theoretical maximum availability is roughly:

0.9999 × 0.9995 × 0.999 = 0.9984 (99.84%)

Setting an SLO of 99.99% would be dishonest—your dependencies make it unachievable.

Include hidden dependencies:

DNS resolution
CDN availability
SSL certificate validation
Third-party JS libraries (for web apps)

Step 4: The Achievability Gap

Match ambition to capability:

Current Performance	Recommended Initial SLO
99.7% average	99.5% (achievable with buffer)
99.5% average	99% (conservative start)
98% average	95% (honest baseline)

Why start conservative?

An SLO you consistently miss teaches your organization to ignore SLOs. An SLO you occasionally miss teaches your organization that SLOs matter. Start with achievable targets and tighten them as you improve.

The 10x Rule

A useful heuristic: Your internal SLO target should be approximately 10x more lenient than your most demanding user's expectations. This provides buffer for:

• Measurement variance • Undetected issues • Planned maintenance • Experimentation and testing

If users expect 99.99%, target 99.9% internally. The buffer is your safety margin.

SLO Time Windows: Duration Matters

The time window over which you evaluate SLOs profoundly affects their utility. A 99.9% SLO over 1 hour means something very different from 99.9% over 30 days.

Impact of Time Window Selection
Window	99.9% SLO Allows	Characteristics	Best For
1 hour	3.6 seconds downtime/hour	Very sensitive, noisy	Critical real-time systems
1 day	86 seconds downtime/day	Moderate sensitivity	Operational monitoring
7 days	~10 minutes downtime/week	Balanced	Sprint-aligned review
28 days	~40 minutes downtime/month	Stable, strategic	Error budget management
30 days	~43 minutes downtime/month	Calendar-aligned	Monthly reporting, SLAs
Quarter	~2.2 hours downtime/quarter	Long-term trends	Executive reporting

Rolling vs Calendar Windows

Rolling windows (recommended for SLOs):

Always looking back N days from now
Smoothly incorporates new data, drops old data
No 'end of month' panic or gaming
Better for real-time error budget tracking

Calendar windows (common for SLAs):

Reset at midnight on the 1st of each month
Creates artificial deadlines
Can incentivize 'end of month' heroics
Required for some contractual reporting

The 28-day rolling window:

Google's SRE book popularized the 28-day rolling window for good reasons:

Long enough for stability (reduces noise)
Short enough for relevance (reflects recent changes)
Always includes exactly 4 weeks (avoids day-of-week bias)
Not calendar-aligned (prevents gaming)

This is the de facto standard for SLO evaluation in modern reliability engineering.

Window Size Trade-offs

Shorter windows: ✓ Faster detection of issues ✗ More noise, more false positives ✗ Single incidents have outsized impact

Longer windows: ✓ More stable, less noise ✗ Slower response to degradation ✗ Major incidents get 'averaged out'

Most teams use 28-day windows for SLO tracking but also monitor shorter windows (1-hour, 1-day) for alerting purposes.

Designing a Comprehensive SLO Strategy

Real services rarely have just one SLO. A comprehensive SLO strategy typically includes multiple objectives covering different aspects of user experience and different user segments.

Layered SLOs

Tier 1: Critical path SLOs

These protect your most important user journeys. Violations demand immediate response.

Example: Checkout success rate ≥ 99.95% over 28 days
Example: Login latency P99 < 500ms over 28 days

Tier 2: Important but not critical SLOs

These matter but brief violations won't cause user churn.

Example: Search latency P95 < 200ms over 28 days
Example: Recommendation service availability ≥ 99% over 28 days

Tier 3: Best-effort SLOs

These are aspirational—nice to hit but not worth sacrificing other priorities.

Example: Admin dashboard availability ≥ 95%
Example: Internal reporting freshness ≤ 1 hour behind

Segment-Specific SLOs

Not all users are equal from a business perspective:

By customer tier:

Enterprise customers: 99.99% availability
Standard customers: 99.9% availability
Free tier: 99% availability

By geography:

Primary markets (US, EU): P99 latency < 100ms
Secondary markets: P99 latency < 300ms
Emerging markets: P99 latency < 500ms

By use case:

Real-time API: P99 latency < 50ms
Batch processing: Complete within 1 hour
Analytics: Freshness within 5 minutes

SLO Anti-Patterns

•Too many SLOs (>10) that nobody tracks
•SLOs without ownership
•SLOs not tied to user experience
•Set-and-forget SLOs never reviewed
•SLOs too aggressive (always in violation)
•SLOs too lenient (never actionable)

SLO Best Practices

•3-5 core SLOs per service
•Clear owner for each SLO
•Direct correlation to user happiness
•Quarterly SLO review process
•Achievable but aspirational targets
•Automated tracking and alerting

The SLO Explosion Problem

A common failure mode is creating too many SLOs. When you have 50 SLOs, you effectively have zero—nobody can track them all, violations become meaningless, and the system provides no actionable signal.

Rule of thumb: If a single person can't explain all your service's SLOs from memory, you have too many.

SLO Documentation: Making SLOs Actionable

An SLO that exists only in someone's head or an undiscoverable wiki page might as well not exist. Effective SLOs require rigorous documentation that makes them discoverable, understandable, and actionable.

The SLO Document Template

Every SLO should have a formal document containing:

1. SLO Identification

Service: Checkout API
SLO Name: Checkout Availability
Version: 1.3
Owner: Payments Team
Last Updated: 2024-Q3
Review Schedule: Quarterly

2. SLI Specification

Indicator: Request success rate
Measurement: (HTTP 2xx responses / Total requests) × 100%
Data Source: Datadog APM for checkout-api service
Exclusions: Health check endpoints, internal tooling

3. Objective Definition

Target: 99.95%
Time Window: 28-day rolling
Error Budget: 21.6 minutes/28 days (0.05% of traffic)

4. Rationale

User research indicates checkout failures above 0.1% 
cause measurable cart abandonment spike (see User Study #42).
Historical performance averages 99.97%.
Dependency analysis caps theoretical max at 99.98%.
Target of 99.95% provides 0.02% buffer for deployments.

5. Alert Configuration

Fast burn alert: >2% budget consumed in 1 hour → P1
Slow burn alert: >10% budget consumed in 1 day → P2
Budget exhaustion warning: >50% consumed → P3

6. Response Procedures

Fast burn: Page on-call, freeze deployments, initiate incident
Slow burn: Notify team channel, investigate during business hours
Budget warning: Add to sprint planning, prioritize reliability work

Machine-Readable SLOs

Modern teams define SLOs in machine-readable formats (YAML, JSON) that feed directly into monitoring systems. This enables:

• Automated SLO dashboards • Automated alerting based on burn rates • SLO compliance reporting • Drift detection when definitions change

Consider tools like Sloth, OpenSLO, or building your own SLO-as-code framework.

slo-definition.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# OpenSLO-compatible SLO definition
apiVersion: openslo/v1
kind: SLO
metadata:
  name: checkout-availability
  displayName: Checkout API Availability
spec:
  description: >
    Ensures checkout API maintains high availability for payment processing
  service: checkout-api
  
  sli:
    metric:
      source: datadog
      goodQuery: >
        sum:checkout.requests{status:2xx}.as_count()
      totalQuery: >
        sum:checkout.requests{*}.as_count()
    type: ratio
    
  objectives:
    - target: 0.9995
      displayName: 99.95% availability
      timeWindow:
        rolling:
          unit: day
          count: 28
        
  alerting:
    fastBurn:
      burnRate: 14.4
      lookbackWindow: 1h
      severity: critical
    slowBurn:
      burnRate: 6
      lookbackWindow: 6h
      severity: warning

SLO Evolution: When and How to Change SLOs

SLOs are not permanent—they should evolve as your service, users, and business evolve. However, changing SLOs requires discipline to prevent gaming and maintain trust.

When to Tighten SLOs

Evidence-based tightening:

Consistently exceeding target: If you're always at 99.99% with a 99.9% SLO, the SLO isn't driving behavior. Tighten it.
User expectations have increased: Competition or market changes mean users expect more.
Business has grown: Revenue impact of downtime has increased; investment in reliability is justified.
You've improved infrastructure: New capabilities (multi-region, better failover) make higher reliability achievable.

When to Loosen SLOs

Legitimate loosening scenarios:

SLO was set unrealistically: Initial target was aspirational, not achievable.
Dependency degradation: An upstream service reduced their SLO, lowering your ceiling.
Strategic pivot: Business decided to invest in features over reliability.
Cost reduction: Economic pressure requires accepting more risk.

The political danger of loosening:

Loosening SLOs often looks like 'giving up' or 'lowering standards.' Document the rationale clearly to prevent future misinterpretation.

SLO Change Discipline

Never change SLOs retroactively. If you're about to miss an SLO and you quickly loosen the target, you've destroyed the system's credibility.

SLO changes should: • Be announced in advance • Take effect at the start of a new measurement window • Be documented with rationale • Be approved by stakeholders (not just engineering)

If you're frequently wanting to change SLOs, you're setting them wrong initially.

The SLO Review Cadence

•Quarterly SLO review — Formal assessment of all SLOs. Are they relevant? Achievable? Documented?
•Post-incident review — After major incidents, assess whether SLOs captured the user impact accurately.
•Annual strategic review — Align SLOs with annual business objectives and resource allocation.
•Major change trigger — New dependencies, new user segments, or architectural changes warrant SLO reevaluation.

Summary: Mastering Service Level Objectives

SLOs are the bridge between measurement (SLIs) and action. They define what 'good enough' means for your service, enabling data-driven reliability decisions.

Key Takeaways

•SLOs are targets for SLIs — They define the threshold between acceptable and unacceptable performance.
•100% is never the right target — Diminishing returns, cost, and impossibility make perfect reliability a counterproductive goal.
•SLO setting is structured — User research, historical data, dependencies, and capability all inform targets.
•Time windows matter — 28-day rolling windows are the standard for SLO evaluation.
•Layered SLOs serve different purposes — Critical paths, important functions, and best-effort features deserve different treatment.
•Documentation is essential — SLOs must be discoverable, understandable, and actionable.
•SLOs evolve — Regular review and disciplined change management keep SLOs relevant.

What's next:

With SLIs defining measurement and SLOs defining targets, we need to understand Service Level Agreements (SLAs)—the contractual commitments we make to customers about reliability, complete with consequences for violations.

Page Complete

You now understand Service Level Objectives—the targets that transform reliability from a vague aspiration into a measurable, manageable commitment. SLOs answer 'How reliable is reliable enough?' and enable every subsequent reliability engineering decision. Next, we'll explore how SLOs become contractual SLAs.

2 / 5

Loading learning content...

System Design (HLD)SLOs, SLIs & Incident Management

SLIs, SLOs, and SLAs

LevelIntermediate

Duration75 mins

TopicSLOs, SLIs & Incident Management

2 / 5

SLO: Service Level Objective

Defining 'Good Enough' Reliability

What You Will Learn

What is a Service Level Objective?

The SLO Formula:

SLO = SLI ≥ Target over Time Window

Concrete Examples:

Availability SLO: 99.9% of requests succeed over a rolling 30-day window
Latency SLO: 95% of requests complete in under 200ms over a rolling 7-day window
Error Rate SLO: Less than 0.1% of requests result in 500-series errors per day

Notice that every SLO has three components:

The SLI being measured (what we're tracking)
The target threshold (the line between acceptable and unacceptable)
The time window (the period over which we evaluate)

SLO as a Contract with Yourself

Think of an SLO as a promise you make to your users—but more importantly, to your engineering organization. It answers:

• 'When should we stop shipping features to fix reliability?' • 'How do we prioritize a reliability fix vs. a product feature?' • 'When is it okay to take risks with deployments?'

Without SLOs, these questions devolve into political debates. With SLOs, they become data-driven decisions.

SLI vs SLO: A Clear Distinction
Aspect	SLI (Indicator)	SLO (Objective)
What it is	A measurement	A target for that measurement
Example	99.4% of requests succeeded	≥99.9% of requests should succeed
Nature	Descriptive (what is)	Prescriptive (what should be)
Source	Observability data	Business and engineering judgment
Triggers action when	Data is collected	Target is not met
Changes over time?	Constantly (reflects reality)	Rarely (reflects strategy)

Why 100% Reliability is the Wrong Target

The 100% Impossibility

The Diminishing Returns of Reliability

Reliability improvement follows an exponential cost curve. Each additional 'nine' of availability costs roughly 10x more than the previous one.

Consider the real costs:

Availability	Downtime/Year	Relative Cost*	Infrastructure Complexity
99% ('two nines')	3.65 days	1x	Single server, basic monitoring
99.9% ('three nines')	8.76 hours	10x	Redundancy, health checks, basic automation
99.99% ('four nines')	52.6 minutes	100x	Multi-AZ, automated failover, sophisticated monitoring
99.999% ('five nines')	5.26 minutes	1000x	Multi-region, chaos engineering, SRE team
99.9999% ('six nines')	31.5 seconds	10000x+	Specialized hardware, formal verification

*Relative cost is illustrative—actual costs vary by system.

The User Experience Threshold

The 'last mile' problem:

Even if your service achieves 99.999% availability, your users experience:

ISP outages
Local network issues
Device failures
Browser/app crashes
User errors

The user's actual experience is dominated by factors outside your control. Investing in 99.999% when the user sees 99% due to their own infrastructure is wasted effort.

The Opportunity Cost of Over-Reliability

Every engineering hour spent on reliability is an hour not spent on features, performance, or innovation. This isn't laziness—it's economics.

The frozen product anti-pattern:

Teams targeting 100% reliability often become paralyzed:

Deployments are terrifying (might break something)
Feature velocity drops to near zero
Technical debt accumulates (changes are risky)
Talented engineers leave (bored or frustrated)

Eventually, competitors with faster release cycles overtake the 'reliable' product.

The Right Target is 'Reliable Enough'

•Match user expectations — If users expect 99.9%, delivering 99.999% at 100x the cost is wasteful.
•Match dependencies — Your service can't be more reliable than the least reliable component it depends on.
•Match business value — High-value services (payments) warrant higher investment than low-value ones (internal tools).
•Reserve room for learning — Some 'failures' are necessary for testing, experimentation, and improvement.

Setting the Right SLO: A Rigorous Framework

Setting SLOs is not guesswork—it's a structured process that incorporates user research, historical data, business requirements, and engineering capability. Here's a comprehensive framework:

The SLO Setting Process

•Understand user expectations — What do users consider unacceptable? Research, surveys, support tickets.
•Analyze historical performance — What has the service actually achieved? Baseline your current SLIs.
•Identify business requirements — Are there contractual obligations? Revenue implications?
•Assess engineering capability — What can you realistically achieve with current resources?
•Compare with dependencies — Your SLO can't exceed your dependencies' SLOs.
•Draft initial targets — Propose SLOs based on the above analysis.
•Validate with stakeholders — Product, engineering, leadership alignment.
•Implement and measure — Deploy SLO tracking and alerting.
•Iterate based on data — Adjust SLOs as you learn more.

Step 1: Understanding User Expectations

Methods to gauge user expectations:

Support ticket analysis: What do users complain about? What latency/error rate triggers complaints?
Competitive benchmarking: What do comparable services offer?
User research: Direct surveys, interviews, focus groups
Behavioral data: At what latency do users abandon sessions? What error rate causes churn?

Step 2: Historical Performance Analysis

Establish your baseline:

Last 90 days availability: 99.4% (min 98.7%, max 99.8%)
P99 latency: 340ms (range 280-520ms)
Error rate: 0.3% average

Key questions:

During periods of 99.4% availability, did users complain?
Did business metrics (revenue, engagement) correlate with reliability?
Were outages random or caused by specific fixable issues?

Use this data to set achievable starting points. Don't set SLOs you've never achieved—you'll immediately be in violation.

Step 3: Dependency Chain Analysis

The SLO ceiling rule:

Your service's SLO cannot exceed the combined SLOs of your critical dependencies. If you depend on:

Database: 99.99% SLO
Payment processor: 99.95% SLO
Authentication service: 99.9% SLO

Your theoretical maximum availability is roughly:

0.9999 × 0.9995 × 0.999 = 0.9984 (99.84%)

Setting an SLO of 99.99% would be dishonest—your dependencies make it unachievable.

Include hidden dependencies:

DNS resolution
CDN availability
SSL certificate validation
Third-party JS libraries (for web apps)

Step 4: The Achievability Gap

Match ambition to capability:

Current Performance	Recommended Initial SLO
99.7% average	99.5% (achievable with buffer)
99.5% average	99% (conservative start)
98% average	95% (honest baseline)

Why start conservative?

The 10x Rule

A useful heuristic: Your internal SLO target should be approximately 10x more lenient than your most demanding user's expectations. This provides buffer for:

• Measurement variance • Undetected issues • Planned maintenance • Experimentation and testing

If users expect 99.99%, target 99.9% internally. The buffer is your safety margin.

SLO Time Windows: Duration Matters

The time window over which you evaluate SLOs profoundly affects their utility. A 99.9% SLO over 1 hour means something very different from 99.9% over 30 days.

Impact of Time Window Selection
Window	99.9% SLO Allows	Characteristics	Best For
1 hour	3.6 seconds downtime/hour	Very sensitive, noisy	Critical real-time systems
1 day	86 seconds downtime/day	Moderate sensitivity	Operational monitoring
7 days	~10 minutes downtime/week	Balanced	Sprint-aligned review
28 days	~40 minutes downtime/month	Stable, strategic	Error budget management
30 days	~43 minutes downtime/month	Calendar-aligned	Monthly reporting, SLAs
Quarter	~2.2 hours downtime/quarter	Long-term trends	Executive reporting

Rolling vs Calendar Windows

Rolling windows (recommended for SLOs):

Always looking back N days from now
Smoothly incorporates new data, drops old data
No 'end of month' panic or gaming
Better for real-time error budget tracking

Calendar windows (common for SLAs):

Reset at midnight on the 1st of each month
Creates artificial deadlines
Can incentivize 'end of month' heroics
Required for some contractual reporting

The 28-day rolling window:

Google's SRE book popularized the 28-day rolling window for good reasons:

Long enough for stability (reduces noise)
Short enough for relevance (reflects recent changes)
Always includes exactly 4 weeks (avoids day-of-week bias)
Not calendar-aligned (prevents gaming)

This is the de facto standard for SLO evaluation in modern reliability engineering.

Window Size Trade-offs

Shorter windows: ✓ Faster detection of issues ✗ More noise, more false positives ✗ Single incidents have outsized impact

Longer windows: ✓ More stable, less noise ✗ Slower response to degradation ✗ Major incidents get 'averaged out'

Most teams use 28-day windows for SLO tracking but also monitor shorter windows (1-hour, 1-day) for alerting purposes.

Designing a Comprehensive SLO Strategy

Real services rarely have just one SLO. A comprehensive SLO strategy typically includes multiple objectives covering different aspects of user experience and different user segments.

Layered SLOs

Tier 1: Critical path SLOs

These protect your most important user journeys. Violations demand immediate response.

Example: Checkout success rate ≥ 99.95% over 28 days
Example: Login latency P99 < 500ms over 28 days

Tier 2: Important but not critical SLOs

These matter but brief violations won't cause user churn.

Example: Search latency P95 < 200ms over 28 days
Example: Recommendation service availability ≥ 99% over 28 days

Tier 3: Best-effort SLOs

These are aspirational—nice to hit but not worth sacrificing other priorities.

Example: Admin dashboard availability ≥ 95%
Example: Internal reporting freshness ≤ 1 hour behind

Segment-Specific SLOs

Not all users are equal from a business perspective:

By customer tier:

Enterprise customers: 99.99% availability
Standard customers: 99.9% availability
Free tier: 99% availability

By geography:

Primary markets (US, EU): P99 latency < 100ms
Secondary markets: P99 latency < 300ms
Emerging markets: P99 latency < 500ms

By use case:

Real-time API: P99 latency < 50ms
Batch processing: Complete within 1 hour
Analytics: Freshness within 5 minutes

SLO Anti-Patterns

•Too many SLOs (>10) that nobody tracks
•SLOs without ownership
•SLOs not tied to user experience
•Set-and-forget SLOs never reviewed
•SLOs too aggressive (always in violation)
•SLOs too lenient (never actionable)

SLO Best Practices

•3-5 core SLOs per service
•Clear owner for each SLO
•Direct correlation to user happiness
•Quarterly SLO review process
•Achievable but aspirational targets
•Automated tracking and alerting

The SLO Explosion Problem

Rule of thumb: If a single person can't explain all your service's SLOs from memory, you have too many.

SLO Documentation: Making SLOs Actionable

The SLO Document Template

Every SLO should have a formal document containing:

1. SLO Identification

Service: Checkout API
SLO Name: Checkout Availability
Version: 1.3
Owner: Payments Team
Last Updated: 2024-Q3
Review Schedule: Quarterly

2. SLI Specification

Indicator: Request success rate
Measurement: (HTTP 2xx responses / Total requests) × 100%
Data Source: Datadog APM for checkout-api service
Exclusions: Health check endpoints, internal tooling

3. Objective Definition

Target: 99.95%
Time Window: 28-day rolling
Error Budget: 21.6 minutes/28 days (0.05% of traffic)

4. Rationale

User research indicates checkout failures above 0.1% 
cause measurable cart abandonment spike (see User Study #42).
Historical performance averages 99.97%.
Dependency analysis caps theoretical max at 99.98%.
Target of 99.95% provides 0.02% buffer for deployments.

5. Alert Configuration

Fast burn alert: >2% budget consumed in 1 hour → P1
Slow burn alert: >10% budget consumed in 1 day → P2
Budget exhaustion warning: >50% consumed → P3

6. Response Procedures

Fast burn: Page on-call, freeze deployments, initiate incident
Slow burn: Notify team channel, investigate during business hours
Budget warning: Add to sprint planning, prioritize reliability work

Machine-Readable SLOs

Modern teams define SLOs in machine-readable formats (YAML, JSON) that feed directly into monitoring systems. This enables:

• Automated SLO dashboards • Automated alerting based on burn rates • SLO compliance reporting • Drift detection when definitions change

Consider tools like Sloth, OpenSLO, or building your own SLO-as-code framework.

slo-definition.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# OpenSLO-compatible SLO definition
apiVersion: openslo/v1
kind: SLO
metadata:
  name: checkout-availability
  displayName: Checkout API Availability
spec:
  description: >
    Ensures checkout API maintains high availability for payment processing
  service: checkout-api
  
  sli:
    metric:
      source: datadog
      goodQuery: >
        sum:checkout.requests{status:2xx}.as_count()
      totalQuery: >
        sum:checkout.requests{*}.as_count()
    type: ratio
    
  objectives:
    - target: 0.9995
      displayName: 99.95% availability
      timeWindow:
        rolling:
          unit: day
          count: 28
        
  alerting:
    fastBurn:
      burnRate: 14.4
      lookbackWindow: 1h
      severity: critical
    slowBurn:
      burnRate: 6
      lookbackWindow: 6h
      severity: warning

SLO Evolution: When and How to Change SLOs

SLOs are not permanent—they should evolve as your service, users, and business evolve. However, changing SLOs requires discipline to prevent gaming and maintain trust.

When to Tighten SLOs

Evidence-based tightening:

Consistently exceeding target: If you're always at 99.99% with a 99.9% SLO, the SLO isn't driving behavior. Tighten it.
User expectations have increased: Competition or market changes mean users expect more.
Business has grown: Revenue impact of downtime has increased; investment in reliability is justified.
You've improved infrastructure: New capabilities (multi-region, better failover) make higher reliability achievable.

When to Loosen SLOs

Legitimate loosening scenarios:

SLO was set unrealistically: Initial target was aspirational, not achievable.
Dependency degradation: An upstream service reduced their SLO, lowering your ceiling.
Strategic pivot: Business decided to invest in features over reliability.
Cost reduction: Economic pressure requires accepting more risk.

The political danger of loosening:

Loosening SLOs often looks like 'giving up' or 'lowering standards.' Document the rationale clearly to prevent future misinterpretation.

SLO Change Discipline

Never change SLOs retroactively. If you're about to miss an SLO and you quickly loosen the target, you've destroyed the system's credibility.

SLO changes should: • Be announced in advance • Take effect at the start of a new measurement window • Be documented with rationale • Be approved by stakeholders (not just engineering)

If you're frequently wanting to change SLOs, you're setting them wrong initially.

The SLO Review Cadence

•Quarterly SLO review — Formal assessment of all SLOs. Are they relevant? Achievable? Documented?
•Post-incident review — After major incidents, assess whether SLOs captured the user impact accurately.
•Annual strategic review — Align SLOs with annual business objectives and resource allocation.
•Major change trigger — New dependencies, new user segments, or architectural changes warrant SLO reevaluation.

Summary: Mastering Service Level Objectives

SLOs are the bridge between measurement (SLIs) and action. They define what 'good enough' means for your service, enabling data-driven reliability decisions.

Key Takeaways

•SLOs are targets for SLIs — They define the threshold between acceptable and unacceptable performance.
•100% is never the right target — Diminishing returns, cost, and impossibility make perfect reliability a counterproductive goal.
•SLO setting is structured — User research, historical data, dependencies, and capability all inform targets.
•Time windows matter — 28-day rolling windows are the standard for SLO evaluation.
•Layered SLOs serve different purposes — Critical paths, important functions, and best-effort features deserve different treatment.
•Documentation is essential — SLOs must be discoverable, understandable, and actionable.
•SLOs evolve — Regular review and disciplined change management keep SLOs relevant.

What's next:

Page Complete

2 / 5