Loading learning content...
Having established SLIs as our measurement foundation, we face a critical question: How reliable is 'reliable enough'? This is not a technical question with a mathematical answer—it's a strategic decision that balances user expectations, engineering capabilities, business constraints, and the fundamental economics of reliability.
A Service Level Objective (SLO) is a target value or range for an SLI that defines the acceptable level of service. It transforms the question 'How are we performing?' into 'Are we performing well enough?' SLOs are the bridge between raw measurements and engineering action.
By the end of this page, you will understand how to set meaningful SLOs, the art and science behind choosing the right targets, and why 100% is almost never the right answer. You'll learn the framework for balancing reliability with velocity and how SLOs become the foundation for engineering decisions.
A Service Level Objective (SLO) is a target value for an SLI, expressed as a percentage or threshold, that represents the level of reliability your service aims to maintain. Where SLIs tell you what to measure, SLOs tell you what value constitutes success.
The SLO Formula:
SLO = SLI ≥ Target over Time Window
Concrete Examples:
Notice that every SLO has three components:
Think of an SLO as a promise you make to your users—but more importantly, to your engineering organization. It answers:
• 'When should we stop shipping features to fix reliability?' • 'How do we prioritize a reliability fix vs. a product feature?' • 'When is it okay to take risks with deployments?'
Without SLOs, these questions devolve into political debates. With SLOs, they become data-driven decisions.
| Aspect | SLI (Indicator) | SLO (Objective) |
|---|---|---|
| What it is | A measurement | A target for that measurement |
| Example | 99.4% of requests succeeded | ≥99.9% of requests should succeed |
| Nature | Descriptive (what is) | Prescriptive (what should be) |
| Source | Observability data | Business and engineering judgment |
| Triggers action when | Data is collected | Target is not met |
| Changes over time? | Constantly (reflects reality) | Rarely (reflects strategy) |
Newcomers to reliability engineering often ask: 'Why not aim for 100% availability?' This seemingly reasonable question reveals a fundamental misunderstanding of distributed systems, economics, and user behavior. Let's dismantle the 100% myth systematically.
In distributed systems, 100% reliability is not just expensive—it's mathematically impossible. Networks have non-zero failure rates. Hardware fails. Software has bugs. Users themselves make mistakes. Targeting 100% means your reliability goal is definitionally unachievable, which makes it useless for decision-making.
Reliability improvement follows an exponential cost curve. Each additional 'nine' of availability costs roughly 10x more than the previous one.
Consider the real costs:
| Availability | Downtime/Year | Relative Cost* | Infrastructure Complexity |
|---|---|---|---|
| 99% ('two nines') | 3.65 days | 1x | Single server, basic monitoring |
| 99.9% ('three nines') | 8.76 hours | 10x | Redundancy, health checks, basic automation |
| 99.99% ('four nines') | 52.6 minutes | 100x | Multi-AZ, automated failover, sophisticated monitoring |
| 99.999% ('five nines') | 5.26 minutes | 1000x | Multi-region, chaos engineering, SRE team |
| 99.9999% ('six nines') | 31.5 seconds | 10000x+ | Specialized hardware, formal verification |
*Relative cost is illustrative—actual costs vary by system.
Here's a crucial insight: Users can't tell the difference between 99.99% and 99.999%. In both cases, they experience less than 1 minute of downtime per week. The incremental cost of that improvement is massive, but the user benefit is imperceptible.
The 'last mile' problem:
Even if your service achieves 99.999% availability, your users experience:
The user's actual experience is dominated by factors outside your control. Investing in 99.999% when the user sees 99% due to their own infrastructure is wasted effort.
Every engineering hour spent on reliability is an hour not spent on features, performance, or innovation. This isn't laziness—it's economics.
The frozen product anti-pattern:
Teams targeting 100% reliability often become paralyzed:
Eventually, competitors with faster release cycles overtake the 'reliable' product.
Setting SLOs is not guesswork—it's a structured process that incorporates user research, historical data, business requirements, and engineering capability. Here's a comprehensive framework:
Methods to gauge user expectations:
Example finding: Analysis of e-commerce checkout abandonment shows users tolerate up to 3 seconds for checkout confirmation, but abandon rapidly beyond 5 seconds. This suggests a latency SLO target of 95% < 3 seconds, not 95% < 100ms.
Establish your baseline:
Last 90 days availability: 99.4% (min 98.7%, max 99.8%)
P99 latency: 340ms (range 280-520ms)
Error rate: 0.3% average
Key questions:
Use this data to set achievable starting points. Don't set SLOs you've never achieved—you'll immediately be in violation.
The SLO ceiling rule:
Your service's SLO cannot exceed the combined SLOs of your critical dependencies. If you depend on:
Your theoretical maximum availability is roughly:
0.9999 × 0.9995 × 0.999 = 0.9984 (99.84%)
Setting an SLO of 99.99% would be dishonest—your dependencies make it unachievable.
Include hidden dependencies:
Match ambition to capability:
| Current Performance | Recommended Initial SLO |
|---|---|
| 99.7% average | 99.5% (achievable with buffer) |
| 99.5% average | 99% (conservative start) |
| 98% average | 95% (honest baseline) |
Why start conservative?
An SLO you consistently miss teaches your organization to ignore SLOs. An SLO you occasionally miss teaches your organization that SLOs matter. Start with achievable targets and tighten them as you improve.
A useful heuristic: Your internal SLO target should be approximately 10x more lenient than your most demanding user's expectations. This provides buffer for:
• Measurement variance • Undetected issues • Planned maintenance • Experimentation and testing
If users expect 99.99%, target 99.9% internally. The buffer is your safety margin.
The time window over which you evaluate SLOs profoundly affects their utility. A 99.9% SLO over 1 hour means something very different from 99.9% over 30 days.
| Window | 99.9% SLO Allows | Characteristics | Best For |
|---|---|---|---|
| 1 hour | 3.6 seconds downtime/hour | Very sensitive, noisy | Critical real-time systems |
| 1 day | 86 seconds downtime/day | Moderate sensitivity | Operational monitoring |
| 7 days | ~10 minutes downtime/week | Balanced | Sprint-aligned review |
| 28 days | ~40 minutes downtime/month | Stable, strategic | Error budget management |
| 30 days | ~43 minutes downtime/month | Calendar-aligned | Monthly reporting, SLAs |
| Quarter | ~2.2 hours downtime/quarter | Long-term trends | Executive reporting |
Rolling windows (recommended for SLOs):
Calendar windows (common for SLAs):
The 28-day rolling window:
Google's SRE book popularized the 28-day rolling window for good reasons:
This is the de facto standard for SLO evaluation in modern reliability engineering.
Shorter windows: ✓ Faster detection of issues ✗ More noise, more false positives ✗ Single incidents have outsized impact
Longer windows: ✓ More stable, less noise ✗ Slower response to degradation ✗ Major incidents get 'averaged out'
Most teams use 28-day windows for SLO tracking but also monitor shorter windows (1-hour, 1-day) for alerting purposes.
Real services rarely have just one SLO. A comprehensive SLO strategy typically includes multiple objectives covering different aspects of user experience and different user segments.
Tier 1: Critical path SLOs
These protect your most important user journeys. Violations demand immediate response.
Tier 2: Important but not critical SLOs
These matter but brief violations won't cause user churn.
Tier 3: Best-effort SLOs
These are aspirational—nice to hit but not worth sacrificing other priorities.
Not all users are equal from a business perspective:
By customer tier:
By geography:
By use case:
A common failure mode is creating too many SLOs. When you have 50 SLOs, you effectively have zero—nobody can track them all, violations become meaningless, and the system provides no actionable signal.
Rule of thumb: If a single person can't explain all your service's SLOs from memory, you have too many.
An SLO that exists only in someone's head or an undiscoverable wiki page might as well not exist. Effective SLOs require rigorous documentation that makes them discoverable, understandable, and actionable.
Every SLO should have a formal document containing:
1. SLO Identification
Service: Checkout API
SLO Name: Checkout Availability
Version: 1.3
Owner: Payments Team
Last Updated: 2024-Q3
Review Schedule: Quarterly
2. SLI Specification
Indicator: Request success rate
Measurement: (HTTP 2xx responses / Total requests) × 100%
Data Source: Datadog APM for checkout-api service
Exclusions: Health check endpoints, internal tooling
3. Objective Definition
Target: 99.95%
Time Window: 28-day rolling
Error Budget: 21.6 minutes/28 days (0.05% of traffic)
4. Rationale
User research indicates checkout failures above 0.1%
cause measurable cart abandonment spike (see User Study #42).
Historical performance averages 99.97%.
Dependency analysis caps theoretical max at 99.98%.
Target of 99.95% provides 0.02% buffer for deployments.
5. Alert Configuration
Fast burn alert: >2% budget consumed in 1 hour → P1
Slow burn alert: >10% budget consumed in 1 day → P2
Budget exhaustion warning: >50% consumed → P3
6. Response Procedures
Fast burn: Page on-call, freeze deployments, initiate incident
Slow burn: Notify team channel, investigate during business hours
Budget warning: Add to sprint planning, prioritize reliability work
Modern teams define SLOs in machine-readable formats (YAML, JSON) that feed directly into monitoring systems. This enables:
• Automated SLO dashboards • Automated alerting based on burn rates • SLO compliance reporting • Drift detection when definitions change
Consider tools like Sloth, OpenSLO, or building your own SLO-as-code framework.
12345678910111213141516171819202122232425262728293031323334353637
# OpenSLO-compatible SLO definitionapiVersion: openslo/v1kind: SLOmetadata: name: checkout-availability displayName: Checkout API Availabilityspec: description: > Ensures checkout API maintains high availability for payment processing service: checkout-api sli: metric: source: datadog goodQuery: > sum:checkout.requests{status:2xx}.as_count() totalQuery: > sum:checkout.requests{*}.as_count() type: ratio objectives: - target: 0.9995 displayName: 99.95% availability timeWindow: rolling: unit: day count: 28 alerting: fastBurn: burnRate: 14.4 lookbackWindow: 1h severity: critical slowBurn: burnRate: 6 lookbackWindow: 6h severity: warningSLOs are not permanent—they should evolve as your service, users, and business evolve. However, changing SLOs requires discipline to prevent gaming and maintain trust.
Evidence-based tightening:
Consistently exceeding target: If you're always at 99.99% with a 99.9% SLO, the SLO isn't driving behavior. Tighten it.
User expectations have increased: Competition or market changes mean users expect more.
Business has grown: Revenue impact of downtime has increased; investment in reliability is justified.
You've improved infrastructure: New capabilities (multi-region, better failover) make higher reliability achievable.
Legitimate loosening scenarios:
SLO was set unrealistically: Initial target was aspirational, not achievable.
Dependency degradation: An upstream service reduced their SLO, lowering your ceiling.
Strategic pivot: Business decided to invest in features over reliability.
Cost reduction: Economic pressure requires accepting more risk.
The political danger of loosening:
Loosening SLOs often looks like 'giving up' or 'lowering standards.' Document the rationale clearly to prevent future misinterpretation.
Never change SLOs retroactively. If you're about to miss an SLO and you quickly loosen the target, you've destroyed the system's credibility.
SLO changes should: • Be announced in advance • Take effect at the start of a new measurement window • Be documented with rationale • Be approved by stakeholders (not just engineering)
If you're frequently wanting to change SLOs, you're setting them wrong initially.
SLOs are the bridge between measurement (SLIs) and action. They define what 'good enough' means for your service, enabling data-driven reliability decisions.
What's next:
With SLIs defining measurement and SLOs defining targets, we need to understand Service Level Agreements (SLAs)—the contractual commitments we make to customers about reliability, complete with consequences for violations.
You now understand Service Level Objectives—the targets that transform reliability from a vague aspiration into a measurable, manageable commitment. SLOs answer 'How reliable is reliable enough?' and enable every subsequent reliability engineering decision. Next, we'll explore how SLOs become contractual SLAs.