Loading content...
Setting Service Level Objective (SLO) targets is one of the most consequential decisions a reliability engineering team makes. An SLO target is not merely a number—it is a contract that defines what "good enough" means for your service. It encodes expectations between engineering teams, product managers, business stakeholders, and ultimately, your users.
Get this wrong, and you face two failure modes:
The challenge is that there is no universally "correct" SLO target. A target of 99.9% availability might be excessive for an internal tool used by 50 employees, yet catastrophically insufficient for a payment processing system handling billions in transactions. Context is everything.
By the end of this page, you'll understand the complete framework for selecting SLO targets: analyzing user expectations, evaluating technical constraints, aligning with business objectives, and establishing targets that drive the right organizational behaviors. You'll learn methodologies used by Google, Netflix, and other reliability leaders to set targets that are both ambitious and achievable.
Before diving into target selection methodologies, let's establish clarity on what SLO targets actually represent and how they function within the reliability ecosystem.
The Structure of an SLO Target:
An SLO target consists of three components:
For example: "99.9% of login requests will complete in under 500ms over a rolling 30-day window."
This structure matters because each component can be tuned independently. You might keep the same latency threshold but relax the target percentage during a major migration. Understanding these levers gives you precision in reliability engineering.
| Component | What It Defines | Example Values | Tuning Implications |
|---|---|---|---|
| SLI | What property to measure | Availability, Latency, Error rate, Throughput | Changing SLI redefines what 'reliability' means for the service |
| Threshold | Boundary of acceptable behavior | < 200ms, < 1% errors, > 99.5% success | Stricter thresholds increase error budget consumption |
| Target % | How often threshold must be met | 99.0%, 99.9%, 99.99% | Each additional '9' is exponentially harder to achieve |
| Window | Time period for evaluation | Rolling 28 days, Calendar month, Quarter | Longer windows smooth variance but delay feedback |
Each additional 'nine' in your target represents a 10x reduction in allowed downtime. 99% allows 3.65 days of downtime per year; 99.9% allows 8.76 hours; 99.99% allows 52.6 minutes; 99.999% allows just 5.26 minutes. Before targeting high nines, honestly assess whether your infrastructure, dependencies, and processes can realistically achieve them.
The difference between internal and external targets:
Sophisticated organizations often maintain two levels of targets:
This buffer serves critical purposes:
The gap between internal SLO and external SLA represents your safety margin—the reliability buffer you've deliberately built into your commitments.
The foundational principle of SLO target selection is this: SLOs should reflect user happiness, not internal metrics. An SLO that your team consistently meets while users complain is a failed SLO—it measures the wrong thing or sets the wrong bar.
This user-centric approach requires understanding how users actually experience your service and what thresholds trigger dissatisfaction.
Converting user expectations to SLO targets:
User expectations are typically expressed qualitatively: "fast," "reliable," "available." Translating these into quantitative SLO targets requires a structured approach:
Step 1: Identify the user journey critical points
Not all parts of your service matter equally to users. A 500ms delay in a background analytics call is invisible; a 500ms delay on the checkout button feels sluggish. Map user journeys and identify moments of truth—the interactions where performance directly impacts user satisfaction or conversion.
Step 2: Establish acceptable thresholds through experimentation
Run controlled experiments (A/B tests) with artificially degraded performance to find the point where user behavior changes. This might reveal that users tolerate up to 800ms page loads before bounce rates increase, giving you a data-driven threshold.
Step 3: Determine frequency tolerance
Users can forgive occasional failures. The question is: how often? If 1 in 100 requests failing goes unnoticed but 1 in 20 generates complaints, your target should be between 95% and 99%. Survey data and behavioral analysis inform this range.
Pursuing 99.99% reliability for a service where users would be equally happy with 99.5% is not just wasteful—it's actively harmful. Engineering effort spent on unnecessary reliability is effort not spent on new features, security improvements, or technical debt reduction. Over-engineering reliability has opportunity costs that can cripple product velocity.
| User Segment | Critical Interaction | User-Expressed Expectation | Derived SLO Target |
|---|---|---|---|
| Mobile shoppers | Product page load | "Should load before I lose interest" | p95 latency < 2s for 99.5% of requests |
| Enterprise API consumers | API response | "Needs to be reliable for our automation" | 99.95% availability, < 0.1% error rate |
| Video streamers | Playback start | "Should start within a few seconds" | Playback initiation < 4s for 99% of starts |
| Financial traders | Order execution | "Milliseconds matter, no failures" | p99 < 50ms, 99.99% success rate |
| Social media users | Feed loading | "Should just work most of the time" | p90 < 1s for 99% of loads |
User expectations define the ceiling for your SLO targets—what you aspire to. Technical constraints define the floor—what you can realistically achieve given your architecture, dependencies, and investment level.
A critical mistake is setting SLO targets that are architecturally impossible. If your service depends on a third-party API with 99.5% availability, your service mathematically cannot exceed 99.5% availability for operations requiring that dependency. Setting a 99.99% target would be organizational self-deception.
The dependency chain calculation:
In distributed systems, your theoretical maximum availability is the product of your dependencies' availabilities:
Service Availability ≤ Dependency₁ × Dependency₂ × ... × Dependencyₙ
For a service with three dependencies each at 99.9%:
Max Availability ≤ 0.999 × 0.999 × 0.999 = 0.997 (99.7%)
This calculation assumes serial dependencies (all required for operation) and perfect internal reliability. Real-world services typically perform worse due to their own bugs, capacity issues, and operational errors.
Building target achievability assessment:
Before committing to an SLO target, conduct an achievability analysis:
Historical performance analysis: What has your service actually achieved over the past 3-12 months? A target significantly better than historical performance requires specific improvements to be credible.
Dependency audit: Catalog every external system your service requires and their documented or observed reliability. Calculate your theoretical ceiling.
Failure mode enumeration: List known failure modes and their frequency. Calculate expected error budget consumption from each category.
Gap analysis: If your desired target exceeds achievable reliability, identify specific investments needed to close the gap—and whether those investments are justified by business value.
Stretch factor: Even well-understood systems surprise us. Apply a 10-20% "reality adjustment" to account for unknown failure modes and imperfect mitigation.
Reliability investment follows an exponential cost curve. Moving from 99% to 99.9% might require 2x infrastructure spending. Moving from 99.9% to 99.99% might require 5-10x. Moving from 99.99% to 99.999% can require 20x or more, including redundant datacenters, specialized engineering teams, and sophisticated automation. Ensure business value justifies these investments.
SLO targets are not purely technical decisions—they encode business tradeoffs. Every improvement in reliability comes at a cost: engineering time, infrastructure spending, feature velocity, or operational burden. SLO target selection is fundamentally a negotiation between reliability aspirations and business realities.
The stakeholder landscape:
Different stakeholders have different (often conflicting) perspectives on SLO targets:
Effective SLO target selection requires acknowledging and balancing these perspectives—not optimizing for one at others' expense.
Documenting the target decision:
SLO target decisions should be documented with their rationale, not just the final number. This documentation serves multiple purposes:
A well-documented SLO target decision includes:
Framing SLO targets in terms of error budgets often makes business negotiations more productive. Instead of 'We need 99.9% availability,' try 'We're proposing 8.7 hours of allowable downtime per year. This gives us room for 4-5 minor incidents and 1 moderate incident while still meeting our commitment. Is this risk profile acceptable given our customer base and competitive position?'
Synthesizing user expectations, technical constraints, and business alignment into a concrete SLO target requires a structured methodology. Here's a practical framework used by mature SRE organizations:
1234567891011121314151617181920212223242526272829303132333435363738
# SLO Target Calculation Worksheet ## 1. Dependency Analysis| Dependency | Stated SLA | Observed Reliability | Notes ||---------------------|------------|----------------------|--------------------------|| Cloud Provider (Compute) | 99.99% | 99.98% | 2 incidents last year || Database (Managed) | 99.95% | 99.93% | Failover twice per month || Payment Gateway | 99.9% | 99.85% | Degradation events || CDN | 99.99% | 99.97% | Edge caching helps | ## 2. Theoretical Maximum (Serial Dependencies)Max = 0.9998 × 0.9993 × 0.9985 × 0.9997 = 0.9973 (99.73%) ## 3. Historical Performance- Past 12 months actual: 99.1%- Best quarter: 99.6%- Worst quarter: 98.4%- Root causes of gaps: 60% deployment issues, 25% dependency failures, 15% capacity ## 4. User Expectation Input- Survey result: Users expect "occasional" issues (~1/week acceptable)- Behavioral data: Abandonment spikes at >3s latency, >2% error rate- Support correlation: Complaints spike when availability drops below 99.5% weekly ## 5. Target Synthesis- Floor (minimum acceptable): 99.5% (user expectation driven)- Ceiling (maximum achievable): 99.73% (dependency limited)- Investment-adjusted ceiling: 99.5% (current investment level)- Reality-adjusted target: 99.3% (accounts for unknown failure modes) ## 6. Recommended SLOInternal SLO: 99.5% availability over rolling 30 daysExternal SLA: 99.0% (provides buffer for contractual obligations) ## 7. Investment Required for Higher Target- To reach 99.7%: Add redundant payment gateway ($50K/year + integration)- To reach 99.9%: Above + dedicated SRE hire + automated failover ($300K/year)- To reach 99.95%: Above + multi-region deployment ($800K/year)Even experienced organizations make predictable mistakes in SLO target selection. Awareness of these pitfalls helps avoid them:
Beware the organizational pressure to only ever tighten targets. If you achieve 99.95% when targeting 99.9%, leadership may demand 99.95% as the new baseline—ignoring that the overachievement might have been lucky or unsustainable. Protect your team from this ratchet by focusing on consistent achievement of current targets rather than occasional over-performance.
Detecting that your targets are wrong:
Even carefully selected targets can prove to be miscalibrated. Watch for these signals:
Signals that targets are too aggressive:
Signals that targets are too lenient:
Signals that targets are measuring the wrong thing:
For services without historical data or organizations new to SLO practice, selecting targets can feel like guesswork. The solution is to embrace uncertainty through provisional targets—initial commitments designed to be refined based on empirical feedback.
The provisional target approach:
Start from reasonable defaults: Industry experience provides useful starting points. For typical web services, 99.5% availability and p95 latency under 500ms are reasonable initial assumptions.
Explicitly label as provisional: Document that the target is expected to change within 2-3 months as data accumulates.
Instrument thoroughly: Ensure comprehensive SLI measurement from day one. You can't refine targets without data.
Schedule early review: Plan target reassessment after 30-60 days of production data.
Gather qualitative feedback: During the provisional period, actively solicit user and stakeholder feedback on perceived reliability.
| Service Type | Initial Availability Target | Initial Latency Target | Rationale |
|---|---|---|---|
| Consumer web application | 99.5% | p95 < 500ms | Users have alternatives; moderate tolerance for issues |
| Mobile application backend | 99.5% | p95 < 300ms | Mobile users expect responsiveness; network adds latency |
| Enterprise SaaS | 99.9% | p95 < 1s | Business users have higher expectations; workflows depend on it |
| Internal tooling | 99.0% | p95 < 2s | Captive audience; reliability less critical than for external users |
| Payment/financial systems | 99.95% | p99 < 500ms | Money is involved; errors have serious consequences |
| Real-time communication | 99.9% | p99 < 200ms | Latency directly perceptible; users are sensitive to delays |
| Batch processing | 95.0% (jobs complete on time) | N/A | Delayed processing often acceptable; focus on completion |
It's psychologically and organizationally easier to tighten targets after consistently meeting them than to relax targets after consistently missing them. When uncertain, start with more lenient targets and increase stringency as capability demonstrates it's warranted. This builds confidence and avoids early demoralization.
SLO target selection is among the most strategically important decisions in reliability engineering. Get it right, and SLOs become powerful tools for alignment, prioritization, and continuous improvement. Get it wrong, and they become sources of organizational dysfunction—either ignored as unachievable or dismissed as unchallenging.
You now understand the comprehensive framework for selecting SLO targets that balance user expectations, technical reality, and business objectives. Next, we'll explore error budgets—the mechanism that transforms SLO targets into actionable decision frameworks that align development velocity with reliability investment.