Loading content...
Setting an SLO is not a one-time decision. Services evolve, user expectations shift, business priorities change, and what was appropriate yesterday may be wrong tomorrow. The practice of systematic SLO review and adjustment is what separates mature SRE organizations from those that treat SLOs as bureaucratic artifacts.
Why SLOs require ongoing review:
Without deliberate review practices, SLOs drift into irrelevance. Either they become too lenient (always met, no signal) or too aggressive (always missed, learned helplessness). Neither state produces value.
By the end of this page, you'll understand how to establish SLO review cadences, what signals indicate SLOs need adjustment, how to analyze SLO effectiveness, the process for making adjustments, how to manage stakeholder communication during changes, and how to document and version your SLOs over time.
Effective SLO governance requires predictable review rhythms. Different aspects of SLOs benefit from different review frequencies:
Three-tier review cadence:
| Review Type | Frequency | Focus | Participants | Outcomes |
|---|---|---|---|---|
| Operational Review | Weekly | Budget health, recent incidents, current trajectory | On-call, team lead | Immediate actions if budget threatened |
| Tactical Review | Monthly | SLO performance trends, alert quality, emerging patterns | Engineering team, product | Tuning decisions, investment priorities |
| Strategic Review | Quarterly | Target appropriateness, measurement validity, business alignment | Engineering leads, product, business stakeholders | SLO target adjustments, strategy changes |
Weekly operational review (15-30 minutes):
This is a quick health check, typically part of existing team meetings:
Output: No formal document. Simple go/no-go decision on velocity. Escalate to monthly review if patterns concerning.
Monthly tactical review (1-2 hours):
A deeper analysis of SLO effectiveness:
Output: Written summary of SLO status. Identified action items. Recommendations for quarterly review.
Quarterly strategic review (2-4 hours):
A comprehensive evaluation of SLO appropriateness:
Output: SLO revision proposals. Updated documentation. Stakeholder sign-off on any changes.
Align quarterly strategic reviews with business planning cycles (OKR setting, budget planning, roadmap review). This ensures SLO discussions inform and are informed by broader organizational priorities. SLOs shouldn't exist in isolation from business strategy.
Not every SLO needs adjustment at every review. Learning to recognize the signals that indicate adjustment is warranted helps focus review efforts productively.
Signals that SLO targets are too aggressive:
Signals that SLI definitions need revision:
Signals that review data is insufficient:
One bad month doesn't mean the SLO is wrong. External factors (traffic spikes, dependency issues, one-time events) can cause temporary performance dips. Require sustained patterns (2-3 months minimum) before considering target adjustments. Hasty changes erode SLO credibility.
Beyond simple pass/fail, effective SLO analysis examines whether the SLO is fulfilling its purpose: guiding decisions, reflecting user experience, and driving appropriate organizational behavior.
The SLO effectiveness framework:
| Criterion | What It Measures | Ideal State | Warning Signs |
|---|---|---|---|
| Achievement rate | How often is the SLO met? | 75-95% of periods | Always met (too easy) or rarely met (too hard) |
| Budget utilization | How much budget is consumed? | 50-80% average | <20% (wasted capacity) or >100% (unsustainable) |
| User correlation | Does SLO track user satisfaction? | High correlation | SLO met but users unhappy, or missed but users fine |
| Decision influence | Does budget status affect decisions? | Regularly consulted | Ignored in planning, or causes panic |
| Alert quality | Do SLO alerts correlate with real issues? | 90% actionable | High false positives or missed incidents |
| Investment ROI | Does reliability work improve SLIs? | Measurable improvement | Investments don't move metrics |
Quantitative analysis methods:
Achievement distribution analysis:
Plot SLO achievement across multiple periods. Healthy distributions show:
Unhealthy distributions:
Budget consumption patterns:
Analyze how budget is consumed:
Healthy pattern: Mostly planned consumption with occasional unplanned incidents. Unhealthy pattern: Entirely unplanned incident-driven, or budget never meaningfully touched.
Correlation with business metrics:
Compare SLO performance with:
Strong correlation = SLO reflects what users care about. Weak correlation = SLI might not measure the right thing.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
-- SLO Effectiveness Analysis Queries-- Run against your SLO metrics warehouse -- 1. Achievement rate over timeSELECT DATE_TRUNC('month', period_end) as month, service, slo_name, AVG(CASE WHEN achieved THEN 1 ELSE 0 END) as achievement_rate, AVG(budget_consumed_pct) as avg_budget_consumed, COUNT(*) as periodsFROM slo_periodsWHERE period_end > NOW() - INTERVAL '12 months'GROUP BY 1, 2, 3ORDER BY 1 DESC, 2; -- 2. Budget consumption distributionSELECT service, slo_name, -- Distribution buckets COUNT(*) FILTER (WHERE budget_consumed_pct < 20) as under_20_pct, COUNT(*) FILTER (WHERE budget_consumed_pct BETWEEN 20 AND 50) as "20_50_pct", COUNT(*) FILTER (WHERE budget_consumed_pct BETWEEN 50 AND 80) as "50_80_pct", COUNT(*) FILTER (WHERE budget_consumed_pct BETWEEN 80 AND 100) as "80_100_pct", COUNT(*) FILTER (WHERE budget_consumed_pct > 100) as exceeded, -- Summary stats AVG(budget_consumed_pct) as avg_consumed, STDDEV(budget_consumed_pct) as stddev_consumedFROM slo_periodsWHERE period_end > NOW() - INTERVAL '6 months'GROUP BY 1, 2; -- 3. Incident budget consumption breakdownSELECT service, incident_category, COUNT(*) as incident_count, SUM(budget_consumed_minutes) as total_budget_consumed, AVG(budget_consumed_minutes) as avg_per_incident, SUM(budget_consumed_minutes) / SUM(SUM(budget_consumed_minutes)) OVER (PARTITION BY service) * 100 as pct_of_service_budgetFROM incidentsWHERE incident_date > NOW() - INTERVAL '6 months' AND budget_consumed_minutes > 0GROUP BY 1, 2ORDER BY 1, total_budget_consumed DESC; -- 4. SLO vs User Satisfaction CorrelationSELECT date_trunc('week', period_date) as week, s.service, AVG(s.sli_value) as avg_sli, AVG(CASE WHEN s.achieved THEN 1 ELSE 0 END) as achievement_rate, AVG(u.nps_score) as avg_nps, AVG(u.support_tickets_per_1000_users) as support_rate, CORR(s.sli_value, u.nps_score) OVER ( PARTITION BY s.service ORDER BY period_date ROWS BETWEEN 12 PRECEDING AND CURRENT ROW ) as rolling_sli_nps_correlationFROM slo_daily_rollup sJOIN user_satisfaction_daily u ON s.service = u.service AND s.period_date = u.dateWHERE period_date > NOW() - INTERVAL '6 months'GROUP BY 1, 2, s.period_dateORDER BY 1 DESC, 2; -- 5. Alert effectivenessSELECT service, alert_name, COUNT(*) as total_alerts, COUNT(*) FILTER (WHERE resolution = 'true_positive_action_taken') as actionable, COUNT(*) FILTER (WHERE resolution = 'false_positive') as false_positive, COUNT(*) FILTER (WHERE resolution = 'auto_resolved') as noise, ROUND( COUNT(*) FILTER (WHERE resolution = 'true_positive_action_taken')::numeric / COUNT(*)::numeric * 100, 1 ) as actionability_rateFROM alertsWHERE alert_time > NOW() - INTERVAL '3 months' AND is_paging = trueGROUP BY 1, 2HAVING COUNT(*) > 5ORDER BY actionability_rate ASC;When analysis indicates an SLO adjustment is warranted, following a structured process ensures changes are well-considered, properly communicated, and appropriately documented.
The adjustment workflow:
Types of SLO adjustments:
Target adjustment (most common): Changing the target percentage (e.g., 99.9% → 99.5% or 99.9% → 99.95%). This adjusts how much error budget you have.
Threshold adjustment: Changing what qualifies as success (e.g., latency threshold from <300ms to <500ms). This changes the SLI definition.
Window adjustment: Changing the evaluation period (e.g., 30 days → 7 days or calendar month → rolling). This affects budget dynamics.
SLI replacement: Replacing the underlying metric entirely (e.g., from error rate to success rate, or from server-side latency to client-perceived latency).
Scope adjustment: Changing what's included in the SLO (e.g., adding new endpoints, excluding background operations, segmenting by user tier).
Each type has different implications:
| Adjustment Type | Implementation Complexity | Stakeholder Impact | Historical Continuity |
|---|---|---|---|
| Target change | Low (config update) | Medium (budget implications) | Preserved (same metric, different goal) |
| Threshold change | Low-Medium | Medium-High (redefines success) | Partially preserved (same concept, different bar) |
| Window change | Medium | High (changes planning dynamics) | Breaks (not comparable to historical) |
| SLI replacement | High | High (completely different signal) | Breaks (cannot compare old vs new) |
| Scope change | High | Medium | Depends on degree of change |
Every SLO adjustment creates a break in organizational memory. 'We were at 99.7% last year' becomes meaningless if the SLI definition changed. Document adjustments thoroughly, maintain versioned SLO history, and clearly annotate historical charts to indicate definition changes.
SLO adjustments affect multiple stakeholders differently. Effective communication ensures changes are understood, accepted, and don't create downstream problems.
Stakeholder communication matrix:
| Stakeholder | Key Concerns | Communication Focus | Timing |
|---|---|---|---|
| Engineering team | How does this affect our work? Alert changes? | Technical implications, new thresholds, rationale | Before change, involve in analysis |
| Product management | Does this affect roadmap? Customer commitments? | User impact, budget implications for velocity | During analysis, approve proposals |
| Engineering leadership | Resource implications? Trend concerning? | Strategic rationale, investment needs | Approve significant changes |
| Sales/Customer Success | What can we tell customers? SLA impact? | External communication guidance, talking points | Before external communication |
| Legal/Contracts | SLA implications? Contract amendments? | Contractual analysis, risk assessment | If SLA affected, early involvement |
| Customers (if applicable) | Is service getting worse/better? | Transparent explanation, benefit framing | If external SLA changes, formal notice |
Framing adjustments appropriately:
Tightening targets (raising the bar):
Positive framing—this reflects investment in reliability or evolved user expectations:
"Based on user research and our improved infrastructure, we're raising our availability target from 99.9% to 99.95%. This reflects our commitment to best-in-class reliability and aligns with what our enterprise customers now expect."
Relaxing targets (lowering the bar):
This requires careful framing to avoid appearing as a step backward:
"After six months of data, we're adjusting our latency target from p99 < 200ms to p99 < 300ms. Our analysis shows users are equally satisfied at both levels, but the 200ms target was requiring engineering investment that would be better directed at features users are requesting. This allows us to maintain excellent user experience while accelerating product development."
Changing SLI definitions:
Focus on improved accuracy:
"We're updating our availability SLI from server-side uptime to synthetic transaction success rate. This better reflects actual user experience, as our server-side metric wasn't capturing client-facing issues. Historical data under the old definition will be preserved for reference, but going forward, this new metric drives our targets."
Announce SLO changes before they take effect, not after. Proactive communication ('Here's what we're changing and why') builds trust. Reactive communication ('We changed this last month') creates suspicion that changes were made to hide problems. Even for relaxed targets, proactive transparency is better than silent adjustment.
Handling pushback:
"You're lowering your standards" (when relaxing targets) Response: "We're aligning our internal expectations with user reality. Data shows our previous target exceeded what users notice or value. This doesn't mean we're providing worse service—just that we're being honest about what matters."
"You're setting yourself up for failure" (when tightening targets) Response: "We've analyzed our capability thoroughly. This target is achievable with our current infrastructure and intended investments. We wouldn't commit to something we can't deliver."
"This affects our SLA with customers" (for any change) Response: "Let's review the specific SLA language. Internal SLOs and external SLAs are distinct. If SLA needs adjustment, we'll follow the appropriate contractual process."
"Why wasn't I consulted earlier?" Response: "You're right—this is feedback for our process improvement. Let me walk you through our analysis now, and we'll ensure you're included earlier in future adjustments."
SLOs are organizational contracts—they need the same care in documentation and versioning that you'd give to any important agreement. A year from now, someone should be able to understand what SLOs were in effect, why they were set that way, and how they've evolved.
Essential SLO documentation elements:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
# SLO Document: Payment API Availability# Version: 3# Last Updated: 2024-01-15# Status: Active metadata: service: payment-api owner_team: payments-core stakeholders: - product-payments - enterprise-sales - customer-success external_sla_reference: MSA-v2-AppendixB review_schedule: "Quarterly (March, June, September, December)" document_location: https://wiki.example.com/slos/payment-api sli: name: "Payment Transaction Availability" description: | Measures the success rate of payment transaction attempts, as perceived by the client API consumer. measurement: source: "Prometheus metrics from payment-api service" numerator: "Sum of 2xx responses to /v1/payments/* endpoints" denominator: "Sum of all responses to /v1/payments/* endpoints" excludes: - "400-level client errors (indicated by x-client-error header)" - "Synthetic monitoring traffic (user-agent contains 'synthetic')" - "Internal test transactions (x-test-mode header present)" formula: | sum(rate(http_requests_total{service="payment-api", path=~"/v1/payments/.*", status=~"2.."}[5m])) / sum(rate(http_requests_total{service="payment-api", path=~"/v1/payments/.*", status!~"4..", client_error!="true"}[5m])) slo: target: 99.95 window: "30 days rolling" error_budget_minutes: 21.6 # 30 days * 24 hours * 0.0005 rationale: | Target of 99.95% reflects: - User research (Q3-2023): Enterprise customers expect "five nines minus a bit" - Technical analysis: Dependencies support up to 99.97% theoretical maximum - Business requirement: Enterprise SLA (MSA-v2) requires 99.9% contractual minimum - Safety margin: 0.05% buffer between internal target and external SLA error_budget_policy: healthy: threshold: "< 50% consumed" actions: "Normal operations, full deployment velocity" caution: threshold: "50-75% consumed" actions: "Increased monitoring, defer risky changes" at_risk: threshold: "75-90% consumed" actions: "Reliability focus, limited deployments" critical: threshold: "> 90% consumed" actions: "Development freeze, all hands on reliability" alerting: burn_rate_alerts: - name: "SLOBurnSevere" burn_rate: 14.4 short_window: 5m long_window: 1h destination: "@payments-oncall (page)" - name: "SLOBurnHigh" burn_rate: 6 short_window: 30m long_window: 6h destination: "@payments-oncall (page)" - name: "SLOBurnMedium" burn_rate: 3 short_window: 2h long_window: 24h destination: "#payments-alerts (ticket)" budget_alerts: - name: "SLOBudget75" threshold: "75% consumed" destination: "#payments-eng (notification)" - name: "SLOBudget90" threshold: "90% consumed" destination: "@payments-eng-manager (notification)" change_history: - version: 3 date: "2024-01-15" change: "Tightened target from 99.9% to 99.95%" rationale: | - Consistent over-achievement (avg 99.97%) for 6 months - Enterprise customer feedback requesting tighter commitments - New infrastructure investment enables higher reliability approved_by: "VP Engineering" - version: 2 date: "2023-07-01" change: "Updated SLI to exclude client errors" rationale: | - 400-level errors were inflating failure count - These represent client bugs, not service reliability approved_by: "Engineering Manager, Payments" - version: 1 date: "2023-01-01" change: "Initial SLO establishment" rationale: "New service launch, baseline target" approved_by: "VP Engineering" approval: current_version_approved_by: "VP Engineering, Product Director" approval_date: "2024-01-10" next_review_date: "2024-03-31"Version control best practices:
For organizations with many SLOs, consider a central SLO registry—a database or service that stores all SLOs, their current targets, and their status. This enables portfolio-level analysis, consistent reporting, and easier governance. Tools like Sloth, Pyrra, and Nobl9 provide registry capabilities.
SLO review isn't just about adjusting targets—it's an opportunity for broader reliability improvement. Effective organizations use SLO reviews as a vehicle for continuous improvement.
Post-review improvement activities:
Building a reliability improvement backlog:
Every SLO review should produce actionable items. Maintain a dedicated reliability improvement backlog that captures:
Prioritize this backlog based on expected SLI improvement per unit of effort. A small fix that eliminates 20% of error budget consumption is more valuable than a large project that might save 5%.
Measuring SLO program maturity:
As your SLO practice evolves, track program-level metrics:
Effective SLO practice creates a virtuous cycle: SLOs reveal reliability gaps → Improvements are made → SLO performance improves → Confidence grows → More ambitious SLOs are set → New gaps are revealed → The cycle continues. This continuous improvement is the ultimate purpose of SLOs—not just measuring reliability, but systematically improving it.
SLOs are not "set and forget" artifacts—they are living commitments that require ongoing attention, evaluation, and adjustment. The practice of SLO review and adjustment is what ensures your reliability targets remain meaningful, achievable, and aligned with user needs over time.
You have now completed the comprehensive module on Setting SLOs. You've mastered SLO target selection, error budgets, burn rate alerting, SLO-based alerting strategies, and the ongoing practice of reviewing and adjusting SLOs. These skills form the operational foundation of Site Reliability Engineering—transforming reliability from an aspiration into a measurable, manageable, and continuously improving practice.