Setting SLOs - Learning Module

Loading content...

0/273

Reviewing and Adjusting SLOs

SLOs Are Living Documents

Setting an SLO is not a one-time decision. Services evolve, user expectations shift, business priorities change, and what was appropriate yesterday may be wrong tomorrow. The practice of systematic SLO review and adjustment is what separates mature SRE organizations from those that treat SLOs as bureaucratic artifacts.

Why SLOs require ongoing review:

Service changes: New features, architectural changes, and dependency updates alter reliability characteristics
User expectation evolution: Users become accustomed to performance levels and expect continued improvement
Competitive pressure: Market dynamics shift expectations for what "good" reliability looks like
Measurement improvements: Better instrumentation reveals previously hidden issues or confirms over-engineering
Business priority shifts: What the business needs from the service may change—growth phases vs. stability phases
Historical learning: Experience with the SLO reveals whether it was set correctly

Without deliberate review practices, SLOs drift into irrelevance. Either they become too lenient (always met, no signal) or too aggressive (always missed, learned helplessness). Neither state produces value.

What You Will Learn

By the end of this page, you'll understand how to establish SLO review cadences, what signals indicate SLOs need adjustment, how to analyze SLO effectiveness, the process for making adjustments, how to manage stakeholder communication during changes, and how to document and version your SLOs over time.

Establishing Review Cadences

Effective SLO governance requires predictable review rhythms. Different aspects of SLOs benefit from different review frequencies:

Three-tier review cadence:

SLO Review Cadence Framework
Review Type	Frequency	Focus	Participants	Outcomes
Operational Review	Weekly	Budget health, recent incidents, current trajectory	On-call, team lead	Immediate actions if budget threatened
Tactical Review	Monthly	SLO performance trends, alert quality, emerging patterns	Engineering team, product	Tuning decisions, investment priorities
Strategic Review	Quarterly	Target appropriateness, measurement validity, business alignment	Engineering leads, product, business stakeholders	SLO target adjustments, strategy changes

Weekly operational review (15-30 minutes):

This is a quick health check, typically part of existing team meetings:

Current error budget status across services
Any SLO violations since last review?
Incidents that consumed significant budget
Current burn rate trends
Are we on track for the month/quarter?

Output: No formal document. Simple go/no-go decision on velocity. Escalate to monthly review if patterns concerning.

Monthly tactical review (1-2 hours):

A deeper analysis of SLO effectiveness:

SLO performance over the past month (achievement %, budget remaining)
Alert quality: true positives, false positives, missed incidents
Correlation between SLO status and user satisfaction signals
Emerging patterns that might require SLO adjustment
Investments made vs. reliability improvements achieved

Output: Written summary of SLO status. Identified action items. Recommendations for quarterly review.

Quarterly strategic review (2-4 hours):

A comprehensive evaluation of SLO appropriateness:

Are current targets still appropriate given service evolution?
Are SLIs still measuring what matters to users?
Has competitive or market context changed?
Are we over or under-invested in reliability?
What SLO changes should we make for next quarter?

Output: SLO revision proposals. Updated documentation. Stakeholder sign-off on any changes.

Anchor Reviews to Business Cycles

Align quarterly strategic reviews with business planning cycles (OKR setting, budget planning, roadmap review). This ensures SLO discussions inform and are informed by broader organizational priorities. SLOs shouldn't exist in isolation from business strategy.

Signals That SLOs Need Adjustment

Not every SLO needs adjustment at every review. Learning to recognize the signals that indicate adjustment is warranted helps focus review efforts productively.

Signals that SLO targets are too aggressive:

Signs Your SLO Target Is Too Aggressive

•Chronic SLO misses: Missing the SLO more than 2-3 months per year despite genuine effort suggests the target exceeds achievable reliability.
•Team burnout: Engineers constantly fighting to meet the SLO, excessive on-call burden, or high turnover related to reliability pressures.
•Feature velocity collapse: Development slows to near-zero because all capacity goes to reliability work.
•Gaming behaviors: Teams under-reporting incidents, narrowing SLI definitions to exclude bad data, or avoiding deployments entirely.
•Disconnect from user satisfaction: SLO is missed but user complaints haven't increased proportionally—suggesting target exceeds user needs.
•Unrealistic dependency assumptions: Target requires dependency reliability that dependencies don't provide.

Signs Your SLO Target Is Too Lenient

•Chronic over-achievement: Consistently exceeding the SLO by large margins (e.g., targeting 99.9% but achieving 99.99%) suggests under-targeting.
•User complaints despite hitting SLO: If users complain about reliability while you meet your SLO, the SLO isn't capturing user expectations.
•Error budget never consumed: If budget is never meaningfully used, it's not providing decision-making value. Teams deploy freely regardless of budget.
•Competitive disadvantage: Competitors offer demonstrably better reliability; users cite reliability as a reason for choosing alternatives.
•No reliability investment prioritization: Since the SLO is always healthy, reliability work never gets prioritized. Technical debt accumulates.
•Stakeholder skepticism: Business stakeholders don't take the SLO seriously because they know it doesn't represent meaningful reliability.

Signals that SLI definitions need revision:

Incidents without SLO impact: Significant user-impacting events don't register on SLI measurements. The SLI is missing important failure modes.
SLO violations without user impact: SLI triggers violations but users don't notice. The SLI is measuring something that doesn't matter.
New service capabilities: Major feature changes introduce new user journeys that existing SLIs don't cover.
Measurement drift: Instrumentation has changed and SLI calculations no longer align with original intent.
User segment changes: Who uses the service has shifted, and their experience patterns differ from historical assumptions.

Signals that review data is insufficient:

High measurement uncertainty: SLI data has gaps, anomalies, or unexplained variance that undermines confidence.
Short history: Less than 6 months of SLO data makes it difficult to separate signal from noise.
Major recent changes: Recent architectural changes make historical data less predictive of future behavior.

Resist Knee-Jerk Adjustments

One bad month doesn't mean the SLO is wrong. External factors (traffic spikes, dependency issues, one-time events) can cause temporary performance dips. Require sustained patterns (2-3 months minimum) before considering target adjustments. Hasty changes erode SLO credibility.

Analyzing SLO Effectiveness

Beyond simple pass/fail, effective SLO analysis examines whether the SLO is fulfilling its purpose: guiding decisions, reflecting user experience, and driving appropriate organizational behavior.

The SLO effectiveness framework:

SLO Effectiveness Evaluation Criteria
Criterion	What It Measures	Ideal State	Warning Signs
Achievement rate	How often is the SLO met?	75-95% of periods	Always met (too easy) or rarely met (too hard)
Budget utilization	How much budget is consumed?	50-80% average	<20% (wasted capacity) or >100% (unsustainable)
User correlation	Does SLO track user satisfaction?	High correlation	SLO met but users unhappy, or missed but users fine
Decision influence	Does budget status affect decisions?	Regularly consulted	Ignored in planning, or causes panic
Alert quality	Do SLO alerts correlate with real issues?	90% actionable	High false positives or missed incidents
Investment ROI	Does reliability work improve SLIs?	Measurable improvement	Investments don't move metrics

Quantitative analysis methods:

Achievement distribution analysis:

Plot SLO achievement across multiple periods. Healthy distributions show:

Median around 85-95% of target
Long tail of good performance (occasional over-achievement)
Short tail of poor performance (occasional misses)

Unhealthy distributions:

Bimodal (either hitting target easily OR missing badly—system unstable)
Always at ceiling (target too lenient)
Always below threshold (target unrealistic)

Budget consumption patterns:

Analyze how budget is consumed:

What % from planned activities (deployments, maintenance)?
What % from unplanned incidents?
What categories of incidents consume most budget?
Is consumption lumpy (few big incidents) or distributed (many small issues)?

Healthy pattern: Mostly planned consumption with occasional unplanned incidents. Unhealthy pattern: Entirely unplanned incident-driven, or budget never meaningfully touched.

Correlation with business metrics:

Compare SLO performance with:

User satisfaction surveys (NPS, CSAT)
Customer churn rates
Support ticket volume
Conversion rates for key flows

Strong correlation = SLO reflects what users care about. Weak correlation = SLI might not measure the right thing.

slo-effectiveness-analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
-- SLO Effectiveness Analysis Queries
-- Run against your SLO metrics warehouse
 
-- 1. Achievement rate over time
SELECT 
    DATE_TRUNC('month', period_end) as month,
    service,
    slo_name,
    AVG(CASE WHEN achieved THEN 1 ELSE 0 END) as achievement_rate,
    AVG(budget_consumed_pct) as avg_budget_consumed,
    COUNT(*) as periods
FROM slo_periods
WHERE period_end > NOW() - INTERVAL '12 months'
GROUP BY 1, 2, 3
ORDER BY 1 DESC, 2;
 
-- 2. Budget consumption distribution
SELECT 
    service,
    slo_name,
    -- Distribution buckets
    COUNT(*) FILTER (WHERE budget_consumed_pct < 20) as under_20_pct,
    COUNT(*) FILTER (WHERE budget_consumed_pct BETWEEN 20 AND 50) as "20_50_pct",
    COUNT(*) FILTER (WHERE budget_consumed_pct BETWEEN 50 AND 80) as "50_80_pct",
    COUNT(*) FILTER (WHERE budget_consumed_pct BETWEEN 80 AND 100) as "80_100_pct",
    COUNT(*) FILTER (WHERE budget_consumed_pct > 100) as exceeded,
    -- Summary stats
    AVG(budget_consumed_pct) as avg_consumed,
    STDDEV(budget_consumed_pct) as stddev_consumed
FROM slo_periods
WHERE period_end > NOW() - INTERVAL '6 months'
GROUP BY 1, 2;
 
-- 3. Incident budget consumption breakdown
SELECT 
    service,
    incident_category,
    COUNT(*) as incident_count,
    SUM(budget_consumed_minutes) as total_budget_consumed,
    AVG(budget_consumed_minutes) as avg_per_incident,
    SUM(budget_consumed_minutes) / 
        SUM(SUM(budget_consumed_minutes)) OVER (PARTITION BY service) * 100 as pct_of_service_budget
FROM incidents
WHERE incident_date > NOW() - INTERVAL '6 months'
  AND budget_consumed_minutes > 0
GROUP BY 1, 2
ORDER BY 1, total_budget_consumed DESC;
 
-- 4. SLO vs User Satisfaction Correlation
SELECT 
    date_trunc('week', period_date) as week,
    s.service,
    AVG(s.sli_value) as avg_sli,
    AVG(CASE WHEN s.achieved THEN 1 ELSE 0 END) as achievement_rate,
    AVG(u.nps_score) as avg_nps,
    AVG(u.support_tickets_per_1000_users) as support_rate,
    CORR(s.sli_value, u.nps_score) OVER (
        PARTITION BY s.service 
        ORDER BY period_date 
        ROWS BETWEEN 12 PRECEDING AND CURRENT ROW
    ) as rolling_sli_nps_correlation
FROM slo_daily_rollup s
JOIN user_satisfaction_daily u 
    ON s.service = u.service AND s.period_date = u.date
WHERE period_date > NOW() - INTERVAL '6 months'
GROUP BY 1, 2, s.period_date
ORDER BY 1 DESC, 2;
 
-- 5. Alert effectiveness
SELECT 
    service,
    alert_name,
    COUNT(*) as total_alerts,
    COUNT(*) FILTER (WHERE resolution = 'true_positive_action_taken') as actionable,
    COUNT(*) FILTER (WHERE resolution = 'false_positive') as false_positive,
    COUNT(*) FILTER (WHERE resolution = 'auto_resolved') as noise,
    ROUND(
        COUNT(*) FILTER (WHERE resolution = 'true_positive_action_taken')::numeric / 
        COUNT(*)::numeric * 100, 1
    ) as actionability_rate
FROM alerts
WHERE alert_time > NOW() - INTERVAL '3 months'
  AND is_paging = true
GROUP BY 1, 2
HAVING COUNT(*) > 5
ORDER BY actionability_rate ASC;

The SLO Adjustment Process

When analysis indicates an SLO adjustment is warranted, following a structured process ensures changes are well-considered, properly communicated, and appropriately documented.

The adjustment workflow:

SLO Adjustment Process Steps

•Document the case for change: Write a clear statement of why adjustment is needed, supported by data. Include historical analysis, effectiveness evaluation, and specific problems the current SLO creates.
•Propose the new target: Specify the proposed new SLO target, with rationale for the specific value. Reference user research, technical analysis, or business requirements supporting the new target.
•Impact analysis: Assess implications of the change. What happens to error budget? How do alert thresholds need to change? What team behaviors will shift? Are there SLA implications?
•Stakeholder consultation: Review the proposal with all affected stakeholders—product, engineering leadership, customer-facing teams, legal/contracts if SLA implications. Collect feedback and concerns.
•Decision and approval: Route through appropriate approval chain. For most SLOs, engineering leadership approval is sufficient. For SLOs tied to external SLAs, may require VP or legal approval.
•Implementation planning: Schedule the change. Update monitoring, alerting, dashboards, and documentation. Plan communication to the team.
•Execution and announcement: Make the change, update all systems, and communicate clearly to all stakeholders what changed and why.
•Observation period: Monitor the new SLO closely for 4-8 weeks. Validate it behaves as expected. Be prepared to adjust if analysis was wrong.

Types of SLO adjustments:

Target adjustment (most common): Changing the target percentage (e.g., 99.9% → 99.5% or 99.9% → 99.95%). This adjusts how much error budget you have.

Threshold adjustment: Changing what qualifies as success (e.g., latency threshold from <300ms to <500ms). This changes the SLI definition.

Window adjustment: Changing the evaluation period (e.g., 30 days → 7 days or calendar month → rolling). This affects budget dynamics.

SLI replacement: Replacing the underlying metric entirely (e.g., from error rate to success rate, or from server-side latency to client-perceived latency).

Scope adjustment: Changing what's included in the SLO (e.g., adding new endpoints, excluding background operations, segmenting by user tier).

Each type has different implications:

SLO Adjustment Types and Implications
Adjustment Type	Implementation Complexity	Stakeholder Impact	Historical Continuity
Target change	Low (config update)	Medium (budget implications)	Preserved (same metric, different goal)
Threshold change	Low-Medium	Medium-High (redefines success)	Partially preserved (same concept, different bar)
Window change	Medium	High (changes planning dynamics)	Breaks (not comparable to historical)
SLI replacement	High	High (completely different signal)	Breaks (cannot compare old vs new)
Scope change	High	Medium	Depends on degree of change

The Organizational Memory Problem

Every SLO adjustment creates a break in organizational memory. 'We were at 99.7% last year' becomes meaningless if the SLI definition changed. Document adjustments thoroughly, maintain versioned SLO history, and clearly annotate historical charts to indicate definition changes.

Managing Stakeholder Communication

SLO adjustments affect multiple stakeholders differently. Effective communication ensures changes are understood, accepted, and don't create downstream problems.

Stakeholder communication matrix:

SLO Change Communication by Stakeholder
Stakeholder	Key Concerns	Communication Focus	Timing
Engineering team	How does this affect our work? Alert changes?	Technical implications, new thresholds, rationale	Before change, involve in analysis
Product management	Does this affect roadmap? Customer commitments?	User impact, budget implications for velocity	During analysis, approve proposals
Engineering leadership	Resource implications? Trend concerning?	Strategic rationale, investment needs	Approve significant changes
Sales/Customer Success	What can we tell customers? SLA impact?	External communication guidance, talking points	Before external communication
Legal/Contracts	SLA implications? Contract amendments?	Contractual analysis, risk assessment	If SLA affected, early involvement
Customers (if applicable)	Is service getting worse/better?	Transparent explanation, benefit framing	If external SLA changes, formal notice

Framing adjustments appropriately:

Tightening targets (raising the bar):

Positive framing—this reflects investment in reliability or evolved user expectations:

"Based on user research and our improved infrastructure, we're raising our availability target from 99.9% to 99.95%. This reflects our commitment to best-in-class reliability and aligns with what our enterprise customers now expect."

Relaxing targets (lowering the bar):

This requires careful framing to avoid appearing as a step backward:

"After six months of data, we're adjusting our latency target from p99 < 200ms to p99 < 300ms. Our analysis shows users are equally satisfied at both levels, but the 200ms target was requiring engineering investment that would be better directed at features users are requesting. This allows us to maintain excellent user experience while accelerating product development."

Changing SLI definitions:

Focus on improved accuracy:

"We're updating our availability SLI from server-side uptime to synthetic transaction success rate. This better reflects actual user experience, as our server-side metric wasn't capturing client-facing issues. Historical data under the old definition will be preserved for reference, but going forward, this new metric drives our targets."

Proactive vs Reactive Communication

Announce SLO changes before they take effect, not after. Proactive communication ('Here's what we're changing and why') builds trust. Reactive communication ('We changed this last month') creates suspicion that changes were made to hide problems. Even for relaxed targets, proactive transparency is better than silent adjustment.

Handling pushback:

"You're lowering your standards" (when relaxing targets) Response: "We're aligning our internal expectations with user reality. Data shows our previous target exceeded what users notice or value. This doesn't mean we're providing worse service—just that we're being honest about what matters."

"You're setting yourself up for failure" (when tightening targets) Response: "We've analyzed our capability thoroughly. This target is achievable with our current infrastructure and intended investments. We wouldn't commit to something we can't deliver."

"This affects our SLA with customers" (for any change) Response: "Let's review the specific SLA language. Internal SLOs and external SLAs are distinct. If SLA needs adjustment, we'll follow the appropriate contractual process."

"Why wasn't I consulted earlier?" Response: "You're right—this is feedback for our process improvement. Let me walk you through our analysis now, and we'll ensure you're included earlier in future adjustments."

Documentation and Versioning

SLOs are organizational contracts—they need the same care in documentation and versioning that you'd give to any important agreement. A year from now, someone should be able to understand what SLOs were in effect, why they were set that way, and how they've evolved.

Essential SLO documentation elements:

SLO Documentation Required Content

•Service identification: Which service(s) the SLO covers, owners, stakeholders
•SLI definition: Precise description of how the indicator is measured, data sources, calculation formula
•SLO target: The target value, evaluation window, and any conditions or exclusions
•Rationale: Why this target was chosen—user research, technical analysis, business requirements
•Error budget policy: What happens at various budget consumption levels
•Alerting configuration: What alerts are tied to this SLO, their thresholds and destinations
•Review schedule: When this SLO will next be reviewed
•Change history: Log of all changes with dates, descriptions, and rationale
•Approval chain: Who approved the SLO and any changes

slo-document.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
# SLO Document: Payment API Availability
# Version: 3
# Last Updated: 2024-01-15
# Status: Active
 
metadata:
  service: payment-api
  owner_team: payments-core
  stakeholders:
    - product-payments
    - enterprise-sales
    - customer-success
  external_sla_reference: MSA-v2-AppendixB
  review_schedule: "Quarterly (March, June, September, December)"
  document_location: https://wiki.example.com/slos/payment-api
 
sli:
  name: "Payment Transaction Availability"
  description: |
    Measures the success rate of payment transaction attempts, 
    as perceived by the client API consumer.
  measurement:
    source: "Prometheus metrics from payment-api service"
    numerator: "Sum of 2xx responses to /v1/payments/* endpoints"
    denominator: "Sum of all responses to /v1/payments/* endpoints"
    excludes:
      - "400-level client errors (indicated by x-client-error header)"
      - "Synthetic monitoring traffic (user-agent contains 'synthetic')"
      - "Internal test transactions (x-test-mode header present)"
  formula: |
    sum(rate(http_requests_total{service="payment-api", 
        path=~"/v1/payments/.*", status=~"2.."}[5m]))
    /
    sum(rate(http_requests_total{service="payment-api", 
        path=~"/v1/payments/.*", status!~"4..", 
        client_error!="true"}[5m]))
 
slo:
  target: 99.95
  window: "30 days rolling"
  error_budget_minutes: 21.6  # 30 days * 24 hours * 0.0005
 
rationale: |
  Target of 99.95% reflects:
  - User research (Q3-2023): Enterprise customers expect "five nines minus a bit"
  - Technical analysis: Dependencies support up to 99.97% theoretical maximum
  - Business requirement: Enterprise SLA (MSA-v2) requires 99.9% contractual minimum
  - Safety margin: 0.05% buffer between internal target and external SLA
 
error_budget_policy:
  healthy: 
    threshold: "< 50% consumed"
    actions: "Normal operations, full deployment velocity"
  caution:
    threshold: "50-75% consumed"
    actions: "Increased monitoring, defer risky changes"
  at_risk:
    threshold: "75-90% consumed"
    actions: "Reliability focus, limited deployments"
  critical:
    threshold: "> 90% consumed"
    actions: "Development freeze, all hands on reliability"
 
alerting:
  burn_rate_alerts:
    - name: "SLOBurnSevere"
      burn_rate: 14.4
      short_window: 5m
      long_window: 1h
      destination: "@payments-oncall (page)"
    - name: "SLOBurnHigh"
      burn_rate: 6
      short_window: 30m  
      long_window: 6h
      destination: "@payments-oncall (page)"
    - name: "SLOBurnMedium"
      burn_rate: 3
      short_window: 2h
      long_window: 24h
      destination: "#payments-alerts (ticket)"
  budget_alerts:
    - name: "SLOBudget75"
      threshold: "75% consumed"
      destination: "#payments-eng (notification)"
    - name: "SLOBudget90"
      threshold: "90% consumed"
      destination: "@payments-eng-manager (notification)"
 
change_history:
  - version: 3
    date: "2024-01-15"
    change: "Tightened target from 99.9% to 99.95%"
    rationale: |
      - Consistent over-achievement (avg 99.97%) for 6 months
      - Enterprise customer feedback requesting tighter commitments
      - New infrastructure investment enables higher reliability
    approved_by: "VP Engineering"
    
  - version: 2
    date: "2023-07-01"
    change: "Updated SLI to exclude client errors"
    rationale: |
      - 400-level errors were inflating failure count
      - These represent client bugs, not service reliability
    approved_by: "Engineering Manager, Payments"
    
  - version: 1
    date: "2023-01-01"
    change: "Initial SLO establishment"
    rationale: "New service launch, baseline target"
    approved_by: "VP Engineering"
 
approval:
  current_version_approved_by: "VP Engineering, Product Director"
  approval_date: "2024-01-10"
  next_review_date: "2024-03-31"

Version control best practices:

Store SLO documents in version control (Git) alongside code, or in a structured wiki with version history
Use SLO document templates to ensure consistency across services
Require PR reviews for SLO changes with appropriate approvers
Tag/release SLO documents when changes take effect
Maintain a changelog that's easily scannable
Archive old versions rather than deleting them

SLO Registries

For organizations with many SLOs, consider a central SLO registry—a database or service that stores all SLOs, their current targets, and their status. This enables portfolio-level analysis, consistent reporting, and easier governance. Tools like Sloth, Pyrra, and Nobl9 provide registry capabilities.

Continuous Improvement Practices

SLO review isn't just about adjusting targets—it's an opportunity for broader reliability improvement. Effective organizations use SLO reviews as a vehicle for continuous improvement.

Post-review improvement activities:

SLO-Driven Improvement Opportunities

•Measurement improvements: Every review should ask: "Is our SLI measurement accurate and complete?" Gaps discovered during review drive instrumentation investments.
•Alert tuning: Review alert performance alongside SLO performance. High false positives or missed incidents indicate alert rules need adjustment.
•Runbook updates: If incidents during the review period revealed runbook gaps, update runbooks as part of the review action items.
•Error budget category analysis: What categories consumed the most budget? Prioritize reliability improvements for the top contributors.
•Process refinements: Did the review process work well? What could be done better next time? Improve the review practice itself.
•Training needs: Did the team have the skills to meet the SLO? Identify training gaps and address them.
•Tool improvements: Would better tooling help achieve the SLO? Dashboard improvements, automation opportunities, or new monitoring capabilities.

Building a reliability improvement backlog:

Every SLO review should produce actionable items. Maintain a dedicated reliability improvement backlog that captures:

Immediate fixes: Issues that are actively threatening the SLO and need immediate attention
Preventive investments: Work that would reduce future error budget consumption
Visibility improvements: Monitoring, alerting, and dashboard work to improve observability
Process improvements: Changes to incident response, deployment practices, or review procedures

Prioritize this backlog based on expected SLI improvement per unit of effort. A small fix that eliminates 20% of error budget consumption is more valuable than a large project that might save 5%.

Measuring SLO program maturity:

As your SLO practice evolves, track program-level metrics:

SLO coverage: What % of critical services have defined SLOs?
Achievement rate: What % of SLOs are being met across the portfolio?
Review compliance: Are reviews happening on schedule?
Stakeholder participation: Are the right people participating in reviews?
Action item closure: Are review action items being completed?
User correlation: Is the aggregate SLO performance tracking with aggregate user satisfaction?

The Virtuous Cycle

Effective SLO practice creates a virtuous cycle: SLOs reveal reliability gaps → Improvements are made → SLO performance improves → Confidence grows → More ambitious SLOs are set → New gaps are revealed → The cycle continues. This continuous improvement is the ultimate purpose of SLOs—not just measuring reliability, but systematically improving it.

Summary: Sustaining Effective SLOs

SLOs are not "set and forget" artifacts—they are living commitments that require ongoing attention, evaluation, and adjustment. The practice of SLO review and adjustment is what ensures your reliability targets remain meaningful, achievable, and aligned with user needs over time.

Key Takeaways

•Establish review cadences: Weekly operational check-ins, monthly tactical reviews, quarterly strategic assessments. Each serves a different purpose.
•Recognize adjustment signals: Chronic over/under-achievement, user satisfaction disconnects, measurement gaps, and business context changes all indicate potential adjustment needs.
•Analyze effectiveness, not just achievement: Ask whether the SLO is driving the right behaviors, correlating with user satisfaction, and providing decision-making value.
•Follow a structured adjustment process: Document the case, propose changes, assess impact, consult stakeholders, get approval, implement carefully, and observe closely.
•Communicate changes appropriately: Frame adjustments positively, tailor messaging to each stakeholder group, and proactively announce changes before they take effect.
•Document and version thoroughly: SLOs are organizational contracts deserving full documentation, version history, and approval chains.
•Use reviews for continuous improvement: Every review is an opportunity to improve measurement, alerting, runbooks, and the review process itself.

Module Complete

You have now completed the comprehensive module on Setting SLOs. You've mastered SLO target selection, error budgets, burn rate alerting, SLO-based alerting strategies, and the ongoing practice of reviewing and adjusting SLOs. These skills form the operational foundation of Site Reliability Engineering—transforming reliability from an aspiration into a measurable, manageable, and continuously improving practice.