System Design (HLD)Setting SLOs

Setting Service Level Objectives

LevelIntermediate

Duration90 mins

TopicSetting SLOs

4 / 5

SLO-Based Alerting

Rethinking Alerting Through the SLO Lens

Traditional alerting evolved organically: add an alert for each symptom, each component, each failure mode. The result? Alert sprawl, alert fatigue, and a fundamental disconnect between what operators respond to and what actually matters to users.

The SLO-based alerting philosophy:

SLO-based alerting inverts the traditional approach. Instead of asking "What can break?" and alerting on each answer, it asks:

"Is user experience threatened?" → Alert on SLO impact, not internal symptoms
"How urgent is this?" → Severity based on error budget impact, not arbitrary thresholds
"What action is needed?" → Alerts tied to specific response procedures

This approach doesn't eliminate all symptom-based alerts—you still need basic infrastructure monitoring. But it fundamentally changes what pages you and what merely creates tickets. The goal is: every page represents a genuine threat to user experience; every ticket represents something worth fixing but not urgently.

Burn rate alerting (covered in the previous page) is the cornerstone of SLO-based alerting, but the philosophy extends further into alert design, prioritization, and organizational practices.

What You Will Learn

By the end of this page, you'll understand how to structure an entire alerting strategy around SLOs: symptom vs. cause alerts, alert prioritization based on budget impact, reducing alert fatigue through SLO-based consolidation, integrating SLO context into incident response, and measuring alerting quality against SLO outcomes.

Symptom-Based vs Cause-Based Alerting

Understanding the distinction between symptom-based and cause-based alerting is foundational to SLO-based alerting design.

Cause-based alerts fire when something in your infrastructure breaks—a server goes down, a process crashes, a disk fills up. They're detecting causes of potential user impact.

Symptom-based alerts fire when user experience degrades—error rates increase, latency rises, availability drops. They're detecting symptoms that users experience.

The SLO connection:

SLOs are fundamentally symptom-based—they measure user experience. Therefore, SLO-based alerting prioritizes symptom detection:

SLO Alerts = "Is the user experience degraded?"
Infrastructure Alerts = "Is something broken internally?"

The key insight is that not all internal breakage affects users, and prioritizing by user impact prevents chasing phantom problems.

Symptom vs Cause Alert Characteristics
Aspect	Symptom-Based Alerts	Cause-Based Alerts
What it detects	User-facing impact (latency, errors, availability)	Infrastructure issues (crashes, disk, CPU)
False positive rate	Lower (if users aren't impacted, alert doesn't fire)	Higher (internal issues may have no user impact)
False negative rate	Can miss if instrumentation has gaps	Can miss novel failure modes
Actionability	High (known user impact requires response)	Variable (internal fix may or may not be urgent)
Correlation with SLO	Direct (symptoms ARE SLI measurements)	Indirect (causes MAY lead to SLO impact)
Alert volume	Lower (consolidated by user impact)	Higher (each component can have many alerts)

Symptom Alerts for Paging

•Error rate exceeds threshold (burning budget)
•Latency percentiles degraded (affecting users)
•Availability drops (users can't reach service)
•Success rate for critical journeys drops
•Synthetic canary failures (simulated user can't succeed)

Cause Alerts for Tickets

•Database replica lag increasing
•Pod restart loop (but redundancy handling it)
•Certificate expiring in 7 days
•Disk usage at 80% (not yet critical)
•Background job queue depth growing

The Cause Alert Exception

Some cause-based alerts warrant paging even without current symptoms—when the cause will inevitably lead to severe symptoms if not addressed. Example: a database failover just consumed your last redundant replica. No symptoms yet, but you're now running without protection. These 'imminent doom' alerts are valid pages even in symptom-first approaches.

Layered alerting architecture:

The recommended approach layers symptom and cause alerts:

Layer 1: SLO/Symptom Alerts (Page)

Primary alerting based on burn rate and SLI thresholds
These are the alerts that wake people up
Always actionable: user impact is occurring or imminent

Layer 2: Leading Indicator Alerts (Page or Ticket)

Cause-based alerts for conditions that reliably lead to SLO impact
Examples: critical redundancy lost, capacity < 20% remaining
Page if impact is imminent and severe; ticket otherwise

Layer 3: Observational Alerts (Ticket)

Cause-based alerts for concerning conditions without immediate impact
Create tickets for investigation during normal hours
Never page, as there's no current user impact

Layer 4: Informational Alerts (Dashboard/Log)

Useful for debugging and trending
Don't create tickets automatically—just available for context

Connecting Alert Severity to Budget Impact

Traditional alert severity (Critical, Warning, Info) is often assigned subjectively: "This feels like a critical alert." SLO-based alerting provides an objective framework: severity should reflect error budget impact.

The severity mapping principle:

Severity = f(Budget Impact Rate, Current Budget Status)

An alert that would consume 10% of monthly budget in an hour is more severe than one consuming 1% per day, even if the underlying symptoms look similar. And that same 10%/hour burn is more critical when you have 15% budget remaining than when you have 80%.

Severity Assignment Based on Budget Impact
Severity Level	Burn Rate	Time to Exhaust	Remaining Budget Factor	Response Expectation
P1 / Critical	14x	< 2 days	Any, or < 30% remaining	Immediate response, <5 min acknowledgment
P2 / High	6x - 14x	2-5 days	Any, or < 50% remaining	Response within 30 min
P3 / Medium	3x - 6x	5-10 days	Any	Response within 4 hours (business hours)
P4 / Low	1x - 3x	10-30 days	50% remaining	Response within 24 hours
P5 / Info	< 1x	Not exhausting	Any	Ticket for tracking, no SLA

Adjusting severity by budget status:

The same burn rate can warrant different severities based on current budget health:

Scenario 1: Budget at 70% remaining

5x burn rate = P3 (Medium): You have headroom, investigate today

Scenario 2: Budget at 20% remaining

5x burn rate = P2 (High): You have minimal headroom, respond within 30 minutes

Scenario 3: Budget exhausted

Any burn = P1 (Critical): You're already violating SLO, all hands

This adaptive severity ensures that response intensity matches actual risk. An organization with healthy budgets can operate more calmly; one with depleted budgets operates with heightened vigilance.

Implementation approaches:

Static severity (simpler): Assign fixed severity to each burn rate tier. Rely on operators to mentally factor in budget status.

Dynamic severity (sophisticated): Use alert rules that incorporate both burn rate AND current budget:

if burn_rate > 6 AND budget_remaining < 30%:
    severity = P1
elif burn_rate > 6:
    severity = P2
elif burn_rate > 3 AND budget_remaining < 30%:
    severity = P2
...

Dynamic severity is more operationally useful but requires more complex alert configuration.

Severity Consistency Matters

Whatever severity scheme you choose, be consistent across services. Operators build intuition based on severity—P1 means drop everything. If P1 means different things for different services, operators lose trust in the system and either over-respond (burnout) or under-respond (missed incidents).

Reducing Alert Fatigue Through SLO Focus

Alert fatigue is the operational equivalent of "crying wolf"—when teams receive too many alerts, they stop taking any of them seriously. SLO-based alerting is a powerful antidote to alert fatigue because it fundamentally reduces alert volume while increasing signal quality.

Why SLO-based alerting reduces fatigue:

Consolidation: Instead of 50 component alerts that might indicate a problem, one SLO alert confirms user impact
Filtering: Cause-based alerts that don't affect SLOs don't page—they become tickets
Relevance: Every SLO alert represents real user impact, not just internal metrics
Actionability: SLO alerts have clear response procedures (protect user experience)

Alert Fatigue Reduction Strategies

•Replace cause alerts with SLO alerts: Instead of alerting on each component's health, alert on the service's SLO. If the database is slow but SLO is unaffected (due to caching), don't page.
•Implement alert deduplication: Multiple burn rate tiers for the same SLO shouldn't page multiple times. Deduplicate to the highest severity.
•Use time-based suppression: Once an SLO alert is acknowledged, suppress repeat alerts for that SLO during investigation. Clear when resolved.
•Aggregate correlated alerts: If 10 microservices are affected by the same database issue, alert once on the impacted SLO, not 10 times.
•Demote to tickets aggressively: Any alert that doesn't require immediate action should be a ticket, not a page. Review your paging alerts quarterly.
•Implement 'snooze' with SLO context: Allow snoozing alerts during planned activities, with automatic un-snooze if burn rate accelerates beyond expected.

The '5x5' rule of thumb:

A healthy SLO-based alerting configuration should result in roughly:

≤5 pages per week average for a service (not per person—per service)
≤5 unique alert types that can page for a service

If a service generates more than 5 pages per week on average, something is wrong—either your reliability needs investment, or your alert thresholds are too sensitive. If you have more than 5 paging alert types, you probably have alert sprawl and should consolidate to SLO-based alerts.

Measuring alert quality:

Track these metrics to assess alert health:

Pages per week per service: Trend over time, target < 5
Actionability rate: % of pages that resulted in meaningful action (target > 90%)
Noise rate: % of pages that self-resolved without intervention (target < 10%)
Mean time to acknowledge: How quickly operators respond (indicates fatigue if growing)
Snooze/suppress rate: High rates indicate operators have lost trust in the signals

The Danger of Alert Accumulation

Organizations naturally accumulate alerts over time. Each incident produces new alerts; few are ever removed. Schedule quarterly 'alert pruning' reviews: examine every paging alert's history, remove those that no longer provide signal, and demote cause-based alerts that haven't correlated with SLO impact.

Integrating SLO Context into Alerts

An alert that says "Error rate is 2%" is far less useful than one that says "Error rate is 2%, consuming budget at 20x sustainable rate, 65% budget remaining, projected to exhaust in 1.5 days if sustained." SLO-aware alerts include context that enables faster, better-informed response.

Essential SLO context in alerts:

SLO Context to Include in Alert Messages

•Current SLI value: The actual measurement that triggered the alert (e.g., "error rate: 2.3%")
•SLO target: What the threshold is (e.g., "target: < 0.1%")
•Current burn rate: How fast budget is being consumed (e.g., "burn rate: 23x")
•Budget remaining: Current error budget status (e.g., "budget remaining: 65%")
•Projected exhaustion: When budget will exhaust at current rate (e.g., "exhausts in: 1.5 days")
•Trend direction: Is it getting better or worse? (e.g., "trend: worsening")
•Affected scope: What's impacted (e.g., "service: payment-api, region: us-west-2")
•Recent changes: Deployments or config changes in preceding hours
•Runbook link: Direct link to relevant procedure

slo-context-alert.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Example: SLO-Enriched Alert Configuration
# Shows how to include rich context in alert annotations
 
groups:
  - name: slo-alerts-with-context
    rules:
      - alert: SLOBudgetBurn
        expr: |
          slo:burn_rate:ratio_rate1h > 6
          AND
          slo:burn_rate:ratio_rate6h > 6
        for: 5m
        labels:
          severity: warning
          slo_impacting: "true"
        annotations:
          # Summary with key context
          summary: |
            SLO Alert: {{ $labels.service }} burning budget at {{ $value | printf "%.1f" }}x sustainable rate
          
          # Rich description with full context
          description: |
            🚨 **Service**: {{ $labels.service }}
            📊 **Current State**:
               - Error Rate: {{ with query "slo:error_rate:ratio_rate1h{service='$labels.service'}" }}{{ . | first | value | printf "%.3f" }}%{{ end }}
               - Burn Rate: {{ $value | printf "%.1f" }}x sustainable
               - Budget Used (30d): {{ with query "slo:budget_consumed:ratio{service='$labels.service'}" }}{{ . | first | value | printf "%.1f" }}%{{ end }}
            
            ⏱️ **Projection**: At current rate, budget exhausts in {{ printf "%.1f" (div 30 $value) }} days
            
            📈 **Trend**: {{ with query "delta(slo:burn_rate:ratio_rate1h{service='$labels.service'}[30m])" }}{{ if gt (. | first | value) 0 }}Worsening{{ else }}Improving{{ end }}{{ end }}
            
            🔄 **Recent Changes**:
            {{ range query "deployment_timestamp{service='$labels.service', age<'6h'}" }}
               - {{ .Labels.version }} deployed {{ .Labels.age }} ago
            {{ end }}
          
          # Direct links
          dashboard_url: https://grafana.example.com/d/slo-dashboard?var-service={{ $labels.service }}
          runbook_url: https://runbooks.example.com/slo-burn-rate-high
          
          # Structured metadata for automation
          budget_remaining: '{{ with query "1 - slo:budget_consumed:ratio{service=\'$labels.service\'}" }}{{ . | first | value | printf "%.2f" }}{{ end }}'
          days_to_exhaustion: '{{ printf "%.1f" (div 30 $value) }}'

Alert enrichment strategies:

Static enrichment: Include fixed information in alert templates (runbook links, escalation paths, service ownership).

Dynamic enrichment: At alert fire time, query for current context (budget remaining, recent deployments, correlated symptoms).

Post-fire enrichment: Use incident management tools to add context after the alert fires (add responders, attach relevant logs, link to dashboards).

Automation hooks:

SLO context enables automation that wasn't previously possible:

Automatically escalate if budget remaining < 20%
Auto-page additional responders if burn rate accelerates
Automatically rollback most recent deployment if burn rate exceeds 20x within 30 minutes of deploy
Auto-communicate to stakeholders if SLO is violated

Link Directly to Investigation

Every paging alert should include a one-click link to a pre-filtered dashboard showing the relevant SLIs, recent changes, and comparison to baseline. Don't make responders navigate to find context—deliver it with the page. This can reduce mean-time-to-diagnosis by 50% or more.

SLO Alerting for Different Stakeholders

Not everyone needs the same SLO alerts. Different stakeholders have different needs for SLO notifications:

On-call engineers: Need immediate, detailed technical alerts for response Engineering managers: Need visibility into SLO trends and major incidents Product managers: Need to understand user impact and budget status Executives: Need high-level SLO health and business impact Customers (enterprise): May need proactive notification of issues affecting them

SLO Notification Matrix by Stakeholder
Stakeholder	Alert Types	Delivery Channel	Timing	Content Focus
On-call Engineer	All severity levels	PagerDuty, Slack	Real-time	Technical details, runbooks, dashboards
Engineering Manager	P1, P2, SLO violations	Slack, Email	Real-time for P1, digest for others	Impact summary, budget status, team load
Product Manager	SLO at risk, violations	Slack, Email	Near-real-time	User impact, feature implications
Director/VP	SLO violations, weekly summary	Email, Dashboard	Digest (daily/weekly)	Portfolio health, trends, action items
Executive	Major incidents, quarterly SLO	Dashboard, Slides	Periodic reporting	Business impact, trends, investments
Enterprise Customer	Issues affecting their usage	Status page, Email	Proactive during impact	What's affected, ETA, workarounds

Creating a notification hierarchy:

Tier 1: Immediate Technical Response Recipients: On-call engineers Trigger: Any paging-tier SLO alert Channel: PagerDuty/Opsgenie with full context Expectation: 5-minute acknowledgment, immediate investigation

Tier 2: Engineering Awareness Recipients: Engineering managers, team leads, SRE team Trigger: P1/P2 alerts, or cumulative P3+ reaching threshold Channel: Slack channel, summarized Expectation: Awareness during business hours, available for escalation

Tier 3: Business Stakeholder Visibility Recipients: Product managers, business leads Trigger: SLO at risk (>75% budget consumed), violations Channel: Email or dedicated Slack channel Expectation: Awareness of user impact, input on prioritization

Tier 4: Executive Visibility Recipients: VP/C-level Trigger: Major incidents (defined by business impact), SLO violations Channel: Incident management system, executive brief Expectation: Awareness for stakeholder management, resource decisions

Avoid Alert Overloading

Resist the temptation to 'add everyone to the page' for visibility. Executive stakeholders don't need real-time technical alerts—they need curated summaries. Overloading people with alerts they can't act on creates noise and desensitization. Tailor notifications to what each audience can actually use.

Alert Testing and Validation

SLO alerts protect your user experience—but only if they actually fire when needed. Testing and validating alerts is essential to maintaining confidence in your alerting system.

Why alerts fail silently:

Configuration drift as infrastructure changes
Alert rules referencing metrics that stopped existing
Threshold changes that inadvertently broke logic
Aggregation changes affecting cardinality
Monitoring system failures that go undetected

Alert Validation Strategies

•Synthetic testing: Regularly inject synthetic errors that should trigger alerts. Verify alerts fire within expected time. This catches configuration drift.
•Alert freshness monitoring: Track when each alert last fired. Alerts that haven't fired in months may be misconfigured or obsolete.
•Unit testing alert rules: Write tests for your alert queries that verify they evaluate correctly against known metric samples.
•Chaos engineering: Intentionally create conditions that should trigger alerts (kill instances, inject latency). Verify end-to-end alerting.
•Delivery verification: Periodically send test alerts through the entire chain (alert → notification system → phone). Catch delivery failures.
•Runbook validation: Regularly walk through runbooks in simulation. Ensure they're still accurate and the linked dashboards exist.

Alert testing cadence:

Weekly: Automated synthetic tests for critical SLO alerts Monthly: Review of alert freshness; investigate any that haven't fired Quarterly: Full chaos engineering exercise including alerting validation After changes: Any modification to alert rules requires test verification

Dead alert detection:

Alerts that never fire are either:

Working correctly because conditions never occur (valid)
Broken and would fail to fire if conditions occurred (dangerous)

To distinguish these:

Lower thresholds temporarily and verify alert fires
Inject synthetic data that should trigger the alert
Review alert rule for obvious errors (wrong labels, stale metric names)

Alert rule testing example:

alert-test.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# Unit test for Prometheus alert rules using promtool
# Save as: tests/slo_alerts_test.yaml
 
rule_files:
  - ../rules/slo_alerts.yaml
 
tests:
  # Test 1: Verify severe burn rate alert fires correctly
  - interval: 1m
    input_series:
      # Simulate 2% error rate when SLO allows 0.1% 
      # This is 20x burn rate
      - series: 'http_requests_total{service="api",status="200"}'
        values: '0+980x10'
      - series: 'http_requests_total{service="api",status="500"}'
        values: '0+20x10'
      # SLO target
      - series: 'slo:error_budget:ratio{service="api"}'
        values: '0.001x10'
    
    alert_rule_test:
      - eval_time: 5m
        alertname: SLOBurnRateSevere
        exp_alerts:
          - exp_labels:
              severity: critical
              service: api
            exp_annotations:
              summary: "api: High SLO burn rate (severe)"
              
  # Test 2: Verify alert does NOT fire when burn rate is sustainable
  - interval: 1m
    input_series:
      # Simulate 0.1% error rate (1x burn rate - sustainable)
      - series: 'http_requests_total{service="api",status="200"}'
        values: '0+999x10'
      - series: 'http_requests_total{service="api",status="500"}'
        values: '0+1x10'
      - series: 'slo:error_budget:ratio{service="api"}'
        values: '0.001x10'
    
    alert_rule_test:
      - eval_time: 5m
        alertname: SLOBurnRateSevere
        exp_alerts: []  # No alerts expected
        
  # Test 3: Verify multi-window requirement
  - interval: 1m
    input_series:
      # High error rate for only 2 minutes (should not trigger)
      - series: 'http_requests_total{service="api",status="200"}'
        values: '0+999x2 0+980x1 0+999x7'
      - series: 'http_requests_total{service="api",status="500"}'
        values: '0+1x2 0+20x1 0+1x7'
      - series: 'slo:error_budget:ratio{service="api"}'
        values: '0.001x10'
    
    alert_rule_test:
      - eval_time: 5m
        alertname: SLOBurnRateSevere
        exp_alerts: []  # Brief spike should not alert due to long window
 
# Run with: promtool test rules tests/slo_alerts_test.yaml

Building SLO-Aware Runbooks

Every SLO alert should have an associated runbook that guides responders through investigation and mitigation. SLO-aware runbooks are distinctive in their focus on protecting user experience and managing error budget, not just fixing technical issues.

Runbook structure for SLO alerts:

SLO Alert Runbook Template

•Alert Context: What this alert means, which SLO it protects, typical severity levels, and who owns the service.
•Impact Assessment: How to quickly assess current user impact—which dashboards to check, what metrics indicate severity, who's affected.
•Budget Status Check: How to determine current budget remaining, burn rate, and projected exhaustion—with links to relevant dashboards.
•Quick Mitigation Options: Immediate actions to stop budget consumption—rollback procedures, feature flags, traffic routing, scaling. Focus on time-to-mitigation.
•Root Cause Investigation: Systematic debugging steps organized by common causes. Not exhaustive—focused on the top 80% of historical causes.
•Escalation Criteria: When to page additional people, when to declare a formal incident, who the escalation contacts are.
•Communication Templates: Pre-written messages for stakeholder updates, customer notification (if applicable), and status page updates.
•Recovery Verification: How to confirm the issue is resolved—metrics to check, burn rate to verify, how long to monitor before declaring recovery.
•Post-Incident Actions: Reminder to create post-mortem, update runbook if needed, consider alert tuning.

slo-burn-rate-runbook.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# SLO Burn Rate High - Payment API
## Alert: SLOBurnRateSevere - payment-api
 
### 📋 Alert Context
- **SLO**: Payment API availability 99.95% (30-day rolling)
- **Typical trigger**: Error rate spike, latency degradation, dependency failure
- **Service owner**: Payments Team (#payments-eng on Slack)
- **On-call rotation**: payments-oncall@pagerduty
 
### 🎯 Impact Assessment (2 minutes)
1. Open [SLO Dashboard](https://grafana.example.com/d/slo-payments)
2. Check "Transactions Affected" panel - **How many users impacted?**
3. Check "Error Distribution" - **What's failing?** (400s vs 500s, which endpoints)
4. Check "Geographic Impact" - **Regional or global?**
 
### 💰 Budget Status Check (1 minute)
- [Error Budget Dashboard](https://grafana.example.com/d/budget-payments)
- Current remaining: Check "Budget Remaining" gauge
- Burn rate: Check "Current Burn Rate" panel
- **If budget < 20%**: Escalate to Engineering Manager immediately
 
### 🚨 Quick Mitigation (5 minutes)
 
**Option 1: Rollback (if recent deployment)**
```bash
kubectl rollout undo deployment/payment-api -n payments
```
Verify: Error rate should drop within 2 minutes
 
**Option 2: Feature Flag (if specific feature)**
```bash
# Disable suspected feature
launchdarkly feature toggle payment-new-flow --off
```
 
**Option 3: Traffic Shift (if regional)**
```bash
# Route away from affected region
./scripts/traffic-shift.sh --region us-west-2 --pct 0
```
 
**Option 4: Dependency Fallback**
If payment gateway is the issue:
```bash
kubectl set env deployment/payment-api GATEWAY_FALLBACK=true
```
 
### 🔍 Root Cause Investigation
 
**Check in this order (covers 90% of cases):**
 
1. **Recent deployments** (most common)
   - Check [Deploy History](https://deploys.example.com/payment-api)
   - Anything in last 2 hours? → Consider rollback
 
2. **Dependency health**
   - [Payment Gateway Status](https://status.paymentgateway.com)
   - [Database Dashboard](https://grafana.example.com/d/db-payments)
   - [Cache Dashboard](https://grafana.example.com/d/cache-payments)
 
3. **Capacity issues**
   - [Resource Dashboard](https://grafana.example.com/d/resources)
   - CPU/Memory saturation?
   - Pod restarts?
 
4. **Traffic anomalies**
   - [Traffic Dashboard](https://grafana.example.com/d/traffic)
   - Unusual spike? Attack traffic?
 
### 📢 Escalation
 
**Escalate if:**
- Impact duration > 15 minutes
- Budget < 10% remaining
- Cause not identified within 30 minutes
- Multiple services affected
 
**Escalation contacts:**
- Primary: @payments-oncall (PagerDuty)
- Engineering Manager: Jane Smith (#payments-eng-leads)
- Platform: @platform-oncall (if infrastructure suspected)
 
### 📣 Communication Templates
 
**Slack update (every 15 min during incident):**
> [TIME] Payment API degraded performance. Impact: [X]% of transactions seeing errors. Investigating [suspected cause]. ETA: [estimate or "Investigating"]
 
**Status page (if > 10 min impact):**
> We are investigating reports of degraded performance with payment processing. Some users may experience failures or delays. We are actively working to resolve this issue.
 
### ✅ Recovery Verification
 
Before declaring resolved:
1. Burn rate returned to < 2x for 15 minutes
2. Error rate below SLO threshold for 15 minutes
3. No abnormal customer complaints
4. Affected transactions confirmed processing
 
### 📝 Post-Incident
- [ ] Create post-mortem if impact > 5 minutes or budget impact > 5%
- [ ] Update this runbook if investigation steps were missing
- [ ] Consider alert tuning if false positive

Keep Runbooks Alive

Runbooks decay rapidly as systems evolve. After every incident, ask: 'Did the runbook help? What was missing?' Update it immediately. Schedule quarterly reviews to catch drift. Dead runbooks are worse than no runbooks—they waste time and erode trust.

Summary: SLO-Based Alerting Excellence

SLO-based alerting transforms operational monitoring from a reactive chore into a strategic capability. By aligning alerts with user experience and error budget impact, teams achieve better outcomes with less noise and fatigue.

Key Takeaways

•Prioritize symptom-based over cause-based alerting: Page on user impact (SLI violations), ticket on internal symptoms (component issues that may or may not affect users).
•Connect severity to budget impact: A 10x burn rate is more severe than a 2x burn rate, regardless of how 'scary' the underlying error looks. Budget impact is the objective measure of severity.
•Reduce alert fatigue through SLO focus: Consolidate component alerts into SLO alerts. Every page should represent genuine user impact requiring action.
•Enrich alerts with SLO context: Include burn rate, budget remaining, projected exhaustion, and links to dashboards. Enable faster, better-informed response.
•Tailor notifications to stakeholders: Engineers need technical context; managers need impact summaries; executives need business implications. One size doesn't fit all.
•Test and validate alerts continuously: Synthetic testing, freshness monitoring, and chaos engineering ensure alerts actually fire when needed.
•Build SLO-aware runbooks: Guide responders through protecting user experience and managing error budget, not just fixing technical issues.

Page Complete

You now understand comprehensive SLO-based alerting strategies—from philosophical foundations through practical implementation and testing. Next, we'll explore how to review and adjust SLOs over time, ensuring your reliability targets remain appropriate as services, users, and business needs evolve.

4 / 5

Loading learning content...

System Design (HLD)Setting SLOs

Setting Service Level Objectives

LevelIntermediate

Duration90 mins

TopicSetting SLOs

4 / 5

SLO-Based Alerting

Rethinking Alerting Through the SLO Lens

The SLO-based alerting philosophy:

SLO-based alerting inverts the traditional approach. Instead of asking "What can break?" and alerting on each answer, it asks:

"Is user experience threatened?" → Alert on SLO impact, not internal symptoms
"How urgent is this?" → Severity based on error budget impact, not arbitrary thresholds
"What action is needed?" → Alerts tied to specific response procedures

Burn rate alerting (covered in the previous page) is the cornerstone of SLO-based alerting, but the philosophy extends further into alert design, prioritization, and organizational practices.

What You Will Learn

Symptom-Based vs Cause-Based Alerting

Understanding the distinction between symptom-based and cause-based alerting is foundational to SLO-based alerting design.

Cause-based alerts fire when something in your infrastructure breaks—a server goes down, a process crashes, a disk fills up. They're detecting causes of potential user impact.

Symptom-based alerts fire when user experience degrades—error rates increase, latency rises, availability drops. They're detecting symptoms that users experience.

The SLO connection:

SLOs are fundamentally symptom-based—they measure user experience. Therefore, SLO-based alerting prioritizes symptom detection:

SLO Alerts = "Is the user experience degraded?"
Infrastructure Alerts = "Is something broken internally?"

The key insight is that not all internal breakage affects users, and prioritizing by user impact prevents chasing phantom problems.

Symptom vs Cause Alert Characteristics
Aspect	Symptom-Based Alerts	Cause-Based Alerts
What it detects	User-facing impact (latency, errors, availability)	Infrastructure issues (crashes, disk, CPU)
False positive rate	Lower (if users aren't impacted, alert doesn't fire)	Higher (internal issues may have no user impact)
False negative rate	Can miss if instrumentation has gaps	Can miss novel failure modes
Actionability	High (known user impact requires response)	Variable (internal fix may or may not be urgent)
Correlation with SLO	Direct (symptoms ARE SLI measurements)	Indirect (causes MAY lead to SLO impact)
Alert volume	Lower (consolidated by user impact)	Higher (each component can have many alerts)

Symptom Alerts for Paging

•Error rate exceeds threshold (burning budget)
•Latency percentiles degraded (affecting users)
•Availability drops (users can't reach service)
•Success rate for critical journeys drops
•Synthetic canary failures (simulated user can't succeed)

Cause Alerts for Tickets

•Database replica lag increasing
•Pod restart loop (but redundancy handling it)
•Certificate expiring in 7 days
•Disk usage at 80% (not yet critical)
•Background job queue depth growing

The Cause Alert Exception

Layered alerting architecture:

The recommended approach layers symptom and cause alerts:

Layer 1: SLO/Symptom Alerts (Page)

Primary alerting based on burn rate and SLI thresholds
These are the alerts that wake people up
Always actionable: user impact is occurring or imminent

Layer 2: Leading Indicator Alerts (Page or Ticket)

Cause-based alerts for conditions that reliably lead to SLO impact
Examples: critical redundancy lost, capacity < 20% remaining
Page if impact is imminent and severe; ticket otherwise

Layer 3: Observational Alerts (Ticket)

Cause-based alerts for concerning conditions without immediate impact
Create tickets for investigation during normal hours
Never page, as there's no current user impact

Layer 4: Informational Alerts (Dashboard/Log)

Useful for debugging and trending
Don't create tickets automatically—just available for context

Connecting Alert Severity to Budget Impact

The severity mapping principle:

Severity = f(Budget Impact Rate, Current Budget Status)

Severity Assignment Based on Budget Impact
Severity Level	Burn Rate	Time to Exhaust	Remaining Budget Factor	Response Expectation
P1 / Critical	14x	< 2 days	Any, or < 30% remaining	Immediate response, <5 min acknowledgment
P2 / High	6x - 14x	2-5 days	Any, or < 50% remaining	Response within 30 min
P3 / Medium	3x - 6x	5-10 days	Any	Response within 4 hours (business hours)
P4 / Low	1x - 3x	10-30 days	50% remaining	Response within 24 hours
P5 / Info	< 1x	Not exhausting	Any	Ticket for tracking, no SLA

Adjusting severity by budget status:

The same burn rate can warrant different severities based on current budget health:

Scenario 1: Budget at 70% remaining

5x burn rate = P3 (Medium): You have headroom, investigate today

Scenario 2: Budget at 20% remaining

5x burn rate = P2 (High): You have minimal headroom, respond within 30 minutes

Scenario 3: Budget exhausted

Any burn = P1 (Critical): You're already violating SLO, all hands

This adaptive severity ensures that response intensity matches actual risk. An organization with healthy budgets can operate more calmly; one with depleted budgets operates with heightened vigilance.

Implementation approaches:

Static severity (simpler): Assign fixed severity to each burn rate tier. Rely on operators to mentally factor in budget status.

Dynamic severity (sophisticated): Use alert rules that incorporate both burn rate AND current budget:

if burn_rate > 6 AND budget_remaining < 30%:
    severity = P1
elif burn_rate > 6:
    severity = P2
elif burn_rate > 3 AND budget_remaining < 30%:
    severity = P2
...

Dynamic severity is more operationally useful but requires more complex alert configuration.

Severity Consistency Matters

Reducing Alert Fatigue Through SLO Focus

Why SLO-based alerting reduces fatigue:

Consolidation: Instead of 50 component alerts that might indicate a problem, one SLO alert confirms user impact
Filtering: Cause-based alerts that don't affect SLOs don't page—they become tickets
Relevance: Every SLO alert represents real user impact, not just internal metrics
Actionability: SLO alerts have clear response procedures (protect user experience)

Alert Fatigue Reduction Strategies

•Replace cause alerts with SLO alerts: Instead of alerting on each component's health, alert on the service's SLO. If the database is slow but SLO is unaffected (due to caching), don't page.
•Implement alert deduplication: Multiple burn rate tiers for the same SLO shouldn't page multiple times. Deduplicate to the highest severity.
•Use time-based suppression: Once an SLO alert is acknowledged, suppress repeat alerts for that SLO during investigation. Clear when resolved.
•Aggregate correlated alerts: If 10 microservices are affected by the same database issue, alert once on the impacted SLO, not 10 times.
•Demote to tickets aggressively: Any alert that doesn't require immediate action should be a ticket, not a page. Review your paging alerts quarterly.
•Implement 'snooze' with SLO context: Allow snoozing alerts during planned activities, with automatic un-snooze if burn rate accelerates beyond expected.

The '5x5' rule of thumb:

A healthy SLO-based alerting configuration should result in roughly:

≤5 pages per week average for a service (not per person—per service)
≤5 unique alert types that can page for a service

Measuring alert quality:

Track these metrics to assess alert health:

Pages per week per service: Trend over time, target < 5
Actionability rate: % of pages that resulted in meaningful action (target > 90%)
Noise rate: % of pages that self-resolved without intervention (target < 10%)
Mean time to acknowledge: How quickly operators respond (indicates fatigue if growing)
Snooze/suppress rate: High rates indicate operators have lost trust in the signals

The Danger of Alert Accumulation

Integrating SLO Context into Alerts

Essential SLO context in alerts:

SLO Context to Include in Alert Messages

•Current SLI value: The actual measurement that triggered the alert (e.g., "error rate: 2.3%")
•SLO target: What the threshold is (e.g., "target: < 0.1%")
•Current burn rate: How fast budget is being consumed (e.g., "burn rate: 23x")
•Budget remaining: Current error budget status (e.g., "budget remaining: 65%")
•Projected exhaustion: When budget will exhaust at current rate (e.g., "exhausts in: 1.5 days")
•Trend direction: Is it getting better or worse? (e.g., "trend: worsening")
•Affected scope: What's impacted (e.g., "service: payment-api, region: us-west-2")
•Recent changes: Deployments or config changes in preceding hours
•Runbook link: Direct link to relevant procedure

slo-context-alert.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Example: SLO-Enriched Alert Configuration
# Shows how to include rich context in alert annotations
 
groups:
  - name: slo-alerts-with-context
    rules:
      - alert: SLOBudgetBurn
        expr: |
          slo:burn_rate:ratio_rate1h > 6
          AND
          slo:burn_rate:ratio_rate6h > 6
        for: 5m
        labels:
          severity: warning
          slo_impacting: "true"
        annotations:
          # Summary with key context
          summary: |
            SLO Alert: {{ $labels.service }} burning budget at {{ $value | printf "%.1f" }}x sustainable rate
          
          # Rich description with full context
          description: |
            🚨 **Service**: {{ $labels.service }}
            📊 **Current State**:
               - Error Rate: {{ with query "slo:error_rate:ratio_rate1h{service='$labels.service'}" }}{{ . | first | value | printf "%.3f" }}%{{ end }}
               - Burn Rate: {{ $value | printf "%.1f" }}x sustainable
               - Budget Used (30d): {{ with query "slo:budget_consumed:ratio{service='$labels.service'}" }}{{ . | first | value | printf "%.1f" }}%{{ end }}
            
            ⏱️ **Projection**: At current rate, budget exhausts in {{ printf "%.1f" (div 30 $value) }} days
            
            📈 **Trend**: {{ with query "delta(slo:burn_rate:ratio_rate1h{service='$labels.service'}[30m])" }}{{ if gt (. | first | value) 0 }}Worsening{{ else }}Improving{{ end }}{{ end }}
            
            🔄 **Recent Changes**:
            {{ range query "deployment_timestamp{service='$labels.service', age<'6h'}" }}
               - {{ .Labels.version }} deployed {{ .Labels.age }} ago
            {{ end }}
          
          # Direct links
          dashboard_url: https://grafana.example.com/d/slo-dashboard?var-service={{ $labels.service }}
          runbook_url: https://runbooks.example.com/slo-burn-rate-high
          
          # Structured metadata for automation
          budget_remaining: '{{ with query "1 - slo:budget_consumed:ratio{service=\'$labels.service\'}" }}{{ . | first | value | printf "%.2f" }}{{ end }}'
          days_to_exhaustion: '{{ printf "%.1f" (div 30 $value) }}'

Alert enrichment strategies:

Static enrichment: Include fixed information in alert templates (runbook links, escalation paths, service ownership).

Dynamic enrichment: At alert fire time, query for current context (budget remaining, recent deployments, correlated symptoms).

Post-fire enrichment: Use incident management tools to add context after the alert fires (add responders, attach relevant logs, link to dashboards).

Automation hooks:

SLO context enables automation that wasn't previously possible:

Automatically escalate if budget remaining < 20%
Auto-page additional responders if burn rate accelerates
Automatically rollback most recent deployment if burn rate exceeds 20x within 30 minutes of deploy
Auto-communicate to stakeholders if SLO is violated

Link Directly to Investigation

SLO Alerting for Different Stakeholders

Not everyone needs the same SLO alerts. Different stakeholders have different needs for SLO notifications:

SLO Notification Matrix by Stakeholder
Stakeholder	Alert Types	Delivery Channel	Timing	Content Focus
On-call Engineer	All severity levels	PagerDuty, Slack	Real-time	Technical details, runbooks, dashboards
Engineering Manager	P1, P2, SLO violations	Slack, Email	Real-time for P1, digest for others	Impact summary, budget status, team load
Product Manager	SLO at risk, violations	Slack, Email	Near-real-time	User impact, feature implications
Director/VP	SLO violations, weekly summary	Email, Dashboard	Digest (daily/weekly)	Portfolio health, trends, action items
Executive	Major incidents, quarterly SLO	Dashboard, Slides	Periodic reporting	Business impact, trends, investments
Enterprise Customer	Issues affecting their usage	Status page, Email	Proactive during impact	What's affected, ETA, workarounds

Creating a notification hierarchy:

Avoid Alert Overloading

Alert Testing and Validation

SLO alerts protect your user experience—but only if they actually fire when needed. Testing and validating alerts is essential to maintaining confidence in your alerting system.

Why alerts fail silently:

Configuration drift as infrastructure changes
Alert rules referencing metrics that stopped existing
Threshold changes that inadvertently broke logic
Aggregation changes affecting cardinality
Monitoring system failures that go undetected

Alert Validation Strategies

•Synthetic testing: Regularly inject synthetic errors that should trigger alerts. Verify alerts fire within expected time. This catches configuration drift.
•Alert freshness monitoring: Track when each alert last fired. Alerts that haven't fired in months may be misconfigured or obsolete.
•Unit testing alert rules: Write tests for your alert queries that verify they evaluate correctly against known metric samples.
•Chaos engineering: Intentionally create conditions that should trigger alerts (kill instances, inject latency). Verify end-to-end alerting.
•Delivery verification: Periodically send test alerts through the entire chain (alert → notification system → phone). Catch delivery failures.
•Runbook validation: Regularly walk through runbooks in simulation. Ensure they're still accurate and the linked dashboards exist.

Alert testing cadence:

Dead alert detection:

Alerts that never fire are either:

Working correctly because conditions never occur (valid)
Broken and would fail to fire if conditions occurred (dangerous)

To distinguish these:

Lower thresholds temporarily and verify alert fires
Inject synthetic data that should trigger the alert
Review alert rule for obvious errors (wrong labels, stale metric names)

Alert rule testing example:

alert-test.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# Unit test for Prometheus alert rules using promtool
# Save as: tests/slo_alerts_test.yaml
 
rule_files:
  - ../rules/slo_alerts.yaml
 
tests:
  # Test 1: Verify severe burn rate alert fires correctly
  - interval: 1m
    input_series:
      # Simulate 2% error rate when SLO allows 0.1% 
      # This is 20x burn rate
      - series: 'http_requests_total{service="api",status="200"}'
        values: '0+980x10'
      - series: 'http_requests_total{service="api",status="500"}'
        values: '0+20x10'
      # SLO target
      - series: 'slo:error_budget:ratio{service="api"}'
        values: '0.001x10'
    
    alert_rule_test:
      - eval_time: 5m
        alertname: SLOBurnRateSevere
        exp_alerts:
          - exp_labels:
              severity: critical
              service: api
            exp_annotations:
              summary: "api: High SLO burn rate (severe)"
              
  # Test 2: Verify alert does NOT fire when burn rate is sustainable
  - interval: 1m
    input_series:
      # Simulate 0.1% error rate (1x burn rate - sustainable)
      - series: 'http_requests_total{service="api",status="200"}'
        values: '0+999x10'
      - series: 'http_requests_total{service="api",status="500"}'
        values: '0+1x10'
      - series: 'slo:error_budget:ratio{service="api"}'
        values: '0.001x10'
    
    alert_rule_test:
      - eval_time: 5m
        alertname: SLOBurnRateSevere
        exp_alerts: []  # No alerts expected
        
  # Test 3: Verify multi-window requirement
  - interval: 1m
    input_series:
      # High error rate for only 2 minutes (should not trigger)
      - series: 'http_requests_total{service="api",status="200"}'
        values: '0+999x2 0+980x1 0+999x7'
      - series: 'http_requests_total{service="api",status="500"}'
        values: '0+1x2 0+20x1 0+1x7'
      - series: 'slo:error_budget:ratio{service="api"}'
        values: '0.001x10'
    
    alert_rule_test:
      - eval_time: 5m
        alertname: SLOBurnRateSevere
        exp_alerts: []  # Brief spike should not alert due to long window
 
# Run with: promtool test rules tests/slo_alerts_test.yaml

Building SLO-Aware Runbooks

Runbook structure for SLO alerts:

SLO Alert Runbook Template

•Alert Context: What this alert means, which SLO it protects, typical severity levels, and who owns the service.
•Impact Assessment: How to quickly assess current user impact—which dashboards to check, what metrics indicate severity, who's affected.
•Budget Status Check: How to determine current budget remaining, burn rate, and projected exhaustion—with links to relevant dashboards.
•Quick Mitigation Options: Immediate actions to stop budget consumption—rollback procedures, feature flags, traffic routing, scaling. Focus on time-to-mitigation.
•Root Cause Investigation: Systematic debugging steps organized by common causes. Not exhaustive—focused on the top 80% of historical causes.
•Escalation Criteria: When to page additional people, when to declare a formal incident, who the escalation contacts are.
•Communication Templates: Pre-written messages for stakeholder updates, customer notification (if applicable), and status page updates.
•Recovery Verification: How to confirm the issue is resolved—metrics to check, burn rate to verify, how long to monitor before declaring recovery.
•Post-Incident Actions: Reminder to create post-mortem, update runbook if needed, consider alert tuning.

slo-burn-rate-runbook.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# SLO Burn Rate High - Payment API
## Alert: SLOBurnRateSevere - payment-api
 
### 📋 Alert Context
- **SLO**: Payment API availability 99.95% (30-day rolling)
- **Typical trigger**: Error rate spike, latency degradation, dependency failure
- **Service owner**: Payments Team (#payments-eng on Slack)
- **On-call rotation**: payments-oncall@pagerduty
 
### 🎯 Impact Assessment (2 minutes)
1. Open [SLO Dashboard](https://grafana.example.com/d/slo-payments)
2. Check "Transactions Affected" panel - **How many users impacted?**
3. Check "Error Distribution" - **What's failing?** (400s vs 500s, which endpoints)
4. Check "Geographic Impact" - **Regional or global?**
 
### 💰 Budget Status Check (1 minute)
- [Error Budget Dashboard](https://grafana.example.com/d/budget-payments)
- Current remaining: Check "Budget Remaining" gauge
- Burn rate: Check "Current Burn Rate" panel
- **If budget < 20%**: Escalate to Engineering Manager immediately
 
### 🚨 Quick Mitigation (5 minutes)
 
**Option 1: Rollback (if recent deployment)**
```bash
kubectl rollout undo deployment/payment-api -n payments
```
Verify: Error rate should drop within 2 minutes
 
**Option 2: Feature Flag (if specific feature)**
```bash
# Disable suspected feature
launchdarkly feature toggle payment-new-flow --off
```
 
**Option 3: Traffic Shift (if regional)**
```bash
# Route away from affected region
./scripts/traffic-shift.sh --region us-west-2 --pct 0
```
 
**Option 4: Dependency Fallback**
If payment gateway is the issue:
```bash
kubectl set env deployment/payment-api GATEWAY_FALLBACK=true
```
 
### 🔍 Root Cause Investigation
 
**Check in this order (covers 90% of cases):**
 
1. **Recent deployments** (most common)
   - Check [Deploy History](https://deploys.example.com/payment-api)
   - Anything in last 2 hours? → Consider rollback
 
2. **Dependency health**
   - [Payment Gateway Status](https://status.paymentgateway.com)
   - [Database Dashboard](https://grafana.example.com/d/db-payments)
   - [Cache Dashboard](https://grafana.example.com/d/cache-payments)
 
3. **Capacity issues**
   - [Resource Dashboard](https://grafana.example.com/d/resources)
   - CPU/Memory saturation?
   - Pod restarts?
 
4. **Traffic anomalies**
   - [Traffic Dashboard](https://grafana.example.com/d/traffic)
   - Unusual spike? Attack traffic?
 
### 📢 Escalation
 
**Escalate if:**
- Impact duration > 15 minutes
- Budget < 10% remaining
- Cause not identified within 30 minutes
- Multiple services affected
 
**Escalation contacts:**
- Primary: @payments-oncall (PagerDuty)
- Engineering Manager: Jane Smith (#payments-eng-leads)
- Platform: @platform-oncall (if infrastructure suspected)
 
### 📣 Communication Templates
 
**Slack update (every 15 min during incident):**
> [TIME] Payment API degraded performance. Impact: [X]% of transactions seeing errors. Investigating [suspected cause]. ETA: [estimate or "Investigating"]
 
**Status page (if > 10 min impact):**
> We are investigating reports of degraded performance with payment processing. Some users may experience failures or delays. We are actively working to resolve this issue.
 
### ✅ Recovery Verification
 
Before declaring resolved:
1. Burn rate returned to < 2x for 15 minutes
2. Error rate below SLO threshold for 15 minutes
3. No abnormal customer complaints
4. Affected transactions confirmed processing
 
### 📝 Post-Incident
- [ ] Create post-mortem if impact > 5 minutes or budget impact > 5%
- [ ] Update this runbook if investigation steps were missing
- [ ] Consider alert tuning if false positive

Keep Runbooks Alive

Summary: SLO-Based Alerting Excellence

Key Takeaways

•Prioritize symptom-based over cause-based alerting: Page on user impact (SLI violations), ticket on internal symptoms (component issues that may or may not affect users).
•Connect severity to budget impact: A 10x burn rate is more severe than a 2x burn rate, regardless of how 'scary' the underlying error looks. Budget impact is the objective measure of severity.
•Reduce alert fatigue through SLO focus: Consolidate component alerts into SLO alerts. Every page should represent genuine user impact requiring action.
•Enrich alerts with SLO context: Include burn rate, budget remaining, projected exhaustion, and links to dashboards. Enable faster, better-informed response.
•Tailor notifications to stakeholders: Engineers need technical context; managers need impact summaries; executives need business implications. One size doesn't fit all.
•Test and validate alerts continuously: Synthetic testing, freshness monitoring, and chaos engineering ensure alerts actually fire when needed.
•Build SLO-aware runbooks: Guide responders through protecting user experience and managing error budget, not just fixing technical issues.

Page Complete

4 / 5