Loading learning content...
Traditional alerting evolved organically: add an alert for each symptom, each component, each failure mode. The result? Alert sprawl, alert fatigue, and a fundamental disconnect between what operators respond to and what actually matters to users.
The SLO-based alerting philosophy:
SLO-based alerting inverts the traditional approach. Instead of asking "What can break?" and alerting on each answer, it asks:
This approach doesn't eliminate all symptom-based alerts—you still need basic infrastructure monitoring. But it fundamentally changes what pages you and what merely creates tickets. The goal is: every page represents a genuine threat to user experience; every ticket represents something worth fixing but not urgently.
Burn rate alerting (covered in the previous page) is the cornerstone of SLO-based alerting, but the philosophy extends further into alert design, prioritization, and organizational practices.
By the end of this page, you'll understand how to structure an entire alerting strategy around SLOs: symptom vs. cause alerts, alert prioritization based on budget impact, reducing alert fatigue through SLO-based consolidation, integrating SLO context into incident response, and measuring alerting quality against SLO outcomes.
Understanding the distinction between symptom-based and cause-based alerting is foundational to SLO-based alerting design.
Cause-based alerts fire when something in your infrastructure breaks—a server goes down, a process crashes, a disk fills up. They're detecting causes of potential user impact.
Symptom-based alerts fire when user experience degrades—error rates increase, latency rises, availability drops. They're detecting symptoms that users experience.
The SLO connection:
SLOs are fundamentally symptom-based—they measure user experience. Therefore, SLO-based alerting prioritizes symptom detection:
SLO Alerts = "Is the user experience degraded?"
Infrastructure Alerts = "Is something broken internally?"
The key insight is that not all internal breakage affects users, and prioritizing by user impact prevents chasing phantom problems.
| Aspect | Symptom-Based Alerts | Cause-Based Alerts |
|---|---|---|
| What it detects | User-facing impact (latency, errors, availability) | Infrastructure issues (crashes, disk, CPU) |
| False positive rate | Lower (if users aren't impacted, alert doesn't fire) | Higher (internal issues may have no user impact) |
| False negative rate | Can miss if instrumentation has gaps | Can miss novel failure modes |
| Actionability | High (known user impact requires response) | Variable (internal fix may or may not be urgent) |
| Correlation with SLO | Direct (symptoms ARE SLI measurements) | Indirect (causes MAY lead to SLO impact) |
| Alert volume | Lower (consolidated by user impact) | Higher (each component can have many alerts) |
Some cause-based alerts warrant paging even without current symptoms—when the cause will inevitably lead to severe symptoms if not addressed. Example: a database failover just consumed your last redundant replica. No symptoms yet, but you're now running without protection. These 'imminent doom' alerts are valid pages even in symptom-first approaches.
Layered alerting architecture:
The recommended approach layers symptom and cause alerts:
Layer 1: SLO/Symptom Alerts (Page)
Layer 2: Leading Indicator Alerts (Page or Ticket)
Layer 3: Observational Alerts (Ticket)
Layer 4: Informational Alerts (Dashboard/Log)
Traditional alert severity (Critical, Warning, Info) is often assigned subjectively: "This feels like a critical alert." SLO-based alerting provides an objective framework: severity should reflect error budget impact.
The severity mapping principle:
Severity = f(Budget Impact Rate, Current Budget Status)
An alert that would consume 10% of monthly budget in an hour is more severe than one consuming 1% per day, even if the underlying symptoms look similar. And that same 10%/hour burn is more critical when you have 15% budget remaining than when you have 80%.
| Severity Level | Burn Rate | Time to Exhaust | Remaining Budget Factor | Response Expectation |
|---|---|---|---|---|
| P1 / Critical | 14x | < 2 days | Any, or < 30% remaining | Immediate response, <5 min acknowledgment |
| P2 / High | 6x - 14x | 2-5 days | Any, or < 50% remaining | Response within 30 min |
| P3 / Medium | 3x - 6x | 5-10 days | Any | Response within 4 hours (business hours) |
| P4 / Low | 1x - 3x | 10-30 days | 50% remaining | Response within 24 hours |
| P5 / Info | < 1x | Not exhausting | Any | Ticket for tracking, no SLA |
Adjusting severity by budget status:
The same burn rate can warrant different severities based on current budget health:
Scenario 1: Budget at 70% remaining
Scenario 2: Budget at 20% remaining
Scenario 3: Budget exhausted
This adaptive severity ensures that response intensity matches actual risk. An organization with healthy budgets can operate more calmly; one with depleted budgets operates with heightened vigilance.
Implementation approaches:
Static severity (simpler): Assign fixed severity to each burn rate tier. Rely on operators to mentally factor in budget status.
Dynamic severity (sophisticated): Use alert rules that incorporate both burn rate AND current budget:
if burn_rate > 6 AND budget_remaining < 30%:
severity = P1
elif burn_rate > 6:
severity = P2
elif burn_rate > 3 AND budget_remaining < 30%:
severity = P2
...
Dynamic severity is more operationally useful but requires more complex alert configuration.
Whatever severity scheme you choose, be consistent across services. Operators build intuition based on severity—P1 means drop everything. If P1 means different things for different services, operators lose trust in the system and either over-respond (burnout) or under-respond (missed incidents).
Alert fatigue is the operational equivalent of "crying wolf"—when teams receive too many alerts, they stop taking any of them seriously. SLO-based alerting is a powerful antidote to alert fatigue because it fundamentally reduces alert volume while increasing signal quality.
Why SLO-based alerting reduces fatigue:
The '5x5' rule of thumb:
A healthy SLO-based alerting configuration should result in roughly:
If a service generates more than 5 pages per week on average, something is wrong—either your reliability needs investment, or your alert thresholds are too sensitive. If you have more than 5 paging alert types, you probably have alert sprawl and should consolidate to SLO-based alerts.
Measuring alert quality:
Track these metrics to assess alert health:
Organizations naturally accumulate alerts over time. Each incident produces new alerts; few are ever removed. Schedule quarterly 'alert pruning' reviews: examine every paging alert's history, remove those that no longer provide signal, and demote cause-based alerts that haven't correlated with SLO impact.
An alert that says "Error rate is 2%" is far less useful than one that says "Error rate is 2%, consuming budget at 20x sustainable rate, 65% budget remaining, projected to exhaust in 1.5 days if sustained." SLO-aware alerts include context that enables faster, better-informed response.
Essential SLO context in alerts:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
# Example: SLO-Enriched Alert Configuration# Shows how to include rich context in alert annotations groups: - name: slo-alerts-with-context rules: - alert: SLOBudgetBurn expr: | slo:burn_rate:ratio_rate1h > 6 AND slo:burn_rate:ratio_rate6h > 6 for: 5m labels: severity: warning slo_impacting: "true" annotations: # Summary with key context summary: | SLO Alert: {{ $labels.service }} burning budget at {{ $value | printf "%.1f" }}x sustainable rate # Rich description with full context description: | 🚨 **Service**: {{ $labels.service }} 📊 **Current State**: - Error Rate: {{ with query "slo:error_rate:ratio_rate1h{service='$labels.service'}" }}{{ . | first | value | printf "%.3f" }}%{{ end }} - Burn Rate: {{ $value | printf "%.1f" }}x sustainable - Budget Used (30d): {{ with query "slo:budget_consumed:ratio{service='$labels.service'}" }}{{ . | first | value | printf "%.1f" }}%{{ end }} ⏱️ **Projection**: At current rate, budget exhausts in {{ printf "%.1f" (div 30 $value) }} days 📈 **Trend**: {{ with query "delta(slo:burn_rate:ratio_rate1h{service='$labels.service'}[30m])" }}{{ if gt (. | first | value) 0 }}Worsening{{ else }}Improving{{ end }}{{ end }} 🔄 **Recent Changes**: {{ range query "deployment_timestamp{service='$labels.service', age<'6h'}" }} - {{ .Labels.version }} deployed {{ .Labels.age }} ago {{ end }} # Direct links dashboard_url: https://grafana.example.com/d/slo-dashboard?var-service={{ $labels.service }} runbook_url: https://runbooks.example.com/slo-burn-rate-high # Structured metadata for automation budget_remaining: '{{ with query "1 - slo:budget_consumed:ratio{service=\'$labels.service\'}" }}{{ . | first | value | printf "%.2f" }}{{ end }}' days_to_exhaustion: '{{ printf "%.1f" (div 30 $value) }}'Alert enrichment strategies:
Static enrichment: Include fixed information in alert templates (runbook links, escalation paths, service ownership).
Dynamic enrichment: At alert fire time, query for current context (budget remaining, recent deployments, correlated symptoms).
Post-fire enrichment: Use incident management tools to add context after the alert fires (add responders, attach relevant logs, link to dashboards).
Automation hooks:
SLO context enables automation that wasn't previously possible:
Every paging alert should include a one-click link to a pre-filtered dashboard showing the relevant SLIs, recent changes, and comparison to baseline. Don't make responders navigate to find context—deliver it with the page. This can reduce mean-time-to-diagnosis by 50% or more.
Not everyone needs the same SLO alerts. Different stakeholders have different needs for SLO notifications:
On-call engineers: Need immediate, detailed technical alerts for response Engineering managers: Need visibility into SLO trends and major incidents Product managers: Need to understand user impact and budget status Executives: Need high-level SLO health and business impact Customers (enterprise): May need proactive notification of issues affecting them
| Stakeholder | Alert Types | Delivery Channel | Timing | Content Focus |
|---|---|---|---|---|
| On-call Engineer | All severity levels | PagerDuty, Slack | Real-time | Technical details, runbooks, dashboards |
| Engineering Manager | P1, P2, SLO violations | Slack, Email | Real-time for P1, digest for others | Impact summary, budget status, team load |
| Product Manager | SLO at risk, violations | Slack, Email | Near-real-time | User impact, feature implications |
| Director/VP | SLO violations, weekly summary | Email, Dashboard | Digest (daily/weekly) | Portfolio health, trends, action items |
| Executive | Major incidents, quarterly SLO | Dashboard, Slides | Periodic reporting | Business impact, trends, investments |
| Enterprise Customer | Issues affecting their usage | Status page, Email | Proactive during impact | What's affected, ETA, workarounds |
Creating a notification hierarchy:
Tier 1: Immediate Technical Response Recipients: On-call engineers Trigger: Any paging-tier SLO alert Channel: PagerDuty/Opsgenie with full context Expectation: 5-minute acknowledgment, immediate investigation
Tier 2: Engineering Awareness Recipients: Engineering managers, team leads, SRE team Trigger: P1/P2 alerts, or cumulative P3+ reaching threshold Channel: Slack channel, summarized Expectation: Awareness during business hours, available for escalation
Tier 3: Business Stakeholder Visibility Recipients: Product managers, business leads Trigger: SLO at risk (>75% budget consumed), violations Channel: Email or dedicated Slack channel Expectation: Awareness of user impact, input on prioritization
Tier 4: Executive Visibility Recipients: VP/C-level Trigger: Major incidents (defined by business impact), SLO violations Channel: Incident management system, executive brief Expectation: Awareness for stakeholder management, resource decisions
Resist the temptation to 'add everyone to the page' for visibility. Executive stakeholders don't need real-time technical alerts—they need curated summaries. Overloading people with alerts they can't act on creates noise and desensitization. Tailor notifications to what each audience can actually use.
SLO alerts protect your user experience—but only if they actually fire when needed. Testing and validating alerts is essential to maintaining confidence in your alerting system.
Why alerts fail silently:
Alert testing cadence:
Weekly: Automated synthetic tests for critical SLO alerts Monthly: Review of alert freshness; investigate any that haven't fired Quarterly: Full chaos engineering exercise including alerting validation After changes: Any modification to alert rules requires test verification
Dead alert detection:
Alerts that never fire are either:
To distinguish these:
Alert rule testing example:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
# Unit test for Prometheus alert rules using promtool# Save as: tests/slo_alerts_test.yaml rule_files: - ../rules/slo_alerts.yaml tests: # Test 1: Verify severe burn rate alert fires correctly - interval: 1m input_series: # Simulate 2% error rate when SLO allows 0.1% # This is 20x burn rate - series: 'http_requests_total{service="api",status="200"}' values: '0+980x10' - series: 'http_requests_total{service="api",status="500"}' values: '0+20x10' # SLO target - series: 'slo:error_budget:ratio{service="api"}' values: '0.001x10' alert_rule_test: - eval_time: 5m alertname: SLOBurnRateSevere exp_alerts: - exp_labels: severity: critical service: api exp_annotations: summary: "api: High SLO burn rate (severe)" # Test 2: Verify alert does NOT fire when burn rate is sustainable - interval: 1m input_series: # Simulate 0.1% error rate (1x burn rate - sustainable) - series: 'http_requests_total{service="api",status="200"}' values: '0+999x10' - series: 'http_requests_total{service="api",status="500"}' values: '0+1x10' - series: 'slo:error_budget:ratio{service="api"}' values: '0.001x10' alert_rule_test: - eval_time: 5m alertname: SLOBurnRateSevere exp_alerts: [] # No alerts expected # Test 3: Verify multi-window requirement - interval: 1m input_series: # High error rate for only 2 minutes (should not trigger) - series: 'http_requests_total{service="api",status="200"}' values: '0+999x2 0+980x1 0+999x7' - series: 'http_requests_total{service="api",status="500"}' values: '0+1x2 0+20x1 0+1x7' - series: 'slo:error_budget:ratio{service="api"}' values: '0.001x10' alert_rule_test: - eval_time: 5m alertname: SLOBurnRateSevere exp_alerts: [] # Brief spike should not alert due to long window # Run with: promtool test rules tests/slo_alerts_test.yamlEvery SLO alert should have an associated runbook that guides responders through investigation and mitigation. SLO-aware runbooks are distinctive in their focus on protecting user experience and managing error budget, not just fixing technical issues.
Runbook structure for SLO alerts:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
# SLO Burn Rate High - Payment API## Alert: SLOBurnRateSevere - payment-api ### 📋 Alert Context- **SLO**: Payment API availability 99.95% (30-day rolling)- **Typical trigger**: Error rate spike, latency degradation, dependency failure- **Service owner**: Payments Team (#payments-eng on Slack)- **On-call rotation**: payments-oncall@pagerduty ### 🎯 Impact Assessment (2 minutes)1. Open [SLO Dashboard](https://grafana.example.com/d/slo-payments)2. Check "Transactions Affected" panel - **How many users impacted?**3. Check "Error Distribution" - **What's failing?** (400s vs 500s, which endpoints)4. Check "Geographic Impact" - **Regional or global?** ### 💰 Budget Status Check (1 minute)- [Error Budget Dashboard](https://grafana.example.com/d/budget-payments)- Current remaining: Check "Budget Remaining" gauge- Burn rate: Check "Current Burn Rate" panel- **If budget < 20%**: Escalate to Engineering Manager immediately ### 🚨 Quick Mitigation (5 minutes) **Option 1: Rollback (if recent deployment)**```bashkubectl rollout undo deployment/payment-api -n payments```Verify: Error rate should drop within 2 minutes **Option 2: Feature Flag (if specific feature)**```bash# Disable suspected featurelaunchdarkly feature toggle payment-new-flow --off``` **Option 3: Traffic Shift (if regional)**```bash# Route away from affected region./scripts/traffic-shift.sh --region us-west-2 --pct 0``` **Option 4: Dependency Fallback**If payment gateway is the issue:```bashkubectl set env deployment/payment-api GATEWAY_FALLBACK=true``` ### 🔍 Root Cause Investigation **Check in this order (covers 90% of cases):** 1. **Recent deployments** (most common) - Check [Deploy History](https://deploys.example.com/payment-api) - Anything in last 2 hours? → Consider rollback 2. **Dependency health** - [Payment Gateway Status](https://status.paymentgateway.com) - [Database Dashboard](https://grafana.example.com/d/db-payments) - [Cache Dashboard](https://grafana.example.com/d/cache-payments) 3. **Capacity issues** - [Resource Dashboard](https://grafana.example.com/d/resources) - CPU/Memory saturation? - Pod restarts? 4. **Traffic anomalies** - [Traffic Dashboard](https://grafana.example.com/d/traffic) - Unusual spike? Attack traffic? ### 📢 Escalation **Escalate if:**- Impact duration > 15 minutes- Budget < 10% remaining- Cause not identified within 30 minutes- Multiple services affected **Escalation contacts:**- Primary: @payments-oncall (PagerDuty)- Engineering Manager: Jane Smith (#payments-eng-leads)- Platform: @platform-oncall (if infrastructure suspected) ### 📣 Communication Templates **Slack update (every 15 min during incident):**> [TIME] Payment API degraded performance. Impact: [X]% of transactions seeing errors. Investigating [suspected cause]. ETA: [estimate or "Investigating"] **Status page (if > 10 min impact):**> We are investigating reports of degraded performance with payment processing. Some users may experience failures or delays. We are actively working to resolve this issue. ### ✅ Recovery Verification Before declaring resolved:1. Burn rate returned to < 2x for 15 minutes2. Error rate below SLO threshold for 15 minutes3. No abnormal customer complaints4. Affected transactions confirmed processing ### 📝 Post-Incident- [ ] Create post-mortem if impact > 5 minutes or budget impact > 5%- [ ] Update this runbook if investigation steps were missing- [ ] Consider alert tuning if false positiveRunbooks decay rapidly as systems evolve. After every incident, ask: 'Did the runbook help? What was missing?' Update it immediately. Schedule quarterly reviews to catch drift. Dead runbooks are worse than no runbooks—they waste time and erode trust.
SLO-based alerting transforms operational monitoring from a reactive chore into a strategic capability. By aligning alerts with user experience and error budget impact, teams achieve better outcomes with less noise and fatigue.
You now understand comprehensive SLO-based alerting strategies—from philosophical foundations through practical implementation and testing. Next, we'll explore how to review and adjust SLOs over time, ensuring your reliability targets remain appropriate as services, users, and business needs evolve.