Loading learning content...
Traditional threshold-based alerting answers a simple question: "Is the system broken right now?" If error rate exceeds 5%, fire an alert. If latency exceeds 500ms, fire an alert. This works for catching acute failures but fails catastrophically for SLO-based reliability management.
The problem:
Imagine a service with a 99.9% availability SLO over 30 days. The error budget is 43 minutes of downtime. Traditional alerting faces a dilemma:
Neither approach answers the question you actually need answered: "At the current rate of error budget consumption, will we violate our SLO?"
Enter burn rate alerting.
Burn rate alerting doesn't ask whether the system is broken now. It asks whether the rate of error budget consumption will lead to SLO violation if it continues. This subtle but powerful reframing enables proactive intervention based on trajectory rather than current state.
By the end of this page, you'll understand burn rate mathematics, how to calculate appropriate burn rate thresholds, multi-window alerting strategies, alert configuration best practices, and how to respond to burn rate alerts effectively. You'll learn the alerting strategies pioneered by Google SRE and adopted across the industry.
Burn rate is the speed at which you're consuming your error budget, expressed as a multiple of the sustainable consumption rate.
The fundamental formula:
Burn Rate = Actual Error Rate / Allowed Error Rate
Where:
Example calculation:
For a service with 99.9% SLO (0.1% error budget):
| Burn Rate | Current Error Rate | Budget Impact | Time to Exhaust if Sustained |
|---|---|---|---|
| 1.0 | 0.1% | Sustainable consumption | 30 days (within budget) |
| 2.0 | 0.2% | Consuming 2x faster | 15 days |
| 5.0 | 0.5% | Consuming 5x faster | 6 days |
| 10.0 | 1.0% | Consuming 10x faster | 3 days |
| 14.4 | 1.44% | Consuming 14.4x faster | ~2 days |
| 36.0 | 3.6% | Consuming 36x faster | ~20 hours |
| 720.0 | 72% | Complete outage rate | 1 hour |
Intuition for burn rate:
Think of error budget as a fuel tank and burn rate as your consumption speed:
Why burn rate beats simple thresholds:
Traditional alerting: "Error rate is 0.5%—is that bad?"
The answer depends entirely on context:
Burn rate normalizes across different SLO targets. A burn rate of 5 means "5x sustainable" regardless of whether you're managing a 99% or 99.99% service.
Services often have multiple SLIs (availability, latency, error rate). Calculate burn rate separately for each, then alert on the highest burn rate across all SLIs. This ensures you catch whichever dimension is consuming budget fastest.
The naive approach to burn rate alerting uses a single measurement window. If burn rate over the last hour exceeds some threshold, alert. This is better than threshold alerting but has significant limitations.
The single-window problem:
Consider alerting when 1-hour burn rate exceeds 10x:
Or, alerting when 24-hour burn rate exceeds 2x:
Short windows are responsive but noisy. Long windows are stable but slow. You cannot optimize for both with a single measurement. The solution is multi-window alerting: using multiple time windows simultaneously to capture different incident profiles.
Multi-window alerting strategy:
The industry-standard approach (pioneered by Google and documented in the SRE Workbook) uses pairs of windows:
An alert fires only when BOTH conditions are true:
This reduces false positives from brief spikes while ensuring sustained problems are caught quickly.
The standard multi-window configuration:
Google's SRE Workbook recommends these window pairs:
| Short Window | Long Window | Burn Rate Threshold | Time to Exhaust Budget | Use Case |
|---|---|---|---|---|
| 5 minutes | 1 hour | 14.4x | ~2 days | Page: Severe, urgent |
| 30 minutes | 6 hours | 6x | 5 days | Page: Significant, needs attention |
| 2 hours | 24 hours | 3x | 10 days | Ticket: Slow burn, investigate |
| 6 hours | 3 days | 1x | 30 days | Ticket: Tracking, might become issue |
The burn rate thresholds are calculated such that if the burn rate is sustained, budget would exhaust in the specified time.
Setting appropriate burn rate thresholds requires balancing detection speed against false positives. Here's the mathematical framework:
The core relationship:
For a 30-day SLO window, if you want to be alerted when budget consumption would exhaust in D days:
Burn Rate Threshold = 30 / D
Examples:
| Days to Budget Exhaustion | Burn Rate Threshold | Alert Urgency Level | Typical Response |
|---|---|---|---|
| 1 day | 30.0 | Critical/Page immediately | All hands, major incident |
| 2 days | 15.0 | High/Page | Immediate investigation |
| 5 days | 6.0 | Medium/Page | Same-day investigation |
| 10 days | 3.0 | Low/Ticket | Investigate this week |
| 15 days | 2.0 | Info/Ticket | Monitor, potential issue |
| 30 days | 1.0 | Warning | On track to exceed budget |
Window sizing for thresholds:
Once you've chosen your burn rate thresholds, you need to size your measurement windows appropriately.
The detection time formula:
Detection Time ≈ (1 - Threshold / Max) × 60 × Window Hours
But more practically, windows should be sized to catch the percentage of budget consumption that matters:
Rule of thumb for window sizing:
The trade-off:
Most teams accept 2-5% budget consumption before the first alert fires, trading some budget loss for reduced false positives.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
# Burn Rate Threshold Calculator# Helps determine appropriate thresholds for your SLO configuration def calculate_burn_rate_threshold(slo_window_days: int, exhaustion_days: float) -> float: """ Calculate the burn rate threshold that would exhaust budget in given days. Args: slo_window_days: Your SLO evaluation window (typically 28 or 30) exhaustion_days: Days until budget exhaustion at this rate Returns: Burn rate threshold value """ return slo_window_days / exhaustion_days def calculate_detection_percentage( burn_rate: float, short_window_hours: float, slo_window_days: int) -> float: """ Calculate what percentage of budget could be consumed before detection. Args: burn_rate: The burn rate threshold being used short_window_hours: Short window size in hours slo_window_days: Your SLO evaluation window in days Returns: Percentage of budget potentially consumed before alert """ slo_window_hours = slo_window_days * 24 return (burn_rate * short_window_hours / slo_window_hours) * 100 # Example: Standard multi-window configuration for 30-day SLOslo_window = 30 # days print("=== Burn Rate Alert Configuration ===\n") configurations = [ {"name": "Severe (Page)", "exhaustion_days": 2, "short_hours": 1, "long_hours": 6}, {"name": "Significant (Page)", "exhaustion_days": 5, "short_hours": 6, "long_hours": 24}, {"name": "Moderate (Ticket)", "exhaustion_days": 10, "short_hours": 24, "long_hours": 72},] for config in configurations: threshold = calculate_burn_rate_threshold(slo_window, config["exhaustion_days"]) budget_at_detection = calculate_detection_percentage( threshold, config["short_hours"], slo_window ) print(f"Alert Level: {config['name']}") print(f" Burn Rate Threshold: {threshold:.1f}x") print(f" Short Window: {config['short_hours']} hours") print(f" Long Window: {config['long_hours']} hours") print(f" Budget consumed at detection: {budget_at_detection:.1f}%") print(f" Days until exhaustion if sustained: {config['exhaustion_days']} days") print() # Output:# === Burn Rate Alert Configuration ===## Alert Level: Severe (Page)# Burn Rate Threshold: 15.0x# Short Window: 1 hours# Long Window: 6 hours# Budget consumed at detection: 2.1%# Days until exhaustion if sustained: 2 days## Alert Level: Significant (Page)# Burn Rate Threshold: 6.0x# Short Window: 6 hours# Long Window: 24 hours# Budget consumed at detection: 5.0%# Days until exhaustion if sustained: 5 days## Alert Level: Moderate (Ticket)# Burn Rate Threshold: 3.0x# Short Window: 24 hours# Long Window: 72 hours# Budget consumed at detection: 10.0%# Days until exhaustion if sustained: 10 daysTranslating burn rate theory into practice requires appropriate queries, alert rules, and integration with your observability stack. Here's how to implement burn rate alerting with common tools:
Prometheus/PromQL Implementation:
For an SLI measured as a ratio of successful requests to total requests:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
# Prometheus Alerting Rules for Burn Rate# Based on multi-window, multi-burn-rate strategy groups: - name: slo-burn-rate-alerts rules: # ========================================== # Define recording rules for cleaner alerts # ========================================== # Calculate error rate over 5 minutes - record: slo:error_rate:ratio_rate5m expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) # Calculate error rate over 1 hour - record: slo:error_rate:ratio_rate1h expr: | sum(rate(http_requests_total{status=~"5.."}[1h])) by (service) / sum(rate(http_requests_total[1h])) by (service) # Calculate error rate over 6 hours - record: slo:error_rate:ratio_rate6h expr: | sum(rate(http_requests_total{status=~"5.."}[6h])) by (service) / sum(rate(http_requests_total[6h])) by (service) # SLO target (99.9% = 0.001 error budget) - record: slo:error_budget:ratio expr: 0.001 # ========================================== # Burn rate calculation # ========================================== # 5-minute burn rate - record: slo:burn_rate:ratio_rate5m expr: | slo:error_rate:ratio_rate5m / on(service) group_left slo:error_budget:ratio # 1-hour burn rate - record: slo:burn_rate:ratio_rate1h expr: | slo:error_rate:ratio_rate1h / on(service) group_left slo:error_budget:ratio # 6-hour burn rate - record: slo:burn_rate:ratio_rate6h expr: | slo:error_rate:ratio_rate6h / on(service) group_left slo:error_budget:ratio # ========================================== # Multi-window burn rate alerts # ========================================== # SEVERE: 14.4x burn rate = budget exhausts in ~2 days # Short window: 5 min, Long window: 1 hour - alert: SLOBurnRateSevere expr: | slo:burn_rate:ratio_rate5m > 14.4 AND slo:burn_rate:ratio_rate1h > 14.4 for: 2m labels: severity: critical annotations: summary: "{{ $labels.service }}: High SLO burn rate (severe)" description: | Service {{ $labels.service }} is burning error budget at {{ $value | printf "%.1f" }}x the sustainable rate. At this rate, the entire monthly budget will be exhausted in ~2 days. Immediate investigation required. runbook_url: https://runbooks.example.com/slo-burn-rate-severe # SIGNIFICANT: 6x burn rate = budget exhausts in ~5 days # Short window: 30 min, Long window: 6 hours - alert: SLOBurnRateSignificant expr: | slo:burn_rate:ratio_rate30m > 6 AND slo:burn_rate:ratio_rate6h > 6 for: 5m labels: severity: warning annotations: summary: "{{ $labels.service }}: Elevated SLO burn rate (significant)" description: | Service {{ $labels.service }} is burning error budget at {{ $value | printf "%.1f" }}x the sustainable rate. At this rate, the entire monthly budget will be exhausted in ~5 days. Investigation needed today. # MODERATE: 3x burn rate = budget exhausts in ~10 days # Short window: 2 hours, Long window: 24 hours - alert: SLOBurnRateModerate expr: | slo:burn_rate:ratio_rate2h > 3 AND slo:burn_rate:ratio_rate24h > 3 for: 15m labels: severity: info annotations: summary: "{{ $labels.service }}: Slow SLO burn (moderate)" description: | Service {{ $labels.service }} is slowly burning error budget at {{ $value | printf "%.1f" }}x sustainable. At this rate, budget will be exhausted in ~10 days. Create ticket for investigation this week.Key implementation notes:
Recording rules: Pre-compute error rates and burn rates as recording rules. This improves alert evaluation performance and enables cleaner alert expressions.
The for clause: Require burn rate to be sustained for a minimum duration before alerting. This filters transient spikes that self-resolve.
Label inheritance: Use group_left or similar to ensure service labels propagate correctly for alert routing.
Multiple SLIs: Create parallel rules for each SLI (availability, latency, error rate) and alert on any exceeding thresholds.
Alert fatigue prevention: Start with higher thresholds (more conservative) and tighten as you validate signal quality.
Tools like Sloth (github.com/slok/sloth), Pyrra, and Google Cloud SLO Monitoring can generate burn rate alerting rules automatically from SLO definitions. This reduces the manual configuration burden and ensures mathematical consistency. Consider using these tools rather than handcrafting all rules.
Burn rate alerts require different response strategies than traditional "something is broken" alerts. The signal isn't that the system is down—it's that the system is consuming reliability budget at an unsustainable pace. Response must be calibrated to severity.
Response by severity level:
| Severity | Burn Rate | Time to Exhaust | Response | First Actions |
|---|---|---|---|---|
| Critical | 14x | < 2 days | Page immediately, incident response |
|
| Warning | 6-14x | 2-5 days | Page, investigate immediately |
|
| Info | 3-6x | 5-10 days | Ticket, investigate same day |
|
| Low | 1-3x | 10-30 days | Ticket, investigate this week |
|
The burn rate investigation framework:
When a burn rate alert fires, follow this systematic investigation:
1. Validate the signal (2-5 minutes)
2. Assess impact and trajectory (5-10 minutes)
3. Identify the cause (10-30 minutes)
4. Decide on intervention (immediate)
The most insidious issues are slow burns that don't feel urgent in the moment. A 3x burn rate gives you 10 days—it doesn't feel like an emergency. But those 10 days pass quickly, and slow burns often mask systemic issues. Treat low-severity burn rate alerts as genuine work items requiring attention, not dismissable noise.
Burn rate alerting is not "set and forget." Initial configurations require iteration based on real-world signal quality. Here's how to tune your alerts:
Signal quality assessment:
Track for each alert configuration:
Target metrics:
Iterative tuning process:
Week 1-2: Observation
Week 3-4: First adjustment
Ongoing: Continuous refinement
Service-specific considerations:
Different services may need different sensitivity:
Treat alert tuning as ongoing maintenance, not a one-time project. Budget 1-2 hours per month per critical service for alert review and adjustment. This investment pays off in higher signal quality and reduced alert fatigue.
Beyond basic multi-window alerting, sophisticated organizations implement additional patterns for more nuanced budget management:
Pattern 1: Burn rate forecasting
Rather than alerting when burn rate exceeds a threshold, predict when budget will exhaust based on current trajectory:
Predicted Exhaustion Date = Current Date + (Remaining Budget / Current Burn Rate × Budget Period)
Alert when predicted exhaustion is within the current SLO period. This catches slow burns earlier than threshold-based detection.
Pattern 2: Trend-adjusted burn rate
If burn rate is increasing, a current rate of 3x might soon become 6x. Calculate burn rate acceleration:
Burn Acceleration = (Current Burn Rate - Previous Period Burn Rate) / Time Delta
Alert on positive acceleration even if absolute burn rate is acceptable—the situation is getting worse.
Pattern 3: Remaining budget alerting
Complement burn rate alerts with absolute budget remaining alerts:
This catches cases where many small burns have cumulatively depleted budget without any single incident being alarming.
The standard multi-window approach covers most cases well. Add advanced patterns only when you have specific gaps the basic approach doesn't address. Each additional pattern adds complexity to understand, maintain, and debug. Complexity is a cost; ensure it provides value.
Burn rate alerting transforms SLO monitoring from passive observation to active management. By alerting on the rate of budget consumption rather than instantaneous state, you gain the ability to intervene before SLO violations occur.
You now understand burn rate alerting comprehensively—the mathematics, implementation strategies, response frameworks, and tuning practices. Next, we'll explore SLO-based alerting more broadly, covering how to integrate SLO awareness into your overall alerting strategy beyond just burn rate.