System Design (HLD)Setting SLOs

Setting Service Level Objectives

LevelIntermediate

Duration90 mins

TopicSetting SLOs

3 / 5

Burn Rate Alerting

Why Traditional Alerting Fails for SLOs

Traditional threshold-based alerting answers a simple question: "Is the system broken right now?" If error rate exceeds 5%, fire an alert. If latency exceeds 500ms, fire an alert. This works for catching acute failures but fails catastrophically for SLO-based reliability management.

The problem:

Imagine a service with a 99.9% availability SLO over 30 days. The error budget is 43 minutes of downtime. Traditional alerting faces a dilemma:

Alert on any downtime: You'll fire alerts constantly for brief blips that are normal and within budget. Alert fatigue ensues.
Alert only on severe outages: You might miss a pattern where small degradations (2 minutes here, 3 minutes there) slowly consume budget until it's exhausted.

Neither approach answers the question you actually need answered: "At the current rate of error budget consumption, will we violate our SLO?"

Enter burn rate alerting.

Burn rate alerting doesn't ask whether the system is broken now. It asks whether the rate of error budget consumption will lead to SLO violation if it continues. This subtle but powerful reframing enables proactive intervention based on trajectory rather than current state.

What You Will Learn

By the end of this page, you'll understand burn rate mathematics, how to calculate appropriate burn rate thresholds, multi-window alerting strategies, alert configuration best practices, and how to respond to burn rate alerts effectively. You'll learn the alerting strategies pioneered by Google SRE and adopted across the industry.

Understanding Burn Rate

Burn rate is the speed at which you're consuming your error budget, expressed as a multiple of the sustainable consumption rate.

The fundamental formula:

Burn Rate = Actual Error Rate / Allowed Error Rate

Where:

Actual Error Rate: The error rate you're currently experiencing (e.g., 0.5% of requests failing)
Allowed Error Rate: The error budget percentage (e.g., 0.1% for a 99.9% SLO)

Example calculation:

For a service with 99.9% SLO (0.1% error budget):

Current error rate of 0.1%: Burn Rate = 0.1% / 0.1% = 1.0 (sustainable; perfectly on budget)
Current error rate of 0.5%: Burn Rate = 0.5% / 0.1% = 5.0 (consuming budget 5x faster than sustainable)
Current error rate of 1.0%: Burn Rate = 1.0% / 0.1% = 10.0 (budget will exhaust 10x faster)

Burn Rate Interpretation for 99.9% SLO (30-Day Window)
Burn Rate	Current Error Rate	Budget Impact	Time to Exhaust if Sustained
1.0	0.1%	Sustainable consumption	30 days (within budget)
2.0	0.2%	Consuming 2x faster	15 days
5.0	0.5%	Consuming 5x faster	6 days
10.0	1.0%	Consuming 10x faster	3 days
14.4	1.44%	Consuming 14.4x faster	~2 days
36.0	3.6%	Consuming 36x faster	~20 hours
720.0	72%	Complete outage rate	1 hour

Intuition for burn rate:

Think of error budget as a fuel tank and burn rate as your consumption speed:

Burn rate = 1.0: You're driving at exactly the speed that will empty the tank at the destination. Perfect efficiency.
Burn rate = 2.0: You're driving twice as fast, so you'll run out halfway. Need to slow down or refuel.
Burn rate = 0.5: You're being overly cautious; you'll arrive with half a tank unused. Could afford more speed.

Why burn rate beats simple thresholds:

Traditional alerting: "Error rate is 0.5%—is that bad?"

The answer depends entirely on context:

For a 99% SLO (1% budget): 0.5% is fine, half the budget rate
For a 99.9% SLO (0.1% budget): 0.5% is severe, 5x the sustainable rate
For a 99.99% SLO (0.01% budget): 0.5% is critical, 50x the sustainable rate

Burn rate normalizes across different SLO targets. A burn rate of 5 means "5x sustainable" regardless of whether you're managing a 99% or 99.99% service.

Burn Rate for Multiple SLIs

Services often have multiple SLIs (availability, latency, error rate). Calculate burn rate separately for each, then alert on the highest burn rate across all SLIs. This ensures you catch whichever dimension is consuming budget fastest.

Single vs Multi-Window Alerting

The naive approach to burn rate alerting uses a single measurement window. If burn rate over the last hour exceeds some threshold, alert. This is better than threshold alerting but has significant limitations.

The single-window problem:

Consider alerting when 1-hour burn rate exceeds 10x:

Good: Catches severe ongoing incidents
Bad: Might miss slow-burn problems (e.g., 3x burn rate sustained for days)
Bad: Might trigger on brief spikes that self-resolve

Or, alerting when 24-hour burn rate exceeds 2x:

Good: Catches sustained degradation
Bad: Too slow to catch acute incidents; hours of budget consumed before alert
Bad: Long averaging window masks short severe incidents

No Single Window Is Optimal

Short windows are responsive but noisy. Long windows are stable but slow. You cannot optimize for both with a single measurement. The solution is multi-window alerting: using multiple time windows simultaneously to capture different incident profiles.

Multi-window alerting strategy:

The industry-standard approach (pioneered by Google and documented in the SRE Workbook) uses pairs of windows:

Short window: Captures immediate burn rate (e.g., last 5 minutes, last 1 hour)
Long window: Confirms the burn is sustained (e.g., last 1 hour, last 6 hours)

An alert fires only when BOTH conditions are true:

Short window burn rate exceeds threshold X
Long window burn rate exceeds threshold Y (typically lower)

This reduces false positives from brief spikes while ensuring sustained problems are caught quickly.

The standard multi-window configuration:

Google's SRE Workbook recommends these window pairs:

Short Window	Long Window	Burn Rate Threshold	Time to Exhaust Budget	Use Case
5 minutes	1 hour	14.4x	~2 days	Page: Severe, urgent
30 minutes	6 hours	6x	5 days	Page: Significant, needs attention
2 hours	24 hours	3x	10 days	Ticket: Slow burn, investigate
6 hours	3 days	1x	30 days	Ticket: Tracking, might become issue

The burn rate thresholds are calculated such that if the burn rate is sustained, budget would exhaust in the specified time.

Single-Window Limitations

•Can't optimize for both speed and stability
•Short windows = excessive false positives
•Long windows = delayed detection
•Spikes and slow-burns need different tuning
•Often leads to alert fatigue or missed incidents

Multi-Window Benefits

•Short window provides speed
•Long window provides confirmation
•Different pairs catch different incident profiles
•Dramatically reduced false positives
•Covers both acute and chronic degradation

Calculating Burn Rate Thresholds

Setting appropriate burn rate thresholds requires balancing detection speed against false positives. Here's the mathematical framework:

The core relationship:

For a 30-day SLO window, if you want to be alerted when budget consumption would exhaust in D days:

Burn Rate Threshold = 30 / D

Examples:

Alert when budget would exhaust in 2 days: Threshold = 30/2 = 15
Alert when budget would exhaust in 5 days: Threshold = 30/5 = 6
Alert when budget would exhaust in 10 days: Threshold = 30/10 = 3

Burn Rate Thresholds by Detection Goal (30-Day Window)
Days to Budget Exhaustion	Burn Rate Threshold	Alert Urgency Level	Typical Response
1 day	30.0	Critical/Page immediately	All hands, major incident
2 days	15.0	High/Page	Immediate investigation
5 days	6.0	Medium/Page	Same-day investigation
10 days	3.0	Low/Ticket	Investigate this week
15 days	2.0	Info/Ticket	Monitor, potential issue
30 days	1.0	Warning	On track to exceed budget

Window sizing for thresholds:

Once you've chosen your burn rate thresholds, you need to size your measurement windows appropriately.

The detection time formula:

Detection Time ≈ (1 - Threshold / Max) × 60 × Window Hours

But more practically, windows should be sized to catch the percentage of budget consumption that matters:

Rule of thumb for window sizing:

To catch 10% budget consumption: Window = SLO Period × 10% / Burn Rate = 30 days × 0.1 / 14.4 ≈ 5 hours
To catch 5% budget consumption: Window = SLO Period × 5% / Burn Rate = 30 days × 0.05 / 14.4 ≈ 2.5 hours
To catch 2% budget consumption: Window = SLO Period × 2% / Burn Rate = 30 days × 0.02 / 14.4 ≈ 1 hour

The trade-off:

Smaller windows detect less budget consumption but are more responsive
Larger windows catch more before alerting but delay initial notification

Most teams accept 2-5% budget consumption before the first alert fires, trading some budget loss for reduced false positives.

burn-rate-threshold-calculator.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
# Burn Rate Threshold Calculator
# Helps determine appropriate thresholds for your SLO configuration
 
def calculate_burn_rate_threshold(slo_window_days: int, exhaustion_days: float) -> float:
    """
    Calculate the burn rate threshold that would exhaust budget in given days.
    
    Args:
        slo_window_days: Your SLO evaluation window (typically 28 or 30)
        exhaustion_days: Days until budget exhaustion at this rate
    
    Returns:
        Burn rate threshold value
    """
    return slo_window_days / exhaustion_days
 
def calculate_detection_percentage(
    burn_rate: float, 
    short_window_hours: float, 
    slo_window_days: int
) -> float:
    """
    Calculate what percentage of budget could be consumed before detection.
    
    Args:
        burn_rate: The burn rate threshold being used
        short_window_hours: Short window size in hours
        slo_window_days: Your SLO evaluation window in days
    
    Returns:
        Percentage of budget potentially consumed before alert
    """
    slo_window_hours = slo_window_days * 24
    return (burn_rate * short_window_hours / slo_window_hours) * 100
 
# Example: Standard multi-window configuration for 30-day SLO
slo_window = 30  # days
 
print("=== Burn Rate Alert Configuration ===\n")
 
configurations = [
    {"name": "Severe (Page)", "exhaustion_days": 2, "short_hours": 1, "long_hours": 6},
    {"name": "Significant (Page)", "exhaustion_days": 5, "short_hours": 6, "long_hours": 24},
    {"name": "Moderate (Ticket)", "exhaustion_days": 10, "short_hours": 24, "long_hours": 72},
]
 
for config in configurations:
    threshold = calculate_burn_rate_threshold(slo_window, config["exhaustion_days"])
    budget_at_detection = calculate_detection_percentage(
        threshold, 
        config["short_hours"], 
        slo_window
    )
    
    print(f"Alert Level: {config['name']}")
    print(f"  Burn Rate Threshold: {threshold:.1f}x")
    print(f"  Short Window: {config['short_hours']} hours")
    print(f"  Long Window: {config['long_hours']} hours")
    print(f"  Budget consumed at detection: {budget_at_detection:.1f}%")
    print(f"  Days until exhaustion if sustained: {config['exhaustion_days']} days")
    print()
 
# Output:
# === Burn Rate Alert Configuration ===
#
# Alert Level: Severe (Page)
#   Burn Rate Threshold: 15.0x
#   Short Window: 1 hours
#   Long Window: 6 hours
#   Budget consumed at detection: 2.1%
#   Days until exhaustion if sustained: 2 days
#
# Alert Level: Significant (Page)
#   Burn Rate Threshold: 6.0x
#   Short Window: 6 hours
#   Long Window: 24 hours
#   Budget consumed at detection: 5.0%
#   Days until exhaustion if sustained: 5 days
#
# Alert Level: Moderate (Ticket)
#   Burn Rate Threshold: 3.0x
#   Short Window: 24 hours
#   Long Window: 72 hours
#   Budget consumed at detection: 10.0%
#   Days until exhaustion if sustained: 10 days

Implementing Burn Rate Alerts

Translating burn rate theory into practice requires appropriate queries, alert rules, and integration with your observability stack. Here's how to implement burn rate alerting with common tools:

Prometheus/PromQL Implementation:

For an SLI measured as a ratio of successful requests to total requests:

burn-rate-alerts.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# Prometheus Alerting Rules for Burn Rate
# Based on multi-window, multi-burn-rate strategy
 
groups:
  - name: slo-burn-rate-alerts
    rules:
      # ==========================================
      # Define recording rules for cleaner alerts
      # ==========================================
      
      # Calculate error rate over 5 minutes
      - record: slo:error_rate:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
      
      # Calculate error rate over 1 hour
      - record: slo:error_rate:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[1h])) by (service)
          /
          sum(rate(http_requests_total[1h])) by (service)
      
      # Calculate error rate over 6 hours
      - record: slo:error_rate:ratio_rate6h
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[6h])) by (service)
          /
          sum(rate(http_requests_total[6h])) by (service)
      
      # SLO target (99.9% = 0.001 error budget)
      - record: slo:error_budget:ratio
        expr: 0.001
      
      # ==========================================
      # Burn rate calculation
      # ==========================================
      
      # 5-minute burn rate
      - record: slo:burn_rate:ratio_rate5m
        expr: |
          slo:error_rate:ratio_rate5m / on(service) group_left slo:error_budget:ratio
      
      # 1-hour burn rate
      - record: slo:burn_rate:ratio_rate1h
        expr: |
          slo:error_rate:ratio_rate1h / on(service) group_left slo:error_budget:ratio
      
      # 6-hour burn rate
      - record: slo:burn_rate:ratio_rate6h
        expr: |
          slo:error_rate:ratio_rate6h / on(service) group_left slo:error_budget:ratio
      
      # ==========================================
      # Multi-window burn rate alerts
      # ==========================================
      
      # SEVERE: 14.4x burn rate = budget exhausts in ~2 days
      # Short window: 5 min, Long window: 1 hour
      - alert: SLOBurnRateSevere
        expr: |
          slo:burn_rate:ratio_rate5m > 14.4
          AND
          slo:burn_rate:ratio_rate1h > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.service }}: High SLO burn rate (severe)"
          description: |
            Service {{ $labels.service }} is burning error budget at {{ $value | printf "%.1f" }}x the sustainable rate.
            At this rate, the entire monthly budget will be exhausted in ~2 days.
            Immediate investigation required.
          runbook_url: https://runbooks.example.com/slo-burn-rate-severe
 
      # SIGNIFICANT: 6x burn rate = budget exhausts in ~5 days  
      # Short window: 30 min, Long window: 6 hours
      - alert: SLOBurnRateSignificant
        expr: |
          slo:burn_rate:ratio_rate30m > 6
          AND
          slo:burn_rate:ratio_rate6h > 6
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.service }}: Elevated SLO burn rate (significant)"
          description: |
            Service {{ $labels.service }} is burning error budget at {{ $value | printf "%.1f" }}x the sustainable rate.
            At this rate, the entire monthly budget will be exhausted in ~5 days.
            Investigation needed today.
 
      # MODERATE: 3x burn rate = budget exhausts in ~10 days
      # Short window: 2 hours, Long window: 24 hours  
      - alert: SLOBurnRateModerate
        expr: |
          slo:burn_rate:ratio_rate2h > 3
          AND
          slo:burn_rate:ratio_rate24h > 3
        for: 15m
        labels:
          severity: info
        annotations:
          summary: "{{ $labels.service }}: Slow SLO burn (moderate)"
          description: |
            Service {{ $labels.service }} is slowly burning error budget at {{ $value | printf "%.1f" }}x sustainable.
            At this rate, budget will be exhausted in ~10 days.
            Create ticket for investigation this week.

Key implementation notes:

Recording rules: Pre-compute error rates and burn rates as recording rules. This improves alert evaluation performance and enables cleaner alert expressions.
The for clause: Require burn rate to be sustained for a minimum duration before alerting. This filters transient spikes that self-resolve.
Label inheritance: Use group_left or similar to ensure service labels propagate correctly for alert routing.
Multiple SLIs: Create parallel rules for each SLI (availability, latency, error rate) and alert on any exceeding thresholds.
Alert fatigue prevention: Start with higher thresholds (more conservative) and tighten as you validate signal quality.

Sloth and Other SLO Tools

Tools like Sloth (github.com/slok/sloth), Pyrra, and Google Cloud SLO Monitoring can generate burn rate alerting rules automatically from SLO definitions. This reduces the manual configuration burden and ensures mathematical consistency. Consider using these tools rather than handcrafting all rules.

Responding to Burn Rate Alerts

Burn rate alerts require different response strategies than traditional "something is broken" alerts. The signal isn't that the system is down—it's that the system is consuming reliability budget at an unsustainable pace. Response must be calibrated to severity.

Response by severity level:

Burn Rate Alert Response Matrix
Severity	Burn Rate	Time to Exhaust	Response	First Actions
Critical	14x	< 2 days	Page immediately, incident response	Acknowledge alert 2. Check for obvious cause 3. Consider rollback 4. Engage incident process
Warning	6-14x	2-5 days	Page, investigate immediately	Check recent changes 2. Review error distribution 3. Assess if intervention needed 4. May escalate to incident
Info	3-6x	5-10 days	Ticket, investigate same day	Create investigation ticket 2. Trend analysis 3. Root cause investigation 4. Plan remediation
Low	1-3x	10-30 days	Ticket, investigate this week	Add to backlog 2. Monitor trend 3. Address in normal sprint

The burn rate investigation framework:

When a burn rate alert fires, follow this systematic investigation:

1. Validate the signal (2-5 minutes)

Is the alert a true positive? Check current metrics directly.
Are there measurement issues (gaps in data, instrumentation bugs)?
Is this a new pattern or recurrence of a known issue?

2. Assess impact and trajectory (5-10 minutes)

What's the current budget consumption? How much remains?
At current rate, when would budget exhaust?
Is the burn rate increasing, stable, or decreasing?
Are users actually impacted? (Check error types, affected endpoints)

3. Identify the cause (10-30 minutes)

What changed recently? (deployments, config changes, traffic patterns)
Is a specific component failing? (dependency, database, specific service instance)
Is it traffic-related? (capacity, load distribution)
Is it code-related? (new bug, regression)

4. Decide on intervention (immediate)

Can we fix forward quickly?
Should we rollback?
Do we need to page additional people?
Should we declare an incident?

Don't Ignore 'Slow Burn' Alerts

The most insidious issues are slow burns that don't feel urgent in the moment. A 3x burn rate gives you 10 days—it doesn't feel like an emergency. But those 10 days pass quickly, and slow burns often mask systemic issues. Treat low-severity burn rate alerts as genuine work items requiring attention, not dismissable noise.

Common Burn Rate Causes and Remediation

•Recent deployment introduced regression: Rollback if quick and safe, or fix-forward with emergency patch. Root cause: insufficient testing or canary analysis.
•Dependency degradation: Implement fallbacks, cache responses, or route around failing dependency. Root cause: missing resilience patterns.
•Capacity exhaustion: Scale up resources, shed non-critical load, or optimize hot paths. Root cause: capacity planning gaps.
•Traffic pattern change: Investigate legitimate vs. attack traffic. Adjust autoscaling or implement rate limiting. Root cause: insufficient adaptive capacity.
•Configuration drift: Identify and correct configuration. Implement config validation. Root cause: missing guardrails.
•Chronic code issue surfacing: Create technical debt ticket. May need longer-term fix. Root cause: accumulating technical debt.

Tuning Burn Rate Alerts

Burn rate alerting is not "set and forget." Initial configurations require iteration based on real-world signal quality. Here's how to tune your alerts:

Signal quality assessment:

Track for each alert configuration:

True positive rate: Alerts that correctly identified budget-threatening conditions
False positive rate: Alerts that fired but self-resolved without intervention
False negative rate: Budget depletions that weren't caught by alerts (the worst)
Time to detection: How long after issue onset did the alert fire?

Target metrics:

True positive rate: > 90% (most alerts should require action)
False positive rate: < 10% (some noise is acceptable for safety)
False negative rate: 0% (never acceptable to miss budget exhaustion)
Time to detection: Appropriate for severity (minutes for critical, hours for low)

Too Many False Positives?

•Increase long window duration
•Slightly increase burn rate threshold
•Increase 'for' clause duration
•Add correlation with other signals
•Verify SLI measurements are accurate

Missing Real Issues?

•Decrease burn rate threshold
•Shorten measurement windows
•Reduce 'for' clause duration
•Add lower-severity tier alerts
•Verify SLI captures all failure modes

Iterative tuning process:

Week 1-2: Observation

Deploy alerts but don't page—send to a non-critical channel
Observe when alerts fire vs. when issues occur
Note false positives and false negatives

Week 3-4: First adjustment

Adjust thresholds based on observations
Connect critical alerts to paging
Keep low-severity as tickets

Ongoing: Continuous refinement

Review alert performance monthly
Adjust thresholds as service behavior changes
Add new alert tiers if gaps identified

Service-specific considerations:

Different services may need different sensitivity:

Customer-facing critical path: Tighter thresholds, faster windows, page aggressively
Internal tooling: Looser thresholds, longer windows, ticket-based
Batch processing: May need different SLIs entirely (job completion rate, not request latency)

Budget for Alert Tuning

Treat alert tuning as ongoing maintenance, not a one-time project. Budget 1-2 hours per month per critical service for alert review and adjustment. This investment pays off in higher signal quality and reduced alert fatigue.

Advanced Burn Rate Patterns

Beyond basic multi-window alerting, sophisticated organizations implement additional patterns for more nuanced budget management:

Pattern 1: Burn rate forecasting

Rather than alerting when burn rate exceeds a threshold, predict when budget will exhaust based on current trajectory:

Predicted Exhaustion Date = Current Date + (Remaining Budget / Current Burn Rate × Budget Period)

Alert when predicted exhaustion is within the current SLO period. This catches slow burns earlier than threshold-based detection.

Pattern 2: Trend-adjusted burn rate

If burn rate is increasing, a current rate of 3x might soon become 6x. Calculate burn rate acceleration:

Burn Acceleration = (Current Burn Rate - Previous Period Burn Rate) / Time Delta

Alert on positive acceleration even if absolute burn rate is acceptable—the situation is getting worse.

Pattern 3: Remaining budget alerting

Complement burn rate alerts with absolute budget remaining alerts:

Alert at 50% budget consumed (informational)
Alert at 75% consumed (warning)
Alert at 90% consumed (critical)

This catches cases where many small burns have cumulatively depleted budget without any single incident being alarming.

Burn Rate Alert Enhancement Patterns

•Synthetic canary burn rate: Compare burn rate from synthetic probes vs. real user traffic. Divergence indicates measurement issues or localized problems.
•Per-customer burn rate: For B2B services, track which customers are experiencing elevated burn. Enables proactive outreach before they notice.
•Deployment correlation: Automatically tag burn rate spikes with recent deployments. Accelerates root cause identification.
•Anomaly-adjusted thresholds: Use ML to establish baseline burn rate patterns (including expected cyclicality) and alert on deviations rather than fixed thresholds.
•Cross-service burn rate correlation: When multiple dependent services show elevated burn simultaneously, aggregate into a single incident rather than multiple alerts.
•Budget reservation: Reserve a portion of error budget for planned activities (migrations, experiments). Alert if unplanned consumption threatens the reserved portion.

Start Simple, Add Complexity Carefully

The standard multi-window approach covers most cases well. Add advanced patterns only when you have specific gaps the basic approach doesn't address. Each additional pattern adds complexity to understand, maintain, and debug. Complexity is a cost; ensure it provides value.

Summary: Burn Rate Alerting Mastery

Burn rate alerting transforms SLO monitoring from passive observation to active management. By alerting on the rate of budget consumption rather than instantaneous state, you gain the ability to intervene before SLO violations occur.

Key Takeaways

•Burn rate = actual errors / allowed errors: It normalizes across SLO targets and tells you how fast budget is depleting.
•Single windows can't optimize speed and stability: Use multi-window alerting with short windows for responsiveness and long windows for confirmation.
•Threshold math is straightforward: Burn rate = SLO period / days to exhaustion. A 15x burn rate exhausts a 30-day budget in 2 days.
•Different severities need different thresholds: Critical for acute issues (14x+), warning for significant issues (6x+), info for slow burns (3x+).
•Response calibration matters: Not every burn rate alert is an emergency. Match response intensity to severity and remaining budget.
•Continuous tuning is essential: Track true/false positives and adjust thresholds based on signal quality, not just theory.
•Advanced patterns add power at cost of complexity: Forecasting, trend analysis, and remaining budget alerts enhance coverage but require more maintenance.

Page Complete

You now understand burn rate alerting comprehensively—the mathematics, implementation strategies, response frameworks, and tuning practices. Next, we'll explore SLO-based alerting more broadly, covering how to integrate SLO awareness into your overall alerting strategy beyond just burn rate.

3 / 5

Loading learning content...

System Design (HLD)Setting SLOs

Setting Service Level Objectives

LevelIntermediate

Duration90 mins

TopicSetting SLOs

3 / 5

Burn Rate Alerting

Why Traditional Alerting Fails for SLOs

The problem:

Imagine a service with a 99.9% availability SLO over 30 days. The error budget is 43 minutes of downtime. Traditional alerting faces a dilemma:

Alert on any downtime: You'll fire alerts constantly for brief blips that are normal and within budget. Alert fatigue ensues.
Alert only on severe outages: You might miss a pattern where small degradations (2 minutes here, 3 minutes there) slowly consume budget until it's exhausted.

Neither approach answers the question you actually need answered: "At the current rate of error budget consumption, will we violate our SLO?"

Enter burn rate alerting.

What You Will Learn

Understanding Burn Rate

Burn rate is the speed at which you're consuming your error budget, expressed as a multiple of the sustainable consumption rate.

The fundamental formula:

Burn Rate = Actual Error Rate / Allowed Error Rate

Where:

Actual Error Rate: The error rate you're currently experiencing (e.g., 0.5% of requests failing)
Allowed Error Rate: The error budget percentage (e.g., 0.1% for a 99.9% SLO)

Example calculation:

For a service with 99.9% SLO (0.1% error budget):

Current error rate of 0.1%: Burn Rate = 0.1% / 0.1% = 1.0 (sustainable; perfectly on budget)
Current error rate of 0.5%: Burn Rate = 0.5% / 0.1% = 5.0 (consuming budget 5x faster than sustainable)
Current error rate of 1.0%: Burn Rate = 1.0% / 0.1% = 10.0 (budget will exhaust 10x faster)

Burn Rate Interpretation for 99.9% SLO (30-Day Window)
Burn Rate	Current Error Rate	Budget Impact	Time to Exhaust if Sustained
1.0	0.1%	Sustainable consumption	30 days (within budget)
2.0	0.2%	Consuming 2x faster	15 days
5.0	0.5%	Consuming 5x faster	6 days
10.0	1.0%	Consuming 10x faster	3 days
14.4	1.44%	Consuming 14.4x faster	~2 days
36.0	3.6%	Consuming 36x faster	~20 hours
720.0	72%	Complete outage rate	1 hour

Intuition for burn rate:

Think of error budget as a fuel tank and burn rate as your consumption speed:

Burn rate = 1.0: You're driving at exactly the speed that will empty the tank at the destination. Perfect efficiency.
Burn rate = 2.0: You're driving twice as fast, so you'll run out halfway. Need to slow down or refuel.
Burn rate = 0.5: You're being overly cautious; you'll arrive with half a tank unused. Could afford more speed.

Why burn rate beats simple thresholds:

Traditional alerting: "Error rate is 0.5%—is that bad?"

The answer depends entirely on context:

For a 99% SLO (1% budget): 0.5% is fine, half the budget rate
For a 99.9% SLO (0.1% budget): 0.5% is severe, 5x the sustainable rate
For a 99.99% SLO (0.01% budget): 0.5% is critical, 50x the sustainable rate

Burn rate normalizes across different SLO targets. A burn rate of 5 means "5x sustainable" regardless of whether you're managing a 99% or 99.99% service.

Burn Rate for Multiple SLIs

Single vs Multi-Window Alerting

The single-window problem:

Consider alerting when 1-hour burn rate exceeds 10x:

Good: Catches severe ongoing incidents
Bad: Might miss slow-burn problems (e.g., 3x burn rate sustained for days)
Bad: Might trigger on brief spikes that self-resolve

Or, alerting when 24-hour burn rate exceeds 2x:

Good: Catches sustained degradation
Bad: Too slow to catch acute incidents; hours of budget consumed before alert
Bad: Long averaging window masks short severe incidents

No Single Window Is Optimal

Multi-window alerting strategy:

The industry-standard approach (pioneered by Google and documented in the SRE Workbook) uses pairs of windows:

Short window: Captures immediate burn rate (e.g., last 5 minutes, last 1 hour)
Long window: Confirms the burn is sustained (e.g., last 1 hour, last 6 hours)

An alert fires only when BOTH conditions are true:

Short window burn rate exceeds threshold X
Long window burn rate exceeds threshold Y (typically lower)

This reduces false positives from brief spikes while ensuring sustained problems are caught quickly.

The standard multi-window configuration:

Google's SRE Workbook recommends these window pairs:

Short Window	Long Window	Burn Rate Threshold	Time to Exhaust Budget	Use Case
5 minutes	1 hour	14.4x	~2 days	Page: Severe, urgent
30 minutes	6 hours	6x	5 days	Page: Significant, needs attention
2 hours	24 hours	3x	10 days	Ticket: Slow burn, investigate
6 hours	3 days	1x	30 days	Ticket: Tracking, might become issue

The burn rate thresholds are calculated such that if the burn rate is sustained, budget would exhaust in the specified time.

Single-Window Limitations

•Can't optimize for both speed and stability
•Short windows = excessive false positives
•Long windows = delayed detection
•Spikes and slow-burns need different tuning
•Often leads to alert fatigue or missed incidents

Multi-Window Benefits

•Short window provides speed
•Long window provides confirmation
•Different pairs catch different incident profiles
•Dramatically reduced false positives
•Covers both acute and chronic degradation

Calculating Burn Rate Thresholds

Setting appropriate burn rate thresholds requires balancing detection speed against false positives. Here's the mathematical framework:

The core relationship:

For a 30-day SLO window, if you want to be alerted when budget consumption would exhaust in D days:

Burn Rate Threshold = 30 / D

Examples:

Alert when budget would exhaust in 2 days: Threshold = 30/2 = 15
Alert when budget would exhaust in 5 days: Threshold = 30/5 = 6
Alert when budget would exhaust in 10 days: Threshold = 30/10 = 3

Burn Rate Thresholds by Detection Goal (30-Day Window)
Days to Budget Exhaustion	Burn Rate Threshold	Alert Urgency Level	Typical Response
1 day	30.0	Critical/Page immediately	All hands, major incident
2 days	15.0	High/Page	Immediate investigation
5 days	6.0	Medium/Page	Same-day investigation
10 days	3.0	Low/Ticket	Investigate this week
15 days	2.0	Info/Ticket	Monitor, potential issue
30 days	1.0	Warning	On track to exceed budget

Window sizing for thresholds:

Once you've chosen your burn rate thresholds, you need to size your measurement windows appropriately.

The detection time formula:

Detection Time ≈ (1 - Threshold / Max) × 60 × Window Hours

But more practically, windows should be sized to catch the percentage of budget consumption that matters:

Rule of thumb for window sizing:

To catch 10% budget consumption: Window = SLO Period × 10% / Burn Rate = 30 days × 0.1 / 14.4 ≈ 5 hours
To catch 5% budget consumption: Window = SLO Period × 5% / Burn Rate = 30 days × 0.05 / 14.4 ≈ 2.5 hours
To catch 2% budget consumption: Window = SLO Period × 2% / Burn Rate = 30 days × 0.02 / 14.4 ≈ 1 hour

The trade-off:

Smaller windows detect less budget consumption but are more responsive
Larger windows catch more before alerting but delay initial notification

Most teams accept 2-5% budget consumption before the first alert fires, trading some budget loss for reduced false positives.

burn-rate-threshold-calculator.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
# Burn Rate Threshold Calculator
# Helps determine appropriate thresholds for your SLO configuration
 
def calculate_burn_rate_threshold(slo_window_days: int, exhaustion_days: float) -> float:
    """
    Calculate the burn rate threshold that would exhaust budget in given days.
    
    Args:
        slo_window_days: Your SLO evaluation window (typically 28 or 30)
        exhaustion_days: Days until budget exhaustion at this rate
    
    Returns:
        Burn rate threshold value
    """
    return slo_window_days / exhaustion_days
 
def calculate_detection_percentage(
    burn_rate: float, 
    short_window_hours: float, 
    slo_window_days: int
) -> float:
    """
    Calculate what percentage of budget could be consumed before detection.
    
    Args:
        burn_rate: The burn rate threshold being used
        short_window_hours: Short window size in hours
        slo_window_days: Your SLO evaluation window in days
    
    Returns:
        Percentage of budget potentially consumed before alert
    """
    slo_window_hours = slo_window_days * 24
    return (burn_rate * short_window_hours / slo_window_hours) * 100
 
# Example: Standard multi-window configuration for 30-day SLO
slo_window = 30  # days
 
print("=== Burn Rate Alert Configuration ===\n")
 
configurations = [
    {"name": "Severe (Page)", "exhaustion_days": 2, "short_hours": 1, "long_hours": 6},
    {"name": "Significant (Page)", "exhaustion_days": 5, "short_hours": 6, "long_hours": 24},
    {"name": "Moderate (Ticket)", "exhaustion_days": 10, "short_hours": 24, "long_hours": 72},
]
 
for config in configurations:
    threshold = calculate_burn_rate_threshold(slo_window, config["exhaustion_days"])
    budget_at_detection = calculate_detection_percentage(
        threshold, 
        config["short_hours"], 
        slo_window
    )
    
    print(f"Alert Level: {config['name']}")
    print(f"  Burn Rate Threshold: {threshold:.1f}x")
    print(f"  Short Window: {config['short_hours']} hours")
    print(f"  Long Window: {config['long_hours']} hours")
    print(f"  Budget consumed at detection: {budget_at_detection:.1f}%")
    print(f"  Days until exhaustion if sustained: {config['exhaustion_days']} days")
    print()
 
# Output:
# === Burn Rate Alert Configuration ===
#
# Alert Level: Severe (Page)
#   Burn Rate Threshold: 15.0x
#   Short Window: 1 hours
#   Long Window: 6 hours
#   Budget consumed at detection: 2.1%
#   Days until exhaustion if sustained: 2 days
#
# Alert Level: Significant (Page)
#   Burn Rate Threshold: 6.0x
#   Short Window: 6 hours
#   Long Window: 24 hours
#   Budget consumed at detection: 5.0%
#   Days until exhaustion if sustained: 5 days
#
# Alert Level: Moderate (Ticket)
#   Burn Rate Threshold: 3.0x
#   Short Window: 24 hours
#   Long Window: 72 hours
#   Budget consumed at detection: 10.0%
#   Days until exhaustion if sustained: 10 days

Implementing Burn Rate Alerts

Translating burn rate theory into practice requires appropriate queries, alert rules, and integration with your observability stack. Here's how to implement burn rate alerting with common tools:

Prometheus/PromQL Implementation:

For an SLI measured as a ratio of successful requests to total requests:

burn-rate-alerts.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# Prometheus Alerting Rules for Burn Rate
# Based on multi-window, multi-burn-rate strategy
 
groups:
  - name: slo-burn-rate-alerts
    rules:
      # ==========================================
      # Define recording rules for cleaner alerts
      # ==========================================
      
      # Calculate error rate over 5 minutes
      - record: slo:error_rate:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
      
      # Calculate error rate over 1 hour
      - record: slo:error_rate:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[1h])) by (service)
          /
          sum(rate(http_requests_total[1h])) by (service)
      
      # Calculate error rate over 6 hours
      - record: slo:error_rate:ratio_rate6h
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[6h])) by (service)
          /
          sum(rate(http_requests_total[6h])) by (service)
      
      # SLO target (99.9% = 0.001 error budget)
      - record: slo:error_budget:ratio
        expr: 0.001
      
      # ==========================================
      # Burn rate calculation
      # ==========================================
      
      # 5-minute burn rate
      - record: slo:burn_rate:ratio_rate5m
        expr: |
          slo:error_rate:ratio_rate5m / on(service) group_left slo:error_budget:ratio
      
      # 1-hour burn rate
      - record: slo:burn_rate:ratio_rate1h
        expr: |
          slo:error_rate:ratio_rate1h / on(service) group_left slo:error_budget:ratio
      
      # 6-hour burn rate
      - record: slo:burn_rate:ratio_rate6h
        expr: |
          slo:error_rate:ratio_rate6h / on(service) group_left slo:error_budget:ratio
      
      # ==========================================
      # Multi-window burn rate alerts
      # ==========================================
      
      # SEVERE: 14.4x burn rate = budget exhausts in ~2 days
      # Short window: 5 min, Long window: 1 hour
      - alert: SLOBurnRateSevere
        expr: |
          slo:burn_rate:ratio_rate5m > 14.4
          AND
          slo:burn_rate:ratio_rate1h > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.service }}: High SLO burn rate (severe)"
          description: |
            Service {{ $labels.service }} is burning error budget at {{ $value | printf "%.1f" }}x the sustainable rate.
            At this rate, the entire monthly budget will be exhausted in ~2 days.
            Immediate investigation required.
          runbook_url: https://runbooks.example.com/slo-burn-rate-severe
 
      # SIGNIFICANT: 6x burn rate = budget exhausts in ~5 days  
      # Short window: 30 min, Long window: 6 hours
      - alert: SLOBurnRateSignificant
        expr: |
          slo:burn_rate:ratio_rate30m > 6
          AND
          slo:burn_rate:ratio_rate6h > 6
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.service }}: Elevated SLO burn rate (significant)"
          description: |
            Service {{ $labels.service }} is burning error budget at {{ $value | printf "%.1f" }}x the sustainable rate.
            At this rate, the entire monthly budget will be exhausted in ~5 days.
            Investigation needed today.
 
      # MODERATE: 3x burn rate = budget exhausts in ~10 days
      # Short window: 2 hours, Long window: 24 hours  
      - alert: SLOBurnRateModerate
        expr: |
          slo:burn_rate:ratio_rate2h > 3
          AND
          slo:burn_rate:ratio_rate24h > 3
        for: 15m
        labels:
          severity: info
        annotations:
          summary: "{{ $labels.service }}: Slow SLO burn (moderate)"
          description: |
            Service {{ $labels.service }} is slowly burning error budget at {{ $value | printf "%.1f" }}x sustainable.
            At this rate, budget will be exhausted in ~10 days.
            Create ticket for investigation this week.

Key implementation notes:

Recording rules: Pre-compute error rates and burn rates as recording rules. This improves alert evaluation performance and enables cleaner alert expressions.
The for clause: Require burn rate to be sustained for a minimum duration before alerting. This filters transient spikes that self-resolve.
Label inheritance: Use group_left or similar to ensure service labels propagate correctly for alert routing.
Multiple SLIs: Create parallel rules for each SLI (availability, latency, error rate) and alert on any exceeding thresholds.
Alert fatigue prevention: Start with higher thresholds (more conservative) and tighten as you validate signal quality.

Sloth and Other SLO Tools

Responding to Burn Rate Alerts

Response by severity level:

Burn Rate Alert Response Matrix
Severity	Burn Rate	Time to Exhaust	Response	First Actions
Critical	14x	< 2 days	Page immediately, incident response	Acknowledge alert 2. Check for obvious cause 3. Consider rollback 4. Engage incident process
Warning	6-14x	2-5 days	Page, investigate immediately	Check recent changes 2. Review error distribution 3. Assess if intervention needed 4. May escalate to incident
Info	3-6x	5-10 days	Ticket, investigate same day	Create investigation ticket 2. Trend analysis 3. Root cause investigation 4. Plan remediation
Low	1-3x	10-30 days	Ticket, investigate this week	Add to backlog 2. Monitor trend 3. Address in normal sprint

The burn rate investigation framework:

When a burn rate alert fires, follow this systematic investigation:

1. Validate the signal (2-5 minutes)

Is the alert a true positive? Check current metrics directly.
Are there measurement issues (gaps in data, instrumentation bugs)?
Is this a new pattern or recurrence of a known issue?

2. Assess impact and trajectory (5-10 minutes)

What's the current budget consumption? How much remains?
At current rate, when would budget exhaust?
Is the burn rate increasing, stable, or decreasing?
Are users actually impacted? (Check error types, affected endpoints)

3. Identify the cause (10-30 minutes)

What changed recently? (deployments, config changes, traffic patterns)
Is a specific component failing? (dependency, database, specific service instance)
Is it traffic-related? (capacity, load distribution)
Is it code-related? (new bug, regression)

4. Decide on intervention (immediate)

Can we fix forward quickly?
Should we rollback?
Do we need to page additional people?
Should we declare an incident?

Don't Ignore 'Slow Burn' Alerts

Common Burn Rate Causes and Remediation

•Recent deployment introduced regression: Rollback if quick and safe, or fix-forward with emergency patch. Root cause: insufficient testing or canary analysis.
•Dependency degradation: Implement fallbacks, cache responses, or route around failing dependency. Root cause: missing resilience patterns.
•Capacity exhaustion: Scale up resources, shed non-critical load, or optimize hot paths. Root cause: capacity planning gaps.
•Traffic pattern change: Investigate legitimate vs. attack traffic. Adjust autoscaling or implement rate limiting. Root cause: insufficient adaptive capacity.
•Configuration drift: Identify and correct configuration. Implement config validation. Root cause: missing guardrails.
•Chronic code issue surfacing: Create technical debt ticket. May need longer-term fix. Root cause: accumulating technical debt.

Tuning Burn Rate Alerts

Burn rate alerting is not "set and forget." Initial configurations require iteration based on real-world signal quality. Here's how to tune your alerts:

Signal quality assessment:

Track for each alert configuration:

True positive rate: Alerts that correctly identified budget-threatening conditions
False positive rate: Alerts that fired but self-resolved without intervention
False negative rate: Budget depletions that weren't caught by alerts (the worst)
Time to detection: How long after issue onset did the alert fire?

Target metrics:

True positive rate: > 90% (most alerts should require action)
False positive rate: < 10% (some noise is acceptable for safety)
False negative rate: 0% (never acceptable to miss budget exhaustion)
Time to detection: Appropriate for severity (minutes for critical, hours for low)

Too Many False Positives?

•Increase long window duration
•Slightly increase burn rate threshold
•Increase 'for' clause duration
•Add correlation with other signals
•Verify SLI measurements are accurate

Missing Real Issues?

•Decrease burn rate threshold
•Shorten measurement windows
•Reduce 'for' clause duration
•Add lower-severity tier alerts
•Verify SLI captures all failure modes

Iterative tuning process:

Week 1-2: Observation

Deploy alerts but don't page—send to a non-critical channel
Observe when alerts fire vs. when issues occur
Note false positives and false negatives

Week 3-4: First adjustment

Adjust thresholds based on observations
Connect critical alerts to paging
Keep low-severity as tickets

Ongoing: Continuous refinement

Review alert performance monthly
Adjust thresholds as service behavior changes
Add new alert tiers if gaps identified

Service-specific considerations:

Different services may need different sensitivity:

Customer-facing critical path: Tighter thresholds, faster windows, page aggressively
Internal tooling: Looser thresholds, longer windows, ticket-based
Batch processing: May need different SLIs entirely (job completion rate, not request latency)

Budget for Alert Tuning

Advanced Burn Rate Patterns

Beyond basic multi-window alerting, sophisticated organizations implement additional patterns for more nuanced budget management:

Pattern 1: Burn rate forecasting

Rather than alerting when burn rate exceeds a threshold, predict when budget will exhaust based on current trajectory:

Predicted Exhaustion Date = Current Date + (Remaining Budget / Current Burn Rate × Budget Period)

Alert when predicted exhaustion is within the current SLO period. This catches slow burns earlier than threshold-based detection.

Pattern 2: Trend-adjusted burn rate

If burn rate is increasing, a current rate of 3x might soon become 6x. Calculate burn rate acceleration:

Burn Acceleration = (Current Burn Rate - Previous Period Burn Rate) / Time Delta

Alert on positive acceleration even if absolute burn rate is acceptable—the situation is getting worse.

Pattern 3: Remaining budget alerting

Complement burn rate alerts with absolute budget remaining alerts:

Alert at 50% budget consumed (informational)
Alert at 75% consumed (warning)
Alert at 90% consumed (critical)

This catches cases where many small burns have cumulatively depleted budget without any single incident being alarming.

Burn Rate Alert Enhancement Patterns

•Synthetic canary burn rate: Compare burn rate from synthetic probes vs. real user traffic. Divergence indicates measurement issues or localized problems.
•Per-customer burn rate: For B2B services, track which customers are experiencing elevated burn. Enables proactive outreach before they notice.
•Deployment correlation: Automatically tag burn rate spikes with recent deployments. Accelerates root cause identification.
•Anomaly-adjusted thresholds: Use ML to establish baseline burn rate patterns (including expected cyclicality) and alert on deviations rather than fixed thresholds.
•Cross-service burn rate correlation: When multiple dependent services show elevated burn simultaneously, aggregate into a single incident rather than multiple alerts.
•Budget reservation: Reserve a portion of error budget for planned activities (migrations, experiments). Alert if unplanned consumption threatens the reserved portion.

Start Simple, Add Complexity Carefully

Summary: Burn Rate Alerting Mastery

Key Takeaways

•Burn rate = actual errors / allowed errors: It normalizes across SLO targets and tells you how fast budget is depleting.
•Single windows can't optimize speed and stability: Use multi-window alerting with short windows for responsiveness and long windows for confirmation.
•Threshold math is straightforward: Burn rate = SLO period / days to exhaustion. A 15x burn rate exhausts a 30-day budget in 2 days.
•Different severities need different thresholds: Critical for acute issues (14x+), warning for significant issues (6x+), info for slow burns (3x+).
•Response calibration matters: Not every burn rate alert is an emergency. Match response intensity to severity and remaining budget.
•Continuous tuning is essential: Track true/false positives and adjust thresholds based on signal quality, not just theory.
•Advanced patterns add power at cost of complexity: Forecasting, trend analysis, and remaining budget alerts enhance coverage but require more maintenance.

Page Complete

3 / 5