Loading learning content...
For decades, reliability engineering was trapped in a false dichotomy: either you prioritize reliability and sacrifice development velocity, or you move fast and accept that things will break. Teams faced constant tension between "shipping features" and "keeping the site up," with neither side having objective criteria for resolution.
Error budgets changed everything.
An error budget is the mathematical inverse of an SLO target. If your SLO is 99.9% availability, your error budget is 0.1%—the amount of unreliability you're explicitly permitted before violating your commitment. This seemingly simple reframing has profound implications:
Error budgets are the mechanism by which SLOs become actionable. Without them, SLOs are just numbers on a dashboard. With them, SLOs drive organizational behavior.
By the end of this page, you'll understand error budgets deeply: how to calculate them, how to track consumption, how to use them for decision-making, and how to build organizational processes around them. You'll learn the error budget policies used at Google, Spotify, and other SRE leaders to balance innovation with stability.
At its core, an error budget is simply the inverse of your SLO target, expressed as a quantity of allowable failure over your SLO window. Understanding the mathematics enables precise tracking and decision-making.
The fundamental formula:
Error Budget = 100% - SLO Target
For an SLO of 99.9% availability:
Error Budget = 100% - 99.9% = 0.1%
This percentage becomes concrete when applied to your SLO window and measured in practical units:
Time-based calculation (for availability SLOs):
| SLO Target | Error Budget % | Allowed Downtime (30 days) | Allowed Downtime (per day) |
|---|---|---|---|
| 99.0% | 1.0% | 7 hours 12 minutes | ~14.4 minutes |
| 99.5% | 0.5% | 3 hours 36 minutes | ~7.2 minutes |
| 99.9% | 0.1% | 43.2 minutes | ~1.44 minutes |
| 99.95% | 0.05% | 21.6 minutes | ~43 seconds |
| 99.99% | 0.01% | 4.32 minutes | ~8.6 seconds |
| 99.999% | 0.001% | 26 seconds | ~0.86 seconds |
Request-based calculation (for latency/error rate SLOs):
For SLOs measured in requests rather than time, the error budget is expressed as a count of allowable bad requests:
Allowable Bad Requests = Total Requests × Error Budget %
For a service handling 10 million requests per month with a 99.9% success rate SLO:
Allowable Bad Requests = 10,000,000 × 0.001 = 10,000 requests
This means you can have up to 10,000 failed requests before violating your SLO.
The rolling window consideration:
Most error budgets use rolling windows (typically 28 or 30 days) rather than calendar periods. This matters because:
28-day windows are common because they contain exactly 4 weeks, eliminating day-of-week variance. 30-day windows are simpler to communicate. Shorter windows (7 days) provide faster feedback but more noise. Longer windows (90 days) provide stability but delay signals. Match window length to your deployment cadence and decision-making rhythm.
Tracking budget consumption:
Error budget consumption is typically expressed as a percentage of total budget used:
Budget Consumed = (Actual Errors / Allowable Errors) × 100%
Or equivalently:
Budget Consumed = (Actual Downtime / Allowable Downtime) × 100%
Tracking this over time creates an error budget burn chart—a visualization showing how quickly budget is being consumed and projecting when exhaustion might occur. This is analogous to financial burn rate for startups, applied to reliability.
Example scenario:
A service with 99.9% SLO has a 43-minute monthly error budget. After 15 days:
Error budgets embody a radical philosophical shift in how we think about reliability. Understanding this philosophy is essential to using error budgets effectively.
The core insight: 100% is the wrong target
Traditional reliability thinking aimed for maximum uptime. More reliability was always better. Error budgets explicitly reject this:
This is initially counterintuitive. How can more reliability be bad? The answer lies in opportunity cost.
Every percentage point of reliability beyond your SLO target represents engineering effort that could have been spent on features, security, performance, or technical debt. If users are equally happy at 99.9% and 99.99%, the engineering effort to achieve the latter is pure waste. Error budgets make this tradeoff visible and manageable.
Error budget as a shared resource:
Error budget is consumed by any activity that causes errors or unavailability:
This creates a shared accountability model. The development team's risky deployment and the ops team's maintenance window both draw from the same pool. Neither can blame the other for budget exhaustion—it's a collective resource.
The velocity-reliability pendulum:
Error budgets create a self-regulating system:
This pendulum naturally balances innovation and stability without requiring constant negotiation or management intervention.
Who "owns" the error budget:
A subtle but critical question is who controls how error budget is spent. There are several models:
Product-owned budget: Product management decides how to allocate budget between features (new deployments) and reliability work. This aligns with product teams having ultimate responsibility for user experience.
Engineering-owned budget: Engineering leadership allocates budget based on technical judgment. This prevents product pressure from overriding engineering safety concerns.
Jointly-owned budget: Both product and engineering must agree on budget allocation, with defined escalation paths for disagreement. This forces alignment but can slow decisions.
Most mature organizations use joint ownership with a tie-breaker rule (typically: when budget is healthy, product decides; when budget is threatened, engineering decides on protective measures).
An error budget policy is a pre-agreed set of rules defining what actions are triggered by different error budget states. Having explicit policies prevents ad-hoc negotiation during stressful situations and ensures consistent organizational responses.
The policy framework:
Error budget policies typically define:
| Budget State | Threshold | Permitted Activities | Required Actions |
|---|---|---|---|
| Healthy | < 50% consumed | Full velocity: features, experiments, risky changes | Normal development cadence |
| Caution | 50-75% consumed | Moderate velocity: features continue, reduce experiment scope | Increase deployment monitoring, review pending risky changes |
| At Risk | 75-90% consumed | Reduced velocity: only low-risk deployments | Reliability improvements prioritized, daily budget review |
| Critical | 90-100% consumed | Emergency only: only reliability fixes deploy | Development freeze, incident-level focus on recovery |
| Exhausted | 100% consumed | Zero tolerance: nothing deploys without VP approval | Post-mortem required, formal remediation plan |
Defining policy actions in detail:
Deployment restrictions: The most common policy action is restricting deployments as budget depletes. This might mean:
Reliability investment mandates: Policies can mandate reliability work when budget is threatened:
Escalation and visibility: Budget status can trigger management visibility:
Every error budget policy needs an exception process for genuine business necessities. But exceptions are dangerous—too many, and the policy becomes meaningless. Require exceptions to be approved at a higher level than the normal decision-maker, documented with justification, and tracked. If you're granting exceptions regularly, your policy or your SLO target is wrong.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
# Example Error Budget Policy Document# Service: Payment Processing API# SLO: 99.95% availability, p99 latency < 500ms service: payment-apislo_window: 30 days rollinglast_reviewed: 2024-01-15next_review: 2024-04-15 error_budget_policy: healthy: threshold: "< 50% budget consumed" deployment_policy: - Standard deployment cadence (up to 5 per day) - Canary: 5% for 10 minutes - Rollback: automatic on error spike permitted_activities: - Feature deployments - A/B experiments - Infrastructure migrations - Dependency upgrades review_cadence: Weekly error budget check-in caution: threshold: "50-75% budget consumed" deployment_policy: - Reduced deployment cadence (up to 3 per day) - Canary: 5% for 30 minutes - Business hours only (9am-4pm) - Mandatory rollback plan documented permitted_activities: - Feature deployments (with extra review) - Small, well-tested experiments only - Delayed infrastructure work required_actions: - Daily error budget review - Identify top error budget consumers - Defer risky changes to next healthy period escalation: Engineering Manager notified at_risk: threshold: "75-90% budget consumed" deployment_policy: - Emergency and reliability deployments only - Canary: 1% for 60 minutes - Business hours only with on-call engineer present - Explicit VP approval for feature deployments permitted_activities: - Bug fixes - Reliability improvements - Rollbacks required_actions: - Reliability sprint initiated - Top 3 reliability improvements prioritized - War room until budget recovers to caution level escalation: Director notified, included in leadership sync critical: threshold: "90-100% budget consumed" deployment_policy: - Complete feature freeze - Only reliability fixes and rollbacks permitted - All changes require on-call + engineering lead approval required_actions: - All hands on reliability work - Customer communication prepared (not sent unless SLA breach) - Post-mortem for budget depletion initiated escalation: VP notified, daily executive summary exhausted: threshold: "> 100% budget consumed (SLO violated)" deployment_policy: - Nothing deploys without VP written approval - All changes reviewed by SRE team required_actions: - Formal post-mortem within 48 hours - Remediation plan with timeline - Customer notification (per SLA requirements) - Executive briefing recovery_criteria: - Root cause identified and fixed - Prevention measures implemented - Budget returns to at_risk or better escalation: Executive leadership notified exception_process: approver: VP of Engineering documentation: Required in deployment ticket tracking: All exceptions logged in reliability dashboard review: Monthly review of exceptions for policy refinementThe true power of error budgets emerges when they become the primary input for reliability-related decisions. Here are the key decision frameworks error budgets enable:
Decision 1: Ship or Wait?
When a feature is ready but carries deployment risk, error budget provides the answer:
This removes subjective negotiation. The budget is an objective arbiter that both product and engineering agreed to respect.
Decision 2: Reliability vs. Feature Work
The classic engineering tradeoff becomes quantifiable:
This isn't annual planning—it's continuous rebalancing based on real feedback. A team might shift priorities weekly based on budget trends.
Decision 3: Incident Response Investment
Error budgets clarify how much incident response effort is warranted:
This prevents both over-reaction (treating every small incident as a crisis) and under-reaction (ignoring incidents that cumulatively deplete budget).
Decision 4: SRE Team Engagement
In organizations with dedicated SRE teams, error budget often governs SRE support levels:
One powerful mental model: calculate your 'cost per error minute' by dividing your monthly reliability engineering investment by your monthly error budget. If you spend $100K/month on reliability for a 43-minute budget, each minute of allowable downtime costs ~$2,300 in reliability investment. This helps calibrate whether a given activity's risk is worth its potential budget consumption.
Decision 5: Freeze or Continue During Incidents?
When an incident occurs, teams often debate whether to freeze all activity or continue (carefully) with planned work. Error budget provides guidance:
Decision 6: Experiment Design
Error budgets should explicitly influence experiment design:
Understanding what is consuming your error budget is as important as knowing how much is consumed. Effective error budget management requires categorizing and analyzing consumption patterns.
Consumption categories:
| Category | Examples | Controllability | Response Strategy |
|---|---|---|---|
| Deployment errors | Rollout bugs, configuration mistakes, regressions | High | Improve testing, canary analysis, rollback speed |
| Infrastructure incidents | Hardware failures, network issues, provider outages | Low | Redundancy, multi-region, provider SLA management |
| Dependency failures | Third-party API outages, upstream service issues | Medium | Fallbacks, caching, circuit breakers, vendor diversity |
| Capacity issues | Traffic spikes, resource exhaustion, scaling failures | High | Autoscaling, capacity planning, load shedding |
| Operational errors | Human mistakes, runbook failures, misconfigurations | High | Automation, guardrails, training, review processes |
| Planned maintenance | Upgrades, migrations, intentional degradation | High | Schedule during low-traffic, communicate, budget allocation |
Attribution and accountability:
For error budgets to drive improvement, consumption must be attributed accurately:
A team consistently burning budget on deployment errors should invest in CI/CD improvements. A team burning budget on dependency failures should invest in resilience patterns. The budget tells you where to focus.
Consumption forecasting:
Beyond tracking current consumption, mature teams forecast future consumption:
Significant budget consumption that can't be attributed indicates observability gaps. If 30% of your error budget is consumed by 'unknown causes,' that's a signal to invest in monitoring and attribution before you can effectively reduce consumption. You can't fix what you can't measure.
Consumption efficiency:
Not all budget consumption is equal. Some activities provide business value for the budget they consume; others are pure waste:
High-value consumption:
Low-value consumption:
The goal isn't zero consumption—it's ensuring that consumed budget generates proportionate value. A team that uses 100% of budget to ship valuable features and experiments is performing optimally. A team that uses 50% of budget on preventable incidents is underperforming even though they're 'healthier.'
When error budget is exhausted or critically low, teams need a structured approach to recovery. This isn't about punishing teams—it's about providing a framework to return to healthy operations.
The recovery mindset:
Budget exhaustion is not a failure to be ashamed of—it's a signal that the balance between velocity and reliability needs recalibration. The goal of recovery is to:
Recovery velocity:
With a rolling window, error budget naturally recovers as old incidents roll out of the window. The recovery rate depends on:
For a 30-day window:
Accelerated recovery strategies:
While you can't change the math of rolling windows, you can accelerate effective recovery:
Prevent new consumption: The fastest way to recover is to stop consuming budget. Feature freezes serve this purpose.
Reduce ongoing consumption: Fix chronic small issues that continuously drain budget. A fix that saves 1 minute per day saves 30 minutes per month.
Improve incident response: Faster detection and mitigation means less budget consumed per incident.
Invest in reliability improvements: Use the recovery period productively rather than just waiting for the window to clear.
Teams can enter a doom loop where budget exhaustion → freeze → pressure to ship builds → risky deployment when freeze lifts → new incidents → budget exhaustion again. Break this cycle by using recovery periods productively for reliability work, not just waiting. The goal is that when velocity resumes, the system can sustain it.
Error budgets are conceptually elegant but organizationally challenging. Successful adoption requires navigating predictable obstacles:
Challenge 1: 'We don't have good enough measurement'
Teams often cite imperfect observability as a reason to delay error budget adoption. While measurement matters, perfect measurement isn't required to start:
Challenge 2: Gaming the system
Sophisticated teams may find ways to game error budgets:
Countermeasures:
Challenge 3: Cultural resistance to 'allowed failures'
Some organizations have cultures where any failure is unacceptable. This prevents productive use of error budgets:
These positions often collapse under scrutiny—no service truly has zero errors—but they require patient cultural change, often led by data showing that users actually tolerate more than assumed.
Don't try to implement error budgets organization-wide immediately. Start with one service and team that's on board. Prove the model creates better outcomes. Use that success to expand adoption. Forced adoption creates compliance without commitment; earned adoption creates genuine cultural change.
Error budgets transform SLOs from static targets into dynamic decision-making tools. They make reliability a shared resource, provide objective criteria for velocity-reliability tradeoffs, and create self-regulating systems that balance innovation with stability.
You now understand error budgets comprehensively—their mathematics, philosophy, policies, and practical application. Next, we'll explore burn rate alerting—the mechanism for detecting when error budget consumption is accelerating dangerously, enabling proactive intervention before budget exhaustion.