Performance Metrics - Learning Module

Loading content...

0/273

Availability and Uptime

The Reliability Dimension

A system with perfect latency and infinite throughput is worthless if it's down. Availability measures whether your system is actually running and serving users when they need it. It's the fundamental promise you make: "Our service will be here when you need it."

Availability is often expressed as a percentage—99%, 99.9%, 99.99%—each additional "nine" representing an order of magnitude improvement in reliability. The difference between 99% and 99.99% uptime sounds small (just two digits!), but it's the difference between 87 hours of downtime per year and 52 minutes. For a payment system, an airline booking engine, or a hospital records system, those extra nines can mean the difference between a minor inconvenience and a crisis.

This page explores availability as a core performance metric: how to define it, measure it, set targets for it, and engineer systems that achieve it.

What You Will Learn

By the end of this page, you'll understand availability as a precise engineering concept. You'll know how to calculate "nines," set realistic SLOs, measure availability correctly, and reason about the trade-offs between availability and other system properties. You'll develop the reliability mindset that distinguishes engineers who build systems that stay up.

Defining Availability Precisely

Availability seems simple—is the system up or down? But precision matters enormously when defining targets and measuring compliance.

The Basic Formula

Availability = Uptime / (Uptime + Downtime) × 100%

Or equivalently:

Availability = (Total Time - Downtime) / Total Time × 100%

What Counts as "Down"?

This seemingly simple question generates significant debate. Consider a service that:

Responds to 95% of requests successfully
Has 50% of servers healthy while 50% are failing
Returns errors for one API endpoint but works for others
Works for paid users but not free users
Is accessible from US but not Europe

Is it "up" or "down"? Neither answer is fully correct. This is why we need more nuanced definitions.

Availability Types

There are multiple availability definitions: Time-based (what percentage of time was it up?), Request-based (what percentage of requests succeeded?), and User-based (what percentage of users experienced success?). Different definitions can give dramatically different numbers for the same system state.

Time-Based Availability

The traditional approach: measure time periods when the system was available vs. unavailable.

Time Availability = (Planned Uptime - Unplanned Downtime) / Planned Uptime

"Planned uptime" excludes scheduled maintenance windows. A system available 99.9% of time during the hours it's supposed to be up achieves 99.9% availability even if it has 2 hours of weekly scheduled maintenance.

Request-Based Availability

More granular: measure successful requests vs. total requests.

Request Availability = Successful Requests / Total Requests × 100%

This captures partial outages that time-based metrics miss. If 10% of requests fail during an hour-long degradation, time-based says "1 hour down"; request-based says "90% available during that hour."

User-Based Availability

Most user-centric: what percentage of users had a successful experience?

User Availability = Users With Successful Sessions / Total Active Users

A regional outage affecting 5% of users is 95% user-available; the same outage by request count might be only 1% of requests if that region has low traffic.

The Nines of Availability

Availability is typically expressed in terms of "nines"—a shorthand for how many 9s appear in the percentage. Each additional nine represents 10× better reliability.

Availability Levels and Allowed Downtime
Nines	Availability %	Downtime/Year	Downtime/Month	Downtime/Week
One nine	90%	36.5 days	3 days	16.8 hours
Two nines	99%	3.65 days	7.2 hours	1.68 hours
Three nines	99.9%	8.76 hours	43.8 minutes	10.1 minutes
Four nines	99.99%	52.6 minutes	4.38 minutes	1.01 minutes
Five nines	99.999%	5.26 minutes	26.3 seconds	6.05 seconds
Six nines	99.9999%	31.5 seconds	2.63 seconds	0.6 seconds

What Each Level Means Practically:

Two nines (99%): Acceptable for internal tools, development environments. ~3.5 days/year of downtime.
Three nines (99.9%): Standard for most web applications. About 8 hours/year of downtime. Allows for occasional outages and maintenance.
Four nines (99.99%): High availability target. About 52 minutes/year. Requires redundancy, automated failover, no scheduled downtime.
Five nines (99.999%): Mission-critical systems (banking, emergency services). About 5 minutes/year. Requires extensive redundancy, geographic distribution, instant failover.
Six nines (99.9999%): Telecom/defense grade. 31 seconds/year. Achieved only with massive redundancy and specialized hardware.

Nines Are Exponentially Harder

Each additional nine is roughly 10× harder and more expensive to achieve. Going from 99.9% to 99.99% often requires fundamental architectural changes (geographic redundancy, eliminating single points of failure). Going to 99.999% typically requires specialized hardware, custom software, and dedicated reliability engineering teams. Don't promise more nines than you're prepared to invest in.

Calculating Composite System Availability

Real systems are composed of multiple components. Understanding how component availability combines is essential for predicting system-level availability.

Serial Dependency (All Components Required)

When components are in series (A → B → C), all must be available for the system to work:

Availability_serial = A₁ × A₂ × A₃ × ... × Aₙ

Example: A web app depends on an API server (99.9%), database (99.95%), and cache (99.99%).

System Availability = 0.999 × 0.9995 × 0.9999 = 0.9984 = 99.84%

The chain is only as strong as its weakest link, and each component slightly reduces overall availability.

Serial Dependencies Multiply Failure Risk

A system with 10 components each at 99.9% availability has composite availability of 0.999^10 = 99.0%. Those 10 dependencies cost you an entire nine. Every serial dependency degrades overall availability. This is why minimizing dependencies is critical for high-availability systems.

Parallel Redundancy (Any Component Sufficient)

When components are redundant (either A₁ OR A₂ can serve), system fails only if ALL fail:

Unavailability_parallel = (1-A₁) × (1-A₂) × ... × (1-Aₙ)
Availability_parallel = 1 - Unavailability_parallel

Example: Two redundant API servers at 99.9% each:

P(both fail) = 0.001 × 0.001 = 0.000001
Availability = 1 - 0.000001 = 99.9999%

Redundancy dramatically improves availability—two 99.9% servers achieve six nines together!

Combined Serial and Parallel

Real architectures combine both patterns:

[Load Balancer] → [App Server 1] OR [App Server 2] → [Primary DB] OR [Replica DB]
    99.99%           99.9%      ↔      99.9%          99.95%    ↔    99.95%

Calculation:

App tier: 1 - (0.001 × 0.001) = 99.9999%
DB tier: 1 - (0.0005 × 0.0005) = 99.999975%
Total: 0.9999 × 0.999999 × 0.99999975 ≈ 99.989% ≈ 99.99%

The non-redundant load balancer becomes the bottleneck! This is why high-availability designs eliminate single points of failure everywhere.

MTBF, MTTR, and MTTF

Availability can be expressed in terms of failure and recovery times, which provides actionable targets for improvement.

Key Metrics

Time-Based Reliability Metrics
Metric	Full Name	Definition	Improvement Path
MTTF	Mean Time To Failure	Average time between system going live and failing	Improve component reliability
MTTR	Mean Time To Recovery/Repair	Average time from failure detection to recovery	Faster detection, automated recovery
MTBF	Mean Time Between Failures	Average time between failures (MTTF + MTTR)	Both reliability and recovery

The MTBF/MTTR Relationship to Availability

Availability = MTTF / (MTTF + MTTR) = MTTF / MTBF

Or equivalently:

Availability = Uptime / (Uptime + Downtime)

Two Paths to Higher Availability:

Increase MTTF (Fail Less Often)
- Use more reliable components
- Add redundancy
- Eliminate single points of failure
- Reduce blast radius of failures
Decrease MTTR (Recover Faster)
- Faster detection (better monitoring)
- Automated remediation
- Runbooks for common failures
- Practice incident response

MTTR is Often Easier to Improve

For most systems, reducing MTTR is more achievable than increasing MTTF. You can't always prevent failures, but you can always detect and recover faster. A system with MTTF=100 hours and MTTR=1 hour has 99% availability. Reducing MTTR to 10 minutes achieves 99.83%. Much more achievable than making the same system fail 10× less often.

Breaking Down MTTR

MTTR consists of several sub-components:

MTTR = MTTD + MTTI + MTTM + MTTV

Where:

MTTD (Mean Time To Detect): From failure occurring to detection
MTTI (Mean Time To Investigate): From detection to understanding cause
MTTM (Mean Time To Mitigate): From understanding to fixing
MTTV (Mean Time To Verify): From fix deployed to confirmed working

Each component can be optimized:

MTTD: Better monitoring, faster alerting thresholds
MTTI: Better logs, distributed tracing, runbooks
MTTM: Automated remediation, easy rollback, feature flags
MTTV: Automated health checks, canary verification

Availability SLOs and Error Budgets

Setting availability targets requires balancing reliability with practical constraints like development velocity, cost, and complexity.

Setting Realistic Availability SLOs

Your availability target should be based on:

Business requirements: What availability does the business actually need? Overpromising wastes engineering effort.
Dependency limits: You can't be more available than your dependencies. If your cloud provider offers 99.95%, targeting 99.999% is unrealistic.
Cost/benefit: Each nine typically costs 10× more. Is the marginal reliability worth the investment?
User tolerance: How much downtime will users accept before churning?

Availability Targets by System Type
System Type	Typical Target	Allowed Annual Downtime	Justification
Internal tools	99%	3.65 days	Non-critical, users can wait
Content websites	99.5%	1.83 days	Annoying but not catastrophic if down
E-commerce	99.9%	8.76 hours	Direct revenue impact from downtime
SaaS applications	99.9% - 99.95%	4-8 hours	Customer trust, contractual obligations
Payment systems	99.99%	52 minutes	Transactions blocked, regulatory requirements
Emergency services	99.999%+	5 minutes	Life safety implications

Error Budgets: Operationalizing Availability

An error budget is the inverse of your availability target—the amount of downtime or failure you're "allowed" before violating your SLO.

Error Budget = 1 - SLO Target

For a 99.9% (three nines) SLO:

Error budget = 0.1% of time/requests
Per month: 43.8 minutes of downtime allowed
Per month: 0.1% of requests can fail

How Error Budgets Change Behavior:

Budget available: Ship features aggressively, take calculated risks
Budget low: Slow down deployments, focus on reliability
Budget exhausted: Freeze risky changes, remediate existing issues

Error budgets create a self-regulating system: reliability issues consume budget, creating natural pressure to fix them before more features ship.

Error Budgets as Currency

Treat your error budget as a currency. Spending it on new feature releases is fine—that's what it's for. But spending it on preventable outages is waste. Track where error budget is consumed. If 80% goes to deployment failures, invest in safer deploys. If 80% goes to infrastructure, invest in redundancy.

Measuring Availability Correctly

Accurate availability measurement requires careful consideration of what to measure, from where, and how to handle edge cases.

What to Measure

Define specific success criteria:

A request is successful if:
  - HTTP status code is 2xx or 3xx (not 4xx, 5xx)
  - Response time is under SLO threshold (e.g., < 500ms)
  - Response is well-formed (passes validation)
  - Business logic completed correctly

Be explicit. A 200 OK response that returns invalid data should count as failure. A request that times out should count as failure even if the server eventually completes it.

Client vs. Server Measurement

Server-side measurement misses failures the server never sees: network partitions, load balancer errors, DNS failures. Client-side measurement captures true user experience but is harder to instrument. Synthetic monitoring probes the system from outside but may not represent real user paths. Use multiple perspectives for a complete picture.

Synthetic Monitoring

Synthetic monitoring uses automated probes to test availability:

Simple probes: HTTP health check every 30 seconds from multiple locations
Transaction probes: Complete user journeys (login, add to cart, checkout)
Geographic probes: Tests from multiple regions to detect regional issues

Advantages:

Detects issues before users do
Consistent, controlled tests
Works when no users are active

Disadvantages:

Doesn't represent real user diversity
May not exercise all code paths
Can be fooled by systems that optimize for probes

Real User Monitoring (RUM)

Measure actual user requests:

Instrument client applications/browsers to report success/failure
Aggregate real user experience data
See exactly what users see

Advantages:

Shows actual user experience
Catches issues probes miss
Represents real traffic patterns

Disadvantages:

No data when no users are active
Harder to instrument
Noise from client-side issues (user's network, browser)

Availability Measurement Approaches
Approach	Perspective	Best For	Blind Spots
Server metrics	Internal	High-volume, detailed data	Doesn't see network/LB failures
Load balancer logs	Edge	All requests reaching edge	Doesn't see DNS/network failures
Synthetic probes	External	Consistent, controlled checks	May not match user experience
Real User Monitoring	True user	Actual user experience	No data when idle, client noise
Third-party monitoring	Independent	Unbiased external view	Limited depth, usually probe-based

Engineering for Higher Availability

Achieving high availability is an engineering discipline with established practices. Let's examine the key strategies.

Eliminate Single Points of Failure

•Redundant servers — Multiple application instances behind load balancers. If one dies, others serve traffic.
•Database replication — Primary with synchronous/async replicas. Failover on primary failure.
•Multi-zone deployment — Distribute across availability zones. Survive zone failures.
•Multi-region deployment — For the highest availability, deploy across geographic regions.
•Redundant network paths — Multiple network providers, multiple data center links.

Reduce Blast Radius

•Microservices isolation — Failure in one service doesn't cascade to others. Bulkheads limit damage.
•Feature flags — Disable problematic features without full rollback. Surgical response to issues.
•Cell-based architecture — Partition users into isolated cells. Failure affects only one cell.
•Canary deployments — Roll out to small percentage first. Catch issues before full impact.
•Circuit breakers — Stop calling failing dependencies. Fail fast, prevent cascade.

Faster Detection and Recovery

•Comprehensive monitoring — Metrics, logs, traces for rapid diagnosis.
•Alerting with low latency — Detect failures in seconds, not minutes.
•Automated remediation — Auto-restart failed processes, auto-scale on load.
•Runbooks and playbooks — Documented procedures for common failures.
•Regular incident practice — Game days, chaos engineering to practice response.

The Resilience Testing Imperative

You don't know your system is resilient until it's tested. Chaos engineering practices like Netflix's Chaos Monkey intentionally cause failures in production to verify recovery mechanisms work. If you've never seen your failover activate, you don't know if it works.

Availability Trade-offs

Higher availability always comes at a cost. Understanding these trade-offs helps make informed decisions about reliability investments.

Availability vs. Consistency (CAP Theorem)

The CAP theorem states that distributed systems can't simultaneously guarantee:

Consistency: All nodes see the same data
Availability: Every request receives a response
Partition tolerance: System continues despite network failures

Since network partitions are unavoidable, you must choose between:

CP: Sacrifice availability for consistency (reject requests if can't guarantee consistency)
AP: Sacrifice consistency for availability (serve requests with potentially stale data)

Most high-availability systems choose AP, accepting eventual consistency to remain available during partitions.

Availability Trade-off Dimensions
Trade-off	Higher Availability Requires	Cost/Complexity
Availability vs. Cost	More redundancy, higher-tier infrastructure	$$$, larger operational burden
Availability vs. Consistency	Relaxed consistency (eventual)	Complex application logic for conflicts
Availability vs. Latency	Synchronous replication for durability	Added latency for strong guarantees
Availability vs. Velocity	More testing, slower deployments	Reduced feature delivery speed
Availability vs. Simplicity	Complex failover, monitoring, automation	Higher cognitive load, more failure modes

The Cost Curve

Availability cost is non-linear:

99% → 99.9%: ~2× cost (add basic redundancy)
99.9% → 99.99%: ~5× cost (multi-zone, automated failover)
99.99% → 99.999%: ~10× cost (multi-region, dedicated SRE team)
99.999% → 99.9999%: ~50× cost (specialized hardware, extreme redundancy)

Diminishing returns set in quickly. Ask: "What's the business value of each additional nine?" For most systems, the answer suggests stopping at 99.9% or 99.99%.

Don't Over-Engineer Availability

A common mistake is pursuing maximally high availability because it seems like the "right" thing to do. But 99.999% availability for an internal reporting tool is wasteful. Match your availability investment to actual business needs. That budget is better spent on features or other systems that genuinely need higher reliability.

Summary: The Reliability Imperative

Availability is the foundational promise that your system will be there when users need it. Let's consolidate what we've learned:

Key Takeaways

•Define availability precisely — Time-based, request-based, and user-based availability tell different stories. Choose and measure the one that matters for your users.
•Understand the nines — Each additional nine is 10× better reliability but ~10× more expensive. 99.9% means 8 hours/year downtime; 99.99% means 52 minutes.
•Calculate composite availability — Serial dependencies multiply failure probability. Redundancy dramatically improves availability. Design to eliminate single points of failure.
•Focus on MTTR — Mean Time To Recovery is often easier to improve than Mean Time To Failure. Faster detection and recovery is usually more achievable than never failing.
•Use error budgets — Transform availability from a constraint into a currency. Budget available = ship fast. Budget low = focus on reliability.
•Measure from multiple perspectives — Combine server metrics, synthetic monitoring, and real user monitoring for a complete picture.
•Accept trade-offs — Higher availability costs more, may sacrifice consistency or velocity. Match investment to actual business needs.

What's Next:

We've covered latency (speed), throughput (volume), percentiles (distributions), and now availability (reliability). The final dimension of performance metrics is resource utilization—understanding how efficiently your system uses its underlying hardware and how to interpret utilization data for capacity planning.

Page Complete

You now understand availability as a precise engineering concept. You can calculate nines, set appropriate SLOs, use error budgets, measure availability correctly, and reason about the trade-offs involved in achieving higher reliability. You have the foundation for building systems that stay up.

Availability and Uptime

The Reliability Dimension

This page explores availability as a core performance metric: how to define it, measure it, set targets for it, and engineer systems that achieve it.

What You Will Learn

Defining Availability Precisely

Availability seems simple—is the system up or down? But precision matters enormously when defining targets and measuring compliance.

The Basic Formula

Availability = Uptime / (Uptime + Downtime) × 100%

Or equivalently:

Availability = (Total Time - Downtime) / Total Time × 100%

What Counts as "Down"?

This seemingly simple question generates significant debate. Consider a service that:

Responds to 95% of requests successfully
Has 50% of servers healthy while 50% are failing
Returns errors for one API endpoint but works for others
Works for paid users but not free users
Is accessible from US but not Europe

Is it "up" or "down"? Neither answer is fully correct. This is why we need more nuanced definitions.

Availability Types

Time-Based Availability

The traditional approach: measure time periods when the system was available vs. unavailable.

Time Availability = (Planned Uptime - Unplanned Downtime) / Planned Uptime

Request-Based Availability

More granular: measure successful requests vs. total requests.

Request Availability = Successful Requests / Total Requests × 100%

User-Based Availability

Most user-centric: what percentage of users had a successful experience?

User Availability = Users With Successful Sessions / Total Active Users

A regional outage affecting 5% of users is 95% user-available; the same outage by request count might be only 1% of requests if that region has low traffic.

The Nines of Availability

Availability is typically expressed in terms of "nines"—a shorthand for how many 9s appear in the percentage. Each additional nine represents 10× better reliability.

Availability Levels and Allowed Downtime
Nines	Availability %	Downtime/Year	Downtime/Month	Downtime/Week
One nine	90%	36.5 days	3 days	16.8 hours
Two nines	99%	3.65 days	7.2 hours	1.68 hours
Three nines	99.9%	8.76 hours	43.8 minutes	10.1 minutes
Four nines	99.99%	52.6 minutes	4.38 minutes	1.01 minutes
Five nines	99.999%	5.26 minutes	26.3 seconds	6.05 seconds
Six nines	99.9999%	31.5 seconds	2.63 seconds	0.6 seconds

What Each Level Means Practically:

Two nines (99%): Acceptable for internal tools, development environments. ~3.5 days/year of downtime.
Three nines (99.9%): Standard for most web applications. About 8 hours/year of downtime. Allows for occasional outages and maintenance.
Four nines (99.99%): High availability target. About 52 minutes/year. Requires redundancy, automated failover, no scheduled downtime.
Five nines (99.999%): Mission-critical systems (banking, emergency services). About 5 minutes/year. Requires extensive redundancy, geographic distribution, instant failover.
Six nines (99.9999%): Telecom/defense grade. 31 seconds/year. Achieved only with massive redundancy and specialized hardware.

Nines Are Exponentially Harder

Calculating Composite System Availability

Real systems are composed of multiple components. Understanding how component availability combines is essential for predicting system-level availability.

Serial Dependency (All Components Required)

When components are in series (A → B → C), all must be available for the system to work:

Availability_serial = A₁ × A₂ × A₃ × ... × Aₙ

Example: A web app depends on an API server (99.9%), database (99.95%), and cache (99.99%).

System Availability = 0.999 × 0.9995 × 0.9999 = 0.9984 = 99.84%

The chain is only as strong as its weakest link, and each component slightly reduces overall availability.

Serial Dependencies Multiply Failure Risk

Parallel Redundancy (Any Component Sufficient)

When components are redundant (either A₁ OR A₂ can serve), system fails only if ALL fail:

Unavailability_parallel = (1-A₁) × (1-A₂) × ... × (1-Aₙ)
Availability_parallel = 1 - Unavailability_parallel

Example: Two redundant API servers at 99.9% each:

P(both fail) = 0.001 × 0.001 = 0.000001
Availability = 1 - 0.000001 = 99.9999%

Redundancy dramatically improves availability—two 99.9% servers achieve six nines together!

Combined Serial and Parallel

Real architectures combine both patterns:

[Load Balancer] → [App Server 1] OR [App Server 2] → [Primary DB] OR [Replica DB]
    99.99%           99.9%      ↔      99.9%          99.95%    ↔    99.95%

Calculation:

App tier: 1 - (0.001 × 0.001) = 99.9999%
DB tier: 1 - (0.0005 × 0.0005) = 99.999975%
Total: 0.9999 × 0.999999 × 0.99999975 ≈ 99.989% ≈ 99.99%

The non-redundant load balancer becomes the bottleneck! This is why high-availability designs eliminate single points of failure everywhere.

MTBF, MTTR, and MTTF

Availability can be expressed in terms of failure and recovery times, which provides actionable targets for improvement.

Key Metrics

Time-Based Reliability Metrics
Metric	Full Name	Definition	Improvement Path
MTTF	Mean Time To Failure	Average time between system going live and failing	Improve component reliability
MTTR	Mean Time To Recovery/Repair	Average time from failure detection to recovery	Faster detection, automated recovery
MTBF	Mean Time Between Failures	Average time between failures (MTTF + MTTR)	Both reliability and recovery

The MTBF/MTTR Relationship to Availability

Availability = MTTF / (MTTF + MTTR) = MTTF / MTBF

Or equivalently:

Availability = Uptime / (Uptime + Downtime)

Two Paths to Higher Availability:

Increase MTTF (Fail Less Often)
- Use more reliable components
- Add redundancy
- Eliminate single points of failure
- Reduce blast radius of failures
Decrease MTTR (Recover Faster)
- Faster detection (better monitoring)
- Automated remediation
- Runbooks for common failures
- Practice incident response

MTTR is Often Easier to Improve

Breaking Down MTTR

MTTR consists of several sub-components:

MTTR = MTTD + MTTI + MTTM + MTTV

Where:

MTTD (Mean Time To Detect): From failure occurring to detection
MTTI (Mean Time To Investigate): From detection to understanding cause
MTTM (Mean Time To Mitigate): From understanding to fixing
MTTV (Mean Time To Verify): From fix deployed to confirmed working

Each component can be optimized:

MTTD: Better monitoring, faster alerting thresholds
MTTI: Better logs, distributed tracing, runbooks
MTTM: Automated remediation, easy rollback, feature flags
MTTV: Automated health checks, canary verification

Availability SLOs and Error Budgets

Setting availability targets requires balancing reliability with practical constraints like development velocity, cost, and complexity.

Setting Realistic Availability SLOs

Your availability target should be based on:

Business requirements: What availability does the business actually need? Overpromising wastes engineering effort.
Dependency limits: You can't be more available than your dependencies. If your cloud provider offers 99.95%, targeting 99.999% is unrealistic.
Cost/benefit: Each nine typically costs 10× more. Is the marginal reliability worth the investment?
User tolerance: How much downtime will users accept before churning?

Availability Targets by System Type
System Type	Typical Target	Allowed Annual Downtime	Justification
Internal tools	99%	3.65 days	Non-critical, users can wait
Content websites	99.5%	1.83 days	Annoying but not catastrophic if down
E-commerce	99.9%	8.76 hours	Direct revenue impact from downtime
SaaS applications	99.9% - 99.95%	4-8 hours	Customer trust, contractual obligations
Payment systems	99.99%	52 minutes	Transactions blocked, regulatory requirements
Emergency services	99.999%+	5 minutes	Life safety implications

Error Budgets: Operationalizing Availability

An error budget is the inverse of your availability target—the amount of downtime or failure you're "allowed" before violating your SLO.

Error Budget = 1 - SLO Target

For a 99.9% (three nines) SLO:

Error budget = 0.1% of time/requests
Per month: 43.8 minutes of downtime allowed
Per month: 0.1% of requests can fail

How Error Budgets Change Behavior:

Budget available: Ship features aggressively, take calculated risks
Budget low: Slow down deployments, focus on reliability
Budget exhausted: Freeze risky changes, remediate existing issues

Error budgets create a self-regulating system: reliability issues consume budget, creating natural pressure to fix them before more features ship.

Error Budgets as Currency

Measuring Availability Correctly

Accurate availability measurement requires careful consideration of what to measure, from where, and how to handle edge cases.

What to Measure

Define specific success criteria:

A request is successful if:
  - HTTP status code is 2xx or 3xx (not 4xx, 5xx)
  - Response time is under SLO threshold (e.g., < 500ms)
  - Response is well-formed (passes validation)
  - Business logic completed correctly

Be explicit. A 200 OK response that returns invalid data should count as failure. A request that times out should count as failure even if the server eventually completes it.

Client vs. Server Measurement

Synthetic Monitoring

Synthetic monitoring uses automated probes to test availability:

Simple probes: HTTP health check every 30 seconds from multiple locations
Transaction probes: Complete user journeys (login, add to cart, checkout)
Geographic probes: Tests from multiple regions to detect regional issues

Advantages:

Detects issues before users do
Consistent, controlled tests
Works when no users are active

Disadvantages:

Doesn't represent real user diversity
May not exercise all code paths
Can be fooled by systems that optimize for probes

Real User Monitoring (RUM)

Measure actual user requests:

Instrument client applications/browsers to report success/failure
Aggregate real user experience data
See exactly what users see

Advantages:

Shows actual user experience
Catches issues probes miss
Represents real traffic patterns

Disadvantages:

No data when no users are active
Harder to instrument
Noise from client-side issues (user's network, browser)

Availability Measurement Approaches
Approach	Perspective	Best For	Blind Spots
Server metrics	Internal	High-volume, detailed data	Doesn't see network/LB failures
Load balancer logs	Edge	All requests reaching edge	Doesn't see DNS/network failures
Synthetic probes	External	Consistent, controlled checks	May not match user experience
Real User Monitoring	True user	Actual user experience	No data when idle, client noise
Third-party monitoring	Independent	Unbiased external view	Limited depth, usually probe-based

Engineering for Higher Availability

Achieving high availability is an engineering discipline with established practices. Let's examine the key strategies.

Eliminate Single Points of Failure

•Redundant servers — Multiple application instances behind load balancers. If one dies, others serve traffic.
•Database replication — Primary with synchronous/async replicas. Failover on primary failure.
•Multi-zone deployment — Distribute across availability zones. Survive zone failures.
•Multi-region deployment — For the highest availability, deploy across geographic regions.
•Redundant network paths — Multiple network providers, multiple data center links.

Reduce Blast Radius

•Microservices isolation — Failure in one service doesn't cascade to others. Bulkheads limit damage.
•Feature flags — Disable problematic features without full rollback. Surgical response to issues.
•Cell-based architecture — Partition users into isolated cells. Failure affects only one cell.
•Canary deployments — Roll out to small percentage first. Catch issues before full impact.
•Circuit breakers — Stop calling failing dependencies. Fail fast, prevent cascade.

Faster Detection and Recovery

•Comprehensive monitoring — Metrics, logs, traces for rapid diagnosis.
•Alerting with low latency — Detect failures in seconds, not minutes.
•Automated remediation — Auto-restart failed processes, auto-scale on load.
•Runbooks and playbooks — Documented procedures for common failures.
•Regular incident practice — Game days, chaos engineering to practice response.

The Resilience Testing Imperative

Availability Trade-offs

Higher availability always comes at a cost. Understanding these trade-offs helps make informed decisions about reliability investments.

Availability vs. Consistency (CAP Theorem)

The CAP theorem states that distributed systems can't simultaneously guarantee:

Consistency: All nodes see the same data
Availability: Every request receives a response
Partition tolerance: System continues despite network failures

Since network partitions are unavoidable, you must choose between:

CP: Sacrifice availability for consistency (reject requests if can't guarantee consistency)
AP: Sacrifice consistency for availability (serve requests with potentially stale data)

Most high-availability systems choose AP, accepting eventual consistency to remain available during partitions.

Availability Trade-off Dimensions
Trade-off	Higher Availability Requires	Cost/Complexity
Availability vs. Cost	More redundancy, higher-tier infrastructure	$$$, larger operational burden
Availability vs. Consistency	Relaxed consistency (eventual)	Complex application logic for conflicts
Availability vs. Latency	Synchronous replication for durability	Added latency for strong guarantees
Availability vs. Velocity	More testing, slower deployments	Reduced feature delivery speed
Availability vs. Simplicity	Complex failover, monitoring, automation	Higher cognitive load, more failure modes

The Cost Curve

Availability cost is non-linear:

99% → 99.9%: ~2× cost (add basic redundancy)
99.9% → 99.99%: ~5× cost (multi-zone, automated failover)
99.99% → 99.999%: ~10× cost (multi-region, dedicated SRE team)
99.999% → 99.9999%: ~50× cost (specialized hardware, extreme redundancy)

Diminishing returns set in quickly. Ask: "What's the business value of each additional nine?" For most systems, the answer suggests stopping at 99.9% or 99.99%.

Don't Over-Engineer Availability

Summary: The Reliability Imperative

Availability is the foundational promise that your system will be there when users need it. Let's consolidate what we've learned:

Key Takeaways

•Define availability precisely — Time-based, request-based, and user-based availability tell different stories. Choose and measure the one that matters for your users.
•Understand the nines — Each additional nine is 10× better reliability but ~10× more expensive. 99.9% means 8 hours/year downtime; 99.99% means 52 minutes.
•Calculate composite availability — Serial dependencies multiply failure probability. Redundancy dramatically improves availability. Design to eliminate single points of failure.
•Focus on MTTR — Mean Time To Recovery is often easier to improve than Mean Time To Failure. Faster detection and recovery is usually more achievable than never failing.
•Use error budgets — Transform availability from a constraint into a currency. Budget available = ship fast. Budget low = focus on reliability.
•Measure from multiple perspectives — Combine server metrics, synthetic monitoring, and real user monitoring for a complete picture.
•Accept trade-offs — Higher availability costs more, may sacrifice consistency or velocity. Match investment to actual business needs.

What's Next:

Page Complete