Loading content...
A system with perfect latency and infinite throughput is worthless if it's down. Availability measures whether your system is actually running and serving users when they need it. It's the fundamental promise you make: "Our service will be here when you need it."
Availability is often expressed as a percentage—99%, 99.9%, 99.99%—each additional "nine" representing an order of magnitude improvement in reliability. The difference between 99% and 99.99% uptime sounds small (just two digits!), but it's the difference between 87 hours of downtime per year and 52 minutes. For a payment system, an airline booking engine, or a hospital records system, those extra nines can mean the difference between a minor inconvenience and a crisis.
This page explores availability as a core performance metric: how to define it, measure it, set targets for it, and engineer systems that achieve it.
By the end of this page, you'll understand availability as a precise engineering concept. You'll know how to calculate "nines," set realistic SLOs, measure availability correctly, and reason about the trade-offs between availability and other system properties. You'll develop the reliability mindset that distinguishes engineers who build systems that stay up.
Availability seems simple—is the system up or down? But precision matters enormously when defining targets and measuring compliance.
The Basic Formula
Availability = Uptime / (Uptime + Downtime) × 100%
Or equivalently:
Availability = (Total Time - Downtime) / Total Time × 100%
What Counts as "Down"?
This seemingly simple question generates significant debate. Consider a service that:
Is it "up" or "down"? Neither answer is fully correct. This is why we need more nuanced definitions.
There are multiple availability definitions: Time-based (what percentage of time was it up?), Request-based (what percentage of requests succeeded?), and User-based (what percentage of users experienced success?). Different definitions can give dramatically different numbers for the same system state.
Time-Based Availability
The traditional approach: measure time periods when the system was available vs. unavailable.
Time Availability = (Planned Uptime - Unplanned Downtime) / Planned Uptime
"Planned uptime" excludes scheduled maintenance windows. A system available 99.9% of time during the hours it's supposed to be up achieves 99.9% availability even if it has 2 hours of weekly scheduled maintenance.
Request-Based Availability
More granular: measure successful requests vs. total requests.
Request Availability = Successful Requests / Total Requests × 100%
This captures partial outages that time-based metrics miss. If 10% of requests fail during an hour-long degradation, time-based says "1 hour down"; request-based says "90% available during that hour."
User-Based Availability
Most user-centric: what percentage of users had a successful experience?
User Availability = Users With Successful Sessions / Total Active Users
A regional outage affecting 5% of users is 95% user-available; the same outage by request count might be only 1% of requests if that region has low traffic.
Availability is typically expressed in terms of "nines"—a shorthand for how many 9s appear in the percentage. Each additional nine represents 10× better reliability.
| Nines | Availability % | Downtime/Year | Downtime/Month | Downtime/Week |
|---|---|---|---|---|
| One nine | 90% | 36.5 days | 3 days | 16.8 hours |
| Two nines | 99% | 3.65 days | 7.2 hours | 1.68 hours |
| Three nines | 99.9% | 8.76 hours | 43.8 minutes | 10.1 minutes |
| Four nines | 99.99% | 52.6 minutes | 4.38 minutes | 1.01 minutes |
| Five nines | 99.999% | 5.26 minutes | 26.3 seconds | 6.05 seconds |
| Six nines | 99.9999% | 31.5 seconds | 2.63 seconds | 0.6 seconds |
What Each Level Means Practically:
Two nines (99%): Acceptable for internal tools, development environments. ~3.5 days/year of downtime.
Three nines (99.9%): Standard for most web applications. About 8 hours/year of downtime. Allows for occasional outages and maintenance.
Four nines (99.99%): High availability target. About 52 minutes/year. Requires redundancy, automated failover, no scheduled downtime.
Five nines (99.999%): Mission-critical systems (banking, emergency services). About 5 minutes/year. Requires extensive redundancy, geographic distribution, instant failover.
Six nines (99.9999%): Telecom/defense grade. 31 seconds/year. Achieved only with massive redundancy and specialized hardware.
Each additional nine is roughly 10× harder and more expensive to achieve. Going from 99.9% to 99.99% often requires fundamental architectural changes (geographic redundancy, eliminating single points of failure). Going to 99.999% typically requires specialized hardware, custom software, and dedicated reliability engineering teams. Don't promise more nines than you're prepared to invest in.
Real systems are composed of multiple components. Understanding how component availability combines is essential for predicting system-level availability.
Serial Dependency (All Components Required)
When components are in series (A → B → C), all must be available for the system to work:
Availability_serial = A₁ × A₂ × A₃ × ... × Aₙ
Example: A web app depends on an API server (99.9%), database (99.95%), and cache (99.99%).
System Availability = 0.999 × 0.9995 × 0.9999 = 0.9984 = 99.84%
The chain is only as strong as its weakest link, and each component slightly reduces overall availability.
A system with 10 components each at 99.9% availability has composite availability of 0.999^10 = 99.0%. Those 10 dependencies cost you an entire nine. Every serial dependency degrades overall availability. This is why minimizing dependencies is critical for high-availability systems.
Parallel Redundancy (Any Component Sufficient)
When components are redundant (either A₁ OR A₂ can serve), system fails only if ALL fail:
Unavailability_parallel = (1-A₁) × (1-A₂) × ... × (1-Aₙ)
Availability_parallel = 1 - Unavailability_parallel
Example: Two redundant API servers at 99.9% each:
P(both fail) = 0.001 × 0.001 = 0.000001
Availability = 1 - 0.000001 = 99.9999%
Redundancy dramatically improves availability—two 99.9% servers achieve six nines together!
Combined Serial and Parallel
Real architectures combine both patterns:
[Load Balancer] → [App Server 1] OR [App Server 2] → [Primary DB] OR [Replica DB]
99.99% 99.9% ↔ 99.9% 99.95% ↔ 99.95%
Calculation:
The non-redundant load balancer becomes the bottleneck! This is why high-availability designs eliminate single points of failure everywhere.
Availability can be expressed in terms of failure and recovery times, which provides actionable targets for improvement.
Key Metrics
| Metric | Full Name | Definition | Improvement Path |
|---|---|---|---|
| MTTF | Mean Time To Failure | Average time between system going live and failing | Improve component reliability |
| MTTR | Mean Time To Recovery/Repair | Average time from failure detection to recovery | Faster detection, automated recovery |
| MTBF | Mean Time Between Failures | Average time between failures (MTTF + MTTR) | Both reliability and recovery |
The MTBF/MTTR Relationship to Availability
Availability = MTTF / (MTTF + MTTR) = MTTF / MTBF
Or equivalently:
Availability = Uptime / (Uptime + Downtime)
Two Paths to Higher Availability:
Increase MTTF (Fail Less Often)
Decrease MTTR (Recover Faster)
For most systems, reducing MTTR is more achievable than increasing MTTF. You can't always prevent failures, but you can always detect and recover faster. A system with MTTF=100 hours and MTTR=1 hour has 99% availability. Reducing MTTR to 10 minutes achieves 99.83%. Much more achievable than making the same system fail 10× less often.
Breaking Down MTTR
MTTR consists of several sub-components:
MTTR = MTTD + MTTI + MTTM + MTTV
Where:
Each component can be optimized:
Setting availability targets requires balancing reliability with practical constraints like development velocity, cost, and complexity.
Setting Realistic Availability SLOs
Your availability target should be based on:
| System Type | Typical Target | Allowed Annual Downtime | Justification |
|---|---|---|---|
| Internal tools | 99% | 3.65 days | Non-critical, users can wait |
| Content websites | 99.5% | 1.83 days | Annoying but not catastrophic if down |
| E-commerce | 99.9% | 8.76 hours | Direct revenue impact from downtime |
| SaaS applications | 99.9% - 99.95% | 4-8 hours | Customer trust, contractual obligations |
| Payment systems | 99.99% | 52 minutes | Transactions blocked, regulatory requirements |
| Emergency services | 99.999%+ | 5 minutes | Life safety implications |
Error Budgets: Operationalizing Availability
An error budget is the inverse of your availability target—the amount of downtime or failure you're "allowed" before violating your SLO.
Error Budget = 1 - SLO Target
For a 99.9% (three nines) SLO:
How Error Budgets Change Behavior:
Error budgets create a self-regulating system: reliability issues consume budget, creating natural pressure to fix them before more features ship.
Treat your error budget as a currency. Spending it on new feature releases is fine—that's what it's for. But spending it on preventable outages is waste. Track where error budget is consumed. If 80% goes to deployment failures, invest in safer deploys. If 80% goes to infrastructure, invest in redundancy.
Accurate availability measurement requires careful consideration of what to measure, from where, and how to handle edge cases.
What to Measure
Define specific success criteria:
A request is successful if:
- HTTP status code is 2xx or 3xx (not 4xx, 5xx)
- Response time is under SLO threshold (e.g., < 500ms)
- Response is well-formed (passes validation)
- Business logic completed correctly
Be explicit. A 200 OK response that returns invalid data should count as failure. A request that times out should count as failure even if the server eventually completes it.
Server-side measurement misses failures the server never sees: network partitions, load balancer errors, DNS failures. Client-side measurement captures true user experience but is harder to instrument. Synthetic monitoring probes the system from outside but may not represent real user paths. Use multiple perspectives for a complete picture.
Synthetic Monitoring
Synthetic monitoring uses automated probes to test availability:
Advantages:
Disadvantages:
Real User Monitoring (RUM)
Measure actual user requests:
Advantages:
Disadvantages:
| Approach | Perspective | Best For | Blind Spots |
|---|---|---|---|
| Server metrics | Internal | High-volume, detailed data | Doesn't see network/LB failures |
| Load balancer logs | Edge | All requests reaching edge | Doesn't see DNS/network failures |
| Synthetic probes | External | Consistent, controlled checks | May not match user experience |
| Real User Monitoring | True user | Actual user experience | No data when idle, client noise |
| Third-party monitoring | Independent | Unbiased external view | Limited depth, usually probe-based |
Achieving high availability is an engineering discipline with established practices. Let's examine the key strategies.
You don't know your system is resilient until it's tested. Chaos engineering practices like Netflix's Chaos Monkey intentionally cause failures in production to verify recovery mechanisms work. If you've never seen your failover activate, you don't know if it works.
Higher availability always comes at a cost. Understanding these trade-offs helps make informed decisions about reliability investments.
Availability vs. Consistency (CAP Theorem)
The CAP theorem states that distributed systems can't simultaneously guarantee:
Since network partitions are unavoidable, you must choose between:
Most high-availability systems choose AP, accepting eventual consistency to remain available during partitions.
| Trade-off | Higher Availability Requires | Cost/Complexity |
|---|---|---|
| Availability vs. Cost | More redundancy, higher-tier infrastructure | $$$, larger operational burden |
| Availability vs. Consistency | Relaxed consistency (eventual) | Complex application logic for conflicts |
| Availability vs. Latency | Synchronous replication for durability | Added latency for strong guarantees |
| Availability vs. Velocity | More testing, slower deployments | Reduced feature delivery speed |
| Availability vs. Simplicity | Complex failover, monitoring, automation | Higher cognitive load, more failure modes |
The Cost Curve
Availability cost is non-linear:
Diminishing returns set in quickly. Ask: "What's the business value of each additional nine?" For most systems, the answer suggests stopping at 99.9% or 99.99%.
A common mistake is pursuing maximally high availability because it seems like the "right" thing to do. But 99.999% availability for an internal reporting tool is wasteful. Match your availability investment to actual business needs. That budget is better spent on features or other systems that genuinely need higher reliability.
Availability is the foundational promise that your system will be there when users need it. Let's consolidate what we've learned:
What's Next:
We've covered latency (speed), throughput (volume), percentiles (distributions), and now availability (reliability). The final dimension of performance metrics is resource utilization—understanding how efficiently your system uses its underlying hardware and how to interpret utilization data for capacity planning.
You now understand availability as a precise engineering concept. You can calculate nines, set appropriate SLOs, use error budgets, measure availability correctly, and reason about the trade-offs involved in achieving higher reliability. You have the foundation for building systems that stay up.