Loading learning content...
At 2:47 AM on a Monday morning, a customer in Tokyo tries to complete a purchase on your e-commerce platform. Their payment goes through, but the confirmation page hangs indefinitely. The customer doesn't know if they were charged. They might try again, potentially paying twice. Or they might abandon the purchase entirely and never return.
This scenario illustrates why availability matters. In the CAP theorem, availability means that every request received by a non-failing node must result in a response—not just eventually, but within a reasonable time. The system cannot simply hang, time out, or refuse to answer. Availability is the promise that your system is always there when users need it.
By the end of this page, you will deeply understand what availability means in the CAP context, how it differs from the colloquial notion of 'uptime,' the strategies for achieving high availability, the metrics and SLAs that quantify availability, and why the tension between availability and consistency is the central challenge of distributed systems.
The CAP theorem defines availability precisely, and this precision matters.
CAP Availability (Formal Definition):
Every request received by a non-failing node in the system must result in a response.
Let's unpack this definition carefully:
"Every request": No exceptions. The system cannot reject requests or refuse to answer. Every valid request must receive a response.
"Received by a non-failing node": This qualification is crucial. A crashed node is offline—we don't expect responses from it. Availability applies to nodes that are operational and reachable.
"Must result in a response": The response doesn't have to be "success." It can be an error, a partial result, or stale data. But there must be a response. Hanging indefinitely violates availability.
What CAP availability notably does NOT guarantee:
CAP availability is a theoretical property focused on liveness—the system makes progress. Practical availability (what we measure with '99.99% uptime') is about successful, timely responses that serve user needs. A system can be CAP-available while returning stale data or errors—something that would hurt practical availability metrics.
| Aspect | CAP Availability | Practical Availability |
|---|---|---|
| Core requirement | Response from non-failing nodes | Successful responses within SLA |
| Response content | Any response (even stale/error) | Correct, useful responses |
| Timing constraint | Eventually (no hard bound) | Within specified latency (e.g., 200ms) |
| Scope | Individual node behavior | System-wide aggregate |
| Measurement | Binary (available or not) | Percentage (99.9%, 99.99%) |
Why the distinction matters:
When designing systems, you must be clear about which definition you're targeting. A system can achieve CAP availability by always returning cached (potentially stale) data, but this might violate your business requirements for data freshness. Conversely, a system that blocks until fresh data is available might provide better practical availability for correctness-dependent use cases, even if it technically violates CAP availability during partitions.
Understanding this distinction prevents a common mistake: believing that because you've achieved one form of availability, you've achieved the other.
In industry, availability is typically expressed as a percentage—the famous "nines" that appear in every SLA discussion. Understanding these numbers and their implications is essential for system design.
The Nines Table:
Each additional nine represents a 10x improvement in availability—and typically a 10x increase in cost and complexity to achieve.
| Availability | Downtime/Year | Downtime/Month | Downtime/Day | Example Use Case |
|---|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.3 hours | 14.4 minutes | Internal tools, dev environments |
| 99.9% (three nines) | 8.76 hours | 43.8 minutes | 1.44 minutes | Standard business applications |
| 99.95% | 4.38 hours | 21.9 minutes | 43.2 seconds | Important business applications |
| 99.99% (four nines) | 52.6 minutes | 4.38 minutes | 8.64 seconds | E-commerce, financial services |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds | 864 milliseconds | Telecom, emergency services |
| 99.9999% (six nines) | 31.5 seconds | 2.63 seconds | 86.4 milliseconds | Air traffic control, nuclear systems |
Moving from 99% to 99.9% is relatively straightforward—add redundancy, monitor better, automate failover. Moving from 99.99% to 99.999% requires fundamentally different architectures, extensive testing, and often accepting trade-offs in other areas. Each nine costs exponentially more than the last.
How to Calculate Availability:
Basic Formula:
Availability = (Total Time - Downtime) / Total Time × 100%
With MTBF and MTTR:
Availability = MTBF / (MTBF + MTTR)
Where:
Example: If MTBF = 720 hours (30 days) and MTTR = 1 hour: Availability = 720 / (720 + 1) = 99.86%
This formula reveals two paths to higher availability:
In practice, reducing MTTR often provides better ROI than increasing MTBF, because preventing all failures is impossible, but fast recovery is achievable.
12345678910111213141516171819202122
SERIAL COMPONENTS (all must work):┌─────────────────────────────────────────────────────┐│ LB (99.99%) → App (99.9%) → DB (99.95%) ││ ││ Total = 0.9999 × 0.999 × 0.9995 = 99.84% ││ (less than any individual component!) │└─────────────────────────────────────────────────────┘ PARALLEL COMPONENTS (any one can work):┌─────────────────────────────────────────────────────┐│ ┌── App Server 1 (99.9%) ──┐ ││ LB ────┤ ├── DB ││ └── App Server 2 (99.9%) ──┘ ││ ││ App layer = 1 - (0.001 × 0.001) = 99.9999% ││ (both must fail for layer to fail) │└─────────────────────────────────────────────────────┘ KEY INSIGHT: Serial composition decreases availability. Parallel composition increases it. This is why every reliable system has redundancy at every layer.Understanding what breaks availability is as important as knowing what enables it. These anti-patterns appear repeatedly in production incidents.
Anti-Pattern 1: Single Points of Failure (SPOF)
A SPOF is any component whose failure brings down the entire system. Common SPOFs:
Anti-Pattern 2: Synchronous Dependencies
When Service A calls Service B synchronously, A's availability is bounded by B's:
Availability(A) ≤ Availability(A_standalone) × Availability(B)
With 5 serial dependencies at 99.9% each:
System Availability = 0.999^5 = 99.5%
That's 43 hours of downtime per year—from services that are individually 99.9%!
Anti-Pattern 3: The Thundering Herd
After an outage, all clients retry simultaneously. This surge overwhelms the recovering system, causing another failure. The system oscillates between up and down states.
Solution: Implement exponential backoff with jitter. Each client waits a random amount before retrying, spreading the load over time.
Anti-Pattern 4: Correlated Failures
Two "independent" systems fail together because they share a hidden dependency:
Solution: Audit dependencies rigorously. True independence is harder than it looks.
Adding redundancy doesn't automatically improve availability. Poorly implemented redundancy can reduce availability by adding complexity, increasing failure modes, and creating coordination overhead. Every redundant component needs health checks, failover logic, and careful testing. Without these, you're adding liability, not reliability.
High availability isn't a feature you add—it's an architectural property you design for from the beginning. These strategies form the foundation of available systems.
Strategy 1: Redundancy at Every Layer
Every component should have at least one backup ready to take over:
Strategy 2: Health Checks and Automatic Failover
Redundancy is worthless without detection and action:
Failover must be automatic and fast. Manual intervention that takes 15 minutes negates the value of 99.99% availability (which allows only 4.38 minutes of downtime per month).
Strategy 3: Graceful Degradation
Not all features are equally critical. Design your system to shed non-essential functionality under stress:
Example: E-commerce during Black Friday
| Load Level | Features Enabled |
|---|---|
| Normal | Everything: recommendations, reviews, wishlist, analytics |
| High | Core + recommendations (disable wishlist, analytics) |
| Critical | Core only: browse, cart, checkout |
| Extreme | Static pages, deferred checkout |
This approach maximizes availability of the most important functionality.
High availability isn't proven until it's tested. Chaos engineering—deliberately injecting failures in production—validates that your redundancy and failover actually work. Netflix's Chaos Monkey randomly kills instances. AWS tests its services by simulating region-wide outages. If you haven't tested your failover, you don't have failover—you have hope.
Let's examine how major systems achieve high availability and what trade-offs they accept.
Case Study 1: AWS S3 (11 nines of durability)
S3 famously promises 99.999999999% durability. This means if you store 10 million objects, you might lose one every 10,000 years. How?
Note: S3's durability (11 nines) differs from its availability (99.99% for Standard, meaning ~52 minutes downtime/year). Durability means data isn't lost. Availability means you can access it.
Case Study 2: Google Spanner (External Consistency + High Availability)
Spanner achieves 99.999% availability while maintaining strong consistency. The secret:
This demonstrates that high consistency and high availability can coexist—if you're willing to pay the engineering cost.
| System | Availability Target | Key Strategy | Trade-off Accepted |
|---|---|---|---|
| AWS S3 | 99.99% | Multi-AZ replication, automatic failover | Higher latency for strong durability |
| Google Search | 99.999%+ | Massive redundancy, graceful degradation | Slightly stale results acceptable |
| Netflix | 99.99% | Multi-region active-active, chaos tested | Eventual consistency in recommendations |
| Stripe | 99.99% | Multi-cloud, idempotent APIs | Higher complexity, higher cost |
| Discord | 99.95% | Erlang/Elixir fault tolerance, hot code swap | Learning curve for unusual tech stack |
| Amazon DynamoDB | 99.999% | Multi-AZ with automatic failover | Limited query flexibility for consistency |
Case Study 3: The 2017 S3 Outage
On February 28, 2017, a single typo in an S3 maintenance command took down a significant portion of the internet for several hours. Services affected included Slack, Trello, Quora, and parts of AWS itself.
Lessons learned:
Changes Amazon made:
AWS, Google, and other cloud providers experience outages despite billions of dollars in infrastructure. The lesson: design your systems to survive your dependency's failures. Multi-region, multi-cloud, and graceful degradation aren't paranoia—they're prudent engineering.
We've explored consistency and availability separately. Now we confront the heart of the CAP theorem: under network partitions, you cannot have both.
The Partition Scenario:
Imagine a two-node distributed database. A network partition occurs—the nodes can't communicate. A client sends a write to Node A. What happens?
If you choose Consistency:
If you choose Availability:
This isn't a design flaw—it's a fundamental impossibility proven by Brewer and formalized by Gilbert and Lynch.
123456789101112131415161718192021222324
SCENARIO: Network partition between two data centers ┌─────────────┐ PARTITION ┌─────────────┐ │ Node A │ ─ ─ ─ ✕ ─ ─ ─ ─ │ Node B │ │ (DC 1) │ │ (DC 2) │ └─────────────┘ └─────────────┘ │ │ Client X Client Y write(x=1) write(x=2) OPTION 1: CHOOSE CONSISTENCY (CP)├── Node A: "I can't reach Node B. I must refuse this write."├── Node B: "I can't reach Node A. I must refuse this write."├── Both clients get errors or timeouts└── Result: CONSISTENT but UNAVAILABLE OPTION 2: CHOOSE AVAILABILITY (AP)├── Node A: "I'll accept write x=1 from Client X"├── Node B: "I'll accept write x=2 from Client Y"├── Both clients get success└── After partition heals: x=1 or x=2? CONFLICT! Result: AVAILABLE but INCONSISTENT THERE IS NO OPTION 3.Why can't we have both?
The proof is elegant:
The key insight:
The CAP theorem doesn't say you can never have consistency and availability. It says you can't have both during a partition. In the absence of partitions, you can have both. This is why the PACELC theorem (which we'll cover later) extends CAP to describe behavior when partitions aren't occurring.
Some argue: 'Our network is reliable, so we don't need to worry about partitions.' This is dangerously wrong. Network partitions happen. Switches fail. Cables get cut. Misconfigurations create logical partitions. Cloud providers have region-wide network issues. Design for partitions, or your system will fail when they inevitably occur.
Given that we must make trade-offs, how do we architect systems that maximize availability while managing consistency challenges?
Pattern 1: Eventual Consistency with Conflict Resolution
Accept that replicas may temporarily diverge, but implement mechanisms to detect and resolve conflicts:
Pattern 2: Read-Your-Writes Guarantee
Ensure clients always see their own writes, even if other clients see stale data:
Pattern 3: Async Replication with Sync Fallback
Normally use async replication for speed, but allow callers to request sync replication for critical operations:
POST /orders
X-Consistency: strong // Use sync replication for this request
This lets you tune consistency per-operation rather than per-system.
Pattern 4: The Outbox Pattern for Cross-Service Reliability
When you need to update a database AND publish an event, use the Outbox Pattern:
This guarantees that if the database update succeeds, the event will eventually be published—achieving reliability without distributed transactions.
You cannot bolt availability onto a system later. Architectures that assume reliable networks, single databases, or synchronous processing cannot be made highly available without fundamental rewrite. Design for availability from day one, even if you don't implement all redundancy immediately.
We've deeply explored availability—the 'A' in CAP—understanding what it means, how to measure it, what breaks it, and how to achieve it. Let's consolidate the essential insights.
What's Next:
We've covered Consistency (C) and Availability (A). The next page explores Partition Tolerance—the 'P' in CAP. You'll learn why partition tolerance isn't really optional, which reframes CAP as a choice between CP and AP systems rather than a three-way trade-off.
You now understand availability in the context of the CAP theorem: what it means formally, how to measure it, common anti-patterns that break it, strategies to achieve it, and its fundamental tension with consistency. This knowledge is essential for making informed architectural decisions in distributed systems.