Loading learning content...
At 10:47 AM on a Tuesday, Amazon's inventory service becomes unreachable from one of their datacenters. Within seconds, millions of dollars in potential transactions hang in the balance. Every second of downtime represents lost sales, frustrated customers, and damaged trust.
The engineering team faces a choice: Should the system reject all requests until the partition heals, ensuring data consistency? Or should it continue serving requests from local data, potentially selling items that aren't actually in stock?
This scenario illustrates the essence of availability in distributed systems—the guarantee that every request receives a meaningful response. It's not just about uptime; it's about the promise that the system will always try to serve you, even when things go wrong.
Availability is the 'A' in CAP, and understanding its precise meaning reveals why perfect consistency and perfect availability cannot coexist in a partitioned network.
By the end of this page, you will understand the formal definition of availability in the CAP theorem, how availability metrics are measured in production systems, the engineering techniques that maximize availability, and why guaranteeing availability during partitions fundamentally conflicts with consistency.
Like consistency, 'availability' has multiple meanings in computing. The CAP theorem uses a very specific definition:
CAP Availability: Every request received by a non-failing node in the system must result in a response.
This definition is deceptively simple but carries profound implications:
What CAP Availability Does NOT Guarantee:
CAP availability is theoretical—it describes a guarantee that all non-failed nodes always respond. This is different from 'five nines' availability (99.999%), which measures what percentage of requests succeed over time. A system can have high SLA availability while not being CAP-available (if it rejects requests during partitions), and vice versa.
| Definition | Context | Measurement | Failure Behavior |
|---|---|---|---|
| CAP Availability | Distributed systems theory | Binary (available or not) | Non-failed nodes must respond |
| SLA Availability | Business metrics | Percentage uptime | % of successful requests |
| High Availability (HA) | System architecture | Ability to survive failures | Redundancy and failover |
| Fault Tolerance | System design | Continued operation despite faults | Graceful degradation |
The Response Requirement:
CAP availability requires that the system respond—but it doesn't specify the quality of the response. A CAP-available system might return:
This is where the tension with consistency emerges. A CAP-available system will respond, but that response might not be consistent with what other nodes would return. The system chooses to respond with something rather than wait indefinitely for consistency.
While CAP availability is a theoretical property, practical systems measure availability using quantitative metrics:
The 'Nines' of Availability:
Availability is commonly expressed as a percentage of uptime over a period (usually one year):
| Availability | Downtime/Year | Downtime/Month | Downtime/Week | Industry Example |
|---|---|---|---|---|
| 99% | 3.65 days | 7.2 hours | 1.68 hours | Managed internal service |
| 99.9% (three nines) | 8.76 hours | 43.8 minutes | 10.1 minutes | Enterprise SaaS |
| 99.95% | 4.38 hours | 21.9 minutes | 5.04 minutes | Cloud provider baseline |
| 99.99% (four nines) | 52.6 minutes | 4.38 minutes | 1.01 minutes | Financial systems |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds | 6.05 seconds | Telecommunication |
| 99.9999% (six nines) | 31.5 seconds | 2.63 seconds | 0.605 seconds | Emergency services |
Calculating Availability:
Availability = Uptime / (Uptime + Downtime)
Or with MTBF (Mean Time Between Failures) and MTTR (Mean Time To Repair):
Availability = MTBF / (MTBF + MTTR)
This formula reveals a crucial insight: availability improves by either increasing time between failures (better hardware, redundancy) or decreasing time to recover (automation, hot standbys).
Composite Availability:
Real systems are composed of multiple components. The overall availability depends on the architecture:
Serial Components (all must work):
A_total = A_1 × A_2 × ... × A_n
Example: If A = B = C = 99%, then A_total = 0.99³ = 97.03%
Parallel Components (any can substitute):
A_total = 1 - (1-A_1) × (1-A_2) × ... × (1-A_n)
Example: If A = B = 99%, then A_total = 1 - 0.01² = 99.99%
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
def calculate_serial_availability(components: list[float]) -> float: """ For components in series (all must work), multiply availabilities. Example: Database → Application Server → Load Balancer If each is 99.9% available, total = 0.999^3 = 99.7% """ availability = 1.0 for component in components: availability *= component return availability def calculate_parallel_availability(components: list[float]) -> float: """ For components in parallel (any can substitute), use complement multiplication. Example: Two database replicas, each 99.9% Unavailability = 0.001 * 0.001 = 0.000001 Availability = 1 - 0.000001 = 99.9999% """ unavailability = 1.0 for component in components: unavailability *= (1 - component) return 1 - unavailability def calculate_system_availability(): """ Real-world example: 3-tier web application with redundancy Architecture: ┌─────────────────────────────────────────────────────┐ │ Load Balancer (99.99% each) │ │ ┌─────────┐ ┌─────────┐ │ │ │ LB1 │ │ LB2 │ ← Parallel (redundant) │ │ └────┬────┘ └────┬────┘ │ │ └─────┬──────┘ │ │ │ │ │ ┌──────────▼──────────┐ │ │ │ App Servers (99.9%) │ ← 3 parallel │ │ │ ┌───┐ ┌───┐ ┌───┐ │ │ │ │ │AS1│ │AS2│ │AS3│ │ │ │ │ └───┘ └───┘ └───┘ │ │ │ └──────────┬───────────┘ │ │ │ │ │ ┌──────────▼──────────┐ │ │ │ Database (99.95%) │ ← Primary + Replica │ │ │ ┌───────┐┌────────┐ │ │ │ │ │Primary││Replica │ │ │ │ │ └───────┘└────────┘ │ │ │ └──────────────────────┘ │ └─────────────────────────────────────────────────────┘ """ # Layer 1: Load Balancers (parallel) lb_single = 0.9999 lb_availability = calculate_parallel_availability([lb_single, lb_single]) print(f"Load Balancer Layer: {lb_availability:.6%}") # Result: 99.999999% (virtually perfect) # Layer 2: Application Servers (parallel, need 1 of 3) app_single = 0.999 app_availability = calculate_parallel_availability([app_single] * 3) print(f"Application Layer: {app_availability:.6%}") # Result: 99.9999999% (3 replicas = very high availability) # Layer 3: Database (parallel with primary and replica) db_single = 0.9995 db_availability = calculate_parallel_availability([db_single, db_single]) print(f"Database Layer: {db_availability:.6%}") # Result: 99.999975% # Total system (serial across layers) total = calculate_serial_availability([ lb_availability, app_availability, db_availability ]) print(f"Total System: {total:.6%}") # Result: ~99.999975% # Without redundancy for comparison no_redundancy = calculate_serial_availability([ 0.9999, # Single LB 0.999, # Single App Server 0.9995 # Single Database ]) print(f"Without Redundancy: {no_redundancy:.4%}") # Result: ~99.84% if __name__ == "__main__": calculate_system_availability()Each additional 'nine' of availability is exponentially harder to achieve. Going from 99% to 99.9% requires 10x less downtime. Going from 99.9% to 99.99% requires another 10x reduction. This is why high-availability systems rely on massive redundancy—single points of failure become unacceptable at these scales.
The true test of a distributed system's availability is not during normal operation—it's during a network partition. This is where the CAP theorem becomes relevant.
What is a Network Partition?
A network partition occurs when network failures prevent some nodes from communicating with others, dividing the cluster into isolated groups. Each group can still function internally but cannot reach nodes in other groups.
Common Causes of Partitions:
The Availability Decision During Partition:
When a partition occurs, nodes on each side of the partition face a fundamental question:
Should I continue serving requests with my local data, or should I refuse requests because I can't coordinate with other nodes?
Option 1: Prioritize Availability (AP System)
Option 2: Prioritize Consistency (CP System)
SCENARIO: 5-node cluster splits into groups of 3 and 2 Before Partition: ┌─────────────────────────────┐ │ [A] [B] [C] [D] [E] │ │ Nodes communicate freely │ │ Majority quorum = 3 │ └─────────────────────────────┘ During Partition: ┌─────────────┐ X ┌─────────┐ │ [A] [B] [C]│ │ [D] [E] │ │ (3 nodes) │ │(2 nodes)│ │ HAS QUORUM │ │NO QUORUM│ └─────────────┘ └─────────┘ CP SYSTEM BEHAVIOR (e.g., ZooKeeper with quorum=3):┌───────────────────────────────────────────────────────┐│ Partition A (nodes A, B, C): ││ ✓ Can accept writes (has quorum of 3) ││ ✓ Can serve consistent reads ││ ││ Partition B (nodes D, E): ││ ✗ Cannot accept writes (only 2, needs 3) ││ ✗ Rejects requests with "No quorum" error ││ ✗ System is UNAVAILABLE in this partition │└───────────────────────────────────────────────────────┘ AP SYSTEM BEHAVIOR (e.g., Cassandra with quorum=1):┌───────────────────────────────────────────────────────┐│ Partition A (nodes A, B, C): ││ ✓ Accepts writes with local quorum ││ ✓ Serves reads from local data ││ ││ Partition B (nodes D, E): ││ ✓ ALSO accepts writes! ││ ✓ Serves reads from its (potentially stale) data ││ ││ PROBLEM: Two clients writing to the same key ││ can create conflicting versions: ││ - Client 1 → Node A: SET x = "foo" ││ - Client 2 → Node D: SET x = "bar" ││ - When partition heals: x = "foo" or "bar"?? │└───────────────────────────────────────────────────────┘ After Partition Heals: ┌─────────────────────────────┐ │ [A] [B] [C] [D] [E] │ │ │ │ CP: D and E catch up from │ │ A, B, C. No conflicts. │ │ │ │ AP: Conflict resolution │ │ needed. May use: │ │ - Last-write-wins (LWW)│ │ - Vector clocks │ │ - Application merge │ └─────────────────────────────┘During a network partition, you MUST choose: either some nodes cannot respond (lose availability), or nodes respond with potentially inconsistent data (lose consistency). There is no third option. This is not a bug in system design—it's a fundamental theorem about distributed systems.
While CAP availability is theoretical, practical high availability requires careful engineering across multiple dimensions:
Core Principles of High Availability:
Eliminate Single Points of Failure (SPOF)
Design for Fault Isolation
Automate Failure Detection and Recovery
Replication Strategies for Availability:
Active-Passive (Primary-Secondary):
Active-Active (Multi-Master):
Leader Election:
Active-active replication is powerful but comes with significant complexity. Systems like Amazon DynamoDB and Apache Cassandra use it effectively, but they require careful conflict resolution strategies (last-write-wins, vector clocks, CRDTs) and application-level awareness of eventual consistency. Don't use active-active unless you truly need it.
Within CAP-available systems, there's an additional trade-off that CAP doesn't capture: the relationship between availability and latency. This is described by the PACELC theorem:
PACELC Theorem:
If there is a Partition (P), how does the system trade off Availability and Consistency (A and C); Else (E), when the system is running normally in the absence of partitions, how does the system trade off Latency (L) and Consistency (C)?
In other words:
This extends CAP to acknowledge that even without partitions, you're making trade-offs. A strongly consistent system incurs latency costs even when all nodes are reachable.
| System | Under Partition (P) | Else/Normal (E) | Classification |
|---|---|---|---|
| DynamoDB | Availability | Latency | PA/EL |
| Cassandra | Availability | Latency | PA/EL |
| MongoDB | Availability (default) | Latency | PA/EL |
| Spanner | Consistency | Consistency | PC/EC |
| ZooKeeper | Consistency | Consistency | PC/EC |
| CockroachDB | Consistency | Consistency | PC/EC |
| RethinkDB | Consistency | Latency | PC/EL |
| VoltDB | Consistency | Latency | PC/EL |
The Latency Dimension:
Consider a write operation to a distributed database:
For Availability + Low Latency (PA/EL):
For Availability + Consistency (PA/EC):
For Consistency + Low Latency (PC/EL):
For Consistency + Consistency (PC/EC):
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
class DatabaseWriteStrategies: """ Demonstrates the latency-consistency trade-off in write operations. """ def write_low_latency(self, data): """ PA/EL Strategy: Prioritize latency over consistency. Write locally and return immediately. """ # Write to local node only self.local_node.write(data) # Acknowledge to client immediately # Latency: ~1-5ms (local write only) return {"status": "acknowledged", "consistency": "eventual"} # Background replication happens asynchronously # Other nodes receive update milliseconds to seconds later def write_balanced(self, data): """ Middle ground: Write to local and one remote before acknowledging. Provides better durability without full consistency. """ # Write to local node self.local_node.write(data) # Wait for at least one remote node to acknowledge # Latency: ~10-100ms (depends on network distance) await self.any_remote_node.write(data) return {"status": "acknowledged", "consistency": "semi-sync"} def write_strong_consistency(self, data): """ PC/EC Strategy: Prioritize consistency over latency. Wait for quorum of nodes before acknowledging. """ # Write to local node self.local_node.write(data) # Wait for majority of nodes to acknowledge # Latency: ~50-500ms (depends on slowest quorum node) acks = 1 for node in self.remote_nodes: try: await asyncio.wait_for(node.write(data), timeout=1.0) acks += 1 if acks >= self.quorum_size: break except asyncio.TimeoutError: continue if acks >= self.quorum_size: return {"status": "committed", "consistency": "strong"} else: # Not enough acknowledgments - must roll back or retry raise ConsistencyError("Quorum not reached") def write_ultimate_consistency(self, data): """ PC/EC with synchronous replication to ALL nodes. Maximum latency, maximum consistency. """ # Must successfully write to ALL nodes # Latency: determined by slowest node # Availability: if ANY node fails, write fails for node in self.all_nodes: await node.write(data) # Fails entire write if any node unavailable return {"status": "committed", "consistency": "linearizable"} # Typical latencies for a 5-node cluster spanning 2 datacenters:## Strategy | Latency | Durability | Consistency# ------------------------------------------------------------------# Local only | 1-5ms | Low | Eventual# Local + 1 remote | 50-100ms | Medium | Semi# Majority quorum (3) | 100-200ms | High | Strong# All nodes sync | 200-500ms | Highest | LinearizablePACELC is more practical than CAP because partitions are relatively rare, but latency is constant. Most of the time, your system operates in the 'E' clause. Understanding this trade-off helps you choose the right database for your latency and consistency requirements during normal operation, not just during failures.
Understanding availability also means recognizing common mistakes that silently destroy it:
1. Hidden Single Points of Failure:
You might have redundant databases, but:
All redundancy is useless if a single hidden component can take everything down.
2. Correlated Failures:
Redundancy only helps if failures are independent. Watch for:
3. The Thundering Herd Problem:
When a failed component comes back online:
Solution: Jittered backoff, gradual connection draining, and capacity buffers.
4. Cascading Failures:
When one component slows down:
Solution: Timeouts, circuit breakers, and bulkheads prevent one slow component from taking down everything.
Most availability disasters happen because engineers assumed something would work that was never tested. Kill random services in production. Simulate network partitions. Inject latency. The only way to know your system is highly available is to prove it by breaking things in controlled ways.
We've explored the 'A' in CAP theorem comprehensively. Let's consolidate the key insights:
What's Next:
We've explored Consistency (C) and Availability (A). The next page examines Partition Tolerance (P)—the property that makes CAP interesting. You'll learn why almost every distributed system must be partition tolerant, making the real CAP choice between C and A.
You now understand availability in the context of distributed systems—its formal definition, practical measurement, the PACELC extension, and engineering techniques for high availability. This prepares you to understand why partition tolerance forces the CAP trade-off.