Database Management SystemsCAP Theorem

CAP Theorem: Fundamental Trade-offs in Distributed Systems

LevelAdvanced

Duration90 mins

TopicCAP Theorem

2 / 5

Availability in Distributed Systems

The System That Never Says No

At 10:47 AM on a Tuesday, Amazon's inventory service becomes unreachable from one of their datacenters. Within seconds, millions of dollars in potential transactions hang in the balance. Every second of downtime represents lost sales, frustrated customers, and damaged trust.

The engineering team faces a choice: Should the system reject all requests until the partition heals, ensuring data consistency? Or should it continue serving requests from local data, potentially selling items that aren't actually in stock?

This scenario illustrates the essence of availability in distributed systems—the guarantee that every request receives a meaningful response. It's not just about uptime; it's about the promise that the system will always try to serve you, even when things go wrong.

Availability is the 'A' in CAP, and understanding its precise meaning reveals why perfect consistency and perfect availability cannot coexist in a partitioned network.

What You Will Master

By the end of this page, you will understand the formal definition of availability in the CAP theorem, how availability metrics are measured in production systems, the engineering techniques that maximize availability, and why guaranteeing availability during partitions fundamentally conflicts with consistency.

Defining Availability Precisely

Like consistency, 'availability' has multiple meanings in computing. The CAP theorem uses a very specific definition:

CAP Availability: Every request received by a non-failing node in the system must result in a response.

This definition is deceptively simple but carries profound implications:

Every request — Not 99.99% of requests, not most requests, but literally every single request
Received by a non-failing node — If a node has crashed or is unresponsive, CAP doesn't require it to answer
Must result in a response — The system cannot say 'try again later'; it must give a definitive answer

What CAP Availability Does NOT Guarantee:

The response contains the most recent data (that's consistency)
The response comes within any specific time bound (that's about latency)
All nodes give the same response (that's also consistency)

CAP ≠ SLA Availability

CAP availability is theoretical—it describes a guarantee that all non-failed nodes always respond. This is different from 'five nines' availability (99.999%), which measures what percentage of requests succeed over time. A system can have high SLA availability while not being CAP-available (if it rejects requests during partitions), and vice versa.

Availability Definitions Compared
Definition	Context	Measurement	Failure Behavior
CAP Availability	Distributed systems theory	Binary (available or not)	Non-failed nodes must respond
SLA Availability	Business metrics	Percentage uptime	% of successful requests
High Availability (HA)	System architecture	Ability to survive failures	Redundancy and failover
Fault Tolerance	System design	Continued operation despite faults	Graceful degradation

The Response Requirement:

CAP availability requires that the system respond—but it doesn't specify the quality of the response. A CAP-available system might return:

The correct, most recent value
A stale value from before the partition
A default value
An error message (as long as it's a response, not a timeout)

This is where the tension with consistency emerges. A CAP-available system will respond, but that response might not be consistent with what other nodes would return. The system chooses to respond with something rather than wait indefinitely for consistency.

Measuring Availability in Practice

While CAP availability is a theoretical property, practical systems measure availability using quantitative metrics:

The 'Nines' of Availability:

Availability is commonly expressed as a percentage of uptime over a period (usually one year):

Availability Levels and Downtime
Availability	Downtime/Year	Downtime/Month	Downtime/Week	Industry Example
99%	3.65 days	7.2 hours	1.68 hours	Managed internal service
99.9% (three nines)	8.76 hours	43.8 minutes	10.1 minutes	Enterprise SaaS
99.95%	4.38 hours	21.9 minutes	5.04 minutes	Cloud provider baseline
99.99% (four nines)	52.6 minutes	4.38 minutes	1.01 minutes	Financial systems
99.999% (five nines)	5.26 minutes	26.3 seconds	6.05 seconds	Telecommunication
99.9999% (six nines)	31.5 seconds	2.63 seconds	0.605 seconds	Emergency services

Calculating Availability:

Availability = Uptime / (Uptime + Downtime)

Or with MTBF (Mean Time Between Failures) and MTTR (Mean Time To Repair):

Availability = MTBF / (MTBF + MTTR)

This formula reveals a crucial insight: availability improves by either increasing time between failures (better hardware, redundancy) or decreasing time to recover (automation, hot standbys).

Composite Availability:

Real systems are composed of multiple components. The overall availability depends on the architecture:

Serial Components (all must work):

A_total = A_1 × A_2 × ... × A_n

Example: If A = B = C = 99%, then A_total = 0.99³ = 97.03%

Parallel Components (any can substitute):

A_total = 1 - (1-A_1) × (1-A_2) × ... × (1-A_n)

Example: If A = B = 99%, then A_total = 1 - 0.01² = 99.99%

availability_calculation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def calculate_serial_availability(components: list[float]) -> float:
    """
    For components in series (all must work), multiply availabilities.
    
    Example: Database → Application Server → Load Balancer
    If each is 99.9% available, total = 0.999^3 = 99.7%
    """
    availability = 1.0
    for component in components:
        availability *= component
    return availability
 
 
def calculate_parallel_availability(components: list[float]) -> float:
    """
    For components in parallel (any can substitute), 
    use complement multiplication.
    
    Example: Two database replicas, each 99.9%
    Unavailability = 0.001 * 0.001 = 0.000001
    Availability = 1 - 0.000001 = 99.9999%
    """
    unavailability = 1.0
    for component in components:
        unavailability *= (1 - component)
    return 1 - unavailability
 
 
def calculate_system_availability():
    """
    Real-world example: 3-tier web application with redundancy
    
    Architecture:
    ┌─────────────────────────────────────────────────────┐
    │  Load Balancer (99.99% each)                        │
    │  ┌─────────┐  ┌─────────┐                           │
    │  │   LB1   │  │   LB2   │   ← Parallel (redundant)  │
    │  └────┬────┘  └────┬────┘                           │
    │       └─────┬──────┘                                │
    │             │                                       │
    │  ┌──────────▼──────────┐                            │
    │  │  App Servers (99.9%) │   ← 3 parallel            │
    │  │  ┌───┐ ┌───┐ ┌───┐   │                           │
    │  │  │AS1│ │AS2│ │AS3│   │                           │
    │  │  └───┘ └───┘ └───┘   │                           │
    │  └──────────┬───────────┘                           │
    │             │                                       │
    │  ┌──────────▼──────────┐                            │
    │  │  Database (99.95%)   │   ← Primary + Replica     │
    │  │  ┌───────┐┌────────┐ │                           │
    │  │  │Primary││Replica │ │                           │
    │  │  └───────┘└────────┘ │                           │
    │  └──────────────────────┘                           │
    └─────────────────────────────────────────────────────┘
    """
    
    # Layer 1: Load Balancers (parallel)
    lb_single = 0.9999
    lb_availability = calculate_parallel_availability([lb_single, lb_single])
    print(f"Load Balancer Layer: {lb_availability:.6%}")
    # Result: 99.999999% (virtually perfect)
    
    # Layer 2: Application Servers (parallel, need 1 of 3)
    app_single = 0.999
    app_availability = calculate_parallel_availability([app_single] * 3)
    print(f"Application Layer: {app_availability:.6%}")
    # Result: 99.9999999% (3 replicas = very high availability)
    
    # Layer 3: Database (parallel with primary and replica)
    db_single = 0.9995
    db_availability = calculate_parallel_availability([db_single, db_single])
    print(f"Database Layer: {db_availability:.6%}")
    # Result: 99.999975%
    
    # Total system (serial across layers)
    total = calculate_serial_availability([
        lb_availability,
        app_availability, 
        db_availability
    ])
    print(f"Total System: {total:.6%}")
    # Result: ~99.999975%
    
    # Without redundancy for comparison
    no_redundancy = calculate_serial_availability([
        0.9999,  # Single LB
        0.999,   # Single App Server
        0.9995   # Single Database
    ])
    print(f"Without Redundancy: {no_redundancy:.4%}")
    # Result: ~99.84%
 
 
if __name__ == "__main__":
    calculate_system_availability()

The Redundancy Imperative

Each additional 'nine' of availability is exponentially harder to achieve. Going from 99% to 99.9% requires 10x less downtime. Going from 99.9% to 99.99% requires another 10x reduction. This is why high-availability systems rely on massive redundancy—single points of failure become unacceptable at these scales.

Availability Under Partition

The true test of a distributed system's availability is not during normal operation—it's during a network partition. This is where the CAP theorem becomes relevant.

What is a Network Partition?

A network partition occurs when network failures prevent some nodes from communicating with others, dividing the cluster into isolated groups. Each group can still function internally but cannot reach nodes in other groups.

Common Causes of Partitions:

Switch or router failures
Network cable cuts
Datacenter connectivity issues
Firewall misconfigurations
NIC (network interface) failures
Asymmetric network failures (A can reach C, B can reach C, but A can't reach B)

Converting Mermaid diagram...

The Availability Decision During Partition:

When a partition occurs, nodes on each side of the partition face a fundamental question:

Should I continue serving requests with my local data, or should I refuse requests because I can't coordinate with other nodes?

Option 1: Prioritize Availability (AP System)

Each partition continues serving read and write requests
Responses may become inconsistent across partitions
When partition heals, conflicts must be resolved
Example: DynamoDB's default behavior, Cassandra with low consistency

Option 2: Prioritize Consistency (CP System)

Partitions that can't form a quorum stop accepting writes
May also reject reads if consistency is required
System is unavailable until partition heals or quorum is restored
Example: ZooKeeper, etcd, MongoDB with majority writes

partition_behavior.txt
SCENARIO: 5-node cluster splits into groups of 3 and 2
 
Before Partition:
   ┌─────────────────────────────┐
   │  [A] [B] [C] [D] [E]        │
   │   Nodes communicate freely   │
   │   Majority quorum = 3        │
   └─────────────────────────────┘
 
During Partition:
   ┌─────────────┐  X  ┌─────────┐
   │  [A] [B] [C]│     │ [D] [E] │
   │  (3 nodes)  │     │(2 nodes)│
   │  HAS QUORUM │     │NO QUORUM│
   └─────────────┘     └─────────┘
 
CP SYSTEM BEHAVIOR (e.g., ZooKeeper with quorum=3):
┌───────────────────────────────────────────────────────┐
│  Partition A (nodes A, B, C):                         │
│  ✓ Can accept writes (has quorum of 3)                │
│  ✓ Can serve consistent reads                         │
│                                                       │
│  Partition B (nodes D, E):                            │
│  ✗ Cannot accept writes (only 2, needs 3)             │
│  ✗ Rejects requests with "No quorum" error            │
│  ✗ System is UNAVAILABLE in this partition            │
└───────────────────────────────────────────────────────┘
 
AP SYSTEM BEHAVIOR (e.g., Cassandra with quorum=1):
┌───────────────────────────────────────────────────────┐
│  Partition A (nodes A, B, C):                         │
│  ✓ Accepts writes with local quorum                   │
│  ✓ Serves reads from local data                       │
│                                                       │
│  Partition B (nodes D, E):                            │
│  ✓ ALSO accepts writes!                               │
│  ✓ Serves reads from its (potentially stale) data     │
│                                                       │
│  PROBLEM: Two clients writing to the same key         │
│  can create conflicting versions:                     │
│    - Client 1 → Node A: SET x = "foo"                 │
│    - Client 2 → Node D: SET x = "bar"                 │
│    - When partition heals: x = "foo" or "bar"??       │
└───────────────────────────────────────────────────────┘
 
After Partition Heals:
   ┌─────────────────────────────┐
   │  [A] [B] [C] [D] [E]        │
   │                             │
   │  CP: D and E catch up from  │
   │      A, B, C. No conflicts. │
   │                             │
   │  AP: Conflict resolution    │
   │      needed. May use:       │
   │      - Last-write-wins (LWW)│
   │      - Vector clocks        │
   │      - Application merge    │
   └─────────────────────────────┘

The Inescapable Trade-off

During a network partition, you MUST choose: either some nodes cannot respond (lose availability), or nodes respond with potentially inconsistent data (lose consistency). There is no third option. This is not a bug in system design—it's a fundamental theorem about distributed systems.

Achieving High Availability

While CAP availability is theoretical, practical high availability requires careful engineering across multiple dimensions:

Core Principles of High Availability:

Eliminate Single Points of Failure (SPOF)
- Every component must have at least one backup
- This includes hardware, software, network paths, and even humans
- The system continues if any single element fails
Design for Fault Isolation
- Failures in one component shouldn't cascade
- Use bulkheads (like ship compartments that limit flooding)
- Implement circuit breakers to prevent cascade failures
Automate Failure Detection and Recovery
- Human response time is too slow for high availability
- Systems must self-heal without manual intervention
- Implement health checks, automatic failover, and self-restart

High Availability Techniques

•Replication — Maintain multiple copies of data and services across nodes. If one fails, others continue serving. Types include synchronous (stronger consistency) and asynchronous (higher availability).
•Load Balancing — Distribute requests across healthy nodes. Remove unhealthy nodes from rotation automatically. Can operate at L4 (TCP) or L7 (HTTP) levels.
•Failover Mechanisms — Automatic promotion of standby to primary. Hot standby (instant), warm standby (minutes), cold standby (longer). DNS failover for entire datacenters.
•Geographic Distribution — Deploy across multiple availability zones or regions. Survive entire datacenter failures. Balance latency against redundancy.
•Graceful Degradation — When under stress, shed non-critical functionality. Maintain core operations even with reduced capacity. Better partial service than total failure.
•Self-Healing Systems — Automatic restart of failed processes. Container orchestration (Kubernetes) for automated recovery. Auto-scaling to replace unhealthy instances.

Replication Strategies for Availability:

Active-Passive (Primary-Secondary):

One primary handles all writes
Secondary replicas receive updates
Failover to secondary if primary fails
Simple but creates a potential bottleneck

Active-Active (Multi-Master):

Multiple nodes accept writes
Higher availability (no single write bottleneck)
Requires conflict resolution
More complex but more resilient

Leader Election:

Nodes elect a leader using consensus (Raft, Paxos)
Leader handles coordination
If leader fails, automatic re-election
Used by ZooKeeper, etcd, Consul

Active-Passive Pros/Cons

•✓ Simple to implement and reason about
•✓ No write conflicts
•✓ Strong consistency possible
•✗ Primary is SPOF until failover
•✗ Failover may cause brief unavailability
•✗ Passive resources are underutilized

Active-Active Pros/Cons

•✓ No single point of failure for writes
•✓ Better resource utilization
•✓ Higher write throughput possible
•✗ Complex conflict resolution needed
•✗ Eventual consistency often required
•✗ Split-brain scenarios possible

The Active-Active Reality

Active-active replication is powerful but comes with significant complexity. Systems like Amazon DynamoDB and Apache Cassandra use it effectively, but they require careful conflict resolution strategies (last-write-wins, vector clocks, CRDTs) and application-level awareness of eventual consistency. Don't use active-active unless you truly need it.

The Availability-Latency Trade-off

Within CAP-available systems, there's an additional trade-off that CAP doesn't capture: the relationship between availability and latency. This is described by the PACELC theorem:

PACELC Theorem:

If there is a Partition (P), how does the system trade off Availability and Consistency (A and C); Else (E), when the system is running normally in the absence of partitions, how does the system trade off Latency (L) and Consistency (C)?

In other words:

During Partition: Choose between Availability and Consistency (CAP)
During Normal Operation: Choose between Latency and Consistency

This extends CAP to acknowledge that even without partitions, you're making trade-offs. A strongly consistent system incurs latency costs even when all nodes are reachable.

PACELC Classification of Distributed Systems
System	Under Partition (P)	Else/Normal (E)	Classification
DynamoDB	Availability	Latency	PA/EL
Cassandra	Availability	Latency	PA/EL
MongoDB	Availability (default)	Latency	PA/EL
Spanner	Consistency	Consistency	PC/EC
ZooKeeper	Consistency	Consistency	PC/EC
CockroachDB	Consistency	Consistency	PC/EC
RethinkDB	Consistency	Latency	PC/EL
VoltDB	Consistency	Latency	PC/EL

The Latency Dimension:

Consider a write operation to a distributed database:

For Availability + Low Latency (PA/EL):

Write is acknowledged as soon as it reaches one local node
Replication happens asynchronously in the background
Client sees fast response times
Risk: Data not yet replicated could be lost on failure

For Availability + Consistency (PA/EC):

This is a contradictory combination—you can't optimize for both
Some systems claim this but sacrifice one under load

For Consistency + Low Latency (PC/EL):

Uses optimistic strategies: proceed quickly, abort on conflict
Works well with low contention
Under high contention, latency spikes or availability drops

For Consistency + Consistency (PC/EC):

Wait for synchronous replication before acknowledging
Highest latency, but strongest guarantees
Example: Financial transaction systems

latency_tradeoff_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
class DatabaseWriteStrategies:
    """
    Demonstrates the latency-consistency trade-off in write operations.
    """
    
    def write_low_latency(self, data):
        """
        PA/EL Strategy: Prioritize latency over consistency.
        Write locally and return immediately.
        """
        # Write to local node only
        self.local_node.write(data)
        
        # Acknowledge to client immediately
        # Latency: ~1-5ms (local write only)
        return {"status": "acknowledged", "consistency": "eventual"}
        
        # Background replication happens asynchronously
        # Other nodes receive update milliseconds to seconds later
    
    def write_balanced(self, data):
        """
        Middle ground: Write to local and one remote before acknowledging.
        Provides better durability without full consistency.
        """
        # Write to local node
        self.local_node.write(data)
        
        # Wait for at least one remote node to acknowledge
        # Latency: ~10-100ms (depends on network distance)
        await self.any_remote_node.write(data)
        
        return {"status": "acknowledged", "consistency": "semi-sync"}
    
    def write_strong_consistency(self, data):
        """
        PC/EC Strategy: Prioritize consistency over latency.
        Wait for quorum of nodes before acknowledging.
        """
        # Write to local node
        self.local_node.write(data)
        
        # Wait for majority of nodes to acknowledge
        # Latency: ~50-500ms (depends on slowest quorum node)
        acks = 1
        for node in self.remote_nodes:
            try:
                await asyncio.wait_for(node.write(data), timeout=1.0)
                acks += 1
                if acks >= self.quorum_size:
                    break
            except asyncio.TimeoutError:
                continue
        
        if acks >= self.quorum_size:
            return {"status": "committed", "consistency": "strong"}
        else:
            # Not enough acknowledgments - must roll back or retry
            raise ConsistencyError("Quorum not reached")
    
    def write_ultimate_consistency(self, data):
        """
        PC/EC with synchronous replication to ALL nodes.
        Maximum latency, maximum consistency.
        """
        # Must successfully write to ALL nodes
        # Latency: determined by slowest node
        # Availability: if ANY node fails, write fails
        
        for node in self.all_nodes:
            await node.write(data)  # Fails entire write if any node unavailable
        
        return {"status": "committed", "consistency": "linearizable"}
 
 
# Typical latencies for a 5-node cluster spanning 2 datacenters:
#
# Strategy              | Latency      | Durability | Consistency
# ------------------------------------------------------------------
# Local only            | 1-5ms        | Low        | Eventual
# Local + 1 remote      | 50-100ms     | Medium     | Semi
# Majority quorum (3)   | 100-200ms    | High       | Strong
# All nodes sync        | 200-500ms    | Highest    | Linearizable

Why PACELC Matters

PACELC is more practical than CAP because partitions are relatively rare, but latency is constant. Most of the time, your system operates in the 'E' clause. Understanding this trade-off helps you choose the right database for your latency and consistency requirements during normal operation, not just during failures.

Availability Anti-Patterns

Understanding availability also means recognizing common mistakes that silently destroy it:

1. Hidden Single Points of Failure:

You might have redundant databases, but:

Is there only one configuration server?
Is there only one deployment pipeline?
Is there only one network path between components?
Does the entire system depend on one DNS provider?

All redundancy is useless if a single hidden component can take everything down.

2. Correlated Failures:

Redundancy only helps if failures are independent. Watch for:

Two 'redundant' servers on the same physical rack (rack failure takes both)
Multiple services dependent on the same shared library (bug affects all)
All replicas in the same availability zone
Backups stored on the same storage system as primary

More Availability Anti-Patterns

•Slow Health Checks — If detecting a failure takes 30 seconds, your 'fast failover' takes at least 30 seconds. Use aggressive, frequent health checks.
•Manual Failover — If failover requires a human to push a button, your availability is capped at human response time (minutes to hours).
•Insufficient Testing — If you've never tested failover in production, you don't know if it works. Chaos engineering is essential for real availability.
•Dependency Explosion — Every external dependency is a potential failure point. A system with 10 dependencies, each 99.9% available, has 99.9%^10 = 99% availability.
•Unbounded Retries — Retries without backoff or limits can cascade failures and extend outages. Always use exponential backoff and circuit breakers.
•All-or-Nothing Thinking — Better to serve 80% of functionality than 0%. Design for graceful degradation, not complete failure.

3. The Thundering Herd Problem:

When a failed component comes back online:

All clients that were waiting suddenly reconnect
The recovering server is overwhelmed
It fails again due to overload
Cycle repeats

Solution: Jittered backoff, gradual connection draining, and capacity buffers.

4. Cascading Failures:

When one component slows down:

Callers wait longer, holding resources
Caller's callers also slow down
Eventually, the entire system stops

Solution: Timeouts, circuit breakers, and bulkheads prevent one slow component from taking down everything.

Test Your Assumptions

Most availability disasters happen because engineers assumed something would work that was never tested. Kill random services in production. Simulate network partitions. Inject latency. The only way to know your system is highly available is to prove it by breaking things in controlled ways.

Summary: Availability in the CAP Triangle

We've explored the 'A' in CAP theorem comprehensively. Let's consolidate the key insights:

Key Takeaways

•CAP availability means every non-failed node responds — It's binary and theoretical, different from SLA percentages.
•Practical availability is measured in 'nines' — Each additional nine is 10x harder and requires exponentially more investment in redundancy.
•During partitions, availability conflicts with consistency — Either reject requests (lose availability) or return potentially inconsistent data (lose consistency).
•PACELC extends CAP — Even without partitions, you trade latency for consistency. This is the more common trade-off in practice.
•High availability requires eliminating SPOFs — Redundancy, automated failover, geographic distribution, and chaos testing are essential.
•Common anti-patterns destroy availability — Correlated failures, manual processes, and cascading dependencies silently undermine redundancy.

What's Next:

We've explored Consistency (C) and Availability (A). The next page examines Partition Tolerance (P)—the property that makes CAP interesting. You'll learn why almost every distributed system must be partition tolerant, making the real CAP choice between C and A.

Page Complete

You now understand availability in the context of distributed systems—its formal definition, practical measurement, the PACELC extension, and engineering techniques for high availability. This prepares you to understand why partition tolerance forces the CAP trade-off.

2 / 5

Loading learning content...

Database Management SystemsCAP Theorem

CAP Theorem: Fundamental Trade-offs in Distributed Systems

LevelAdvanced

Duration90 mins

TopicCAP Theorem

2 / 5

Availability in Distributed Systems

The System That Never Says No

Availability is the 'A' in CAP, and understanding its precise meaning reveals why perfect consistency and perfect availability cannot coexist in a partitioned network.

What You Will Master

Defining Availability Precisely

Like consistency, 'availability' has multiple meanings in computing. The CAP theorem uses a very specific definition:

CAP Availability: Every request received by a non-failing node in the system must result in a response.

This definition is deceptively simple but carries profound implications:

Every request — Not 99.99% of requests, not most requests, but literally every single request
Received by a non-failing node — If a node has crashed or is unresponsive, CAP doesn't require it to answer
Must result in a response — The system cannot say 'try again later'; it must give a definitive answer

What CAP Availability Does NOT Guarantee:

The response contains the most recent data (that's consistency)
The response comes within any specific time bound (that's about latency)
All nodes give the same response (that's also consistency)

CAP ≠ SLA Availability

Availability Definitions Compared
Definition	Context	Measurement	Failure Behavior
CAP Availability	Distributed systems theory	Binary (available or not)	Non-failed nodes must respond
SLA Availability	Business metrics	Percentage uptime	% of successful requests
High Availability (HA)	System architecture	Ability to survive failures	Redundancy and failover
Fault Tolerance	System design	Continued operation despite faults	Graceful degradation

The Response Requirement:

CAP availability requires that the system respond—but it doesn't specify the quality of the response. A CAP-available system might return:

The correct, most recent value
A stale value from before the partition
A default value
An error message (as long as it's a response, not a timeout)

Measuring Availability in Practice

While CAP availability is a theoretical property, practical systems measure availability using quantitative metrics:

The 'Nines' of Availability:

Availability is commonly expressed as a percentage of uptime over a period (usually one year):

Availability Levels and Downtime
Availability	Downtime/Year	Downtime/Month	Downtime/Week	Industry Example
99%	3.65 days	7.2 hours	1.68 hours	Managed internal service
99.9% (three nines)	8.76 hours	43.8 minutes	10.1 minutes	Enterprise SaaS
99.95%	4.38 hours	21.9 minutes	5.04 minutes	Cloud provider baseline
99.99% (four nines)	52.6 minutes	4.38 minutes	1.01 minutes	Financial systems
99.999% (five nines)	5.26 minutes	26.3 seconds	6.05 seconds	Telecommunication
99.9999% (six nines)	31.5 seconds	2.63 seconds	0.605 seconds	Emergency services

Calculating Availability:

Availability = Uptime / (Uptime + Downtime)

Or with MTBF (Mean Time Between Failures) and MTTR (Mean Time To Repair):

Availability = MTBF / (MTBF + MTTR)

This formula reveals a crucial insight: availability improves by either increasing time between failures (better hardware, redundancy) or decreasing time to recover (automation, hot standbys).

Composite Availability:

Real systems are composed of multiple components. The overall availability depends on the architecture:

Serial Components (all must work):

A_total = A_1 × A_2 × ... × A_n

Example: If A = B = C = 99%, then A_total = 0.99³ = 97.03%

Parallel Components (any can substitute):

A_total = 1 - (1-A_1) × (1-A_2) × ... × (1-A_n)

Example: If A = B = 99%, then A_total = 1 - 0.01² = 99.99%

availability_calculation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def calculate_serial_availability(components: list[float]) -> float:
    """
    For components in series (all must work), multiply availabilities.
    
    Example: Database → Application Server → Load Balancer
    If each is 99.9% available, total = 0.999^3 = 99.7%
    """
    availability = 1.0
    for component in components:
        availability *= component
    return availability
 
 
def calculate_parallel_availability(components: list[float]) -> float:
    """
    For components in parallel (any can substitute), 
    use complement multiplication.
    
    Example: Two database replicas, each 99.9%
    Unavailability = 0.001 * 0.001 = 0.000001
    Availability = 1 - 0.000001 = 99.9999%
    """
    unavailability = 1.0
    for component in components:
        unavailability *= (1 - component)
    return 1 - unavailability
 
 
def calculate_system_availability():
    """
    Real-world example: 3-tier web application with redundancy
    
    Architecture:
    ┌─────────────────────────────────────────────────────┐
    │  Load Balancer (99.99% each)                        │
    │  ┌─────────┐  ┌─────────┐                           │
    │  │   LB1   │  │   LB2   │   ← Parallel (redundant)  │
    │  └────┬────┘  └────┬────┘                           │
    │       └─────┬──────┘                                │
    │             │                                       │
    │  ┌──────────▼──────────┐                            │
    │  │  App Servers (99.9%) │   ← 3 parallel            │
    │  │  ┌───┐ ┌───┐ ┌───┐   │                           │
    │  │  │AS1│ │AS2│ │AS3│   │                           │
    │  │  └───┘ └───┘ └───┘   │                           │
    │  └──────────┬───────────┘                           │
    │             │                                       │
    │  ┌──────────▼──────────┐                            │
    │  │  Database (99.95%)   │   ← Primary + Replica     │
    │  │  ┌───────┐┌────────┐ │                           │
    │  │  │Primary││Replica │ │                           │
    │  │  └───────┘└────────┘ │                           │
    │  └──────────────────────┘                           │
    └─────────────────────────────────────────────────────┘
    """
    
    # Layer 1: Load Balancers (parallel)
    lb_single = 0.9999
    lb_availability = calculate_parallel_availability([lb_single, lb_single])
    print(f"Load Balancer Layer: {lb_availability:.6%}")
    # Result: 99.999999% (virtually perfect)
    
    # Layer 2: Application Servers (parallel, need 1 of 3)
    app_single = 0.999
    app_availability = calculate_parallel_availability([app_single] * 3)
    print(f"Application Layer: {app_availability:.6%}")
    # Result: 99.9999999% (3 replicas = very high availability)
    
    # Layer 3: Database (parallel with primary and replica)
    db_single = 0.9995
    db_availability = calculate_parallel_availability([db_single, db_single])
    print(f"Database Layer: {db_availability:.6%}")
    # Result: 99.999975%
    
    # Total system (serial across layers)
    total = calculate_serial_availability([
        lb_availability,
        app_availability, 
        db_availability
    ])
    print(f"Total System: {total:.6%}")
    # Result: ~99.999975%
    
    # Without redundancy for comparison
    no_redundancy = calculate_serial_availability([
        0.9999,  # Single LB
        0.999,   # Single App Server
        0.9995   # Single Database
    ])
    print(f"Without Redundancy: {no_redundancy:.4%}")
    # Result: ~99.84%
 
 
if __name__ == "__main__":
    calculate_system_availability()

The Redundancy Imperative

Availability Under Partition

The true test of a distributed system's availability is not during normal operation—it's during a network partition. This is where the CAP theorem becomes relevant.

What is a Network Partition?

Common Causes of Partitions:

Switch or router failures
Network cable cuts
Datacenter connectivity issues
Firewall misconfigurations
NIC (network interface) failures
Asymmetric network failures (A can reach C, B can reach C, but A can't reach B)

Converting Mermaid diagram...

The Availability Decision During Partition:

When a partition occurs, nodes on each side of the partition face a fundamental question:

Should I continue serving requests with my local data, or should I refuse requests because I can't coordinate with other nodes?

Option 1: Prioritize Availability (AP System)

Each partition continues serving read and write requests
Responses may become inconsistent across partitions
When partition heals, conflicts must be resolved
Example: DynamoDB's default behavior, Cassandra with low consistency

Option 2: Prioritize Consistency (CP System)

Partitions that can't form a quorum stop accepting writes
May also reject reads if consistency is required
System is unavailable until partition heals or quorum is restored
Example: ZooKeeper, etcd, MongoDB with majority writes

partition_behavior.txt
SCENARIO: 5-node cluster splits into groups of 3 and 2
 
Before Partition:
   ┌─────────────────────────────┐
   │  [A] [B] [C] [D] [E]        │
   │   Nodes communicate freely   │
   │   Majority quorum = 3        │
   └─────────────────────────────┘
 
During Partition:
   ┌─────────────┐  X  ┌─────────┐
   │  [A] [B] [C]│     │ [D] [E] │
   │  (3 nodes)  │     │(2 nodes)│
   │  HAS QUORUM │     │NO QUORUM│
   └─────────────┘     └─────────┘
 
CP SYSTEM BEHAVIOR (e.g., ZooKeeper with quorum=3):
┌───────────────────────────────────────────────────────┐
│  Partition A (nodes A, B, C):                         │
│  ✓ Can accept writes (has quorum of 3)                │
│  ✓ Can serve consistent reads                         │
│                                                       │
│  Partition B (nodes D, E):                            │
│  ✗ Cannot accept writes (only 2, needs 3)             │
│  ✗ Rejects requests with "No quorum" error            │
│  ✗ System is UNAVAILABLE in this partition            │
└───────────────────────────────────────────────────────┘
 
AP SYSTEM BEHAVIOR (e.g., Cassandra with quorum=1):
┌───────────────────────────────────────────────────────┐
│  Partition A (nodes A, B, C):                         │
│  ✓ Accepts writes with local quorum                   │
│  ✓ Serves reads from local data                       │
│                                                       │
│  Partition B (nodes D, E):                            │
│  ✓ ALSO accepts writes!                               │
│  ✓ Serves reads from its (potentially stale) data     │
│                                                       │
│  PROBLEM: Two clients writing to the same key         │
│  can create conflicting versions:                     │
│    - Client 1 → Node A: SET x = "foo"                 │
│    - Client 2 → Node D: SET x = "bar"                 │
│    - When partition heals: x = "foo" or "bar"??       │
└───────────────────────────────────────────────────────┘
 
After Partition Heals:
   ┌─────────────────────────────┐
   │  [A] [B] [C] [D] [E]        │
   │                             │
   │  CP: D and E catch up from  │
   │      A, B, C. No conflicts. │
   │                             │
   │  AP: Conflict resolution    │
   │      needed. May use:       │
   │      - Last-write-wins (LWW)│
   │      - Vector clocks        │
   │      - Application merge    │
   └─────────────────────────────┘

The Inescapable Trade-off

Achieving High Availability

While CAP availability is theoretical, practical high availability requires careful engineering across multiple dimensions:

Core Principles of High Availability:

Eliminate Single Points of Failure (SPOF)
- Every component must have at least one backup
- This includes hardware, software, network paths, and even humans
- The system continues if any single element fails
Design for Fault Isolation
- Failures in one component shouldn't cascade
- Use bulkheads (like ship compartments that limit flooding)
- Implement circuit breakers to prevent cascade failures
Automate Failure Detection and Recovery
- Human response time is too slow for high availability
- Systems must self-heal without manual intervention
- Implement health checks, automatic failover, and self-restart

High Availability Techniques

•Replication — Maintain multiple copies of data and services across nodes. If one fails, others continue serving. Types include synchronous (stronger consistency) and asynchronous (higher availability).
•Load Balancing — Distribute requests across healthy nodes. Remove unhealthy nodes from rotation automatically. Can operate at L4 (TCP) or L7 (HTTP) levels.
•Failover Mechanisms — Automatic promotion of standby to primary. Hot standby (instant), warm standby (minutes), cold standby (longer). DNS failover for entire datacenters.
•Geographic Distribution — Deploy across multiple availability zones or regions. Survive entire datacenter failures. Balance latency against redundancy.
•Graceful Degradation — When under stress, shed non-critical functionality. Maintain core operations even with reduced capacity. Better partial service than total failure.
•Self-Healing Systems — Automatic restart of failed processes. Container orchestration (Kubernetes) for automated recovery. Auto-scaling to replace unhealthy instances.

Replication Strategies for Availability:

Active-Passive (Primary-Secondary):

One primary handles all writes
Secondary replicas receive updates
Failover to secondary if primary fails
Simple but creates a potential bottleneck

Active-Active (Multi-Master):

Multiple nodes accept writes
Higher availability (no single write bottleneck)
Requires conflict resolution
More complex but more resilient

Leader Election:

Nodes elect a leader using consensus (Raft, Paxos)
Leader handles coordination
If leader fails, automatic re-election
Used by ZooKeeper, etcd, Consul

Active-Passive Pros/Cons

•✓ Simple to implement and reason about
•✓ No write conflicts
•✓ Strong consistency possible
•✗ Primary is SPOF until failover
•✗ Failover may cause brief unavailability
•✗ Passive resources are underutilized

Active-Active Pros/Cons

•✓ No single point of failure for writes
•✓ Better resource utilization
•✓ Higher write throughput possible
•✗ Complex conflict resolution needed
•✗ Eventual consistency often required
•✗ Split-brain scenarios possible

The Active-Active Reality

The Availability-Latency Trade-off

Within CAP-available systems, there's an additional trade-off that CAP doesn't capture: the relationship between availability and latency. This is described by the PACELC theorem:

PACELC Theorem:

In other words:

During Partition: Choose between Availability and Consistency (CAP)
During Normal Operation: Choose between Latency and Consistency

This extends CAP to acknowledge that even without partitions, you're making trade-offs. A strongly consistent system incurs latency costs even when all nodes are reachable.

PACELC Classification of Distributed Systems
System	Under Partition (P)	Else/Normal (E)	Classification
DynamoDB	Availability	Latency	PA/EL
Cassandra	Availability	Latency	PA/EL
MongoDB	Availability (default)	Latency	PA/EL
Spanner	Consistency	Consistency	PC/EC
ZooKeeper	Consistency	Consistency	PC/EC
CockroachDB	Consistency	Consistency	PC/EC
RethinkDB	Consistency	Latency	PC/EL
VoltDB	Consistency	Latency	PC/EL

The Latency Dimension:

Consider a write operation to a distributed database:

For Availability + Low Latency (PA/EL):

Write is acknowledged as soon as it reaches one local node
Replication happens asynchronously in the background
Client sees fast response times
Risk: Data not yet replicated could be lost on failure

For Availability + Consistency (PA/EC):

This is a contradictory combination—you can't optimize for both
Some systems claim this but sacrifice one under load

For Consistency + Low Latency (PC/EL):

Uses optimistic strategies: proceed quickly, abort on conflict
Works well with low contention
Under high contention, latency spikes or availability drops

For Consistency + Consistency (PC/EC):

Wait for synchronous replication before acknowledging
Highest latency, but strongest guarantees
Example: Financial transaction systems

latency_tradeoff_example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
class DatabaseWriteStrategies:
    """
    Demonstrates the latency-consistency trade-off in write operations.
    """
    
    def write_low_latency(self, data):
        """
        PA/EL Strategy: Prioritize latency over consistency.
        Write locally and return immediately.
        """
        # Write to local node only
        self.local_node.write(data)
        
        # Acknowledge to client immediately
        # Latency: ~1-5ms (local write only)
        return {"status": "acknowledged", "consistency": "eventual"}
        
        # Background replication happens asynchronously
        # Other nodes receive update milliseconds to seconds later
    
    def write_balanced(self, data):
        """
        Middle ground: Write to local and one remote before acknowledging.
        Provides better durability without full consistency.
        """
        # Write to local node
        self.local_node.write(data)
        
        # Wait for at least one remote node to acknowledge
        # Latency: ~10-100ms (depends on network distance)
        await self.any_remote_node.write(data)
        
        return {"status": "acknowledged", "consistency": "semi-sync"}
    
    def write_strong_consistency(self, data):
        """
        PC/EC Strategy: Prioritize consistency over latency.
        Wait for quorum of nodes before acknowledging.
        """
        # Write to local node
        self.local_node.write(data)
        
        # Wait for majority of nodes to acknowledge
        # Latency: ~50-500ms (depends on slowest quorum node)
        acks = 1
        for node in self.remote_nodes:
            try:
                await asyncio.wait_for(node.write(data), timeout=1.0)
                acks += 1
                if acks >= self.quorum_size:
                    break
            except asyncio.TimeoutError:
                continue
        
        if acks >= self.quorum_size:
            return {"status": "committed", "consistency": "strong"}
        else:
            # Not enough acknowledgments - must roll back or retry
            raise ConsistencyError("Quorum not reached")
    
    def write_ultimate_consistency(self, data):
        """
        PC/EC with synchronous replication to ALL nodes.
        Maximum latency, maximum consistency.
        """
        # Must successfully write to ALL nodes
        # Latency: determined by slowest node
        # Availability: if ANY node fails, write fails
        
        for node in self.all_nodes:
            await node.write(data)  # Fails entire write if any node unavailable
        
        return {"status": "committed", "consistency": "linearizable"}
 
 
# Typical latencies for a 5-node cluster spanning 2 datacenters:
#
# Strategy              | Latency      | Durability | Consistency
# ------------------------------------------------------------------
# Local only            | 1-5ms        | Low        | Eventual
# Local + 1 remote      | 50-100ms     | Medium     | Semi
# Majority quorum (3)   | 100-200ms    | High       | Strong
# All nodes sync        | 200-500ms    | Highest    | Linearizable

Why PACELC Matters

Availability Anti-Patterns

Understanding availability also means recognizing common mistakes that silently destroy it:

1. Hidden Single Points of Failure:

You might have redundant databases, but:

Is there only one configuration server?
Is there only one deployment pipeline?
Is there only one network path between components?
Does the entire system depend on one DNS provider?

All redundancy is useless if a single hidden component can take everything down.

2. Correlated Failures:

Redundancy only helps if failures are independent. Watch for:

Two 'redundant' servers on the same physical rack (rack failure takes both)
Multiple services dependent on the same shared library (bug affects all)
All replicas in the same availability zone
Backups stored on the same storage system as primary

More Availability Anti-Patterns

•Slow Health Checks — If detecting a failure takes 30 seconds, your 'fast failover' takes at least 30 seconds. Use aggressive, frequent health checks.
•Manual Failover — If failover requires a human to push a button, your availability is capped at human response time (minutes to hours).
•Insufficient Testing — If you've never tested failover in production, you don't know if it works. Chaos engineering is essential for real availability.
•Dependency Explosion — Every external dependency is a potential failure point. A system with 10 dependencies, each 99.9% available, has 99.9%^10 = 99% availability.
•Unbounded Retries — Retries without backoff or limits can cascade failures and extend outages. Always use exponential backoff and circuit breakers.
•All-or-Nothing Thinking — Better to serve 80% of functionality than 0%. Design for graceful degradation, not complete failure.

3. The Thundering Herd Problem:

When a failed component comes back online:

All clients that were waiting suddenly reconnect
The recovering server is overwhelmed
It fails again due to overload
Cycle repeats

Solution: Jittered backoff, gradual connection draining, and capacity buffers.

4. Cascading Failures:

When one component slows down:

Callers wait longer, holding resources
Caller's callers also slow down
Eventually, the entire system stops

Solution: Timeouts, circuit breakers, and bulkheads prevent one slow component from taking down everything.

Test Your Assumptions

Summary: Availability in the CAP Triangle

We've explored the 'A' in CAP theorem comprehensively. Let's consolidate the key insights:

Key Takeaways

•CAP availability means every non-failed node responds — It's binary and theoretical, different from SLA percentages.
•Practical availability is measured in 'nines' — Each additional nine is 10x harder and requires exponentially more investment in redundancy.
•During partitions, availability conflicts with consistency — Either reject requests (lose availability) or return potentially inconsistent data (lose consistency).
•PACELC extends CAP — Even without partitions, you trade latency for consistency. This is the more common trade-off in practice.
•High availability requires eliminating SPOFs — Redundancy, automated failover, geographic distribution, and chaos testing are essential.
•Common anti-patterns destroy availability — Correlated failures, manual processes, and cascading dependencies silently undermine redundancy.

What's Next:

Page Complete

2 / 5