System Design (HLD)CAP Theorem

CAP Theorem: Understanding Distributed System Trade-offs

LevelIntermediate

Duration90 mins

TopicCAP Theorem

2 / 5

Availability — Every Request Gets a Response

The Promise of Always-On Systems

At 2:47 AM on a Monday morning, a customer in Tokyo tries to complete a purchase on your e-commerce platform. Their payment goes through, but the confirmation page hangs indefinitely. The customer doesn't know if they were charged. They might try again, potentially paying twice. Or they might abandon the purchase entirely and never return.

This scenario illustrates why availability matters. In the CAP theorem, availability means that every request received by a non-failing node must result in a response—not just eventually, but within a reasonable time. The system cannot simply hang, time out, or refuse to answer. Availability is the promise that your system is always there when users need it.

What You Will Learn

By the end of this page, you will deeply understand what availability means in the CAP context, how it differs from the colloquial notion of 'uptime,' the strategies for achieving high availability, the metrics and SLAs that quantify availability, and why the tension between availability and consistency is the central challenge of distributed systems.

Defining Availability in the CAP Context

The CAP theorem defines availability precisely, and this precision matters.

CAP Availability (Formal Definition):

Every request received by a non-failing node in the system must result in a response.

Let's unpack this definition carefully:

"Every request": No exceptions. The system cannot reject requests or refuse to answer. Every valid request must receive a response.

"Received by a non-failing node": This qualification is crucial. A crashed node is offline—we don't expect responses from it. Availability applies to nodes that are operational and reachable.

"Must result in a response": The response doesn't have to be "success." It can be an error, a partial result, or stale data. But there must be a response. Hanging indefinitely violates availability.

What CAP availability notably does NOT guarantee:

That the response is correct or current
That the response arrives within a specific time
That all nodes respond (only non-failing ones must)
That the response is useful for the client's purpose

CAP Availability vs Practical Availability

CAP availability is a theoretical property focused on liveness—the system makes progress. Practical availability (what we measure with '99.99% uptime') is about successful, timely responses that serve user needs. A system can be CAP-available while returning stale data or errors—something that would hurt practical availability metrics.

Availability: CAP vs Practical Definitions
Aspect	CAP Availability	Practical Availability
Core requirement	Response from non-failing nodes	Successful responses within SLA
Response content	Any response (even stale/error)	Correct, useful responses
Timing constraint	Eventually (no hard bound)	Within specified latency (e.g., 200ms)
Scope	Individual node behavior	System-wide aggregate
Measurement	Binary (available or not)	Percentage (99.9%, 99.99%)

Why the distinction matters:

When designing systems, you must be clear about which definition you're targeting. A system can achieve CAP availability by always returning cached (potentially stale) data, but this might violate your business requirements for data freshness. Conversely, a system that blocks until fresh data is available might provide better practical availability for correctness-dependent use cases, even if it technically violates CAP availability during partitions.

Understanding this distinction prevents a common mistake: believing that because you've achieved one form of availability, you've achieved the other.

Measuring Availability: Nines and Beyond

In industry, availability is typically expressed as a percentage—the famous "nines" that appear in every SLA discussion. Understanding these numbers and their implications is essential for system design.

The Nines Table:

Each additional nine represents a 10x improvement in availability—and typically a 10x increase in cost and complexity to achieve.

Availability Levels and Downtime
Availability	Downtime/Year	Downtime/Month	Downtime/Day	Example Use Case
99% (two nines)	3.65 days	7.3 hours	14.4 minutes	Internal tools, dev environments
99.9% (three nines)	8.76 hours	43.8 minutes	1.44 minutes	Standard business applications
99.95%	4.38 hours	21.9 minutes	43.2 seconds	Important business applications
99.99% (four nines)	52.6 minutes	4.38 minutes	8.64 seconds	E-commerce, financial services
99.999% (five nines)	5.26 minutes	26.3 seconds	864 milliseconds	Telecom, emergency services
99.9999% (six nines)	31.5 seconds	2.63 seconds	86.4 milliseconds	Air traffic control, nuclear systems

The Law of Diminishing Returns

Moving from 99% to 99.9% is relatively straightforward—add redundancy, monitor better, automate failover. Moving from 99.99% to 99.999% requires fundamentally different architectures, extensive testing, and often accepting trade-offs in other areas. Each nine costs exponentially more than the last.

How to Calculate Availability:

Basic Formula:

Availability = (Total Time - Downtime) / Total Time × 100%

With MTBF and MTTR:

Availability = MTBF / (MTBF + MTTR)

Where:

MTBF (Mean Time Between Failures): Average time the system operates before failing
MTTR (Mean Time To Recovery): Average time to restore service after failure

Example: If MTBF = 720 hours (30 days) and MTTR = 1 hour: Availability = 720 / (720 + 1) = 99.86%

This formula reveals two paths to higher availability:

Increase MTBF: Make failures less frequent (better hardware, redundancy)
Decrease MTTR: Make recovery faster (automation, hot standby)

In practice, reducing MTTR often provides better ROI than increasing MTBF, because preventing all failures is impossible, but fast recovery is achievable.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
SERIAL COMPONENTS (all must work):
┌─────────────────────────────────────────────────────┐
│  LB (99.99%) → App (99.9%) → DB (99.95%)            │
│                                                     │
│  Total = 0.9999 × 0.999 × 0.9995 = 99.84%           │
│  (less than any individual component!)              │
└─────────────────────────────────────────────────────┘
 
PARALLEL COMPONENTS (any one can work):
┌─────────────────────────────────────────────────────┐
│         ┌── App Server 1 (99.9%) ──┐                │
│  LB ────┤                          ├── DB           │
│         └── App Server 2 (99.9%) ──┘                │
│                                                     │
│  App layer = 1 - (0.001 × 0.001) = 99.9999%         │
│  (both must fail for layer to fail)                 │
└─────────────────────────────────────────────────────┘
 
KEY INSIGHT: Serial composition decreases availability.
             Parallel composition increases it.
 
This is why every reliable system has redundancy at every layer.

Common Availability Anti-Patterns

Understanding what breaks availability is as important as knowing what enables it. These anti-patterns appear repeatedly in production incidents.

Anti-Pattern 1: Single Points of Failure (SPOF)

A SPOF is any component whose failure brings down the entire system. Common SPOFs:

Single database server (even with backups—failover takes time)
Single load balancer (ironic, since it's meant to distribute load)
Single DNS provider (if DNS fails, no traffic reaches you)
Single cloud region (region-wide outages happen)
Single team member who understands the system ("bus factor")

Anti-Pattern 2: Synchronous Dependencies

When Service A calls Service B synchronously, A's availability is bounded by B's:

Availability(A) ≤ Availability(A_standalone) × Availability(B)

With 5 serial dependencies at 99.9% each:

System Availability = 0.999^5 = 99.5%

That's 43 hours of downtime per year—from services that are individually 99.9%!

Critical Availability Anti-Patterns

•Deployment During Peak Hours: One bad deploy can take down the system when it hurts most
•No Circuit Breakers: Cascading failures propagate through the entire system
•Unbounded Retries: Retries during outages create thundering herds that prevent recovery
•No Graceful Degradation: System either works 100% or fails completely
•All Eggs in One Basket: Single cloud provider, single region, single AZ
•No Capacity Headroom: Systems at 100% capacity fail under any spike
•Shared Fate Dependencies: Database cluster shares a power supply or network switch
•No Timeout Configuration: Requests hang forever waiting for unresponsive services

Anti-Pattern 3: The Thundering Herd

After an outage, all clients retry simultaneously. This surge overwhelms the recovering system, causing another failure. The system oscillates between up and down states.

Solution: Implement exponential backoff with jitter. Each client waits a random amount before retrying, spreading the load over time.

Anti-Pattern 4: Correlated Failures

Two "independent" systems fail together because they share a hidden dependency:

Both services use the same Redis cluster for caching
Both deployed at the same time with the same buggy library
Both hosted in the same data center that lost power
Both use the same third-party API that went down

Solution: Audit dependencies rigorously. True independence is harder than it looks.

The Fallacy of 'Just Add More Servers'

Adding redundancy doesn't automatically improve availability. Poorly implemented redundancy can reduce availability by adding complexity, increasing failure modes, and creating coordination overhead. Every redundant component needs health checks, failover logic, and careful testing. Without these, you're adding liability, not reliability.

Strategies for Achieving High Availability

High availability isn't a feature you add—it's an architectural property you design for from the beginning. These strategies form the foundation of available systems.

Strategy 1: Redundancy at Every Layer

Every component should have at least one backup ready to take over:

Load Balancers: Active-passive or active-active pairs
Application Servers: Horizontally scaled behind load balancer
Databases: Primary-replica, multi-primary, or consensus-based replication
Networks: Multiple paths, multiple ISPs, multiple regions
Storage: RAID, erasure coding, geo-replicated object storage

Strategy 2: Health Checks and Automatic Failover

Redundancy is worthless without detection and action:

Liveness checks: Is the component running?
Readiness checks: Is the component ready to serve traffic?
Deep health checks: Are dependencies available?
Automatic failover: Remove unhealthy nodes, route to healthy ones

Failover must be automatic and fast. Manual intervention that takes 15 minutes negates the value of 99.99% availability (which allows only 4.38 minutes of downtime per month).

Defensive Patterns

•Circuit Breakers: Stop calling failing services, fail fast instead
•Bulkheads: Isolate failures to prevent cascade
•Timeouts: Never wait forever for a response
•Retry with Backoff: Handle transient failures gracefully
•Rate Limiting: Protect services from overload
•Load Shedding: Reject excess load to preserve core functionality

Recovery Patterns

•Hot Standby: Ready to take over in seconds
•Warm Standby: Ready in minutes (cost-effective)
•Pilot Light: Minimal infrastructure, scales up on demand
•Backup and Restore: Hours to recover, but simple and cheap
•Multi-Region Active-Active: Instant failover, highest complexity and cost
•Auto-Scaling: Replace failed instances automatically

Strategy 3: Graceful Degradation

Not all features are equally critical. Design your system to shed non-essential functionality under stress:

Example: E-commerce during Black Friday

Load Level	Features Enabled
Normal	Everything: recommendations, reviews, wishlist, analytics
High	Core + recommendations (disable wishlist, analytics)
Critical	Core only: browse, cart, checkout
Extreme	Static pages, deferred checkout

This approach maximizes availability of the most important functionality.

The Chaos Engineering Mindset

High availability isn't proven until it's tested. Chaos engineering—deliberately injecting failures in production—validates that your redundancy and failover actually work. Netflix's Chaos Monkey randomly kills instances. AWS tests its services by simulating region-wide outages. If you haven't tested your failover, you don't have failover—you have hope.

Availability in Real-World Systems

Let's examine how major systems achieve high availability and what trade-offs they accept.

Case Study 1: AWS S3 (11 nines of durability)

S3 famously promises 99.999999999% durability. This means if you store 10 million objects, you might lose one every 10,000 years. How?

Data automatically replicated across at least 3 AZs
Each AZ has independent power, cooling, networking
Integrity checking with automatic repair
No single device failure can cause data loss
Multiple layers of redundancy at every level

Note: S3's durability (11 nines) differs from its availability (99.99% for Standard, meaning ~52 minutes downtime/year). Durability means data isn't lost. Availability means you can access it.

Case Study 2: Google Spanner (External Consistency + High Availability)

Spanner achieves 99.999% availability while maintaining strong consistency. The secret:

Paxos-based replication across multiple zones
Synchronized clocks with GPS and atomic clocks (TrueTime)
Automatic failover with sub-second leader election
Massive engineering investment in every component

This demonstrates that high consistency and high availability can coexist—if you're willing to pay the engineering cost.

Availability Strategies: Major Distributed Systems
System	Availability Target	Key Strategy	Trade-off Accepted
AWS S3	99.99%	Multi-AZ replication, automatic failover	Higher latency for strong durability
Google Search	99.999%+	Massive redundancy, graceful degradation	Slightly stale results acceptable
Netflix	99.99%	Multi-region active-active, chaos tested	Eventual consistency in recommendations
Stripe	99.99%	Multi-cloud, idempotent APIs	Higher complexity, higher cost
Discord	99.95%	Erlang/Elixir fault tolerance, hot code swap	Learning curve for unusual tech stack
Amazon DynamoDB	99.999%	Multi-AZ with automatic failover	Limited query flexibility for consistency

Case Study 3: The 2017 S3 Outage

On February 28, 2017, a single typo in an S3 maintenance command took down a significant portion of the internet for several hours. Services affected included Slack, Trello, Quora, and parts of AWS itself.

Lessons learned:

Dependencies cascade: Many services hard-depended on S3. When S3 failed, they failed.
Testing matters: The command that caused the outage hadn't been tested at scale in years.
Blast radius: A simple administrative error affected millions of users globally.
Recovery can be slow: Some S3 subsystems took hours to fully recover.

Changes Amazon made:

Added safeguards preventing rapid capacity removal
Improved tooling with better confirmation and rollback
Reduced interdependencies between S3 subsystems
Published detailed post-mortem to help others learn

Even Giants Fall

AWS, Google, and other cloud providers experience outages despite billions of dollars in infrastructure. The lesson: design your systems to survive your dependency's failures. Multi-region, multi-cloud, and graceful degradation aren't paranoia—they're prudent engineering.

The Fundamental Tension: Availability vs Consistency

We've explored consistency and availability separately. Now we confront the heart of the CAP theorem: under network partitions, you cannot have both.

The Partition Scenario:

Imagine a two-node distributed database. A network partition occurs—the nodes can't communicate. A client sends a write to Node A. What happens?

If you choose Consistency:

Node A cannot confirm the write without Node B's acknowledgment
But Node A can't reach Node B
Node A must refuse the write (or block indefinitely)
Result: System is unavailable during the partition

If you choose Availability:

Node A accepts and acknowledges the write
Meanwhile, Node B might be accepting different writes
When the partition heals, the nodes have diverged
Result: System is inconsistent until (and during) reconciliation

This isn't a design flaw—it's a fundamental impossibility proven by Brewer and formalized by Gilbert and Lynch.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
SCENARIO: Network partition between two data centers
 
                    ┌─────────────┐     PARTITION     ┌─────────────┐
                    │   Node A    │ ─ ─ ─ ✕ ─ ─ ─ ─  │   Node B    │
                    │   (DC 1)    │                  │   (DC 2)    │
                    └─────────────┘                  └─────────────┘
                          │                                │
                     Client X                         Client Y
                    write(x=1)                       write(x=2)
 
OPTION 1: CHOOSE CONSISTENCY (CP)
├── Node A: "I can't reach Node B. I must refuse this write."
├── Node B: "I can't reach Node A. I must refuse this write."
├── Both clients get errors or timeouts
└── Result: CONSISTENT but UNAVAILABLE
 
OPTION 2: CHOOSE AVAILABILITY (AP)
├── Node A: "I'll accept write x=1 from Client X"
├── Node B: "I'll accept write x=2 from Client Y"
├── Both clients get success
└── After partition heals: x=1 or x=2? CONFLICT!
    Result: AVAILABLE but INCONSISTENT
 
THERE IS NO OPTION 3.

Why can't we have both?

The proof is elegant:

Assume we want both linearizability (C) and availability (A)
During a partition, messages between nodes are lost
A client sends a write to Node A, which completes
A client sends a read to Node B
For availability: Node B must respond
For consistency: Node B must return the value just written to Node A
But Node B never received the write (partition!)
Contradiction: Node B cannot satisfy both requirements

The key insight:

The CAP theorem doesn't say you can never have consistency and availability. It says you can't have both during a partition. In the absence of partitions, you can have both. This is why the PACELC theorem (which we'll cover later) extends CAP to describe behavior when partitions aren't occurring.

Partitions Are Unavoidable

Some argue: 'Our network is reliable, so we don't need to worry about partitions.' This is dangerously wrong. Network partitions happen. Switches fail. Cables get cut. Misconfigurations create logical partitions. Cloud providers have region-wide network issues. Design for partitions, or your system will fail when they inevitably occur.

Architectural Patterns for Maximizing Availability

Given that we must make trade-offs, how do we architect systems that maximize availability while managing consistency challenges?

Pattern 1: Eventual Consistency with Conflict Resolution

Accept that replicas may temporarily diverge, but implement mechanisms to detect and resolve conflicts:

Last-Writer-Wins (LWW): Simple but loses data
Multi-Value Registers: Return all concurrent values, let application choose
CRDTs: Data structures that automatically merge without conflicts
Custom merge functions: Application-specific resolution logic

Pattern 2: Read-Your-Writes Guarantee

Ensure clients always see their own writes, even if other clients see stale data:

Route read requests to the same node that handled recent writes (sticky sessions)
Include version tokens in responses; wait for replica to reach that version before reading
Read from primary for critical operations, replicas for others

Pattern 3: Async Replication with Sync Fallback

Normally use async replication for speed, but allow callers to request sync replication for critical operations:

POST /orders
X-Consistency: strong  // Use sync replication for this request

This lets you tune consistency per-operation rather than per-system.

Availability Checklist for System Design

•No single points of failure: Every component has redundancy
•Automatic failover: Failure detection and recovery without human intervention
•Health checks at every layer: Detect failures before users do
•Circuit breakers on all dependencies: Fail fast when dependencies fail
•Timeouts everywhere: No request waits forever
•Graceful degradation paths: Non-critical features can be disabled
•Multi-region or multi-AZ: Geographic redundancy for disasters
•Tested failover: Regular chaos engineering validates assumptions
•Monitoring and alerting: Know about problems before customers report them
•Runbooks for incidents: Reduce MTTR with clear procedures

Pattern 4: The Outbox Pattern for Cross-Service Reliability

When you need to update a database AND publish an event, use the Outbox Pattern:

Write the event to an "outbox" table in the same database transaction as your data change
A separate process reads the outbox and publishes events to the message broker
If publishing fails, the process retries (the event is safely in the database)
After successful publish, mark the outbox entry as processed

This guarantees that if the database update succeeds, the event will eventually be published—achieving reliability without distributed transactions.

Availability is a Feature, Not an Afterthought

You cannot bolt availability onto a system later. Architectures that assume reliable networks, single databases, or synchronous processing cannot be made highly available without fundamental rewrite. Design for availability from day one, even if you don't implement all redundancy immediately.

Summary: Availability in Distributed Systems

We've deeply explored availability—the 'A' in CAP—understanding what it means, how to measure it, what breaks it, and how to achieve it. Let's consolidate the essential insights.

Key Takeaways

•CAP availability means every non-failing node responds: This is different from practical uptime metrics, which measure successful responses within SLAs.
•Availability is measured in nines: Each additional nine is 10x harder to achieve. 99.99% allows about 52 minutes of downtime per year.
•Serial dependencies reduce availability: If you depend on 5 services at 99.9% each, your availability is only 99.5%.
•Redundancy is necessary but not sufficient: Without health checks, failover, and testing, redundancy adds complexity without improving availability.
•The CAP trade-off is real and unavoidable: During network partitions, you must choose between consistency and availability.
•Graceful degradation preserves core functionality: When under stress, shed non-essential features to keep critical paths available.
•Chaos engineering validates availability assumptions: If you haven't tested failover in production, you don't have failover.
•Design for availability from the start: It cannot be added later without fundamental architectural changes.

What's Next:

We've covered Consistency (C) and Availability (A). The next page explores Partition Tolerance—the 'P' in CAP. You'll learn why partition tolerance isn't really optional, which reframes CAP as a choice between CP and AP systems rather than a three-way trade-off.

Page Complete

You now understand availability in the context of the CAP theorem: what it means formally, how to measure it, common anti-patterns that break it, strategies to achieve it, and its fundamental tension with consistency. This knowledge is essential for making informed architectural decisions in distributed systems.

2 / 5

Loading learning content...

System Design (HLD)CAP Theorem

CAP Theorem: Understanding Distributed System Trade-offs

LevelIntermediate

Duration90 mins

TopicCAP Theorem

2 / 5

Availability — Every Request Gets a Response

The Promise of Always-On Systems

What You Will Learn

Defining Availability in the CAP Context

The CAP theorem defines availability precisely, and this precision matters.

CAP Availability (Formal Definition):

Every request received by a non-failing node in the system must result in a response.

Let's unpack this definition carefully:

"Every request": No exceptions. The system cannot reject requests or refuse to answer. Every valid request must receive a response.

"Received by a non-failing node": This qualification is crucial. A crashed node is offline—we don't expect responses from it. Availability applies to nodes that are operational and reachable.

What CAP availability notably does NOT guarantee:

That the response is correct or current
That the response arrives within a specific time
That all nodes respond (only non-failing ones must)
That the response is useful for the client's purpose

CAP Availability vs Practical Availability

Availability: CAP vs Practical Definitions
Aspect	CAP Availability	Practical Availability
Core requirement	Response from non-failing nodes	Successful responses within SLA
Response content	Any response (even stale/error)	Correct, useful responses
Timing constraint	Eventually (no hard bound)	Within specified latency (e.g., 200ms)
Scope	Individual node behavior	System-wide aggregate
Measurement	Binary (available or not)	Percentage (99.9%, 99.99%)

Why the distinction matters:

Understanding this distinction prevents a common mistake: believing that because you've achieved one form of availability, you've achieved the other.

Measuring Availability: Nines and Beyond

The Nines Table:

Each additional nine represents a 10x improvement in availability—and typically a 10x increase in cost and complexity to achieve.

Availability Levels and Downtime
Availability	Downtime/Year	Downtime/Month	Downtime/Day	Example Use Case
99% (two nines)	3.65 days	7.3 hours	14.4 minutes	Internal tools, dev environments
99.9% (three nines)	8.76 hours	43.8 minutes	1.44 minutes	Standard business applications
99.95%	4.38 hours	21.9 minutes	43.2 seconds	Important business applications
99.99% (four nines)	52.6 minutes	4.38 minutes	8.64 seconds	E-commerce, financial services
99.999% (five nines)	5.26 minutes	26.3 seconds	864 milliseconds	Telecom, emergency services
99.9999% (six nines)	31.5 seconds	2.63 seconds	86.4 milliseconds	Air traffic control, nuclear systems

The Law of Diminishing Returns

How to Calculate Availability:

Basic Formula:

Availability = (Total Time - Downtime) / Total Time × 100%

With MTBF and MTTR:

Availability = MTBF / (MTBF + MTTR)

Where:

MTBF (Mean Time Between Failures): Average time the system operates before failing
MTTR (Mean Time To Recovery): Average time to restore service after failure

Example: If MTBF = 720 hours (30 days) and MTTR = 1 hour: Availability = 720 / (720 + 1) = 99.86%

This formula reveals two paths to higher availability:

Increase MTBF: Make failures less frequent (better hardware, redundancy)
Decrease MTTR: Make recovery faster (automation, hot standby)

In practice, reducing MTTR often provides better ROI than increasing MTBF, because preventing all failures is impossible, but fast recovery is achievable.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
SERIAL COMPONENTS (all must work):
┌─────────────────────────────────────────────────────┐
│  LB (99.99%) → App (99.9%) → DB (99.95%)            │
│                                                     │
│  Total = 0.9999 × 0.999 × 0.9995 = 99.84%           │
│  (less than any individual component!)              │
└─────────────────────────────────────────────────────┘
 
PARALLEL COMPONENTS (any one can work):
┌─────────────────────────────────────────────────────┐
│         ┌── App Server 1 (99.9%) ──┐                │
│  LB ────┤                          ├── DB           │
│         └── App Server 2 (99.9%) ──┘                │
│                                                     │
│  App layer = 1 - (0.001 × 0.001) = 99.9999%         │
│  (both must fail for layer to fail)                 │
└─────────────────────────────────────────────────────┘
 
KEY INSIGHT: Serial composition decreases availability.
             Parallel composition increases it.
 
This is why every reliable system has redundancy at every layer.

Common Availability Anti-Patterns

Understanding what breaks availability is as important as knowing what enables it. These anti-patterns appear repeatedly in production incidents.

Anti-Pattern 1: Single Points of Failure (SPOF)

A SPOF is any component whose failure brings down the entire system. Common SPOFs:

Single database server (even with backups—failover takes time)
Single load balancer (ironic, since it's meant to distribute load)
Single DNS provider (if DNS fails, no traffic reaches you)
Single cloud region (region-wide outages happen)
Single team member who understands the system ("bus factor")

Anti-Pattern 2: Synchronous Dependencies

When Service A calls Service B synchronously, A's availability is bounded by B's:

Availability(A) ≤ Availability(A_standalone) × Availability(B)

With 5 serial dependencies at 99.9% each:

System Availability = 0.999^5 = 99.5%

That's 43 hours of downtime per year—from services that are individually 99.9%!

Critical Availability Anti-Patterns

•Deployment During Peak Hours: One bad deploy can take down the system when it hurts most
•No Circuit Breakers: Cascading failures propagate through the entire system
•Unbounded Retries: Retries during outages create thundering herds that prevent recovery
•No Graceful Degradation: System either works 100% or fails completely
•All Eggs in One Basket: Single cloud provider, single region, single AZ
•No Capacity Headroom: Systems at 100% capacity fail under any spike
•Shared Fate Dependencies: Database cluster shares a power supply or network switch
•No Timeout Configuration: Requests hang forever waiting for unresponsive services

Anti-Pattern 3: The Thundering Herd

After an outage, all clients retry simultaneously. This surge overwhelms the recovering system, causing another failure. The system oscillates between up and down states.

Solution: Implement exponential backoff with jitter. Each client waits a random amount before retrying, spreading the load over time.

Anti-Pattern 4: Correlated Failures

Two "independent" systems fail together because they share a hidden dependency:

Both services use the same Redis cluster for caching
Both deployed at the same time with the same buggy library
Both hosted in the same data center that lost power
Both use the same third-party API that went down

Solution: Audit dependencies rigorously. True independence is harder than it looks.

The Fallacy of 'Just Add More Servers'

Strategies for Achieving High Availability

High availability isn't a feature you add—it's an architectural property you design for from the beginning. These strategies form the foundation of available systems.

Strategy 1: Redundancy at Every Layer

Every component should have at least one backup ready to take over:

Load Balancers: Active-passive or active-active pairs
Application Servers: Horizontally scaled behind load balancer
Databases: Primary-replica, multi-primary, or consensus-based replication
Networks: Multiple paths, multiple ISPs, multiple regions
Storage: RAID, erasure coding, geo-replicated object storage

Strategy 2: Health Checks and Automatic Failover

Redundancy is worthless without detection and action:

Liveness checks: Is the component running?
Readiness checks: Is the component ready to serve traffic?
Deep health checks: Are dependencies available?
Automatic failover: Remove unhealthy nodes, route to healthy ones

Failover must be automatic and fast. Manual intervention that takes 15 minutes negates the value of 99.99% availability (which allows only 4.38 minutes of downtime per month).

Defensive Patterns

•Circuit Breakers: Stop calling failing services, fail fast instead
•Bulkheads: Isolate failures to prevent cascade
•Timeouts: Never wait forever for a response
•Retry with Backoff: Handle transient failures gracefully
•Rate Limiting: Protect services from overload
•Load Shedding: Reject excess load to preserve core functionality

Recovery Patterns

•Hot Standby: Ready to take over in seconds
•Warm Standby: Ready in minutes (cost-effective)
•Pilot Light: Minimal infrastructure, scales up on demand
•Backup and Restore: Hours to recover, but simple and cheap
•Multi-Region Active-Active: Instant failover, highest complexity and cost
•Auto-Scaling: Replace failed instances automatically

Strategy 3: Graceful Degradation

Not all features are equally critical. Design your system to shed non-essential functionality under stress:

Example: E-commerce during Black Friday

Load Level	Features Enabled
Normal	Everything: recommendations, reviews, wishlist, analytics
High	Core + recommendations (disable wishlist, analytics)
Critical	Core only: browse, cart, checkout
Extreme	Static pages, deferred checkout

This approach maximizes availability of the most important functionality.

The Chaos Engineering Mindset

Availability in Real-World Systems

Let's examine how major systems achieve high availability and what trade-offs they accept.

Case Study 1: AWS S3 (11 nines of durability)

S3 famously promises 99.999999999% durability. This means if you store 10 million objects, you might lose one every 10,000 years. How?

Data automatically replicated across at least 3 AZs
Each AZ has independent power, cooling, networking
Integrity checking with automatic repair
No single device failure can cause data loss
Multiple layers of redundancy at every level

Note: S3's durability (11 nines) differs from its availability (99.99% for Standard, meaning ~52 minutes downtime/year). Durability means data isn't lost. Availability means you can access it.

Case Study 2: Google Spanner (External Consistency + High Availability)

Spanner achieves 99.999% availability while maintaining strong consistency. The secret:

Paxos-based replication across multiple zones
Synchronized clocks with GPS and atomic clocks (TrueTime)
Automatic failover with sub-second leader election
Massive engineering investment in every component

This demonstrates that high consistency and high availability can coexist—if you're willing to pay the engineering cost.

Availability Strategies: Major Distributed Systems
System	Availability Target	Key Strategy	Trade-off Accepted
AWS S3	99.99%	Multi-AZ replication, automatic failover	Higher latency for strong durability
Google Search	99.999%+	Massive redundancy, graceful degradation	Slightly stale results acceptable
Netflix	99.99%	Multi-region active-active, chaos tested	Eventual consistency in recommendations
Stripe	99.99%	Multi-cloud, idempotent APIs	Higher complexity, higher cost
Discord	99.95%	Erlang/Elixir fault tolerance, hot code swap	Learning curve for unusual tech stack
Amazon DynamoDB	99.999%	Multi-AZ with automatic failover	Limited query flexibility for consistency

Case Study 3: The 2017 S3 Outage

Lessons learned:

Dependencies cascade: Many services hard-depended on S3. When S3 failed, they failed.
Testing matters: The command that caused the outage hadn't been tested at scale in years.
Blast radius: A simple administrative error affected millions of users globally.
Recovery can be slow: Some S3 subsystems took hours to fully recover.

Changes Amazon made:

Added safeguards preventing rapid capacity removal
Improved tooling with better confirmation and rollback
Reduced interdependencies between S3 subsystems
Published detailed post-mortem to help others learn

Even Giants Fall

The Fundamental Tension: Availability vs Consistency

We've explored consistency and availability separately. Now we confront the heart of the CAP theorem: under network partitions, you cannot have both.

The Partition Scenario:

Imagine a two-node distributed database. A network partition occurs—the nodes can't communicate. A client sends a write to Node A. What happens?

If you choose Consistency:

Node A cannot confirm the write without Node B's acknowledgment
But Node A can't reach Node B
Node A must refuse the write (or block indefinitely)
Result: System is unavailable during the partition

If you choose Availability:

Node A accepts and acknowledges the write
Meanwhile, Node B might be accepting different writes
When the partition heals, the nodes have diverged
Result: System is inconsistent until (and during) reconciliation

This isn't a design flaw—it's a fundamental impossibility proven by Brewer and formalized by Gilbert and Lynch.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
SCENARIO: Network partition between two data centers
 
                    ┌─────────────┐     PARTITION     ┌─────────────┐
                    │   Node A    │ ─ ─ ─ ✕ ─ ─ ─ ─  │   Node B    │
                    │   (DC 1)    │                  │   (DC 2)    │
                    └─────────────┘                  └─────────────┘
                          │                                │
                     Client X                         Client Y
                    write(x=1)                       write(x=2)
 
OPTION 1: CHOOSE CONSISTENCY (CP)
├── Node A: "I can't reach Node B. I must refuse this write."
├── Node B: "I can't reach Node A. I must refuse this write."
├── Both clients get errors or timeouts
└── Result: CONSISTENT but UNAVAILABLE
 
OPTION 2: CHOOSE AVAILABILITY (AP)
├── Node A: "I'll accept write x=1 from Client X"
├── Node B: "I'll accept write x=2 from Client Y"
├── Both clients get success
└── After partition heals: x=1 or x=2? CONFLICT!
    Result: AVAILABLE but INCONSISTENT
 
THERE IS NO OPTION 3.

Why can't we have both?

The proof is elegant:

Assume we want both linearizability (C) and availability (A)
During a partition, messages between nodes are lost
A client sends a write to Node A, which completes
A client sends a read to Node B
For availability: Node B must respond
For consistency: Node B must return the value just written to Node A
But Node B never received the write (partition!)
Contradiction: Node B cannot satisfy both requirements

The key insight:

Partitions Are Unavoidable

Architectural Patterns for Maximizing Availability

Given that we must make trade-offs, how do we architect systems that maximize availability while managing consistency challenges?

Pattern 1: Eventual Consistency with Conflict Resolution

Accept that replicas may temporarily diverge, but implement mechanisms to detect and resolve conflicts:

Last-Writer-Wins (LWW): Simple but loses data
Multi-Value Registers: Return all concurrent values, let application choose
CRDTs: Data structures that automatically merge without conflicts
Custom merge functions: Application-specific resolution logic

Pattern 2: Read-Your-Writes Guarantee

Ensure clients always see their own writes, even if other clients see stale data:

Route read requests to the same node that handled recent writes (sticky sessions)
Include version tokens in responses; wait for replica to reach that version before reading
Read from primary for critical operations, replicas for others

Pattern 3: Async Replication with Sync Fallback

Normally use async replication for speed, but allow callers to request sync replication for critical operations:

POST /orders
X-Consistency: strong  // Use sync replication for this request

This lets you tune consistency per-operation rather than per-system.

Availability Checklist for System Design

•No single points of failure: Every component has redundancy
•Automatic failover: Failure detection and recovery without human intervention
•Health checks at every layer: Detect failures before users do
•Circuit breakers on all dependencies: Fail fast when dependencies fail
•Timeouts everywhere: No request waits forever
•Graceful degradation paths: Non-critical features can be disabled
•Multi-region or multi-AZ: Geographic redundancy for disasters
•Tested failover: Regular chaos engineering validates assumptions
•Monitoring and alerting: Know about problems before customers report them
•Runbooks for incidents: Reduce MTTR with clear procedures

Pattern 4: The Outbox Pattern for Cross-Service Reliability

When you need to update a database AND publish an event, use the Outbox Pattern:

Write the event to an "outbox" table in the same database transaction as your data change
A separate process reads the outbox and publishes events to the message broker
If publishing fails, the process retries (the event is safely in the database)
After successful publish, mark the outbox entry as processed

This guarantees that if the database update succeeds, the event will eventually be published—achieving reliability without distributed transactions.

Availability is a Feature, Not an Afterthought

Summary: Availability in Distributed Systems

We've deeply explored availability—the 'A' in CAP—understanding what it means, how to measure it, what breaks it, and how to achieve it. Let's consolidate the essential insights.

Key Takeaways

•CAP availability means every non-failing node responds: This is different from practical uptime metrics, which measure successful responses within SLAs.
•Availability is measured in nines: Each additional nine is 10x harder to achieve. 99.99% allows about 52 minutes of downtime per year.
•Serial dependencies reduce availability: If you depend on 5 services at 99.9% each, your availability is only 99.5%.
•Redundancy is necessary but not sufficient: Without health checks, failover, and testing, redundancy adds complexity without improving availability.
•The CAP trade-off is real and unavoidable: During network partitions, you must choose between consistency and availability.
•Graceful degradation preserves core functionality: When under stress, shed non-essential features to keep critical paths available.
•Chaos engineering validates availability assumptions: If you haven't tested failover in production, you don't have failover.
•Design for availability from the start: It cannot be added later without fundamental architectural changes.

What's Next:

Page Complete

2 / 5