Why System Design Matters - Learning Module

Loading content...

0/273

Ensuring Reliability and Availability

The Systems That Never Sleep

In the digital economy, downtime is measured not in hours but in consequences. When a major cloud provider experiences an outage, stock prices fall, revenue vanishes, and headlines proclaim technological failure. When a payment system goes offline, transactions stop, businesses halt, and trust erodes. The systems we build are expected to be always on—not because it's a nice-to-have, but because modern society increasingly depends on software infrastructure.

Yet systems are composed of components that fail. Servers crash. Networks partition. Disks corrupt. Software contains bugs. Power supplies die. Data centers flood. The question is not whether failures will occur, but how the system behaves when they do.

This is the domain of reliability and availability engineering—the discipline of building systems that continue to function correctly even when individual components malfunction. It's the difference between a system that fails gracefully, perhaps with degraded functionality, and one that collapses entirely, taking user data or business operations with it.

This page explores the principles, patterns, and practices that transform fragile systems into resilient ones—the engineering foundations that allow systems to serve millions while components continuously fail beneath them.

What You Will Learn

By the end of this page, you will understand the precise definitions of reliability and availability, how to quantify and measure them, the architectural patterns that enable fault tolerance, and the operational practices that transform brittle systems into dependable infrastructure.

Defining Reliability and Availability

Before engineering for reliability and availability, we must define these terms precisely. They are often used interchangeably, but they measure different aspects of system behavior.

Availability: The System is Up

Availability measures the proportion of time a system is operational and accessible:

Availability = Uptime / (Uptime + Downtime)

Availability is typically expressed as a percentage, often referred to by the number of 'nines':

The Nines of Availability:

Availability Levels and Allowed Downtime
Availability	Nines	Downtime/Year	Downtime/Month	Downtime/Day
99%	Two nines	3.65 days	7.2 hours	14.4 minutes
99.9%	Three nines	8.76 hours	43.2 minutes	1.44 minutes
99.99%	Four nines	52.6 minutes	4.32 minutes	8.64 seconds
99.999%	Five nines	5.26 minutes	25.9 seconds	0.86 seconds
99.9999%	Six nines	31.5 seconds	2.59 seconds	0.0864 seconds

Reliability: The System Works Correctly

Reliability measures the probability that a system performs its intended function correctly over a given time period:

Reliability = P(no failure during time interval T)

While availability asks 'Is the system up?', reliability asks 'Is the system producing correct results?'

The Distinction Matters:

A system can be available (responding to requests) but not reliable (returning incorrect or corrupted data)
A payment system that charges customers twice is available but not reliable
A search engine that returns random results is available but not reliable

Key Reliability Metrics:

Mean Time Between Failures (MTBF): Average time the system runs before failure
Mean Time To Repair (MTTR): Average time to restore service after failure
Failure Rate (λ): Probability of failure per unit time

Relationship:

Availability ≈ MTBF / (MTBF + MTTR)

This reveals two paths to higher availability: increase MTBF (fail less often) or decrease MTTR (recover faster).

The Practical Focus

In most distributed systems, MTTR dominates availability. Reducing MTBF is hard (you can't prevent hardware from occasionally failing), but reducing MTTR is achievable through automation, redundancy, and fast detection. The best systems assume failure and focus on recovery speed.

Failure Modes and Fault Taxonomy

To design reliable systems, you must understand how systems fail. Failures occur across multiple dimensions and manifest in different ways.

Categories of Failure

By Scope:

Single Component: One server, disk, or network link fails
Correlated Failure: Multiple components fail simultaneously (power outage, software bug affecting all instances)
Cascading Failure: Failure of one component causes others to fail (overload propagation)

By Duration:

Transient: Brief failures that resolve themselves (network hiccup, brief CPU spike)
Intermittent: Failures that come and go unpredictably (flaky hardware, race conditions)
Permanent: Failures requiring intervention to resolve (disk death, data corruption)

By Failure Mode:

Crash Failure: Component stops completely
- Easiest to detect (no response)
- Can be handled with redundancy and failover
Omission Failure: Component fails to respond to some requests
- Harder to detect (might look like latency)
- Requires timeouts and retries
Timing Failure: Component responds too slowly
- Violates latency requirements but still 'correct'
- Can cascade (slow dependencies make callers slow)
Response Failure: Component responds incorrectly
- Subtle and dangerous (system thinks it's working)
- Requires validation and checksums
Byzantine Failure: Component behaves arbitrarily, possibly maliciously
- Hardest to handle (conflicting information to different observers)
- Requires consensus protocols (Raft, PBFT)

Failure Modes and Detection Strategies
Failure Mode	Detectability	Example	Detection Strategy
Crash	High	Server process dies	Heartbeats, health checks
Omission	Medium	Network packet loss	Timeouts, retries
Timing	Medium	Overloaded service	Latency monitoring, SLOs
Response	Low	Data corruption	Checksums, validation
Byzantine	Very Low	Malicious node	Consensus protocols, voting

The Danger of Silent Failures

The most dangerous failures are those that go undetected. A service returning stale or incorrect data while appearing healthy can cause data corruption that spreads through the system before anyone notices. Invest in detection and validation, not just redundancy.

Redundancy Strategies

The fundamental technique for achieving reliability is redundancy—having more than one of everything, so that when one fails, another takes over. But redundancy must be implemented thoughtfully to be effective.

Types of Redundancy

1. Active-Active (Hot Redundancy):

All instances handle traffic simultaneously
No failover delay; traffic just redistributes
Load balancer detects failure and routes around it

Example: Multiple web servers behind a load balancer

Trade-offs:

Most responsive to failure
More expensive (all resources always running)
Complexity in maintaining state consistency

2. Active-Passive (Warm Redundancy):

Primary handles traffic; secondary stands by
On primary failure, secondary takes over
Failover delay while secondary activates

Example: Database primary with replica promoted on failure

Trade-offs:

Lower cost (passive resource underutilized)
Failover delay (seconds to minutes)
Passive may have stale state

3. Cold Redundancy:

Backup resources not running, started on demand
Longest recovery time
Often used for disaster recovery

Example: Backup data center activated after regional failure

Trade-offs:

Lowest cost (no resources running)
Longest recovery (may require data restoration)
Requires thorough testing (rarely exercised)

Levels of Redundancy

Redundancy must be applied at every level of the stack:

Application Layer:

Multiple stateless instances
Auto-scaling groups with minimum instance counts
Multiple processes per host (if one crashes, others continue)

Database Layer:

Primary with synchronous replicas (no data loss on failover)
Read replicas for query offloading
Sharded clusters with replica sets per shard

Network Layer:

Multiple network interfaces (NIC bonding)
Redundant switches and routers
Multiple ISPs/uplinks

Storage Layer:

RAID configurations (mirrored disks)
Object storage with automatic replication (S3 stores 3+ copies)
Backup systems (separate from primary storage)

Infrastructure Layer:

Multiple availability zones within a region
Multiple regions for global systems
Multiple cloud providers (rare but ultimate redundancy)

The N+1 Rule

At minimum, have N+1 capacity for critical components: if you need N servers to handle peak load, run N+1 so that one failure doesn't reduce capacity below requirements. For higher reliability, N+2 or more provides tolerance for concurrent failures and maintenance windows.

Fault Tolerance Patterns

Beyond redundancy, specific architectural patterns enable systems to tolerate faults gracefully.

Timeouts and Deadlines

Without timeouts, a single slow dependency can lock up your entire system:

The Problem:

Service A calls Service B
Service B hangs (network partition, overload, deadlock)
Service A's thread waits indefinitely
Service A's thread pool exhausts
Service A stops serving new requests

The Solution:

Every external call has a timeout
Timeout propagates through call chain (deadline budget)
After timeout, caller receives error (not success)

Best Practices:

Connection Timeout: How long to wait for TCP handshake (100-500ms)
Request Timeout: How long to wait for response (proportional to operation)
Deadline Propagation: Pass remaining budget to downstream calls

Retries with Backoff

Many failures are transient. Retrying can succeed if the first attempt failed due to temporary issues:

Retry Strategies:

Immediate Retry: Retry instantly (for very brief failures)
Fixed Backoff: Wait fixed interval between retries
Exponential Backoff: Double wait time each retry (1s, 2s, 4s, 8s...)
Exponential Backoff with Jitter: Add randomness to prevent thundering herd

Cautions:

Retry Storms: All clients retrying simultaneously can worsen overload
Non-Idempotent Operations: Retrying non-idempotent operations can cause duplicates
Amplification: Retries at multiple layers multiply total attempts

Best Practice:

Retry only at the outermost layer, or with careful coordination
Implement idempotency keys for mutations
Cap maximum retries to prevent infinite loops
Add jitter to spread retry load

Circuit Breakers

Circuit breakers prevent cascading failures by stopping calls to failing services:

States:

Closed (Normal):
- Requests pass through normally
- Failures are counted
- If failure threshold exceeded → Open
Open:
- Requests fail immediately (don't call downstream)
- Timer starts
- After timeout → Half-Open
Half-Open:
- Allow limited requests through (test downstream)
- If successful → Closed
- If failures → Open

Benefits:

Gives failing service time to recover
Prevents resource exhaustion on caller
Provides fast failure feedback to users

Converting Mermaid diagram...

Bulkheads (Isolation)

Bulkheads isolate failures to prevent them from spreading:

Examples:

Thread Pool Isolation: Separate thread pools for different dependencies. If one pool exhausts, others continue.
Service Isolation: Different customer tiers use different service instances. If one tier experiences issues, others are unaffected.
Data Isolation: Partition data by tenant. Corruption in one partition doesn't affect others.

Principle: Design systems so that a failure in one area cannot consume all resources or propagate to other areas.

Defense in Depth

These patterns work together: timeouts prevent hanging, retries handle transient failures, circuit breakers prevent cascades, bulkheads isolate damage. Combine them thoughtfully—each has costs (complexity, configuration) and benefits (resilience).

Graceful Degradation

When systems face failure or overload, graceful degradation means providing reduced functionality rather than complete failure. It's the difference between a car that slows down when the engine struggles and one that explodes.

Degradation Strategies

1. Feature Fallbacks:

When a non-critical feature fails, disable it rather than failing the entire request:

Recommendation engine down? Show generic popular items instead
Profile picture service down? Show default avatar
Real-time notifications unavailable? Queue for later delivery

Key Principle: Identify your critical path (what users absolutely need) versus optional enrichments (nice-to-have). Protect the critical path; degrade enrichments.

2. Static Fallbacks:

When dynamic systems fail, serve static content:

CMS down? Serve cached static pages
Search indexing delayed? Return results from older index
Pricing service unavailable? Show 'contact for quote' or cached prices

Implementation:

Cache rendered pages at CDN with long TTLs
Store static fallback responses as backup
Configure CDN to serve stale content on origin failure

3. Read-Only Mode:

When write infrastructure fails, allow reads to continue:

Payment processing down? Allow browsing, disable checkout
Database writes failing? Allow reads from replicas
Comment system unavailable? Show existing comments, disable new ones

Implementation:

Separate read and write paths architecturally
Feature flags to disable write-requiring features
Queue writes for later replay when service restores

4. Progressive Load Shedding:

When overloaded, strategically reject requests to protect the system:

Shedding Hierarchy (prioritized):

Reject unauthenticated requests first
Reject requests for non-essential features
Apply rate limits to authenticated users
As last resort, reject percentage of all requests

Implementation:

Monitor system load (CPU, memory, queue depth, latency)
Define shedding thresholds and policies
Return 503 (Service Unavailable) or 429 (Too Many Requests) with Retry-After
Shed early—waiting until system is overwhelmed is too late

Design for Degradation

Graceful degradation doesn't happen accidentally—it must be designed and tested. For each feature, ask: 'What happens if this fails?' Define explicit fallback behavior, implement it, and test it regularly. Systems that have never degraded don't degrade gracefully—they crash.

Graceful Degradation Checklist

•Identify critical path vs. optional features
•Define fallback behavior for each dependency
•Implement timeout and circuit breaker for external calls
•Cache responses for static fallback
•Test degradation modes regularly (chaos engineering)
•Monitor and alert on degraded states
•Design for read-only mode capability

Monitoring and Alerting

You cannot ensure reliability without visibility. Monitoring and alerting are the eyes and ears of operations, enabling rapid detection and response to issues.

What to Monitor

The Four Golden Signals (Google SRE):

Latency: Time to service a request
- Distinguish success latency from error latency
- Track percentiles (p50, p95, p99), not just averages
Traffic: Demand on the system
- Requests per second
- Concurrent users
- Transaction volume
Errors: Rate of failed requests
- HTTP 5xx responses
- Application-level errors
- Dependency failures
Saturation: How full the system is
- CPU utilization
- Memory usage
- Queue depths
- Connection pool utilization

Service Level Objectives (SLOs)

SLOs define reliability targets in measurable terms:

Components:

Service Level Indicator (SLI): The metric being measured (e.g., successful request ratio)
Service Level Objective (SLO): The target value (e.g., 99.9% of requests successful)
Service Level Agreement (SLA): Contractual commitment with consequences

Example SLOs:

"99.9% of API requests complete successfully within 200ms over a rolling 30-day window"
"99.95% availability measured by health check endpoint"
"Zero data loss for committed transactions"

Error Budget:

The inverse of SLO gives your error budget—how much failure is acceptable:

99.9% SLO = 0.1% error budget = 43.2 minutes of downtime per month
Error budget can be 'spent' on deployments, experiments, or unexpected failures
When exhausted, focus shifts to reliability over new features

Alerting Best Practices

Alert on SLO Breach, Not Component Status:

Bad: Alert when CPU > 80% (might be fine)
Good: Alert when error rate exceeds error budget burn rate

Avoid Alert Fatigue:

Every alert should require human action
If an alert can be safely ignored, remove it
Group related alerts to reduce noise

Alert Hierarchy:

Page (Wake Someone Up): Customer-impacting, requires immediate action
Ticket (Next Business Day): Important but not urgent
Log (Investigate When Convenient): Informational, no action needed

Runbooks:

Every pager alert should have a runbook
Runbook describes: What triggered it, how to diagnose, how to mitigate
Update runbooks after every incident

The Monitoring Paradox

Monitoring systems must be more reliable than the systems they monitor. If your monitoring goes down, you won't know when your production systems fail. Treat monitoring infrastructure with the same rigor as production: redundancy, alerting on monitoring health, and independent failure domains.

Disaster Recovery

Beyond handling routine failures, systems must prepare for catastrophic events that affect entire infrastructure regions. Disaster recovery (DR) ensures business continuity when the worst happens.

Disaster Recovery Metrics

RPO (Recovery Point Objective):

How much data loss is acceptable?
RPO = 1 hour means you might lose up to 1 hour of transactions
Drives backup frequency and replication strategy

RTO (Recovery Time Objective):

How long can the business be down?
RTO = 4 hours means services must be restored within 4 hours
Drives DR infrastructure investment and automation

Disaster Recovery Strategies
Strategy	Cost	RTO	RPO	Description
Backup & Restore	Low	Hours-Days	Hours	Regular backups; restore to new infrastructure on disaster
Pilot Light	Low-Med	Hours	Minutes	Core systems warm; scale up on disaster
Warm Standby	Medium	Minutes	Seconds	Scaled-down copy running; scale up and switch traffic
Multi-Site Active-Active	High	Seconds	Zero	Full redundancy; traffic shifts automatically

Data Protection Strategies

Backup Types:

Full Backup: Complete copy of all data (expensive, comprehensive)
Incremental Backup: Only changes since last backup (efficient, requires chain)
Differential Backup: Changes since last full backup (balance of both)

Replication:

Synchronous Replication: Primary waits for replica acknowledgment; zero data loss; adds latency
Asynchronous Replication: Primary doesn't wait; low latency; potential data loss on failure
Semi-Synchronous: Wait for one replica; balance of durability and performance

The 3-2-1 Rule:

3 copies of data
2 different storage types
1 copy offsite

This protects against single-location disasters (fire, flood), storage media failures, and software corruption.

Testing Disaster Recovery

A DR plan that hasn't been tested is a hope, not a plan:

Game Days:

Scheduled exercises where you simulate disasters
Failover to DR site with full traffic
Verify RPO/RTO are achieved
Identify gaps before real disaster occurs

Chaos Engineering:

Continuously inject failures in production
Verify systems handle failure gracefully
Build confidence in resilience mechanisms

Documentation:

DR procedures must be documented step-by-step
Documents must be accessible during disaster (not only in affected data center)
Keep contact lists, vendor relationships, and escalation paths current

DR is Insurance

Disaster recovery investment follows the insurance model: you pay continuously for something you hope never to use. The cost must be balanced against the business impact of potential disaster. Critical systems (financial, healthcare) justify high DR investment; less critical systems may accept higher RPO/RTO to reduce costs.

Summary: Ensuring Reliability and Availability

We've explored the principles and practices that keep systems running when components fail—the foundation of dependable infrastructure.

Key Takeaways

•Availability and reliability are distinct — Availability measures uptime; reliability measures correctness. Both are essential.
•Failures have taxonomy — Crash, omission, timing, response, and Byzantine failures require different detection and handling strategies.
•Redundancy is foundational — Active-active, active-passive, and cold redundancy provide different trade-offs between cost, complexity, and recovery time.
•Fault tolerance patterns are essential — Timeouts, retries with backoff, circuit breakers, and bulkheads prevent cascading failures.
•Graceful degradation preserves value — When systems can't deliver full functionality, providing reduced functionality protects user experience.
•Monitoring enables response — The four golden signals, SLOs, and thoughtful alerting enable rapid detection and recovery.
•Disaster recovery requires planning and testing — RPO and RTO define requirements; regular testing validates readiness.

What's Next:

Reliability and availability have costs—infrastructure, engineering time, operational overhead. The next page explores cost optimization at scale: how to balance performance, reliability, and budget to build systems that are not only dependable but economically sustainable.

Page Complete

You now understand the engineering foundations of reliable and available systems—not just the patterns, but the philosophy that failure is normal and systems must be designed to handle it gracefully. Next, we'll explore how to achieve reliability and scale while optimizing costs.

Ensuring Reliability and Availability

The Systems That Never Sleep

What You Will Learn

Defining Reliability and Availability

Before engineering for reliability and availability, we must define these terms precisely. They are often used interchangeably, but they measure different aspects of system behavior.

Availability: The System is Up

Availability measures the proportion of time a system is operational and accessible:

Availability = Uptime / (Uptime + Downtime)

Availability is typically expressed as a percentage, often referred to by the number of 'nines':

The Nines of Availability:

Availability Levels and Allowed Downtime
Availability	Nines	Downtime/Year	Downtime/Month	Downtime/Day
99%	Two nines	3.65 days	7.2 hours	14.4 minutes
99.9%	Three nines	8.76 hours	43.2 minutes	1.44 minutes
99.99%	Four nines	52.6 minutes	4.32 minutes	8.64 seconds
99.999%	Five nines	5.26 minutes	25.9 seconds	0.86 seconds
99.9999%	Six nines	31.5 seconds	2.59 seconds	0.0864 seconds

Reliability: The System Works Correctly

Reliability measures the probability that a system performs its intended function correctly over a given time period:

Reliability = P(no failure during time interval T)

While availability asks 'Is the system up?', reliability asks 'Is the system producing correct results?'

The Distinction Matters:

A system can be available (responding to requests) but not reliable (returning incorrect or corrupted data)
A payment system that charges customers twice is available but not reliable
A search engine that returns random results is available but not reliable

Key Reliability Metrics:

Mean Time Between Failures (MTBF): Average time the system runs before failure
Mean Time To Repair (MTTR): Average time to restore service after failure
Failure Rate (λ): Probability of failure per unit time

Relationship:

Availability ≈ MTBF / (MTBF + MTTR)

This reveals two paths to higher availability: increase MTBF (fail less often) or decrease MTTR (recover faster).

The Practical Focus

Failure Modes and Fault Taxonomy

To design reliable systems, you must understand how systems fail. Failures occur across multiple dimensions and manifest in different ways.

Categories of Failure

By Scope:

Single Component: One server, disk, or network link fails
Correlated Failure: Multiple components fail simultaneously (power outage, software bug affecting all instances)
Cascading Failure: Failure of one component causes others to fail (overload propagation)

By Duration:

Transient: Brief failures that resolve themselves (network hiccup, brief CPU spike)
Intermittent: Failures that come and go unpredictably (flaky hardware, race conditions)
Permanent: Failures requiring intervention to resolve (disk death, data corruption)

By Failure Mode:

Crash Failure: Component stops completely
- Easiest to detect (no response)
- Can be handled with redundancy and failover
Omission Failure: Component fails to respond to some requests
- Harder to detect (might look like latency)
- Requires timeouts and retries
Timing Failure: Component responds too slowly
- Violates latency requirements but still 'correct'
- Can cascade (slow dependencies make callers slow)
Response Failure: Component responds incorrectly
- Subtle and dangerous (system thinks it's working)
- Requires validation and checksums
Byzantine Failure: Component behaves arbitrarily, possibly maliciously
- Hardest to handle (conflicting information to different observers)
- Requires consensus protocols (Raft, PBFT)

Failure Modes and Detection Strategies
Failure Mode	Detectability	Example	Detection Strategy
Crash	High	Server process dies	Heartbeats, health checks
Omission	Medium	Network packet loss	Timeouts, retries
Timing	Medium	Overloaded service	Latency monitoring, SLOs
Response	Low	Data corruption	Checksums, validation
Byzantine	Very Low	Malicious node	Consensus protocols, voting

The Danger of Silent Failures

Redundancy Strategies

Types of Redundancy

1. Active-Active (Hot Redundancy):

All instances handle traffic simultaneously
No failover delay; traffic just redistributes
Load balancer detects failure and routes around it

Example: Multiple web servers behind a load balancer

Trade-offs:

Most responsive to failure
More expensive (all resources always running)
Complexity in maintaining state consistency

2. Active-Passive (Warm Redundancy):

Primary handles traffic; secondary stands by
On primary failure, secondary takes over
Failover delay while secondary activates

Example: Database primary with replica promoted on failure

Trade-offs:

Lower cost (passive resource underutilized)
Failover delay (seconds to minutes)
Passive may have stale state

3. Cold Redundancy:

Backup resources not running, started on demand
Longest recovery time
Often used for disaster recovery

Example: Backup data center activated after regional failure

Trade-offs:

Lowest cost (no resources running)
Longest recovery (may require data restoration)
Requires thorough testing (rarely exercised)

Levels of Redundancy

Redundancy must be applied at every level of the stack:

Application Layer:

Multiple stateless instances
Auto-scaling groups with minimum instance counts
Multiple processes per host (if one crashes, others continue)

Database Layer:

Primary with synchronous replicas (no data loss on failover)
Read replicas for query offloading
Sharded clusters with replica sets per shard

Network Layer:

Multiple network interfaces (NIC bonding)
Redundant switches and routers
Multiple ISPs/uplinks

Storage Layer:

RAID configurations (mirrored disks)
Object storage with automatic replication (S3 stores 3+ copies)
Backup systems (separate from primary storage)

Infrastructure Layer:

Multiple availability zones within a region
Multiple regions for global systems
Multiple cloud providers (rare but ultimate redundancy)

The N+1 Rule

Fault Tolerance Patterns

Beyond redundancy, specific architectural patterns enable systems to tolerate faults gracefully.

Timeouts and Deadlines

Without timeouts, a single slow dependency can lock up your entire system:

The Problem:

Service A calls Service B
Service B hangs (network partition, overload, deadlock)
Service A's thread waits indefinitely
Service A's thread pool exhausts
Service A stops serving new requests

The Solution:

Every external call has a timeout
Timeout propagates through call chain (deadline budget)
After timeout, caller receives error (not success)

Best Practices:

Connection Timeout: How long to wait for TCP handshake (100-500ms)
Request Timeout: How long to wait for response (proportional to operation)
Deadline Propagation: Pass remaining budget to downstream calls

Retries with Backoff

Many failures are transient. Retrying can succeed if the first attempt failed due to temporary issues:

Retry Strategies:

Immediate Retry: Retry instantly (for very brief failures)
Fixed Backoff: Wait fixed interval between retries
Exponential Backoff: Double wait time each retry (1s, 2s, 4s, 8s...)
Exponential Backoff with Jitter: Add randomness to prevent thundering herd

Cautions:

Retry Storms: All clients retrying simultaneously can worsen overload
Non-Idempotent Operations: Retrying non-idempotent operations can cause duplicates
Amplification: Retries at multiple layers multiply total attempts

Best Practice:

Retry only at the outermost layer, or with careful coordination
Implement idempotency keys for mutations
Cap maximum retries to prevent infinite loops
Add jitter to spread retry load

Circuit Breakers

Circuit breakers prevent cascading failures by stopping calls to failing services:

States:

Closed (Normal):
- Requests pass through normally
- Failures are counted
- If failure threshold exceeded → Open
Open:
- Requests fail immediately (don't call downstream)
- Timer starts
- After timeout → Half-Open
Half-Open:
- Allow limited requests through (test downstream)
- If successful → Closed
- If failures → Open

Benefits:

Gives failing service time to recover
Prevents resource exhaustion on caller
Provides fast failure feedback to users

Converting Mermaid diagram...

Bulkheads (Isolation)

Bulkheads isolate failures to prevent them from spreading:

Examples:

Thread Pool Isolation: Separate thread pools for different dependencies. If one pool exhausts, others continue.
Service Isolation: Different customer tiers use different service instances. If one tier experiences issues, others are unaffected.
Data Isolation: Partition data by tenant. Corruption in one partition doesn't affect others.

Principle: Design systems so that a failure in one area cannot consume all resources or propagate to other areas.

Defense in Depth

Graceful Degradation

Degradation Strategies

1. Feature Fallbacks:

When a non-critical feature fails, disable it rather than failing the entire request:

Recommendation engine down? Show generic popular items instead
Profile picture service down? Show default avatar
Real-time notifications unavailable? Queue for later delivery

Key Principle: Identify your critical path (what users absolutely need) versus optional enrichments (nice-to-have). Protect the critical path; degrade enrichments.

2. Static Fallbacks:

When dynamic systems fail, serve static content:

CMS down? Serve cached static pages
Search indexing delayed? Return results from older index
Pricing service unavailable? Show 'contact for quote' or cached prices

Implementation:

Cache rendered pages at CDN with long TTLs
Store static fallback responses as backup
Configure CDN to serve stale content on origin failure

3. Read-Only Mode:

When write infrastructure fails, allow reads to continue:

Payment processing down? Allow browsing, disable checkout
Database writes failing? Allow reads from replicas
Comment system unavailable? Show existing comments, disable new ones

Implementation:

Separate read and write paths architecturally
Feature flags to disable write-requiring features
Queue writes for later replay when service restores

4. Progressive Load Shedding:

When overloaded, strategically reject requests to protect the system:

Shedding Hierarchy (prioritized):

Reject unauthenticated requests first
Reject requests for non-essential features
Apply rate limits to authenticated users
As last resort, reject percentage of all requests

Implementation:

Monitor system load (CPU, memory, queue depth, latency)
Define shedding thresholds and policies
Return 503 (Service Unavailable) or 429 (Too Many Requests) with Retry-After
Shed early—waiting until system is overwhelmed is too late

Design for Degradation

Graceful Degradation Checklist

•Identify critical path vs. optional features
•Define fallback behavior for each dependency
•Implement timeout and circuit breaker for external calls
•Cache responses for static fallback
•Test degradation modes regularly (chaos engineering)
•Monitor and alert on degraded states
•Design for read-only mode capability

Monitoring and Alerting

You cannot ensure reliability without visibility. Monitoring and alerting are the eyes and ears of operations, enabling rapid detection and response to issues.

What to Monitor

The Four Golden Signals (Google SRE):

Latency: Time to service a request
- Distinguish success latency from error latency
- Track percentiles (p50, p95, p99), not just averages
Traffic: Demand on the system
- Requests per second
- Concurrent users
- Transaction volume
Errors: Rate of failed requests
- HTTP 5xx responses
- Application-level errors
- Dependency failures
Saturation: How full the system is
- CPU utilization
- Memory usage
- Queue depths
- Connection pool utilization

Service Level Objectives (SLOs)

SLOs define reliability targets in measurable terms:

Components:

Service Level Indicator (SLI): The metric being measured (e.g., successful request ratio)
Service Level Objective (SLO): The target value (e.g., 99.9% of requests successful)
Service Level Agreement (SLA): Contractual commitment with consequences

Example SLOs:

"99.9% of API requests complete successfully within 200ms over a rolling 30-day window"
"99.95% availability measured by health check endpoint"
"Zero data loss for committed transactions"

Error Budget:

The inverse of SLO gives your error budget—how much failure is acceptable:

99.9% SLO = 0.1% error budget = 43.2 minutes of downtime per month
Error budget can be 'spent' on deployments, experiments, or unexpected failures
When exhausted, focus shifts to reliability over new features

Alerting Best Practices

Alert on SLO Breach, Not Component Status:

Bad: Alert when CPU > 80% (might be fine)
Good: Alert when error rate exceeds error budget burn rate

Avoid Alert Fatigue:

Every alert should require human action
If an alert can be safely ignored, remove it
Group related alerts to reduce noise

Alert Hierarchy:

Page (Wake Someone Up): Customer-impacting, requires immediate action
Ticket (Next Business Day): Important but not urgent
Log (Investigate When Convenient): Informational, no action needed

Runbooks:

Every pager alert should have a runbook
Runbook describes: What triggered it, how to diagnose, how to mitigate
Update runbooks after every incident

The Monitoring Paradox

Disaster Recovery

Beyond handling routine failures, systems must prepare for catastrophic events that affect entire infrastructure regions. Disaster recovery (DR) ensures business continuity when the worst happens.

Disaster Recovery Metrics

RPO (Recovery Point Objective):

How much data loss is acceptable?
RPO = 1 hour means you might lose up to 1 hour of transactions
Drives backup frequency and replication strategy

RTO (Recovery Time Objective):

How long can the business be down?
RTO = 4 hours means services must be restored within 4 hours
Drives DR infrastructure investment and automation

Disaster Recovery Strategies
Strategy	Cost	RTO	RPO	Description
Backup & Restore	Low	Hours-Days	Hours	Regular backups; restore to new infrastructure on disaster
Pilot Light	Low-Med	Hours	Minutes	Core systems warm; scale up on disaster
Warm Standby	Medium	Minutes	Seconds	Scaled-down copy running; scale up and switch traffic
Multi-Site Active-Active	High	Seconds	Zero	Full redundancy; traffic shifts automatically

Data Protection Strategies

Backup Types:

Full Backup: Complete copy of all data (expensive, comprehensive)
Incremental Backup: Only changes since last backup (efficient, requires chain)
Differential Backup: Changes since last full backup (balance of both)

Replication:

Synchronous Replication: Primary waits for replica acknowledgment; zero data loss; adds latency
Asynchronous Replication: Primary doesn't wait; low latency; potential data loss on failure
Semi-Synchronous: Wait for one replica; balance of durability and performance

The 3-2-1 Rule:

3 copies of data
2 different storage types
1 copy offsite

This protects against single-location disasters (fire, flood), storage media failures, and software corruption.

Testing Disaster Recovery

A DR plan that hasn't been tested is a hope, not a plan:

Game Days:

Scheduled exercises where you simulate disasters
Failover to DR site with full traffic
Verify RPO/RTO are achieved
Identify gaps before real disaster occurs

Chaos Engineering:

Continuously inject failures in production
Verify systems handle failure gracefully
Build confidence in resilience mechanisms

Documentation:

DR procedures must be documented step-by-step
Documents must be accessible during disaster (not only in affected data center)
Keep contact lists, vendor relationships, and escalation paths current

DR is Insurance

Summary: Ensuring Reliability and Availability

We've explored the principles and practices that keep systems running when components fail—the foundation of dependable infrastructure.

Key Takeaways

•Availability and reliability are distinct — Availability measures uptime; reliability measures correctness. Both are essential.
•Failures have taxonomy — Crash, omission, timing, response, and Byzantine failures require different detection and handling strategies.
•Redundancy is foundational — Active-active, active-passive, and cold redundancy provide different trade-offs between cost, complexity, and recovery time.
•Fault tolerance patterns are essential — Timeouts, retries with backoff, circuit breakers, and bulkheads prevent cascading failures.
•Graceful degradation preserves value — When systems can't deliver full functionality, providing reduced functionality protects user experience.
•Monitoring enables response — The four golden signals, SLOs, and thoughtful alerting enable rapid detection and recovery.
•Disaster recovery requires planning and testing — RPO and RTO define requirements; regular testing validates readiness.

What's Next:

Page Complete