Distributed System Concepts - Learning Module

Loading content...

0/227

Fault Tolerance

Designing for Inevitable Failure

In distributed systems, failure is not exceptional—it is the norm. At scale, hardware fails constantly: disks, servers, network switches, power supplies, and even entire data centers experience outages. Software has bugs. Networks become congested or misconfigured. Human operators make mistakes.

Large distributed systems experience thousands of failures per day. Yet services like Google Search, Netflix, and Amazon remain available because they are designed with fault tolerance—the ability to continue operating correctly despite failures.

Fault tolerance is the second fundamental reason (after scalability) to build distributed systems. A properly designed distributed system can be more reliable than any of its individual components. This page examines how to achieve such resilience through redundancy, detection, and recovery mechanisms.

What You Will Learn

By the end of this page, you will understand the taxonomy of faults and failures, failure models from crash to Byzantine, redundancy strategies, failure detection mechanisms, recovery techniques, and the fundamental limits of fault tolerance. This knowledge is essential for building systems that users can rely on.

Understanding Faults, Errors, and Failures

Precise terminology distinguishes related but distinct concepts:

Fault (Defect): The underlying cause of a problem. Examples: a bug in code, a faulty RAM chip, a misconfigured router. Faults may be dormant, producing no observable effect until activated.

Error (Incorrect State): An erroneous system state resulting from activation of a fault. The fault has manifested but may not yet impact the user. Example: a computation produced a wrong intermediate result due to a memory bit flip.

Failure (Service Deviation): The system deviates from its specified behavior in a way observable to users. The error has propagated to the user-visible level. Example: a web request returns incorrect data or times out.

The Progression:

Fault → (activation) → Error → (propagation) → Failure

Fault Tolerance Strategy:

Fault tolerance interrupts this chain:

Fault prevention: Eliminate faults through quality processes
Fault removal: Detect and remove faults before deployment (testing)
Fault tolerance: Prevent faults from becoming failures (redundancy, recovery)
Fault forecasting: Predict and prepare for likely faults

Fault Classification by Cause
Fault Category	Description	Examples
Hardware	Physical component malfunction	Disk failure, memory corruption, power loss
Software	Bugs in code or configuration	Race conditions, memory leaks, misconfiguration
Network	Communication infrastructure problems	Packet loss, latency spikes, partitions
Operational	Human errors in operation	Wrong deployment, configuration mistake, accidental deletion
Environmental	External factors affecting infrastructure	Power outage, cooling failure, natural disasters

Everything Fails

At sufficient scale, even 'rare' failures become routine. If a component has a 0.1% annual failure rate and you have 10,000 of them, you'll see roughly 10 failures per year—nearly one per month. At 100,000 components, it's 100 per year. For large services, multiple failures daily are normal. Fault tolerance isn't optional.

Failure Models

Failure models characterize how components can fail. Stronger failure models (more adversarial) require stronger (more expensive) fault tolerance mechanisms. Understanding the failure model your system assumes is critical for correct design.

Failure Model Hierarchy (weakest to strongest):

Failure Models

•Crash-Stop Failure — A component works correctly until it crashes, then stops forever. Simplest model. Example: a server loses power and never comes back online.
•Crash-Recovery Failure — A component may crash but can recover and resume operation with its state restored (from stable storage). More realistic for persistent systems. Example: a database server crashes and restarts, recovering from transaction logs.
•Omission Failure — A component fails to send or receive messages (but doesn't produce incorrect messages). Example: a network dropping packets, or a server failing to respond to some requests.
•Timing Failure — A component responds correctly but not within the expected time bound. Relevant for real-time systems. Example: a response arrives after the timeout deadline.
•Byzantine Failure — A component can exhibit arbitrary, potentially malicious behavior: sending incorrect messages, lying about state, colluding with other faulty nodes. Hardest to tolerate. Example: a compromised server returning forged data.

Failure Model Implications
Model	Redundancy Needed	Tolerate f Failures	Common In
Crash-Stop	n = f + 1 replicas	Any f nodes can crash	Hardware failures
Crash-Recovery	Same, plus stable storage	Nodes recover with state	Database systems
Omission	n = 2f + 1 for detection	Detect via majority voting	Network issues
Byzantine	n = 3f + 1 replicas	1/3 can be adversarial	Security-critical, blockchain

Practical Implications:

Most distributed systems assume crash-recovery or omission failure models:

Data centers with controlled environments
Trusted network infrastructure
Authentication prevents external Byzantine actors

Byzantine tolerance is required for:

Blockchain and cryptocurrency systems
Military or adversarial environments
Multi-party computation with untrusted participants

Byzantine tolerance is expensive: 3f+1 nodes to tolerate f faulty nodes, with complex algorithms (PBFT, etc.). Most systems can't justify this overhead.

Gray Failures

Real systems experience 'gray failures'—partial failures that are harder to detect than clean crashes. A server might respond slowly, or correctly for some requests. A disk might read correctly but occasionally return corrupted data. These subtle failures fall between models and are notoriously difficult to detect and handle.

Redundancy Strategies

Redundancy is the foundation of fault tolerance. By having multiple copies of components, the system can continue operating when some copies fail. Different redundancy strategies offer different tradeoffs.

Redundancy Types

•Physical Redundancy — Multiple hardware components (servers, disks, network paths). The fundamental form of redundancy. Examples: RAID for disk redundancy, multiple servers for service redundancy, redundant network links.
•Time Redundancy — Repeating operations on failure. Simpler than maintaining copies, but adds latency. Examples: automatic retries, transaction rollback and reexecution, timeout-based retransmission.
•Information Redundancy — Extra data for error detection and correction. Examples: checksums detect corruption, parity bits enable recovery, erasure coding reconstructs data from fragments, RAID 5/6 parity.

Redundancy Configurations for Physical Redundancy:

Active Replication (State Machine Replication):

All replicas process all requests
All produce the same output (deterministic)
Clients receive first response
Tolerates f failures with f+1 replicas
High resource usage but instant failover

Passive Replication (Primary-Backup):

Primary handles all requests
Primary forwards updates to backups
On primary failure, a backup promotes
Lower resource usage but failover delay
Simpler to implement for non-deterministic systems

Quorum-Based Replication:

Writes go to W nodes, reads from R nodes
If W + R > N, reads see latest writes
Tunable consistency vs. availability
Examples: Dynamo-style databases (Cassandra, Riak)

Redundancy Strategy Comparison
Strategy	Failover Time	Resource Cost	Complexity	Best For
Active Replication	Instant	High (all processing)	Complex (determinism needed)	Critical path, low latency
Primary-Backup	Seconds	Low (passive backups)	Moderate	Most database systems
Quorum	Instant	Medium	Moderate	Distributed databases
Time Redundancy (Retry)	Request delay	Low	Low	Idempotent operations

Failure Detection

Before recovering from failures, systems must detect them. Failure detection in distributed systems is fundamentally challenging because a slow response and a crashed node are indistinguishable from the detector's perspective.

The FLP Impossibility Result:

Fischer, Lynch, and Paterson proved (1985) that in an asynchronous system with even one possible crash failure, there is no deterministic algorithm that solves consensus. A key implication: perfect failure detection is impossible in asynchronous systems.

Practical Failure Detection:

Since perfect detection is impossible, practical systems use unreliable failure detectors—mechanisms that may make mistakes but are useful nonetheless.

Failure Detection Mechanisms

•Heartbeats (Push Model) — Nodes periodically send 'I am alive' messages. Absence of heartbeats triggers suspicion. Simple but generates constant traffic. Parameter: heartbeat interval.
•Health Probes (Pull Model) — Detection service periodically checks nodes. Node responds to prove liveness. Scales better than heartbeats for large clusters. Parameters: probe interval, timeout.
•Gossip-Based Detection — Nodes gossip about other nodes' states. Failure information propagates epidemically. Decentralized, scales well. Parameters: gossip interval, suspicion threshold.
•Phi Accrual Detector — Maintains a probabilistic estimate of node liveness based on historical heartbeat patterns. Returns a 'suspicion level' rather than binary alive/dead. Adaptive to network conditions.

Failure Detector Properties:

Failure detectors are characterized by two properties:

Completeness: Every failed node is eventually suspected by every non-failed node.

Accuracy: Non-failed nodes are not incorrectly suspected.

Tradeoff: These properties are in tension:

Short timeouts: High completeness (fast detection) but low accuracy (false positives)
Long timeouts: High accuracy (few false positives) but low completeness (slow detection)

Practical Approach:

Most systems prioritize completeness (don't miss failures) and tolerate some false positives:

Use slightly shorter timeouts
Handle false positive recovery gracefully
Implement 'mark down' rather than 'consider dead'

Production Tip

Production failure detectors often use multiple signals: heartbeat absence, failed requests, resource saturation, and peer reports. No single signal is definitive; the detector considers cumulative evidence. Systems like Consul and etcd use both heartbeats and application-level health checks.

Recovery Mechanisms

Once failures are detected, systems must recover. Recovery mechanisms restore service and maintain consistency. Different strategies suit different failure scenarios.

Recovery Strategies

•Failover — Traffic redirected to surviving replica. Works when redundant replicas exist and are synchronized. Fast when pre-provisioned. Examples: database primary/replica failover, load balancer removing failed backends.
•Restart and Rejoin — Failed node restarts and joins the cluster. Must recover state from stable storage or synchronize from peers. Works for transient failures. Examples: crash-recovery databases, stateful container restart.
•Checkpoint/Restart — Periodically save state (checkpoint); on failure, restart from last checkpoint. Limits repeated work. Critical for long-running computations. Examples: batch processing (Spark), scientific computing.
•Rollback Recovery — Roll back to a consistent state before the failure. Paired with logging for operations. Examples: database transaction rollback, compensating transactions.
•Forward Recovery — Repair current state without rollback. Uses error-correcting codes or majority voting. No repeat of work. Examples: erasure-coded storage, Byzantine fault-tolerant voting.

Recovery in Practice: Database Failover

A typical database failover sequence:

Detection — Primary stops sending heartbeats; sentinel detects failure
Election — Remaining replicas elect a new primary (consensus protocol)
Promotion — Chosen replica promoted to primary role
Synchronization — New primary catches up any missed transactions
Redirection — Clients redirected to new primary (DNS update, config change)
Notification — Operators alerted; old primary marked for investigation

Automatic vs. Manual Failover:

Automatic: Faster recovery, risk of false positive (split-brain)
Manual: Slower, but human validates failure before action

Many systems use semi-automatic: detect automatically, require human confirmation for irreversible actions.

Split-Brain Danger

Split-brain occurs when network partitions cause two replicas to both believe they are primary, accepting conflicting writes. This is catastrophic for consistency. Prevention requires coordination mechanisms: external witness, quorum-based election, fencing (STONITH—'Shoot The Other Node In The Head'), or consensus protocols like Raft.

Fault Tolerance Patterns

Common patterns for building fault-tolerant applications have emerged from decades of industry experience. These patterns encapsulate proven techniques for handling failures gracefully.

Fault Tolerance Patterns

•Retry with Backoff — On failure, retry the operation with increasing delays between attempts. Avoids overwhelming recovering services. Exponential backoff is common: wait 1s, 2s, 4s, 8s... Often combined with jitter to prevent thundering herd.
•Circuit Breaker — Track failure rates to downstream services. When failures exceed threshold, 'open' the circuit—fail fast without calling downstream. Periodically probe to check recovery. Prevents cascade failures and gives downstream time to recover.
•Bulkhead — Isolate components so failures don't cascade. Named after ship compartments that contain flooding. Examples: separate thread pools per downstream, separate clusters for different workloads.
•Timeout — Every remote call must have a timeout. Without timeouts, a hung downstream blocks the caller indefinitely. Timeouts should cascade appropriately through call chains.
•Fallback — When primary operation fails, use a fallback: cached data, degraded functionality, or default response. Service continues, possibly at reduced quality. Example: show cached product prices if pricing service is down.
•Idempotency — Design operations to be safely retryable. Same request multiple times produces same result. Critical for retry safety. Examples: use idempotency keys, conditional updates (if-not-modified).

Circuit Breaker Example
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Circuit Breaker Pattern Implementation
interface CircuitBreakerState {
    failures: number;
    lastFailureTime: number;
    state: 'CLOSED' | 'OPEN' | 'HALF_OPEN';
}
 
class CircuitBreaker {
    private state: CircuitBreakerState = {
        failures: 0,
        lastFailureTime: 0,
        state: 'CLOSED'
    };
    
    constructor(
        private readonly failureThreshold: number = 5,
        private readonly resetTimeout: number = 30000  // 30 seconds
    ) {}
 
    async call<T>(fn: () => Promise<T>): Promise<T> {
        // Check if circuit should be reset
        if (this.state.state === 'OPEN') {
            if (Date.now() - this.state.lastFailureTime > this.resetTimeout) {
                this.state.state = 'HALF_OPEN';  // Allow one probe request
            } else {
                throw new Error('Circuit is OPEN - failing fast');
            }
        }
 
        try {
            const result = await fn();
            this.onSuccess();
            return result;
        } catch (error) {
            this.onFailure();
            throw error;
        }
    }
 
    private onSuccess(): void {
        this.state.failures = 0;
        this.state.state = 'CLOSED';  // Reset on success
    }
 
    private onFailure(): void {
        this.state.failures++;
        this.state.lastFailureTime = Date.now();
        if (this.state.failures >= this.failureThreshold) {
            this.state.state = 'OPEN';  // Open circuit
        }
    }
}

Consensus and Replicated State Machines

For strongly consistent fault tolerance, replicas must agree on the order of operations despite failures. This is the consensus problem—and it's at the heart of building reliable distributed systems.

Replicated State Machine (RSM):

A Replicated State Machine ensures that all replicas of a service:

Start in the same initial state
Apply the same operations in the same order
End up in the same final state

If operations are deterministic, identical ordering guarantees identical states. This is the foundation of fault-tolerant services.

The Role of Consensus:

Consensus protocols ensure all non-faulty replicas agree on:

Which operations to execute
In what order

Even if some replicas crash or messages are lost, the rest agree and continue.

Major Consensus Protocols

•Paxos (1989) — Lamport's seminal protocol. Provably correct but notoriously complex. Many variations: Multi-Paxos, Fast Paxos, Egalitarian Paxos. Theoretical foundation for consensus.
•Raft (2014) — Designed for understandability. Equivalent power to Paxos but clearer structure: leader election, log replication, safety. Widely implemented: etcd, Consul, CockroachDB.
•Viewstamped Replication — Early predecessor, similar to Paxos. Used in some academic systems.
•PBFT (1999) — Practical Byzantine Fault Tolerance. Handles Byzantine failures but expensive: 3f+1 replicas, O(n²) messages.

Consensus in Practice:

Consensus protocols are rarely implemented directly by application developers. Instead, they're embedded in infrastructure:

etcd: Distributed key-value store using Raft (Kubernetes uses for state)
ZooKeeper: Coordination service using ZAB (Zab is similar to Paxos)
Consul: Service mesh using Raft for catalog and KV store
CockroachDB / TiDB: Distributed SQL using Raft for transaction log

Applications use these systems for leader election, configuration, distributed locks, and coordinated state—delegating the complexity of consensus to proven implementations.

The Cost of Consensus

Consensus requires communication between replicas, adding latency to every write. For replicas across geographic regions, this can mean 100ms+ per commit. Strong consistency comes at a performance cost. Most systems offer tunable consistency: synchronous commit for critical operations, asynchronous for performance-sensitive paths.

Chaos Engineering

Fault tolerance mechanisms are only as good as their testing. Chaos Engineering is the discipline of experimenting on a distributed system to build confidence in its ability to withstand turbulent conditions.

The Problem:

Traditional testing focuses on expected behavior. But distributed systems fail in unexpected ways:

Combinations of failures not anticipated
Recovery mechanisms that don't work under load
Cascading failures across services
Split-brain conditions under specific network partitions

The Chaos Engineering Approach:

Define 'steady state': What does normal system behavior look like?
Hypothesize: System continues in steady state despite failure X
Inject failure: Introduce failure X in production or staging
Observe: Does system maintain steady state?
Learn and improve: Fix issues discovered; update hypotheses

Chaos Experiments

•Instance Termination — Kill random VMs/containers. Does failover work? Popular: Chaos Monkey (Netflix).
•Network Delays — Inject latency between services. Does timeout handling work? Reveals assumptions about latency.
•Network Partitions — Block traffic between service groups. Does split-brain protection work?
•Resource Exhaustion — Fill disk, exhaust CPU, consume memory. Does system degrade gracefully?
•Dependency Failures — Kill or slow downstream services. Do circuit breakers and fallbacks work?
•Clock Skew — Introduce time differences between nodes. Do time-dependent operations fail safely?

Tools and Platforms:

Chaos Monkey (Netflix): Random instance termination
Gremlin: Commercial chaos engineering platform
Chaos Mesh (CNCF): Kubernetes-native chaos engineering
Litmus: Open source chaos engineering for Kubernetes
Pumba: Container chaos testing
Toxiproxy: Network condition simulation

Best Practices:

Start small: single service, non-production
Establish blast radius controls: limit scope of experiments
Have abort mechanisms: stop experiment if things go too wrong
Run in production eventually: staging doesn't reveal all issues
Make chaos routine: regular GameDays normalize failure handling

Netflix's Wisdom

Netflix runs chaos experiments in production continuously. Their philosophy: 'The best way to avoid failure is to fail constantly.' By making failure routine, engineers build systems that handle it gracefully, and operators become skilled at responding. Chaos becomes a feature, not a bug.

Summary: Fault Tolerance in Distributed Systems

Fault tolerance transforms the liability of having many components into an asset: properly designed distributed systems are more reliable than any individual component. Let's consolidate the key insights:

Key Takeaways

•Faults, errors, and failures form a progression — Faults are causes, errors are internal states, failures are user-visible. Fault tolerance interrupts this chain.
•Failure models characterize how components can fail — Crash-stop, crash-recovery, omission, timing, Byzantine. Stronger models require more expensive tolerance.
•Redundancy is the foundation — Physical, time, and information redundancy. Active/passive replication, quorum systems each have tradeoffs.
•Perfect failure detection is impossible in asynchronous systems — Practical detectors balance completeness against accuracy. Timeouts are fundamental.
•Recovery mechanisms restore service — Failover, restart/rejoin, checkpoint/restart, rollback, forward recovery. Split-brain is the critical danger.
•Fault tolerance patterns encode proven techniques — Retry, circuit breaker, bulkhead, timeout, fallback, idempotency. These patterns make applications resilient.
•Consensus protocols ensure replicas agree despite failures — Paxos, Raft, PBFT enable replicated state machines. Usually embedded in infrastructure.
•Chaos engineering validates fault tolerance — Deliberately inject failures to find weaknesses. Make failure routine so systems handle it gracefully.

Looking Ahead:

With fault tolerance understood, we examine consistency challenges—the difficulties of maintaining correct data state across a distributed system. Fault tolerance and consistency are deeply intertwined: replication for fault tolerance creates consistency challenges, and consistency requirements constrain fault tolerance designs.

Page Complete

You now understand fault tolerance comprehensively: failure taxonomy, failure models, redundancy strategies, detection mechanisms, recovery techniques, fault tolerance patterns, consensus fundamentals, and chaos engineering. This knowledge enables you to build and evaluate systems that remain reliable despite inevitable failures.

Fault Tolerance

Designing for Inevitable Failure

What You Will Learn

Understanding Faults, Errors, and Failures

Precise terminology distinguishes related but distinct concepts:

Fault (Defect): The underlying cause of a problem. Examples: a bug in code, a faulty RAM chip, a misconfigured router. Faults may be dormant, producing no observable effect until activated.

The Progression:

Fault → (activation) → Error → (propagation) → Failure

Fault Tolerance Strategy:

Fault tolerance interrupts this chain:

Fault prevention: Eliminate faults through quality processes
Fault removal: Detect and remove faults before deployment (testing)
Fault tolerance: Prevent faults from becoming failures (redundancy, recovery)
Fault forecasting: Predict and prepare for likely faults

Fault Classification by Cause
Fault Category	Description	Examples
Hardware	Physical component malfunction	Disk failure, memory corruption, power loss
Software	Bugs in code or configuration	Race conditions, memory leaks, misconfiguration
Network	Communication infrastructure problems	Packet loss, latency spikes, partitions
Operational	Human errors in operation	Wrong deployment, configuration mistake, accidental deletion
Environmental	External factors affecting infrastructure	Power outage, cooling failure, natural disasters

Everything Fails

Failure Models

Failure Model Hierarchy (weakest to strongest):

Failure Models

•Crash-Stop Failure — A component works correctly until it crashes, then stops forever. Simplest model. Example: a server loses power and never comes back online.
•Crash-Recovery Failure — A component may crash but can recover and resume operation with its state restored (from stable storage). More realistic for persistent systems. Example: a database server crashes and restarts, recovering from transaction logs.
•Omission Failure — A component fails to send or receive messages (but doesn't produce incorrect messages). Example: a network dropping packets, or a server failing to respond to some requests.
•Timing Failure — A component responds correctly but not within the expected time bound. Relevant for real-time systems. Example: a response arrives after the timeout deadline.
•Byzantine Failure — A component can exhibit arbitrary, potentially malicious behavior: sending incorrect messages, lying about state, colluding with other faulty nodes. Hardest to tolerate. Example: a compromised server returning forged data.

Failure Model Implications
Model	Redundancy Needed	Tolerate f Failures	Common In
Crash-Stop	n = f + 1 replicas	Any f nodes can crash	Hardware failures
Crash-Recovery	Same, plus stable storage	Nodes recover with state	Database systems
Omission	n = 2f + 1 for detection	Detect via majority voting	Network issues
Byzantine	n = 3f + 1 replicas	1/3 can be adversarial	Security-critical, blockchain

Practical Implications:

Most distributed systems assume crash-recovery or omission failure models:

Data centers with controlled environments
Trusted network infrastructure
Authentication prevents external Byzantine actors

Byzantine tolerance is required for:

Blockchain and cryptocurrency systems
Military or adversarial environments
Multi-party computation with untrusted participants

Byzantine tolerance is expensive: 3f+1 nodes to tolerate f faulty nodes, with complex algorithms (PBFT, etc.). Most systems can't justify this overhead.

Gray Failures

Redundancy Strategies

Redundancy Types

•Physical Redundancy — Multiple hardware components (servers, disks, network paths). The fundamental form of redundancy. Examples: RAID for disk redundancy, multiple servers for service redundancy, redundant network links.
•Time Redundancy — Repeating operations on failure. Simpler than maintaining copies, but adds latency. Examples: automatic retries, transaction rollback and reexecution, timeout-based retransmission.
•Information Redundancy — Extra data for error detection and correction. Examples: checksums detect corruption, parity bits enable recovery, erasure coding reconstructs data from fragments, RAID 5/6 parity.

Redundancy Configurations for Physical Redundancy:

Active Replication (State Machine Replication):

All replicas process all requests
All produce the same output (deterministic)
Clients receive first response
Tolerates f failures with f+1 replicas
High resource usage but instant failover

Passive Replication (Primary-Backup):

Primary handles all requests
Primary forwards updates to backups
On primary failure, a backup promotes
Lower resource usage but failover delay
Simpler to implement for non-deterministic systems

Quorum-Based Replication:

Writes go to W nodes, reads from R nodes
If W + R > N, reads see latest writes
Tunable consistency vs. availability
Examples: Dynamo-style databases (Cassandra, Riak)

Redundancy Strategy Comparison
Strategy	Failover Time	Resource Cost	Complexity	Best For
Active Replication	Instant	High (all processing)	Complex (determinism needed)	Critical path, low latency
Primary-Backup	Seconds	Low (passive backups)	Moderate	Most database systems
Quorum	Instant	Medium	Moderate	Distributed databases
Time Redundancy (Retry)	Request delay	Low	Low	Idempotent operations

Failure Detection

The FLP Impossibility Result:

Practical Failure Detection:

Since perfect detection is impossible, practical systems use unreliable failure detectors—mechanisms that may make mistakes but are useful nonetheless.

Failure Detection Mechanisms

•Heartbeats (Push Model) — Nodes periodically send 'I am alive' messages. Absence of heartbeats triggers suspicion. Simple but generates constant traffic. Parameter: heartbeat interval.
•Health Probes (Pull Model) — Detection service periodically checks nodes. Node responds to prove liveness. Scales better than heartbeats for large clusters. Parameters: probe interval, timeout.
•Gossip-Based Detection — Nodes gossip about other nodes' states. Failure information propagates epidemically. Decentralized, scales well. Parameters: gossip interval, suspicion threshold.
•Phi Accrual Detector — Maintains a probabilistic estimate of node liveness based on historical heartbeat patterns. Returns a 'suspicion level' rather than binary alive/dead. Adaptive to network conditions.

Failure Detector Properties:

Failure detectors are characterized by two properties:

Completeness: Every failed node is eventually suspected by every non-failed node.

Accuracy: Non-failed nodes are not incorrectly suspected.

Tradeoff: These properties are in tension:

Short timeouts: High completeness (fast detection) but low accuracy (false positives)
Long timeouts: High accuracy (few false positives) but low completeness (slow detection)

Practical Approach:

Most systems prioritize completeness (don't miss failures) and tolerate some false positives:

Use slightly shorter timeouts
Handle false positive recovery gracefully
Implement 'mark down' rather than 'consider dead'

Production Tip

Recovery Mechanisms

Once failures are detected, systems must recover. Recovery mechanisms restore service and maintain consistency. Different strategies suit different failure scenarios.

Recovery Strategies

•Failover — Traffic redirected to surviving replica. Works when redundant replicas exist and are synchronized. Fast when pre-provisioned. Examples: database primary/replica failover, load balancer removing failed backends.
•Restart and Rejoin — Failed node restarts and joins the cluster. Must recover state from stable storage or synchronize from peers. Works for transient failures. Examples: crash-recovery databases, stateful container restart.
•Checkpoint/Restart — Periodically save state (checkpoint); on failure, restart from last checkpoint. Limits repeated work. Critical for long-running computations. Examples: batch processing (Spark), scientific computing.
•Rollback Recovery — Roll back to a consistent state before the failure. Paired with logging for operations. Examples: database transaction rollback, compensating transactions.
•Forward Recovery — Repair current state without rollback. Uses error-correcting codes or majority voting. No repeat of work. Examples: erasure-coded storage, Byzantine fault-tolerant voting.

Recovery in Practice: Database Failover

A typical database failover sequence:

Detection — Primary stops sending heartbeats; sentinel detects failure
Election — Remaining replicas elect a new primary (consensus protocol)
Promotion — Chosen replica promoted to primary role
Synchronization — New primary catches up any missed transactions
Redirection — Clients redirected to new primary (DNS update, config change)
Notification — Operators alerted; old primary marked for investigation

Automatic vs. Manual Failover:

Automatic: Faster recovery, risk of false positive (split-brain)
Manual: Slower, but human validates failure before action

Many systems use semi-automatic: detect automatically, require human confirmation for irreversible actions.

Split-Brain Danger

Fault Tolerance Patterns

Common patterns for building fault-tolerant applications have emerged from decades of industry experience. These patterns encapsulate proven techniques for handling failures gracefully.

Fault Tolerance Patterns

•Retry with Backoff — On failure, retry the operation with increasing delays between attempts. Avoids overwhelming recovering services. Exponential backoff is common: wait 1s, 2s, 4s, 8s... Often combined with jitter to prevent thundering herd.
•Circuit Breaker — Track failure rates to downstream services. When failures exceed threshold, 'open' the circuit—fail fast without calling downstream. Periodically probe to check recovery. Prevents cascade failures and gives downstream time to recover.
•Bulkhead — Isolate components so failures don't cascade. Named after ship compartments that contain flooding. Examples: separate thread pools per downstream, separate clusters for different workloads.
•Timeout — Every remote call must have a timeout. Without timeouts, a hung downstream blocks the caller indefinitely. Timeouts should cascade appropriately through call chains.
•Fallback — When primary operation fails, use a fallback: cached data, degraded functionality, or default response. Service continues, possibly at reduced quality. Example: show cached product prices if pricing service is down.
•Idempotency — Design operations to be safely retryable. Same request multiple times produces same result. Critical for retry safety. Examples: use idempotency keys, conditional updates (if-not-modified).

Circuit Breaker Example
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Circuit Breaker Pattern Implementation
interface CircuitBreakerState {
    failures: number;
    lastFailureTime: number;
    state: 'CLOSED' | 'OPEN' | 'HALF_OPEN';
}
 
class CircuitBreaker {
    private state: CircuitBreakerState = {
        failures: 0,
        lastFailureTime: 0,
        state: 'CLOSED'
    };
    
    constructor(
        private readonly failureThreshold: number = 5,
        private readonly resetTimeout: number = 30000  // 30 seconds
    ) {}
 
    async call<T>(fn: () => Promise<T>): Promise<T> {
        // Check if circuit should be reset
        if (this.state.state === 'OPEN') {
            if (Date.now() - this.state.lastFailureTime > this.resetTimeout) {
                this.state.state = 'HALF_OPEN';  // Allow one probe request
            } else {
                throw new Error('Circuit is OPEN - failing fast');
            }
        }
 
        try {
            const result = await fn();
            this.onSuccess();
            return result;
        } catch (error) {
            this.onFailure();
            throw error;
        }
    }
 
    private onSuccess(): void {
        this.state.failures = 0;
        this.state.state = 'CLOSED';  // Reset on success
    }
 
    private onFailure(): void {
        this.state.failures++;
        this.state.lastFailureTime = Date.now();
        if (this.state.failures >= this.failureThreshold) {
            this.state.state = 'OPEN';  // Open circuit
        }
    }
}

Consensus and Replicated State Machines

Replicated State Machine (RSM):

A Replicated State Machine ensures that all replicas of a service:

Start in the same initial state
Apply the same operations in the same order
End up in the same final state

If operations are deterministic, identical ordering guarantees identical states. This is the foundation of fault-tolerant services.

The Role of Consensus:

Consensus protocols ensure all non-faulty replicas agree on:

Which operations to execute
In what order

Even if some replicas crash or messages are lost, the rest agree and continue.

Major Consensus Protocols

•Paxos (1989) — Lamport's seminal protocol. Provably correct but notoriously complex. Many variations: Multi-Paxos, Fast Paxos, Egalitarian Paxos. Theoretical foundation for consensus.
•Raft (2014) — Designed for understandability. Equivalent power to Paxos but clearer structure: leader election, log replication, safety. Widely implemented: etcd, Consul, CockroachDB.
•Viewstamped Replication — Early predecessor, similar to Paxos. Used in some academic systems.
•PBFT (1999) — Practical Byzantine Fault Tolerance. Handles Byzantine failures but expensive: 3f+1 replicas, O(n²) messages.

Consensus in Practice:

Consensus protocols are rarely implemented directly by application developers. Instead, they're embedded in infrastructure:

etcd: Distributed key-value store using Raft (Kubernetes uses for state)
ZooKeeper: Coordination service using ZAB (Zab is similar to Paxos)
Consul: Service mesh using Raft for catalog and KV store
CockroachDB / TiDB: Distributed SQL using Raft for transaction log

Applications use these systems for leader election, configuration, distributed locks, and coordinated state—delegating the complexity of consensus to proven implementations.

The Cost of Consensus

Chaos Engineering

The Problem:

Traditional testing focuses on expected behavior. But distributed systems fail in unexpected ways:

Combinations of failures not anticipated
Recovery mechanisms that don't work under load
Cascading failures across services
Split-brain conditions under specific network partitions

The Chaos Engineering Approach:

Define 'steady state': What does normal system behavior look like?
Hypothesize: System continues in steady state despite failure X
Inject failure: Introduce failure X in production or staging
Observe: Does system maintain steady state?
Learn and improve: Fix issues discovered; update hypotheses

Chaos Experiments

•Instance Termination — Kill random VMs/containers. Does failover work? Popular: Chaos Monkey (Netflix).
•Network Delays — Inject latency between services. Does timeout handling work? Reveals assumptions about latency.
•Network Partitions — Block traffic between service groups. Does split-brain protection work?
•Resource Exhaustion — Fill disk, exhaust CPU, consume memory. Does system degrade gracefully?
•Dependency Failures — Kill or slow downstream services. Do circuit breakers and fallbacks work?
•Clock Skew — Introduce time differences between nodes. Do time-dependent operations fail safely?

Tools and Platforms:

Chaos Monkey (Netflix): Random instance termination
Gremlin: Commercial chaos engineering platform
Chaos Mesh (CNCF): Kubernetes-native chaos engineering
Litmus: Open source chaos engineering for Kubernetes
Pumba: Container chaos testing
Toxiproxy: Network condition simulation

Best Practices:

Start small: single service, non-production
Establish blast radius controls: limit scope of experiments
Have abort mechanisms: stop experiment if things go too wrong
Run in production eventually: staging doesn't reveal all issues
Make chaos routine: regular GameDays normalize failure handling

Netflix's Wisdom

Summary: Fault Tolerance in Distributed Systems

Key Takeaways

•Faults, errors, and failures form a progression — Faults are causes, errors are internal states, failures are user-visible. Fault tolerance interrupts this chain.
•Failure models characterize how components can fail — Crash-stop, crash-recovery, omission, timing, Byzantine. Stronger models require more expensive tolerance.
•Redundancy is the foundation — Physical, time, and information redundancy. Active/passive replication, quorum systems each have tradeoffs.
•Perfect failure detection is impossible in asynchronous systems — Practical detectors balance completeness against accuracy. Timeouts are fundamental.
•Recovery mechanisms restore service — Failover, restart/rejoin, checkpoint/restart, rollback, forward recovery. Split-brain is the critical danger.
•Fault tolerance patterns encode proven techniques — Retry, circuit breaker, bulkhead, timeout, fallback, idempotency. These patterns make applications resilient.
•Consensus protocols ensure replicas agree despite failures — Paxos, Raft, PBFT enable replicated state machines. Usually embedded in infrastructure.
•Chaos engineering validates fault tolerance — Deliberately inject failures to find weaknesses. Make failure routine so systems handle it gracefully.

Looking Ahead:

Page Complete