System Design (HLD)Failure Is Inevitable

Failure Is Inevitable: Understanding and Embracing System Failures

LevelIntermediate

Duration90 mins

TopicFailure Is Inevitable

1 / 4

Types of Failures: Hardware, Software, Network

The Uncomfortable Truth About Systems

Every production system you've ever used has failed. The difference between systems that seem reliable and those that don't isn't the absence of failure—it's how those failures are handled, masked, and recovered from. At Google's scale, hardware failures occur literally every second. At Amazon, network partitions are measured not in 'if' but 'when' and 'how often.'

The first step toward building truly resilient systems is developing an intimate understanding of how systems fail. Not a theoretical awareness, but a deep, practical taxonomy of failure modes that allows you to anticipate problems before they occur and design defenses before you need them.

What You Will Learn

By the end of this page, you will have mastered the complete taxonomy of system failures: hardware failures from bit flips to datacenter outages, software failures from memory leaks to cascading crashes, and network failures from latency spikes to complete partitions. You'll understand not just what fails, but why, how to detect it, and most importantly—how to think about failure as a design input rather than an afterthought.

The Philosophy of Failure

Before diving into failure categories, we must establish a fundamental mindset shift. Traditional software development treats failure as an exception—something that happens at the edges, something to be prevented entirely. This perspective works for single-machine programs but catastrophically fails for distributed systems.

The distributed systems perspective:

In distributed systems, failure isn't exceptional—it's the norm. Consider the math: if a single component has 99.9% uptime, a system with 1,000 such components will experience at least one component failure 63% of the time. With 10,000 components? Virtually constant failure somewhere in the system.

This isn't pessimism—it's reality at scale. And once you internalize this reality, your entire approach to system design transforms.

Probability of At Least One Component Failure
Component Reliability	100 Components	1,000 Components	10,000 Components
99.99%	0.99%	9.52%	63.2%
99.9%	9.52%	63.2%	99.995%
99%	63.4%	99.996%	~100%
95%	99.4%	~100%	~100%

The Scale Multiplier

At hyperscale, individual component reliability becomes almost irrelevant. What matters is how your system behaves when components fail—because they will, constantly. Design for failure, not against it.

Hardware Failures: The Physical Layer

Hardware failures represent the physical foundation of all computing failures. Despite remarkable advances in reliability, hardware remains subject to the laws of physics: materials degrade, components wear out, and random events (cosmic rays, power fluctuations) introduce errors. Understanding hardware failure modes is essential because they're fundamentally different from software failures—they can't be 'fixed' with a patch.

Failure Characteristics:

Hardware failures exhibit distinct patterns that inform how we design around them:

Wear-out failures increase with age (disk drives, fans, capacitors)
Random failures occur unpredictably at constant rates (memory bit flips, cosmic ray events)
Correlated failures hit multiple components simultaneously (power failures, temperature events)
Byzantine failures produce incorrect results rather than clean failures (rare but dangerous)

Hardware Failure Categories

•Disk Failures — The most common hardware failure in datacenters. HDDs have annual failure rates (AFR) of 2-8%, while SSDs range from 0.5-3%. Failures include media degradation, controller failures, firmware bugs, and complete drive death. Silent data corruption ('bit rot') is particularly insidious—data reads successfully but is incorrect.
•Memory (RAM) Failures — ECC (Error-Correcting Code) memory catches most single-bit errors, but multi-bit errors and 'row hammer' attacks can corrupt data. Google studies found one correctable error per 8GB of RAM per year. Without ECC, silent corruption occurs regularly.
•CPU Failures — Rare but catastrophic. Include silicon defects, thermal damage, and 'silent' errors where computation produces wrong results. Intel's Skylake had an infamous bug where certain workloads caused crashes. Some failures only manifest under specific workloads.
•Network Interface Card (NIC) Failures — Hardware failures can cause packet loss, corruption, or complete connectivity loss. Partial failures are particularly problematic—the NIC appears functional but drops packets intermittently.
•Power Supply Failures — Can cause immediate shutdown or gradual voltage degradation. Redundant power supplies (N+1 or 2N) are standard in production systems. Voltage fluctuations before complete failure can cause silent data corruption.
•Cooling Failures — Thermal events can cause CPU throttling, silent computation errors, or catastrophic shutdowns. Correlated with other failures as overheating components fail more frequently.

Hardware Failure Rates at Scale (Industry Data)
Component	Annual Failure Rate	Detection Method	Typical Recovery
HDD	2-8%	SMART monitoring, I/O errors	Replace drive, rebuild from replica
SSD	0.5-3%	SMART, wear leveling metrics	Replace drive, rebuild from replica
RAM	0.2-0.5%	ECC errors, memtest	Replace DIMM, reboot
CPU	0.01-0.1%	Machine check exceptions	Replace server
NIC	0.5-1%	Packet loss, CRC errors	Failover to backup NIC
PSU	1-3%	Voltage monitoring	Automatic failover to redundant PSU
Fan	5-10%	RPM monitoring, temperature	Replace fan, thermal throttling

The Google Disk Study

Google's famous 2007 study of 100,000+ drives found that SMART data is a poor predictor of failure—36% of failed drives showed no SMART warnings. Age correlates with failure more strongly than usage. The study fundamentally changed how the industry thinks about disk reliability.

Correlated Hardware Failures:

Perhaps more dangerous than individual failures are correlated failures—events that take out multiple components simultaneously. These defeat the basic assumption behind redundancy: that failures are independent.

Examples include:

Power failures: Kill all servers on a circuit, rack, or datacenter
Cooling failures: Cause thermal shutdowns across a zone
Firmware bugs: Affect all devices of a model simultaneously
Batch defects: Manufacturing issues in a specific production run

Sophisticated systems intentionally spread replicas across failure domains (different racks, power circuits, cooling zones, even hardware batches) to minimize correlation.

Software Failures: The Code Layer

Software failures are fundamentally different from hardware failures. Hardware fails due to physical processes—wear, random events, environmental factors. Software fails because of bugs—logical errors frozen into code that deterministically produce incorrect behavior under specific conditions. This has profound implications:

Software failures are deterministic: The same input produces the same failure
Software failures are often latent: Bugs can exist for years before conditions trigger them
Software failures can be correlated: All instances running the same code have the same bugs
Software failures can cascade: One component's failure can trigger failures in others

These characteristics make software failures simultaneously easier to prevent (through testing, reviews) and more dangerous (a single bug can crash an entire distributed system).

Resource Exhaustion Failures

•Memory Leaks — Gradual accumulation of unreleased memory leading to OOM kills. Particularly dangerous in long-running services. Can take days or weeks to manifest.
•Thread/Connection Leaks — Exhaustion of thread pools or connection pools, causing requests to queue indefinitely. Often triggered by edge cases that prevent cleanup.
•Disk Space Exhaustion — Logs, temporary files, or data growth consuming all available space. Can crash databases, prevent logging, block writes.
•File Descriptor Exhaustion — Running out of file handles blocks new connections and file operations. Common with high-connection services.

Logic Failures

•Race Conditions — Non-deterministic bugs where timing between threads/processes produces incorrect results. Notoriously difficult to reproduce and debug.
•Deadlocks — Circular waits between locks that permanently freeze processing. Often involves subtle interactions between components.
•Infinite Loops/Recursion — Logic errors causing unbounded processing. Can consume CPU, memory, or stack space.
•Off-by-One Errors — Boundary mistakes causing buffer overflows, missing data, or crash conditions.

State Corruption:

Among the most dangerous software failures are those that corrupt persistent state. When bugs write incorrect data to databases, they don't just cause immediate problems—they create time bombs. The corrupted data may be read later, causing secondary failures. Worse, backups may propagate the corruption, making recovery difficult or impossible.

Examples include:

Incorrect serialization/deserialization corrupting stored objects
Race conditions in database writes creating inconsistent state
Partial transaction commits leaving data in invalid states
Replication bugs propagating corrupt data to all replicas

State corruption often requires complex remediation: identifying affected records, determining correct values, and carefully applying fixes without causing new problems.

Cascading Failure Patterns

•Retry Storms — When a service fails, clients retry aggressively. The retry load exceeds what even a healthy service could handle, preventing recovery. The service thrashes between failure and overload indefinitely.
•Queue Buildups — Slow processing causes request queues to grow. Queue processing time increases. Timeouts trigger retries. The queue grows faster than it drains, memory exhausts, everything crashes.
•Connection Pool Exhaustion — A downstream service slows. Waiting requests hold connections. Pool exhausts. New requests fail immediately. The slow service problem cascades to all callers.
•Thundering Herd — After an outage, all clients reconnect simultaneously. The reconnection load exceeds system capacity. System fails again. Clients reconnect again. Repeats indefinitely without intervention.
•Feedback Loops — Component A slows when B is slow. A's slowness makes B slower. B's slowness makes A even slower. System spirals into complete failure even though no component is 'broken'.

The Worst Software Failures

The most dangerous software failures are those where the system continues running while producing incorrect results. A service that crashes is visible; a service that silently returns wrong answers for 0.1% of requests can corrupt downstream data for months before detection.

Network Failures: The Communication Layer

Network failures occupy a unique position in the failure taxonomy. Unlike hardware failures (which are local) or software failures (which are deterministic), network failures are characterized by their uncertainty and asymmetry. A network problem between two nodes doesn't just prevent communication—it creates ambiguity about state.

The fundamental network uncertainty:

When you send a message and don't receive a response, you cannot distinguish between:

The message never arrived at the destination
The message arrived but the destination failed before processing
The message was processed but the response was lost
The response is still in transit (extreme latency)
The destination processed the message slowly before responding

This uncertainty is fundamental to networked systems and cannot be eliminated—only managed.

Network Failure Modes

•Complete Partition (Network Split) — Zero connectivity between network segments. Nodes on either side cannot communicate. The classic 'brain split' scenario in distributed systems occurs when a leader is isolated from followers but remains active.
•Asymmetric Partition — Node A can reach Node B, but Node B cannot reach Node A. This violates assumptions of link symmetry in many protocols. Particularly insidious because health checks from one direction succeed while actual communication fails.
•Intermittent Failures (Flapping) — Network connectivity fluctuates between working and failing. Causes repeated connection/disconnection events. Systems that assume 'failed' nodes stay failed may behave incorrectly.
•Latency Spikes — Normal latency is 1ms, but occasionally spikes to 500ms or more. Can trigger timeouts that treat healthy-but-slow nodes as failed. Latency variance often more problematic than high average latency.
•Packet Loss — Some percentage of packets are dropped and never arrive. TCP retries transparently, but introduces latency. UDP-based protocols must handle losses at application level.
•Packet Corruption — Packets arrive but with corrupted data. Network checksums catch most corruption, but Byzantine failures can slip through. Applications must not assume network data is trustworthy.
•Bandwidth Saturation — Link capacity is exhausted, causing queuing, latency spikes, and eventual packet drops. Often triggered by bursty traffic patterns or attack traffic (DDoS).
•DNS Failures — DNS resolution fails or returns stale data. Services using hostnames cannot resolve addresses. Particularly dangerous because it can appear as 'everything is down' when only DNS is affected.

Gray Failures:

Some of the most challenging network problems are 'gray failures'—partial failures that don't cleanly classify as 'working' or 'failed.' Consider a network link with 10% packet loss. TCP connections still work (with high latency from retransmissions). Health checks might pass. But actual application performance is severely degraded.

Gray failures often manifest as:

Some requests succeed, others fail seemingly randomly
Latency is highly variable—some requests fast, others extremely slow
Connections intermittently drop and reconnect
System appears to oscillate between healthy and unhealthy

These are particularly dangerous because monitoring systems designed to detect binary failure (working/not-working) may miss them entirely.

Network Failure Frequency by Environment
Environment	Partition Frequency	Typical Duration	Common Causes
Single Datacenter	Rare (~yearly)	Minutes	Switch failures, misconfigurations
Multi-Datacenter	Occasional (~monthly)	Minutes to hours	WAN link failures, routing issues
Hybrid Cloud	Frequent (~weekly)	Seconds to minutes	VPN issues, cloud connectivity
Global Distribution	Expected (~daily)	Variable	Internet routing, congestion
Edge/IoT	Constant (~continuous)	Variable	Last-mile issues, mobile networks

The Two Generals Problem

The two generals problem proves that perfect consensus over unreliable networks is impossible. No protocol can guarantee that two parties agree on an action when messages can be lost. Every practical distributed system is an engineering compromise around this fundamental impossibility.

Failure Interactions and Combinations

Real-world outages rarely involve a single failure. They typically result from combinations of failures, or chains of causation where one failure triggers another. Understanding these interactions is crucial because resilience measures for individual failures may not protect against multi-failure scenarios.

The Multi-Factor Reality:

Postmortem analyses of major outages consistently reveal multi-factor causation:

An initial failure occurs (often minor)
A latent bug or misconfiguration is triggered by the failure
Recovery mechanisms fail or worsen the situation
Human intervention introduces additional errors
The combination overwhelms what individual resilience measures could handle

Common Failure Combinations

•Hardware + Software — Disk failure triggers failover, but failover code has a bug that crashes the backup node. Result: complete service outage instead of seamless failover.
•Network + Software — Network partition occurs. Software correctly detects it but has a bug handling reconnection. When network heals, corrupted state propagates to all nodes.
•Multiple Hardware — Power event takes out multiple servers. Remaining servers are overloaded. One has a latent memory issue that only manifests under load. Combined failures exceed redundancy.
•Software + Operator — Automated system pages on-call engineer about high latency. Engineer pushes 'fix' that actually causes outage. System's true problem was transient, but human action makes it permanent.
•Cascading Across Layers — DNS failure prevents service discovery. Services retry DNS aggressively. Retry load overwhelms DNS servers. DNS outage spreads. Services can't recover even after original issue resolves.

The Swiss Cheese Model

Major outages are like the Swiss cheese model of accident causation: each defensive layer has holes (vulnerabilities), and catastrophe occurs when holes align. Building resilient systems requires multiple, diverse layers of defense with non-aligned failure modes.

Failure Amplification:

Certain failure combinations amplify each other dramatically:

Load + Latency: High load increases latency. Higher latency increases concurrency (more in-flight requests). Higher concurrency increases load. Spiral continues to crash.
Failure + Retries: Failure causes retries. Retries add load. Added load increases failures. More failures trigger more retries. Exponential growth to system death.
Partial Failure + Load Shedding: When some nodes fail, load shifts to survivors. Survivors become overloaded. Overloaded nodes start failing. More load shifts to fewer survivors. Collapse accelerates.

Designing for fault tolerance means not just handling individual failures, but breaking these amplification loops.

Measuring and Categorizing Failures

To design for failure, we need systematic frameworks for thinking about failure characteristics. Several dimensions prove particularly useful:

Failure Duration:

Transient: Self-correcting, lasting seconds (network jitter, brief overload)
Intermittent: Recurring but not permanent (flapping network, overloaded service during peaks)
Permanent: Requires intervention (hardware death, persistent bug)

Failure Detection:

Fail-Stop: Obvious when it happens (process crash, connection refused)
Fail-Slow: Degraded but functional (increased latency, reduced throughput)
Fail-Silent: Returns no response (cannot distinguish from network failure)
Byzantine: Returns incorrect results (hardest to detect and handle)

Failure Scope:

Component: Single service instance, single disk
Node: Entire server, all components on it
Rack/Cluster: Group of related nodes
Datacenter: Entire facility
Region: Multiple datacenters in geographic area
Global: Complete service failure

Failure Classification Matrix
Failure Type	Detection Difficulty	Recovery Complexity	Design Priority
Hardware (Fail-Stop)	Easy	Moderate	High (common)
Hardware (Fail-Slow)	Moderate	Moderate	High (dangerous)
Software (Crash)	Easy	Easy	High (common)
Software (Corruption)	Very Hard	Very Hard	Critical (dangerous)
Network (Partition)	Easy	Hard	Critical
Network (Latency)	Moderate	Moderate	High
Byzantine (Any)	Extremely Hard	Extremely Hard	Context-dependent

The MTBF/MTTR Framework:

Two key metrics for understanding failure impact:

MTBF (Mean Time Between Failures): Average time between failures. Higher is better. Depends on component quality and environmental factors.
MTTR (Mean Time To Repair): Average time to restore service after failure. Lower is better. Depends on detection speed, automation, and component replaceability.

System availability can be expressed as:

Availability = MTBF / (MTBF + MTTR)

This formula reveals an important insight: you can improve availability either by making failures less frequent (increasing MTBF) OR by making recovery faster (decreasing MTTR). For many systems, reducing MTTR is more practical than increasing MTBF.

Focus on MTTR

Netflix's approach: rather than trying to prevent all failures, assume failures will happen and optimize for fast recovery. A system with 1-hour MTBF but 30-second MTTR achieves 99.2% availability. A system with 1-week MTBF but 4-hour MTTR achieves only 97.6% availability.

Building Failure Intuition

Developing an intuitive sense for failure modes is essential for system design. This intuition allows you to anticipate problems during design reviews, quickly diagnose production issues, and evaluate the true resilience of a system.

Developing Failure Intuition:

Read postmortems: Every major tech company publishes postmortems. These are invaluable for understanding real-world failure patterns.
Practice failure analysis: For any system design, ask 'What happens when X fails?' for every component. Then ask 'What happens when X and Y fail together?'
Run failure experiments: Chaos engineering isn't just for validating systems—it builds intuition about how failures propagate.
Study failure history: Your own organization's incidents teach you about your specific failure patterns and blind spots.

Questions to Ask for Any System

•What are the single points of failure? — Components where failure causes complete system failure
•What are the failure domains? — How are components grouped in terms of correlated failures?
•What happens during partial failure? — How does the system behave when degraded but not dead?
•What are the recovery mechanisms? — How does the system return to healthy state after failure?
•What can make recovery fail? — Are there failure modes that defeat recovery mechanisms?
•What's the blast radius? — How far can a single failure propagate?
•What detects failures? — How long until failures are noticed?
•What's the human component? — Where might humans make failures worse?

The Pre-Mortem Technique

A pre-mortem inverses the postmortem: before launching, imagine the system has catastrophically failed. What caused it? This mental exercise surfaces failure modes that optimistic planning overlooks. It's remarkably effective at finding design flaws.

Summary: Types of Failures

We've established the comprehensive taxonomy of system failures. Let's consolidate the essential knowledge:

Key Takeaways

•Hardware failures are physical realities — Disks fail, memory corrupts, power fluctuates. These failures are probabilistic and can't be eliminated, only managed through redundancy.
•Software failures are deterministic bugs — The same conditions produce the same failure. All instances share the same bugs. Testing and diversity are key defenses.
•Network failures create uncertainty — The fundamental challenge isn't lost messages but ambiguous state. You never know if a non-response means 'failed' or 'slow'.
•Failures combine and cascade — Single failures rarely cause outages. Multiple failures, or failures that trigger other failures, create catastrophes.
•Gray failures are the hardest — Partial failures are more dangerous than complete failures because they're harder to detect and handle.
•MTTR often matters more than MTBF — Optimizing recovery speed may be more effective than preventing failures.

What's next:

Now that we understand the taxonomy of individual failures, we'll explore partial failures—the characteristic challenge of distributed systems. Partial failures, where some parts of a system fail while others continue operating, create unique challenges that don't exist in single-machine computing.

Page Complete

You now have a comprehensive understanding of the failure taxonomy: hardware, software, and network failures, their characteristics, and how they interact. This foundation is essential for the fault tolerance patterns we'll study throughout this chapter.

1 / 4

Loading learning content...

System Design (HLD)Failure Is Inevitable

Failure Is Inevitable: Understanding and Embracing System Failures

LevelIntermediate

Duration90 mins

TopicFailure Is Inevitable

1 / 4

Types of Failures: Hardware, Software, Network

The Uncomfortable Truth About Systems

What You Will Learn

The Philosophy of Failure

The distributed systems perspective:

This isn't pessimism—it's reality at scale. And once you internalize this reality, your entire approach to system design transforms.

Probability of At Least One Component Failure
Component Reliability	100 Components	1,000 Components	10,000 Components
99.99%	0.99%	9.52%	63.2%
99.9%	9.52%	63.2%	99.995%
99%	63.4%	99.996%	~100%
95%	99.4%	~100%	~100%

The Scale Multiplier

Hardware Failures: The Physical Layer

Failure Characteristics:

Hardware failures exhibit distinct patterns that inform how we design around them:

Wear-out failures increase with age (disk drives, fans, capacitors)
Random failures occur unpredictably at constant rates (memory bit flips, cosmic ray events)
Correlated failures hit multiple components simultaneously (power failures, temperature events)
Byzantine failures produce incorrect results rather than clean failures (rare but dangerous)

Hardware Failure Categories

•Disk Failures — The most common hardware failure in datacenters. HDDs have annual failure rates (AFR) of 2-8%, while SSDs range from 0.5-3%. Failures include media degradation, controller failures, firmware bugs, and complete drive death. Silent data corruption ('bit rot') is particularly insidious—data reads successfully but is incorrect.
•Memory (RAM) Failures — ECC (Error-Correcting Code) memory catches most single-bit errors, but multi-bit errors and 'row hammer' attacks can corrupt data. Google studies found one correctable error per 8GB of RAM per year. Without ECC, silent corruption occurs regularly.
•CPU Failures — Rare but catastrophic. Include silicon defects, thermal damage, and 'silent' errors where computation produces wrong results. Intel's Skylake had an infamous bug where certain workloads caused crashes. Some failures only manifest under specific workloads.
•Network Interface Card (NIC) Failures — Hardware failures can cause packet loss, corruption, or complete connectivity loss. Partial failures are particularly problematic—the NIC appears functional but drops packets intermittently.
•Power Supply Failures — Can cause immediate shutdown or gradual voltage degradation. Redundant power supplies (N+1 or 2N) are standard in production systems. Voltage fluctuations before complete failure can cause silent data corruption.
•Cooling Failures — Thermal events can cause CPU throttling, silent computation errors, or catastrophic shutdowns. Correlated with other failures as overheating components fail more frequently.

Hardware Failure Rates at Scale (Industry Data)
Component	Annual Failure Rate	Detection Method	Typical Recovery
HDD	2-8%	SMART monitoring, I/O errors	Replace drive, rebuild from replica
SSD	0.5-3%	SMART, wear leveling metrics	Replace drive, rebuild from replica
RAM	0.2-0.5%	ECC errors, memtest	Replace DIMM, reboot
CPU	0.01-0.1%	Machine check exceptions	Replace server
NIC	0.5-1%	Packet loss, CRC errors	Failover to backup NIC
PSU	1-3%	Voltage monitoring	Automatic failover to redundant PSU
Fan	5-10%	RPM monitoring, temperature	Replace fan, thermal throttling

The Google Disk Study

Correlated Hardware Failures:

Examples include:

Power failures: Kill all servers on a circuit, rack, or datacenter
Cooling failures: Cause thermal shutdowns across a zone
Firmware bugs: Affect all devices of a model simultaneously
Batch defects: Manufacturing issues in a specific production run

Sophisticated systems intentionally spread replicas across failure domains (different racks, power circuits, cooling zones, even hardware batches) to minimize correlation.

Software Failures: The Code Layer

Software failures are deterministic: The same input produces the same failure
Software failures are often latent: Bugs can exist for years before conditions trigger them
Software failures can be correlated: All instances running the same code have the same bugs
Software failures can cascade: One component's failure can trigger failures in others

These characteristics make software failures simultaneously easier to prevent (through testing, reviews) and more dangerous (a single bug can crash an entire distributed system).

Resource Exhaustion Failures

•Memory Leaks — Gradual accumulation of unreleased memory leading to OOM kills. Particularly dangerous in long-running services. Can take days or weeks to manifest.
•Thread/Connection Leaks — Exhaustion of thread pools or connection pools, causing requests to queue indefinitely. Often triggered by edge cases that prevent cleanup.
•Disk Space Exhaustion — Logs, temporary files, or data growth consuming all available space. Can crash databases, prevent logging, block writes.
•File Descriptor Exhaustion — Running out of file handles blocks new connections and file operations. Common with high-connection services.

Logic Failures

•Race Conditions — Non-deterministic bugs where timing between threads/processes produces incorrect results. Notoriously difficult to reproduce and debug.
•Deadlocks — Circular waits between locks that permanently freeze processing. Often involves subtle interactions between components.
•Infinite Loops/Recursion — Logic errors causing unbounded processing. Can consume CPU, memory, or stack space.
•Off-by-One Errors — Boundary mistakes causing buffer overflows, missing data, or crash conditions.

State Corruption:

Examples include:

Incorrect serialization/deserialization corrupting stored objects
Race conditions in database writes creating inconsistent state
Partial transaction commits leaving data in invalid states
Replication bugs propagating corrupt data to all replicas

State corruption often requires complex remediation: identifying affected records, determining correct values, and carefully applying fixes without causing new problems.

Cascading Failure Patterns

•Retry Storms — When a service fails, clients retry aggressively. The retry load exceeds what even a healthy service could handle, preventing recovery. The service thrashes between failure and overload indefinitely.
•Queue Buildups — Slow processing causes request queues to grow. Queue processing time increases. Timeouts trigger retries. The queue grows faster than it drains, memory exhausts, everything crashes.
•Connection Pool Exhaustion — A downstream service slows. Waiting requests hold connections. Pool exhausts. New requests fail immediately. The slow service problem cascades to all callers.
•Thundering Herd — After an outage, all clients reconnect simultaneously. The reconnection load exceeds system capacity. System fails again. Clients reconnect again. Repeats indefinitely without intervention.
•Feedback Loops — Component A slows when B is slow. A's slowness makes B slower. B's slowness makes A even slower. System spirals into complete failure even though no component is 'broken'.

The Worst Software Failures

Network Failures: The Communication Layer

The fundamental network uncertainty:

When you send a message and don't receive a response, you cannot distinguish between:

The message never arrived at the destination
The message arrived but the destination failed before processing
The message was processed but the response was lost
The response is still in transit (extreme latency)
The destination processed the message slowly before responding

This uncertainty is fundamental to networked systems and cannot be eliminated—only managed.

Network Failure Modes

•Complete Partition (Network Split) — Zero connectivity between network segments. Nodes on either side cannot communicate. The classic 'brain split' scenario in distributed systems occurs when a leader is isolated from followers but remains active.
•Asymmetric Partition — Node A can reach Node B, but Node B cannot reach Node A. This violates assumptions of link symmetry in many protocols. Particularly insidious because health checks from one direction succeed while actual communication fails.
•Intermittent Failures (Flapping) — Network connectivity fluctuates between working and failing. Causes repeated connection/disconnection events. Systems that assume 'failed' nodes stay failed may behave incorrectly.
•Latency Spikes — Normal latency is 1ms, but occasionally spikes to 500ms or more. Can trigger timeouts that treat healthy-but-slow nodes as failed. Latency variance often more problematic than high average latency.
•Packet Loss — Some percentage of packets are dropped and never arrive. TCP retries transparently, but introduces latency. UDP-based protocols must handle losses at application level.
•Packet Corruption — Packets arrive but with corrupted data. Network checksums catch most corruption, but Byzantine failures can slip through. Applications must not assume network data is trustworthy.
•Bandwidth Saturation — Link capacity is exhausted, causing queuing, latency spikes, and eventual packet drops. Often triggered by bursty traffic patterns or attack traffic (DDoS).
•DNS Failures — DNS resolution fails or returns stale data. Services using hostnames cannot resolve addresses. Particularly dangerous because it can appear as 'everything is down' when only DNS is affected.

Gray Failures:

Gray failures often manifest as:

Some requests succeed, others fail seemingly randomly
Latency is highly variable—some requests fast, others extremely slow
Connections intermittently drop and reconnect
System appears to oscillate between healthy and unhealthy

These are particularly dangerous because monitoring systems designed to detect binary failure (working/not-working) may miss them entirely.

Network Failure Frequency by Environment
Environment	Partition Frequency	Typical Duration	Common Causes
Single Datacenter	Rare (~yearly)	Minutes	Switch failures, misconfigurations
Multi-Datacenter	Occasional (~monthly)	Minutes to hours	WAN link failures, routing issues
Hybrid Cloud	Frequent (~weekly)	Seconds to minutes	VPN issues, cloud connectivity
Global Distribution	Expected (~daily)	Variable	Internet routing, congestion
Edge/IoT	Constant (~continuous)	Variable	Last-mile issues, mobile networks

The Two Generals Problem

Failure Interactions and Combinations

The Multi-Factor Reality:

Postmortem analyses of major outages consistently reveal multi-factor causation:

An initial failure occurs (often minor)
A latent bug or misconfiguration is triggered by the failure
Recovery mechanisms fail or worsen the situation
Human intervention introduces additional errors
The combination overwhelms what individual resilience measures could handle

Common Failure Combinations

•Hardware + Software — Disk failure triggers failover, but failover code has a bug that crashes the backup node. Result: complete service outage instead of seamless failover.
•Network + Software — Network partition occurs. Software correctly detects it but has a bug handling reconnection. When network heals, corrupted state propagates to all nodes.
•Multiple Hardware — Power event takes out multiple servers. Remaining servers are overloaded. One has a latent memory issue that only manifests under load. Combined failures exceed redundancy.
•Software + Operator — Automated system pages on-call engineer about high latency. Engineer pushes 'fix' that actually causes outage. System's true problem was transient, but human action makes it permanent.
•Cascading Across Layers — DNS failure prevents service discovery. Services retry DNS aggressively. Retry load overwhelms DNS servers. DNS outage spreads. Services can't recover even after original issue resolves.

The Swiss Cheese Model

Failure Amplification:

Certain failure combinations amplify each other dramatically:

Load + Latency: High load increases latency. Higher latency increases concurrency (more in-flight requests). Higher concurrency increases load. Spiral continues to crash.
Failure + Retries: Failure causes retries. Retries add load. Added load increases failures. More failures trigger more retries. Exponential growth to system death.
Partial Failure + Load Shedding: When some nodes fail, load shifts to survivors. Survivors become overloaded. Overloaded nodes start failing. More load shifts to fewer survivors. Collapse accelerates.

Designing for fault tolerance means not just handling individual failures, but breaking these amplification loops.

Measuring and Categorizing Failures

To design for failure, we need systematic frameworks for thinking about failure characteristics. Several dimensions prove particularly useful:

Failure Duration:

Transient: Self-correcting, lasting seconds (network jitter, brief overload)
Intermittent: Recurring but not permanent (flapping network, overloaded service during peaks)
Permanent: Requires intervention (hardware death, persistent bug)

Failure Detection:

Fail-Stop: Obvious when it happens (process crash, connection refused)
Fail-Slow: Degraded but functional (increased latency, reduced throughput)
Fail-Silent: Returns no response (cannot distinguish from network failure)
Byzantine: Returns incorrect results (hardest to detect and handle)

Failure Scope:

Component: Single service instance, single disk
Node: Entire server, all components on it
Rack/Cluster: Group of related nodes
Datacenter: Entire facility
Region: Multiple datacenters in geographic area
Global: Complete service failure

Failure Classification Matrix
Failure Type	Detection Difficulty	Recovery Complexity	Design Priority
Hardware (Fail-Stop)	Easy	Moderate	High (common)
Hardware (Fail-Slow)	Moderate	Moderate	High (dangerous)
Software (Crash)	Easy	Easy	High (common)
Software (Corruption)	Very Hard	Very Hard	Critical (dangerous)
Network (Partition)	Easy	Hard	Critical
Network (Latency)	Moderate	Moderate	High
Byzantine (Any)	Extremely Hard	Extremely Hard	Context-dependent

The MTBF/MTTR Framework:

Two key metrics for understanding failure impact:

MTBF (Mean Time Between Failures): Average time between failures. Higher is better. Depends on component quality and environmental factors.
MTTR (Mean Time To Repair): Average time to restore service after failure. Lower is better. Depends on detection speed, automation, and component replaceability.

System availability can be expressed as:

Availability = MTBF / (MTBF + MTTR)

Focus on MTTR

Building Failure Intuition

Developing Failure Intuition:

Read postmortems: Every major tech company publishes postmortems. These are invaluable for understanding real-world failure patterns.
Practice failure analysis: For any system design, ask 'What happens when X fails?' for every component. Then ask 'What happens when X and Y fail together?'
Run failure experiments: Chaos engineering isn't just for validating systems—it builds intuition about how failures propagate.
Study failure history: Your own organization's incidents teach you about your specific failure patterns and blind spots.

Questions to Ask for Any System

•What are the single points of failure? — Components where failure causes complete system failure
•What are the failure domains? — How are components grouped in terms of correlated failures?
•What happens during partial failure? — How does the system behave when degraded but not dead?
•What are the recovery mechanisms? — How does the system return to healthy state after failure?
•What can make recovery fail? — Are there failure modes that defeat recovery mechanisms?
•What's the blast radius? — How far can a single failure propagate?
•What detects failures? — How long until failures are noticed?
•What's the human component? — Where might humans make failures worse?

The Pre-Mortem Technique

Summary: Types of Failures

We've established the comprehensive taxonomy of system failures. Let's consolidate the essential knowledge:

Key Takeaways

•Hardware failures are physical realities — Disks fail, memory corrupts, power fluctuates. These failures are probabilistic and can't be eliminated, only managed through redundancy.
•Software failures are deterministic bugs — The same conditions produce the same failure. All instances share the same bugs. Testing and diversity are key defenses.
•Network failures create uncertainty — The fundamental challenge isn't lost messages but ambiguous state. You never know if a non-response means 'failed' or 'slow'.
•Failures combine and cascade — Single failures rarely cause outages. Multiple failures, or failures that trigger other failures, create catastrophes.
•Gray failures are the hardest — Partial failures are more dangerous than complete failures because they're harder to detect and handle.
•MTTR often matters more than MTBF — Optimizing recovery speed may be more effective than preventing failures.

What's next:

Page Complete

1 / 4