Loading content...
Every production incident, every outage postmortem, every late-night page can be traced back to one or more fundamental failure modes. The systems we build—no matter how carefully designed—exist within an environment where failure is not just possible but inevitable. Networks partition. Servers crash. Disks fill up. Clocks drift. Dependencies become unavailable.
The question isn't whether these failures will occur, but whether your system will handle them gracefully when they do. Failure injection—the deliberate introduction of controlled failures into a running system—provides the means to answer this question proactively, on your terms, during business hours, rather than at 3 AM when an actual incident strikes.
By the end of this page, you will understand the complete taxonomy of injectable failures—from network-level disruptions through application-level faults to infrastructure resource exhaustion. You will learn to categorize real-world production incidents into these failure types and develop intuition for which categories pose the greatest risk to your specific systems.
Before diving into specific failure types, we must understand why we inject failures in the first place. This practice rests on a fundamental insight from chaos engineering: systems behave differently under failure conditions than they do under normal operation, and the only way to truly understand that behavior is to observe it directly.
Traditional testing methodologies—unit tests, integration tests, even sophisticated end-to-end tests—typically operate under the assumption that infrastructure behaves correctly. The network delivers packets. The database accepts writes. The cache returns values. But production environments are harsh, and these assumptions routinely break.
Failure injection bridges this gap by forcing systems to confront the very conditions that production will eventually impose. The goal is not to break things randomly but to explore specific hypotheses about system behavior under adverse conditions.
Failure injection is not chaos for chaos's sake. Each injected failure should test a specific hypothesis: 'If service X becomes unavailable, service Y will degrade gracefully and continue serving read requests.' Without a hypothesis, you're just breaking things and hoping to learn something—a far less effective approach than scientific experimentation.
The Failure Spectrum
Failures exist on a spectrum from complete and obvious to subtle and insidious:
Hard failures: Complete unavailability—a server that's entirely unreachable, a database that refuses all connections. These are often the easier failures to handle because they're unambiguous. The system knows the dependency is dead.
Soft failures: Partial degradation—a server that responds slowly, a database that accepts some writes but rejects others, a network that drops occasional packets. These are far more dangerous because they're ambiguous. The system may not realize it's sick.
Byzantine failures: Incorrect behavior—a server that returns wrong data, a clock that reports the wrong time, a disk that confirms writes it never persisted. These are the most dangerous because they violate fundamental assumptions the system makes about its environment.
Effective failure injection exercises the entire spectrum, not just the easy cases. Your system might handle a dead database beautifully while completely failing when that same database becomes slow.
We can organize injectable failures into five major categories, each targeting different layers of the distributed systems stack. Understanding this taxonomy is essential because it ensures you're not testing the same failure modes repeatedly while leaving others completely unexplored.
| Category | Target Layer | Common Examples | Typical Impact |
|---|---|---|---|
| Network Failures | OSI Layers 3-4 | Latency, packet loss, partitions, DNS failures | Communication breakdown between services |
| Service Failures | Application Layer | Process crashes, exception injection, response errors | Dependency unavailability, cascading failures |
| Resource Exhaustion | Infrastructure Layer | CPU saturation, memory pressure, disk full, FD exhaustion | Performance degradation, OOM kills, service crashes |
| Clock/Time Failures | System Services | Clock skew, time jumps, NTP failures | Distributed coordination failures, data inconsistency |
| Data/State Failures | Storage Layer | Corruption, stale reads, split-brain | Data loss, incorrect behavior, trust violations |
Each category targets fundamentally different assumptions your system makes about its environment:
Network failures test the assumption that services can communicate reliably. In a distributed system, network is the only thing connecting your services—when it fails, partitioned components must decide independently how to proceed.
Service failures test the assumption that dependencies will be available when needed. Every external call—to a database, cache, API, or microservice—is a potential failure point that your error handling must address.
Resource exhaustion tests the assumption that infrastructure resources are abundant. Modern systems are designed to scale, but resource limits—memory, CPU, disk, file descriptors—impose hard boundaries that can cause unexpected failures.
Clock/time failures test the assumption that time is consistent and reliable. Distributed systems often depend on time for ordering events, expiring cache entries, and coordinating distributed algorithms.
Data/state failures test the assumption that storage systems preserve and return correct data. When these assumptions break, the consequences can range from minor inconsistencies to catastrophic data loss.
Network failures are perhaps the most common category of injectable failures because networks are inherently unreliable. Even within a single data center, network issues occur regularly due to hardware failures, configuration errors, capacity constraints, and software bugs in network equipment.
Many systems handle complete network failures reasonably well—a connection that fails immediately triggers retry logic. But latency is insidious. A request that hangs for 60 seconds consumes resources, holds open connections, and blocks threads. The slow dependency often causes more damage than the dead one. Always test latency injection, not just connection failures.
Service failures target the application layer—the actual software components that make up your distributed system. While network failures test infrastructure, service failures test application logic and error handling code paths.
The Hierarchy of Service Failures
Service failures can be organized by their scope and detectability:
| Scope | Detectability | Example | Handling Complexity |
|---|---|---|---|
| Single instance | High | One pod crashes | Low—orchestrator restarts |
| All instances | High | Entire service down | Medium—circuit breaker triggers |
| Partial instances | Medium | 50% of pods degraded | High—load balancer health checks may not detect |
| Intermittent | Low | Occasional request failures | Very high—hard to reproduce, debug |
| Semantic | Very Low | Wrong data returned | Extremely high—may not be detected at all |
The most dangerous failures are those with low detectability. Your monitoring might show all services as healthy while users experience consistent errors from the subset of traffic routed to degraded instances.
Resource exhaustion represents some of the most insidious failure modes because they typically don't cause immediate failures. Instead, they cause gradual degradation that compounds over time, making them difficult to detect and diagnose before they cause serious problems.
Resource exhaustion failures rarely occur in isolation. Memory pressure leads to increased garbage collection, which consumes CPU. CPU saturation leads to slower request processing, which increases queue depths and connection counts. Disk I/O saturation leads to write stalls, which cause requests to back up. Understanding these cascades is essential for interpreting chaos experiment results.
Time-related failures are among the most subtle and dangerous because distributed systems often make implicit assumptions about time that developers don't even realize exist. When clocks drift or jump, these hidden assumptions can cause surprising and difficult-to-debug failures.
Why Clock Failures Matter
Many distributed systems depend on time for correctness:
These dependencies are often implicit in library code or infrastructure components, making them easy to overlook during system design.
Data and state failures are the most severe category because they can result in permanent data loss or corruption. These failures test the most fundamental assumptions about storage system behavior—that writes persist and reads return the correct data.
Data failure injection requires extreme caution. Unlike network or service failures, data corruption can be permanent if not carefully isolated. Always conduct data failure experiments against isolated test data, never production data. Ensure you have verified backup and recovery procedures before experimenting with any form of data failure.
The true value of understanding failure taxonomies becomes clear when you analyze past production incidents. Almost every outage can be traced to one or more failure categories from our taxonomy. By mapping incidents to failure types, you can identify which categories pose the greatest risk to your specific systems and prioritize your chaos engineering efforts accordingly.
| Incident | Primary Failure Type | Secondary Failure | What Chaos Testing Would Have Revealed |
|---|---|---|---|
| AWS us-east-1 2011 | Resource Exhaustion | Network Partition | EBS control plane couldn't handle failover traffic volume |
| GitHub 2018 split-brain | Network Partition | Data State Failure | MySQL failover left replicas with conflicting data |
| Cloudflare 2019 | Resource Exhaustion | CPU Saturation | Regex in WAF rule caused catastrophic backtracking |
| Facebook 2021 | Network Failure | DNS Failure | BGP withdrawal made DNS servers unreachable |
| Knight Capital 2012 | Service Failure | Partial Deployment | Old code on subset of servers caused $440M loss in 45 minutes |
| Leap second bugs | Clock/Time | Time Jump | Many systems couldn't handle 61-second minutes |
Creating Your Failure Risk Profile
Not all failure types pose equal risk to all systems. A system heavily dependent on distributed coordination is more vulnerable to clock skew than a simple stateless web server. A system with complex service dependencies is more vulnerable to cascade failures than a monolith.
To create your failure risk profile:
Analyze your architecture — What assumptions does each component make? What dependencies does it have? Where are the single points of failure?
Review past incidents — What failure types have caused outages before? These are proven risk areas.
Assess consequence severity — For each failure type, what's the business impact? Data loss is usually worse than temporary unavailability.
Evaluate detection capability — How quickly would you detect each failure type? Low-detectability failures deserve more attention.
Prioritize experiments — Focus chaos engineering efforts on high-risk, high-impact, low-visibility failure types first.
We've established a comprehensive taxonomy of injectable failures, organized into five major categories that span the entire distributed systems stack. This taxonomy provides the foundation for systematic chaos engineering—ensuring you test all the ways your system can fail, not just the obvious ones.
What's Next:
With the complete taxonomy established, we'll now deep-dive into the most common and impactful category: Network Failures. You'll learn specific techniques for injecting latency, packet loss, partitions, and DNS failures, along with the observable effects each produces and what they reveal about your system's resilience.
You now understand the complete taxonomy of injectable failures in distributed systems. This framework will guide your chaos engineering efforts, ensuring comprehensive coverage across all failure modes. Next, we'll explore network failure injection in depth.