Failure Injection - Learning Module

Loading content...

0/273

Types of Failures to Inject

Understanding Failure Taxonomies

Every production incident, every outage postmortem, every late-night page can be traced back to one or more fundamental failure modes. The systems we build—no matter how carefully designed—exist within an environment where failure is not just possible but inevitable. Networks partition. Servers crash. Disks fill up. Clocks drift. Dependencies become unavailable.

The question isn't whether these failures will occur, but whether your system will handle them gracefully when they do. Failure injection—the deliberate introduction of controlled failures into a running system—provides the means to answer this question proactively, on your terms, during business hours, rather than at 3 AM when an actual incident strikes.

What You Will Learn

By the end of this page, you will understand the complete taxonomy of injectable failures—from network-level disruptions through application-level faults to infrastructure resource exhaustion. You will learn to categorize real-world production incidents into these failure types and develop intuition for which categories pose the greatest risk to your specific systems.

The Philosophy of Controlled Failure

Before diving into specific failure types, we must understand why we inject failures in the first place. This practice rests on a fundamental insight from chaos engineering: systems behave differently under failure conditions than they do under normal operation, and the only way to truly understand that behavior is to observe it directly.

Traditional testing methodologies—unit tests, integration tests, even sophisticated end-to-end tests—typically operate under the assumption that infrastructure behaves correctly. The network delivers packets. The database accepts writes. The cache returns values. But production environments are harsh, and these assumptions routinely break.

Failure injection bridges this gap by forcing systems to confront the very conditions that production will eventually impose. The goal is not to break things randomly but to explore specific hypotheses about system behavior under adverse conditions.

The Controlled Experiment Mindset

Failure injection is not chaos for chaos's sake. Each injected failure should test a specific hypothesis: 'If service X becomes unavailable, service Y will degrade gracefully and continue serving read requests.' Without a hypothesis, you're just breaking things and hoping to learn something—a far less effective approach than scientific experimentation.

The Failure Spectrum

Failures exist on a spectrum from complete and obvious to subtle and insidious:

Hard failures: Complete unavailability—a server that's entirely unreachable, a database that refuses all connections. These are often the easier failures to handle because they're unambiguous. The system knows the dependency is dead.
Soft failures: Partial degradation—a server that responds slowly, a database that accepts some writes but rejects others, a network that drops occasional packets. These are far more dangerous because they're ambiguous. The system may not realize it's sick.
Byzantine failures: Incorrect behavior—a server that returns wrong data, a clock that reports the wrong time, a disk that confirms writes it never persisted. These are the most dangerous because they violate fundamental assumptions the system makes about its environment.

Effective failure injection exercises the entire spectrum, not just the easy cases. Your system might handle a dead database beautifully while completely failing when that same database becomes slow.

The Master Taxonomy of Injectable Failures

We can organize injectable failures into five major categories, each targeting different layers of the distributed systems stack. Understanding this taxonomy is essential because it ensures you're not testing the same failure modes repeatedly while leaving others completely unexplored.

The Five Categories of Injectable Failures
Category	Target Layer	Common Examples	Typical Impact
Network Failures	OSI Layers 3-4	Latency, packet loss, partitions, DNS failures	Communication breakdown between services
Service Failures	Application Layer	Process crashes, exception injection, response errors	Dependency unavailability, cascading failures
Resource Exhaustion	Infrastructure Layer	CPU saturation, memory pressure, disk full, FD exhaustion	Performance degradation, OOM kills, service crashes
Clock/Time Failures	System Services	Clock skew, time jumps, NTP failures	Distributed coordination failures, data inconsistency
Data/State Failures	Storage Layer	Corruption, stale reads, split-brain	Data loss, incorrect behavior, trust violations

Each category targets fundamentally different assumptions your system makes about its environment:

Network failures test the assumption that services can communicate reliably. In a distributed system, network is the only thing connecting your services—when it fails, partitioned components must decide independently how to proceed.

Service failures test the assumption that dependencies will be available when needed. Every external call—to a database, cache, API, or microservice—is a potential failure point that your error handling must address.

Resource exhaustion tests the assumption that infrastructure resources are abundant. Modern systems are designed to scale, but resource limits—memory, CPU, disk, file descriptors—impose hard boundaries that can cause unexpected failures.

Clock/time failures test the assumption that time is consistent and reliable. Distributed systems often depend on time for ordering events, expiring cache entries, and coordinating distributed algorithms.

Data/state failures test the assumption that storage systems preserve and return correct data. When these assumptions break, the consequences can range from minor inconsistencies to catastrophic data loss.

Network Failure Categories

Network failures are perhaps the most common category of injectable failures because networks are inherently unreliable. Even within a single data center, network issues occur regularly due to hardware failures, configuration errors, capacity constraints, and software bugs in network equipment.

Network Failure Subcategories

•Latency Injection — Adding artificial delay to network communications. This can be constant delay, variable delay, or delay that increases over time. Tests timeout handling, queue depth limits, and user experience under degraded conditions.
•Packet Loss — Dropping a percentage of network packets. Can be random, periodic, or based on packet characteristics. Tests retry logic, idempotency handling, and protocol-level error recovery.
•Network Partitions — Completely severing communication between network segments while keeping each segment internally connected. Tests split-brain handling, partition tolerance, and consensus algorithm behavior.
•Bandwidth Throttling — Limiting available network bandwidth. Tests behavior when data transfer rates are constrained, affecting large payload transfers and streaming operations.
•DNS Failures — Making DNS resolution fail entirely, return incorrect results, or experience extended latency. Tests DNS caching behavior, fallback mechanisms, and service discovery resilience.
•Connection Failures — Preventing new connections from being established while allowing existing connections to continue (or vice versa). Tests connection pooling, retry behavior, and connection lifecycle management.

The Danger of Latency vs. Complete Failure

Many systems handle complete network failures reasonably well—a connection that fails immediately triggers retry logic. But latency is insidious. A request that hangs for 60 seconds consumes resources, holds open connections, and blocks threads. The slow dependency often causes more damage than the dead one. Always test latency injection, not just connection failures.

Service Failure Categories

Service failures target the application layer—the actual software components that make up your distributed system. While network failures test infrastructure, service failures test application logic and error handling code paths.

Service Failure Subcategories

•Process Termination — Killing service processes abruptly (SIGKILL) or gracefully (SIGTERM). Tests restart behavior, in-flight request handling, state recovery, and cascading failure propagation.
•Exception Injection — Causing services to throw exceptions at specific code paths. Tests error handling, exception propagation, and whether failures are correctly surfaced to callers.
•Response Code Manipulation — Making services return error HTTP status codes (500, 503, 429) or application-specific error responses. Tests client error handling and retry logic.
•Response Corruption — Returning malformed or unexpected responses. Tests input validation, schema tolerance, and defensive parsing in consuming services.
•Dependency Unavailability — Making specific downstream dependencies unavailable to a service while keeping the service itself running. Tests circuit breaker behavior, fallback paths, and graceful degradation.
•Partial Deployment Failures — Simulating failed deployments by having different service instances run different versions or configurations. Tests version compatibility and deployment safety.

The Hierarchy of Service Failures

Service failures can be organized by their scope and detectability:

Scope	Detectability	Example	Handling Complexity
Single instance	High	One pod crashes	Low—orchestrator restarts
All instances	High	Entire service down	Medium—circuit breaker triggers
Partial instances	Medium	50% of pods degraded	High—load balancer health checks may not detect
Intermittent	Low	Occasional request failures	Very high—hard to reproduce, debug
Semantic	Very Low	Wrong data returned	Extremely high—may not be detected at all

The most dangerous failures are those with low detectability. Your monitoring might show all services as healthy while users experience consistent errors from the subset of traffic routed to degraded instances.

Resource Exhaustion Categories

Resource exhaustion represents some of the most insidious failure modes because they typically don't cause immediate failures. Instead, they cause gradual degradation that compounds over time, making them difficult to detect and diagnose before they cause serious problems.

Resource Exhaustion Subcategories

•CPU Saturation — Consuming available CPU cycles through artificial load. Tests autoscaling behavior, request shedding, priority-based scheduling, and performance under contention.
•Memory Pressure — Allocating memory to reduce available RAM. Tests garbage collection behavior, OOM killer response, memory limits enforcement, and graceful degradation under memory constraints.
•Disk Space Exhaustion — Filling disk storage to capacity or near-capacity. Tests logging behavior, database write handling, file rotation policies, and alerting on disk usage.
•Disk I/O Saturation — Consuming disk I/O bandwidth through heavy read/write operations. Tests queue depths, I/O timeouts, and behavior when storage becomes a bottleneck.
•File Descriptor Exhaustion — Opening large numbers of files/sockets to exhaust FD limits. Tests connection pooling, FD limit configuration, and error messages when connections fail.
•Thread/Process Limits — Exhausting available threads or process slots. Tests thread pool sizing, request queuing, and behavior when parallelism is constrained.
•Network Port Exhaustion — Consuming all ephemeral ports through rapid connection/disconnection cycles. Tests connection reuse, SO_REUSEADDR behavior, and port range configuration.

The Cascade Effect of Resource Exhaustion

Resource exhaustion failures rarely occur in isolation. Memory pressure leads to increased garbage collection, which consumes CPU. CPU saturation leads to slower request processing, which increases queue depths and connection counts. Disk I/O saturation leads to write stalls, which cause requests to back up. Understanding these cascades is essential for interpreting chaos experiment results.

Time and Clock Failure Categories

Time-related failures are among the most subtle and dangerous because distributed systems often make implicit assumptions about time that developers don't even realize exist. When clocks drift or jump, these hidden assumptions can cause surprising and difficult-to-debug failures.

Time/Clock Failure Subcategories

•Clock Skew — Making different machines have different notions of current time. Tests distributed coordination algorithms, cache TTL handling, and time-based ordering assumptions.
•Time Jumps — Making the system clock jump forward or backward abruptly. Tests scheduler behavior, timeout calculations, and time-windowed operations.
•Slow Clock — Making time advance more slowly than real time. Tests timeout handling and any code that depends on wall-clock time for durations.
•Fast Clock — Making time advance more quickly than real time. Tests TTL-based caching, token expiration, and scheduled task execution.
•NTP Failures — Preventing time synchronization with NTP servers. Tests long-term clock drift and the consequences of desynchronized cluster clocks.
•Timezone Confusion — Simulating timezone-related issues through system timezone changes. Tests date handling, scheduled operations, and timezone-sensitive logic.

Why Clock Failures Matter

Many distributed systems depend on time for correctness:

Distributed locks often use lease timeouts—if clocks are skewed, a node might believe it holds a lock it has already lost
Cache expiration depends on consistent time—skewed clocks can cause stale cache hits or premature evictions
Event ordering in log aggregation assumes comparable timestamps—clock skew can cause events to appear out of order
Rate limiting often uses time windows—a jumping clock can allow burst traffic that should be limited
Distributed consensus algorithms like Raft use time for leader election—clock issues can cause split-brain
Certificate validation checks timestamps—a jumping clock can invalidate valid certificates or accept expired ones

These dependencies are often implicit in library code or infrastructure components, making them easy to overlook during system design.

Data and State Failure Categories

Data and state failures are the most severe category because they can result in permanent data loss or corruption. These failures test the most fundamental assumptions about storage system behavior—that writes persist and reads return the correct data.

Data/State Failure Subcategories

•Storage Failures — Making storage systems (databases, file systems) unavailable or slow. Tests failover behavior, replication handling, and data access resilience.
•Stale Reads — Causing read operations to return outdated data. Tests eventual consistency handling, read-your-writes semantics, and cache coherency.
•Write Failures — Making write operations fail after the application believes they succeeded. Tests transaction handling, idempotency, and retry semantics.
•Replication Lag — Introducing delay between primary and replica updates. Tests read-after-write consistency, query routing, and replica selection logic.
•Split-Brain Scenarios — Simulating situations where different parts of the system have conflicting views of data state. Tests conflict resolution, leader election, and partition handling.
•Data Corruption — Introducing bit flips or format errors in stored data. Tests checksumming, validation, and corruption detection mechanisms.

Handle with Extreme Care

Data failure injection requires extreme caution. Unlike network or service failures, data corruption can be permanent if not carefully isolated. Always conduct data failure experiments against isolated test data, never production data. Ensure you have verified backup and recovery procedures before experimenting with any form of data failure.

Mapping Production Incidents to Failure Types

The true value of understanding failure taxonomies becomes clear when you analyze past production incidents. Almost every outage can be traced to one or more failure categories from our taxonomy. By mapping incidents to failure types, you can identify which categories pose the greatest risk to your specific systems and prioritize your chaos engineering efforts accordingly.

Industry Outages Mapped to Failure Types
Incident	Primary Failure Type	Secondary Failure	What Chaos Testing Would Have Revealed
AWS us-east-1 2011	Resource Exhaustion	Network Partition	EBS control plane couldn't handle failover traffic volume
GitHub 2018 split-brain	Network Partition	Data State Failure	MySQL failover left replicas with conflicting data
Cloudflare 2019	Resource Exhaustion	CPU Saturation	Regex in WAF rule caused catastrophic backtracking
Facebook 2021	Network Failure	DNS Failure	BGP withdrawal made DNS servers unreachable
Knight Capital 2012	Service Failure	Partial Deployment	Old code on subset of servers caused $440M loss in 45 minutes
Leap second bugs	Clock/Time	Time Jump	Many systems couldn't handle 61-second minutes

Creating Your Failure Risk Profile

Not all failure types pose equal risk to all systems. A system heavily dependent on distributed coordination is more vulnerable to clock skew than a simple stateless web server. A system with complex service dependencies is more vulnerable to cascade failures than a monolith.

To create your failure risk profile:

Analyze your architecture — What assumptions does each component make? What dependencies does it have? Where are the single points of failure?
Review past incidents — What failure types have caused outages before? These are proven risk areas.
Assess consequence severity — For each failure type, what's the business impact? Data loss is usually worse than temporary unavailability.
Evaluate detection capability — How quickly would you detect each failure type? Low-detectability failures deserve more attention.
Prioritize experiments — Focus chaos engineering efforts on high-risk, high-impact, low-visibility failure types first.

Summary: The Complete Failure Taxonomy

We've established a comprehensive taxonomy of injectable failures, organized into five major categories that span the entire distributed systems stack. This taxonomy provides the foundation for systematic chaos engineering—ensuring you test all the ways your system can fail, not just the obvious ones.

Key Takeaways

•Failure injection is scientific experimentation — Each injected failure tests a specific hypothesis about system behavior, not random destruction.
•Failures exist on a spectrum — From obvious hard failures to subtle Byzantine failures, each requires different handling strategies.
•Five categories cover the stack — Network, Service, Resource, Time/Clock, and Data/State failures each test different system assumptions.
•Soft failures are often more dangerous — Latency and partial failures are harder to handle than complete outages.
•Resource exhaustion cascades — One resource constraint often triggers failures in seemingly unrelated areas.
•Clock failures are subtle and pervasive — Many implicit time dependencies exist in code that developers don't realize.
•Past incidents guide priorities — Mapping historical outages to failure types reveals which categories pose the greatest risk.

What's Next:

With the complete taxonomy established, we'll now deep-dive into the most common and impactful category: Network Failures. You'll learn specific techniques for injecting latency, packet loss, partitions, and DNS failures, along with the observable effects each produces and what they reveal about your system's resilience.

Page Complete

You now understand the complete taxonomy of injectable failures in distributed systems. This framework will guide your chaos engineering efforts, ensuring comprehensive coverage across all failure modes. Next, we'll explore network failure injection in depth.