Loading content...
The second principle of chaos engineering might sound simple: Vary real-world events. But within this principle lies one of the deepest challenges of the practice. What constitutes a "real-world event"? How do you decide which failures to simulate? And critically, how do you avoid the trap of only testing for failures you've already imagined?
The naive approach—flipping random switches and seeing what breaks—fails spectacularly. It generates noise without insight, creates incidents without learning, and burns organizational trust without building resilience. The principled approach requires understanding the taxonomy of failures, prioritizing based on likelihood and impact, and designing experiments that reveal systemic weaknesses rather than just confirming known problems.
This page teaches you to think like a chaos engineer: systematically identifying the real-world events that could compromise your system, then designing experiments that test your hypotheses about how the system should respond.
By the end of this page, you will understand the taxonomy of real-world failures, how to prioritize which failures to simulate, the difference between fault injection and realistic event simulation, how to model failures at different system layers, and techniques for discovering failure modes you haven't yet considered.
Before injecting chaos, you need a mental model of what kinds of failures actually occur in distributed systems. Production failures don't arrive labeled and categorized—they emerge from complex interactions. But we can organize them into a taxonomy that helps ensure comprehensive coverage.
The four domains of failure:
Failures in distributed systems generally fall into four domains, each requiring different simulation approaches:
| Domain | Examples | Characteristics | Detection Difficulty |
|---|---|---|---|
| Infrastructure | Server crash, disk failure, power outage, hypervisor issues | Binary (working or not), typically fast detection | Easy—monitoring sees it immediately |
| Network | Partition, latency, packet loss, DNS failure, MTU issues | Often partial or intermittent, affects subsets of traffic | Medium—may appear as application errors |
| Application | Memory leaks, deadlocks, resource exhaustion, logic bugs | Gradual degradation, often time-dependent | Hard—requires deep observability |
| Dependency | Third-party API failure, upstream service degradation, data corruption from partner | Outside your control, often unexpected | Variable—depends on integration monitoring |
Why this taxonomy matters:
Each domain requires different injection techniques and reveals different resilience gaps:
Infrastructure failures test your basic redundancy and failover mechanisms. These are the 'table stakes' of distributed systems resilience—if you can't survive a server crash, nothing else matters.
Network failures expose assumptions about connectivity. Many applications are written assuming reliable networks, and even those designed for unreliable networks often have edge cases that only appear under partial connectivity.
Application failures reveal problems that can't be solved by adding redundancy. A bug that corrupts data will corrupt that data across all your replicas. Memory leaks affect all instances running the same code.
Dependency failures test your isolation strategies. How much does your system degrade when services you don't control behave unexpectedly?
While infrastructure failures get the most attention in chaos engineering discussions, real-world incidents are far more likely to stem from application and dependency issues. A study of postmortems at large tech companies found that configuration changes, dependency failures, and software bugs cause significantly more outages than hardware failures. Design your chaos experiments accordingly.
Real-world failures rarely manifest as clean, complete outages. More often, they're partial, gradual, or intermittent. Effective chaos engineering must model this complexity.
The spectrum of failure severity:
Network failure patterns deserve special attention:
Network issues are particularly insidious because they come in many forms, each requiring different handling:
Full partition: Complete loss of connectivity between components. Easy to detect, hard to resolve without redundant paths.
Asymmetric partition: A can reach B, but B cannot reach A. This breaks many distributed protocols that assume symmetric connectivity.
Partial packet loss: 10-50% of packets dropped. TCP handles this through retransmission, but at high loss rates, throughput collapses.
Latency injection: Artificially adding delay to packets. Even small delays (50-100ms) can devastate systems designed for low-latency operation.
Bandwidth restriction: Throttling throughput to simulate congested links. Reveals assumptions about available bandwidth.
TCP issues: Connection resets, FIN storms, SYN floods. These lower-level issues often bypass application-level error handling.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
from dataclasses import dataclassfrom enum import Enumfrom typing import Optional, Listimport random class FailurePattern(Enum): CONSTANT = "constant" # Steady failure rate BURST = "burst" # Periodic spikes of failure PROGRESSIVE = "progressive" # Gradually worsening CHAOTIC = "chaotic" # Random, unpredictable @dataclassclass NetworkFailureModel: """ Models realistic network failure characteristics for chaos experiments. Real failures rarely follow simple patterns - this class captures the complexity of production network issues. """ # What type of failure to simulate failure_type: str # "latency", "packet_loss", "partition", "bandwidth" # The pattern of failure over time pattern: FailurePattern # Severity parameters (interpretation depends on failure_type) # For latency: additional ms delay # For packet_loss: percentage 0-100 # For partition: N/A (binary) # For bandwidth: max bytes/second base_severity: float max_severity: float # Which components/routes are affected # Realistic failures rarely affect all traffic uniformly affected_routes: List[str] # e.g., ["service-a->service-b", "service-a->database"] # Bidirectional or asymmetric? symmetric: bool = True # Time characteristics duration_seconds: int = 60 ramp_up_seconds: int = 0 # For progressive pattern def get_severity_at_time(self, elapsed_seconds: float) -> float: """ Calculate failure severity at a given point in time. This models realistic failure progression. """ if self.pattern == FailurePattern.CONSTANT: return self.base_severity elif self.pattern == FailurePattern.BURST: # Bursts every 10 seconds lasting 2 seconds in_burst = (elapsed_seconds % 10) < 2 return self.max_severity if in_burst else 0 elif self.pattern == FailurePattern.PROGRESSIVE: # Linear ramp from base to max over ramp_up period if elapsed_seconds < self.ramp_up_seconds: progress = elapsed_seconds / self.ramp_up_seconds return self.base_severity + (self.max_severity - self.base_severity) * progress return self.max_severity elif self.pattern == FailurePattern.CHAOTIC: # Random severity each time - models unstable network return random.uniform(self.base_severity, self.max_severity) return self.base_severity def should_affect_request(self, route: str, direction: str) -> bool: """ Determine if a specific request should be affected by this failure. Enables simulation of partial failures. """ if route not in self.affected_routes: return False if not self.symmetric and direction == "response": return False return True # Example: Modeling a realistic cloud provider network issuedef create_az_network_degradation() -> NetworkFailureModel: """ Models the kind of network degradation that occurs during cloud provider availability zone issues - progressive latency increase affecting cross-AZ traffic. """ return NetworkFailureModel( failure_type="latency", pattern=FailurePattern.PROGRESSIVE, base_severity=10, # Start with 10ms added latency max_severity=500, # Ramp to 500ms affected_routes=[ "us-east-1a->us-east-1b", "us-east-1a->us-east-1c" ], symmetric=False, # Often asymmetric in real incidents duration_seconds=300, ramp_up_seconds=60 # Takes 1 minute to reach full severity ) def create_partial_packet_loss() -> NetworkFailureModel: """ Models intermittent packet loss that degrades TCP performance without causing complete connection failures. """ return NetworkFailureModel( failure_type="packet_loss", pattern=FailurePattern.BURST, base_severity=0, # No loss normally max_severity=30, # 30% loss during bursts affected_routes=["*->database-primary"], symmetric=True, duration_seconds=120 )The universe of possible failures is infinite. You can't test them all. The art of chaos engineering lies in prioritization—focusing on scenarios that provide the highest learning-to-risk ratio.
The prioritization framework:
Consider each potential chaos experiment along two dimensions:
Likelihood: How probable is this failure in production? Failures that have already happened are more likely to recur. Failures that industry peers report are worth attention.
Impact: If this failure occurs, how severe is the consequence? Total outage is worse than degraded performance. Customer-facing impact is worse than internal-only impact.
Plot potential experiments on a likelihood-impact matrix to identify priorities:
| Low Impact | Medium Impact | High Impact | |
|---|---|---|---|
| High Likelihood | Test, but not critical | High priority | CRITICAL - test immediately |
| Medium Likelihood | Lower priority | Test after high-priority items | High priority |
| Low Likelihood | Skip unless low-cost to test | Consider for advanced program | Test if cost-effective |
Sources of prioritization intelligence:
To accurately assess likelihood and impact, gather data from multiple sources:
Historical incident data: What failures have actually occurred? How often? With what impact? Past incidents are the strongest predictor of future incidents.
Architecture analysis: Where are the single points of failure? Which dependencies are most critical? Architectural review reveals theoretical weak points.
Industry postmortems: Companies like Google, Facebook, and Amazon publish detailed incident reports. Learn from others' failures.
Dependency risk assessment: Third-party services and open-source components have their own failure histories. Review their status pages and incident reports.
Team intuition: Experienced engineers often have gut feelings about which components are 'scary.' These intuitions often reflect real risk.
Observability data: Metrics that show high variance or frequent alerts may indicate components under stress.
Teams sometimes want to start with exciting scenarios: simultaneous multi-region failure, database corruption, coordinated attacks. These are important eventually, but extraordinarily unlikely to be your next production incident. Master the common failures before tackling the exotic ones. If you can't survive a single instance termination, you're not ready for multi-region chaos.
Early chaos engineering focused heavily on binary failures—server up or down, service available or unavailable. But production incidents increasingly involve gray failures: situations where components are technically operational but behaving badly.
The gray failure problem:
A server returning HTTP 200 on health checks while silently dropping 50% of requests. A database query that works for small datasets but times out on production-scale data. A service that performs well under normal load but collapses under burst traffic.
These failures are particularly dangerous because:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
import { Request, Response, NextFunction } from 'express'; /** * Gray failure injection middleware for Express applications. * Simulates partial failures that pass health checks but degrade service. */ interface GrayFailureConfig { // Percentage of requests to affect (0-100) affectedPercentage: number; // Types of gray failure to inject failureType: 'slow' | 'error' | 'corrupt' | 'timeout'; // Latency to add for 'slow' type (ms) additionalLatencyMs?: number; // Error code to return for 'error' type errorCode?: number; // Routes to affect (or '*' for all non-health routes) affectedRoutes: string[] | '*'; // Routes that should NEVER be affected (health checks) protectedRoutes: string[];} const createGrayFailureMiddleware = (config: GrayFailureConfig) => { return async (req: Request, res: Response, next: NextFunction) => { // Never affect protected routes (health checks) if (config.protectedRoutes.some(route => req.path.includes(route))) { return next(); } // Check if this route should be affected const shouldAffectRoute = config.affectedRoutes === '*' || config.affectedRoutes.some(route => req.path.includes(route)); if (!shouldAffectRoute) { return next(); } // Probabilistically inject failure if (Math.random() * 100 > config.affectedPercentage) { return next(); // This request escapes unaffected } // Inject the configured failure switch (config.failureType) { case 'slow': // Add artificial latency await sleep(config.additionalLatencyMs || 5000); return next(); case 'error': // Return error response return res.status(config.errorCode || 500).json({ error: 'Internal server error', // Real errors often have misleading messages message: 'Request processed successfully' // Intentionally wrong }); case 'corrupt': // Let request process, but corrupt the response const originalJson = res.json.bind(res); res.json = (body: any) => { // Subtle corruption - might go unnoticed if (typeof body === 'object' && body !== null) { return originalJson({ ...body, // Random field corruption ...(body.userId && { userId: body.userId + 1 }), ...(body.amount && { amount: body.amount * 0.99 }), }); } return originalJson(body); }; return next(); case 'timeout': // Never respond - connection eventually times out // The health check isn't affected, so this instance // keeps receiving traffic return; // Intentionally not calling next() or sending response default: return next(); } };}; const sleep = (ms: number) => new Promise(resolve => setTimeout(resolve, ms)); // Example configuration: 20% of requests to /api/orders are slowconst orderSlownessConfig: GrayFailureConfig = { affectedPercentage: 20, failureType: 'slow', additionalLatencyMs: 3000, affectedRoutes: ['/api/orders', '/api/checkout'], protectedRoutes: ['/health', '/ready', '/metrics']}; // Example configuration: 5% of responses have subtle data corruptionconst dataCorruptionConfig: GrayFailureConfig = { affectedPercentage: 5, failureType: 'corrupt', affectedRoutes: '*', protectedRoutes: ['/health', '/ready', '/metrics', '/api/auth']}; export { createGrayFailureMiddleware, GrayFailureConfig };Gray failure experiments often reveal that current monitoring is insufficient. If you inject 20% error rates and your dashboard shows 100% success, that's a major finding—your observability has blind spots. Many organizations discover through chaos that their monitoring only catches complete outages, not partial degradation.
Many real-world failures only manifest under specific conditions: high load, resource constraints, or time-sensitive situations. Effective chaos engineering must simulate these conditions, not just component failures.
Load-dependent failures:
Systems often work perfectly at 1x load but fail spectacularly at 3x load. This isn't just about capacity—it reveals algorithmic inefficiencies (O(n²) operations that were fine for n=1000 but fail at n=10000), resource exhaustion patterns, and contention issues.
Time-related failures:
Some failures are inherently time-bound. Certificates expire. Tokens time out. Scheduled jobs run (or don't run). These temporal aspects require specific testing:
The " clock skew" category deserves special attention:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
import subprocessimport timefrom datetime import datetime, timedeltafrom contextlib import contextmanagerfrom typing import Generatorimport logging logger = logging.getLogger(__name__) class ClockSkewExperiment: """ Simulates clock skew between distributed system components. Clock skew causes subtle but severe bugs: - Certificate validation failures - Distributed lock issues - Event ordering problems - Token/session expiration issues WARNING: This modifies system time. Use only in isolated environments. """ def __init__(self, target_host: str): self.target_host = target_host self.original_time = None @contextmanager def skewed_clock(self, skew_seconds: int) -> Generator[None, None, None]: """ Context manager that skews the clock on target host for the duration of the context, then restores it. Args: skew_seconds: Positive for future, negative for past """ try: self._record_original_time() self._apply_skew(skew_seconds) logger.info( f"Clock on {self.target_host} skewed by {skew_seconds}s" ) yield finally: self._restore_time() logger.info(f"Clock on {self.target_host} restored") def _record_original_time(self): """Record current time for later restoration.""" result = subprocess.run( ['ssh', self.target_host, 'date +%s'], capture_output=True, text=True ) self.original_time = int(result.stdout.strip()) def _apply_skew(self, skew_seconds: int): """Set system time on target host.""" new_time = self.original_time + skew_seconds # Note: Requires root/sudo access on target subprocess.run([ 'ssh', self.target_host, f'sudo date -s @{new_time}' ]) def _restore_time(self): """Restore original time on target host.""" if self.original_time: # Sync with NTP subprocess.run([ 'ssh', self.target_host, 'sudo systemctl restart systemd-timesyncd' ]) def run_clock_skew_scenarios(): """ Common clock skew scenarios that reveal real bugs. """ scenarios = [ { "name": "Backward clock jump", "description": "Clock suddenly moves 5 minutes into past", "skew_seconds": -300, "expected_issues": [ "Rate limiters may over-restrict", "Recently issued tokens appear expired", "Distributed locks may conflict" ] }, { "name": "Forward clock jump", "description": "Clock suddenly moves 5 minutes into future", "skew_seconds": 300, "expected_issues": [ "Certificates may appear expired", "Scheduled jobs may fire early", "Cache TTLs may expire prematurely" ] }, { "name": "Gradual drift", "description": "Clock slowly drifts over time", "skew_seconds": 60, # Applied incrementally "expected_issues": [ "Distributed consensus may fail", "Log ordering becomes incorrect", "Database replication may lag" ] }, { "name": "Cross-zone skew", "description": "Different AZs have different times", "skew_seconds": 30, # On subset of nodes "expected_issues": [ "Leader election instability", "Partition detection false positives", "Event ordering inconsistencies" ] } ] return scenariosThe most damaging production incidents aren't simple component failures—they're cascading failures where one failure triggers a chain reaction that brings down seemingly unrelated systems. Simulating these cascades requires thinking about failure propagation paths.
Common cascade patterns:
Designing cascade experiments:
Simulating cascades requires multi-component experiments:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
# Cascade Failure Experiment Definitionname: "Database Slowdown Cascade"description: | Simulates a cascade where database latency increase causes connection pool exhaustion, leading to service failures, creating retry storms, and potentially overwhelming upstream load balancers. trigger: component: database-primary failure_type: latency_injection severity: 500ms # 10x normal latency duration: 5m expected_propagation: - step: 1 component: database-connection-pools effect: "Connections held 10x longer, pool usage increases" time_range: "0-30s" metric: "connection_pool_utilization" expected_value: ">80%" - step: 2 component: api-services effect: "Requests wait for connections, latency increases" time_range: "30-60s" metric: "api_p99_latency_ms" expected_value: ">2000" - step: 3 component: client-applications effect: "Timeouts trigger retries, amplifying load" time_range: "60-120s" metric: "request_rate" expected_value: "+50% over baseline" - step: 4 component: load-balancer effect: "Circuit breaker should engage" time_range: "120s+" metric: "circuit_breaker_state" expected_value: "OPEN" containment_hypothesis: description: | The cascade should be contained by circuit breakers opening at step 4. Services behind open circuits should return fallback responses. Database should recover as retry pressure drops. success_criteria: - "No complete service outage" - "Circuit breakers open within 120s" - "Error rate at load balancer < 50%" - "Recovery within 60s of trigger removal" abort_conditions: - "Error rate > 90% for > 30s" - "Complete loss of service availability" - "Customer impact detected via SLI breach" rollback: action: "Remove latency injection" verification: "Confirm database latency returns to baseline" expected_recovery_time: 60sCascade experiments are inherently risky because they can spiral beyond expectations. Start with short durations and have abort conditions ready. Monitor closely. Many teams discover that their cascade containment mechanisms don't work as expected—which is exactly why these experiments are valuable.
Modern systems depend on services outside your control: cloud provider APIs, payment processors, authentication providers, CDNs, third-party data feeds. These dependencies introduce failure modes you can't prevent—only prepare for.
Simulating external failures:
You typically can't inject failure into third-party services directly. Instead, you simulate their failure at the boundary:
| External Service | Failure Pattern | Simulation Technique | What It Reveals |
|---|---|---|---|
| Payment Processor | 5 second response times | Proxy adds latency to Stripe/Braintree endpoints | Checkout timeout handling, retry behavior |
| Auth Provider | Intermittent 500 errors | Proxy returns errors for 30% of OAuth requests | Login degradation, session handling |
| CDN | Complete unavailability | Block CDN domain at DNS level | Asset fallbacks, loading behavior |
| Email Service | Rate limiting | Mock returns 429 after N requests | Queue handling, backpressure |
| Cloud APIs | Partial region failure | Block specific AWS/GCP/Azure endpoints | Multi-region fallback behavior |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
from mitmproxy import httpimport randomimport timefrom typing import Optional, Dictimport jsonimport logging logger = logging.getLogger(__name__) class ExternalDependencyFailureInjector: """ mitmproxy addon that injects failures for external service simulation. Use with: mitmproxy -s external-dependency-proxy.py This allows simulating third-party service failures without modifying those services or fully mocking them. """ def __init__(self): self.failure_configs: Dict[str, dict] = { # Stripe API simulation "api.stripe.com": { "enabled": True, "latency_ms": 5000, # Add 5s latency "error_rate": 0.1, # 10% errors "error_code": 503, "timeout_rate": 0.05, # 5% complete timeouts }, # Auth0 simulation "*.auth0.com": { "enabled": True, "latency_ms": 0, "error_rate": 0.2, # 20% errors - auth degradation "error_code": 500, "timeout_rate": 0.0, }, # AWS S3 simulation "*.s3.amazonaws.com": { "enabled": True, "latency_ms": 200, "error_rate": 0.05, "error_code": 503, "timeout_rate": 0.0, # Rate limit simulation "rate_limit_after": 100, # Requests per minute "rate_limit_error": 429, } } self.request_counts: Dict[str, int] = {} def request(self, flow: http.HTTPFlow) -> None: """Called for each request through the proxy.""" host = flow.request.pretty_host config = self._get_config_for_host(host) if not config or not config.get("enabled"): return # Simulate latency if config.get("latency_ms", 0) > 0: time.sleep(config["latency_ms"] / 1000.0) logger.info(f"Injected {config['latency_ms']}ms latency for {host}") # Simulate timeout (don't respond at all) if random.random() < config.get("timeout_rate", 0): flow.kill() logger.info(f"Simulated timeout for {host}") return # Check rate limiting self.request_counts[host] = self.request_counts.get(host, 0) + 1 if self.request_counts.get(host, 0) > config.get("rate_limit_after", float('inf')): flow.response = http.Response.make( config.get("rate_limit_error", 429), json.dumps({"error": "Rate limit exceeded"}), {"Content-Type": "application/json"} ) logger.info(f"Rate limited request to {host}") return # Simulate random errors if random.random() < config.get("error_rate", 0): flow.response = http.Response.make( config.get("error_code", 500), json.dumps({ "error": "Service temporarily unavailable", "status": "error" }), {"Content-Type": "application/json"} ) logger.info(f"Injected error for {host}") return def _get_config_for_host(self, host: str) -> Optional[dict]: """Find matching configuration for host, supporting wildcards.""" # Exact match if host in self.failure_configs: return self.failure_configs[host] # Wildcard match for pattern, config in self.failure_configs.items(): if pattern.startswith("*."): domain = pattern[2:] if host.endswith(domain): return config return None # Create the addon instanceaddons = [ExternalDependencyFailureInjector()]Be especially careful with payment processor and billing service chaos. Use sandbox/test environments and test API keys. Never inject failures that could affect real financial transactions. Many companies have separate chaos engineering environments specifically for testing payment flows.
The second principle of chaos engineering—varying real-world events—transforms abstract resilience goals into concrete experiments. Let's consolidate the key insights:
What's next:
With hypotheses formed and failure scenarios designed, we're ready to tackle the third principle: Run Experiments in Production. This is where chaos engineering gets real—and where the careful preparation from the first two principles pays dividends.
You now understand the taxonomy of failures, how to model realistic failure patterns, prioritization frameworks, and techniques for simulating everything from gray failures to cascading failures to external dependency issues. Next, we'll explore how to safely run these experiments in production environments.