Principles of Chaos - Learning Module

Loading content...

0/273

Vary Real-World Events

The Art of Realistic Disruption

The second principle of chaos engineering might sound simple: Vary real-world events. But within this principle lies one of the deepest challenges of the practice. What constitutes a "real-world event"? How do you decide which failures to simulate? And critically, how do you avoid the trap of only testing for failures you've already imagined?

The naive approach—flipping random switches and seeing what breaks—fails spectacularly. It generates noise without insight, creates incidents without learning, and burns organizational trust without building resilience. The principled approach requires understanding the taxonomy of failures, prioritizing based on likelihood and impact, and designing experiments that reveal systemic weaknesses rather than just confirming known problems.

This page teaches you to think like a chaos engineer: systematically identifying the real-world events that could compromise your system, then designing experiments that test your hypotheses about how the system should respond.

What You Will Learn

By the end of this page, you will understand the taxonomy of real-world failures, how to prioritize which failures to simulate, the difference between fault injection and realistic event simulation, how to model failures at different system layers, and techniques for discovering failure modes you haven't yet considered.

The Taxonomy of Failures

Before injecting chaos, you need a mental model of what kinds of failures actually occur in distributed systems. Production failures don't arrive labeled and categorized—they emerge from complex interactions. But we can organize them into a taxonomy that helps ensure comprehensive coverage.

The four domains of failure:

Failures in distributed systems generally fall into four domains, each requiring different simulation approaches:

Four Domains of Distributed System Failures
Domain	Examples	Characteristics	Detection Difficulty
Infrastructure	Server crash, disk failure, power outage, hypervisor issues	Binary (working or not), typically fast detection	Easy—monitoring sees it immediately
Network	Partition, latency, packet loss, DNS failure, MTU issues	Often partial or intermittent, affects subsets of traffic	Medium—may appear as application errors
Application	Memory leaks, deadlocks, resource exhaustion, logic bugs	Gradual degradation, often time-dependent	Hard—requires deep observability
Dependency	Third-party API failure, upstream service degradation, data corruption from partner	Outside your control, often unexpected	Variable—depends on integration monitoring

Why this taxonomy matters:

Each domain requires different injection techniques and reveals different resilience gaps:

Infrastructure failures test your basic redundancy and failover mechanisms. These are the 'table stakes' of distributed systems resilience—if you can't survive a server crash, nothing else matters.
Network failures expose assumptions about connectivity. Many applications are written assuming reliable networks, and even those designed for unreliable networks often have edge cases that only appear under partial connectivity.
Application failures reveal problems that can't be solved by adding redundancy. A bug that corrupts data will corrupt that data across all your replicas. Memory leaks affect all instances running the same code.
Dependency failures test your isolation strategies. How much does your system degrade when services you don't control behave unexpectedly?

The 80/20 Rule of Failures

While infrastructure failures get the most attention in chaos engineering discussions, real-world incidents are far more likely to stem from application and dependency issues. A study of postmortems at large tech companies found that configuration changes, dependency failures, and software bugs cause significantly more outages than hardware failures. Design your chaos experiments accordingly.

Modeling Real-World Failure Patterns

Real-world failures rarely manifest as clean, complete outages. More often, they're partial, gradual, or intermittent. Effective chaos engineering must model this complexity.

The spectrum of failure severity:

Failure Severity Spectrum

•Complete outage: The component is entirely unavailable. Connection attempts fail immediately. This is the easiest failure to detect and handle.
•Degraded performance: The component responds but slowly. Latency increases 10x, 100x, or more. This often triggers cascading effects as callers time out.
•Partial availability: Some requests succeed while others fail. Load balancers may still route traffic to unhealthy instances. This creates confusing, inconsistent user experiences.
•Intermittent failure: The component flaps between healthy and unhealthy states. This can defeat circuit breakers and health checks, leading to prolonged degradation.
•Byzantine failure: The component returns incorrect results without indicating an error. This is the most dangerous category as downstream systems may act on bad data.

Network failure patterns deserve special attention:

Network issues are particularly insidious because they come in many forms, each requiring different handling:

Full partition: Complete loss of connectivity between components. Easy to detect, hard to resolve without redundant paths.
Asymmetric partition: A can reach B, but B cannot reach A. This breaks many distributed protocols that assume symmetric connectivity.
Partial packet loss: 10-50% of packets dropped. TCP handles this through retransmission, but at high loss rates, throughput collapses.
Latency injection: Artificially adding delay to packets. Even small delays (50-100ms) can devastate systems designed for low-latency operation.
Bandwidth restriction: Throttling throughput to simulate congested links. Reveals assumptions about available bandwidth.
TCP issues: Connection resets, FIN storms, SYN floods. These lower-level issues often bypass application-level error handling.

network-failure-models.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
from dataclasses import dataclass
from enum import Enum
from typing import Optional, List
import random
 
class FailurePattern(Enum):
    CONSTANT = "constant"       # Steady failure rate
    BURST = "burst"             # Periodic spikes of failure
    PROGRESSIVE = "progressive" # Gradually worsening
    CHAOTIC = "chaotic"         # Random, unpredictable
 
@dataclass
class NetworkFailureModel:
    """
    Models realistic network failure characteristics for chaos experiments.
    Real failures rarely follow simple patterns - this class captures
    the complexity of production network issues.
    """
    
    # What type of failure to simulate
    failure_type: str  # "latency", "packet_loss", "partition", "bandwidth"
    
    # The pattern of failure over time
    pattern: FailurePattern
    
    # Severity parameters (interpretation depends on failure_type)
    # For latency: additional ms delay
    # For packet_loss: percentage 0-100
    # For partition: N/A (binary)
    # For bandwidth: max bytes/second
    base_severity: float
    max_severity: float
    
    # Which components/routes are affected
    # Realistic failures rarely affect all traffic uniformly
    affected_routes: List[str]  # e.g., ["service-a->service-b", "service-a->database"]
    
    # Bidirectional or asymmetric?
    symmetric: bool = True
    
    # Time characteristics
    duration_seconds: int = 60
    ramp_up_seconds: int = 0  # For progressive pattern
    
    def get_severity_at_time(self, elapsed_seconds: float) -> float:
        """
        Calculate failure severity at a given point in time.
        This models realistic failure progression.
        """
        if self.pattern == FailurePattern.CONSTANT:
            return self.base_severity
            
        elif self.pattern == FailurePattern.BURST:
            # Bursts every 10 seconds lasting 2 seconds
            in_burst = (elapsed_seconds % 10) < 2
            return self.max_severity if in_burst else 0
            
        elif self.pattern == FailurePattern.PROGRESSIVE:
            # Linear ramp from base to max over ramp_up period
            if elapsed_seconds < self.ramp_up_seconds:
                progress = elapsed_seconds / self.ramp_up_seconds
                return self.base_severity + (self.max_severity - self.base_severity) * progress
            return self.max_severity
            
        elif self.pattern == FailurePattern.CHAOTIC:
            # Random severity each time - models unstable network
            return random.uniform(self.base_severity, self.max_severity)
            
        return self.base_severity
    
    def should_affect_request(self, route: str, direction: str) -> bool:
        """
        Determine if a specific request should be affected by this failure.
        Enables simulation of partial failures.
        """
        if route not in self.affected_routes:
            return False
            
        if not self.symmetric and direction == "response":
            return False
            
        return True
 
 
# Example: Modeling a realistic cloud provider network issue
def create_az_network_degradation() -> NetworkFailureModel:
    """
    Models the kind of network degradation that occurs during
    cloud provider availability zone issues - progressive
    latency increase affecting cross-AZ traffic.
    """
    return NetworkFailureModel(
        failure_type="latency",
        pattern=FailurePattern.PROGRESSIVE,
        base_severity=10,    # Start with 10ms added latency
        max_severity=500,    # Ramp to 500ms
        affected_routes=[
            "us-east-1a->us-east-1b",
            "us-east-1a->us-east-1c"
        ],
        symmetric=False,     # Often asymmetric in real incidents
        duration_seconds=300,
        ramp_up_seconds=60   # Takes 1 minute to reach full severity
    )
 
 
def create_partial_packet_loss() -> NetworkFailureModel:
    """
    Models intermittent packet loss that degrades TCP performance
    without causing complete connection failures.
    """
    return NetworkFailureModel(
        failure_type="packet_loss",
        pattern=FailurePattern.BURST,
        base_severity=0,     # No loss normally
        max_severity=30,     # 30% loss during bursts
        affected_routes=["*->database-primary"],
        symmetric=True,
        duration_seconds=120
    )

Prioritizing Failure Scenarios

The universe of possible failures is infinite. You can't test them all. The art of chaos engineering lies in prioritization—focusing on scenarios that provide the highest learning-to-risk ratio.

The prioritization framework:

Consider each potential chaos experiment along two dimensions:

Likelihood: How probable is this failure in production? Failures that have already happened are more likely to recur. Failures that industry peers report are worth attention.
Impact: If this failure occurs, how severe is the consequence? Total outage is worse than degraded performance. Customer-facing impact is worse than internal-only impact.

Plot potential experiments on a likelihood-impact matrix to identify priorities:

Failure Prioritization Matrix
	Low Impact	Medium Impact	High Impact
High Likelihood	Test, but not critical	High priority	CRITICAL - test immediately
Medium Likelihood	Lower priority	Test after high-priority items	High priority
Low Likelihood	Skip unless low-cost to test	Consider for advanced program	Test if cost-effective

Sources of prioritization intelligence:

To accurately assess likelihood and impact, gather data from multiple sources:

Historical incident data: What failures have actually occurred? How often? With what impact? Past incidents are the strongest predictor of future incidents.
Architecture analysis: Where are the single points of failure? Which dependencies are most critical? Architectural review reveals theoretical weak points.
Industry postmortems: Companies like Google, Facebook, and Amazon publish detailed incident reports. Learn from others' failures.
Dependency risk assessment: Third-party services and open-source components have their own failure histories. Review their status pages and incident reports.
Team intuition: Experienced engineers often have gut feelings about which components are 'scary.' These intuitions often reflect real risk.
Observability data: Metrics that show high variance or frequent alerts may indicate components under stress.

First Experiments for New Chaos Programs

•Single instance termination: Kill one instance of each stateless service. Validates basic redundancy and auto-scaling.
•Database replica failure: Terminate one read replica. Tests connection pool failover and read load distribution.
•Upstream dependency latency: Add 3-5 seconds of latency to your most critical external dependency. Reveals timeout configurations.
•Cache failure: Disable your caching layer temporarily. Shows how the system degrades when cache miss rate spikes.
•DNS failure: Block DNS resolution for external services. Often reveals surprising dependencies and missing fallbacks.

Avoid the Exotic Failure Trap

Teams sometimes want to start with exciting scenarios: simultaneous multi-region failure, database corruption, coordinated attacks. These are important eventually, but extraordinarily unlikely to be your next production incident. Master the common failures before tackling the exotic ones. If you can't survive a single instance termination, you're not ready for multi-region chaos.

Beyond Binary Failures

Early chaos engineering focused heavily on binary failures—server up or down, service available or unavailable. But production incidents increasingly involve gray failures: situations where components are technically operational but behaving badly.

The gray failure problem:

A server returning HTTP 200 on health checks while silently dropping 50% of requests. A database query that works for small datasets but times out on production-scale data. A service that performs well under normal load but collapses under burst traffic.

These failures are particularly dangerous because:

They pass health checks and continue receiving traffic
They're often subtle and intermittent
They can persist for hours before detection
They create inconsistent user experiences that are hard to diagnose

Gray Failure Scenarios to Simulate

•Slow responses that pass health checks: Health endpoints return quickly while business endpoints degrade. Test with latency injection on specific routes.
•Partial error rates: 10-30% of requests fail while 70-90% succeed. Load balancers see the service as healthy. Test with probability-based error injection.
•Resource exhaustion: CPU or memory gradually consumed until performance degrades. No single failure event; gradual degradation.
•Connection pool exhaustion: Database or network connections slowly leaked until new connections fail. Existing requests may still work.
•Clock skew: Time drifts between servers, causing certificate validation failures, token expiration issues, or event ordering problems.
•Garbage collection pressure: Simulating conditions that trigger long GC pauses, creating intermittent latency spikes.

gray-failure-injection.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import { Request, Response, NextFunction } from 'express';
 
/**
 * Gray failure injection middleware for Express applications.
 * Simulates partial failures that pass health checks but degrade service.
 */
 
interface GrayFailureConfig {
  // Percentage of requests to affect (0-100)
  affectedPercentage: number;
  
  // Types of gray failure to inject
  failureType: 'slow' | 'error' | 'corrupt' | 'timeout';
  
  // Latency to add for 'slow' type (ms)
  additionalLatencyMs?: number;
  
  // Error code to return for 'error' type
  errorCode?: number;
  
  // Routes to affect (or '*' for all non-health routes)
  affectedRoutes: string[] | '*';
  
  // Routes that should NEVER be affected (health checks)
  protectedRoutes: string[];
}
 
const createGrayFailureMiddleware = (config: GrayFailureConfig) => {
  return async (req: Request, res: Response, next: NextFunction) => {
    // Never affect protected routes (health checks)
    if (config.protectedRoutes.some(route => req.path.includes(route))) {
      return next();
    }
    
    // Check if this route should be affected
    const shouldAffectRoute = config.affectedRoutes === '*' ||
      config.affectedRoutes.some(route => req.path.includes(route));
      
    if (!shouldAffectRoute) {
      return next();
    }
    
    // Probabilistically inject failure
    if (Math.random() * 100 > config.affectedPercentage) {
      return next(); // This request escapes unaffected
    }
    
    // Inject the configured failure
    switch (config.failureType) {
      case 'slow':
        // Add artificial latency
        await sleep(config.additionalLatencyMs || 5000);
        return next();
        
      case 'error':
        // Return error response
        return res.status(config.errorCode || 500).json({
          error: 'Internal server error',
          // Real errors often have misleading messages
          message: 'Request processed successfully'  // Intentionally wrong
        });
        
      case 'corrupt':
        // Let request process, but corrupt the response
        const originalJson = res.json.bind(res);
        res.json = (body: any) => {
          // Subtle corruption - might go unnoticed
          if (typeof body === 'object' && body !== null) {
            return originalJson({
              ...body,
              // Random field corruption
              ...(body.userId && { userId: body.userId + 1 }),
              ...(body.amount && { amount: body.amount * 0.99 }),
            });
          }
          return originalJson(body);
        };
        return next();
        
      case 'timeout':
        // Never respond - connection eventually times out
        // The health check isn't affected, so this instance
        // keeps receiving traffic
        return; // Intentionally not calling next() or sending response
        
      default:
        return next();
    }
  };
};
 
const sleep = (ms: number) => new Promise(resolve => setTimeout(resolve, ms));
 
// Example configuration: 20% of requests to /api/orders are slow
const orderSlownessConfig: GrayFailureConfig = {
  affectedPercentage: 20,
  failureType: 'slow',
  additionalLatencyMs: 3000,
  affectedRoutes: ['/api/orders', '/api/checkout'],
  protectedRoutes: ['/health', '/ready', '/metrics']
};
 
// Example configuration: 5% of responses have subtle data corruption
const dataCorruptionConfig: GrayFailureConfig = {
  affectedPercentage: 5,
  failureType: 'corrupt',
  affectedRoutes: '*',
  protectedRoutes: ['/health', '/ready', '/metrics', '/api/auth']
};
 
export { createGrayFailureMiddleware, GrayFailureConfig };

Gray Failures Require Better Observability

Gray failure experiments often reveal that current monitoring is insufficient. If you inject 20% error rates and your dashboard shows 100% success, that's a major finding—your observability has blind spots. Many organizations discover through chaos that their monitoring only catches complete outages, not partial degradation.

Time and Resource Pressure

Many real-world failures only manifest under specific conditions: high load, resource constraints, or time-sensitive situations. Effective chaos engineering must simulate these conditions, not just component failures.

Load-dependent failures:

Systems often work perfectly at 1x load but fail spectacularly at 3x load. This isn't just about capacity—it reveals algorithmic inefficiencies (O(n²) operations that were fine for n=1000 but fail at n=10000), resource exhaustion patterns, and contention issues.

Load and Resource Experiments

•Traffic surge: Suddenly double or triple traffic volume. Tests auto-scaling speed and capacity buffers.
•Traffic shape change: Shift traffic patterns—e.g., all requests hit the same shard. Reveals hot-spot vulnerabilities.
•Resource starvation: Limit CPU or memory available to services. Shows degradation behavior under constraint.
•Connection limits: Reduce database or network connection pool sizes. Reveals connection management issues.
•Storage exhaustion: Fill disk to 90%+ capacity. Tests handling of disk full conditions.
•Thundering herd: Simulate cache expiration causing simultaneous database requests from many clients.

Time-related failures:

Some failures are inherently time-bound. Certificates expire. Tokens time out. Scheduled jobs run (or don't run). These temporal aspects require specific testing:

Certificate expiration: Advance clock to test certificate rotation mechanisms
Token expiration storms: Simulate many tokens expiring simultaneously
Leap second handling: Some systems have famously failed on leap seconds
DST transitions: Clock changes have caused production incidents
Scheduled job overlap: What happens when a job runs longer than its schedule interval?

The " clock skew" category deserves special attention:

clock-skew-experiment.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import subprocess
import time
from datetime import datetime, timedelta
from contextlib import contextmanager
from typing import Generator
import logging
 
logger = logging.getLogger(__name__)
 
class ClockSkewExperiment:
    """
    Simulates clock skew between distributed system components.
    Clock skew causes subtle but severe bugs:
    - Certificate validation failures
    - Distributed lock issues
    - Event ordering problems
    - Token/session expiration issues
    
    WARNING: This modifies system time. Use only in isolated environments.
    """
    
    def __init__(self, target_host: str):
        self.target_host = target_host
        self.original_time = None
    
    @contextmanager
    def skewed_clock(self, skew_seconds: int) -> Generator[None, None, None]:
        """
        Context manager that skews the clock on target host for the duration
        of the context, then restores it.
        
        Args:
            skew_seconds: Positive for future, negative for past
        """
        try:
            self._record_original_time()
            self._apply_skew(skew_seconds)
            
            logger.info(
                f"Clock on {self.target_host} skewed by {skew_seconds}s"
            )
            
            yield
            
        finally:
            self._restore_time()
            logger.info(f"Clock on {self.target_host} restored")
    
    def _record_original_time(self):
        """Record current time for later restoration."""
        result = subprocess.run(
            ['ssh', self.target_host, 'date +%s'],
            capture_output=True, text=True
        )
        self.original_time = int(result.stdout.strip())
    
    def _apply_skew(self, skew_seconds: int):
        """Set system time on target host."""
        new_time = self.original_time + skew_seconds
        # Note: Requires root/sudo access on target
        subprocess.run([
            'ssh', self.target_host,
            f'sudo date -s @{new_time}'
        ])
    
    def _restore_time(self):
        """Restore original time on target host."""
        if self.original_time:
            # Sync with NTP
            subprocess.run([
                'ssh', self.target_host,
                'sudo systemctl restart systemd-timesyncd'
            ])
 
 
def run_clock_skew_scenarios():
    """
    Common clock skew scenarios that reveal real bugs.
    """
    
    scenarios = [
        {
            "name": "Backward clock jump",
            "description": "Clock suddenly moves 5 minutes into past",
            "skew_seconds": -300,
            "expected_issues": [
                "Rate limiters may over-restrict",
                "Recently issued tokens appear expired",
                "Distributed locks may conflict"
            ]
        },
        {
            "name": "Forward clock jump",
            "description": "Clock suddenly moves 5 minutes into future",
            "skew_seconds": 300,
            "expected_issues": [
                "Certificates may appear expired",
                "Scheduled jobs may fire early",
                "Cache TTLs may expire prematurely"
            ]
        },
        {
            "name": "Gradual drift",
            "description": "Clock slowly drifts over time",
            "skew_seconds": 60,  # Applied incrementally
            "expected_issues": [
                "Distributed consensus may fail",
                "Log ordering becomes incorrect",
                "Database replication may lag"
            ]
        },
        {
            "name": "Cross-zone skew",
            "description": "Different AZs have different times",
            "skew_seconds": 30,  # On subset of nodes
            "expected_issues": [
                "Leader election instability",
                "Partition detection false positives",
                "Event ordering inconsistencies"
            ]
        }
    ]
    
    return scenarios

Cascading Failure Simulation

The most damaging production incidents aren't simple component failures—they're cascading failures where one failure triggers a chain reaction that brings down seemingly unrelated systems. Simulating these cascades requires thinking about failure propagation paths.

Common cascade patterns:

Cascading Failure Patterns

•Retry storm: Service A fails → Service B retries → Retry traffic overwhelms A further → Complete outage. Simulated by injecting partial failures that trigger retry behavior.
•Connection pool exhaustion: Database slow → Connections held longer → Pool exhausted → New requests fail → Timeout causes more retries → Cascading service failures.
•Thundering herd: Cache expires → Hundreds of simultaneous database queries → Database overloaded → Cache can't rebuild → Sustained overload.
•Memory pressure cascade: One service leaks memory → Gets killed → Traffic shifts to remaining instances → Those instances now handle more load → They leak faster → Cascade of OOMs.
•Failover amplification: Primary fails → All traffic hits secondary → Secondary can't handle full load → Secondary fails → No capacity remaining.

Designing cascade experiments:

Simulating cascades requires multi-component experiments:

Identify the trigger: What initial failure starts the cascade?
Map the propagation path: How does failure spread between components?
Predict the steady state: Where should the cascade stop? What breaks it?
Design the experiment: Inject the trigger while monitoring the entire path
Validate boundaries: Confirm that containment mechanisms (circuit breakers, bulkheads) work

cascade-experiment.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# Cascade Failure Experiment Definition
name: "Database Slowdown Cascade"
description: |
  Simulates a cascade where database latency increase causes
  connection pool exhaustion, leading to service failures,
  creating retry storms, and potentially overwhelming upstream load balancers.
 
trigger:
  component: database-primary
  failure_type: latency_injection
  severity: 500ms  # 10x normal latency
  duration: 5m
 
expected_propagation:
  - step: 1
    component: database-connection-pools
    effect: "Connections held 10x longer, pool usage increases"
    time_range: "0-30s"
    metric: "connection_pool_utilization"
    expected_value: ">80%"
    
  - step: 2
    component: api-services
    effect: "Requests wait for connections, latency increases"
    time_range: "30-60s"
    metric: "api_p99_latency_ms"
    expected_value: ">2000"
    
  - step: 3
    component: client-applications
    effect: "Timeouts trigger retries, amplifying load"
    time_range: "60-120s"
    metric: "request_rate"
    expected_value: "+50% over baseline"
    
  - step: 4
    component: load-balancer
    effect: "Circuit breaker should engage"
    time_range: "120s+"
    metric: "circuit_breaker_state"
    expected_value: "OPEN"
 
containment_hypothesis:
  description: |
    The cascade should be contained by circuit breakers opening at step 4.
    Services behind open circuits should return fallback responses.
    Database should recover as retry pressure drops.
    
  success_criteria:
    - "No complete service outage"
    - "Circuit breakers open within 120s"
    - "Error rate at load balancer < 50%"
    - "Recovery within 60s of trigger removal"
 
abort_conditions:
  - "Error rate > 90% for > 30s"
  - "Complete loss of service availability"
  - "Customer impact detected via SLI breach"
  
rollback:
  action: "Remove latency injection"
  verification: "Confirm database latency returns to baseline"
  expected_recovery_time: 60s

Start Small with Cascades

Cascade experiments are inherently risky because they can spiral beyond expectations. Start with short durations and have abort conditions ready. Monitor closely. Many teams discover that their cascade containment mechanisms don't work as expected—which is exactly why these experiments are valuable.

External Dependency Simulation

Modern systems depend on services outside your control: cloud provider APIs, payment processors, authentication providers, CDNs, third-party data feeds. These dependencies introduce failure modes you can't prevent—only prepare for.

Simulating external failures:

You typically can't inject failure into third-party services directly. Instead, you simulate their failure at the boundary:

Network-level blocking: Use iptables/network policies to block traffic to external endpoints
DNS manipulation: Configure DNS to return failures or wrong addresses for external domains
Proxy injection: Route external traffic through a proxy that introduces failures
Mock services: Replace external endpoints with mock services that simulate failure modes

Common External Failure Patterns
External Service	Failure Pattern	Simulation Technique	What It Reveals
Payment Processor	5 second response times	Proxy adds latency to Stripe/Braintree endpoints	Checkout timeout handling, retry behavior
Auth Provider	Intermittent 500 errors	Proxy returns errors for 30% of OAuth requests	Login degradation, session handling
CDN	Complete unavailability	Block CDN domain at DNS level	Asset fallbacks, loading behavior
Email Service	Rate limiting	Mock returns 429 after N requests	Queue handling, backpressure
Cloud APIs	Partial region failure	Block specific AWS/GCP/Azure endpoints	Multi-region fallback behavior

external-dependency-proxy.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
from mitmproxy import http
import random
import time
from typing import Optional, Dict
import json
import logging
 
logger = logging.getLogger(__name__)
 
class ExternalDependencyFailureInjector:
    """
    mitmproxy addon that injects failures for external service simulation.
    Use with: mitmproxy -s external-dependency-proxy.py
    
    This allows simulating third-party service failures without modifying
    those services or fully mocking them.
    """
    
    def __init__(self):
        self.failure_configs: Dict[str, dict] = {
            # Stripe API simulation
            "api.stripe.com": {
                "enabled": True,
                "latency_ms": 5000,      # Add 5s latency
                "error_rate": 0.1,        # 10% errors
                "error_code": 503,
                "timeout_rate": 0.05,     # 5% complete timeouts
            },
            
            # Auth0 simulation
            "*.auth0.com": {
                "enabled": True,
                "latency_ms": 0,
                "error_rate": 0.2,        # 20% errors - auth degradation
                "error_code": 500,
                "timeout_rate": 0.0,
            },
            
            # AWS S3 simulation
            "*.s3.amazonaws.com": {
                "enabled": True,
                "latency_ms": 200,
                "error_rate": 0.05,
                "error_code": 503,
                "timeout_rate": 0.0,
                # Rate limit simulation
                "rate_limit_after": 100,  # Requests per minute
                "rate_limit_error": 429,
            }
        }
        
        self.request_counts: Dict[str, int] = {}
    
    def request(self, flow: http.HTTPFlow) -> None:
        """Called for each request through the proxy."""
        host = flow.request.pretty_host
        config = self._get_config_for_host(host)
        
        if not config or not config.get("enabled"):
            return
        
        # Simulate latency
        if config.get("latency_ms", 0) > 0:
            time.sleep(config["latency_ms"] / 1000.0)
            logger.info(f"Injected {config['latency_ms']}ms latency for {host}")
        
        # Simulate timeout (don't respond at all)
        if random.random() < config.get("timeout_rate", 0):
            flow.kill()
            logger.info(f"Simulated timeout for {host}")
            return
        
        # Check rate limiting
        self.request_counts[host] = self.request_counts.get(host, 0) + 1
        if self.request_counts.get(host, 0) > config.get("rate_limit_after", float('inf')):
            flow.response = http.Response.make(
                config.get("rate_limit_error", 429),
                json.dumps({"error": "Rate limit exceeded"}),
                {"Content-Type": "application/json"}
            )
            logger.info(f"Rate limited request to {host}")
            return
        
        # Simulate random errors
        if random.random() < config.get("error_rate", 0):
            flow.response = http.Response.make(
                config.get("error_code", 500),
                json.dumps({
                    "error": "Service temporarily unavailable",
                    "status": "error"
                }),
                {"Content-Type": "application/json"}
            )
            logger.info(f"Injected error for {host}")
            return
    
    def _get_config_for_host(self, host: str) -> Optional[dict]:
        """Find matching configuration for host, supporting wildcards."""
        # Exact match
        if host in self.failure_configs:
            return self.failure_configs[host]
        
        # Wildcard match
        for pattern, config in self.failure_configs.items():
            if pattern.startswith("*."):
                domain = pattern[2:]
                if host.endswith(domain):
                    return config
        
        return None
 
 
# Create the addon instance
addons = [ExternalDependencyFailureInjector()]

Testing Billing-Related Dependencies

Be especially careful with payment processor and billing service chaos. Use sandbox/test environments and test API keys. Never inject failures that could affect real financial transactions. Many companies have separate chaos engineering environments specifically for testing payment flows.

Summary: Vary Real-World Events

The second principle of chaos engineering—varying real-world events—transforms abstract resilience goals into concrete experiments. Let's consolidate the key insights:

Key Takeaways

•Failures exist in four domains: Infrastructure, network, application, and dependency failures each require different simulation approaches and reveal different resilience gaps.
•Real failures are rarely binary: Gray failures—partial, intermittent, degraded—are often more damaging than complete outages because they evade detection.
•Prioritization is essential: Use likelihood × impact analysis to focus on scenarios that provide the highest learning-to-risk ratio. Start with common failures before exotic ones.
•Model realistic patterns: Failures in production have characteristics—gradual onset, asymmetric effects, temporal patterns. Your simulations should match these patterns.
•Simulate conditions, not just component failures: Load pressure, resource constraints, and time-related issues often trigger failures that wouldn't occur under normal conditions.
•Cascading failures are the real danger: Single component failures are manageable; it's the chain reactions that cause major outages. Design experiments to test cascade containment.
•External dependencies need special attention: Services you don't control bring failure modes you can't prevent. Simulate them at the boundary through proxies or DNS manipulation.

What's next:

With hypotheses formed and failure scenarios designed, we're ready to tackle the third principle: Run Experiments in Production. This is where chaos engineering gets real—and where the careful preparation from the first two principles pays dividends.

Principle Mastered

You now understand the taxonomy of failures, how to model realistic failure patterns, prioritization frameworks, and techniques for simulating everything from gray failures to cascading failures to external dependency issues. Next, we'll explore how to safely run these experiments in production environments.