System Design (HLD)Principles of Chaos

Principles of Chaos Engineering

LevelAdvanced

Duration75 mins

TopicPrinciples of Chaos

5 / 5

Minimize Blast Radius

The Safety Principle

The fifth and final principle of chaos engineering is perhaps the most important for organizational adoption: Minimize blast radius.

Without this principle, the first four principles become reckless. Running experiments in production without blast radius control is simply creating incidents. Automating experiments without scope limits is automating outages. The power of chaos engineering comes from its ability to find weaknesses without causing the very disasters it's meant to prevent.

Blast radius minimization is what makes chaos engineering feel safe enough to practice. It's the guardrail that enables engineers to say "yes, let's try that" instead of "that's too risky." It's the reason Netflix can run Chaos Monkey continuously in production while maintaining world-class availability. Done properly, the blast radius of an experiment should be smaller than the blast radius of the bugs it would discover.

What You Will Learn

By the end of this page, you will understand the dimensions of blast radius, techniques for progressive scope expansion, containment strategies for when things go wrong, how to calculate acceptable blast radius, and how to build confidence through incremental experimentation.

Defining Blast Radius

Blast radius is the maximum scope of impact an experiment can have. It's measured across multiple dimensions, and controlling all dimensions is necessary for safe experimentation.

The dimensions of blast radius:

Blast radius isn't a single number—it's a multi-dimensional constraint that defines the boundaries of your experiment:

Dimensions of Blast Radius
Dimension	Description	Example Controls
User scope	Which users are affected	Internal only, beta users, region-specific, percentage of all users
Traffic scope	What percentage of requests are affected	1%, 5%, 10%, 50%, 100% of requests
Component scope	Which system components are targeted	Single instance, single service, multiple services, entire stack
Geographic scope	Which regions or data centers are affected	Single AZ, single region, multi-region
Temporal scope	How long the experiment runs	Seconds, minutes, hours, continuous
Severity scope	How severe the injected failure is	Latency (mild), errors (moderate), complete failure (severe)
Feature scope	Which functionality is affected	Read operations, write operations, payment flows

The blast radius budget:

Think of blast radius as a budget. You allocate across dimensions based on risk tolerance and learning objectives. A new experiment might have:

1% of traffic (small traffic scope)
Internal users only (small user scope)
5 minutes maximum (small temporal scope)
Single region (small geographic scope)

As confidence grows, you can expand one dimension at a time while keeping others constrained. This progressive expansion is key to safe chaos.

The Multiplication Effect

Blast radius dimensions multiply, not add. If you affect 10% of traffic × 10% of instances × US-East only, your effective blast radius is far smaller than any single dimension suggests. Use this to your advantage by keeping multiple dimensions small simultaneously.

Progressive Expansion

The safest chaos programs start with minimal blast radius and expand only after demonstrating success. This progressive expansion builds both system confidence and organizational trust.

The expansion ladder:

For each experiment type, define an expansion path:

blast-radius-progression.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# Blast Radius Expansion Configuration
experiment_type: instance_termination
 
progression_levels:
  - level: 1
    name: "Minimal Validation"
    requirements:
      - "New experiment type OR significant system changes"
    blast_radius:
      traffic_percentage: 0
      instance_count: 1
      regions: ["us-east-1"]
      user_segments: ["internal_only"]
      duration_seconds: 60
    graduation_criteria:
      - "Zero customer impact detected"
      - "System recovered within 30 seconds"
      - "No unexpected propagation"
      
  - level: 2
    name: "Early Validation"
    requirements:
      - "Level 1 passed 3 consecutive times"
    blast_radius:
      traffic_percentage: 1
      instance_count: 1
      regions: ["us-east-1"]
      user_segments: ["beta_users"]
      duration_seconds: 120
    graduation_criteria:
      - "Error rate increase < 0.1%"
      - "Latency P99 increase < 50ms"
      - "Recovery within SLO"
      
  - level: 3
    name: "Moderate Validation"
    requirements:
      - "Level 2 passed 5 consecutive times"
      - "No related incidents in past 30 days"
    blast_radius:
      traffic_percentage: 5
      instance_count: 2
      regions: ["us-east-1", "us-west-2"]
      user_segments: ["all_except_enterprise"]
      duration_seconds: 300
    graduation_criteria:
      - "SLO maintained throughout"
      - "No manual intervention required"
      - "Automated analysis confirms hypothesis"
      
  - level: 4
    name: "Production Validation"
    requirements:
      - "Level 3 passed 10 consecutive times"
      - "Executive approval for this experiment class"
    blast_radius:
      traffic_percentage: 10
      instance_count: "25% of fleet"
      regions: ["all_production"]
      user_segments: ["all_users"]
      duration_seconds: 600
    graduation_criteria:
      - "Full production traffic handled"
      - "Error budget not consumed"
      - "Team comfortable with experiment"
      
  - level: 5
    name: "Continuous Validation"
    requirements:
      - "Level 4 passed 20 consecutive times"
      - "Automated abort proven reliable"
    blast_radius:
      traffic_percentage: 10
      instance_count: "25% of fleet"
      regions: ["all_production"]
      user_segments: ["all_users"]
      duration_seconds: "continuous"
      schedule: "random within business hours"
    graduation_criteria:
      - "Running for 30 days without incident"
      - "Team confidence high"
 
demotion_triggers:
  - "Experiment caused customer-visible impact"
  - "Abort mechanism failed"
  - "Unexpected cascade detected"
  - "Major system architecture change"

Key principles of progressive expansion:

One dimension at a time: When expanding, increase only one dimension while keeping others constant. Going from Level 2 to Level 3 might increase traffic from 1% to 5% but keep duration the same.
Graduation requires consistency: A single success shouldn't trigger expansion. Require multiple consecutive successes to prove the result wasn't luck.
Demotion is possible: If an experiment causes unexpected impact at a higher level, demote it to a lower level and restart the progression.
Architecture changes reset: Major system changes (new services, changed dependencies, different infrastructure) may require re-validation from lower levels.
Document the progression: Track what level each experiment has reached and why. This becomes institutional knowledge about system resilience.

The Dial, Not the Switch

Think of blast radius as a dial you turn slowly, not a switch you flip. The difference between "safe" and "reckless" chaos is often just how quickly you turned the dial. Patient, progressive expansion discovers the same weaknesses with far less risk.

Containment Strategies

Even with careful blast radius control, experiments can exceed expected scope. Containment strategies limit damage when things go wrong.

Defense in depth for chaos:

Like security, chaos containment works best with multiple layers. Each layer catches failures that slip through previous layers.

Containment Layers

•Layer 1 - Injection scope: The chaos injection itself is limited. Only specific instances, routes, or users are affected by design.
•Layer 2 - Circuit breakers: System-level circuit breakers detect degradation and stop cascading effects, whether caused by chaos or real failures.
•Layer 3 - Automated abort: The chaos platform monitors for breach of thresholds and terminates experiments that exceed bounds.
•Layer 4 - Manual abort: Human operators can immediately stop experiments via 'big red button' mechanisms.
•Layer 5 - System defaults: Even if chaos injection can't be stopped, systems should fail to safe states that limit impact.

containment-implementation.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
package containment
 
import (
    "context"
    "sync"
    "time"
)
 
// ContainmentLevel represents the severity of containment needed
type ContainmentLevel int
 
const (
    ContainmentNone    ContainmentLevel = 0
    ContainmentInjection ContainmentLevel = 1
    ContainmentCircuit ContainmentLevel = 2
    ContainmentAbort   ContainmentLevel = 3
    ContainmentManual  ContainmentLevel = 4
    ContainmentEmergency ContainmentLevel = 5
)
 
// ContainmentOrchestrator manages multi-layer containment
type ContainmentOrchestrator struct {
    mu              sync.Mutex
    currentLevel    ContainmentLevel
    injector        InjectionController
    circuitBreakers CircuitBreakerController
    abortController AbortController
    notifier        NotificationService
    
    // Track containment escalations
    escalationHistory []EscalationEvent
}
 
// EscalationEvent records when containment was triggered
type EscalationEvent struct {
    Timestamp time.Time
    Level     ContainmentLevel
    Reason    string
    Automated bool
}
 
// NewContainmentOrchestrator creates orchestrator with all layers
func NewContainmentOrchestrator(
    injector InjectionController,
    circuits CircuitBreakerController,
    abort AbortController,
    notifier NotificationService,
) *ContainmentOrchestrator {
    return &ContainmentOrchestrator{
        currentLevel:    ContainmentNone,
        injector:        injector,
        circuitBreakers: circuits,
        abortController: abort,
        notifier:        notifier,
        escalationHistory: []EscalationEvent{},
    }
}
 
// Escalate moves to a higher containment level
func (c *ContainmentOrchestrator) Escalate(level ContainmentLevel, reason string, automated bool) error {
    c.mu.Lock()
    defer c.mu.Unlock()
    
    if level <= c.currentLevel {
        return nil // Already at this level or higher
    }
    
    // Record the escalation
    c.escalationHistory = append(c.escalationHistory, EscalationEvent{
        Timestamp: time.Now(),
        Level:     level,
        Reason:    reason,
        Automated: automated,
    })
    
    // Execute containment actions for each level up to target
    for l := c.currentLevel + 1; l <= level; l++ {
        if err := c.executeContainmentLevel(l, reason); err != nil {
            // Even on error, we want to try higher levels
            c.notifier.Alert("Containment level %d failed: %v", l, err)
        }
    }
    
    c.currentLevel = level
    return nil
}
 
func (c *ContainmentOrchestrator) executeContainmentLevel(level ContainmentLevel, reason string) error {
    switch level {
    case ContainmentInjection:
        // Stop new injections, but let running ones complete
        c.notifier.Info("Containment Level 1: Halting new chaos injections")
        return c.injector.PauseNewInjections()
        
    case ContainmentCircuit:
        // Trigger circuit breakers on affected paths
        c.notifier.Warn("Containment Level 2: Opening circuit breakers")
        return c.circuitBreakers.OpenAffectedCircuits()
        
    case ContainmentAbort:
        // Abort all active experiments
        c.notifier.Warn("Containment Level 3: Aborting all experiments - %s", reason)
        return c.abortController.AbortAll()
        
    case ContainmentManual:
        // Request human intervention
        c.notifier.Critical("Containment Level 4: MANUAL INTERVENTION REQUIRED - %s", reason)
        c.abortController.AbortAll() // Best effort
        return nil
        
    case ContainmentEmergency:
        // Full system safe mode
        c.notifier.Critical("Containment Level 5: EMERGENCY - Enabling system safe mode")
        c.injector.PauseNewInjections()
        c.abortController.AbortAll()
        c.circuitBreakers.OpenAll()
        return nil
    }
    
    return nil
}
 
// MonitorAndContain continuously checks conditions and escalates as needed
func (c *ContainmentOrchestrator) MonitorAndContain(ctx context.Context, thresholds ContainmentThresholds) {
    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()
    
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            c.checkAndEscalate(thresholds)
        }
    }
}
 
func (c *ContainmentOrchestrator) checkAndEscalate(thresholds ContainmentThresholds) {
    // Get current system state
    errorRate := c.getErrorRate()
    latencyP99 := c.getLatencyP99()
    activeExperiments := c.getActiveExperimentCount()
    
    // Check thresholds in order of severity
    if errorRate > thresholds.EmergencyErrorRate {
        c.Escalate(ContainmentEmergency, 
            "Error rate > emergency threshold", true)
        return
    }
    
    if errorRate > thresholds.AbortErrorRate {
        c.Escalate(ContainmentAbort,
            "Error rate > abort threshold", true)
        return
    }
    
    if latencyP99 > thresholds.CircuitLatency {
        c.Escalate(ContainmentCircuit,
            "Latency > circuit threshold", true)
        return
    }
    
    // Multiple experiments might increase risk
    if activeExperiments > thresholds.MaxConcurrentExperiments {
        c.Escalate(ContainmentInjection,
            "Too many concurrent experiments", true)
    }
}
 
// Reset returns to no containment (after recovery)
func (c *ContainmentOrchestrator) Reset() {
    c.mu.Lock()
    defer c.mu.Unlock()
    
    c.currentLevel = ContainmentNone
    c.notifier.Info("Containment reset - normal operations resumed")
}
 
// Stubs for demonstration
func (c *ContainmentOrchestrator) getErrorRate() float64 { return 0.01 }
func (c *ContainmentOrchestrator) getLatencyP99() float64 { return 100 }
func (c *ContainmentOrchestrator) getActiveExperimentCount() int { return 1 }
 
// ContainmentThresholds defines when to escalate
type ContainmentThresholds struct {
    CircuitLatency           float64 // ms
    AbortErrorRate           float64 // 0-1
    EmergencyErrorRate       float64 // 0-1
    MaxConcurrentExperiments int
}

Test Your Containment

Containment mechanisms are only valuable if they work when needed. Periodically test each layer: Does the abort actually stop injection? Do circuit breakers trip at the right thresholds? Can you reach the big red button quickly? A containment layer you've never tested provides false confidence.

Calculating Acceptable Blast Radius

How much blast radius is acceptable? This isn't arbitrary—it should be calculated based on your system's characteristics and business context.

The error budget frame:

If your SLO allows 0.1% downtime per month (99.9% availability), you have an error budget of approximately 43 minutes. Chaos experiments should consume only a small fraction of this budget—perhaps 10% or less. This gives you:

Monthly chaos budget: ~4 minutes of impact
Weekly chaos budget: ~1 minute of impact
Per-experiment budget: seconds of customer-visible impact

This frame forces discipline. You can't run 10% blast radius experiments frequently if each one risks minutes of impact.

blast-radius-calculator.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
from dataclasses import dataclass
from typing import Optional
from enum import Enum
 
class SLOTarget(Enum):
    THREE_NINES = 0.999      # 43.8 min/month downtime
    THREE_FIVE_NINES = 0.9995 # 21.9 min/month
    FOUR_NINES = 0.9999      # 4.38 min/month
    FIVE_NINES = 0.99999     # 26 sec/month
 
@dataclass
class BlastRadiusCalculation:
    """Results of blast radius calculation."""
    max_traffic_percentage: float
    max_duration_seconds: int
    max_weekly_experiments: int
    rationale: str
 
class BlastRadiusCalculator:
    """
    Calculate acceptable blast radius based on SLOs and error budget.
    The key insight: chaos experiments should consume only a fraction
    of your error budget, leaving room for real failures.
    """
    
    def __init__(
        self,
        slo_target: SLOTarget,
        chaos_budget_fraction: float = 0.1  # 10% of error budget for chaos
    ):
        self.slo_target = slo_target
        self.chaos_budget_fraction = chaos_budget_fraction
        
        # Calculate available chaos budget
        self.monthly_error_budget_minutes = self._calculate_error_budget()
        self.weekly_chaos_budget_minutes = (
            self.monthly_error_budget_minutes * 
            chaos_budget_fraction / 4
        )
    
    def _calculate_error_budget(self) -> float:
        """Calculate allowed downtime per month in minutes."""
        allowed_downtime_fraction = 1 - self.slo_target.value
        minutes_per_month = 30 * 24 * 60  # ~43,200 minutes
        return allowed_downtime_fraction * minutes_per_month
    
    def calculate_for_experiment(
        self,
        experiment_type: str,
        expected_impact_fraction: float,  # What fraction of affected requests fail
        experiments_per_week: int = 5
    ) -> BlastRadiusCalculation:
        """
        Calculate maximum safe blast radius for an experiment type.
        
        Args:
            experiment_type: Name of the experiment
            expected_impact_fraction: Fraction of affected requests that fail
                (e.g., latency injection might be 0 if it stays under timeout,
                 instance termination might be 0.5 if half requests fail during failover)
            experiments_per_week: How often we want to run this experiment
        """
        
        # Budget per experiment in minutes
        budget_per_experiment = self.weekly_chaos_budget_minutes / experiments_per_week
        
        # If the experiment causes X% of affected requests to fail,
        # and we can afford Y minutes of full outage equivalent,
        # then max duration × traffic% × impact = Y minutes
        
        # Solve for traffic% given target duration
        target_duration_minutes = 5  # 5 minute max duration
        
        if expected_impact_fraction > 0:
            max_traffic_percentage = min(
                100,
                (budget_per_experiment / target_duration_minutes / expected_impact_fraction) * 100
            )
        else:
            max_traffic_percentage = 100  # No actual failures expected
        
        # Also calculate max duration for 10% traffic
        if expected_impact_fraction > 0:
            max_duration_at_10_percent = (
                budget_per_experiment / (0.10 * expected_impact_fraction)
            )
        else:
            max_duration_at_10_percent = 60  # Cap at 60 minutes
        
        return BlastRadiusCalculation(
            max_traffic_percentage=round(max_traffic_percentage, 2),
            max_duration_seconds=int(min(max_duration_at_10_percent, 60) * 60),
            max_weekly_experiments=experiments_per_week,
            rationale=self._generate_rationale(
                experiment_type, expected_impact_fraction,
                budget_per_experiment
            )
        )
    
    def _generate_rationale(
        self,
        experiment_type: str,
        impact_fraction: float,
        budget_minutes: float
    ) -> str:
        return f"""
Calculation for: {experiment_type}
SLO Target: {self.slo_target.name} ({self.slo_target.value * 100}%)
Monthly error budget: {self.monthly_error_budget_minutes:.1f} minutes
Weekly chaos budget ({self.chaos_budget_fraction*100}%): {self.weekly_chaos_budget_minutes:.1f} minutes
Per-experiment budget: {budget_minutes:.2f} minutes
 
Expected impact fraction: {impact_fraction*100}% of affected requests fail
 
This budget assumes chaos-related impact is completely isolated.
Unexpected cascading effects would consume additional budget.
        """.strip()
 
 
# Example usage
def demonstrate_calculation():
    # High availability system (99.99%)
    calc = BlastRadiusCalculator(
        slo_target=SLOTarget.FOUR_NINES,
        chaos_budget_fraction=0.10
    )
    
    # Instance termination with ~50% request impact during failover
    instance_result = calc.calculate_for_experiment(
        experiment_type="Instance Termination",
        expected_impact_fraction=0.5,  # Half of requests fail during 30s failover
        experiments_per_week=10
    )
    print(f"Instance Termination: max {instance_result.max_traffic_percentage}% traffic")
    
    # Latency injection with no actual failures (just slowness)
    latency_result = calc.calculate_for_experiment(
        experiment_type="200ms Latency Injection",
        expected_impact_fraction=0.0,  # Requests still succeed
        experiments_per_week=20
    )
    print(f"Latency Injection: max {latency_result.max_traffic_percentage}% traffic")
    
    # Network partition with high impact
    partition_result = calc.calculate_for_experiment(
        experiment_type="Network Partition",
        expected_impact_fraction=1.0,  # All affected requests fail
        experiments_per_week=2
    )
    print(f"Network Partition: max {partition_result.max_traffic_percentage}% traffic")
    
    return instance_result, latency_result, partition_result

Factors that affect acceptable blast radius:

Blast Radius Modifiers

•System criticality: Core payment processing needs smaller blast radius than internal analytics dashboards.
•Failure impact: Latency injection (users experience slowness) has lower impact than error injection (users see failures).
•Experiment maturity: New experiments start with smaller radius; well-tested experiments can expand.
•Time of day: Off-peak hours allow larger radius; peak hours demand smaller radius.
•Team availability: Larger radius acceptable when team is actively monitoring; smaller when on-call only.
•Recent incidents: After incidents, error budget is depleted; reduce chaos temporarily.

Isolation Techniques

Minimizing blast radius isn't just about limiting scope—it's about ensuring that whatever scope you affect is truly isolated from the rest of the system. Isolation prevents experiment impacts from cascading beyond intended boundaries.

Technical isolation approaches:

Isolation Techniques
Technique	How It Works	Isolation Strength	Complexity
Traffic splitting	Route percentage of traffic through chaos path	Medium - requests interact with shared resources	Low
User segmentation	Affect specific user groups based on attributes	Medium - users share infrastructure	Low
Instance targeting	Inject chaos into specific instances only	Medium - shared dependencies not isolated	Low
Namespace isolation	Run chaos experiments in separate K8s namespace	High - network policies can enforce boundaries	Medium
Tenancy isolation	Experiment in dedicated tenant/shard	High - data and compute separated	High
Shadow environment	Run experiments on traffic replicas, not live	Very High - no production impact possible	Very High

traffic-isolation.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import { Request, Response, NextFunction } from 'express';
 
/**
 * Traffic isolation for chaos experiments.
 * Determines whether a request should be routed through
 * the chaos path or the normal path.
 */
 
interface IsolationConfig {
  // Percentage of traffic to route to chaos (0-100)
  trafficPercentage: number;
  
  // User segments to include (empty = all)
  includedUserSegments: string[];
  
  // User segments to always exclude
  excludedUserSegments: string[];
  
  // Specific user IDs to whitelist for chaos
  whitelistedUserIds: string[];
  
  // Specific user IDs to blacklist from chaos
  blacklistedUserIds: string[];
  
  // Geographic regions to target
  targetRegions: string[];
  
  // Request paths to target (or exclude)
  includedPaths: string[];
  excludedPaths: string[];
  
  // Mark header on affected requests for observability
  markHeader: string;
}
 
interface IsolationResult {
  routeToChAos: boolean;
  reason: string;
}
 
class ChaosIsolator {
  private config: IsolationConfig;
  private hashSeed: number;
  
  constructor(config: IsolationConfig) {
    this.config = config;
    this.hashSeed = Date.now();  // Consistent for experiment duration
  }
  
  /**
   * Determine if a request should be routed to the chaos path.
   * Uses stable hashing so the same user gets consistent treatment.
   */
  shouldRouteToChAos(req: Request): IsolationResult {
    const userId = req.headers['x-user-id'] as string || 'anonymous';
    const userSegment = req.headers['x-user-segment'] as string || 'default';
    const region = req.headers['x-region'] as string || 'unknown';
    const path = req.path;
    
    // Check blacklist first (never route these to chaos)
    if (this.config.blacklistedUserIds.includes(userId)) {
      return { routeToChAos: false, reason: 'User blacklisted' };
    }
    
    // Check excluded segments
    if (this.config.excludedUserSegments.includes(userSegment)) {
      return { routeToChAos: false, reason: 'Segment excluded' };
    }
    
    // Check excluded paths
    for (const excludedPath of this.config.excludedPaths) {
      if (path.startsWith(excludedPath)) {
        return { routeToChAos: false, reason: 'Path excluded' };
      }
    }
    
    // Check whitelist (always route if whitelisted)
    if (this.config.whitelistedUserIds.includes(userId)) {
      return { routeToChAos: true, reason: 'User whitelisted' };
    }
    
    // Check region targeting
    if (this.config.targetRegions.length > 0 && 
        !this.config.targetRegions.includes(region)) {
      return { routeToChAos: false, reason: 'Region not targeted' };
    }
    
    // Check segment targeting
    if (this.config.includedUserSegments.length > 0 &&
        !this.config.includedUserSegments.includes(userSegment)) {
      return { routeToChAos: false, reason: 'Segment not targeted' };
    }
    
    // Check path targeting
    if (this.config.includedPaths.length > 0) {
      const pathMatch = this.config.includedPaths.some(p => path.startsWith(p));
      if (!pathMatch) {
        return { routeToChAos: false, reason: 'Path not targeted' };
      }
    }
    
    // Stable percentage-based routing
    const userHash = this.stableHash(userId);
    const percentage = userHash % 100;
    
    if (percentage < this.config.trafficPercentage) {
      return { routeToChAos: true, reason: `In ${ this.config.trafficPercentage } % sample` };
    }
    
    return { routeToChAos: false, reason: 'Not in sample' };
  }
  
  /**
   * Generate stable hash for user ID.
   * Same user always gets same result within an experiment.
   */
  private stableHash(input: string): number {
    let hash = this.hashSeed;
    for (let i = 0; i < input.length; i++) {
      const char = input.charCodeAt(i);
      hash = ((hash << 5) - hash) + char;
      hash = hash & hash;  // Convert to 32bit integer
    }
    return Math.abs(hash);
  }
}
 
/**
 * Express middleware for chaos isolation
 */
function createChaosIsolationMiddleware(config: IsolationConfig) {
  const isolator = new ChaosIsolator(config);
  
  return (req: Request, res: Response, next: NextFunction) => {
    const result = isolator.shouldRouteToChAos(req);
    
    // Mark request for observability
    req.headers[config.markHeader] = result.routeToChAos ? 'true' : 'false';
    req.headers['x-chaos-routing-reason'] = result.reason;
    
    // Store decision for downstream use
    (req as any).chaosRouting = result;
    
    next();
  };
}
 
export { ChaosIsolator, createChaosIsolationMiddleware, IsolationConfig };

The VIP Protection Pattern

Many organizations implement a 'VIP protection' pattern: high-value customers (enterprise accounts, top spenders) are never included in chaos experiments, regardless of other settings. This is implemented via the blacklist mechanism. While it means you're not testing the system with exactly the same traffic mix as production, it protects your most important relationships while you build confidence.

Recovery and Remediation

Blast radius minimization includes not just limiting impact during the experiment, but ensuring rapid recovery after the experiment ends—whether through normal completion or abort.

Recovery phases:

Post-Experiment Recovery Sequence

•Injection removal: Stop the chaos injection completely. Verify no leftover artifacts (iptables rules, sidecar configurations).
•Health verification: Confirm affected components return to healthy state. Check pod status, endpoint health, connection pools.
•Traffic restoration: If traffic was rerouted, verify normal routing resumes. Check that circuit breakers close.
•Metric stabilization: Wait for metrics to return to baseline. Track recovery time—it's a key resilience indicator.
•Impact assessment: Determine if any lasting effects occurred. Check for data issues, stuck jobs, or orphaned resources.

Measuring recovery:

Recovery time is a critical metric. Track multiple recovery measures:

Time to detection (TTD): How long before the system recognized something was wrong?
Time to mitigation (TTM): How long before impact was limited (circuit breakers opened)?
Time to recovery (TTR): How long before metrics returned to baseline?
Time to resolution (TTRes): How long before all traces of the experiment were removed?

These times inform both experiment design and system resilience assessment. If TTR is 30 minutes for a 5-minute experiment, your blast radius was effectively 6x larger than planned.

Recovery Time Targets by System Tier
System Tier	Target TTD	Target TTM	Target TTR
Tier 0 (Critical)	< 30 seconds	< 1 minute	< 5 minutes
Tier 1 (Core)	< 2 minutes	< 5 minutes	< 15 minutes
Tier 2 (Important)	< 5 minutes	< 10 minutes	< 30 minutes
Tier 3 (Standard)	< 15 minutes	< 30 minutes	< 60 minutes

The Zombie Injection Problem

One of the most dangerous failure modes in chaos engineering is the 'zombie injection': an experiment that was supposed to end but didn't. The abort was triggered, but the injection persisted. Always verify that injections are actually removed, not just that the abort command was sent. Implement idempotent cleanup and verification steps.

Building Organizational Confidence

Technical blast radius control enables chaos engineering. But organizational confidence enables chaos engineering to be adopted and sustained. The two are deeply connected.

The trust cycle:

Blast radius control and organizational trust reinforce each other:

Small Blast Radius → Safe Experiments → Successful Results → 
    → Increased Trust → Approval for Larger Blast Radius → 
        → More Learning → More Trust → ...

Violating this cycle—running large blast radius experiments before earning trust—doesn't just risk technical problems. It risks killing the chaos program entirely as stakeholders withdraw support.

Trust-Building Practices

•Demonstrate control first: Before asking to run production experiments, show that your abort mechanisms work, your monitoring catches issues, and your team responds quickly.
•Communicate proactively: Announce experiments before they run. Provide easy ways to verify an experiment is active. Share results afterward—especially successful ones.
•Start with low-stakes systems: Build track record on less critical systems before touching core business functionality.
•Celebrate learning, not just success: When an experiment reveals a weakness, frame it as a win—you found the problem before customers did.
•Own up to mistakes: If an experiment has unintended impact, acknowledge it, explain what went wrong, and describe preventive measures.
•Make progress visible: Dashboards showing experiment coverage, success rates, and discovered issues help stakeholders see value.

Stakeholder communication for blast radius:

Different stakeholders need different framing:

For executives: "We're running controlled experiments that affect less than 1% of traffic for 5 minutes. The expected learning is worth far more than the tiny risk. Each experiment we run makes our next real incident smaller."

For product managers: "Your feature is resilient to database failures—we tested it. Here's our plan to test the next three failure modes. We'll keep blast radius under 5% until we're confident."

For on-call engineers: "There's a chaos experiment running from 2-2:15 PM affecting the order service. You might see latency spikes. Here's how to abort if needed. Here's the dashboard to watch."

For customer support: "For the next 15 minutes, about 1% of users might see slower checkout. If customers report issues, check [dashboard]. This is a planned test to make the system more reliable."

The First Incident Test

Every chaos program faces a critical test: the first time an experiment causes unexpected impact. How you handle this moment determines the program's future. If you've maintained small blast radius, communicated well, and have good abort mechanisms, the impact will be small and recovery will be fast. Your response—acknowledging the issue, explaining causes, describing fixes—builds or erodes trust. Prepare for this moment before it happens.

Summary: Minimize Blast Radius

The fifth and final principle of chaos engineering—minimizing blast radius—is the safety framework that makes all other principles possible. Let's consolidate the key insights:

Key Takeaways

•Blast radius is multi-dimensional: User scope, traffic percentage, component scope, geography, time, severity, and feature scope all matter.
•Progressive expansion builds confidence: Start minimal, require consecutive successes, expand one dimension at a time, allow demotion when needed.
•Defense in depth for containment: Injection limits, circuit breakers, automated abort, manual abort, and safe defaults provide layered protection.
•Use error budgets as guidance: Chaos experiments should consume only a fraction of error budget, leaving room for real failures.
•Isolation techniques vary in strength: Traffic splitting is easy but provides medium isolation; shadow environments provide strong isolation at high cost.
•Recovery time extends effective blast radius: An experiment with 30-minute recovery has 6x more impact than a 5-minute duration suggests.
•Organizational trust enables larger blast radius: Technical control and stakeholder confidence must grow together.

Conclusion: The Five Principles Together

With all five principles mastered, you have the complete framework for practicing chaos engineering:

Hypothesize about steady state — Know what "normal" looks like
Vary real-world events — Know what failures to simulate
Run experiments in production — Know that only production truly validates
Automate experiments — Know that consistency requires automation
Minimize blast radius — Know that safety enables sustainability

These principles work together. You can't run production experiments without blast radius control. You can't automate without steady state definition. Each principle strengthens the others.

Module Complete

You have now mastered all five principles of chaos engineering. You understand the scientific foundation, the failure taxonomy, production experimentation, automation strategy, and safety controls. In the next modules, you'll apply these principles to specific failure injection techniques and organizational implementation.

5 / 5

Loading learning content...

System Design (HLD)Principles of Chaos

Principles of Chaos Engineering

LevelAdvanced

Duration75 mins

TopicPrinciples of Chaos

5 / 5

Minimize Blast Radius

The Safety Principle

The fifth and final principle of chaos engineering is perhaps the most important for organizational adoption: Minimize blast radius.

What You Will Learn

Defining Blast Radius

Blast radius is the maximum scope of impact an experiment can have. It's measured across multiple dimensions, and controlling all dimensions is necessary for safe experimentation.

The dimensions of blast radius:

Blast radius isn't a single number—it's a multi-dimensional constraint that defines the boundaries of your experiment:

Dimensions of Blast Radius
Dimension	Description	Example Controls
User scope	Which users are affected	Internal only, beta users, region-specific, percentage of all users
Traffic scope	What percentage of requests are affected	1%, 5%, 10%, 50%, 100% of requests
Component scope	Which system components are targeted	Single instance, single service, multiple services, entire stack
Geographic scope	Which regions or data centers are affected	Single AZ, single region, multi-region
Temporal scope	How long the experiment runs	Seconds, minutes, hours, continuous
Severity scope	How severe the injected failure is	Latency (mild), errors (moderate), complete failure (severe)
Feature scope	Which functionality is affected	Read operations, write operations, payment flows

The blast radius budget:

Think of blast radius as a budget. You allocate across dimensions based on risk tolerance and learning objectives. A new experiment might have:

1% of traffic (small traffic scope)
Internal users only (small user scope)
5 minutes maximum (small temporal scope)
Single region (small geographic scope)

As confidence grows, you can expand one dimension at a time while keeping others constrained. This progressive expansion is key to safe chaos.

The Multiplication Effect

Progressive Expansion

The safest chaos programs start with minimal blast radius and expand only after demonstrating success. This progressive expansion builds both system confidence and organizational trust.

The expansion ladder:

For each experiment type, define an expansion path:

blast-radius-progression.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# Blast Radius Expansion Configuration
experiment_type: instance_termination
 
progression_levels:
  - level: 1
    name: "Minimal Validation"
    requirements:
      - "New experiment type OR significant system changes"
    blast_radius:
      traffic_percentage: 0
      instance_count: 1
      regions: ["us-east-1"]
      user_segments: ["internal_only"]
      duration_seconds: 60
    graduation_criteria:
      - "Zero customer impact detected"
      - "System recovered within 30 seconds"
      - "No unexpected propagation"
      
  - level: 2
    name: "Early Validation"
    requirements:
      - "Level 1 passed 3 consecutive times"
    blast_radius:
      traffic_percentage: 1
      instance_count: 1
      regions: ["us-east-1"]
      user_segments: ["beta_users"]
      duration_seconds: 120
    graduation_criteria:
      - "Error rate increase < 0.1%"
      - "Latency P99 increase < 50ms"
      - "Recovery within SLO"
      
  - level: 3
    name: "Moderate Validation"
    requirements:
      - "Level 2 passed 5 consecutive times"
      - "No related incidents in past 30 days"
    blast_radius:
      traffic_percentage: 5
      instance_count: 2
      regions: ["us-east-1", "us-west-2"]
      user_segments: ["all_except_enterprise"]
      duration_seconds: 300
    graduation_criteria:
      - "SLO maintained throughout"
      - "No manual intervention required"
      - "Automated analysis confirms hypothesis"
      
  - level: 4
    name: "Production Validation"
    requirements:
      - "Level 3 passed 10 consecutive times"
      - "Executive approval for this experiment class"
    blast_radius:
      traffic_percentage: 10
      instance_count: "25% of fleet"
      regions: ["all_production"]
      user_segments: ["all_users"]
      duration_seconds: 600
    graduation_criteria:
      - "Full production traffic handled"
      - "Error budget not consumed"
      - "Team comfortable with experiment"
      
  - level: 5
    name: "Continuous Validation"
    requirements:
      - "Level 4 passed 20 consecutive times"
      - "Automated abort proven reliable"
    blast_radius:
      traffic_percentage: 10
      instance_count: "25% of fleet"
      regions: ["all_production"]
      user_segments: ["all_users"]
      duration_seconds: "continuous"
      schedule: "random within business hours"
    graduation_criteria:
      - "Running for 30 days without incident"
      - "Team confidence high"
 
demotion_triggers:
  - "Experiment caused customer-visible impact"
  - "Abort mechanism failed"
  - "Unexpected cascade detected"
  - "Major system architecture change"

Key principles of progressive expansion:

One dimension at a time: When expanding, increase only one dimension while keeping others constant. Going from Level 2 to Level 3 might increase traffic from 1% to 5% but keep duration the same.
Graduation requires consistency: A single success shouldn't trigger expansion. Require multiple consecutive successes to prove the result wasn't luck.
Demotion is possible: If an experiment causes unexpected impact at a higher level, demote it to a lower level and restart the progression.
Architecture changes reset: Major system changes (new services, changed dependencies, different infrastructure) may require re-validation from lower levels.
Document the progression: Track what level each experiment has reached and why. This becomes institutional knowledge about system resilience.

The Dial, Not the Switch

Containment Strategies

Even with careful blast radius control, experiments can exceed expected scope. Containment strategies limit damage when things go wrong.

Defense in depth for chaos:

Like security, chaos containment works best with multiple layers. Each layer catches failures that slip through previous layers.

Containment Layers

•Layer 1 - Injection scope: The chaos injection itself is limited. Only specific instances, routes, or users are affected by design.
•Layer 2 - Circuit breakers: System-level circuit breakers detect degradation and stop cascading effects, whether caused by chaos or real failures.
•Layer 3 - Automated abort: The chaos platform monitors for breach of thresholds and terminates experiments that exceed bounds.
•Layer 4 - Manual abort: Human operators can immediately stop experiments via 'big red button' mechanisms.
•Layer 5 - System defaults: Even if chaos injection can't be stopped, systems should fail to safe states that limit impact.

containment-implementation.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
package containment
 
import (
    "context"
    "sync"
    "time"
)
 
// ContainmentLevel represents the severity of containment needed
type ContainmentLevel int
 
const (
    ContainmentNone    ContainmentLevel = 0
    ContainmentInjection ContainmentLevel = 1
    ContainmentCircuit ContainmentLevel = 2
    ContainmentAbort   ContainmentLevel = 3
    ContainmentManual  ContainmentLevel = 4
    ContainmentEmergency ContainmentLevel = 5
)
 
// ContainmentOrchestrator manages multi-layer containment
type ContainmentOrchestrator struct {
    mu              sync.Mutex
    currentLevel    ContainmentLevel
    injector        InjectionController
    circuitBreakers CircuitBreakerController
    abortController AbortController
    notifier        NotificationService
    
    // Track containment escalations
    escalationHistory []EscalationEvent
}
 
// EscalationEvent records when containment was triggered
type EscalationEvent struct {
    Timestamp time.Time
    Level     ContainmentLevel
    Reason    string
    Automated bool
}
 
// NewContainmentOrchestrator creates orchestrator with all layers
func NewContainmentOrchestrator(
    injector InjectionController,
    circuits CircuitBreakerController,
    abort AbortController,
    notifier NotificationService,
) *ContainmentOrchestrator {
    return &ContainmentOrchestrator{
        currentLevel:    ContainmentNone,
        injector:        injector,
        circuitBreakers: circuits,
        abortController: abort,
        notifier:        notifier,
        escalationHistory: []EscalationEvent{},
    }
}
 
// Escalate moves to a higher containment level
func (c *ContainmentOrchestrator) Escalate(level ContainmentLevel, reason string, automated bool) error {
    c.mu.Lock()
    defer c.mu.Unlock()
    
    if level <= c.currentLevel {
        return nil // Already at this level or higher
    }
    
    // Record the escalation
    c.escalationHistory = append(c.escalationHistory, EscalationEvent{
        Timestamp: time.Now(),
        Level:     level,
        Reason:    reason,
        Automated: automated,
    })
    
    // Execute containment actions for each level up to target
    for l := c.currentLevel + 1; l <= level; l++ {
        if err := c.executeContainmentLevel(l, reason); err != nil {
            // Even on error, we want to try higher levels
            c.notifier.Alert("Containment level %d failed: %v", l, err)
        }
    }
    
    c.currentLevel = level
    return nil
}
 
func (c *ContainmentOrchestrator) executeContainmentLevel(level ContainmentLevel, reason string) error {
    switch level {
    case ContainmentInjection:
        // Stop new injections, but let running ones complete
        c.notifier.Info("Containment Level 1: Halting new chaos injections")
        return c.injector.PauseNewInjections()
        
    case ContainmentCircuit:
        // Trigger circuit breakers on affected paths
        c.notifier.Warn("Containment Level 2: Opening circuit breakers")
        return c.circuitBreakers.OpenAffectedCircuits()
        
    case ContainmentAbort:
        // Abort all active experiments
        c.notifier.Warn("Containment Level 3: Aborting all experiments - %s", reason)
        return c.abortController.AbortAll()
        
    case ContainmentManual:
        // Request human intervention
        c.notifier.Critical("Containment Level 4: MANUAL INTERVENTION REQUIRED - %s", reason)
        c.abortController.AbortAll() // Best effort
        return nil
        
    case ContainmentEmergency:
        // Full system safe mode
        c.notifier.Critical("Containment Level 5: EMERGENCY - Enabling system safe mode")
        c.injector.PauseNewInjections()
        c.abortController.AbortAll()
        c.circuitBreakers.OpenAll()
        return nil
    }
    
    return nil
}
 
// MonitorAndContain continuously checks conditions and escalates as needed
func (c *ContainmentOrchestrator) MonitorAndContain(ctx context.Context, thresholds ContainmentThresholds) {
    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()
    
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            c.checkAndEscalate(thresholds)
        }
    }
}
 
func (c *ContainmentOrchestrator) checkAndEscalate(thresholds ContainmentThresholds) {
    // Get current system state
    errorRate := c.getErrorRate()
    latencyP99 := c.getLatencyP99()
    activeExperiments := c.getActiveExperimentCount()
    
    // Check thresholds in order of severity
    if errorRate > thresholds.EmergencyErrorRate {
        c.Escalate(ContainmentEmergency, 
            "Error rate > emergency threshold", true)
        return
    }
    
    if errorRate > thresholds.AbortErrorRate {
        c.Escalate(ContainmentAbort,
            "Error rate > abort threshold", true)
        return
    }
    
    if latencyP99 > thresholds.CircuitLatency {
        c.Escalate(ContainmentCircuit,
            "Latency > circuit threshold", true)
        return
    }
    
    // Multiple experiments might increase risk
    if activeExperiments > thresholds.MaxConcurrentExperiments {
        c.Escalate(ContainmentInjection,
            "Too many concurrent experiments", true)
    }
}
 
// Reset returns to no containment (after recovery)
func (c *ContainmentOrchestrator) Reset() {
    c.mu.Lock()
    defer c.mu.Unlock()
    
    c.currentLevel = ContainmentNone
    c.notifier.Info("Containment reset - normal operations resumed")
}
 
// Stubs for demonstration
func (c *ContainmentOrchestrator) getErrorRate() float64 { return 0.01 }
func (c *ContainmentOrchestrator) getLatencyP99() float64 { return 100 }
func (c *ContainmentOrchestrator) getActiveExperimentCount() int { return 1 }
 
// ContainmentThresholds defines when to escalate
type ContainmentThresholds struct {
    CircuitLatency           float64 // ms
    AbortErrorRate           float64 // 0-1
    EmergencyErrorRate       float64 // 0-1
    MaxConcurrentExperiments int
}

Test Your Containment

Calculating Acceptable Blast Radius

How much blast radius is acceptable? This isn't arbitrary—it should be calculated based on your system's characteristics and business context.

The error budget frame:

Monthly chaos budget: ~4 minutes of impact
Weekly chaos budget: ~1 minute of impact
Per-experiment budget: seconds of customer-visible impact

This frame forces discipline. You can't run 10% blast radius experiments frequently if each one risks minutes of impact.

blast-radius-calculator.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
from dataclasses import dataclass
from typing import Optional
from enum import Enum
 
class SLOTarget(Enum):
    THREE_NINES = 0.999      # 43.8 min/month downtime
    THREE_FIVE_NINES = 0.9995 # 21.9 min/month
    FOUR_NINES = 0.9999      # 4.38 min/month
    FIVE_NINES = 0.99999     # 26 sec/month
 
@dataclass
class BlastRadiusCalculation:
    """Results of blast radius calculation."""
    max_traffic_percentage: float
    max_duration_seconds: int
    max_weekly_experiments: int
    rationale: str
 
class BlastRadiusCalculator:
    """
    Calculate acceptable blast radius based on SLOs and error budget.
    The key insight: chaos experiments should consume only a fraction
    of your error budget, leaving room for real failures.
    """
    
    def __init__(
        self,
        slo_target: SLOTarget,
        chaos_budget_fraction: float = 0.1  # 10% of error budget for chaos
    ):
        self.slo_target = slo_target
        self.chaos_budget_fraction = chaos_budget_fraction
        
        # Calculate available chaos budget
        self.monthly_error_budget_minutes = self._calculate_error_budget()
        self.weekly_chaos_budget_minutes = (
            self.monthly_error_budget_minutes * 
            chaos_budget_fraction / 4
        )
    
    def _calculate_error_budget(self) -> float:
        """Calculate allowed downtime per month in minutes."""
        allowed_downtime_fraction = 1 - self.slo_target.value
        minutes_per_month = 30 * 24 * 60  # ~43,200 minutes
        return allowed_downtime_fraction * minutes_per_month
    
    def calculate_for_experiment(
        self,
        experiment_type: str,
        expected_impact_fraction: float,  # What fraction of affected requests fail
        experiments_per_week: int = 5
    ) -> BlastRadiusCalculation:
        """
        Calculate maximum safe blast radius for an experiment type.
        
        Args:
            experiment_type: Name of the experiment
            expected_impact_fraction: Fraction of affected requests that fail
                (e.g., latency injection might be 0 if it stays under timeout,
                 instance termination might be 0.5 if half requests fail during failover)
            experiments_per_week: How often we want to run this experiment
        """
        
        # Budget per experiment in minutes
        budget_per_experiment = self.weekly_chaos_budget_minutes / experiments_per_week
        
        # If the experiment causes X% of affected requests to fail,
        # and we can afford Y minutes of full outage equivalent,
        # then max duration × traffic% × impact = Y minutes
        
        # Solve for traffic% given target duration
        target_duration_minutes = 5  # 5 minute max duration
        
        if expected_impact_fraction > 0:
            max_traffic_percentage = min(
                100,
                (budget_per_experiment / target_duration_minutes / expected_impact_fraction) * 100
            )
        else:
            max_traffic_percentage = 100  # No actual failures expected
        
        # Also calculate max duration for 10% traffic
        if expected_impact_fraction > 0:
            max_duration_at_10_percent = (
                budget_per_experiment / (0.10 * expected_impact_fraction)
            )
        else:
            max_duration_at_10_percent = 60  # Cap at 60 minutes
        
        return BlastRadiusCalculation(
            max_traffic_percentage=round(max_traffic_percentage, 2),
            max_duration_seconds=int(min(max_duration_at_10_percent, 60) * 60),
            max_weekly_experiments=experiments_per_week,
            rationale=self._generate_rationale(
                experiment_type, expected_impact_fraction,
                budget_per_experiment
            )
        )
    
    def _generate_rationale(
        self,
        experiment_type: str,
        impact_fraction: float,
        budget_minutes: float
    ) -> str:
        return f"""
Calculation for: {experiment_type}
SLO Target: {self.slo_target.name} ({self.slo_target.value * 100}%)
Monthly error budget: {self.monthly_error_budget_minutes:.1f} minutes
Weekly chaos budget ({self.chaos_budget_fraction*100}%): {self.weekly_chaos_budget_minutes:.1f} minutes
Per-experiment budget: {budget_minutes:.2f} minutes
 
Expected impact fraction: {impact_fraction*100}% of affected requests fail
 
This budget assumes chaos-related impact is completely isolated.
Unexpected cascading effects would consume additional budget.
        """.strip()
 
 
# Example usage
def demonstrate_calculation():
    # High availability system (99.99%)
    calc = BlastRadiusCalculator(
        slo_target=SLOTarget.FOUR_NINES,
        chaos_budget_fraction=0.10
    )
    
    # Instance termination with ~50% request impact during failover
    instance_result = calc.calculate_for_experiment(
        experiment_type="Instance Termination",
        expected_impact_fraction=0.5,  # Half of requests fail during 30s failover
        experiments_per_week=10
    )
    print(f"Instance Termination: max {instance_result.max_traffic_percentage}% traffic")
    
    # Latency injection with no actual failures (just slowness)
    latency_result = calc.calculate_for_experiment(
        experiment_type="200ms Latency Injection",
        expected_impact_fraction=0.0,  # Requests still succeed
        experiments_per_week=20
    )
    print(f"Latency Injection: max {latency_result.max_traffic_percentage}% traffic")
    
    # Network partition with high impact
    partition_result = calc.calculate_for_experiment(
        experiment_type="Network Partition",
        expected_impact_fraction=1.0,  # All affected requests fail
        experiments_per_week=2
    )
    print(f"Network Partition: max {partition_result.max_traffic_percentage}% traffic")
    
    return instance_result, latency_result, partition_result

Factors that affect acceptable blast radius:

Blast Radius Modifiers

•System criticality: Core payment processing needs smaller blast radius than internal analytics dashboards.
•Failure impact: Latency injection (users experience slowness) has lower impact than error injection (users see failures).
•Experiment maturity: New experiments start with smaller radius; well-tested experiments can expand.
•Time of day: Off-peak hours allow larger radius; peak hours demand smaller radius.
•Team availability: Larger radius acceptable when team is actively monitoring; smaller when on-call only.
•Recent incidents: After incidents, error budget is depleted; reduce chaos temporarily.

Isolation Techniques

Technical isolation approaches:

Isolation Techniques
Technique	How It Works	Isolation Strength	Complexity
Traffic splitting	Route percentage of traffic through chaos path	Medium - requests interact with shared resources	Low
User segmentation	Affect specific user groups based on attributes	Medium - users share infrastructure	Low
Instance targeting	Inject chaos into specific instances only	Medium - shared dependencies not isolated	Low
Namespace isolation	Run chaos experiments in separate K8s namespace	High - network policies can enforce boundaries	Medium
Tenancy isolation	Experiment in dedicated tenant/shard	High - data and compute separated	High
Shadow environment	Run experiments on traffic replicas, not live	Very High - no production impact possible	Very High

traffic-isolation.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import { Request, Response, NextFunction } from 'express';
 
/**
 * Traffic isolation for chaos experiments.
 * Determines whether a request should be routed through
 * the chaos path or the normal path.
 */
 
interface IsolationConfig {
  // Percentage of traffic to route to chaos (0-100)
  trafficPercentage: number;
  
  // User segments to include (empty = all)
  includedUserSegments: string[];
  
  // User segments to always exclude
  excludedUserSegments: string[];
  
  // Specific user IDs to whitelist for chaos
  whitelistedUserIds: string[];
  
  // Specific user IDs to blacklist from chaos
  blacklistedUserIds: string[];
  
  // Geographic regions to target
  targetRegions: string[];
  
  // Request paths to target (or exclude)
  includedPaths: string[];
  excludedPaths: string[];
  
  // Mark header on affected requests for observability
  markHeader: string;
}
 
interface IsolationResult {
  routeToChAos: boolean;
  reason: string;
}
 
class ChaosIsolator {
  private config: IsolationConfig;
  private hashSeed: number;
  
  constructor(config: IsolationConfig) {
    this.config = config;
    this.hashSeed = Date.now();  // Consistent for experiment duration
  }
  
  /**
   * Determine if a request should be routed to the chaos path.
   * Uses stable hashing so the same user gets consistent treatment.
   */
  shouldRouteToChAos(req: Request): IsolationResult {
    const userId = req.headers['x-user-id'] as string || 'anonymous';
    const userSegment = req.headers['x-user-segment'] as string || 'default';
    const region = req.headers['x-region'] as string || 'unknown';
    const path = req.path;
    
    // Check blacklist first (never route these to chaos)
    if (this.config.blacklistedUserIds.includes(userId)) {
      return { routeToChAos: false, reason: 'User blacklisted' };
    }
    
    // Check excluded segments
    if (this.config.excludedUserSegments.includes(userSegment)) {
      return { routeToChAos: false, reason: 'Segment excluded' };
    }
    
    // Check excluded paths
    for (const excludedPath of this.config.excludedPaths) {
      if (path.startsWith(excludedPath)) {
        return { routeToChAos: false, reason: 'Path excluded' };
      }
    }
    
    // Check whitelist (always route if whitelisted)
    if (this.config.whitelistedUserIds.includes(userId)) {
      return { routeToChAos: true, reason: 'User whitelisted' };
    }
    
    // Check region targeting
    if (this.config.targetRegions.length > 0 && 
        !this.config.targetRegions.includes(region)) {
      return { routeToChAos: false, reason: 'Region not targeted' };
    }
    
    // Check segment targeting
    if (this.config.includedUserSegments.length > 0 &&
        !this.config.includedUserSegments.includes(userSegment)) {
      return { routeToChAos: false, reason: 'Segment not targeted' };
    }
    
    // Check path targeting
    if (this.config.includedPaths.length > 0) {
      const pathMatch = this.config.includedPaths.some(p => path.startsWith(p));
      if (!pathMatch) {
        return { routeToChAos: false, reason: 'Path not targeted' };
      }
    }
    
    // Stable percentage-based routing
    const userHash = this.stableHash(userId);
    const percentage = userHash % 100;
    
    if (percentage < this.config.trafficPercentage) {
      return { routeToChAos: true, reason: `In ${ this.config.trafficPercentage } % sample` };
    }
    
    return { routeToChAos: false, reason: 'Not in sample' };
  }
  
  /**
   * Generate stable hash for user ID.
   * Same user always gets same result within an experiment.
   */
  private stableHash(input: string): number {
    let hash = this.hashSeed;
    for (let i = 0; i < input.length; i++) {
      const char = input.charCodeAt(i);
      hash = ((hash << 5) - hash) + char;
      hash = hash & hash;  // Convert to 32bit integer
    }
    return Math.abs(hash);
  }
}
 
/**
 * Express middleware for chaos isolation
 */
function createChaosIsolationMiddleware(config: IsolationConfig) {
  const isolator = new ChaosIsolator(config);
  
  return (req: Request, res: Response, next: NextFunction) => {
    const result = isolator.shouldRouteToChAos(req);
    
    // Mark request for observability
    req.headers[config.markHeader] = result.routeToChAos ? 'true' : 'false';
    req.headers['x-chaos-routing-reason'] = result.reason;
    
    // Store decision for downstream use
    (req as any).chaosRouting = result;
    
    next();
  };
}
 
export { ChaosIsolator, createChaosIsolationMiddleware, IsolationConfig };

The VIP Protection Pattern

Recovery and Remediation

Blast radius minimization includes not just limiting impact during the experiment, but ensuring rapid recovery after the experiment ends—whether through normal completion or abort.

Recovery phases:

Post-Experiment Recovery Sequence

•Injection removal: Stop the chaos injection completely. Verify no leftover artifacts (iptables rules, sidecar configurations).
•Health verification: Confirm affected components return to healthy state. Check pod status, endpoint health, connection pools.
•Traffic restoration: If traffic was rerouted, verify normal routing resumes. Check that circuit breakers close.
•Metric stabilization: Wait for metrics to return to baseline. Track recovery time—it's a key resilience indicator.
•Impact assessment: Determine if any lasting effects occurred. Check for data issues, stuck jobs, or orphaned resources.

Measuring recovery:

Recovery time is a critical metric. Track multiple recovery measures:

Time to detection (TTD): How long before the system recognized something was wrong?
Time to mitigation (TTM): How long before impact was limited (circuit breakers opened)?
Time to recovery (TTR): How long before metrics returned to baseline?
Time to resolution (TTRes): How long before all traces of the experiment were removed?

These times inform both experiment design and system resilience assessment. If TTR is 30 minutes for a 5-minute experiment, your blast radius was effectively 6x larger than planned.

Recovery Time Targets by System Tier
System Tier	Target TTD	Target TTM	Target TTR
Tier 0 (Critical)	< 30 seconds	< 1 minute	< 5 minutes
Tier 1 (Core)	< 2 minutes	< 5 minutes	< 15 minutes
Tier 2 (Important)	< 5 minutes	< 10 minutes	< 30 minutes
Tier 3 (Standard)	< 15 minutes	< 30 minutes	< 60 minutes

The Zombie Injection Problem

Building Organizational Confidence

Technical blast radius control enables chaos engineering. But organizational confidence enables chaos engineering to be adopted and sustained. The two are deeply connected.

The trust cycle:

Blast radius control and organizational trust reinforce each other:

Small Blast Radius → Safe Experiments → Successful Results → 
    → Increased Trust → Approval for Larger Blast Radius → 
        → More Learning → More Trust → ...

Trust-Building Practices

•Demonstrate control first: Before asking to run production experiments, show that your abort mechanisms work, your monitoring catches issues, and your team responds quickly.
•Communicate proactively: Announce experiments before they run. Provide easy ways to verify an experiment is active. Share results afterward—especially successful ones.
•Start with low-stakes systems: Build track record on less critical systems before touching core business functionality.
•Celebrate learning, not just success: When an experiment reveals a weakness, frame it as a win—you found the problem before customers did.
•Own up to mistakes: If an experiment has unintended impact, acknowledge it, explain what went wrong, and describe preventive measures.
•Make progress visible: Dashboards showing experiment coverage, success rates, and discovered issues help stakeholders see value.

Stakeholder communication for blast radius:

Different stakeholders need different framing:

For on-call engineers: "There's a chaos experiment running from 2-2:15 PM affecting the order service. You might see latency spikes. Here's how to abort if needed. Here's the dashboard to watch."

The First Incident Test

Summary: Minimize Blast Radius

The fifth and final principle of chaos engineering—minimizing blast radius—is the safety framework that makes all other principles possible. Let's consolidate the key insights:

Key Takeaways

•Blast radius is multi-dimensional: User scope, traffic percentage, component scope, geography, time, severity, and feature scope all matter.
•Progressive expansion builds confidence: Start minimal, require consecutive successes, expand one dimension at a time, allow demotion when needed.
•Defense in depth for containment: Injection limits, circuit breakers, automated abort, manual abort, and safe defaults provide layered protection.
•Use error budgets as guidance: Chaos experiments should consume only a fraction of error budget, leaving room for real failures.
•Isolation techniques vary in strength: Traffic splitting is easy but provides medium isolation; shadow environments provide strong isolation at high cost.
•Recovery time extends effective blast radius: An experiment with 30-minute recovery has 6x more impact than a 5-minute duration suggests.
•Organizational trust enables larger blast radius: Technical control and stakeholder confidence must grow together.

Conclusion: The Five Principles Together

With all five principles mastered, you have the complete framework for practicing chaos engineering:

Hypothesize about steady state — Know what "normal" looks like
Vary real-world events — Know what failures to simulate
Run experiments in production — Know that only production truly validates
Automate experiments — Know that consistency requires automation
Minimize blast radius — Know that safety enables sustainability

These principles work together. You can't run production experiments without blast radius control. You can't automate without steady state definition. Each principle strengthens the others.

Module Complete

5 / 5