Failure Injection - Learning Module

Loading content...

0/273

Service Failures

When Services Fail: The Application Layer Under Stress

Network failure injection tests the infrastructure layer—the pipes through which your services communicate. Service failure injection targets the services themselves—the application processes that receive requests, process data, and produce responses.

Service failures are arguably more diverse than network failures because they encompass everything that can go wrong within an application:

Processes crash or hang
Exceptions thrown from unexpected code paths
Error responses returned from dependencies
Partial service degradation where some operations fail but others succeed
Version mismatches during deployments
Configuration errors that cause behavioral changes

Every external dependency your service relies on—databases, caches, APIs, microservices—represents a service that can fail. Understanding how your system behaves when these failures occur is crucial for building resilient distributed systems.

What You Will Learn

By the end of this page, you will understand how to inject and analyze service failures: process termination, exception injection, error response manipulation, and dependency unavailability. You'll learn to test circuit breakers, fallback mechanisms, retry logic, and graceful degradation—the defensive mechanisms that determine whether a dependency failure becomes a cascading outage.

Process Termination

Process termination is the most straightforward service failure: the service stops running. This can happen gracefully (SIGTERM allowing cleanup) or abruptly (SIGKILL forcing immediate termination). In production, process termination occurs due to:

Out-of-memory (OOM) kills by the kernel or container runtime
Crashes from unhandled exceptions or segmentation faults
Deliberate restarts during deployments or scaling events
Health check failures triggering container restarts
Infrastructure issues killing underlying VMs

Graceful vs. Abrupt Termination:

The behavior differs significantly:

Termination Type	Signal	What Happens	Time for Cleanup
Graceful	SIGTERM	Process receives signal, can handle in-flight requests	Configurable (typically 30s)
Abrupt	SIGKILL	Process immediately terminated	None
OOM Kill	SIGKILL	Kernel selects and kills process	None
Crash	N/A	Process terminates from exception	None

Testing both types is essential because they exercise different code paths. Graceful termination tests your shutdown hooks and connection draining. Abrupt termination tests how other services handle sudden disconnection and whether in-flight requests are handled correctly.

Process Termination Scenarios

•Single Instance Kill — Terminate one instance of a multi-instance service. Tests load balancer health checks, traffic rerouting, and basic redundancy.
•All Instances Kill — Terminate all instances simultaneously. Tests circuit breaker activation, fallback behavior, and upstream service resilience.
•Partial Instance Kill — Kill instances randomly at a rate (e.g., 1 instance every 30 seconds). Tests continuous partial availability.
•Kill During Request Processing — Terminate while requests are in-flight. Tests client retry behavior and request idempotency.
•Kill and Slow Restart — Terminate and delay restart. Tests service discovery update propagation and timeout behavior.
•Kill Primary Instance — For stateful services, kill the leader/primary. Tests failover, leader election, and data consistency.

process-termination.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Basic process termination
kill -SIGTERM <pid>    # Graceful shutdown
kill -SIGKILL <pid>    # Abrupt termination
 
# Kubernetes pod termination
kubectl delete pod <pod-name>                 # Graceful
kubectl delete pod <pod-name> --grace-period=0 --force  # Abrupt
 
# Docker container termination  
docker stop <container>    # Graceful (SIGTERM + timeout + SIGKILL)
docker kill <container>    # Abrupt (SIGKILL)
 
# Kill random pod from a deployment
RANDOM_POD=$(kubectl get pods -l app=my-service -o name | shuf -n 1)
kubectl delete $RANDOM_POD
 
# Continuous pod killing (kill one every 30 seconds)
while true; do
  RANDOM_POD=$(kubectl get pods -l app=my-service -o name | shuf -n 1)
  kubectl delete $RANDOM_POD --grace-period=0 --force
  echo "Killed: $RANDOM_POD"
  sleep 30
done
 
# Kill on a schedule using Chaos Mesh
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-chaos
spec:
  action: pod-kill
  mode: one
  selector:
    labelSelectors:
      app: my-service
  scheduler:
    cron: "*/2 * * * *"   # Every 2 minutes
EOF

Observable Effects of Process Termination
Observation	What It Indicates	Expected Behavior
Error spike then recovery	Load balancer detected failure and rerouted	Normal redundancy working
Prolonged errors after kill	Health check or service discovery too slow	Tune health check intervals
Duplicate processing after restart	In-flight requests retried	Ensure idempotency
Data inconsistency after restart	Transaction wasn't completed	Review transaction handling
Cascading failures to upstream	No circuit breaker protection	Implement circuit breakers
Connection errors in dependent services	Connections not properly drained	Implement graceful shutdown

In-Flight Request Handling

The most common issue revealed by process termination testing is improper handling of in-flight requests. When a service dies mid-request, clients may retry. If the operation isn't idempotent, retries can cause duplicate charges, duplicate emails, or duplicate database entries. Ensure all mutating operations have idempotency keys.

Exception Injection

Exception injection causes services to throw exceptions at specific code paths without terminating the entire process. This is more surgical than process termination—it tests error handling for specific operations while keeping the service otherwise functional.

Why Exception Injection Matters:

In a well-designed service, most code paths have error handling. But exception handling code is often the least-tested code in any codebase. Developers write happy-path tests, maybe some obvious error cases, but rarely test every exception path systematically.

Exception injection reveals:

Whether exceptions are caught and handled appropriately
Whether error information is logged correctly
Whether clients receive meaningful error responses
Whether partial failures are handled (e.g., only some items in a batch fail)
Whether resource cleanup (connections, file handles, locks) happens correctly

Exception Injection Scenarios

•Database Connection Exceptions — Force database operations to throw connection or query errors. Tests transaction handling and connection pool behavior.
•Serialization Exceptions — Inject exceptions during JSON/protobuf serialization. Tests whether partial responses are handled correctly.
•Timeout Exceptions — Inject timeout exceptions from downstream calls. Tests timeout propagation and deadline handling.
•Authentication Exceptions — Force auth failures for specific requests. Tests security error handling and whether sensitive data is leaked.
•Resource Exhaustion Exceptions — Inject 'out of memory' or 'too many connections' errors. Tests graceful degradation under resource pressure.
•Configuration Exceptions — Make configuration reads fail. Tests fallback to defaults and startup failure handling.

exception-injection.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// Example: Exception injection using a wrapper/interceptor pattern
// This shows how to build exception injection into an application
 
interface ExceptionConfig {
  probability: number;      // 0-1, chance of throwing
  exceptionType: string;    // Type of exception to throw
  targetMethod?: string;    // Which method to target
  targetUser?: string;      // Optionally target specific user
}
 
class ExceptionInjector {
  private config: ExceptionConfig | null = null;
 
  configure(config: ExceptionConfig) {
    this.config = config;
  }
 
  maybeThrow(context: { method: string; userId?: string }) {
    if (!this.config) return;
    
    // Check if this context matches targeting criteria
    if (this.config.targetMethod && 
        this.config.targetMethod !== context.method) {
      return;
    }
    
    if (this.config.targetUser && 
        this.config.targetUser !== context.userId) {
      return;
    }
    
    // Probabilistic exception
    if (Math.random() < this.config.probability) {
      throw new Error(`Injected exception: ${this.config.exceptionType}`);
    }
  }
 
  clear() {
    this.config = null;
  }
}
 
// Usage in a service method
const injector = new ExceptionInjector();
 
async function getUserOrders(userId: string) {
  // Exception injection point
  injector.maybeThrow({ method: 'getUserOrders', userId });
  
  // Normal operation
  return await database.query('SELECT * FROM orders WHERE user_id = ?', [userId]);
}
 
// Configure via API during chaos experiment
app.post('/chaos/exceptions', (req, res) => {
  injector.configure(req.body);
  res.json({ status: 'configured' });
});

Integration with Chaos Platforms:

Mature chaos engineering platforms provide built-in exception injection:

Tool	Exception Injection Method	Configuration
Gremlin	Application-level SDK with dashboard control	Code instrumentation + API
Chaos Monkey for Spring Boot	Annotations on methods	YAML configuration
Failure Flags (LaunchDarkly)	Feature flags that throw exceptions	Dashboard + SDK
Byteman (Java)	Bytecode manipulation at runtime	Script language
Custom middleware	HTTP interceptor that can inject failures	Request headers

For production chaos engineering, prefer platforms that allow centralized control, audit logging, and automatic experiment termination.

Test Exception Logging and Monitoring

Exception injection is an excellent opportunity to verify your monitoring and alerting. When you inject exceptions, you should see corresponding alerts fire, dashboards update, and logs capture the error details. If your observability doesn't detect injected exceptions, it won't detect real ones either.

Error Response Manipulation

Error response manipulation makes services return HTTP error codes or application-level error responses. Unlike exception injection (which happens within the service), error response manipulation tests how calling services handle errors from their dependencies.

HTTP Error Categories:

Different error codes should trigger different behaviors:

Code Range	Meaning	Expected Client Behavior
4xx	Client error	Generally don't retry (except 429)
400	Bad request	Log error, possibly fix and retry
401/403	Auth failure	Re-authenticate, then retry
404	Not found	Cache negative result
429	Rate limited	Backoff and retry
5xx	Server error	Retry with backoff
500	Internal error	Retry with backoff
502/503/504	Upstream issues	Retry with backoff

Testing each category verifies that your services implement appropriate retry policies and error handling for each case.

Error Response Scenarios to Test

•5xx on All Requests — Return 500/503 to all traffic. Tests circuit breaker activation and fallback behavior.
•5xx on Percentage — Return 5xx to X% of requests. Tests partial failure handling and retry behavior.
•429 Rate Limiting — Return 429 with Retry-After header. Tests rate limit handling and backoff implementation.
•401/403 Auth Failures — Return authentication errors. Tests credential refresh and security error handling.
•404 for Existing Resources — Return 404 for known-existing items. Tests negative caching and data refresh.
•Malformed Responses — Return responses that don't match expected schema. Tests defensive parsing.
•Slow Error Responses — Add latency before returning errors. Tests combined timeout + error handling.

error-response-injection.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Istio VirtualService for HTTP fault injection
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: payment-service-fault
spec:
  hosts:
    - payment-service
  http:
    - fault:
        abort:
          percentage:
            value: 10      # 10% of requests
          httpStatus: 503  # Return 503 Service Unavailable
      route:
        - destination:
            host: payment-service
 
---
# Return 429 with delay for rate limit testing
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: api-rate-limit-test
spec:
  hosts:
    - external-api
  http:
    - fault:
        delay:
          percentage:
            value: 100
          fixedDelay: 2s
        abort:
          percentage:
            value: 50
          httpStatus: 429
      route:
        - destination:
            host: external-api
 
---
# Chaos Mesh HTTPChaos for Kubernetes
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
metadata:
  name: http-abort-chaos
spec:
  mode: all
  selector:
    labelSelectors:
      app: order-service
  target: Response
  port: 8080
  path: /api/orders/*
  method: POST
  abort: true
  code: 500
  duration: "5m"

Cascading Error Propagation:

When injecting errors, observe how they propagate through your service graph:

Service A calls Service B which returns 500
Service A should decide: retry? return error to its caller? return degraded response?
If Service A returns error, its callers face the same decision
Errors can cascade through many layers

What to Look For:

Do errors transform appropriately? (B's internal error shouldn't leak to A's external clients)
Do retry policies prevent cascade? (A shouldn't retry forever)
Do circuit breakers trip? (After N failures, A should stop calling B)
Do fallbacks activate? (If B is critical, does A have alternatives?)
Does the user experience degrade gracefully? (Even with B down, can users do anything?)

Error Message Leakage

Error injection often reveals that services leak internal details in error responses. A 500 error might include stack traces, database query details, or internal service names. These are security vulnerabilities. Verify that error responses to external clients contain safe, generic error messages.

Dependency Unavailability

Dependency unavailability testing makes specific downstream dependencies unreachable or unresponsive while keeping the service under test running. This is different from network partitions (which affect all traffic) and from process termination (which kills the dependency entirely). The goal is to test how your service behaves when one specific dependency fails.

Common Dependency Types:

Modern services typically have multiple dependency types:

Dependency Type	Examples	Typical Failure Mode	Impact When Unavailable
Database	PostgreSQL, MySQL, DynamoDB	Connection failure or slowness	Data access fails
Cache	Redis, Memcached	Connection failure	Increased DB load, latency
Message Queue	Kafka, RabbitMQ, SQS	Publish/consume failure	Async processing stops
External API	Payment gateway, Email service	HTTP errors or timeout	Feature unavailable
Internal Service	Auth service, User service	Any of the above	Cascading failure risk
Configuration	Feature flags, Config service	Read failure	Stale or default config

Each dependency type requires different handling when unavailable, and testing should verify appropriate behavior for each.

Dependency Failure Scenarios

•Primary Database Unavailable — Block access to the primary database. Tests read replica failover, cached data usage, and graceful degradation.
•Cache Unavailable — Block Redis/Memcached access. Tests cache-aside patterns and whether the system survives without cache.
•Critical External API Down — Make payment gateway unreachable. Tests queuing, manual fallback, and user communication.
•Auth Service Down — Make authentication service unavailable. Tests cached auth tokens, anonymous access, and security implications.
•Feature Flag Service Down — Block feature flag provider. Tests default flag values and stale cache behavior.
•All Replicas of Internal Service Down — Make microservice dependency completely unavailable. Tests circuit breakers and fallbacks.

dependency-unavailability.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Block traffic to specific dependency (database example)
# Find database IP if using hostname
DB_IP=$(dig +short database.internal)
 
# Block all traffic to database
sudo iptables -A OUTPUT -d $DB_IP -j DROP
 
# Block only database port (allow other traffic to same host)
sudo iptables -A OUTPUT -d $DB_IP -p tcp --dport 5432 -j DROP
 
# Using Toxiproxy for application-level dependency blocking
# First, configure app to connect via Toxiproxy
toxiproxy-cli create postgres -l localhost:25432 -u database.internal:5432
toxiproxy-cli create redis -l localhost:26379 -u redis.internal:6379
 
# Disable specific dependency
toxiproxy-cli toggle postgres  # Stop proxying (dependency unavailable)
 
# Add latency before making unavailable (simulates slow death)
toxiproxy-cli toxic add postgres -t latency -a latency=5000
sleep 10
toxiproxy-cli toggle postgres
 
# Kubernetes: Scale deployment to 0 replicas
kubectl scale deployment payment-service --replicas=0
 
# Kubernetes: Delete NetworkPolicy to restore
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: block-payment-service
spec:
  podSelector:
    matchLabels:
      app: order-service
  policyTypes:
    - Egress
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: payment-service
      ports: []  # Empty = block all ports to this destination
EOF

Expected Behaviors Under Dependency Unavailability
Dependency	Minimal Viable Behavior	Good Behavior	Excellent Behavior
Primary DB	Return error to user	Serve stale reads from replica	Serve cached data + queue writes
Cache	Fall back to direct DB queries	Degrade gracefully with slower responses	Local in-memory cache as L2
Auth Service	Return auth error	Honor cached/unexpired tokens	Allow graceful window after expiry
Payment Gateway	Return payment error	Queue transaction for retry	Offer alternative payment method
Config Service	Use last known config	Fall back to bundled defaults	Continue with safe defaults + alert

Classify Your Dependencies

Not all dependencies are equal. Classify each as critical (service cannot function without it), degradable (service can function with reduced capability), or optional (nice to have but not essential). Dependency unavailability testing should verify that your classification is correct—that degradable dependencies truly allow degradation, and critical dependencies truly are critical.

Cascading Failure Injection

Cascading failures occur when a failure in one service propagates to cause failures in dependent services, which in turn cause failures in their dependents, potentially bringing down an entire system. These are among the most dangerous failure modes because they amplify a single point of failure into a system-wide outage.

The Cascade Mechanism:

Service A becomes slow or unavailable
Service B (which calls A) waits for responses, tying up threads/connections
Service B starts responding slowly because threads are blocked
Service C (which calls B) starts experiencing the same problem
Back-pressure propagates through the entire service graph
Eventually, the frontend cannot serve any requests

Why Cascades Happen:

Cascading failures typically occur because services don't have adequate:

Timeouts — Calls wait too long for slow dependencies
Circuit breakers — Services keep calling failing dependencies
Backpressure — Request queues grow unboundedly
Bulkheads — Failure in one area affects others
Fallbacks — No alternatives when primary path fails

Cascade Failure Scenarios

•Slow Leaf Service — Make a service deep in the call graph slow. Observe whether slowness propagates upstream.
•Critical Path Failure — Fail a service on the critical user-facing path. Observe cascade to frontend.
•Retry Storm — Inject failures that trigger aggressive retries. Observe whether retry traffic amplifies the problem.
•Connection Pool Exhaustion — Slow a service until callers exhaust connection pools. Observe what fails when pools are empty.
•Thread Pool Exhaustion — Slow a service until callers exhaust thread pools. Observe whether work can still be processed.
•Multi-Dependency Failure — Fail multiple dependencies simultaneously. Observe whether combined failures cause unexpected interactions.

cascade-prevention.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// Example: Implementing cascade prevention mechanisms
 
import CircuitBreaker from 'opossum';
 
// 1. Timeouts - Never wait forever
const httpClient = axios.create({
  timeout: 5000,  // 5 second timeout
});
 
// 2. Circuit Breaker - Stop calling failing services
const breaker = new CircuitBreaker(callPaymentService, {
  timeout: 3000,           // Trip after 3s timeout
  errorThresholdPercentage: 50,  // Trip at 50% errors
  resetTimeout: 30000,     // Try again after 30s
});
 
breaker.fallback(() => {
  // Return cached/default response when circuit is open
  return { status: 'pending', source: 'fallback' };
});
 
// 3. Bulkhead - Isolate failure domains
const bulkheadedPool = {
  payments: new ThreadPool(10),   // Max 10 concurrent payment calls
  inventory: new ThreadPool(10), // Separate pool for inventory
  notifications: new ThreadPool(5), // Lower priority, smaller pool
};
 
// 4. Backpressure - Reject when overloaded
app.use((req, res, next) => {
  if (requestQueue.length > MAX_QUEUE_SIZE) {
    return res.status(503).json({ 
      error: 'Service overloaded',
      retryAfter: 30 
    });
  }
  next();
});
 
// 5. Deadline propagation - Don't start what you can't finish
async function processOrder(order: Order, deadline: Date) {
  const remaining = deadline.getTime() - Date.now();
  if (remaining < MIN_PROCESSING_TIME) {
    throw new DeadlineExceededException('Not enough time to process');
  }
  
  // Propagate reduced deadline to downstream calls
  const downstreamDeadline = new Date(deadline.getTime() - BUFFER_MS);
  await paymentService.charge(order.payment, { deadline: downstreamDeadline });
}

Measuring Cascade Resistance:

During cascade failure experiments, measure:

Blast radius — How many services are affected by the initial failure?
Propagation speed — How quickly does degradation spread?
Depth of impact — How many levels of the call graph are affected?
Recovery time — How quickly does the system recover after the initial failure is resolved?
User impact — What percentage of user requests fail or degrade?

Healthy system behaviors:

Blast radius is limited to services directly dependent on the failing service
Circuit breakers trip within seconds, stopping propagation
Unrelated services continue operating normally
Recovery is automatic and quick when the failure is resolved
User-facing impact is minimized through fallbacks

The Retry Storm Anti-Pattern

When a service fails, clients retry. If retries are immediate and aggressive, the failing service faces even more load than normal—potentially preventing recovery. Combined with cascading failures, retry storms can turn a minor hiccup into a prolonged outage. Always use exponential backoff with jitter for retries.

Deployment Failure Simulation

Deployment failures represent a special category of service failure where different instances of the same service run different versions or configurations. These are dangerous because they can cause subtle inconsistencies that are difficult to detect and debug.

Deployment Failure Types:

Failure Type	Description	Risk
Partial rollout failure	Only some instances get the new version	Version-dependent behavior differences
Configuration drift	Instances have different configs	Inconsistent behavior across requests
Failed rollback	Rollback doesn't fully complete	Mix of old and new versions
Database mismatch	Code and schema versions don't match	Query failures, data corruption
Contract violation	New version breaks API contract	Dependent services fail
Feature flag split	Different flag states across instances	Inconsistent feature availability

Deployment Failure Scenarios to Test

•Mixed Versions — Run 50% old version, 50% new version for extended period. Test whether clients handle version differences.
•Schema Mismatch — Deploy code expecting new schema before schema migration. Test error handling and data safety.
•Rollback During Traffic — Initiate rollback while processing requests. Test request handling during transition.
•Dependent Service Version Mismatch — Update consuming service before provider is updated. Test contract compatibility.
•Configuration Inconsistency — Give different instances different config values. Test whether behavior is consistent.
•Stuck Deployment — Leave deployment paused at 20% rollout indefinitely. Test mixed-version behavior at scale.

deployment-failure-simulation.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Kubernetes: Simulate stuck partial deployment
# by manually setting different images for pods
 
# First, deploy version 1.0
kubectl set image deployment/my-service my-service=myapp:1.0
 
# Scale to ensure multiple replicas
kubectl scale deployment/my-service --replicas=6
 
# Manually patch half the pods to run version 1.1
# (simulates partial rollout failure)
PODS=$(kubectl get pods -l app=my-service -o name | head -3)
for pod in $PODS; do
  kubectl patch $pod -p '{"spec":{"containers":[{"name":"my-service","image":"myapp:1.1"}]}}'
done
 
---
# Istio: Traffic splitting between versions for testing
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-service-canary
spec:
  hosts:
    - my-service
  http:
    - route:
        - destination:
            host: my-service
            subset: v1
          weight: 50
        - destination:
            host: my-service
            subset: v2  # Simulated "bad" version
          weight: 50
 
---
# DestinationRule for version subsets
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: my-service-versions
spec:
  host: my-service
  subsets:
    - name: v1
      labels:
        version: v1.0.0
    - name: v2
      labels:
        version: v1.1.0-broken

Version Compatibility Testing

Deployment failure testing is most valuable when you're changing APIs or data formats. Before deploying breaking changes, simulate a mixed-version environment to verify that both old and new versions can coexist during the deployment window. This is essential for zero-downtime deployments.

Service Failure Injection Tools

Various tools support service-level failure injection, from simple scripts to sophisticated chaos platforms:

Service Failure Injection Tools Comparison
Tool	Scope	Service Failure Capabilities	Best For
Chaos Monkey	AWS, K8s	Random instance termination	Long-running chaos, Netflix-style
Gremlin	Multi-platform	Process kill, resource attacks, state	Enterprise with approval workflows
Chaos Mesh	Kubernetes	Pod chaos, HTTP chaos, IO chaos	Native K8s, open source
LitmusChaos	Kubernetes	Wide variety of fault types	K8s with workflow engine
Toxiproxy	TCP proxy	Connection manipulation	Application-level testing
Chaos Toolkit	Multi-platform	Extensible chaos experiments	Custom experiment definitions
AWS FIS	AWS	AWS service-specific faults	AWS-native environments

chaos-mesh-pod-chaos.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Comprehensive Pod Chaos example using Chaos Mesh
 
# Random pod kill every 2 minutes
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-chaos
spec:
  action: pod-kill
  mode: one              # Kill one random pod
  selector:
    namespaces:
      - production
    labelSelectors:
      app: my-service
  scheduler:
    cron: "*/2 * * * *"  # Every 2 minutes
 
---
# Pod failure (make container fail)
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-chaos
spec:
  action: pod-failure
  mode: fixed-percent
  value: "30"            # Fail 30% of pods
  duration: "5m"
  selector:
    labelSelectors:
      app: payment-service
 
---
# Container kill (kill specific container, pod restarts it)
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: container-kill-chaos
spec:
  action: container-kill
  mode: one
  containerNames:
    - main-container    # Target specific container in pod
  selector:
    labelSelectors:
      app: multi-container-service
  duration: "10m"

Summary: Mastering Service Failure Injection

Service failure injection tests the application layer—how your code handles errors, unavailable dependencies, and unexpected conditions. Unlike network failures that test infrastructure, service failures test your error handling, resilience patterns, and graceful degradation logic.

Key Takeaways

•Process termination tests recovery — Both graceful and abrupt termination should be tested to verify restart behavior, in-flight request handling, and failover.
•Exception injection tests error paths — The least-tested code is often error handling. Inject exceptions to verify they're caught and handled correctly.
•Error responses test client resilience — Different HTTP error codes require different handling. Verify retry policies, circuit breakers, and fallbacks.
•Dependency unavailability tests isolation — Each dependency should be tested independently to ensure failures don't cascade.
•Cascading failures are the real danger — Timeouts, circuit breakers, bulkheads, and backpressure are essential to prevent cascade propagation.
•Deployment failures cause subtle bugs — Mixed versions and configuration drift can cause inconsistent behavior that's hard to detect.
•Use appropriate tooling — From simple kill commands to sophisticated chaos platforms, match the tool to your environment and maturity.

What's Next:

With network and service failures covered, we'll now examine Resource Exhaustion—CPU saturation, memory pressure, disk exhaustion, and file descriptor limits. These failure modes cause gradual degradation rather than immediate failure, making them particularly challenging to detect and handle.

Page Complete

You now understand how to inject and analyze service failures—process termination, exception injection, error responses, dependency unavailability, and cascading failures. These techniques reveal weaknesses in your application's error handling and resilience patterns. Next, we'll explore resource exhaustion scenarios.