Loading content...
Network failure injection tests the infrastructure layer—the pipes through which your services communicate. Service failure injection targets the services themselves—the application processes that receive requests, process data, and produce responses.
Service failures are arguably more diverse than network failures because they encompass everything that can go wrong within an application:
Every external dependency your service relies on—databases, caches, APIs, microservices—represents a service that can fail. Understanding how your system behaves when these failures occur is crucial for building resilient distributed systems.
By the end of this page, you will understand how to inject and analyze service failures: process termination, exception injection, error response manipulation, and dependency unavailability. You'll learn to test circuit breakers, fallback mechanisms, retry logic, and graceful degradation—the defensive mechanisms that determine whether a dependency failure becomes a cascading outage.
Process termination is the most straightforward service failure: the service stops running. This can happen gracefully (SIGTERM allowing cleanup) or abruptly (SIGKILL forcing immediate termination). In production, process termination occurs due to:
Graceful vs. Abrupt Termination:
The behavior differs significantly:
| Termination Type | Signal | What Happens | Time for Cleanup |
|---|---|---|---|
| Graceful | SIGTERM | Process receives signal, can handle in-flight requests | Configurable (typically 30s) |
| Abrupt | SIGKILL | Process immediately terminated | None |
| OOM Kill | SIGKILL | Kernel selects and kills process | None |
| Crash | N/A | Process terminates from exception | None |
Testing both types is essential because they exercise different code paths. Graceful termination tests your shutdown hooks and connection draining. Abrupt termination tests how other services handle sudden disconnection and whether in-flight requests are handled correctly.
123456789101112131415161718192021222324252627282930313233343536373839
# Basic process terminationkill -SIGTERM <pid> # Graceful shutdownkill -SIGKILL <pid> # Abrupt termination # Kubernetes pod terminationkubectl delete pod <pod-name> # Gracefulkubectl delete pod <pod-name> --grace-period=0 --force # Abrupt # Docker container termination docker stop <container> # Graceful (SIGTERM + timeout + SIGKILL)docker kill <container> # Abrupt (SIGKILL) # Kill random pod from a deploymentRANDOM_POD=$(kubectl get pods -l app=my-service -o name | shuf -n 1)kubectl delete $RANDOM_POD # Continuous pod killing (kill one every 30 seconds)while true; do RANDOM_POD=$(kubectl get pods -l app=my-service -o name | shuf -n 1) kubectl delete $RANDOM_POD --grace-period=0 --force echo "Killed: $RANDOM_POD" sleep 30done # Kill on a schedule using Chaos Meshcat <<EOF | kubectl apply -f -apiVersion: chaos-mesh.org/v1alpha1kind: PodChaosmetadata: name: pod-kill-chaosspec: action: pod-kill mode: one selector: labelSelectors: app: my-service scheduler: cron: "*/2 * * * *" # Every 2 minutesEOF| Observation | What It Indicates | Expected Behavior |
|---|---|---|
| Error spike then recovery | Load balancer detected failure and rerouted | Normal redundancy working |
| Prolonged errors after kill | Health check or service discovery too slow | Tune health check intervals |
| Duplicate processing after restart | In-flight requests retried | Ensure idempotency |
| Data inconsistency after restart | Transaction wasn't completed | Review transaction handling |
| Cascading failures to upstream | No circuit breaker protection | Implement circuit breakers |
| Connection errors in dependent services | Connections not properly drained | Implement graceful shutdown |
The most common issue revealed by process termination testing is improper handling of in-flight requests. When a service dies mid-request, clients may retry. If the operation isn't idempotent, retries can cause duplicate charges, duplicate emails, or duplicate database entries. Ensure all mutating operations have idempotency keys.
Exception injection causes services to throw exceptions at specific code paths without terminating the entire process. This is more surgical than process termination—it tests error handling for specific operations while keeping the service otherwise functional.
Why Exception Injection Matters:
In a well-designed service, most code paths have error handling. But exception handling code is often the least-tested code in any codebase. Developers write happy-path tests, maybe some obvious error cases, but rarely test every exception path systematically.
Exception injection reveals:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
// Example: Exception injection using a wrapper/interceptor pattern// This shows how to build exception injection into an application interface ExceptionConfig { probability: number; // 0-1, chance of throwing exceptionType: string; // Type of exception to throw targetMethod?: string; // Which method to target targetUser?: string; // Optionally target specific user} class ExceptionInjector { private config: ExceptionConfig | null = null; configure(config: ExceptionConfig) { this.config = config; } maybeThrow(context: { method: string; userId?: string }) { if (!this.config) return; // Check if this context matches targeting criteria if (this.config.targetMethod && this.config.targetMethod !== context.method) { return; } if (this.config.targetUser && this.config.targetUser !== context.userId) { return; } // Probabilistic exception if (Math.random() < this.config.probability) { throw new Error(`Injected exception: ${this.config.exceptionType}`); } } clear() { this.config = null; }} // Usage in a service methodconst injector = new ExceptionInjector(); async function getUserOrders(userId: string) { // Exception injection point injector.maybeThrow({ method: 'getUserOrders', userId }); // Normal operation return await database.query('SELECT * FROM orders WHERE user_id = ?', [userId]);} // Configure via API during chaos experimentapp.post('/chaos/exceptions', (req, res) => { injector.configure(req.body); res.json({ status: 'configured' });});Integration with Chaos Platforms:
Mature chaos engineering platforms provide built-in exception injection:
| Tool | Exception Injection Method | Configuration |
|---|---|---|
| Gremlin | Application-level SDK with dashboard control | Code instrumentation + API |
| Chaos Monkey for Spring Boot | Annotations on methods | YAML configuration |
| Failure Flags (LaunchDarkly) | Feature flags that throw exceptions | Dashboard + SDK |
| Byteman (Java) | Bytecode manipulation at runtime | Script language |
| Custom middleware | HTTP interceptor that can inject failures | Request headers |
For production chaos engineering, prefer platforms that allow centralized control, audit logging, and automatic experiment termination.
Exception injection is an excellent opportunity to verify your monitoring and alerting. When you inject exceptions, you should see corresponding alerts fire, dashboards update, and logs capture the error details. If your observability doesn't detect injected exceptions, it won't detect real ones either.
Error response manipulation makes services return HTTP error codes or application-level error responses. Unlike exception injection (which happens within the service), error response manipulation tests how calling services handle errors from their dependencies.
HTTP Error Categories:
Different error codes should trigger different behaviors:
| Code Range | Meaning | Expected Client Behavior |
|---|---|---|
| 4xx | Client error | Generally don't retry (except 429) |
| 400 | Bad request | Log error, possibly fix and retry |
| 401/403 | Auth failure | Re-authenticate, then retry |
| 404 | Not found | Cache negative result |
| 429 | Rate limited | Backoff and retry |
| 5xx | Server error | Retry with backoff |
| 500 | Internal error | Retry with backoff |
| 502/503/504 | Upstream issues | Retry with backoff |
Testing each category verifies that your services implement appropriate retry policies and error handling for each case.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
# Istio VirtualService for HTTP fault injectionapiVersion: networking.istio.io/v1alpha3kind: VirtualServicemetadata: name: payment-service-faultspec: hosts: - payment-service http: - fault: abort: percentage: value: 10 # 10% of requests httpStatus: 503 # Return 503 Service Unavailable route: - destination: host: payment-service ---# Return 429 with delay for rate limit testingapiVersion: networking.istio.io/v1alpha3kind: VirtualServicemetadata: name: api-rate-limit-testspec: hosts: - external-api http: - fault: delay: percentage: value: 100 fixedDelay: 2s abort: percentage: value: 50 httpStatus: 429 route: - destination: host: external-api ---# Chaos Mesh HTTPChaos for KubernetesapiVersion: chaos-mesh.org/v1alpha1kind: HTTPChaosmetadata: name: http-abort-chaosspec: mode: all selector: labelSelectors: app: order-service target: Response port: 8080 path: /api/orders/* method: POST abort: true code: 500 duration: "5m"Cascading Error Propagation:
When injecting errors, observe how they propagate through your service graph:
What to Look For:
Error injection often reveals that services leak internal details in error responses. A 500 error might include stack traces, database query details, or internal service names. These are security vulnerabilities. Verify that error responses to external clients contain safe, generic error messages.
Dependency unavailability testing makes specific downstream dependencies unreachable or unresponsive while keeping the service under test running. This is different from network partitions (which affect all traffic) and from process termination (which kills the dependency entirely). The goal is to test how your service behaves when one specific dependency fails.
Common Dependency Types:
Modern services typically have multiple dependency types:
| Dependency Type | Examples | Typical Failure Mode | Impact When Unavailable |
|---|---|---|---|
| Database | PostgreSQL, MySQL, DynamoDB | Connection failure or slowness | Data access fails |
| Cache | Redis, Memcached | Connection failure | Increased DB load, latency |
| Message Queue | Kafka, RabbitMQ, SQS | Publish/consume failure | Async processing stops |
| External API | Payment gateway, Email service | HTTP errors or timeout | Feature unavailable |
| Internal Service | Auth service, User service | Any of the above | Cascading failure risk |
| Configuration | Feature flags, Config service | Read failure | Stale or default config |
Each dependency type requires different handling when unavailable, and testing should verify appropriate behavior for each.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
# Block traffic to specific dependency (database example)# Find database IP if using hostnameDB_IP=$(dig +short database.internal) # Block all traffic to databasesudo iptables -A OUTPUT -d $DB_IP -j DROP # Block only database port (allow other traffic to same host)sudo iptables -A OUTPUT -d $DB_IP -p tcp --dport 5432 -j DROP # Using Toxiproxy for application-level dependency blocking# First, configure app to connect via Toxiproxytoxiproxy-cli create postgres -l localhost:25432 -u database.internal:5432toxiproxy-cli create redis -l localhost:26379 -u redis.internal:6379 # Disable specific dependencytoxiproxy-cli toggle postgres # Stop proxying (dependency unavailable) # Add latency before making unavailable (simulates slow death)toxiproxy-cli toxic add postgres -t latency -a latency=5000sleep 10toxiproxy-cli toggle postgres # Kubernetes: Scale deployment to 0 replicaskubectl scale deployment payment-service --replicas=0 # Kubernetes: Delete NetworkPolicy to restorecat <<EOF | kubectl apply -f -apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: block-payment-servicespec: podSelector: matchLabels: app: order-service policyTypes: - Egress egress: - to: - podSelector: matchLabels: app: payment-service ports: [] # Empty = block all ports to this destinationEOF| Dependency | Minimal Viable Behavior | Good Behavior | Excellent Behavior |
|---|---|---|---|
| Primary DB | Return error to user | Serve stale reads from replica | Serve cached data + queue writes |
| Cache | Fall back to direct DB queries | Degrade gracefully with slower responses | Local in-memory cache as L2 |
| Auth Service | Return auth error | Honor cached/unexpired tokens | Allow graceful window after expiry |
| Payment Gateway | Return payment error | Queue transaction for retry | Offer alternative payment method |
| Config Service | Use last known config | Fall back to bundled defaults | Continue with safe defaults + alert |
Not all dependencies are equal. Classify each as critical (service cannot function without it), degradable (service can function with reduced capability), or optional (nice to have but not essential). Dependency unavailability testing should verify that your classification is correct—that degradable dependencies truly allow degradation, and critical dependencies truly are critical.
Cascading failures occur when a failure in one service propagates to cause failures in dependent services, which in turn cause failures in their dependents, potentially bringing down an entire system. These are among the most dangerous failure modes because they amplify a single point of failure into a system-wide outage.
The Cascade Mechanism:
Why Cascades Happen:
Cascading failures typically occur because services don't have adequate:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
// Example: Implementing cascade prevention mechanisms import CircuitBreaker from 'opossum'; // 1. Timeouts - Never wait foreverconst httpClient = axios.create({ timeout: 5000, // 5 second timeout}); // 2. Circuit Breaker - Stop calling failing servicesconst breaker = new CircuitBreaker(callPaymentService, { timeout: 3000, // Trip after 3s timeout errorThresholdPercentage: 50, // Trip at 50% errors resetTimeout: 30000, // Try again after 30s}); breaker.fallback(() => { // Return cached/default response when circuit is open return { status: 'pending', source: 'fallback' };}); // 3. Bulkhead - Isolate failure domainsconst bulkheadedPool = { payments: new ThreadPool(10), // Max 10 concurrent payment calls inventory: new ThreadPool(10), // Separate pool for inventory notifications: new ThreadPool(5), // Lower priority, smaller pool}; // 4. Backpressure - Reject when overloadedapp.use((req, res, next) => { if (requestQueue.length > MAX_QUEUE_SIZE) { return res.status(503).json({ error: 'Service overloaded', retryAfter: 30 }); } next();}); // 5. Deadline propagation - Don't start what you can't finishasync function processOrder(order: Order, deadline: Date) { const remaining = deadline.getTime() - Date.now(); if (remaining < MIN_PROCESSING_TIME) { throw new DeadlineExceededException('Not enough time to process'); } // Propagate reduced deadline to downstream calls const downstreamDeadline = new Date(deadline.getTime() - BUFFER_MS); await paymentService.charge(order.payment, { deadline: downstreamDeadline });}Measuring Cascade Resistance:
During cascade failure experiments, measure:
Healthy system behaviors:
When a service fails, clients retry. If retries are immediate and aggressive, the failing service faces even more load than normal—potentially preventing recovery. Combined with cascading failures, retry storms can turn a minor hiccup into a prolonged outage. Always use exponential backoff with jitter for retries.
Deployment failures represent a special category of service failure where different instances of the same service run different versions or configurations. These are dangerous because they can cause subtle inconsistencies that are difficult to detect and debug.
Deployment Failure Types:
| Failure Type | Description | Risk |
|---|---|---|
| Partial rollout failure | Only some instances get the new version | Version-dependent behavior differences |
| Configuration drift | Instances have different configs | Inconsistent behavior across requests |
| Failed rollback | Rollback doesn't fully complete | Mix of old and new versions |
| Database mismatch | Code and schema versions don't match | Query failures, data corruption |
| Contract violation | New version breaks API contract | Dependent services fail |
| Feature flag split | Different flag states across instances | Inconsistent feature availability |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
# Kubernetes: Simulate stuck partial deployment# by manually setting different images for pods # First, deploy version 1.0kubectl set image deployment/my-service my-service=myapp:1.0 # Scale to ensure multiple replicaskubectl scale deployment/my-service --replicas=6 # Manually patch half the pods to run version 1.1# (simulates partial rollout failure)PODS=$(kubectl get pods -l app=my-service -o name | head -3)for pod in $PODS; do kubectl patch $pod -p '{"spec":{"containers":[{"name":"my-service","image":"myapp:1.1"}]}}'done ---# Istio: Traffic splitting between versions for testingapiVersion: networking.istio.io/v1alpha3kind: VirtualServicemetadata: name: my-service-canaryspec: hosts: - my-service http: - route: - destination: host: my-service subset: v1 weight: 50 - destination: host: my-service subset: v2 # Simulated "bad" version weight: 50 ---# DestinationRule for version subsetsapiVersion: networking.istio.io/v1alpha3kind: DestinationRulemetadata: name: my-service-versionsspec: host: my-service subsets: - name: v1 labels: version: v1.0.0 - name: v2 labels: version: v1.1.0-brokenDeployment failure testing is most valuable when you're changing APIs or data formats. Before deploying breaking changes, simulate a mixed-version environment to verify that both old and new versions can coexist during the deployment window. This is essential for zero-downtime deployments.
Various tools support service-level failure injection, from simple scripts to sophisticated chaos platforms:
| Tool | Scope | Service Failure Capabilities | Best For |
|---|---|---|---|
| Chaos Monkey | AWS, K8s | Random instance termination | Long-running chaos, Netflix-style |
| Gremlin | Multi-platform | Process kill, resource attacks, state | Enterprise with approval workflows |
| Chaos Mesh | Kubernetes | Pod chaos, HTTP chaos, IO chaos | Native K8s, open source |
| LitmusChaos | Kubernetes | Wide variety of fault types | K8s with workflow engine |
| Toxiproxy | TCP proxy | Connection manipulation | Application-level testing |
| Chaos Toolkit | Multi-platform | Extensible chaos experiments | Custom experiment definitions |
| AWS FIS | AWS | AWS service-specific faults | AWS-native environments |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
# Comprehensive Pod Chaos example using Chaos Mesh # Random pod kill every 2 minutesapiVersion: chaos-mesh.org/v1alpha1kind: PodChaosmetadata: name: pod-kill-chaosspec: action: pod-kill mode: one # Kill one random pod selector: namespaces: - production labelSelectors: app: my-service scheduler: cron: "*/2 * * * *" # Every 2 minutes ---# Pod failure (make container fail)apiVersion: chaos-mesh.org/v1alpha1kind: PodChaosmetadata: name: pod-failure-chaosspec: action: pod-failure mode: fixed-percent value: "30" # Fail 30% of pods duration: "5m" selector: labelSelectors: app: payment-service ---# Container kill (kill specific container, pod restarts it)apiVersion: chaos-mesh.org/v1alpha1kind: PodChaosmetadata: name: container-kill-chaosspec: action: container-kill mode: one containerNames: - main-container # Target specific container in pod selector: labelSelectors: app: multi-container-service duration: "10m"Service failure injection tests the application layer—how your code handles errors, unavailable dependencies, and unexpected conditions. Unlike network failures that test infrastructure, service failures test your error handling, resilience patterns, and graceful degradation logic.
What's Next:
With network and service failures covered, we'll now examine Resource Exhaustion—CPU saturation, memory pressure, disk exhaustion, and file descriptor limits. These failure modes cause gradual degradation rather than immediate failure, making them particularly challenging to detect and handle.
You now understand how to inject and analyze service failures—process termination, exception injection, error responses, dependency unavailability, and cascading failures. These techniques reveal weaknesses in your application's error handling and resilience patterns. Next, we'll explore resource exhaustion scenarios.