Design Validation - Learning Module

Loading content...

0/273

Failure Scenario Testing

Designing for When Things Go Wrong

In distributed systems, failure is not a possibility—it is a certainty. Networks partition. Servers crash. Disks fail. Cloud providers have outages. Third-party APIs return errors. The only question is: when these failures occur, will your system degrade gracefully, or will it collapse catastrophically?

Failure scenario testing is the practice of systematically asking 'What happens when X fails?' for every component, dependency, and interaction in your system. It's uncomfortable work—you're deliberately looking for weaknesses in something you've designed. But this discomfort is infinitely preferable to discovering those weaknesses in production.

The distributed systems mantra: Design for failure. Expect failure. Test failure. Recover from failure. Principal engineers don't just hope their systems are resilient—they prove it through rigorous failure analysis.

What You Will Master

By the end of this page, you will understand how to systematically test system designs against failure scenarios. You'll learn to classify failures, apply Failure Mode and Effects Analysis (FMEA), design for graceful degradation, prevent cascading failures, and use the techniques that principal engineers apply to ensure systems survive real-world chaos.

Failure Taxonomy

Before testing failure scenarios, you need a vocabulary for describing failures. Not all failures are created equal—they differ in scope, duration, detectability, and impact.

Failure Dimensions

Failure Classification Dimensions
Dimension	Categories	Examples
Scope	Single component, Multi-component, Systemic	One server crash, Database cluster failure, Regional outage
Duration	Transient, Intermittent, Permanent	Network hiccup, Unstable node, Hardware failure
Detectability	Fail-stop, Fail-silent, Byzantine	Process crash, Unresponsive service, Corrupt data
Timing	Synchronous, Asynchronous	Request timeout, Delayed batch failure
Causality	Independent, Correlated, Cascading	Random disk failure, Overlapping maintenance, Domino effect

The Failure Severity Matrix

Combining failure probability with impact yields a prioritization framework:

Failure Risk Prioritization
	Low Impact	Medium Impact	High Impact	Critical Impact
Very Likely	Monitor	Address Soon	Address Now	Emergency
Likely	Accept	Monitor	Address Soon	Address Now
Unlikely	Accept	Accept	Monitor	Address Soon
Very Unlikely	Accept	Accept	Accept	Monitor

Common Failure Modes in Distributed Systems

Failure Mode Catalog

•Process crash — A service instance terminates unexpectedly
•Resource exhaustion — Memory, disk, CPU, connections depleted
•Network partition — Nodes can reach some but not all other nodes
•Network latency spike — Responses arrive, but slowly
•Dependency failure — External service becomes unavailable
•Dependency degradation — External service works but slowly or with errors
•Data corruption — Invalid data enters the system
•Clock skew — Nodes disagree about current time
•Configuration error — Incorrect settings deployed
•Capacity overload — Load exceeds system capacity
•Deployment failure — New version fails to start or behaves incorrectly
•Security breach — Unauthorized access or data exfiltration

Byzantine Failures Are Real

While crash failures are easy to detect (the component is simply gone), Byzantine failures—where a component produces incorrect or inconsistent results—are far more dangerous. A misbehaving cache returning stale data, a clock-skewed server creating future-dated records, or a compromised service issuing false requests can cause subtle corruption that takes weeks to detect and months to repair.

Failure Mode and Effects Analysis (FMEA)

FMEA is a structured approach to identifying potential failures, their causes, and their effects. Originally developed for aerospace and automotive engineering, it's equally applicable to distributed systems.

The FMEA Process

Identify components: List every component in the system
Define failure modes: For each component, list every way it could fail
Determine effects: For each failure mode, trace the impact through the system
Assess severity: Rate the impact (1-10 scale)
Assess probability: Rate the likelihood (1-10 scale)
Assess detectability: Rate how easily it's detected (1-10 scale)
Calculate RPN: Risk Priority Number = Severity × Probability × Detectability
Prioritize mitigations: Address highest RPN items first

Example FMEA: Order Processing System
Component	Failure Mode	Effect	Severity	Probability	Detectability	RPN	Mitigation
Order DB	Primary node crash	writes fail, possible data loss	9	3	2	54	Multi-AZ replication, auto-failover
Order DB	Replication lag > 1s	Stale reads, inventory inconsistency	6	5	4	120	Monitoring, critical-read routing to primary
Payment Gateway	API timeout	Order stuck in pending	7	4	2	56	Timeout < SLA, retry with exponential backoff
Payment Gateway	Complete outage	Cannot process new orders	9	2	1	18	Circuit breaker, degraded mode queue
Inventory Service	Stale cache	Overselling	8	5	6	240	Event-driven invalidation, reservation pattern
Message Queue	Consumer lag	Delayed order processing	5	4	3	60	Autoscaling consumers, lag alerting
Load Balancer	Health check false positive	Good server removed	6	3	5	90	Multiple health checks, gradual removal

Interpreting the FMEA

The highest RPN items demand immediate attention:

Inventory Service stale cache (RPN 240): The combination of high probability and poor detectability makes this dangerous. Event-driven invalidation should be a design requirement, not an optimization.
Order DB replication lag (RPN 120): This is tricky because the components appear healthy even while causing problems. Proactive monitoring and routing critical reads to the primary are essential.
Load Balancer false positive (RPN 90): A misconfigured health check can take down healthy servers, causing an outage from a 'working' system.

FMEA as Living Documentation

An FMEA isn't a one-time exercise—it should be updated whenever the system changes. New components introduce new failure modes. Changed dependencies alter effects. Principal engineers maintain FMEA documents as first-class artifacts alongside architecture documentation.

Cascading Failure Analysis

The most catastrophic system failures are cascading failures—where an initial failure triggers a chain reaction that brings down the entire system. These are particularly insidious because each link in the chain might seem reasonable in isolation.

The Anatomy of a Cascade

Converting Mermaid diagram...

Cascade Amplification Mechanisms

Several patterns commonly amplify failures into cascades:

Cascade Amplification Patterns
Pattern	Description	Example	Mitigation
Retry storms	Failures trigger retries, multiplying load	1 failure → 3 retries → 3× load	Exponential backoff, jitter, retry budgets
Connection starvation	Slow responses hold connections open	DB slowdown → all connections blocked	Aggressive timeouts, connection limits per host
Thread pool exhaustion	Blocked threads can't handle new requests	10 slow requests → 10 threads blocked	Async I/O, bulkheads, bounded queues
Memory pressure	Failed requests consume memory before cleanup	OOM cascading across cluster	Request size limits, streaming, back-pressure
Load redistribution	Failed node's load shifts to survivors	1 of 3 nodes fails → 50% more load per survivor	Capacity headroom, graceful degradation
Positive feedback loops	Failure worsens the condition causing failure	Slow GC → more memory → slower GC	Circuit breakers, load shedding

Designing Cascade Breakers

Every design should include explicit mechanisms to prevent cascades:

Cascade Prevention Patterns

•Circuit breakers — Stop calling a failing service to let it recover
•Bulkheads — Isolate resources so failure in one area doesn't affect others
•Load shedding — Reject excess requests before they cause resource exhaustion
•Timeouts — Bound the time any operation can block resources
•Backpressure — Signal upstream to slow down when downstream can't keep up
•Graceful degradation — Serve reduced functionality rather than failing completely
•Capacity headroom — Maintain spare capacity to absorb load redistribution

The Cascade Test Question

For every component in your design, ask: 'If this component becomes slow (not just failed, but 10× slower than normal), what happens to the rest of the system?' Slow failures are more dangerous than complete failures because they hold resources while appearing to work.

Dependency Failure Scenarios

Modern systems depend on numerous external services—cloud infrastructure, third-party APIs, SaaS platforms. Each dependency introduces failure scenarios that must be explicitly addressed in the design.

The Dependency Failure Matrix

For every external dependency, enumerate the failure modes and define the system's response:

Dependency Failure Response Matrix
Dependency	Failure Mode	Impact	Response Strategy	Fallback
Payment Provider	Complete outage	Cannot process payments	Queue for later processing	Notify user, allow delayed payment
Payment Provider	Intermittent 5XX errors	Some payments fail	Retry with exponential backoff	None (retry handles it)
Payment Provider	Latency > 10s	Request timeouts	Circuit breaker opens	Queue for async retry
Auth Provider (OAuth)	Token endpoint down	Cannot authenticate new users	Cache tokens, extend expiry	Allow existing sessions
Email Service	API unresponsive	Notifications delayed	Queue in persistent storage	Process when available
CDN	Regional outage	Static assets unavailable	Failover to backup CDN	Serve from origin (degraded)
Cloud Object Storage	High latency	Slow uploads/downloads	Client-side retry, streaming	None (wait or fail)

Dependency Criticality Levels

Not all dependencies are equally critical. Classify each dependency:

Critical Dependencies

•Definition: System cannot function without them
•Examples: Primary database, authentication service
•Strategy: Maximum redundancy, multi-region, no single point of failure
•SLA requirement: Match or exceed your own SLA targets

Non-Critical Dependencies

•Definition: System can provide core value without them
•Examples: Analytics, recommendations, notifications
•Strategy: Graceful degradation, async processing
•SLA requirement: Can be lower than main system SLA

dependency-health-check.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
// Structured dependency health modeling for design validation
interface DependencyHealth {
  name: string;
  type: 'database' | 'cache' | 'api' | 'queue' | 'storage';
  criticality: 'critical' | 'important' | 'optional';
  
  // Failure scenarios
  failureModes: FailureMode[];
  
  // Recovery configuration
  circuitBreaker: CircuitBreakerConfig;
  fallbackStrategy: FallbackStrategy;
}
 
interface FailureMode {
  scenario: string;
  probability: 'high' | 'medium' | 'low';
  impact: 'catastrophic' | 'severe' | 'moderate' | 'minor';
  detection: 'instant' | 'delayed' | 'manual';
  mitigation: string;
}
 
interface CircuitBreakerConfig {
  failureThreshold: number;        // Number of failures before opening
  successThreshold: number;        // Successes needed to close
  timeout: number;                 // Half-open timeout (ms)
  windowDuration: number;          // Sliding window (ms)
}
 
interface FallbackStrategy {
  type: 'cache' | 'queue' | 'default_value' | 'degraded_mode' | 'fail_fast';
  config: Record<string, unknown>;
}
 
// Example: Payment service dependency
const paymentServiceDependency: DependencyHealth = {
  name: 'PaymentGateway',
  type: 'api',
  criticality: 'critical',
  
  failureModes: [
    {
      scenario: 'Complete API outage',
      probability: 'low',
      impact: 'severe',
      detection: 'instant',
      mitigation: 'Queue payments for async retry, notify operations',
    },
    {
      scenario: 'Latency spike > 5s',
      probability: 'medium',
      impact: 'moderate',
      detection: 'instant',
      mitigation: 'Circuit breaker opens, queue for retry',
    },
    {
      scenario: 'Intermittent 5XX errors',
      probability: 'medium',
      impact: 'minor',
      detection: 'instant',
      mitigation: 'Automatic retry with exponential backoff',
    },
    {
      scenario: 'Rate limiting (429)',
      probability: 'high',
      impact: 'moderate',
      detection: 'instant',
      mitigation: 'Request queuing with rate-aware scheduling',
    },
  ],
  
  circuitBreaker: {
    failureThreshold: 5,
    successThreshold: 2,
    timeout: 30000,
    windowDuration: 60000,
  },
  
  fallbackStrategy: {
    type: 'queue',
    config: {
      queueName: 'payment-retry-queue',
      maxRetries: 10,
      retryBackoff: 'exponential',
      deadLetterQueue: 'payment-failed-dlq',
    },
  },
};
 
// Validate dependency health configuration
function validateDependencyHealth(dep: DependencyHealth): string[] {
  const issues: string[] = [];
  
  if (dep.criticality === 'critical' && dep.fallbackStrategy.type === 'fail_fast') {
    issues.push(
      `Critical dependency '${dep.name}' has fail_fast fallback - ` +
                                            'consider queue or degraded_mode strategy'
    );
}
 
if (dep.failureModes.length === 0) {
    issues.push(`Dependency '${dep.name}' has no defined failure modes`);
  }
  
  const highImpactModes = dep.failureModes.filter(f => 
    f.impact === 'catastrophic' && f.detection !== 'instant'
  );
  if (highImpactModes.length > 0) {
    issues.push(
      `Dependency '${dep.name}' has catastrophic failures with delayed detection`
    );
  }
  
  return issues;
}

Graceful Degradation Design

Graceful degradation is the principle that a system should provide reduced functionality rather than complete failure when components fail. This requires explicit design—it doesn't happen by accident.

Degradation Levels

Define what 'reduced functionality' means for your system:

Example Degradation Levels: E-Commerce Platform
Level	Trigger	Capabilities Available	Capabilities Degraded	User Experience
Normal	All systems healthy	Full functionality	None	Optimal
Degraded-1	Recommendation service down	Browse, search, purchase	Personalized recommendations	Generic 'Popular Items' shown
Degraded-2	Search service down	Category browsing, purchase	Search functionality	Search box hidden, category browsing promoted
Degraded-3	Payment service slow	Browse, cart management	Checkout speed	Checkout queued, user notified of delay
Degraded-4	Inventory service inconsistent	Browse, limited purchase	Real-time availability	Availability shown as 'Contact for availability'
Emergency	Primary database overloaded	Read-only mode	Purchases, cart changes	Maintenance banner, apology

Designing Degradation Paths

For each non-critical feature, explicitly design what happens when its supporting components fail:

Degradation Design Checklist

•Identify killable features — Which features can be disabled without breaking core workflows?
•Define fallback content — Cached data, static defaults, or simplified alternatives
•Implement feature flags — Runtime ability to disable features without deployment
•Create degradation indicators — How does the UI communicate reduced service?
•Define escalation triggers — What conditions cause escalation to the next degradation level?
•Plan recovery — How does the system detect recovery and restore functionality?
•Test degradation paths — Chaos engineering to validate graceful degradation works

The 'Ship It Anyway' Pattern

Amazon famously operates on the principle that customers should always be able to complete a purchase, even if it means operating with degraded data. If the recommendation service is down, show popular items. If real-time inventory is unavailable, accept the order and reconcile later. The worst outcome is a customer who couldn't buy—inventory issues can be fixed after the sale.

The What-If Game

Principal engineers systematically play the 'What-If Game'—walking through the design and asking failure questions at every component. This structured exercise surfaces hidden assumptions and missing safeguards.

The What-If Protocol

For each component in the system, ask:

What-If Questions

•What if this component crashes? — Is it automatically restarted? How long until replacement is available?
•What if this component is slow? — Do callers time out? What's the impact on upstream?
•What if this component returns errors? — Are errors retried? Logged? Alerted?
•What if this component is unreachable? — Is there a fallback? Do we fail fast or wait?
•What if this component runs out of resources? — Is there backpressure? Admission control?
•What if this component gets corrupted data? — Is input validated? Are writes protected?
•What if this component's data is stale? — Is staleness acceptable? How stale?
•What if multiple instances of this component fail simultaneously? — Is there enough redundancy?

Documenting What-If Analysis

Capture the results in a structured format:

What-If Analysis: Order Service
What If...	Expected Behavior	Verification	Gap?	Action Required
...it crashes	K8s restarts within 30s, LB routes to healthy pods	Integration test with pod kill	No
...it's slow (10× latency)	Circuit breaker opens in API Gateway after 5 failures	Load test with injected latency	Yes	Implement circuit breaker
...DB write fails	Transaction rolls back, error returned to caller, logged	Unit test + chaos test	No
...Kafka is unreachable	Order completes, event buffered locally, sent on recovery	Chaos test	Yes	Add local buffer
...memory exhausted	Container killed, restarted, work lost	Load test to OOM	Yes	Add resource limits + backpressure

The Verification Column

Notice the 'Verification' column—every What-If claim should eventually be tested. Design assumptions without tests are just hopes. Chaos engineering practices like controlled fault injection turn What-If answers into proven system properties.

Multi-Point Failure Analysis

Single-point failures are relatively easy to handle—the harder question is what happens when multiple things fail simultaneously. These scenarios are less probable but often catastrophic when they occur.

Correlated Failure Scenarios

Multi-point failures aren't random—they tend to have common causes:

Correlated Failure Patterns

•Shared infrastructure — All services on the same cloud provider region fail together
•Shared dependencies — All services depending on the same database fail when it fails
•Coordinated deployment — A bad deploy affects all instances simultaneously
•Configuration propagation — Bad config pushes to all nodes at once
•Load-induced failure — Traffic spike causes multiple components to fail simultaneously
•Cascading effects — Failure of A causes failure of B, which causes failure of C
•Time-based triggers — Scheduled jobs, certificate expiry, lease renewal all happening at once
•Human error — Operator mistake affecting multiple systems

The Failure Pair Matrix

For critical systems, explicitly analyze pairs of failures:

Failure Pair Analysis (Critical Components Only)
If A Fails...	And B Also Fails...	System Impact	Mitigation
Primary DB	Secondary DB	Complete data unavailability	Third replica in different region
Primary DB	Cache	All reads go to DB under load	Multiple cache replicas
API Gateway	Auth Service	No requests processed	Gateway has cached auth tokens
Region US-East	Region EU-West	Depends on traffic distribution	US-West handles overflow
Payment Service	Order Queue	Cannot process or queue payments	Local queue with persistent storage

The Cloud Provider Outage Scenario

Multi-region and multi-cloud architectures protect against regional outages, but come with significant complexity and cost. The decision to invest in this level of redundancy should be explicit: What's the cost of a complete cloud provider outage? For how long? Is that risk acceptable given the investment required to mitigate it?

Summary: Failure Scenario Testing

Failure scenario testing transforms optimistic designs into resilient architectures. By systematically asking 'What happens when this fails?', you discover weaknesses before production discovers them for you.

Key Takeaways

•Failure is certain in distributed systems — Design for it, expect it, test it
•Classify failures by impact and probability — Prioritize mitigation of high-RPN items
•FMEA provides structured analysis — Systematic approach to identifying and prioritizing risks
•Cascading failures are the most dangerous — Design explicit cascade breakers
•Every dependency is a failure point — Map failure modes and responses for each external service
•Graceful degradation requires design — Define degradation levels and fallbacks explicitly
•Play the What-If Game — Ask failure questions for every component
•Multi-point failures have common causes — Analyze correlated failure scenarios

What's Next

With failure scenarios analyzed, we move to another critical dimension of design validation: edge case handling. While failure scenarios address 'what breaks,' edge cases address 'what's weird'—the unusual inputs, exceptional conditions, and boundary situations that cause subtle bugs and unexpected behavior.

Page Complete

You now understand how to systematically test system designs against failure scenarios. You can classify failures, apply FMEA, design cascade breakers, plan graceful degradation, and use the What-If methodology to validate resilience. Next, we'll examine how to handle edge cases in your design.