Deep Dive - Learning Module

Loading content...

0/273

Failure Handling

Designing for Inevitable Failure

In distributed systems, failure is not a matter of if, but when. Hardware fails, networks partition, dependencies time out, and software has bugs. The difference between systems that survive and systems that collapse lies not in preventing failures—which is impossible—but in how they handle failures when they occur.

Principal engineers design systems with failure as a first-class concern. Every component, every interaction, every data flow must answer the question: What happens when this fails? This page provides the comprehensive framework for building resilient systems that degrade gracefully under duress.

What You Will Learn

By the end of this page, you will understand the taxonomy of system failures, master resilience patterns (circuit breakers, bulkheads, retries), design for graceful degradation, implement effective error handling strategies, and apply chaos engineering principles to validate your failure handling.

Understanding System Failures: A Complete Taxonomy

Before handling failures, we must understand them systematically. Failures fall into distinct categories, each requiring different handling strategies.

Taxonomy of System Failures
Failure Type	Description	Examples	Detection Method
Crash Failures	Component stops completely	OOM kill, process crash, hardware failure	Health checks, heartbeats
Omission Failures	Component fails to respond	Network timeout, dropped messages	Timeouts, acknowledgment tracking
Timing Failures	Response outside acceptable time	Slow queries, GC pauses	Latency monitoring, deadline tracking
Response Failures	Incorrect response returned	Wrong data, corrupted payload	Checksums, validation, testing
Byzantine Failures	Arbitrary/malicious behavior	Compromised nodes, data corruption	Voting, consensus, cryptographic verification
Cascade Failures	One failure triggers others	Resource exhaustion spreading	Dependency monitoring, circuit breakers

The Failure Cascade Problem

The most dangerous failures are those that cascade. A single component failure triggers a chain reaction:

Database slow → Connection pool exhausted → API timeouts → Client retries → 
  → More load on database → Database slower → Complete system failure

This cascading behavior is why simple 'retry on failure' logic often makes things worse. Proper failure handling must limit blast radius and prevent amplification.

Failure Detection Challenges

•Partial Failures — In distributed systems, some nodes can fail while others work. You can't assume uniform system state.
•Gray Failures — Components that are technically 'up' but performing poorly. Hard to detect with binary health checks.
•Silent Failures — Failures that produce no immediate error but corrupt data or state silently.
•Delayed Manifestation — Issues that cause problems hours or days later (memory leaks, disk filling).
•Heisenbugs — Failures that disappear when you try to observe them, often timing-related.

The Fallacies of Distributed Computing

Peter Deutsch's fallacies remind us what we cannot assume: the network is reliable, latency is zero, bandwidth is infinite, the network is secure, topology doesn't change, there is one administrator, transport cost is zero, the network is homogeneous. Design failure handling as if all these will fail—because they will.

The Circuit Breaker Pattern

The Circuit Breaker pattern prevents cascading failures by stopping calls to a failing service, allowing it time to recover. Named after electrical circuit breakers, it 'trips' when failures exceed a threshold.

Circuit Breaker States:

Closed — Normal operation; requests flow through
Open — Tripped; requests fail immediately without calling downstream
Half-Open — Testing recovery; limited requests allowed through

circuit_breaker.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# Production-Grade Circuit Breaker Implementation
from dataclasses import dataclass
from datetime import datetime, timedelta
from enum import Enum
from threading import Lock
from typing import Callable, TypeVar, Optional
import time
 
class CircuitState(Enum):
    CLOSED = "closed"        # Normal operation
    OPEN = "open"            # Failing fast
    HALF_OPEN = "half_open"  # Testing recovery
 
@dataclass
class CircuitBreakerConfig:
    """Circuit breaker configuration."""
    failure_threshold: int = 5           # Failures before opening
    success_threshold: int = 3           # Successes to close from half-open
    timeout_seconds: float = 30.0        # Time in open before half-open
    half_open_max_calls: int = 3         # Max concurrent half-open calls
    failure_rate_threshold: float = 0.5  # Failure rate to trigger (0-1)
    min_calls_for_rate: int = 10         # Min calls before rate matters
 
T = TypeVar('T')
 
class CircuitBreaker:
    """
    Thread-safe circuit breaker with multiple tripping strategies.
    
    Usage:
        breaker = CircuitBreaker("payment-service")
        
        try:
            result = breaker.call(lambda: payment_client.charge(amount))
        except CircuitOpenError:
            return fallback_response()
    """
    
    def __init__(self, name: str, config: CircuitBreakerConfig = None):
        self.name = name
        self.config = config or CircuitBreakerConfig()
        self._state = CircuitState.CLOSED
        self._failure_count = 0
        self._success_count = 0
        self._total_calls = 0
        self._last_failure_time: Optional[datetime] = None
        self._half_open_calls = 0
        self._lock = Lock()
    
    @property
    def state(self) -> CircuitState:
        with self._lock:
            self._check_state_transition()
            return self._state
    
    def _check_state_transition(self) -> None:
        """Check if circuit should transition states."""
        if self._state == CircuitState.OPEN:
            if self._last_failure_time:
                elapsed = datetime.now() - self._last_failure_time
                if elapsed.total_seconds() >= self.config.timeout_seconds:
                    self._transition_to(CircuitState.HALF_OPEN)
    
    def _transition_to(self, new_state: CircuitState) -> None:
        """Transition to new state with logging."""
        old_state = self._state
        self._state = new_state
        
        if new_state == CircuitState.CLOSED:
            self._failure_count = 0
            self._success_count = 0
            self._total_calls = 0
        elif new_state == CircuitState.HALF_OPEN:
            self._half_open_calls = 0
            self._success_count = 0
        
        print(f"[CircuitBreaker:{self.name}] {old_state.value} -> {new_state.value}")
    
    def _should_trip(self) -> bool:
        """Determine if circuit should trip to OPEN."""
        # Consecutive failure threshold
        if self._failure_count >= self.config.failure_threshold:
            return True
        
        # Failure rate threshold (only after minimum calls)
        if self._total_calls >= self.config.min_calls_for_rate:
            failure_rate = self._failure_count / self._total_calls
            if failure_rate >= self.config.failure_rate_threshold:
                return True
        
        return False
    
    def call(self, func: Callable[[], T], fallback: Callable[[], T] = None) -> T:
        """
        Execute function through circuit breaker.
        
        Args:
            func: The function to execute
            fallback: Optional fallback if circuit is open
            
        Raises:
            CircuitOpenError: If circuit is open and no fallback provided
        """
        with self._lock:
            self._check_state_transition()
            
            if self._state == CircuitState.OPEN:
                if fallback:
                    return fallback()
                raise CircuitOpenError(f"Circuit {self.name} is OPEN")
            
            if self._state == CircuitState.HALF_OPEN:
                if self._half_open_calls >= self.config.half_open_max_calls:
                    if fallback:
                        return fallback()
                    raise CircuitOpenError(f"Circuit {self.name} half-open limit reached")
                self._half_open_calls += 1
        
        try:
            result = func()
            self._record_success()
            return result
        except Exception as e:
            self._record_failure()
            raise
    
    def _record_success(self) -> None:
        """Record successful call."""
        with self._lock:
            self._success_count += 1
            self._total_calls += 1
            
            if self._state == CircuitState.HALF_OPEN:
                if self._success_count >= self.config.success_threshold:
                    self._transition_to(CircuitState.CLOSED)
    
    def _record_failure(self) -> None:
        """Record failed call."""
        with self._lock:
            self._failure_count += 1
            self._total_calls += 1
            self._last_failure_time = datetime.now()
            
            if self._state == CircuitState.HALF_OPEN:
                self._transition_to(CircuitState.OPEN)
            elif self._state == CircuitState.CLOSED:
                if self._should_trip():
                    self._transition_to(CircuitState.OPEN)
 
class CircuitOpenError(Exception):
    """Raised when circuit breaker is open."""
    pass
 
# Example usage
payment_breaker = CircuitBreaker(
    "payment-service",
    CircuitBreakerConfig(
        failure_threshold=5,
        timeout_seconds=30,
        success_threshold=3
    )
)
 
def process_payment(amount: float) -> dict:
    """Process payment with circuit breaker protection."""
    def call_payment_service():
        # Actual payment service call
        return {"status": "success", "amount": amount}
    
    def fallback():
        # Queue for later processing
        return {"status": "queued", "message": "Payment queued for retry"}
    
    return payment_breaker.call(call_payment_service, fallback)

Circuit Breaker Configuration Guidelines

•Failure Threshold — Lower for critical services (3-5), higher for flaky but non-critical ones (10-20). Too low causes unnecessary trips; too high delays protection.
•Open Duration — Long enough for downstream to recover (30-60 seconds typical). Consider downstream service's typical recovery time.
•Success Threshold in Half-Open — Usually 2-5 successes. Ensures recovery is stable, not a single lucky request.
•Per-Operation vs Per-Service — Consider separate breakers for different operations on the same service if failure modes differ.
•Metrics Integration — Expose circuit state as metrics. Alert on circuit trips—they indicate downstream health issues.

Libraries for Production Use

Don't implement circuit breakers from scratch in production. Use battle-tested libraries: resilience4j (Java), Polly (.NET), hystrix (Java, maintenance mode), pybreaker (Python), or service mesh capabilities (Istio, Linkerd). These provide observability, thread safety, and edge case handling.

Retry Patterns: When and How to Retry

Retries are the most common failure handling mechanism, but naive retries cause more harm than good. Proper retry implementation requires understanding when to retry, how to backoff, and when to give up.

Retryable Failures

•Network timeouts (transient)
•Connection refused (service restarting)
•HTTP 502, 503, 504 (temporary unavailability)
•HTTP 429 (rate limited—retry with backoff)
•Database deadlock (retry may succeed)
•Resource temporarily unavailable

Non-Retryable Failures

•HTTP 400 (bad request—fix input first)
•HTTP 401, 403 (auth failure—retrying won't help)
•HTTP 404 (resource doesn't exist)
•Validation errors
•Business logic failures
•Duplicate key violations

Exponential Backoff with Jitter

The gold standard for retry timing. Each retry waits longer than the last, and jitter prevents synchronized retries from multiple clients.

wait_time = min(cap, base * 2^attempt) + random(0, jitter)

Example with base=100ms, cap=30s, jitter=100ms:
  Attempt 1: 100-200ms
  Attempt 2: 200-300ms  
  Attempt 3: 400-500ms
  Attempt 4: 800-900ms
  Attempt 5: 1600-1700ms
  ...
  Attempt N: 30000-30100ms (capped)

Why jitter matters: Without jitter, if 1000 clients timeout simultaneously and all retry after exactly 1 second, you get 1000 simultaneous requests again—a thundering herd. Jitter spreads retries across time.

retry_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
# Comprehensive Retry Implementation
import random
import time
from dataclasses import dataclass
from typing import Callable, TypeVar, Set, Type
from functools import wraps
 
T = TypeVar('T')
 
@dataclass
class RetryConfig:
    """Configuration for retry behavior."""
    max_attempts: int = 3
    base_delay_ms: float = 100
    max_delay_ms: float = 30000
    exponential_base: float = 2
    jitter_ms: float = 100
    retryable_exceptions: Set[Type[Exception]] = None
    
    def __post_init__(self):
        if self.retryable_exceptions is None:
            self.retryable_exceptions = {
                ConnectionError,
                TimeoutError,
                IOError,
            }
 
def calculate_delay(attempt: int, config: RetryConfig) -> float:
    """Calculate delay with exponential backoff and jitter."""
    exponential_delay = config.base_delay_ms * (config.exponential_base ** attempt)
    capped_delay = min(exponential_delay, config.max_delay_ms)
    jitter = random.uniform(0, config.jitter_ms)
    return capped_delay + jitter
 
def is_retryable(exception: Exception, config: RetryConfig) -> bool:
    """Determine if exception is retryable."""
    return any(
        isinstance(exception, exc_type) 
        for exc_type in config.retryable_exceptions
    )
 
def with_retry(config: RetryConfig = None):
    """
    Decorator for retrying functions with exponential backoff.
    
    Usage:
        @with_retry(RetryConfig(max_attempts=5))
        def call_external_service():
            ...
    """
    if config is None:
        config = RetryConfig()
    
    def decorator(func: Callable[..., T]) -> Callable[..., T]:
        @wraps(func)
        def wrapper(*args, **kwargs) -> T:
            last_exception = None
            
            for attempt in range(config.max_attempts):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    last_exception = e
                    
                    if not is_retryable(e, config):
                        raise  # Non-retryable, fail immediately
                    
                    if attempt < config.max_attempts - 1:
                        delay_ms = calculate_delay(attempt, config)
                        print(f"Retry {attempt + 1}/{config.max_attempts} "
                              f"after {delay_ms:.0f}ms: {e}")
                        time.sleep(delay_ms / 1000)
                    else:
                        print(f"All {config.max_attempts} attempts failed")
            
            raise last_exception
        
        return wrapper
    return decorator
 
# Retry budget pattern - limit total retry time
class RetryBudget:
    """
    Limits retries globally to prevent retry storms.
    
    Instead of each request getting N retries, the system gets
    a budget of retries per time window.
    """
    
    def __init__(self, 
                 budget_per_second: float = 0.1,  # 10% of traffic can be retries
                 min_retries_per_second: int = 10):  # But always allow some
        self.budget_ratio = budget_per_second
        self.min_retries = min_retries_per_second
        self._request_count = 0
        self._retry_count = 0
        self._window_start = time.time()
        self._window_seconds = 1.0
    
    def _reset_window_if_needed(self):
        now = time.time()
        if now - self._window_start >= self._window_seconds:
            self._request_count = 0
            self._retry_count = 0
            self._window_start = now
    
    def record_request(self):
        self._reset_window_if_needed()
        self._request_count += 1
    
    def can_retry(self) -> bool:
        """Check if retry budget allows another retry."""
        self._reset_window_if_needed()
        
        # Always allow minimum retries
        if self._retry_count < self.min_retries:
            return True
        
        # Check if under budget
        if self._request_count == 0:
            return True
        
        retry_ratio = self._retry_count / self._request_count
        return retry_ratio < self.budget_ratio
    
    def record_retry(self) -> bool:
        """Record a retry, return True if allowed."""
        if self.can_retry():
            self._retry_count += 1
            return True
        return False
 
# Example: Using retry budget
retry_budget = RetryBudget(budget_per_second=0.1)
 
def make_request_with_budget(url: str) -> dict:
    retry_budget.record_request()
    
    for attempt in range(3):
        try:
            return http_get(url)  # hypothetical
        except TimeoutError:
            if attempt < 2 and retry_budget.record_retry():
                time.sleep(calculate_delay(attempt, RetryConfig()))
                continue
            raise
    raise RuntimeError("Should not reach here")

The Retry Amplification Problem

If each of 5 services in a call chain has 3 retries, a single user request can cause 3^5 = 243 downstream calls. Implement retry budgets, coordinate retries at the edge (not each layer), and use circuit breakers to prevent this amplification.

The Bulkhead Pattern: Isolating Failure Domains

The Bulkhead pattern, named after ship compartments that prevent a single breach from sinking the entire vessel, isolates components so that failure in one doesn't exhaust resources needed by others.

Common Bulkhead Implementations:

Bulkhead Strategies

•Thread Pool Isolation — Separate thread pools for different dependencies. If one pool is exhausted waiting for a slow service, others continue working.
•Connection Pool Isolation — Separate database/HTTP connection pools per operation type. Critical reads don't compete with bulk writes.
•Process Isolation — Separate processes for different workloads. A memory leak in batch processing doesn't crash the API.
•Service Isolation — Separate services for different customer tiers. A runaway free-tier user doesn't affect enterprise customers.
•Infrastructure Isolation — Separate clusters, databases, or regions for different workloads. Complete blast radius containment.

Bulkhead Isolation Levels and Trade-offs
Isolation Level	Blast Radius	Resource Overhead	Operational Complexity	Example Use Case
Thread Pool	Per-dependency	Low	Low	Slow external API isolation
Connection Pool	Per-dependency	Low-Medium	Low	Different DB operation types
Process	Per-workload	Medium	Medium	Batch vs online processing
Container	Per-service	Medium	Medium	Microservice isolation
VM/Instance	Per-tier	High	High	Customer tier isolation
Region/Cluster	Per-deployment	Very High	High	Complete failure domain isolation

bulkhead_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# Thread Pool Bulkhead Implementation
from concurrent.futures import ThreadPoolExecutor, TimeoutError
from dataclasses import dataclass
from typing import Callable, TypeVar, Dict
import threading
 
T = TypeVar('T')
 
@dataclass
class BulkheadConfig:
    """Configuration for a bulkhead."""
    max_concurrent: int = 10
    max_waiting: int = 5
    timeout_seconds: float = 30.0
 
class Bulkhead:
    """
    Thread pool bulkhead for isolating concurrent operations.
    
    Usage:
        payment_bulkhead = Bulkhead("payments", BulkheadConfig(max_concurrent=10))
        result = payment_bulkhead.execute(lambda: payment_service.charge(amount))
    """
    
    def __init__(self, name: str, config: BulkheadConfig = None):
        self.name = name
        self.config = config or BulkheadConfig()
        self._executor = ThreadPoolExecutor(
            max_workers=self.config.max_concurrent,
            thread_name_prefix=f"bulkhead-{name}"
        )
        self._semaphore = threading.Semaphore(
            self.config.max_concurrent + self.config.max_waiting
        )
        self._active = 0
        self._waiting = 0
        self._lock = threading.Lock()
    
    def execute(self, func: Callable[[], T]) -> T:
        """
        Execute function within bulkhead constraints.
        
        Raises:
            BulkheadFullError: If bulkhead is at capacity
            TimeoutError: If execution times out
        """
        # Check if we can acquire a slot
        acquired = self._semaphore.acquire(blocking=False)
        if not acquired:
            raise BulkheadFullError(
                f"Bulkhead {self.name} is full: "
                f"{self._active} active, {self._waiting} waiting"
            )
        
        try:
            with self._lock:
                self._waiting += 1
            
            future = self._executor.submit(func)
            
            with self._lock:
                self._waiting -= 1
                self._active += 1
            
            try:
                return future.result(timeout=self.config.timeout_seconds)
            except TimeoutError:
                future.cancel()
                raise
            finally:
                with self._lock:
                    self._active -= 1
        finally:
            self._semaphore.release()
    
    def metrics(self) -> dict:
        """Get current bulkhead metrics."""
        with self._lock:
            return {
                "name": self.name,
                "max_concurrent": self.config.max_concurrent,
                "max_waiting": self.config.max_waiting,
                "active": self._active,
                "waiting": self._waiting,
                "available": self.config.max_concurrent - self._active,
            }
 
class BulkheadFullError(Exception):
    """Raised when bulkhead cannot accept more work."""
    pass
 
class BulkheadRegistry:
    """
    Central registry for managing multiple bulkheads.
    Provides a single point for monitoring and configuration.
    """
    
    def __init__(self):
        self._bulkheads: Dict[str, Bulkhead] = {}
        self._lock = threading.Lock()
    
    def get_or_create(self, name: str, config: BulkheadConfig = None) -> Bulkhead:
        """Get existing bulkhead or create new one."""
        with self._lock:
            if name not in self._bulkheads:
                self._bulkheads[name] = Bulkhead(name, config)
            return self._bulkheads[name]
    
    def all_metrics(self) -> list:
        """Get metrics for all bulkheads."""
        with self._lock:
            return [b.metrics() for b in self._bulkheads.values()]
 
# Global registry
bulkheads = BulkheadRegistry()
 
# Usage example - different bulkheads for different dependencies
payment_bulkhead = bulkheads.get_or_create(
    "payments", 
    BulkheadConfig(max_concurrent=10, timeout_seconds=30)
)
 
inventory_bulkhead = bulkheads.get_or_create(
    "inventory",
    BulkheadConfig(max_concurrent=50, timeout_seconds=5)
)
 
def checkout(cart) -> dict:
    """Checkout with bulkhead isolation."""
    # If payments is slow, it won't affect inventory checks
    inventory_result = inventory_bulkhead.execute(
        lambda: check_inventory(cart.items)
    )
    
    if not inventory_result.available:
        return {"error": "Items not available"}
    
    payment_result = payment_bulkhead.execute(
        lambda: process_payment(cart.total)
    )
    
    return {"order_id": payment_result.order_id}

Sizing Bulkheads Correctly

Size bulkheads based on observed usage patterns. Monitor queue depth and rejected calls. Too small = unnecessary rejections during normal traffic. Too large = no isolation benefit. Start conservative and tune based on production metrics.

Graceful Degradation: Maintaining Partial Functionality

Graceful degradation means providing reduced but functional service when components fail. Instead of complete failure, the system continues with diminished capabilities.

Design Principles for Graceful Degradation:

Degradation Strategies

•Static Fallbacks — Return cached or pre-computed data when live computation fails. E.g., showing yesterday's recommendations when recommendation engine is down.
•Simplified Processing — Fall back to simpler algorithms. E.g., basic search when ML-powered search fails.
•Feature Flags — Disable non-critical features under load. E.g., disable real-time analytics during traffic spikes.
•Queue and Retry Later — Accept work for async processing when sync processing fails. E.g., queue order for processing when payment fails.
•Default Values — Use sensible defaults when personalization fails. E.g., show popular items instead of personalized recommendations.

Degradation Examples by Feature Type
Feature	Normal Behavior	Degraded Behavior	Impact
Product Recommendations	ML-personalized suggestions	Trending/popular items	Lower conversion, but functional
Search	Semantic + ML ranking	Keyword match only	Less relevant results
User Authentication	Full OAuth flow	Token validation only	No new logins, existing sessions work
Real-time Inventory	Live stock counts	Cached counts + buffer	Occasional oversell
Dynamic Pricing	ML-based pricing	Static price list	Suboptimal pricing
Comments/Reviews	Full social features	Read-only mode	No new content, viewing works

Implementing Feature Degradation

A structured approach to feature degradation uses feature flags and dependency health monitoring:

graceful_degradation.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
// Graceful Degradation Framework
interface DegradationLevel {
  level: 'normal' | 'degraded' | 'minimal' | 'emergency';
  enabledFeatures: string[];
  disabledFeatures: string[];
}
 
interface DependencyHealth {
  name: string;
  healthy: boolean;
  latencyP99Ms: number;
  errorRate: number;
}
 
class DegradationController {
  private currentLevel: DegradationLevel['level'] = 'normal';
  private dependencies: Map<string, DependencyHealth> = new Map();
  
  // Define what features are available at each level
  private readonly levels: Record<string, DegradationLevel> = {
    normal: {
      level: 'normal',
      enabledFeatures: ['ml_search', 'personalization', 'real_time_inventory', 
                        'dynamic_pricing', 'reviews', 'recommendations'],
      disabledFeatures: [],
    },
    degraded: {
      level: 'degraded',
      enabledFeatures: ['basic_search', 'cached_inventory', 'static_pricing', 
                        'reviews_readonly', 'popular_items'],
      disabledFeatures: ['ml_search', 'personalization', 'real_time_inventory',
                         'dynamic_pricing', 'recommendations'],
    },
    minimal: {
      level: 'minimal',
      enabledFeatures: ['basic_search', 'cached_inventory', 'static_pricing'],
      disabledFeatures: ['reviews', 'recommendations', 'personalization',
                         'real_time_inventory', 'dynamic_pricing'],
    },
    emergency: {
      level: 'emergency',
      enabledFeatures: ['static_catalog', 'cached_content'],
      disabledFeatures: ['search', 'checkout', 'user_accounts'],
    },
  };
 
  updateDependencyHealth(health: DependencyHealth): void {
    this.dependencies.set(health.name, health);
    this.evaluateDegradationLevel();
  }
 
  private evaluateDegradationLevel(): void {
    const healths = Array.from(this.dependencies.values());
    
    // Calculate overall system health score
    const unhealthyCount = healths.filter(h => !h.healthy).length;
    const avgErrorRate = healths.reduce((sum, h) => sum + h.errorRate, 0) / healths.length;
    const avgLatency = healths.reduce((sum, h) => sum + h.latencyP99Ms, 0) / healths.length;
 
    // Determine appropriate degradation level
    let newLevel: DegradationLevel['level'];
    
    if (unhealthyCount === 0 && avgErrorRate < 0.01 && avgLatency < 100) {
      newLevel = 'normal';
    } else if (unhealthyCount <= 2 && avgErrorRate < 0.05) {
      newLevel = 'degraded';
    } else if (unhealthyCount <= 4 && avgErrorRate < 0.20) {
      newLevel = 'minimal';
    } else {
      newLevel = 'emergency';
    }
 
    if (newLevel !== this.currentLevel) {
      console.log(`Degradation level changing: ${this.currentLevel} -> ${newLevel}`);
      this.currentLevel = newLevel;
      this.emitLevelChange(newLevel);
    }
  }
 
  isFeatureEnabled(featureName: string): boolean {
    const level = this.levels[this.currentLevel];
    return level.enabledFeatures.includes(featureName);
  }
 
  getCurrentLevel(): DegradationLevel {
    return this.levels[this.currentLevel];
  }
 
  private emitLevelChange(level: string): void {
    // Emit metric, send alert, update feature flags
    metrics.emit('degradation_level', { level });
  }
}
 
// Usage in application code
const degradation = new DegradationController();
 
async function getProductRecommendations(userId: string): Promise<Product[]> {
  if (degradation.isFeatureEnabled('recommendations')) {
    try {
      return await mlRecommendationService.getPersonalized(userId);
    } catch (error) {
      // Fall through to fallback
    }
  }
  
  if (degradation.isFeatureEnabled('popular_items')) {
    return await getPopularItems();
  }
  
  return []; // No recommendations available
}

Communicate Degradation to Users

When degrading, inform users. 'Search is temporarily simplified' is better than silently returning poor results. Users are more forgiving of known limitations than unexplained bad experiences. Consider degradation indicators in your UI design.

Comprehensive Timeout Strategy

Timeouts are the most fundamental failure handling mechanism. They prevent unbounded waits and enable the system to fail fast. However, incorrect timeout configuration causes more outages than it prevents.

Types of Timeouts:

Timeout Categories

•Connection Timeout — How long to wait for a connection to be established. Should be short (1-5 seconds). Failure means the service is unreachable.
•Read/Socket Timeout — How long to wait for data after connection. Depends on expected operation time. Failure means operation is taking too long.
•Request/Operation Timeout — Total time for the entire request including retries. This is what the caller cares about.
•Idle/Keep-Alive Timeout — How long to keep an idle connection. Affects connection pool efficiency.

Timeout Configuration Guidelines
Operation Type	Connection Timeout	Read Timeout	Notes
Health Check	500ms	1s	Must be fast to detect issues quickly
Cache Read	100ms	500ms	Cache should be fast; fail to origin if slow
Database Query	1s	5-30s	Depends on query complexity; consider statement timeout
Internal Service Call	1s	5s	Add ~10% to P99 of called service
External API Call	5s	30s	External services are less predictable
File Upload	5s	5 minutes	Large uploads need long timeouts
Batch Job	N/A	Hours	Use heartbeats instead of timeouts

The Timeout Chain Problem

In a microservices architecture, timeouts must be coordinated across the call chain:

Client (30s timeout)
  → API Gateway (25s timeout)
    → Service A (20s timeout)
      → Service B (15s timeout)
        → Database (10s timeout)

Rules for timeout chains:

Outer timeout > Inner timeout — The caller must wait longer than the callee. Otherwise, the caller times out while the callee is still working.
Include retry time in outer timeout — If inner operation has 3 retries with 5s timeout each, outer timeout must be > 15s.
Account for all layers — Network latency, serialization, queue time all add to effective timeout.
Deadline propagation — Pass absolute deadline, not timeout. Each layer knows how much time remains.

deadline_propagation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# Deadline Propagation Pattern
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional
import time
 
@dataclass
class RequestContext:
    """Context propagated through service calls."""
    request_id: str
    deadline: datetime  # Absolute time by which request must complete
    
    @property
    def remaining_ms(self) -> float:
        """Milliseconds remaining before deadline."""
        remaining = (self.deadline - datetime.now()).total_seconds() * 1000
        return max(0, remaining)
    
    @property
    def is_expired(self) -> bool:
        """Check if deadline has passed."""
        return datetime.now() >= self.deadline
    
    def with_reduced_deadline(self, buffer_ms: float = 100) -> 'RequestContext':
        """
        Create child context with slightly earlier deadline.
        Buffer ensures we have time to handle timeout gracefully.
        """
        new_deadline = self.deadline - timedelta(milliseconds=buffer_ms)
        return RequestContext(
            request_id=self.request_id,
            deadline=new_deadline
        )
 
class DeadlineExceededError(Exception):
    """Raised when operation deadline is exceeded."""
    pass
 
def check_deadline(ctx: RequestContext) -> None:
    """Check deadline and raise if exceeded."""
    if ctx.is_expired:
        raise DeadlineExceededError(
            f"Deadline exceeded for request {ctx.request_id}"
        )
 
# Example: Deadline-aware service call
async def process_order(ctx: RequestContext, order: dict) -> dict:
    """Process order with deadline propagation."""
    
    # Check deadline before starting
    check_deadline(ctx)
    
    # Validate inventory with remaining time
    child_ctx = ctx.with_reduced_deadline(100)  # Reserve 100ms
    inventory_result = await inventory_service.check(
        child_ctx,
        order['items'],
        timeout_ms=child_ctx.remaining_ms
    )
    
    check_deadline(ctx)
    
    # Process payment with remaining time
    child_ctx = ctx.with_reduced_deadline(100)
    payment_result = await payment_service.charge(
        child_ctx,
        order['total'],
        timeout_ms=child_ctx.remaining_ms
    )
    
    return {
        "order_id": generate_order_id(),
        "status": "completed"
    }
 
# HTTP header for deadline propagation
# Request: X-Deadline: 2024-01-15T10:30:00.000Z
# Each service subtracts buffer and passes reduced deadline

gRPC Deadline Propagation

gRPC has built-in deadline propagation via context. Set deadline at the edge, and it automatically propagates through all downstream calls. The framework handles timeout calculation at each hop. Consider gRPC for internal service communication where deadline propagation is critical.

Validating Failure Handling with Chaos Engineering

Failure handling code is often the least tested code in a system—it runs rarely in production, and developers can't easily trigger it in development. Chaos engineering solves this by deliberately injecting failures in a controlled manner.

Chaos Engineering Principles

•Build a hypothesis — 'If database response time increases to 5s, circuit breakers should trip and 95% of requests should still succeed via cache.'
•Minimize blast radius — Start small. Inject failure in staging, then canary, then limited production. Never go straight to full production.
•Automate and instrument — Chaos experiments should be automated, repeatable, and fully instrumented to measure impact.
•Have a rollback plan — Be ready to stop the experiment immediately if impact exceeds expectations.
•Run in production — Staging failures don't reveal production failure modes. Eventually, you must test in production (carefully).

Common Chaos Experiments
Experiment	What It Tests	How to Inject	Expected Result
Service kill	Restart resilience	Kill process/container	Other instances handle load
Network latency	Timeout handling	tc netem delay	Graceful degradation
Network partition	Split-brain handling	iptables rules	Consistent data, no corruption
CPU stress	Resource exhaustion	stress-ng	Request queuing, not crashes
Memory pressure	OOM handling	stress-ng --vm	Graceful restart, no data loss
Disk full	Storage exhaustion	dd to fill disk	Alerts fire, writes fail gracefully
DNS failure	DNS dependency	Modify /etc/hosts	Fallback to cached resolution
Certificate expiry	TLS handling	Use expired cert	Clear error, not hanging

Chaos Engineering in Practice

Popular tools for chaos engineering:

Chaos Monkey (Netflix) — Randomly terminates instances in production
Gremlin — Commercial platform with wide variety of attacks
Litmus — Kubernetes-native chaos engineering
Chaos Mesh — Another Kubernetes chaos platform
Toxiproxy — Inject network-level failures

GameDay: Structured Chaos

GameDays are scheduled chaos exercises with the whole team:

Plan — Choose scenario, define hypothesis, prepare rollback
Communicate — Notify stakeholders, set up war room
Execute — Run experiment while watching dashboards
Observe — Note what happened, capture metrics
Remediate — Fix issues discovered during exercise
Document — Write up findings and action items

Chaos Engineering Is Not Testing in Production

Chaos engineering is controlled experimentation, not reckless action. Every experiment has a hypothesis, instrumentation, blast radius limits, and abort criteria. If you're 'just breaking things to see what happens,' you're not doing chaos engineering—you're causing incidents.

Summary: Building Resilient Systems

Failure handling transforms brittle systems into resilient ones. Let's consolidate the essential principles:

Key Takeaways

•Failure is inevitable — Design assuming components will fail. Ask 'what happens when this fails?' for every interaction.
•Use circuit breakers — Stop calling failing services to prevent cascade failures and give them time to recover.
•Retry intelligently — Use exponential backoff with jitter. Distinguish retryable from non-retryable errors. Implement retry budgets.
•Isolate failure domains — Bulkheads prevent one failing component from exhausting resources needed by others.
•Degrade gracefully — Provide partial functionality when full functionality is impossible. Something is better than nothing.
•Configure timeouts carefully — Outer > inner. Propagate deadlines. Account for retries and all layers.
•Validate with chaos — Test failure handling in production using controlled chaos experiments.

The Resilience Checklist

For every service dependency ask:

✓ What's the circuit breaker configuration? ✓ What's the retry strategy? ✓ What's the timeout and how was it determined? ✓ What's the fallback if this fails? ✓ How is this isolated from other dependencies? ✓ How do we know when this is failing?

What's Next:

With failure handling patterns established, the next page covers Trade-off Discussion—the critical skill of articulating design decisions, evaluating alternatives, and communicating the reasoning behind architectural choices.

Page Complete

You now have a comprehensive framework for handling failures in distributed systems. This knowledge, combined with bottleneck identification and component scaling, enables you to build systems that survive and thrive in the face of inevitable failures.