Loading content...
In distributed systems, failure is not a matter of if, but when. Hardware fails, networks partition, dependencies time out, and software has bugs. The difference between systems that survive and systems that collapse lies not in preventing failures—which is impossible—but in how they handle failures when they occur.
Principal engineers design systems with failure as a first-class concern. Every component, every interaction, every data flow must answer the question: What happens when this fails? This page provides the comprehensive framework for building resilient systems that degrade gracefully under duress.
By the end of this page, you will understand the taxonomy of system failures, master resilience patterns (circuit breakers, bulkheads, retries), design for graceful degradation, implement effective error handling strategies, and apply chaos engineering principles to validate your failure handling.
Before handling failures, we must understand them systematically. Failures fall into distinct categories, each requiring different handling strategies.
| Failure Type | Description | Examples | Detection Method |
|---|---|---|---|
| Crash Failures | Component stops completely | OOM kill, process crash, hardware failure | Health checks, heartbeats |
| Omission Failures | Component fails to respond | Network timeout, dropped messages | Timeouts, acknowledgment tracking |
| Timing Failures | Response outside acceptable time | Slow queries, GC pauses | Latency monitoring, deadline tracking |
| Response Failures | Incorrect response returned | Wrong data, corrupted payload | Checksums, validation, testing |
| Byzantine Failures | Arbitrary/malicious behavior | Compromised nodes, data corruption | Voting, consensus, cryptographic verification |
| Cascade Failures | One failure triggers others | Resource exhaustion spreading | Dependency monitoring, circuit breakers |
The Failure Cascade Problem
The most dangerous failures are those that cascade. A single component failure triggers a chain reaction:
Database slow → Connection pool exhausted → API timeouts → Client retries →
→ More load on database → Database slower → Complete system failure
This cascading behavior is why simple 'retry on failure' logic often makes things worse. Proper failure handling must limit blast radius and prevent amplification.
Peter Deutsch's fallacies remind us what we cannot assume: the network is reliable, latency is zero, bandwidth is infinite, the network is secure, topology doesn't change, there is one administrator, transport cost is zero, the network is homogeneous. Design failure handling as if all these will fail—because they will.
The Circuit Breaker pattern prevents cascading failures by stopping calls to a failing service, allowing it time to recover. Named after electrical circuit breakers, it 'trips' when failures exceed a threshold.
Circuit Breaker States:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174
# Production-Grade Circuit Breaker Implementationfrom dataclasses import dataclassfrom datetime import datetime, timedeltafrom enum import Enumfrom threading import Lockfrom typing import Callable, TypeVar, Optionalimport time class CircuitState(Enum): CLOSED = "closed" # Normal operation OPEN = "open" # Failing fast HALF_OPEN = "half_open" # Testing recovery @dataclassclass CircuitBreakerConfig: """Circuit breaker configuration.""" failure_threshold: int = 5 # Failures before opening success_threshold: int = 3 # Successes to close from half-open timeout_seconds: float = 30.0 # Time in open before half-open half_open_max_calls: int = 3 # Max concurrent half-open calls failure_rate_threshold: float = 0.5 # Failure rate to trigger (0-1) min_calls_for_rate: int = 10 # Min calls before rate matters T = TypeVar('T') class CircuitBreaker: """ Thread-safe circuit breaker with multiple tripping strategies. Usage: breaker = CircuitBreaker("payment-service") try: result = breaker.call(lambda: payment_client.charge(amount)) except CircuitOpenError: return fallback_response() """ def __init__(self, name: str, config: CircuitBreakerConfig = None): self.name = name self.config = config or CircuitBreakerConfig() self._state = CircuitState.CLOSED self._failure_count = 0 self._success_count = 0 self._total_calls = 0 self._last_failure_time: Optional[datetime] = None self._half_open_calls = 0 self._lock = Lock() @property def state(self) -> CircuitState: with self._lock: self._check_state_transition() return self._state def _check_state_transition(self) -> None: """Check if circuit should transition states.""" if self._state == CircuitState.OPEN: if self._last_failure_time: elapsed = datetime.now() - self._last_failure_time if elapsed.total_seconds() >= self.config.timeout_seconds: self._transition_to(CircuitState.HALF_OPEN) def _transition_to(self, new_state: CircuitState) -> None: """Transition to new state with logging.""" old_state = self._state self._state = new_state if new_state == CircuitState.CLOSED: self._failure_count = 0 self._success_count = 0 self._total_calls = 0 elif new_state == CircuitState.HALF_OPEN: self._half_open_calls = 0 self._success_count = 0 print(f"[CircuitBreaker:{self.name}] {old_state.value} -> {new_state.value}") def _should_trip(self) -> bool: """Determine if circuit should trip to OPEN.""" # Consecutive failure threshold if self._failure_count >= self.config.failure_threshold: return True # Failure rate threshold (only after minimum calls) if self._total_calls >= self.config.min_calls_for_rate: failure_rate = self._failure_count / self._total_calls if failure_rate >= self.config.failure_rate_threshold: return True return False def call(self, func: Callable[[], T], fallback: Callable[[], T] = None) -> T: """ Execute function through circuit breaker. Args: func: The function to execute fallback: Optional fallback if circuit is open Raises: CircuitOpenError: If circuit is open and no fallback provided """ with self._lock: self._check_state_transition() if self._state == CircuitState.OPEN: if fallback: return fallback() raise CircuitOpenError(f"Circuit {self.name} is OPEN") if self._state == CircuitState.HALF_OPEN: if self._half_open_calls >= self.config.half_open_max_calls: if fallback: return fallback() raise CircuitOpenError(f"Circuit {self.name} half-open limit reached") self._half_open_calls += 1 try: result = func() self._record_success() return result except Exception as e: self._record_failure() raise def _record_success(self) -> None: """Record successful call.""" with self._lock: self._success_count += 1 self._total_calls += 1 if self._state == CircuitState.HALF_OPEN: if self._success_count >= self.config.success_threshold: self._transition_to(CircuitState.CLOSED) def _record_failure(self) -> None: """Record failed call.""" with self._lock: self._failure_count += 1 self._total_calls += 1 self._last_failure_time = datetime.now() if self._state == CircuitState.HALF_OPEN: self._transition_to(CircuitState.OPEN) elif self._state == CircuitState.CLOSED: if self._should_trip(): self._transition_to(CircuitState.OPEN) class CircuitOpenError(Exception): """Raised when circuit breaker is open.""" pass # Example usagepayment_breaker = CircuitBreaker( "payment-service", CircuitBreakerConfig( failure_threshold=5, timeout_seconds=30, success_threshold=3 )) def process_payment(amount: float) -> dict: """Process payment with circuit breaker protection.""" def call_payment_service(): # Actual payment service call return {"status": "success", "amount": amount} def fallback(): # Queue for later processing return {"status": "queued", "message": "Payment queued for retry"} return payment_breaker.call(call_payment_service, fallback)Don't implement circuit breakers from scratch in production. Use battle-tested libraries: resilience4j (Java), Polly (.NET), hystrix (Java, maintenance mode), pybreaker (Python), or service mesh capabilities (Istio, Linkerd). These provide observability, thread safety, and edge case handling.
Retries are the most common failure handling mechanism, but naive retries cause more harm than good. Proper retry implementation requires understanding when to retry, how to backoff, and when to give up.
Exponential Backoff with Jitter
The gold standard for retry timing. Each retry waits longer than the last, and jitter prevents synchronized retries from multiple clients.
wait_time = min(cap, base * 2^attempt) + random(0, jitter)
Example with base=100ms, cap=30s, jitter=100ms:
Attempt 1: 100-200ms
Attempt 2: 200-300ms
Attempt 3: 400-500ms
Attempt 4: 800-900ms
Attempt 5: 1600-1700ms
...
Attempt N: 30000-30100ms (capped)
Why jitter matters: Without jitter, if 1000 clients timeout simultaneously and all retry after exactly 1 second, you get 1000 simultaneous requests again—a thundering herd. Jitter spreads retries across time.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147
# Comprehensive Retry Implementationimport randomimport timefrom dataclasses import dataclassfrom typing import Callable, TypeVar, Set, Typefrom functools import wraps T = TypeVar('T') @dataclassclass RetryConfig: """Configuration for retry behavior.""" max_attempts: int = 3 base_delay_ms: float = 100 max_delay_ms: float = 30000 exponential_base: float = 2 jitter_ms: float = 100 retryable_exceptions: Set[Type[Exception]] = None def __post_init__(self): if self.retryable_exceptions is None: self.retryable_exceptions = { ConnectionError, TimeoutError, IOError, } def calculate_delay(attempt: int, config: RetryConfig) -> float: """Calculate delay with exponential backoff and jitter.""" exponential_delay = config.base_delay_ms * (config.exponential_base ** attempt) capped_delay = min(exponential_delay, config.max_delay_ms) jitter = random.uniform(0, config.jitter_ms) return capped_delay + jitter def is_retryable(exception: Exception, config: RetryConfig) -> bool: """Determine if exception is retryable.""" return any( isinstance(exception, exc_type) for exc_type in config.retryable_exceptions ) def with_retry(config: RetryConfig = None): """ Decorator for retrying functions with exponential backoff. Usage: @with_retry(RetryConfig(max_attempts=5)) def call_external_service(): ... """ if config is None: config = RetryConfig() def decorator(func: Callable[..., T]) -> Callable[..., T]: @wraps(func) def wrapper(*args, **kwargs) -> T: last_exception = None for attempt in range(config.max_attempts): try: return func(*args, **kwargs) except Exception as e: last_exception = e if not is_retryable(e, config): raise # Non-retryable, fail immediately if attempt < config.max_attempts - 1: delay_ms = calculate_delay(attempt, config) print(f"Retry {attempt + 1}/{config.max_attempts} " f"after {delay_ms:.0f}ms: {e}") time.sleep(delay_ms / 1000) else: print(f"All {config.max_attempts} attempts failed") raise last_exception return wrapper return decorator # Retry budget pattern - limit total retry timeclass RetryBudget: """ Limits retries globally to prevent retry storms. Instead of each request getting N retries, the system gets a budget of retries per time window. """ def __init__(self, budget_per_second: float = 0.1, # 10% of traffic can be retries min_retries_per_second: int = 10): # But always allow some self.budget_ratio = budget_per_second self.min_retries = min_retries_per_second self._request_count = 0 self._retry_count = 0 self._window_start = time.time() self._window_seconds = 1.0 def _reset_window_if_needed(self): now = time.time() if now - self._window_start >= self._window_seconds: self._request_count = 0 self._retry_count = 0 self._window_start = now def record_request(self): self._reset_window_if_needed() self._request_count += 1 def can_retry(self) -> bool: """Check if retry budget allows another retry.""" self._reset_window_if_needed() # Always allow minimum retries if self._retry_count < self.min_retries: return True # Check if under budget if self._request_count == 0: return True retry_ratio = self._retry_count / self._request_count return retry_ratio < self.budget_ratio def record_retry(self) -> bool: """Record a retry, return True if allowed.""" if self.can_retry(): self._retry_count += 1 return True return False # Example: Using retry budgetretry_budget = RetryBudget(budget_per_second=0.1) def make_request_with_budget(url: str) -> dict: retry_budget.record_request() for attempt in range(3): try: return http_get(url) # hypothetical except TimeoutError: if attempt < 2 and retry_budget.record_retry(): time.sleep(calculate_delay(attempt, RetryConfig())) continue raise raise RuntimeError("Should not reach here")If each of 5 services in a call chain has 3 retries, a single user request can cause 3^5 = 243 downstream calls. Implement retry budgets, coordinate retries at the edge (not each layer), and use circuit breakers to prevent this amplification.
The Bulkhead pattern, named after ship compartments that prevent a single breach from sinking the entire vessel, isolates components so that failure in one doesn't exhaust resources needed by others.
Common Bulkhead Implementations:
| Isolation Level | Blast Radius | Resource Overhead | Operational Complexity | Example Use Case |
|---|---|---|---|---|
| Thread Pool | Per-dependency | Low | Low | Slow external API isolation |
| Connection Pool | Per-dependency | Low-Medium | Low | Different DB operation types |
| Process | Per-workload | Medium | Medium | Batch vs online processing |
| Container | Per-service | Medium | Medium | Microservice isolation |
| VM/Instance | Per-tier | High | High | Customer tier isolation |
| Region/Cluster | Per-deployment | Very High | High | Complete failure domain isolation |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
# Thread Pool Bulkhead Implementationfrom concurrent.futures import ThreadPoolExecutor, TimeoutErrorfrom dataclasses import dataclassfrom typing import Callable, TypeVar, Dictimport threading T = TypeVar('T') @dataclassclass BulkheadConfig: """Configuration for a bulkhead.""" max_concurrent: int = 10 max_waiting: int = 5 timeout_seconds: float = 30.0 class Bulkhead: """ Thread pool bulkhead for isolating concurrent operations. Usage: payment_bulkhead = Bulkhead("payments", BulkheadConfig(max_concurrent=10)) result = payment_bulkhead.execute(lambda: payment_service.charge(amount)) """ def __init__(self, name: str, config: BulkheadConfig = None): self.name = name self.config = config or BulkheadConfig() self._executor = ThreadPoolExecutor( max_workers=self.config.max_concurrent, thread_name_prefix=f"bulkhead-{name}" ) self._semaphore = threading.Semaphore( self.config.max_concurrent + self.config.max_waiting ) self._active = 0 self._waiting = 0 self._lock = threading.Lock() def execute(self, func: Callable[[], T]) -> T: """ Execute function within bulkhead constraints. Raises: BulkheadFullError: If bulkhead is at capacity TimeoutError: If execution times out """ # Check if we can acquire a slot acquired = self._semaphore.acquire(blocking=False) if not acquired: raise BulkheadFullError( f"Bulkhead {self.name} is full: " f"{self._active} active, {self._waiting} waiting" ) try: with self._lock: self._waiting += 1 future = self._executor.submit(func) with self._lock: self._waiting -= 1 self._active += 1 try: return future.result(timeout=self.config.timeout_seconds) except TimeoutError: future.cancel() raise finally: with self._lock: self._active -= 1 finally: self._semaphore.release() def metrics(self) -> dict: """Get current bulkhead metrics.""" with self._lock: return { "name": self.name, "max_concurrent": self.config.max_concurrent, "max_waiting": self.config.max_waiting, "active": self._active, "waiting": self._waiting, "available": self.config.max_concurrent - self._active, } class BulkheadFullError(Exception): """Raised when bulkhead cannot accept more work.""" pass class BulkheadRegistry: """ Central registry for managing multiple bulkheads. Provides a single point for monitoring and configuration. """ def __init__(self): self._bulkheads: Dict[str, Bulkhead] = {} self._lock = threading.Lock() def get_or_create(self, name: str, config: BulkheadConfig = None) -> Bulkhead: """Get existing bulkhead or create new one.""" with self._lock: if name not in self._bulkheads: self._bulkheads[name] = Bulkhead(name, config) return self._bulkheads[name] def all_metrics(self) -> list: """Get metrics for all bulkheads.""" with self._lock: return [b.metrics() for b in self._bulkheads.values()] # Global registrybulkheads = BulkheadRegistry() # Usage example - different bulkheads for different dependenciespayment_bulkhead = bulkheads.get_or_create( "payments", BulkheadConfig(max_concurrent=10, timeout_seconds=30)) inventory_bulkhead = bulkheads.get_or_create( "inventory", BulkheadConfig(max_concurrent=50, timeout_seconds=5)) def checkout(cart) -> dict: """Checkout with bulkhead isolation.""" # If payments is slow, it won't affect inventory checks inventory_result = inventory_bulkhead.execute( lambda: check_inventory(cart.items) ) if not inventory_result.available: return {"error": "Items not available"} payment_result = payment_bulkhead.execute( lambda: process_payment(cart.total) ) return {"order_id": payment_result.order_id}Size bulkheads based on observed usage patterns. Monitor queue depth and rejected calls. Too small = unnecessary rejections during normal traffic. Too large = no isolation benefit. Start conservative and tune based on production metrics.
Graceful degradation means providing reduced but functional service when components fail. Instead of complete failure, the system continues with diminished capabilities.
Design Principles for Graceful Degradation:
| Feature | Normal Behavior | Degraded Behavior | Impact |
|---|---|---|---|
| Product Recommendations | ML-personalized suggestions | Trending/popular items | Lower conversion, but functional |
| Search | Semantic + ML ranking | Keyword match only | Less relevant results |
| User Authentication | Full OAuth flow | Token validation only | No new logins, existing sessions work |
| Real-time Inventory | Live stock counts | Cached counts + buffer | Occasional oversell |
| Dynamic Pricing | ML-based pricing | Static price list | Suboptimal pricing |
| Comments/Reviews | Full social features | Read-only mode | No new content, viewing works |
Implementing Feature Degradation
A structured approach to feature degradation uses feature flags and dependency health monitoring:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
// Graceful Degradation Frameworkinterface DegradationLevel { level: 'normal' | 'degraded' | 'minimal' | 'emergency'; enabledFeatures: string[]; disabledFeatures: string[];} interface DependencyHealth { name: string; healthy: boolean; latencyP99Ms: number; errorRate: number;} class DegradationController { private currentLevel: DegradationLevel['level'] = 'normal'; private dependencies: Map<string, DependencyHealth> = new Map(); // Define what features are available at each level private readonly levels: Record<string, DegradationLevel> = { normal: { level: 'normal', enabledFeatures: ['ml_search', 'personalization', 'real_time_inventory', 'dynamic_pricing', 'reviews', 'recommendations'], disabledFeatures: [], }, degraded: { level: 'degraded', enabledFeatures: ['basic_search', 'cached_inventory', 'static_pricing', 'reviews_readonly', 'popular_items'], disabledFeatures: ['ml_search', 'personalization', 'real_time_inventory', 'dynamic_pricing', 'recommendations'], }, minimal: { level: 'minimal', enabledFeatures: ['basic_search', 'cached_inventory', 'static_pricing'], disabledFeatures: ['reviews', 'recommendations', 'personalization', 'real_time_inventory', 'dynamic_pricing'], }, emergency: { level: 'emergency', enabledFeatures: ['static_catalog', 'cached_content'], disabledFeatures: ['search', 'checkout', 'user_accounts'], }, }; updateDependencyHealth(health: DependencyHealth): void { this.dependencies.set(health.name, health); this.evaluateDegradationLevel(); } private evaluateDegradationLevel(): void { const healths = Array.from(this.dependencies.values()); // Calculate overall system health score const unhealthyCount = healths.filter(h => !h.healthy).length; const avgErrorRate = healths.reduce((sum, h) => sum + h.errorRate, 0) / healths.length; const avgLatency = healths.reduce((sum, h) => sum + h.latencyP99Ms, 0) / healths.length; // Determine appropriate degradation level let newLevel: DegradationLevel['level']; if (unhealthyCount === 0 && avgErrorRate < 0.01 && avgLatency < 100) { newLevel = 'normal'; } else if (unhealthyCount <= 2 && avgErrorRate < 0.05) { newLevel = 'degraded'; } else if (unhealthyCount <= 4 && avgErrorRate < 0.20) { newLevel = 'minimal'; } else { newLevel = 'emergency'; } if (newLevel !== this.currentLevel) { console.log(`Degradation level changing: ${this.currentLevel} -> ${newLevel}`); this.currentLevel = newLevel; this.emitLevelChange(newLevel); } } isFeatureEnabled(featureName: string): boolean { const level = this.levels[this.currentLevel]; return level.enabledFeatures.includes(featureName); } getCurrentLevel(): DegradationLevel { return this.levels[this.currentLevel]; } private emitLevelChange(level: string): void { // Emit metric, send alert, update feature flags metrics.emit('degradation_level', { level }); }} // Usage in application codeconst degradation = new DegradationController(); async function getProductRecommendations(userId: string): Promise<Product[]> { if (degradation.isFeatureEnabled('recommendations')) { try { return await mlRecommendationService.getPersonalized(userId); } catch (error) { // Fall through to fallback } } if (degradation.isFeatureEnabled('popular_items')) { return await getPopularItems(); } return []; // No recommendations available}When degrading, inform users. 'Search is temporarily simplified' is better than silently returning poor results. Users are more forgiving of known limitations than unexplained bad experiences. Consider degradation indicators in your UI design.
Timeouts are the most fundamental failure handling mechanism. They prevent unbounded waits and enable the system to fail fast. However, incorrect timeout configuration causes more outages than it prevents.
Types of Timeouts:
| Operation Type | Connection Timeout | Read Timeout | Notes |
|---|---|---|---|
| Health Check | 500ms | 1s | Must be fast to detect issues quickly |
| Cache Read | 100ms | 500ms | Cache should be fast; fail to origin if slow |
| Database Query | 1s | 5-30s | Depends on query complexity; consider statement timeout |
| Internal Service Call | 1s | 5s | Add ~10% to P99 of called service |
| External API Call | 5s | 30s | External services are less predictable |
| File Upload | 5s | 5 minutes | Large uploads need long timeouts |
| Batch Job | N/A | Hours | Use heartbeats instead of timeouts |
The Timeout Chain Problem
In a microservices architecture, timeouts must be coordinated across the call chain:
Client (30s timeout)
→ API Gateway (25s timeout)
→ Service A (20s timeout)
→ Service B (15s timeout)
→ Database (10s timeout)
Rules for timeout chains:
Outer timeout > Inner timeout — The caller must wait longer than the callee. Otherwise, the caller times out while the callee is still working.
Include retry time in outer timeout — If inner operation has 3 retries with 5s timeout each, outer timeout must be > 15s.
Account for all layers — Network latency, serialization, queue time all add to effective timeout.
Deadline propagation — Pass absolute deadline, not timeout. Each layer knows how much time remains.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
# Deadline Propagation Patternfrom dataclasses import dataclassfrom datetime import datetime, timedeltafrom typing import Optionalimport time @dataclassclass RequestContext: """Context propagated through service calls.""" request_id: str deadline: datetime # Absolute time by which request must complete @property def remaining_ms(self) -> float: """Milliseconds remaining before deadline.""" remaining = (self.deadline - datetime.now()).total_seconds() * 1000 return max(0, remaining) @property def is_expired(self) -> bool: """Check if deadline has passed.""" return datetime.now() >= self.deadline def with_reduced_deadline(self, buffer_ms: float = 100) -> 'RequestContext': """ Create child context with slightly earlier deadline. Buffer ensures we have time to handle timeout gracefully. """ new_deadline = self.deadline - timedelta(milliseconds=buffer_ms) return RequestContext( request_id=self.request_id, deadline=new_deadline ) class DeadlineExceededError(Exception): """Raised when operation deadline is exceeded.""" pass def check_deadline(ctx: RequestContext) -> None: """Check deadline and raise if exceeded.""" if ctx.is_expired: raise DeadlineExceededError( f"Deadline exceeded for request {ctx.request_id}" ) # Example: Deadline-aware service callasync def process_order(ctx: RequestContext, order: dict) -> dict: """Process order with deadline propagation.""" # Check deadline before starting check_deadline(ctx) # Validate inventory with remaining time child_ctx = ctx.with_reduced_deadline(100) # Reserve 100ms inventory_result = await inventory_service.check( child_ctx, order['items'], timeout_ms=child_ctx.remaining_ms ) check_deadline(ctx) # Process payment with remaining time child_ctx = ctx.with_reduced_deadline(100) payment_result = await payment_service.charge( child_ctx, order['total'], timeout_ms=child_ctx.remaining_ms ) return { "order_id": generate_order_id(), "status": "completed" } # HTTP header for deadline propagation# Request: X-Deadline: 2024-01-15T10:30:00.000Z# Each service subtracts buffer and passes reduced deadlinegRPC has built-in deadline propagation via context. Set deadline at the edge, and it automatically propagates through all downstream calls. The framework handles timeout calculation at each hop. Consider gRPC for internal service communication where deadline propagation is critical.
Failure handling code is often the least tested code in a system—it runs rarely in production, and developers can't easily trigger it in development. Chaos engineering solves this by deliberately injecting failures in a controlled manner.
| Experiment | What It Tests | How to Inject | Expected Result |
|---|---|---|---|
| Service kill | Restart resilience | Kill process/container | Other instances handle load |
| Network latency | Timeout handling | tc netem delay | Graceful degradation |
| Network partition | Split-brain handling | iptables rules | Consistent data, no corruption |
| CPU stress | Resource exhaustion | stress-ng | Request queuing, not crashes |
| Memory pressure | OOM handling | stress-ng --vm | Graceful restart, no data loss |
| Disk full | Storage exhaustion | dd to fill disk | Alerts fire, writes fail gracefully |
| DNS failure | DNS dependency | Modify /etc/hosts | Fallback to cached resolution |
| Certificate expiry | TLS handling | Use expired cert | Clear error, not hanging |
Chaos Engineering in Practice
Popular tools for chaos engineering:
GameDay: Structured Chaos
GameDays are scheduled chaos exercises with the whole team:
Chaos engineering is controlled experimentation, not reckless action. Every experiment has a hypothesis, instrumentation, blast radius limits, and abort criteria. If you're 'just breaking things to see what happens,' you're not doing chaos engineering—you're causing incidents.
Failure handling transforms brittle systems into resilient ones. Let's consolidate the essential principles:
For every service dependency ask:
✓ What's the circuit breaker configuration? ✓ What's the retry strategy? ✓ What's the timeout and how was it determined? ✓ What's the fallback if this fails? ✓ How is this isolated from other dependencies? ✓ How do we know when this is failing?
What's Next:
With failure handling patterns established, the next page covers Trade-off Discussion—the critical skill of articulating design decisions, evaluating alternatives, and communicating the reasoning behind architectural choices.
You now have a comprehensive framework for handling failures in distributed systems. This knowledge, combined with bottleneck identification and component scaling, enables you to build systems that survive and thrive in the face of inevitable failures.