Loading learning content...
In a monolithic application, a method call either succeeds, throws an exception, or hangs indefinitely (a bug). In distributed systems, an entirely new category of failures emerges: partial failures. Service A might successfully process a request, but Service B's response is lost in transit. Or Service B processes the request but takes so long that Service A times out and assumes failure.
This is not a pathological scenario—it's the normal operating mode of distributed systems. Networks fail, services restart, databases have maintenance windows, and cloud providers experience outages. Building reliable systems means accepting that failure is not exceptional; it's expected.
The question isn't "how do we prevent failures?" but rather "how do we design systems that behave predictably when failures inevitably occur?"
By the end of this page, you will understand the unique failure modes of distributed systems, master patterns like circuit breakers, retries, and bulkheads, learn to propagate errors meaningfully across service boundaries, and design systems that degrade gracefully rather than failing catastrophically.
Before designing error handling strategies, we must understand the unique failure modes of distributed systems. These failures don't exist in monolithic applications and require fundamentally different handling approaches.
Connection Failure: Service cannot establish a TCP connection
Request Timeout: Connection established but no response received
Partial Response: Response truncated or corrupted
Server Errors (5xx): Service understood request but couldn't fulfill it
Client Errors (4xx): Request was problematic
The most insidious failure mode is timeout with unknown outcome. When Service A calls Service B and times out:
You cannot distinguish these cases from Service A's perspective. This fundamental uncertainty drives the need for idempotency and compensating transactions.
| Failure Type | Retryable? | Safe to Retry Without Idempotency? | Typical Action |
|---|---|---|---|
| Connection refused | Yes | Yes | Immediate retry (different instance) |
| DNS failure | Usually | Yes | Retry after delay (DNS may recover) |
| Connection timeout | Yes | Yes | Retry (operation never reached server) |
| Read timeout | Maybe | NO — Uncertain state | Retry only if idempotent |
| 503 Service Unavailable | Yes | Usually Yes | Retry with backoff |
| 500 Internal Server Error | Maybe | Maybe | Depends on error type |
| 400 Bad Request | No | N/A | Fix request or fail permanently |
| 401 Unauthorized | Maybe | Yes | Refresh credentials, retry once |
| 429 Too Many Requests | Yes | Yes | Back off per Retry-After header |
Read timeouts are the most dangerous failure mode. If you timeout while waiting for a response, you have zero knowledge about whether the operation completed. Naively retrying can cause duplicate orders, duplicate payments, or double inventory decrements. Every mutating operation in a distributed system should be designed for idempotent retry.
Retries are the first line of defense against transient failures. But naive retries can amplify problems instead of solving them. Effective retry strategies require careful consideration of timing, backoff, and limits.
Fixed Delay: Wait the same amount between retries
Exponential Backoff: Double the delay each retry (1s, 2s, 4s, 8s...)
Exponential Backoff with Jitter: Add randomness to break synchronization
Decorrelated Jitter: Even more aggressive randomization
sleep = min(cap, random_between(base, sleep * 3))1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
// Comprehensive retry implementation with exponential backoff and jitterinterface RetryConfig { maxRetries: number; // Maximum retry attempts baseDelayMs: number; // Base delay for exponential backoff maxDelayMs: number; // Cap on delay jitterFactor: number; // 0-1, how much randomness to add retryableErrors: Set<string>; // Error types that should trigger retry} const defaultConfig: RetryConfig = { maxRetries: 3, baseDelayMs: 100, maxDelayMs: 30000, jitterFactor: 0.5, retryableErrors: new Set([ 'ECONNRESET', 'ETIMEDOUT', 'ECONNREFUSED', 'SERVICE_UNAVAILABLE', 'TOO_MANY_REQUESTS', ]),}; async function withRetry<T>( operation: () => Promise<T>, config: Partial<RetryConfig> = {}): Promise<T> { const cfg = { ...defaultConfig, ...config }; let lastError: Error; for (let attempt = 0; attempt <= cfg.maxRetries; attempt++) { try { return await operation(); } catch (error) { lastError = error as Error; // Check if error is retryable if (!isRetryable(error, cfg.retryableErrors)) { throw error; // Non-retryable, fail immediately } // Check if we have retries left if (attempt === cfg.maxRetries) { throw new Error( `All ${cfg.maxRetries} retries exhausted: ${lastError.message}` ); } // Calculate delay with exponential backoff + jitter const exponentialDelay = cfg.baseDelayMs * Math.pow(2, attempt); const cappedDelay = Math.min(exponentialDelay, cfg.maxDelayMs); const jitter = cappedDelay * cfg.jitterFactor * Math.random(); const finalDelay = cappedDelay + jitter; console.log( `Retry ${attempt + 1}/${cfg.maxRetries} after ${finalDelay.toFixed(0)}ms` ); await sleep(finalDelay); } } throw lastError!;} function isRetryable(error: unknown, retryableErrors: Set<string>): boolean { if (error instanceof HttpError) { // Rate limit: always retry (after backoff) if (error.status === 429) return true; // Server errors: retry (service might recover) if (error.status >= 500 && error.status <= 599) return true; // Client errors: don't retry (our request is wrong) if (error.status >= 400 && error.status <= 499) return false; } if (error instanceof Error) { // Check error code (Node.js network errors) const code = (error as NodeJS.ErrnoException).code; if (code && retryableErrors.has(code)) return true; } return false;} // Usageconst result = await withRetry( () => orderService.createOrder(orderData), { maxRetries: 3, baseDelayMs: 100 });Instead of per-request retry limits, use retry budgets: 'no more than 20% of requests should be retries.' This prevents retry amplification where retries cause more load than original requests. Google SRE practices emphasize budget-based approaches over per-call configurations.
Circuit breakers prevent cascade failures by stopping calls to failing services. Named after electrical circuit breakers that prevent overload, they provide fast failure and automatic recovery.
CLOSED (Normal Operation)
OPEN (Blocking Requests)
HALF-OPEN (Testing Recovery)
This state machine prevents a failing service from being overwhelmed while simultaneously failing fast for callers.
| Parameter | Typical Value | Purpose |
|---|---|---|
| Failure Threshold | 5-10 failures | Errors before opening |
| Failure Rate Threshold | 50% failure rate | Error percentage before opening |
| Measurement Window | 10 seconds | Time window for counting failures |
| Open Timeout | 30-60 seconds | Time before trying HALF-OPEN |
| Half-Open Trials | 3 requests | Successful calls needed to close |
| Minimum Throughput | 10 requests | Min calls before rate calculation |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145
// Production-grade circuit breaker implementationenum CircuitState { CLOSED = 'CLOSED', OPEN = 'OPEN', HALF_OPEN = 'HALF_OPEN',} interface CircuitBreakerConfig { failureThreshold: number; // Failures before opening failureRateThreshold: number; // Failure rate (0-1) before opening successThreshold: number; // Successes in half-open before closing openTimeout: number; // Ms before trying half-open windowSize: number; // Sliding window size in ms minimumThroughput: number; // Min calls before rate calculation} class CircuitBreaker { private state: CircuitState = CircuitState.CLOSED; private failures: number[] = []; // Failure timestamps private successes: number[] = []; // Success timestamps private lastFailure: number = 0; private halfOpenSuccesses: number = 0; constructor( private readonly name: string, private readonly config: CircuitBreakerConfig ) {} async execute<T>(operation: () => Promise<T>): Promise<T> { if (!this.canExecute()) { throw new CircuitOpenError( `Circuit breaker '${this.name}' is OPEN. ` + `Retry after ${this.getRetryAfter()}ms` ); } try { const result = await operation(); this.recordSuccess(); return result; } catch (error) { this.recordFailure(); throw error; } } private canExecute(): boolean { this.pruneOldCalls(); switch (this.state) { case CircuitState.CLOSED: return true; case CircuitState.OPEN: if (Date.now() - this.lastFailure >= this.config.openTimeout) { this.state = CircuitState.HALF_OPEN; this.halfOpenSuccesses = 0; console.log(`Circuit '${this.name}' → HALF_OPEN`); return true; } return false; case CircuitState.HALF_OPEN: return true; } } private recordSuccess(): void { const now = Date.now(); this.successes.push(now); if (this.state === CircuitState.HALF_OPEN) { this.halfOpenSuccesses++; if (this.halfOpenSuccesses >= this.config.successThreshold) { this.state = CircuitState.CLOSED; this.failures = []; console.log(`Circuit '${this.name}' → CLOSED`); } } } private recordFailure(): void { const now = Date.now(); this.failures.push(now); this.lastFailure = now; if (this.state === CircuitState.HALF_OPEN) { this.state = CircuitState.OPEN; console.log(`Circuit '${this.name}' → OPEN (half-open failure)`); return; } if (this.shouldOpen()) { this.state = CircuitState.OPEN; console.log(`Circuit '${this.name}' → OPEN`); } } private shouldOpen(): boolean { if (this.state !== CircuitState.CLOSED) return false; this.pruneOldCalls(); const totalCalls = this.failures.length + this.successes.length; if (totalCalls < this.config.minimumThroughput) return false; // Check absolute threshold if (this.failures.length >= this.config.failureThreshold) return true; // Check rate threshold const failureRate = this.failures.length / totalCalls; return failureRate >= this.config.failureRateThreshold; } private pruneOldCalls(): void { const cutoff = Date.now() - this.config.windowSize; this.failures = this.failures.filter(t => t > cutoff); this.successes = this.successes.filter(t => t > cutoff); } private getRetryAfter(): number { const elapsed = Date.now() - this.lastFailure; return Math.max(0, this.config.openTimeout - elapsed); } getState(): CircuitState { return this.state; }} // Usage with circuit breaker per downstream serviceconst paymentCircuit = new CircuitBreaker('payment-service', { failureThreshold: 5, failureRateThreshold: 0.5, successThreshold: 3, openTimeout: 30000, windowSize: 60000, minimumThroughput: 10,}); async function processPayment(order: Order): Promise<PaymentResult> { return paymentCircuit.execute(() => paymentService.charge(order.customerId, order.total) );}Create separate circuit breakers for each downstream service. If the payment service is failing, you don't want the inventory service circuit to open. Bulkhead isolation combined with per-service circuits prevents localized failures from becoming system-wide outages.
Bulkheads isolate failures to prevent them from consuming all system resources. Named after ship compartments that contain flooding, bulkheads in software partition resources so a failure in one area doesn't sink the entire ship.
Thread Pool Isolation Separate thread pools for each downstream dependency. If the payment service is slow::
Connection Pool Isolation Separate connection pools per downstream service. One service's connection leak doesn't starve others.
Process/Container Isolation Different functions in separate processes/containers. Memory leak in one doesn't crash others.
Queue Isolation Separate queues for different priority traffic. Bulk operations don't block interactive requests.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
// Bulkhead using semaphore pattern for concurrency limitingclass Bulkhead { private readonly permits: number; private available: number; private readonly queue: Array<{ resolve: () => void; reject: (error: Error) => void; timeout: NodeJS.Timeout; }> = []; constructor( private readonly name: string, permits: number, private readonly queueLimit: number = 100, private readonly queueTimeout: number = 5000 ) { this.permits = permits; this.available = permits; } async execute<T>(operation: () => Promise<T>): Promise<T> { await this.acquire(); try { return await operation(); } finally { this.release(); } } private async acquire(): Promise<void> { if (this.available > 0) { this.available--; return; } if (this.queue.length >= this.queueLimit) { throw new BulkheadFullError( `Bulkhead '${this.name}' is full. ` + `Permits: ${this.permits}, Queue: ${this.queue.length}` ); } return new Promise<void>((resolve, reject) => { const timeout = setTimeout(() => { const index = this.queue.findIndex(w => w.resolve === resolve); if (index !== -1) { this.queue.splice(index, 1); reject(new BulkheadTimeoutError( `Bulkhead '${this.name}' queue timeout after ${this.queueTimeout}ms` )); } }, this.queueTimeout); this.queue.push({ resolve, reject, timeout }); }); } private release(): void { if (this.queue.length > 0) { const waiter = this.queue.shift()!; clearTimeout(waiter.timeout); waiter.resolve(); } else { this.available++; } } getMetrics(): BulkheadMetrics { return { name: this.name, permits: this.permits, available: this.available, queueSize: this.queue.length, queueLimit: this.queueLimit, }; }} // Create bulkheads per downstream serviceconst bulkheads = { payment: new Bulkhead('payment', 20), // Max 20 concurrent payment calls inventory: new Bulkhead('inventory', 50), // Max 50 concurrent inventory calls shipping: new Bulkhead('shipping', 30), // Max 30 concurrent shipping calls}; async function processOrder(order: Order): Promise<void> { // Each call respects its bulkhead limit // Payment service slowdown won't block inventory checks const [inventoryResult, _] = await Promise.all([ bulkheads.inventory.execute(() => inventoryService.reserve(order.items) ), // Payment might be slow, but won't consume inventory bulkhead ]); // Sequential payment (after inventory confirmed) const paymentResult = await bulkheads.payment.execute(() => paymentService.charge(order.customerId, order.total) ); // Shipping can proceed independently await bulkheads.shipping.execute(() => shippingService.createShipment(order) );}When errors occur deep in a service chain (A → B → C → D), how should they propagate back to the original caller? Poor error propagation leads to debugging nightmares.
1. Preserve Original Error Information Wrap errors rather than replacing them. The root cause should be discoverable.
2. Add Context at Each Layer Each service should add its perspective: what operation failed, what inputs were provided.
3. Translate to Appropriate Level Internal errors shouldn't leak implementation details. Map to domain-appropriate error types.
4. Include Correlation IDs Every request should carry a correlation ID for distributed tracing.
5. Distinguish Client vs Server Errors Clearly indicate whether the caller did something wrong (4xx) or the system failed (5xx).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155
// Structured error response formatinterface ServiceError { // Machine-readable code for programmatic handling code: string; // Human-readable message (can be shown to users for 4xx) message: string; // Detailed description (for developers/logs) details?: string; // Field-level errors for validation failures fieldErrors?: Array<{ field: string; code: string; message: string; }>; // For debugging (not exposed to external clients) internal?: { correlationId: string; service: string; timestamp: string; cause?: ServiceError; // Chain of errors stackTrace?: string; // Only in development };} // Error class with context chainingclass ServiceException extends Error { constructor( public readonly code: string, message: string, public readonly status: number, public readonly cause?: Error, public readonly fieldErrors?: Array<{ field: string; code: string; message: string }> ) { super(message); this.name = 'ServiceException'; } toResponse(correlationId: string, includeInternal: boolean): ServiceError { const response: ServiceError = { code: this.code, message: this.message, fieldErrors: this.fieldErrors, }; if (includeInternal) { response.internal = { correlationId, service: process.env.SERVICE_NAME || 'unknown', timestamp: new Date().toISOString(), stackTrace: this.stack, }; if (this.cause instanceof ServiceException) { response.internal.cause = this.cause.toResponse(correlationId, true); } } return response; }} // Error translation between servicesfunction translateDownstreamError( error: unknown, operation: string): ServiceException { if (error instanceof ServiceException) { // Wrap downstream error with context return new ServiceException( 'DOWNSTREAM_ERROR', `${operation} failed: ${error.message}`, error.status >= 500 ? 502 : error.status, // Map 5xx to 502 (Bad Gateway) error ); } if (error instanceof HttpError) { // HTTP error from downstream service const isRetryable = error.status >= 500 || error.status === 429; return new ServiceException( isRetryable ? 'DOWNSTREAM_UNAVAILABLE' : 'DOWNSTREAM_REJECTED', `${operation} failed with status ${error.status}`, isRetryable ? 503 : 502, error ); } if (error instanceof TimeoutError) { return new ServiceException( 'DOWNSTREAM_TIMEOUT', `${operation} timed out`, 504, // Gateway Timeout error ); } // Unknown error type return new ServiceException( 'INTERNAL_ERROR', `${operation} failed unexpectedly`, 500, error instanceof Error ? error : new Error(String(error)) );} // Usage in service layerasync function createOrder(request: CreateOrderRequest): Promise<Order> { // Validate input const validationErrors = validateOrderRequest(request); if (validationErrors.length > 0) { throw new ServiceException( 'VALIDATION_ERROR', 'Order validation failed', 400, undefined, validationErrors ); } try { // Check inventory (downstream call) const inventory = await inventoryService.check(request.items); if (!inventory.allAvailable) { throw new ServiceException( 'INSUFFICIENT_INVENTORY', 'Some items are not available', 409, // Conflict undefined, inventory.unavailableItems.map(item => ({ field: `items[${item.productId}]`, code: 'INSUFFICIENT_STOCK', message: `Only ${item.available} available`, })) ); } } catch (error) { if (error instanceof ServiceException) throw error; throw translateDownstreamError(error, 'Inventory check'); } try { // Process payment (downstream call) await paymentService.charge(request.customerId, calculateTotal(request)); } catch (error) { if (error instanceof ServiceException) throw error; throw translateDownstreamError(error, 'Payment processing'); } // Create order (local operation) return orderRepository.create(request);}Internal error details (stack traces, database errors, infrastructure info) must never reach external clients. Use error translation layers to map internal errors to safe external responses. Log the full details server-side with correlation IDs for debugging.
When downstream services fail, complete failure isn't always the best option. Graceful degradation provides reduced functionality rather than no functionality.
Fallback to Cache Return cached data when live data is unavailable. Stale data is often better than no data.
Fallback to Default Return sensible defaults when specific data unavailable.
Fallback to Degraded Functionality Skip optional features when their services are unavailable.
Fallback to Queue Queue operations for later processing when services are unavailable.
Static Fallback Return pre-computed static responses.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
// Graceful degradation with fallback strategiesinterface DegradationPolicy<T> { strategy: 'cache' | 'default' | 'skip' | 'queue'; fallback?: T | (() => T | Promise<T>); shouldDegrade: (error: Error) => boolean;} async function withDegradation<T>( operation: () => Promise<T>, policy: DegradationPolicy<T>, context: { operationName: string; correlationId: string }): Promise<{ result: T; degraded: boolean }> { try { const result = await operation(); return { result, degraded: false }; } catch (error) { if (!policy.shouldDegrade(error as Error)) { throw error; // Not a degradable error } console.warn( `Operation '${context.operationName}' degrading: ${(error as Error).message}`, { correlationId: context.correlationId } ); const fallbackResult = await resolveFallback(policy); return { result: fallbackResult, degraded: true }; }} async function resolveFallback<T>(policy: DegradationPolicy<T>): Promise<T> { if (policy.fallback === undefined) { throw new Error(`No fallback defined for strategy: ${policy.strategy}`); } return typeof policy.fallback === 'function' ? await (policy.fallback as () => T | Promise<T>)() : policy.fallback;} // Product page with multiple degradation strategiesasync function getProductPage(productId: string): Promise<ProductPage> { const correlationId = generateCorrelationId(); // Core product data - no degradation, must succeed const product = await productService.getProduct(productId); // Reviews - degrade to empty if service unavailable const { result: reviews, degraded: reviewsDegraded } = await withDegradation( () => reviewService.getProductReviews(productId), { strategy: 'default', fallback: { reviews: [], averageRating: null, totalCount: 0 }, shouldDegrade: (err) => err instanceof ServiceUnavailableError, }, { operationName: 'getProductReviews', correlationId } ); // Recommendations - degrade to popular items const { result: recommendations, degraded: recsDegraded } = await withDegradation( () => recommendationService.getForProduct(productId), { strategy: 'cache', fallback: () => getCachedPopularProducts(), shouldDegrade: (err) => err instanceof ServiceUnavailableError, }, { operationName: 'getRecommendations', correlationId } ); // Inventory - degrade to showing "check availability" const { result: inventory, degraded: inventoryDegraded } = await withDegradation( () => inventoryService.getAvailability(productId), { strategy: 'default', fallback: { available: null, message: 'Check availability in store' }, shouldDegrade: (err) => err instanceof ServiceUnavailableError, }, { operationName: 'getInventory', correlationId } ); return { product, reviews, recommendations, inventory, _degraded: { reviews: reviewsDegraded, recommendations: recsDegraded, inventory: inventoryDegraded, }, };}When serving degraded responses, include metadata indicating what's degraded. This helps UIs display appropriate warnings ("Prices may not be current") and helps monitoring systems track degradation frequency. The response should make degradation explicit, not invisible.
Error handling in distributed systems is fundamentally different from monolithic applications. Partial failures, network uncertainty, and cascade effects require deliberate architectural patterns rather than simple exception handling.
Let's consolidate the key insights:
What's next:
As services evolve, their APIs must change. But changing APIs in a distributed system risks breaking all dependent services. The next page explores service versioning strategies that enable evolution without breaking changes—keeping the promise of independent deployability.
You now understand how to handle errors gracefully in distributed systems. You can implement retry strategies, circuit breakers, and bulkheads, propagate errors meaningfully across service boundaries, and design systems that degrade gracefully rather than failing completely. Next, we'll tackle service versioning—evolving APIs without breaking consumers.