Loading learning content...
Robust systems don't just detect and report errors—they recover from them. While users see a momentary pause, behind the scenes your system retries failed operations, activates fallback mechanisms, isolates failing components, and maintains service continuity despite component failures.
Error recovery is the difference between a system that requires human intervention for every hiccup and a system that handles transient failures autonomously, pages engineers only for genuine outages, and maintains high availability even when dependencies fail.
This page examines the full spectrum of recovery strategies: from simple retries to sophisticated circuit breakers, from graceful degradation to compensating transactions. You'll learn when each strategy applies, how to implement them correctly, and how to combine them into defense-in-depth resilience.
By the end of this page, you will understand retry patterns with exponential backoff and jitter, circuit breaker implementation and configuration, fallback strategies and graceful degradation, compensation and rollback for failed distributed operations, timeout management and deadline propagation, and the decision framework for choosing appropriate recovery strategies.
Not all errors can or should be recovered from. Before implementing recovery logic, you must classify errors and match them to appropriate strategies.
Error Classification for Recovery
Errors fall into categories based on their recoverability:
| Category | Characteristics | Recovery Approach | Examples |
|---|---|---|---|
| Transient | Temporary condition that will likely resolve on retry | Retry with backoff | Network timeout, connection reset, 503 Service Unavailable |
| Retriable After Action | Failed but may succeed after taking a specific action | Action + Retry | Token expired (refresh + retry), rate limit (wait + retry) |
| Non-Retriable | Will fail the same way on retry; different approach needed | Fallback or fail | Invalid input, resource not found, permission denied |
| Systemic | Component or dependency is down; retries won't help soon | Circuit break, fallback | Database down, external service outage |
| Degraded | Full service unavailable, but partial service possible | Graceful degradation | Some features unavailable, cached data acceptable |
| Compensable | Operation partially succeeded; needs rollback | Compensation transaction | Payment succeeded but inventory reservation failed |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
/** * Error classification system that determines appropriate recovery strategy. */interface RecoveryDecision { strategy: 'retry' | 'retry-with-action' | 'fallback' | 'circuit-break' | 'fail' | 'compensate'; maxRetries?: number; backoffConfig?: BackoffConfig; fallbackType?: 'cached' | 'default' | 'degraded' | 'none'; action?: () => Promise<void>; compensationRequired?: boolean;} /** * Classify an error and determine recovery strategy. */function classifyError(error: Error, context: OperationContext): RecoveryDecision { // ============================================ // TRANSIENT ERRORS: Retry with backoff // ============================================ if (isTransientError(error)) { return { strategy: 'retry', maxRetries: 3, backoffConfig: { initialDelayMs: 100, maxDelayMs: 5000, multiplier: 2, jitter: 0.2 } }; } // ============================================ // AUTHENTICATION ERRORS: Refresh and retry // ============================================ if (isAuthenticationExpiredError(error)) { return { strategy: 'retry-with-action', maxRetries: 1, action: async () => { await refreshAuthenticationToken(); } }; } // ============================================ // RATE LIMITING: Wait and retry // ============================================ if (isRateLimitError(error)) { const retryAfter = extractRetryAfterHeader(error); return { strategy: 'retry', maxRetries: 2, backoffConfig: { initialDelayMs: (retryAfter || 60) * 1000, maxDelayMs: 120000, multiplier: 1, jitter: 0.1 } }; } // ============================================ // EXTERNAL SERVICE DOWN: Circuit break and fallback // ============================================ if (isExternalServiceError(error)) { return { strategy: 'circuit-break', fallbackType: context.allowDegradedMode ? 'cached' : 'none' }; } // ============================================ // VALIDATION ERRORS: No retry, fail fast // ============================================ if (isValidationError(error) || isNotFoundError(error) || isPermissionError(error)) { return { strategy: 'fail', fallbackType: 'none' }; } // ============================================ // PARTIAL FAILURE: Compensate // ============================================ if (isPartialFailureError(error)) { return { strategy: 'compensate', compensationRequired: true }; } // ============================================ // UNKNOWN: Conservative default // ============================================ return { strategy: 'fail', fallbackType: 'none' };} // Classification helper functionsfunction isTransientError(error: Error): boolean { if ('isTransient' in error && (error as any).isTransient) return true; // Network-level errors if (error.message.includes('ECONNRESET')) return true; if (error.message.includes('ETIMEDOUT')) return true; if (error.message.includes('ECONNREFUSED')) return true; // HTTP errors that are typically transient if ('statusCode' in error) { const status = (error as any).statusCode; if (status === 502 || status === 503 || status === 504) return true; if (status === 429) return true; // Rate limit } return false;} function isAuthenticationExpiredError(error: Error): boolean { if ('statusCode' in error && (error as any).statusCode === 401) return true; if (error.message.includes('token expired')) return true; if (error.message.includes('jwt expired')) return true; return false;} function isRateLimitError(error: Error): boolean { if ('statusCode' in error && (error as any).statusCode === 429) return true; if (error.name === 'RateLimitException') return true; return false;} function isExternalServiceError(error: Error): boolean { return error instanceof ExternalServiceException;} function isValidationError(error: Error): boolean { return error instanceof ValidationException || ('statusCode' in error && (error as any).statusCode === 400);}Before implementing retry logic, ensure the operation is idempotent—performing it multiple times has the same effect as performing it once. Non-idempotent operations (payments, order creation) need idempotency keys or careful design to prevent duplicate effects when retried.
Retrying is the simplest recovery strategy—attempt the operation again in hope of success. But naive retry implementation can cause more harm than good: retry storms that overwhelm recovering services, or immediate retries that fail before transient conditions resolve.
Exponential Backoff with Jitter
The gold standard for retry timing is exponential backoff with jitter:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147
/** * Production-ready retry implementation with exponential backoff and jitter. */interface RetryConfig { maxRetries: number; initialDelayMs: number; maxDelayMs: number; multiplier: number; jitter: number; // 0-1: portion of delay to randomize retryOn?: (error: Error) => boolean; // Which errors to retry} interface RetryResult<T> { success: boolean; result?: T; error?: Error; attempts: number; totalDurationMs: number;} /** * Execute an operation with configurable retry behavior. */async function withRetry<T>( operation: () => Promise<T>, config: RetryConfig, context?: { correlationId: string; operationName: string }): Promise<RetryResult<T>> { const startTime = Date.now(); let lastError: Error | undefined; for (let attempt = 0; attempt <= config.maxRetries; attempt++) { try { const result = await operation(); // Log successful retry if this wasn't the first attempt if (attempt > 0) { logger.info('Operation succeeded after retry', { ...context, attempt: attempt + 1, totalAttempts: attempt + 1, durationMs: Date.now() - startTime }); } return { success: true, result, attempts: attempt + 1, totalDurationMs: Date.now() - startTime }; } catch (error) { lastError = error instanceof Error ? error : new Error(String(error)); // Check if this error is retryable const shouldRetry = config.retryOn ? config.retryOn(lastError) : isTransientError(lastError); if (!shouldRetry || attempt >= config.maxRetries) { // No more retries: log and return failure logger.warn('Operation failed after retries', { ...context, attempt: attempt + 1, totalAttempts: attempt + 1, error: lastError.message, willRetry: false }); return { success: false, error: lastError, attempts: attempt + 1, totalDurationMs: Date.now() - startTime }; } // Calculate delay with exponential backoff and jitter const delay = calculateDelay(attempt, config); logger.debug('Retrying operation', { ...context, attempt: attempt + 1, nextAttempt: attempt + 2, delayMs: delay, error: lastError.message }); await sleep(delay); } } // Should not reach here, but TypeScript needs this return { success: false, error: lastError, attempts: config.maxRetries + 1, totalDurationMs: Date.now() - startTime };} /** * Calculate delay with exponential backoff and jitter. */function calculateDelay(attempt: number, config: RetryConfig): number { // Exponential backoff const exponentialDelay = config.initialDelayMs * Math.pow(config.multiplier, attempt); // Cap at maximum const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs); // Add jitter (randomize ±jitter% of the delay) const jitterRange = cappedDelay * config.jitter; const jitter = (Math.random() * 2 - 1) * jitterRange; return Math.max(0, Math.round(cappedDelay + jitter));} function sleep(ms: number): Promise<void> { return new Promise(resolve => setTimeout(resolve, ms));} // Usage exampleconst result = await withRetry( () => externalPaymentService.processPayment(paymentRequest), { maxRetries: 3, initialDelayMs: 200, maxDelayMs: 5000, multiplier: 2, jitter: 0.25, retryOn: (error) => { // Only retry on transient errors, not validation failures return isTransientError(error) && !isValidationError(error); } }, { correlationId, operationName: 'ProcessPayment' }); if (!result.success) { // All retries exhausted - handle the failure throw new PaymentProcessingException( 'Payment failed after multiple attempts', result.error );}The Retry Budget Pattern
For high-volume systems, individual retry configs aren't enough. A retry budget limits the total percentage of requests that can be retries, preventing retry amplification during outages:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
/** * Retry budget prevents retry storms by limiting the fraction of * requests that can be retries. */class RetryBudget { private window: { timestamp: number; isRetry: boolean }[] = []; private readonly windowSizeMs = 10000; // 10-second window private readonly maxRetryRatio: number; constructor(maxRetryRatio: number = 0.2) { // Maximum 20% of traffic can be retries this.maxRetryRatio = maxRetryRatio; } /** * Record a request attempt. */ recordAttempt(isRetry: boolean): void { const now = Date.now(); this.window.push({ timestamp: now, isRetry }); this.pruneOldEntries(now); } /** * Check if we have budget for another retry. */ canRetry(): boolean { this.pruneOldEntries(Date.now()); if (this.window.length === 0) return true; const retries = this.window.filter(e => e.isRetry).length; const total = this.window.length; const currentRatio = retries / total; return currentRatio < this.maxRetryRatio; } private pruneOldEntries(now: number): void { const cutoff = now - this.windowSizeMs; this.window = this.window.filter(e => e.timestamp >= cutoff); }} // Integration with retry logicconst retryBudget = new RetryBudget(0.2); async function withBudgetedRetry<T>( operation: () => Promise<T>, config: RetryConfig): Promise<RetryResult<T>> { let attempt = 0; let lastError: Error | undefined; while (attempt <= config.maxRetries) { const isRetry = attempt > 0; // Check budget before retrying if (isRetry && !retryBudget.canRetry()) { logger.warn('Retry budget exhausted, failing without retry'); return { success: false, error: lastError || new Error('Retry budget exhausted'), attempts: attempt, totalDurationMs: 0 }; } retryBudget.recordAttempt(isRetry); try { const result = await operation(); return { success: true, result, attempts: attempt + 1, totalDurationMs: 0 }; } catch (error) { lastError = error as Error; attempt++; if (attempt <= config.maxRetries) { await sleep(calculateDelay(attempt - 1, config)); } } } return { success: false, error: lastError, attempts: attempt, totalDurationMs: 0 };}When a dependency is down, continuing to send requests wastes resources and delays failure detection. The circuit breaker pattern prevents this by tracking failure rates and "opening the circuit" when a threshold is exceeded, failing fast without attempting the operation.
Circuit Breaker States
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183
/** * Production-ready circuit breaker implementation. */interface CircuitBreakerConfig { failureThreshold: number; // Failures to open circuit successThreshold: number; // Successes in half-open to close halfOpenRequests: number; // Max requests in half-open state resetTimeoutMs: number; // How long circuit stays open rollingWindowMs: number; // Time window for counting failures} type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN'; class CircuitBreaker { private state: CircuitState = 'CLOSED'; private failures: number[] = []; private halfOpenSuccesses = 0; private halfOpenFailures = 0; private halfOpenRequests = 0; private lastOpenedAt?: number; constructor( private readonly name: string, private readonly config: CircuitBreakerConfig ) {} /** * Execute operation through the circuit breaker. */ async execute<T>(operation: () => Promise<T>): Promise<T> { // Check if circuit allows this request this.evaluateState(); if (this.state === 'OPEN') { throw new CircuitOpenException(this.name, this.getTimeUntilReset()); } if (this.state === 'HALF_OPEN') { if (this.halfOpenRequests >= this.config.halfOpenRequests) { throw new CircuitOpenException(this.name, this.getTimeUntilReset()); } this.halfOpenRequests++; } try { const result = await operation(); this.recordSuccess(); return result; } catch (error) { this.recordFailure(); throw error; } } /** * Record a successful operation. */ private recordSuccess(): void { if (this.state === 'HALF_OPEN') { this.halfOpenSuccesses++; if (this.halfOpenSuccesses >= this.config.successThreshold) { this.transitionTo('CLOSED'); } } // In CLOSED state, success doesn't affect failure count } /** * Record a failed operation. */ private recordFailure(): void { const now = Date.now(); if (this.state === 'CLOSED') { this.failures.push(now); this.pruneOldFailures(now); if (this.failures.length >= this.config.failureThreshold) { this.transitionTo('OPEN'); } } else if (this.state === 'HALF_OPEN') { this.halfOpenFailures++; // Any failure in half-open immediately re-opens this.transitionTo('OPEN'); } } /** * Evaluate and potentially transition state. */ private evaluateState(): void { if (this.state === 'OPEN') { const elapsed = Date.now() - (this.lastOpenedAt || 0); if (elapsed >= this.config.resetTimeoutMs) { this.transitionTo('HALF_OPEN'); } } } private transitionTo(newState: CircuitState): void { const oldState = this.state; this.state = newState; logger.info('Circuit breaker state transition', { circuitName: this.name, oldState, newState, failures: this.failures.length }); if (newState === 'OPEN') { this.lastOpenedAt = Date.now(); metrics.increment('circuit_breaker.opened', { circuit: this.name }); } else if (newState === 'CLOSED') { this.failures = []; this.halfOpenSuccesses = 0; this.halfOpenFailures = 0; this.halfOpenRequests = 0; metrics.increment('circuit_breaker.closed', { circuit: this.name }); } else if (newState === 'HALF_OPEN') { this.halfOpenSuccesses = 0; this.halfOpenFailures = 0; this.halfOpenRequests = 0; metrics.increment('circuit_breaker.half_open', { circuit: this.name }); } } private pruneOldFailures(now: number): void { const cutoff = now - this.config.rollingWindowMs; this.failures = this.failures.filter(ts => ts >= cutoff); } private getTimeUntilReset(): number { if (!this.lastOpenedAt) return 0; return Math.max(0, this.config.resetTimeoutMs - (Date.now() - this.lastOpenedAt)); } /** * Get current circuit state for monitoring. */ getState(): { state: CircuitState; failures: number; halfOpenSuccesses: number } { return { state: this.state, failures: this.failures.length, halfOpenSuccesses: this.halfOpenSuccesses }; }} class CircuitOpenException extends Error { constructor( public readonly circuitName: string, public readonly resetInMs: number ) { super(`Circuit ${circuitName} is open. Retry after ${resetInMs}ms`); this.name = 'CircuitOpenException'; }} // Usage: One circuit breaker per external dependencyconst paymentServiceCircuit = new CircuitBreaker('payment-service', { failureThreshold: 5, // Open after 5 failures successThreshold: 2, // Close after 2 successes in half-open halfOpenRequests: 3, // Allow 3 test requests in half-open resetTimeoutMs: 30000, // Stay open for 30 seconds rollingWindowMs: 60000 // Count failures in 60-second window}); async function processPayment(request: PaymentRequest): Promise<PaymentResult> { try { return await paymentServiceCircuit.execute(() => paymentClient.process(request) ); } catch (error) { if (error instanceof CircuitOpenException) { // Fast failure - use fallback logger.warn('Payment service circuit open, using fallback'); return handlePaymentFallback(request-error); } throw error; }}Create separate circuit breakers for each external dependency, not one global breaker. This ensures that a problem with the payment service doesn't affect calls to the inventory service. For multiple instances of the same service, consider whether to use per-instance or per-service circuits based on your load balancing strategy.
When operations can't complete normally, fallbacks provide alternative behavior that maintains service continuity, possibly with reduced functionality.
Types of Fallback Strategies
| Strategy | Behavior | Use When | Trade-offs |
|---|---|---|---|
| Cached Value | Return previously cached successful response | Data staleness is acceptable for short periods | May return outdated information |
| Default Value | Return a predetermined default | Safe default exists; partial is better than nothing | May not reflect actual state |
| Degraded Feature | Disable non-essential feature, continue with core | Feature is enhancement, not critical | Reduced functionality |
| Alternative Service | Use backup provider or read replica | Redundant systems exist | May have different performance or cost |
| Queue for Later | Accept request, process asynchronously when restored | Operation can be eventually consistent | Delay in completion; needs queue infrastructure |
| Static Response | Return static/pre-computed response | Dynamic data not critical | Same response for all, may be inappropriate |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166
/** * Comprehensive fallback implementation patterns. */ /** * Operation wrapper that supports multiple fallback strategies. */class ResilientOperation<T> { private fallbacks: Array<{ name: string; condition?: (error: Error) => boolean; handler: (error: Error) => Promise<T> | T; }> = []; constructor( private readonly operation: () => Promise<T>, private readonly operationName: string ) {} /** * Add a fallback strategy. */ withFallback( name: string, handler: (error: Error) => Promise<T> | T, condition?: (error: Error) => boolean ): this { this.fallbacks.push({ name, handler, condition }); return this; } /** * Execute with fallback chain. */ async execute(): Promise<T> { try { return await this.operation(); } catch (primaryError) { logger.warn(`Primary operation failed: ${this.operationName}`, { error: primaryError instanceof Error ? primaryError.message : String(primaryError) }); // Try fallbacks in order for (const fallback of this.fallbacks) { // Check condition if specified if (fallback.condition && !fallback.condition(primaryError as Error)) { continue; } try { logger.info(`Attempting fallback: ${fallback.name}`); const result = await fallback.handler(primaryError as Error); metrics.increment('operation.fallback.success', { operation: this.operationName, fallback: fallback.name }); return result; } catch (fallbackError) { logger.warn(`Fallback failed: ${fallback.name}`, { error: fallbackError instanceof Error ? fallbackError.message : String(fallbackError) }); // Continue to next fallback } } // All fallbacks exhausted throw primaryError; } }} // ============================================// EXAMPLE: Product recommendations with fallbacks// ============================================ interface ProductRecommendations { products: Product[]; source: 'ml-service' | 'cached' | 'popular' | 'empty';} async function getRecommendations(userId: string): Promise<ProductRecommendations> { return new ResilientOperation( // Primary: Real-time ML recommendations async () => ({ products: await mlRecommendationService.getPersonalized(userId), source: 'ml-service' as const }), 'getRecommendations' ) // Fallback 1: Cached recommendations for this user .withFallback('cached-personal', async () => { const cached = await cache.get(`recommendations:${userId}`); if (!cached) throw new Error('No cached recommendations'); return { products: cached, source: 'cached' as const }; }) // Fallback 2: Popular products (same for all users) .withFallback('popular-products', async () => ({ products: await productService.getPopular(10), source: 'popular' as const })) // Fallback 3: Empty list (graceful degradation) .withFallback('empty', () => ({ products: [], source: 'empty' as const })) .execute();} // ============================================// EXAMPLE: Configuration with fallback chain// ============================================ async function getConfiguration(key: string): Promise<ConfigValue> { const configService = new ResilientOperation( // Primary: Remote config service () => remoteConfigClient.get(key), 'getConfiguration' ) // Fallback 1: Local cache .withFallback('local-cache', async () => { const cached = localConfigCache.get(key); if (!cached) throw new Error('Not in cache'); return cached; }) // Fallback 2: Environment variable .withFallback('env-var', () => { const envValue = process.env[`CONFIG_${key.toUpperCase()}`]; if (!envValue) throw new Error('Not in env'); return { value: envValue, source: 'environment' }; }) // Fallback 3: Compiled default .withFallback('default', () => { const defaultValue = DEFAULT_CONFIG[key]; if (defaultValue === undefined) throw new Error('No default'); return { value: defaultValue, source: 'default' }; }); return configService.execute();} // ============================================// EXAMPLE: Queue-based fallback for writes// ============================================ async function saveUserPreference(userId: string, pref: Preference): Promise<void> { try { await preferenceService.save(userId, pref); } catch (error) { if (isTransientError(error as Error)) { // Queue for later processing await deferredOperationQueue.enqueue({ type: 'save-preference', payload: { userId, preference: pref }, maxRetries: 5, expiresAt: Date.now() + 24 * 60 * 60 * 1000 // 24 hours }); logger.info('Preference save queued for later', { userId }); // Return success - from user's perspective, operation accepted return; } throw error; }}When returning fallback data, signal this to the caller so they can display appropriately. A recommendation marked 'source: popular' can show '...' instead of 'Recommended for you'. This maintains honesty with users while providing continuity.
Every external call should have a timeout. Without timeouts, a single hung dependency can exhaust your connection pool, block your thread pool, and cascade into complete system failure.
The Deadline Propagation Pattern
In distributed systems, a top-level request timeout must be propagated to all downstream calls. If a user request has 3 seconds to complete, and you've already spent 2 seconds, downstream calls must know they have only 1 second remaining.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169
/** * Deadline propagation for timeout management across service calls. */ /** * Deadline context that flows through the request. */interface DeadlineContext { deadlineEpochMs: number; operationStart: number; parentOperationName?: string;} const deadlineStorage = new AsyncLocalStorage<DeadlineContext>(); /** * Get remaining time until deadline. */function getTimeRemaining(): number { const ctx = deadlineStorage.getStore(); if (!ctx) return Infinity; // No deadline set return Math.max(0, ctx.deadlineEpochMs - Date.now());} /** * Check if deadline has passed. */function isDeadlineExceeded(): boolean { return getTimeRemaining() <= 0;} /** * Middleware that extracts or creates deadline context. */function deadlineMiddleware( req: Request, res: Response, next: NextFunction) { // Try to get deadline from upstream caller const upstreamDeadline = req.headers['x-deadline-ms']; // Or use default timeout for this service const deadline = upstreamDeadline ? parseInt(upstreamDeadline as string, 10) : Date.now() + 30000; // 30 second default const ctx: DeadlineContext = { deadlineEpochMs: deadline, operationStart: Date.now(), parentOperationName: req.path }; deadlineStorage.run(ctx, () => next());} /** * Wrap an async operation with deadline enforcement. */async function withDeadline<T>( operation: () => Promise<T>, operationName: string, reserveMs: number = 100 // Reserve time for response processing): Promise<T> { const remaining = getTimeRemaining() - reserveMs; if (remaining <= 0) { throw new DeadlineExceededException( operationName, 'Deadline already exceeded before operation started' ); } return Promise.race([ operation(), createTimeoutPromise(remaining, operationName) ]);} function createTimeoutPromise<T>(timeoutMs: number, operationName: string): Promise<T> { return new Promise((_, reject) => { setTimeout(() => { reject(new DeadlineExceededException(operationName, `Operation timed out after ${timeoutMs}ms`)); }, timeoutMs); });} /** * HTTP client that propagates deadlines to downstream services. */class DeadlineAwareHttpClient { async request<T>(url: string, options: RequestInit = {}): Promise<T> { const remaining = getTimeRemaining(); if (remaining <= 0) { throw new DeadlineExceededException('http-request', 'Deadline exceeded before making request'); } const headers = new Headers(options.headers); headers.set('X-Deadline-Ms', String(deadlineStorage.getStore()?.deadlineEpochMs)); // Set fetch timeout to remaining time (minus small buffer) const controller = new AbortController(); const timeout = setTimeout(() => controller.abort(), remaining - 50); try { const response = await fetch(url, { ...options, headers, signal: controller.signal }); if (!response.ok) { throw new HttpError(response.status, await response.text()); } return response.json(); } finally { clearTimeout(timeout); } }} class DeadlineExceededException extends Error { constructor( public readonly operation: string, message: string ) { super(message); this.name = 'DeadlineExceededException'; }} // ============================================// Usage: Multi-step operation with deadline budget// ============================================ async function processOrder(order: Order): Promise<ProcessedOrder> { // Total operation timeout: 5 seconds // But we're within a request that may have its own deadline const effectiveTimeout = Math.min(5000, getTimeRemaining()); return await withDeadline(async () => { // Step 1: Validate inventory (budget: 1 second) const inventory = await withDeadline( () => inventoryService.check(order.items), 'inventory-check', 100 ); // Step 2: Process payment (~2 seconds budget) const payment = await withDeadline( () => paymentService.charge(order.payment), 'payment-processing', 100 ); // Step 3: Create shipment (remaining time) const shipment = await withDeadline( () => shippingService.create(order.address), 'shipment-creation', 100 ); return { order, inventory, payment, shipment }; }, 'process-order');}In distributed systems, operations that span multiple services can't use traditional database transactions. When a multi-step operation fails partway through, previously successful steps may need to be undone using compensating transactions.
The Saga Pattern
A saga is a sequence of local transactions where each transaction publishes events triggering the next. If any step fails, compensating transactions undo the preceding successful steps in reverse order.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212
/** * Saga pattern implementation for distributed transaction compensation. */interface SagaStep<TContext> { name: string; execute: (context: TContext) => Promise<void>; compensate: (context: TContext) => Promise<void>;} class SagaOrchestrator<TContext extends { sagaId: string }> { private steps: SagaStep<TContext>[] = []; private executedSteps: string[] = []; addStep(step: SagaStep<TContext>): this { this.steps.push(step); return this; } /** * Execute the saga with automatic compensation on failure. */ async execute(context: TContext): Promise<void> { this.executedSteps = []; for (const step of this.steps) { try { logger.info(`Saga step starting: ${step.name}`, { sagaId: context.sagaId, step: step.name }); await step.execute(context); this.executedSteps.push(step.name); logger.info(`Saga step completed: ${step.name}`, { sagaId: context.sagaId, step: step.name }); } catch (error) { logger.error(`Saga step failed: ${step.name}`, { sagaId: context.sagaId, step: step.name, error: error instanceof Error ? error.message : String(error) }); // Compensate all executed steps in reverse order await this.compensate(context); throw new SagaFailedException( context.sagaId, step.name, error instanceof Error ? error : new Error(String(error)) ); } } } /** * Compensate executed steps in reverse order. */ private async compensate(context: TContext): Promise<void> { const stepsToCompensate = [...this.executedSteps].reverse(); for (const stepName of stepsToCompensate) { const step = this.steps.find(s => s.name === stepName); if (!step) continue; try { logger.info(`Compensating step: ${stepName}`, { sagaId: context.sagaId, step: stepName }); await step.compensate(context); logger.info(`Compensation completed: ${stepName}`, { sagaId: context.sagaId, step: stepName }); } catch (compensationError) { // Compensation failure is critical - needs manual intervention logger.fatal(`Compensation failed: ${stepName}`, { sagaId: context.sagaId, step: stepName, error: compensationError }); // Record for manual remediation await deadLetterQueue.enqueue({ type: 'compensation-failure', sagaId: context.sagaId, step: stepName, context, error: compensationError }); // Continue compensating other steps } } }} class SagaFailedException extends Error { constructor( public readonly sagaId: string, public readonly failedStep: string, public readonly cause: Error ) { super(`Saga ${sagaId} failed at step '${failedStep}'`); this.name = 'SagaFailedException'; }} // ============================================// Example: Order processing saga// ============================================ interface OrderSagaContext { sagaId: string; orderId: string; userId: string; items: OrderItem[]; paymentDetails: PaymentDetails; shippingAddress: Address; // Populated during execution for compensation reservationId?: string; paymentId?: string; shipmentId?: string;} const orderSaga = new SagaOrchestrator<OrderSagaContext>() // Step 1: Reserve inventory .addStep({ name: 'reserve-inventory', execute: async (ctx) => { const reservation = await inventoryService.reserve(ctx.items); ctx.reservationId = reservation.id; }, compensate: async (ctx) => { if (ctx.reservationId) { await inventoryService.releaseReservation(ctx.reservationId); } } }) // Step 2: Process payment .addStep({ name: 'process-payment', execute: async (ctx) => { const payment = await paymentService.charge(ctx.paymentDetails); ctx.paymentId = payment.id; }, compensate: async (ctx) => { if (ctx.paymentId) { await paymentService.refund(ctx.paymentId); } } }) // Step 3: Create shipment .addStep({ name: 'create-shipment', execute: async (ctx) => { const shipment = await shippingService.create({ orderId: ctx.orderId, items: ctx.items, address: ctx.shippingAddress }); ctx.shipmentId = shipment.id; }, compensate: async (ctx) => { if (ctx.shipmentId) { await shippingService.cancel(ctx.shipmentId); } } }) // Step 4: Confirm order (no compensation - this is the commit point) .addStep({ name: 'confirm-order', execute: async (ctx) => { await orderService.confirm(ctx.orderId); }, compensate: async () => { // No compensation needed - if we fail before confirm, // previous steps will be compensated // If confirm succeeds, saga is complete } }); // Usageasync function createOrder(request: CreateOrderRequest): Promise<Order> { const order = await orderService.create(request); const sagaContext: OrderSagaContext = { sagaId: generateSagaId(), orderId: order.id, userId: request.userId, items: request.items, paymentDetails: request.payment, shippingAddress: request.address }; try { await orderSaga.execute(sagaContext); return order; } catch (error) { // Saga failed and compensated - order is cancelled await orderService.markFailed(order.id, (error as Error).message); throw error; }}Compensation might fail due to the same issues (network, timeout) that caused the original failure. Design compensating operations to be idempotent and retriable. When compensation repeatedly fails, route to a dead-letter queue for manual intervention—never silently ignore compensation failures.
Graceful degradation means intentionally reducing functionality to maintain core service during stress or component failures. Rather than complete failure, the system continues operating with reduced capabilities.
Degradation Levels
Define tiers of functionality that can be progressively disabled:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167
/** * Graceful degradation system with configurable feature tiers. */ enum DegradationLevel { NORMAL = 0, // All features available REDUCED = 1, // Non-essential features disabled ESSENTIAL = 2, // Only critical features available MINIMAL = 3 // Absolute minimum viable service} interface FeatureDefinition { name: string; description: string; minLevel: DegradationLevel; // Disable at this level and above dependencies: string[]; // Other services this feature needs fallback?: () => any; // Fallback behavior when disabled} class DegradationManager { private currentLevel: DegradationLevel = DegradationLevel.NORMAL; private features: Map<string, FeatureDefinition> = new Map(); private circuitBreakers: Map<string, CircuitBreaker> = new Map(); registerFeature(feature: FeatureDefinition): void { this.features.set(feature.name, feature); } /** * Check if a feature is currently available. */ isFeatureAvailable(featureName: string): boolean { const feature = this.features.get(featureName); if (!feature) return false; // Check explicit degradation level if (this.currentLevel >= feature.minLevel) { return false; } // Check dependencies' circuits for (const dep of feature.dependencies) { const circuit = this.circuitBreakers.get(dep); if (circuit?.getState().state === 'OPEN') { return false; } } return true; } /** * Execute operation with degradation-aware behavior. */ async executeWithDegradation<T>( featureName: string, operation: () => Promise<T> ): Promise<T | null> { const feature = this.features.get(featureName); if (!this.isFeatureAvailable(featureName)) { if (feature?.fallback) { return feature.fallback(); } return null; } try { return await operation(); } catch (error) { // On error, consider activating degradation this.handleFeatureError(featureName, error as Error); if (feature?.fallback) { return feature.fallback(); } throw error; } } /** * Manually set degradation level (e.g., during incidents). */ setDegradationLevel(level: DegradationLevel): void { const previous = this.currentLevel; this.currentLevel = level; logger.warn('Degradation level changed', { previous: DegradationLevel[previous], current: DegradationLevel[level] }); // Notify monitoring metrics.gauge('degradation.level', level); } /** * Auto-adjust degradation based on error rates. */ private handleFeatureError(featureName: string, error: Error): void { // Track error rate and potentially auto-degrade const errorRate = this.errorRateTracker.record(featureName, error); if (errorRate > 0.5 && this.currentLevel < DegradationLevel.REDUCED) { logger.warn('Auto-degrading due to high error rate', { feature: featureName, errorRate }); this.setDegradationLevel(DegradationLevel.REDUCED); } }} // ============================================// Example: E-commerce with degradation tiers// ============================================ const degradationManager = new DegradationManager(); // Register features with their degradation thresholdsdegradationManager.registerFeature({ name: 'personalized-recommendations', description: 'ML-powered product recommendations', minLevel: DegradationLevel.REDUCED, // Disable early dependencies: ['recommendation-service'], fallback: () => ({ products: [], source: 'disabled' })}); degradationManager.registerFeature({ name: 'live-inventory', description: 'Real-time inventory checks', minLevel: DegradationLevel.ESSENTIAL, dependencies: ['inventory-service'], fallback: () => ({ available: true, cached: true }) // Optimistic}); degradationManager.registerFeature({ name: 'payment-processing', description: 'Process payments', minLevel: DegradationLevel.MINIMAL, // Only disable in extreme cases dependencies: ['payment-service'], fallback: undefined // No fallback - must fail if unavailable}); // Usage in API handlerasync function getProductPage(productId: string): Promise<ProductPageResponse> { const product = await productService.get(productId); // Recommendations - degradable const recommendations = await degradationManager.executeWithDegradation( 'personalized-recommendations', () => recommendationService.getFor(productId) ); // Inventory - degradable with cached fallback const inventory = await degradationManager.executeWithDegradation( 'live-inventory', () => inventoryService.check(productId) ); return { product, recommendations: recommendations || { products: [], source: 'unavailable' }, inventory: inventory || { available: true, cached: true }, degradationLevel: degradationManager.getCurrentLevel() };}When operating in degraded mode, consider informing users. A subtle banner saying 'Some features are temporarily limited' manages expectations better than features silently disappearing. Include degradation status in API responses so clients can adapt their UI accordingly.
Error recovery transforms systems from fragile to resilient. By combining retry strategies, circuit breakers, fallbacks, and compensation, you build systems that handle failures gracefully and maintain availability despite component problems.
Module Complete
You've now mastered error handling at boundaries—from exception translation through layers, to user-facing messages, comprehensive logging, and active recovery strategies. These patterns combine to create systems that handle errors not as exceptional disasters but as expected events with planned responses.
Apply these patterns to build systems that operators trust, users appreciate, and developers can debug efficiently. Error handling isn't just about catching exceptions—it's about designing for graceful behavior when the world doesn't cooperate.
You now understand the complete landscape of error recovery: retry patterns with budgets, circuit breakers for failing dependencies, fallback chains for continuity, timeout management with deadline propagation, compensating transactions for distributed consistency, and graceful degradation for stress scenarios. Apply these patterns to build truly resilient systems.