Loading content...
In distributed systems, failure isn't an edge case—it's the normal operating condition. Networks fail. Services crash. Databases become unavailable. Timeouts occur. And in the complex orchestration of a saga spanning multiple services, the probability of encountering at least one failure approaches certainty.\n\nThe Saga pattern acknowledges this reality by making failure handling a first-class architectural concern. Rather than hoping failures won't happen, sagas are designed from the ground up to detect, categorize, respond to, and recover from failures gracefully.\n\nKey Insight:\n\nA well-designed saga isn't one that never fails—it's one that fails gracefully, recovers automatically when possible, and provides operators with clear information when manual intervention is needed.
By the end of this page, you will understand the taxonomy of saga failures, implement sophisticated retry and timeout strategies, design for observability, handle partial failures, and build saga architectures that are resilient to the realities of distributed systems.
Not all failures are created equal. Understanding the types of failures helps design appropriate handling strategies.
| Failure Type | Characteristics | Examples | Recovery Strategy |
|---|---|---|---|
| Transient | Temporary; will likely succeed on retry | Network timeout, database deadlock, rate limiting | Retry with backoff |
| Intermittent | Failures that occur sporadically | Memory pressure, garbage collection pauses | Retry + monitoring |
| Business Logic | Valid rejection based on business rules | Insufficient funds, inventory depleted | Compensate and notify |
| Permanent Infrastructure | Service or dependency completely down | Service crash, data center outage | Wait + fallback + escalate |
| Data Corruption | Data in unexpected/invalid state | Schema mismatch, corrupted records | Manual intervention |
| Saga Logic Bug | Defect in saga implementation | Null pointer, incorrect state transitions | Fix and redeploy + manual recovery |
The Failure Spectrum:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
// Failure Classification System enum FailureCategory { TRANSIENT = 'TRANSIENT', INTERMITTENT = 'INTERMITTENT', BUSINESS_LOGIC = 'BUSINESS_LOGIC', PERMANENT_INFRASTRUCTURE = 'PERMANENT_INFRASTRUCTURE', DATA_CORRUPTION = 'DATA_CORRUPTION', SAGA_BUG = 'SAGA_BUG'} interface ClassifiedFailure { category: FailureCategory; isRetryable: boolean; requiresCompensation: boolean; requiresEscalation: boolean; suggestedAction: string;} class FailureClassifier { classify(error: Error, context: SagaContext): ClassifiedFailure { // Business logic failures - explicit rejections if (error instanceof InsufficientFundsError || error instanceof InventoryDepletedError || error instanceof CustomerBlockedError) { return { category: FailureCategory.BUSINESS_LOGIC, isRetryable: false, requiresCompensation: true, requiresEscalation: false, suggestedAction: 'Execute compensating transactions and notify customer' }; } // Data corruption if (error instanceof SchemaValidationError || error instanceof InvalidStateError || error.message.includes('constraint violation')) { return { category: FailureCategory.DATA_CORRUPTION, isRetryable: false, requiresCompensation: false, // May not be safe to compensate requiresEscalation: true, suggestedAction: 'Pause saga, escalate to engineering for investigation' }; } // Transient failures - typically network/database issues if (error.message.includes('ECONNRESET') || error.message.includes('timeout') || error.message.includes('deadlock') || error instanceof RateLimitError || this.isHttpRetryable(error)) { return { category: FailureCategory.TRANSIENT, isRetryable: true, requiresCompensation: false, requiresEscalation: false, suggestedAction: 'Retry with exponential backoff' }; } // Permanent infrastructure if (error.message.includes('service unavailable') || error.message.includes('503') || error.message.includes('connection refused')) { return { category: FailureCategory.PERMANENT_INFRASTRUCTURE, isRetryable: true, // But with longer delays requiresCompensation: false, // Wait for recovery requiresEscalation: true, // Alert operators suggestedAction: 'Enable circuit breaker, alert on-call, retry periodically' }; } // Default: assume saga bug return { category: FailureCategory.SAGA_BUG, isRetryable: false, requiresCompensation: false, requiresEscalation: true, suggestedAction: 'Pause saga, capture full context, escalate to engineering' }; } private isHttpRetryable(error: Error): boolean { const retryableStatusCodes = [408, 429, 500, 502, 503, 504]; const match = error.message.match(/status (\d+)/); if (match) { return retryableStatusCodes.includes(parseInt(match[1], 10)); } return false; }}Retry logic is the first line of defense against transient failures. However, naive retry implementations can cause more harm than good—overwhelming already-struggling services or creating thundering herd problems.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
// Production-Grade Retry Implementation interface RetryPolicy { maxAttempts: number; initialDelayMs: number; maxDelayMs: number; backoffMultiplier: number; jitterFactor: number; // 0-1, how much randomness to add retryableExceptions: Array<new (...args: unknown[]) => Error>;} class ExponentialBackoffRetry { private readonly defaultPolicy: RetryPolicy = { maxAttempts: 5, initialDelayMs: 100, maxDelayMs: 30000, backoffMultiplier: 2, jitterFactor: 0.2, retryableExceptions: [NetworkError, TimeoutError, RateLimitError] }; async execute<T>( operation: () => Promise<T>, policy: Partial<RetryPolicy> = {} ): Promise<T> { const config = { ...this.defaultPolicy, ...policy }; let lastError: Error | undefined; let delay = config.initialDelayMs; for (let attempt = 1; attempt <= config.maxAttempts; attempt++) { try { return await operation(); } catch (error) { lastError = error as Error; if (!this.isRetryable(error, config.retryableExceptions)) { throw error; } if (attempt === config.maxAttempts) { break; } // Calculate delay with jitter const jitter = 1 + (Math.random() * 2 - 1) * config.jitterFactor; const sleepTime = Math.min(delay * jitter, config.maxDelayMs); console.log( `Retry attempt ${attempt}/${config.maxAttempts} after ${Math.round(sleepTime)}ms` ); await this.sleep(sleepTime); delay *= config.backoffMultiplier; } } throw new RetryExhaustedError( `Operation failed after ${config.maxAttempts} attempts`, lastError ); } private isRetryable( error: unknown, retryableExceptions: Array<new (...args: unknown[]) => Error> ): boolean { return retryableExceptions.some(exc => error instanceof exc); } private sleep(ms: number): Promise<void> { return new Promise(resolve => setTimeout(resolve, ms)); }} // Retry with respect for rate limitsclass RateLimitAwareRetry extends ExponentialBackoffRetry { async execute<T>( operation: () => Promise<T>, policy: Partial<RetryPolicy> = {} ): Promise<T> { try { return await super.execute(operation, policy); } catch (error) { // Check for rate limit response with Retry-After header if (error instanceof RateLimitError && error.retryAfterSeconds) { console.log(`Rate limited. Waiting ${error.retryAfterSeconds}s as instructed`); await this.sleep(error.retryAfterSeconds * 1000); return await super.execute(operation, policy); } throw error; } }} // Saga step with built-in retryclass RetryableSagaStep<TInput, TOutput> { constructor( private step: SagaStep<TInput, TOutput>, private retryPolicy: RetryPolicy, private retry: ExponentialBackoffRetry ) {} async execute(input: TInput, context: SagaContext): Promise<TOutput> { return await this.retry.execute( () => this.step.execute(input, context), this.retryPolicy ); } async compensate(context: CompensationContext): Promise<void> { return await this.retry.execute( () => this.step.compensate(context), // Compensations get more aggressive retry { ...this.retryPolicy, maxAttempts: this.retryPolicy.maxAttempts * 2, maxDelayMs: this.retryPolicy.maxDelayMs * 2 } ); }}Timeouts are essential for saga liveness—without them, a stuck service can hang the entire saga indefinitely. But timeout management in sagas is more nuanced than simple read/write timeouts.
Types of Timeouts in Sagas:
| Timeout Type | What It Protects Against | Typical Value | Action on Timeout |
|---|---|---|---|
| Step Execution Timeout | Single step taking too long | 5-30 seconds | Retry or compensate |
| Step Response Timeout | Waiting for async response | 30-300 seconds | Retry or compensate |
| Saga Overall Timeout | Entire saga taking too long | 5-60 minutes | Force compensation |
| Compensation Timeout | Compensation step hanging | 30-120 seconds | Retry or escalate |
| Idle Timeout | Saga inactive (stuck) | 5-30 minutes | Resume or escalate |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162
// Comprehensive Timeout Management interface TimeoutConfig { stepExecutionMs: number; stepResponseMs: number; sagaOverallMs: number; compensationMs: number; idleMs: number;} class TimeoutManager { private sagaTimers: Map<string, NodeJS.Timeout> = new Map(); private stepTimers: Map<string, NodeJS.Timeout> = new Map(); constructor( private config: TimeoutConfig, private sagaRepository: SagaRepository, private alertService: AlertService ) {} // Start overall saga timeout startSagaTimeout(sagaId: string, onTimeout: () => Promise<void>): void { const timer = setTimeout(async () => { console.warn(`Saga ${sagaId} exceeded overall timeout of ${this.config.sagaOverallMs}ms`); await this.alertService.warn({ type: 'SAGA_TIMEOUT', sagaId, message: 'Saga exceeded maximum execution time' }); await onTimeout(); }, this.config.sagaOverallMs); this.sagaTimers.set(sagaId, timer); } // Clear saga timeout on completion clearSagaTimeout(sagaId: string): void { const timer = this.sagaTimers.get(sagaId); if (timer) { clearTimeout(timer); this.sagaTimers.delete(sagaId); } } // Execute step with timeout async executeWithTimeout<T>( stepName: string, operation: () => Promise<T>, timeoutMs: number = this.config.stepExecutionMs ): Promise<T> { return new Promise((resolve, reject) => { const timer = setTimeout(() => { reject(new StepTimeoutError( `Step '${stepName}' timed out after ${timeoutMs}ms` )); }, timeoutMs); operation() .then(result => { clearTimeout(timer); resolve(result); }) .catch(error => { clearTimeout(timer); reject(error); }); }); } // Wait for async response with timeout async waitForResponse<T>( sagaId: string, stepName: string, responsePromise: Promise<T>, timeoutMs: number = this.config.stepResponseMs ): Promise<T> { return new Promise((resolve, reject) => { const timer = setTimeout(async () => { // Log timeout for investigation await this.sagaRepository.update(sagaId, { lastTimeoutAt: new Date(), lastTimeoutStep: stepName }); reject(new ResponseTimeoutError( `No response for step '${stepName}' after ${timeoutMs}ms` )); }, timeoutMs); responsePromise .then(result => { clearTimeout(timer); resolve(result); }) .catch(error => { clearTimeout(timer); reject(error); }); }); }} // Deadline propagation for nested callsclass DeadlinePropagation { private static DEADLINE_HEADER = 'X-Saga-Deadline'; // Set deadline in outgoing request static setDeadline(headers: Record<string, string>, deadlineMs: number): void { const deadline = Date.now() + deadlineMs; headers[this.DEADLINE_HEADER] = deadline.toString(); } // Extract deadline from incoming request static getDeadline(headers: Record<string, string>): number | null { const deadline = headers[this.DEADLINE_HEADER]; return deadline ? parseInt(deadline, 10) : null; } // Calculate remaining time until deadline static getRemainingTime(headers: Record<string, string>): number { const deadline = this.getDeadline(headers); if (!deadline) return Infinity; return Math.max(0, deadline - Date.now()); } // Check if deadline has passed static isExpired(headers: Record<string, string>): boolean { const remaining = this.getRemainingTime(headers); return remaining <= 0; }} // Service that respects propagated deadlinesclass DeadlineAwareService { async processOrder(request: OrderRequest): Promise<OrderResult> { // Check if deadline already expired if (DeadlinePropagation.isExpired(request.headers)) { throw new DeadlineExceededError('Request deadline already passed'); } const remainingTime = DeadlinePropagation.getRemainingTime(request.headers); // Set local timeout based on remaining deadline return await new Promise((resolve, reject) => { const timer = setTimeout(() => { reject(new DeadlineExceededError('Processing exceeded deadline')); }, Math.min(remainingTime, 30000)); // Cap at 30s max this.doProcessOrder(request) .then(result => { clearTimeout(timer); resolve(result); }) .catch(error => { clearTimeout(timer); reject(error); }); }); }}A timeout is relative ('30 seconds from now'). A deadline is absolute ('complete by 2024-01-15T10:30:00Z'). Deadlines propagate better across service boundaries because they account for time already spent. If the overall deadline is 30 seconds and 10 seconds were spent on network and Service A, Service B knows it has only 20 seconds, not the full 30.
When a service is failing consistently, continuing to send requests makes things worse—it overwhelms the struggling service and delays saga execution. Circuit breakers prevent this cascade.\n\nCircuit Breaker States:\n\n- CLOSED — Normal operation; requests flow through\n- OPEN — Service is failing; requests fail immediately without calling the service\n- HALF-OPEN — Testing recovery; limited requests allowed through to check if service recovered
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169
// Circuit Breaker Implementation for Saga Steps enum CircuitState { CLOSED = 'CLOSED', OPEN = 'OPEN', HALF_OPEN = 'HALF_OPEN'} interface CircuitBreakerConfig { failureThreshold: number; // Number of failures to open circuit successThreshold: number; // Successes in half-open to close openTimeoutMs: number; // Time before trying half-open windowSizeMs: number; // Rolling window for failure counting} class CircuitBreaker { private state: CircuitState = CircuitState.CLOSED; private failureCount: number = 0; private successCount: number = 0; private lastFailureTime: number = 0; private failures: number[] = []; // Timestamps of failures constructor( private serviceName: string, private config: CircuitBreakerConfig = { failureThreshold: 5, successThreshold: 3, openTimeoutMs: 30000, windowSizeMs: 60000 } ) {} async execute<T>(operation: () => Promise<T>): Promise<T> { // Check if we should allow the request if (!this.shouldAllowRequest()) { throw new CircuitOpenError( `Circuit breaker for ${this.serviceName} is OPEN` ); } try { const result = await operation(); this.onSuccess(); return result; } catch (error) { this.onFailure(); throw error; } } private shouldAllowRequest(): boolean { switch (this.state) { case CircuitState.CLOSED: return true; case CircuitState.OPEN: // Check if we should transition to half-open if (Date.now() - this.lastFailureTime >= this.config.openTimeoutMs) { this.transitionTo(CircuitState.HALF_OPEN); return true; } return false; case CircuitState.HALF_OPEN: // Allow limited requests in half-open state return true; } } private onSuccess(): void { switch (this.state) { case CircuitState.CLOSED: // Reset failure count on success this.pruneOldFailures(); break; case CircuitState.HALF_OPEN: this.successCount++; if (this.successCount >= this.config.successThreshold) { this.transitionTo(CircuitState.CLOSED); } break; case CircuitState.OPEN: // Shouldn't happen, but handle gracefully break; } } private onFailure(): void { this.lastFailureTime = Date.now(); this.failures.push(this.lastFailureTime); switch (this.state) { case CircuitState.CLOSED: this.pruneOldFailures(); if (this.failures.length >= this.config.failureThreshold) { this.transitionTo(CircuitState.OPEN); } break; case CircuitState.HALF_OPEN: // Any failure in half-open reopens the circuit this.transitionTo(CircuitState.OPEN); break; case CircuitState.OPEN: // Already open, just update last failure time break; } } private pruneOldFailures(): void { const cutoff = Date.now() - this.config.windowSizeMs; this.failures = this.failures.filter(t => t > cutoff); } private transitionTo(newState: CircuitState): void { console.log( `Circuit breaker for ${this.serviceName}: ${this.state} -> ${newState}` ); this.state = newState; this.successCount = 0; // Emit metrics/alerts on state change if (newState === CircuitState.OPEN) { this.emitAlert(`Circuit opened for ${this.serviceName}`); } } private emitAlert(message: string): void { // Integration with alerting system console.warn(`[CIRCUIT_BREAKER] ${message}`); } getState(): CircuitState { return this.state; }} // Saga step wrapped with circuit breakerclass CircuitProtectedSagaStep<TInput, TOutput> { private circuitBreaker: CircuitBreaker; constructor( private step: SagaStep<TInput, TOutput>, private serviceName: string ) { this.circuitBreaker = new CircuitBreaker(serviceName); } async execute(input: TInput, context: SagaContext): Promise<TOutput> { try { return await this.circuitBreaker.execute( () => this.step.execute(input, context) ); } catch (error) { if (error instanceof CircuitOpenError) { // Transform to saga-friendly error throw new ServiceUnavailableError( `Service ${this.serviceName} is currently unavailable (circuit open)`, { isRetryable: true, retryAfterMs: 30000 } ); } throw error; } }}Some of the most challenging scenarios involve partial failures—when a step executes somewhere between 0% and 100% successfully. These require careful handling to avoid both data loss and duplicate processing.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209
// Strategies for Handling Partial Failures // 1. IDEMPOTENCY KEYS for "at least once" safetyclass IdempotentOperation { async execute<T>( idempotencyKey: string, operation: () => Promise<T> ): Promise<T> { // Check if already executed const existing = await this.idempotencyStore.find(idempotencyKey); if (existing) { if (existing.status === 'COMPLETED') { return existing.result as T; } if (existing.status === 'IN_PROGRESS') { throw new OperationInProgressError(idempotencyKey); } } // Mark as in progress await this.idempotencyStore.create({ key: idempotencyKey, status: 'IN_PROGRESS', startedAt: new Date() }); try { const result = await operation(); // Mark as completed with result await this.idempotencyStore.update(idempotencyKey, { status: 'COMPLETED', result, completedAt: new Date() }); return result; } catch (error) { // Mark as failed (or delete to allow retry) await this.idempotencyStore.update(idempotencyKey, { status: 'FAILED', error: error.message, failedAt: new Date() }); throw error; } }} // 2. VERIFICATION STEP for ambiguous outcomesclass PaymentStepWithVerification { private readonly verificationAttempts = 3; private readonly verificationDelayMs = 5000; async processPayment(orderId: string, amount: number): Promise<PaymentResult> { const idempotencyKey = `payment-${orderId}`; try { return await this.paymentGateway.charge({ idempotencyKey, amount, metadata: { orderId } }); } catch (error) { if (this.isAmbiguousFailure(error)) { // Timeout or network error - payment might have succeeded return await this.verifyPaymentStatus(idempotencyKey, orderId); } throw error; } } private isAmbiguousFailure(error: Error): boolean { return error instanceof TimeoutError || error.message.includes('ECONNRESET') || error.message.includes('network error'); } private async verifyPaymentStatus( idempotencyKey: string, orderId: string ): Promise<PaymentResult> { for (let attempt = 1; attempt <= this.verificationAttempts; attempt++) { await this.sleep(this.verificationDelayMs); try { // Check if payment actually went through const status = await this.paymentGateway.getPaymentByIdempotencyKey( idempotencyKey ); if (status) { // Payment was processed! return status; } } catch (verifyError) { if (attempt === this.verificationAttempts) { throw new AmbiguousPaymentStateError( `Cannot determine payment status for order ${orderId}`, { requiresManualVerification: true } ); } } } // Verification exhausted, payment not found - likely didn't process throw new PaymentNotProcessedError(orderId); }} // 3. BATCH WITH INDIVIDUAL TRACKINGclass BatchInventoryReservation { async reserveItems( orderId: string, items: OrderItem[] ): Promise<BatchReservationResult> { const results: ItemReservationResult[] = []; const reservedItems: string[] = []; const failedItems: { item: OrderItem; error: Error }[] = []; for (const item of items) { try { const reservationId = await this.inventoryService.reserve( item.productId, item.quantity, orderId ); reservedItems.push(reservationId); results.push({ productId: item.productId, status: 'RESERVED', reservationId }); } catch (error) { failedItems.push({ item, error: error as Error }); results.push({ productId: item.productId, status: 'FAILED', error: error.message }); } } // Decision: all-or-nothing or partial success? if (failedItems.length > 0) { // All-or-nothing: release all successful reservations for (const reservationId of reservedItems) { await this.inventoryService.release(reservationId); } throw new BatchReservationFailedError( `Failed to reserve ${failedItems.length} of ${items.length} items`, { failedItems, results } ); } return { success: true, reservationIds: reservedItems, results }; }} // 4. SAGA STATE RECOVERY AFTER CRASHclass SagaRecovery { async recoverStuckSagas(): Promise<void> { // Find sagas that were in progress when system crashed const stuckSagas = await this.sagaRepository.findByStatus({ status: 'IN_PROGRESS', lastUpdatedBefore: new Date(Date.now() - 5 * 60 * 1000) // 5 min stale }); for (const saga of stuckSagas) { console.log(`Recovering stuck saga: ${saga.id}`); try { await this.determineSagaState(saga); } catch (error) { console.error(`Failed to recover saga ${saga.id}:`, error); await this.escalateSaga(saga, error as Error); } } } private async determineSagaState(saga: SagaInstance): Promise<void> { const lastStep = saga.completedSteps[saga.completedSteps.length - 1]; if (!lastStep) { // Saga never started - safe to restart await this.restartSaga(saga); return; } // Verify if last step actually completed const stepCompleted = await this.verifyStepCompletion(saga, lastStep); if (stepCompleted) { // Continue from next step await this.continueSaga(saga, saga.completedSteps.length); } else { // Last step is ambiguous - need manual inspection await this.escalateSaga(saga, new Error( `Ambiguous state at step ${lastStep}` )); } }}When sagas fail, operators need visibility into what went wrong, where, and why. Comprehensive observability is essential for debugging and maintaining saga systems.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215
// Comprehensive Saga Observability System interface SagaMetrics { sagasStarted: Counter; sagasCompleted: Counter; sagasFailed: Counter; sagasCompensated: Counter; sagaDuration: Histogram; stepDuration: Histogram; stepRetries: Counter; compensationDuration: Histogram;} class ObservableSagaExecutor { private metrics: SagaMetrics; private tracer: Tracer; async executeSaga<T>( sagaId: string, definition: SagaDefinition<T>, input: T ): Promise<SagaResult<T>> { // Start distributed trace const span = this.tracer.startSpan('saga.execute', { attributes: { 'saga.id': sagaId, 'saga.type': definition.name, 'saga.input': JSON.stringify(input) } }); this.metrics.sagasStarted.inc({ saga_type: definition.name }); const startTime = Date.now(); // Structured logging this.logger.info('Saga started', { sagaId, sagaType: definition.name, input: this.sanitizeForLogging(input) }); try { let data = input; for (let i = 0; i < definition.steps.length; i++) { const step = definition.steps[i]; data = await this.executeStep(sagaId, step, data, i + 1, span); } // Success metrics const duration = Date.now() - startTime; this.metrics.sagasCompleted.inc({ saga_type: definition.name }); this.metrics.sagaDuration.observe( { saga_type: definition.name, outcome: 'success' }, duration ); this.logger.info('Saga completed', { sagaId, sagaType: definition.name, durationMs: duration }); span.setStatus({ code: SpanStatusCode.OK }); span.end(); return { success: true, data }; } catch (error) { const duration = Date.now() - startTime; this.metrics.sagasFailed.inc({ saga_type: definition.name, failure_step: (error as SagaError).stepName || 'unknown' }); this.metrics.sagaDuration.observe( { saga_type: definition.name, outcome: 'failed' }, duration ); this.logger.error('Saga failed', { sagaId, sagaType: definition.name, error: error.message, stack: error.stack, durationMs: duration }); // Attempt compensation const compensationResult = await this.compensateSaga( sagaId, definition, data, span ); if (compensationResult.success) { this.metrics.sagasCompensated.inc({ saga_type: definition.name }); } span.recordException(error as Error); span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }); span.end(); return { success: false, error: error.message, compensated: compensationResult.success }; } } private async executeStep<T>( sagaId: string, step: SagaStep<T, T>, data: T, stepNumber: number, parentSpan: Span ): Promise<T> { const stepSpan = this.tracer.startSpan(`saga.step.${step.name}`, { parent: parentSpan, attributes: { 'saga.step.name': step.name, 'saga.step.number': stepNumber } }); const startTime = Date.now(); let attempt = 0; while (true) { attempt++; try { const result = await step.execute(data, { sagaId, stepNumber }); const duration = Date.now() - startTime; this.metrics.stepDuration.observe( { saga_step: step.name, outcome: 'success' }, duration ); this.logger.debug('Step completed', { sagaId, stepName: step.name, attempt, durationMs: duration }); stepSpan.end(); return result; } catch (error) { if (this.isRetryable(error) && attempt < step.maxRetries) { this.metrics.stepRetries.inc({ saga_step: step.name }); this.logger.warn('Step failed, retrying', { sagaId, stepName: step.name, attempt, maxRetries: step.maxRetries, error: error.message }); await this.backoff(attempt); continue; } const duration = Date.now() - startTime; this.metrics.stepDuration.observe( { saga_step: step.name, outcome: 'failed' }, duration ); stepSpan.recordException(error as Error); stepSpan.end(); throw Object.assign(error, { stepName: step.name }); } } }} // Alert Conditionsclass SagaAlertManager { private alertRules: AlertRule[] = [ { name: 'high_saga_failure_rate', condition: (metrics) => metrics.sagasFailed.rate('5m') / metrics.sagasStarted.rate('5m') > 0.1, severity: 'critical', message: 'Saga failure rate exceeds 10% in last 5 minutes' }, { name: 'stuck_sagas', condition: async (metrics, repo) => (await repo.countByStatus('IN_PROGRESS', { staleMins: 30 })) > 10, severity: 'warning', message: 'More than 10 sagas stuck for over 30 minutes' }, { name: 'compensation_failures', condition: (metrics) => metrics.compensationFailures.count('1h') > 5, severity: 'critical', message: 'Multiple compensation failures in last hour - data may be inconsistent' }, { name: 'circuit_breaker_open', condition: (_, services) => services.some(s => s.circuitBreaker.getState() === 'OPEN'), severity: 'warning', message: 'One or more service circuit breakers are open' } ];}When all automatic recovery attempts fail, sagas must have a graceful path to human intervention. Dead Letter Queues (DLQs) provide this mechanism.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189
// Dead Letter Queue and Manual Intervention System interface DeadLetterSaga { sagaId: string; sagaType: string; failedStep: string; error: string; data: unknown; completedSteps: string[]; failedAt: Date; attempts: number; lastAttemptAt: Date; status: 'PENDING_REVIEW' | 'IN_REVIEW' | 'RESOLVED' | 'ABANDONED'; assignedTo?: string; resolution?: SagaResolution;} type SagaResolution = | { type: 'RETRY'; notes: string } | { type: 'MANUAL_COMPLETE'; notes: string } | { type: 'MANUAL_COMPENSATE'; notes: string } | { type: 'ABANDON'; reason: string }; class DeadLetterQueue { async moveToDeadLetter(saga: SagaInstance, error: Error): Promise<void> { const dlSaga: DeadLetterSaga = { sagaId: saga.id, sagaType: saga.type, failedStep: saga.currentStep, error: error.message, data: saga.data, completedSteps: saga.completedSteps, failedAt: new Date(), attempts: saga.attempts, lastAttemptAt: new Date(), status: 'PENDING_REVIEW' }; await this.deadLetterRepository.create(dlSaga); // Update original saga status await this.sagaRepository.update(saga.id, { status: 'IN_DLQ', movedToDlqAt: new Date() }); // Create incident ticket await this.ticketService.create({ type: 'SAGA_DEAD_LETTER', priority: 'HIGH', title: `Saga ${saga.id} requires manual intervention`, description: ` Saga Type: ${saga.type} Failed Step: ${saga.currentStep} Error: ${error.message} Completed Steps: ${saga.completedSteps.join(' -> ')} Data: ${JSON.stringify(saga.data, null, 2)} `, sagaId: saga.id }); // Alert on-call await this.alertService.page({ severity: 'high', message: `Saga moved to DLQ: ${saga.type}.${saga.currentStep}`, context: { sagaId: saga.id } }); } async retryFromDlq(sagaId: string, operatorId: string): Promise<void> { const dlSaga = await this.deadLetterRepository.findById(sagaId); if (!dlSaga) { throw new NotFoundError(`Dead letter saga ${sagaId} not found`); } // Log the manual intervention await this.auditLog.record({ action: 'DLQ_RETRY', sagaId, operatorId, timestamp: new Date() }); // Update status await this.deadLetterRepository.update(sagaId, { status: 'IN_REVIEW', assignedTo: operatorId }); // Attempt to resume saga const originalSaga = await this.sagaRepository.findById(sagaId); try { await this.sagaExecutor.resumeSaga(originalSaga); // Success - mark as resolved await this.deadLetterRepository.update(sagaId, { status: 'RESOLVED', resolution: { type: 'RETRY', notes: `Retried by ${operatorId}` } }); } catch (retryError) { // Still failing - update DLQ record await this.deadLetterRepository.update(sagaId, { status: 'PENDING_REVIEW', lastAttemptAt: new Date(), attempts: dlSaga.attempts + 1, error: retryError.message }); throw retryError; } } async manuallyComplete( sagaId: string, operatorId: string, notes: string ): Promise<void> { // Mark saga as complete without executing remaining steps await this.sagaRepository.update(sagaId, { status: 'MANUALLY_COMPLETED', completedAt: new Date(), completedBy: operatorId, notes }); await this.deadLetterRepository.update(sagaId, { status: 'RESOLVED', resolution: { type: 'MANUAL_COMPLETE', notes } }); await this.auditLog.record({ action: 'MANUAL_SAGA_COMPLETE', sagaId, operatorId, notes, timestamp: new Date() }); } async manuallyCompensate( sagaId: string, operatorId: string, notes: string ): Promise<void> { const saga = await this.sagaRepository.findById(sagaId); // Execute compensations for completed steps for (const stepName of saga.completedSteps.reverse()) { const step = this.getStepDefinition(saga.type, stepName); await step.compensate({ sagaId, ...saga.data }); } await this.sagaRepository.update(sagaId, { status: 'MANUALLY_COMPENSATED', compensatedAt: new Date(), compensatedBy: operatorId }); await this.deadLetterRepository.update(sagaId, { status: 'RESOLVED', resolution: { type: 'MANUAL_COMPENSATE', notes } }); }} // Admin UI for DLQ managementclass DlqDashboard { async getStats(): Promise<DlqStats> { const pending = await this.dlqRepo.countByStatus('PENDING_REVIEW'); const inReview = await this.dlqRepo.countByStatus('IN_REVIEW'); const byType = await this.dlqRepo.groupBy('sagaType'); const avgAge = await this.dlqRepo.avgAgeMinutes('PENDING_REVIEW'); return { pending, inReview, byType, avgAgeMins: avgAge }; } async getQueue(filters: DlqFilters): Promise<DeadLetterSaga[]> { return this.dlqRepo.find({ status: filters.status, sagaType: filters.sagaType, olderThan: filters.olderThan, orderBy: 'failedAt', limit: 50 }); }}Dead Letter Queues should be actively monitored and processed. A growing DLQ indicates systemic problems. Track DLQ depth as a key operational metric and set alerts when it grows beyond thresholds.
Handling failures gracefully is what separates production-grade saga implementations from toy examples. Let's consolidate the essential takeaways:
What's Next:\n\nWe've covered the theory and failure handling. The final page brings everything together with Saga Implementation—production-ready patterns, framework recommendations, and real-world deployment considerations.
You now understand how to make sagas resilient to the inevitable failures of distributed systems. The patterns covered here—retry strategies, circuit breakers, timeout management, and dead letter queues—are the difference between sagas that work in dev and sagas that work in production.