Saga Pattern - Learning Module

Loading content...

0/273

Failure Handling

Embracing Failure as a First-Class Concern

In distributed systems, failure isn't an edge case—it's the normal operating condition. Networks fail. Services crash. Databases become unavailable. Timeouts occur. And in the complex orchestration of a saga spanning multiple services, the probability of encountering at least one failure approaches certainty.\n\nThe Saga pattern acknowledges this reality by making failure handling a first-class architectural concern. Rather than hoping failures won't happen, sagas are designed from the ground up to detect, categorize, respond to, and recover from failures gracefully.\n\nKey Insight:\n\nA well-designed saga isn't one that never fails—it's one that fails gracefully, recovers automatically when possible, and provides operators with clear information when manual intervention is needed.

What You Will Learn

By the end of this page, you will understand the taxonomy of saga failures, implement sophisticated retry and timeout strategies, design for observability, handle partial failures, and build saga architectures that are resilient to the realities of distributed systems.

Taxonomy of Saga Failures

Not all failures are created equal. Understanding the types of failures helps design appropriate handling strategies.

Saga Failure Classification
Failure Type	Characteristics	Examples	Recovery Strategy
Transient	Temporary; will likely succeed on retry	Network timeout, database deadlock, rate limiting	Retry with backoff
Intermittent	Failures that occur sporadically	Memory pressure, garbage collection pauses	Retry + monitoring
Business Logic	Valid rejection based on business rules	Insufficient funds, inventory depleted	Compensate and notify
Permanent Infrastructure	Service or dependency completely down	Service crash, data center outage	Wait + fallback + escalate
Data Corruption	Data in unexpected/invalid state	Schema mismatch, corrupted records	Manual intervention
Saga Logic Bug	Defect in saga implementation	Null pointer, incorrect state transitions	Fix and redeploy + manual recovery

The Failure Spectrum:

Converting Mermaid diagram...

failure-classification.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
// Failure Classification System
 
enum FailureCategory {
  TRANSIENT = 'TRANSIENT',
  INTERMITTENT = 'INTERMITTENT',
  BUSINESS_LOGIC = 'BUSINESS_LOGIC',
  PERMANENT_INFRASTRUCTURE = 'PERMANENT_INFRASTRUCTURE',
  DATA_CORRUPTION = 'DATA_CORRUPTION',
  SAGA_BUG = 'SAGA_BUG'
}
 
interface ClassifiedFailure {
  category: FailureCategory;
  isRetryable: boolean;
  requiresCompensation: boolean;
  requiresEscalation: boolean;
  suggestedAction: string;
}
 
class FailureClassifier {
  classify(error: Error, context: SagaContext): ClassifiedFailure {
    // Business logic failures - explicit rejections
    if (error instanceof InsufficientFundsError ||
        error instanceof InventoryDepletedError ||
        error instanceof CustomerBlockedError) {
      return {
        category: FailureCategory.BUSINESS_LOGIC,
        isRetryable: false,
        requiresCompensation: true,
        requiresEscalation: false,
        suggestedAction: 'Execute compensating transactions and notify customer'
      };
    }
 
    // Data corruption
    if (error instanceof SchemaValidationError ||
        error instanceof InvalidStateError ||
        error.message.includes('constraint violation')) {
      return {
        category: FailureCategory.DATA_CORRUPTION,
        isRetryable: false,
        requiresCompensation: false, // May not be safe to compensate
        requiresEscalation: true,
        suggestedAction: 'Pause saga, escalate to engineering for investigation'
      };
    }
 
    // Transient failures - typically network/database issues
    if (error.message.includes('ECONNRESET') ||
        error.message.includes('timeout') ||
        error.message.includes('deadlock') ||
        error instanceof RateLimitError ||
        this.isHttpRetryable(error)) {
      return {
        category: FailureCategory.TRANSIENT,
        isRetryable: true,
        requiresCompensation: false,
        requiresEscalation: false,
        suggestedAction: 'Retry with exponential backoff'
      };
    }
 
    // Permanent infrastructure
    if (error.message.includes('service unavailable') ||
        error.message.includes('503') ||
        error.message.includes('connection refused')) {
      return {
        category: FailureCategory.PERMANENT_INFRASTRUCTURE,
        isRetryable: true, // But with longer delays
        requiresCompensation: false, // Wait for recovery
        requiresEscalation: true, // Alert operators
        suggestedAction: 'Enable circuit breaker, alert on-call, retry periodically'
      };
    }
 
    // Default: assume saga bug
    return {
      category: FailureCategory.SAGA_BUG,
      isRetryable: false,
      requiresCompensation: false,
      requiresEscalation: true,
      suggestedAction: 'Pause saga, capture full context, escalate to engineering'
    };
  }
 
  private isHttpRetryable(error: Error): boolean {
    const retryableStatusCodes = [408, 429, 500, 502, 503, 504];
    const match = error.message.match(/status (\d+)/);
    if (match) {
      return retryableStatusCodes.includes(parseInt(match[1], 10));
    }
    return false;
  }
}

Sophisticated Retry Strategies

Retry logic is the first line of defense against transient failures. However, naive retry implementations can cause more harm than good—overwhelming already-struggling services or creating thundering herd problems.

Retry Anti-Patterns to Avoid

•Immediate Retry — Retrying immediately after failure often fails again, adding load to struggling services
•Fixed Delay Retry — Same delay for all retries misses the benefit of giving services time to recover
•Unlimited Retries — Retrying forever can cause sagas to hang indefinitely, consuming resources
•Ignoring Retry After Headers — Many APIs tell you when to retry; ignoring this leads to continued rejection
•Retrying Non-Idempotent Operations — Retrying a payment charge could charge the customer multiple times

retry-strategies.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
// Production-Grade Retry Implementation
 
interface RetryPolicy {
  maxAttempts: number;
  initialDelayMs: number;
  maxDelayMs: number;
  backoffMultiplier: number;
  jitterFactor: number;  // 0-1, how much randomness to add
  retryableExceptions: Array<new (...args: unknown[]) => Error>;
}
 
class ExponentialBackoffRetry {
  private readonly defaultPolicy: RetryPolicy = {
    maxAttempts: 5,
    initialDelayMs: 100,
    maxDelayMs: 30000,
    backoffMultiplier: 2,
    jitterFactor: 0.2,
    retryableExceptions: [NetworkError, TimeoutError, RateLimitError]
  };
 
  async execute<T>(
    operation: () => Promise<T>,
    policy: Partial<RetryPolicy> = {}
  ): Promise<T> {
    const config = { ...this.defaultPolicy, ...policy };
    let lastError: Error | undefined;
    let delay = config.initialDelayMs;
 
    for (let attempt = 1; attempt <= config.maxAttempts; attempt++) {
      try {
        return await operation();
      } catch (error) {
        lastError = error as Error;
 
        if (!this.isRetryable(error, config.retryableExceptions)) {
          throw error;
        }
 
        if (attempt === config.maxAttempts) {
          break;
        }
 
        // Calculate delay with jitter
        const jitter = 1 + (Math.random() * 2 - 1) * config.jitterFactor;
        const sleepTime = Math.min(delay * jitter, config.maxDelayMs);
 
        console.log(
          `Retry attempt ${attempt}/${config.maxAttempts} after ${Math.round(sleepTime)}ms`
        );
 
        await this.sleep(sleepTime);
        delay *= config.backoffMultiplier;
      }
    }
 
    throw new RetryExhaustedError(
      `Operation failed after ${config.maxAttempts} attempts`,
      lastError
    );
  }
 
  private isRetryable(
    error: unknown,
    retryableExceptions: Array<new (...args: unknown[]) => Error>
  ): boolean {
    return retryableExceptions.some(exc => error instanceof exc);
  }
 
  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}
 
// Retry with respect for rate limits
class RateLimitAwareRetry extends ExponentialBackoffRetry {
  async execute<T>(
    operation: () => Promise<T>,
    policy: Partial<RetryPolicy> = {}
  ): Promise<T> {
    try {
      return await super.execute(operation, policy);
    } catch (error) {
      // Check for rate limit response with Retry-After header
      if (error instanceof RateLimitError && error.retryAfterSeconds) {
        console.log(`Rate limited. Waiting ${error.retryAfterSeconds}s as instructed`);
        await this.sleep(error.retryAfterSeconds * 1000);
        return await super.execute(operation, policy);
      }
      throw error;
    }
  }
}
 
// Saga step with built-in retry
class RetryableSagaStep<TInput, TOutput> {
  constructor(
    private step: SagaStep<TInput, TOutput>,
    private retryPolicy: RetryPolicy,
    private retry: ExponentialBackoffRetry
  ) {}
 
  async execute(input: TInput, context: SagaContext): Promise<TOutput> {
    return await this.retry.execute(
      () => this.step.execute(input, context),
      this.retryPolicy
    );
  }
 
  async compensate(context: CompensationContext): Promise<void> {
    return await this.retry.execute(
      () => this.step.compensate(context),
      // Compensations get more aggressive retry
      {
        ...this.retryPolicy,
        maxAttempts: this.retryPolicy.maxAttempts * 2,
        maxDelayMs: this.retryPolicy.maxDelayMs * 2
      }
    );
  }
}

Timeout Management

Timeouts are essential for saga liveness—without them, a stuck service can hang the entire saga indefinitely. But timeout management in sagas is more nuanced than simple read/write timeouts.

Types of Timeouts in Sagas:

Saga Timeout Taxonomy
Timeout Type	What It Protects Against	Typical Value	Action on Timeout
Step Execution Timeout	Single step taking too long	5-30 seconds	Retry or compensate
Step Response Timeout	Waiting for async response	30-300 seconds	Retry or compensate
Saga Overall Timeout	Entire saga taking too long	5-60 minutes	Force compensation
Compensation Timeout	Compensation step hanging	30-120 seconds	Retry or escalate
Idle Timeout	Saga inactive (stuck)	5-30 minutes	Resume or escalate

timeout-management.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
// Comprehensive Timeout Management
 
interface TimeoutConfig {
  stepExecutionMs: number;
  stepResponseMs: number;
  sagaOverallMs: number;
  compensationMs: number;
  idleMs: number;
}
 
class TimeoutManager {
  private sagaTimers: Map<string, NodeJS.Timeout> = new Map();
  private stepTimers: Map<string, NodeJS.Timeout> = new Map();
 
  constructor(
    private config: TimeoutConfig,
    private sagaRepository: SagaRepository,
    private alertService: AlertService
  ) {}
 
  // Start overall saga timeout
  startSagaTimeout(sagaId: string, onTimeout: () => Promise<void>): void {
    const timer = setTimeout(async () => {
      console.warn(`Saga ${sagaId} exceeded overall timeout of ${this.config.sagaOverallMs}ms`);
      
      await this.alertService.warn({
        type: 'SAGA_TIMEOUT',
        sagaId,
        message: 'Saga exceeded maximum execution time'
      });
      
      await onTimeout();
    }, this.config.sagaOverallMs);
 
    this.sagaTimers.set(sagaId, timer);
  }
 
  // Clear saga timeout on completion
  clearSagaTimeout(sagaId: string): void {
    const timer = this.sagaTimers.get(sagaId);
    if (timer) {
      clearTimeout(timer);
      this.sagaTimers.delete(sagaId);
    }
  }
 
  // Execute step with timeout
  async executeWithTimeout<T>(
    stepName: string,
    operation: () => Promise<T>,
    timeoutMs: number = this.config.stepExecutionMs
  ): Promise<T> {
    return new Promise((resolve, reject) => {
      const timer = setTimeout(() => {
        reject(new StepTimeoutError(
          `Step '${stepName}' timed out after ${timeoutMs}ms`
        ));
      }, timeoutMs);
 
      operation()
        .then(result => {
          clearTimeout(timer);
          resolve(result);
        })
        .catch(error => {
          clearTimeout(timer);
          reject(error);
        });
    });
  }
 
  // Wait for async response with timeout
  async waitForResponse<T>(
    sagaId: string,
    stepName: string,
    responsePromise: Promise<T>,
    timeoutMs: number = this.config.stepResponseMs
  ): Promise<T> {
    return new Promise((resolve, reject) => {
      const timer = setTimeout(async () => {
        // Log timeout for investigation
        await this.sagaRepository.update(sagaId, {
          lastTimeoutAt: new Date(),
          lastTimeoutStep: stepName
        });
 
        reject(new ResponseTimeoutError(
          `No response for step '${stepName}' after ${timeoutMs}ms`
        ));
      }, timeoutMs);
 
      responsePromise
        .then(result => {
          clearTimeout(timer);
          resolve(result);
        })
        .catch(error => {
          clearTimeout(timer);
          reject(error);
        });
    });
  }
}
 
// Deadline propagation for nested calls
class DeadlinePropagation {
  private static DEADLINE_HEADER = 'X-Saga-Deadline';
 
  // Set deadline in outgoing request
  static setDeadline(headers: Record<string, string>, deadlineMs: number): void {
    const deadline = Date.now() + deadlineMs;
    headers[this.DEADLINE_HEADER] = deadline.toString();
  }
 
  // Extract deadline from incoming request
  static getDeadline(headers: Record<string, string>): number | null {
    const deadline = headers[this.DEADLINE_HEADER];
    return deadline ? parseInt(deadline, 10) : null;
  }
 
  // Calculate remaining time until deadline
  static getRemainingTime(headers: Record<string, string>): number {
    const deadline = this.getDeadline(headers);
    if (!deadline) return Infinity;
    return Math.max(0, deadline - Date.now());
  }
 
  // Check if deadline has passed
  static isExpired(headers: Record<string, string>): boolean {
    const remaining = this.getRemainingTime(headers);
    return remaining <= 0;
  }
}
 
// Service that respects propagated deadlines
class DeadlineAwareService {
  async processOrder(request: OrderRequest): Promise<OrderResult> {
    // Check if deadline already expired
    if (DeadlinePropagation.isExpired(request.headers)) {
      throw new DeadlineExceededError('Request deadline already passed');
    }
 
    const remainingTime = DeadlinePropagation.getRemainingTime(request.headers);
    
    // Set local timeout based on remaining deadline
    return await new Promise((resolve, reject) => {
      const timer = setTimeout(() => {
        reject(new DeadlineExceededError('Processing exceeded deadline'));
      }, Math.min(remainingTime, 30000)); // Cap at 30s max
 
      this.doProcessOrder(request)
        .then(result => {
          clearTimeout(timer);
          resolve(result);
        })
        .catch(error => {
          clearTimeout(timer);
          reject(error);
        });
    });
  }
}

Deadline vs Timeout

A timeout is relative ('30 seconds from now'). A deadline is absolute ('complete by 2024-01-15T10:30:00Z'). Deadlines propagate better across service boundaries because they account for time already spent. If the overall deadline is 30 seconds and 10 seconds were spent on network and Service A, Service B knows it has only 20 seconds, not the full 30.

Circuit Breaker for Saga Steps

When a service is failing consistently, continuing to send requests makes things worse—it overwhelms the struggling service and delays saga execution. Circuit breakers prevent this cascade.\n\nCircuit Breaker States:\n\n- CLOSED — Normal operation; requests flow through\n- OPEN — Service is failing; requests fail immediately without calling the service\n- HALF-OPEN — Testing recovery; limited requests allowed through to check if service recovered

Converting Mermaid diagram...

circuit-breaker.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
// Circuit Breaker Implementation for Saga Steps
 
enum CircuitState {
  CLOSED = 'CLOSED',
  OPEN = 'OPEN',
  HALF_OPEN = 'HALF_OPEN'
}
 
interface CircuitBreakerConfig {
  failureThreshold: number;     // Number of failures to open circuit
  successThreshold: number;     // Successes in half-open to close
  openTimeoutMs: number;        // Time before trying half-open
  windowSizeMs: number;         // Rolling window for failure counting
}
 
class CircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED;
  private failureCount: number = 0;
  private successCount: number = 0;
  private lastFailureTime: number = 0;
  private failures: number[] = []; // Timestamps of failures
 
  constructor(
    private serviceName: string,
    private config: CircuitBreakerConfig = {
      failureThreshold: 5,
      successThreshold: 3,
      openTimeoutMs: 30000,
      windowSizeMs: 60000
    }
  ) {}
 
  async execute<T>(operation: () => Promise<T>): Promise<T> {
    // Check if we should allow the request
    if (!this.shouldAllowRequest()) {
      throw new CircuitOpenError(
        `Circuit breaker for ${this.serviceName} is OPEN`
      );
    }
 
    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
 
  private shouldAllowRequest(): boolean {
    switch (this.state) {
      case CircuitState.CLOSED:
        return true;
 
      case CircuitState.OPEN:
        // Check if we should transition to half-open
        if (Date.now() - this.lastFailureTime >= this.config.openTimeoutMs) {
          this.transitionTo(CircuitState.HALF_OPEN);
          return true;
        }
        return false;
 
      case CircuitState.HALF_OPEN:
        // Allow limited requests in half-open state
        return true;
    }
  }
 
  private onSuccess(): void {
    switch (this.state) {
      case CircuitState.CLOSED:
        // Reset failure count on success
        this.pruneOldFailures();
        break;
 
      case CircuitState.HALF_OPEN:
        this.successCount++;
        if (this.successCount >= this.config.successThreshold) {
          this.transitionTo(CircuitState.CLOSED);
        }
        break;
 
      case CircuitState.OPEN:
        // Shouldn't happen, but handle gracefully
        break;
    }
  }
 
  private onFailure(): void {
    this.lastFailureTime = Date.now();
    this.failures.push(this.lastFailureTime);
 
    switch (this.state) {
      case CircuitState.CLOSED:
        this.pruneOldFailures();
        if (this.failures.length >= this.config.failureThreshold) {
          this.transitionTo(CircuitState.OPEN);
        }
        break;
 
      case CircuitState.HALF_OPEN:
        // Any failure in half-open reopens the circuit
        this.transitionTo(CircuitState.OPEN);
        break;
 
      case CircuitState.OPEN:
        // Already open, just update last failure time
        break;
    }
  }
 
  private pruneOldFailures(): void {
    const cutoff = Date.now() - this.config.windowSizeMs;
    this.failures = this.failures.filter(t => t > cutoff);
  }
 
  private transitionTo(newState: CircuitState): void {
    console.log(
      `Circuit breaker for ${this.serviceName}: ${this.state} -> ${newState}`
    );
    
    this.state = newState;
    this.successCount = 0;
 
    // Emit metrics/alerts on state change
    if (newState === CircuitState.OPEN) {
      this.emitAlert(`Circuit opened for ${this.serviceName}`);
    }
  }
 
  private emitAlert(message: string): void {
    // Integration with alerting system
    console.warn(`[CIRCUIT_BREAKER] ${message}`);
  }
 
  getState(): CircuitState {
    return this.state;
  }
}
 
// Saga step wrapped with circuit breaker
class CircuitProtectedSagaStep<TInput, TOutput> {
  private circuitBreaker: CircuitBreaker;
 
  constructor(
    private step: SagaStep<TInput, TOutput>,
    private serviceName: string
  ) {
    this.circuitBreaker = new CircuitBreaker(serviceName);
  }
 
  async execute(input: TInput, context: SagaContext): Promise<TOutput> {
    try {
      return await this.circuitBreaker.execute(
        () => this.step.execute(input, context)
      );
    } catch (error) {
      if (error instanceof CircuitOpenError) {
        // Transform to saga-friendly error
        throw new ServiceUnavailableError(
          `Service ${this.serviceName} is currently unavailable (circuit open)`,
          { isRetryable: true, retryAfterMs: 30000 }
        );
      }
      throw error;
    }
  }
}

Handling Partial Failures

Some of the most challenging scenarios involve partial failures—when a step executes somewhere between 0% and 100% successfully. These require careful handling to avoid both data loss and duplicate processing.

Types of Partial Failures

•Crash After Commit — Database committed, but saga doesn't know (crashed before recording success). Is it done or not?
•Timeout with Unknown State — Request timed out After 30s. Did the service receive it? Did it complete? We don't know.
•Batch Partial Success — Reserving 5 items: 3 succeeded, 2 failed. What now?
•External Service Ambiguity — Payment gateway returned error after long delay. Was the charge actually processed?
•Network Partition — Request was sent successfully, but response never arrived. Service might have completed.

partial-failure-handling.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
// Strategies for Handling Partial Failures
 
// 1. IDEMPOTENCY KEYS for "at least once" safety
class IdempotentOperation {
  async execute<T>(
    idempotencyKey: string,
    operation: () => Promise<T>
  ): Promise<T> {
    // Check if already executed
    const existing = await this.idempotencyStore.find(idempotencyKey);
    if (existing) {
      if (existing.status === 'COMPLETED') {
        return existing.result as T;
      }
      if (existing.status === 'IN_PROGRESS') {
        throw new OperationInProgressError(idempotencyKey);
      }
    }
 
    // Mark as in progress
    await this.idempotencyStore.create({
      key: idempotencyKey,
      status: 'IN_PROGRESS',
      startedAt: new Date()
    });
 
    try {
      const result = await operation();
      
      // Mark as completed with result
      await this.idempotencyStore.update(idempotencyKey, {
        status: 'COMPLETED',
        result,
        completedAt: new Date()
      });
      
      return result;
    } catch (error) {
      // Mark as failed (or delete to allow retry)
      await this.idempotencyStore.update(idempotencyKey, {
        status: 'FAILED',
        error: error.message,
        failedAt: new Date()
      });
      throw error;
    }
  }
}
 
// 2. VERIFICATION STEP for ambiguous outcomes
class PaymentStepWithVerification {
  private readonly verificationAttempts = 3;
  private readonly verificationDelayMs = 5000;
 
  async processPayment(orderId: string, amount: number): Promise<PaymentResult> {
    const idempotencyKey = `payment-${orderId}`;
 
    try {
      return await this.paymentGateway.charge({
        idempotencyKey,
        amount,
        metadata: { orderId }
      });
    } catch (error) {
      if (this.isAmbiguousFailure(error)) {
        // Timeout or network error - payment might have succeeded
        return await this.verifyPaymentStatus(idempotencyKey, orderId);
      }
      throw error;
    }
  }
 
  private isAmbiguousFailure(error: Error): boolean {
    return error instanceof TimeoutError ||
           error.message.includes('ECONNRESET') ||
           error.message.includes('network error');
  }
 
  private async verifyPaymentStatus(
    idempotencyKey: string,
    orderId: string
  ): Promise<PaymentResult> {
    for (let attempt = 1; attempt <= this.verificationAttempts; attempt++) {
      await this.sleep(this.verificationDelayMs);
 
      try {
        // Check if payment actually went through
        const status = await this.paymentGateway.getPaymentByIdempotencyKey(
          idempotencyKey
        );
 
        if (status) {
          // Payment was processed!
          return status;
        }
      } catch (verifyError) {
        if (attempt === this.verificationAttempts) {
          throw new AmbiguousPaymentStateError(
            `Cannot determine payment status for order ${orderId}`,
            { requiresManualVerification: true }
          );
        }
      }
    }
 
    // Verification exhausted, payment not found - likely didn't process
    throw new PaymentNotProcessedError(orderId);
  }
}
 
// 3. BATCH WITH INDIVIDUAL TRACKING
class BatchInventoryReservation {
  async reserveItems(
    orderId: string,
    items: OrderItem[]
  ): Promise<BatchReservationResult> {
    const results: ItemReservationResult[] = [];
    const reservedItems: string[] = [];
    const failedItems: { item: OrderItem; error: Error }[] = [];
 
    for (const item of items) {
      try {
        const reservationId = await this.inventoryService.reserve(
          item.productId,
          item.quantity,
          orderId
        );
        
        reservedItems.push(reservationId);
        results.push({
          productId: item.productId,
          status: 'RESERVED',
          reservationId
        });
      } catch (error) {
        failedItems.push({ item, error: error as Error });
        results.push({
          productId: item.productId,
          status: 'FAILED',
          error: error.message
        });
      }
    }
 
    // Decision: all-or-nothing or partial success?
    if (failedItems.length > 0) {
      // All-or-nothing: release all successful reservations
      for (const reservationId of reservedItems) {
        await this.inventoryService.release(reservationId);
      }
 
      throw new BatchReservationFailedError(
        `Failed to reserve ${failedItems.length} of ${items.length} items`,
        { failedItems, results }
      );
    }
 
    return {
      success: true,
      reservationIds: reservedItems,
      results
    };
  }
}
 
// 4. SAGA STATE RECOVERY AFTER CRASH
class SagaRecovery {
  async recoverStuckSagas(): Promise<void> {
    // Find sagas that were in progress when system crashed
    const stuckSagas = await this.sagaRepository.findByStatus({
      status: 'IN_PROGRESS',
      lastUpdatedBefore: new Date(Date.now() - 5 * 60 * 1000) // 5 min stale
    });
 
    for (const saga of stuckSagas) {
      console.log(`Recovering stuck saga: ${saga.id}`);
      
      try {
        await this.determineSagaState(saga);
      } catch (error) {
        console.error(`Failed to recover saga ${saga.id}:`, error);
        await this.escalateSaga(saga, error as Error);
      }
    }
  }
 
  private async determineSagaState(saga: SagaInstance): Promise<void> {
    const lastStep = saga.completedSteps[saga.completedSteps.length - 1];
    
    if (!lastStep) {
      // Saga never started - safe to restart
      await this.restartSaga(saga);
      return;
    }
 
    // Verify if last step actually completed
    const stepCompleted = await this.verifyStepCompletion(saga, lastStep);
    
    if (stepCompleted) {
      // Continue from next step
      await this.continueSaga(saga, saga.completedSteps.length);
    } else {
      // Last step is ambiguous - need manual inspection
      await this.escalateSaga(saga, new Error(
        `Ambiguous state at step ${lastStep}`
      ));
    }
  }
}

Observability and Alerting

When sagas fail, operators need visibility into what went wrong, where, and why. Comprehensive observability is essential for debugging and maintaining saga systems.

Essential Saga Metrics

•Saga Throughput — Sagas started, completed, failed, compensated per time period
•Saga Duration — P50, P95, P99 execution times; broken down by step
•Step Success Rate — Percentage of successful executions per step
•Compensation Rate — How often do sagas need to compensate? Which step triggers most?
•Retry Rate — How often are retries needed? Which steps are flaky?
•Stuck Sagas — Count of sagas not progressing (in same state > threshold)
•Circuit Breaker State — Which services have open circuits?

saga-observability.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
// Comprehensive Saga Observability System
 
interface SagaMetrics {
  sagasStarted: Counter;
  sagasCompleted: Counter;
  sagasFailed: Counter;
  sagasCompensated: Counter;
  sagaDuration: Histogram;
  stepDuration: Histogram;
  stepRetries: Counter;
  compensationDuration: Histogram;
}
 
class ObservableSagaExecutor {
  private metrics: SagaMetrics;
  private tracer: Tracer;
 
  async executeSaga<T>(
    sagaId: string,
    definition: SagaDefinition<T>,
    input: T
  ): Promise<SagaResult<T>> {
    // Start distributed trace
    const span = this.tracer.startSpan('saga.execute', {
      attributes: {
        'saga.id': sagaId,
        'saga.type': definition.name,
        'saga.input': JSON.stringify(input)
      }
    });
 
    this.metrics.sagasStarted.inc({ saga_type: definition.name });
    const startTime = Date.now();
 
    // Structured logging
    this.logger.info('Saga started', {
      sagaId,
      sagaType: definition.name,
      input: this.sanitizeForLogging(input)
    });
 
    try {
      let data = input;
      
      for (let i = 0; i < definition.steps.length; i++) {
        const step = definition.steps[i];
        data = await this.executeStep(sagaId, step, data, i + 1, span);
      }
 
      // Success metrics
      const duration = Date.now() - startTime;
      this.metrics.sagasCompleted.inc({ saga_type: definition.name });
      this.metrics.sagaDuration.observe(
        { saga_type: definition.name, outcome: 'success' },
        duration
      );
 
      this.logger.info('Saga completed', {
        sagaId,
        sagaType: definition.name,
        durationMs: duration
      });
 
      span.setStatus({ code: SpanStatusCode.OK });
      span.end();
 
      return { success: true, data };
 
    } catch (error) {
      const duration = Date.now() - startTime;
      this.metrics.sagasFailed.inc({ 
        saga_type: definition.name,
        failure_step: (error as SagaError).stepName || 'unknown'
      });
      this.metrics.sagaDuration.observe(
        { saga_type: definition.name, outcome: 'failed' },
        duration
      );
 
      this.logger.error('Saga failed', {
        sagaId,
        sagaType: definition.name,
        error: error.message,
        stack: error.stack,
        durationMs: duration
      });
 
      // Attempt compensation
      const compensationResult = await this.compensateSaga(
        sagaId, 
        definition, 
        data, 
        span
      );
 
      if (compensationResult.success) {
        this.metrics.sagasCompensated.inc({ saga_type: definition.name });
      }
 
      span.recordException(error as Error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.end();
 
      return { 
        success: false, 
        error: error.message,
        compensated: compensationResult.success 
      };
    }
  }
 
  private async executeStep<T>(
    sagaId: string,
    step: SagaStep<T, T>,
    data: T,
    stepNumber: number,
    parentSpan: Span
  ): Promise<T> {
    const stepSpan = this.tracer.startSpan(`saga.step.${step.name}`, {
      parent: parentSpan,
      attributes: {
        'saga.step.name': step.name,
        'saga.step.number': stepNumber
      }
    });
 
    const startTime = Date.now();
    let attempt = 0;
 
    while (true) {
      attempt++;
      
      try {
        const result = await step.execute(data, { sagaId, stepNumber });
        
        const duration = Date.now() - startTime;
        this.metrics.stepDuration.observe(
          { saga_step: step.name, outcome: 'success' },
          duration
        );
 
        this.logger.debug('Step completed', {
          sagaId,
          stepName: step.name,
          attempt,
          durationMs: duration
        });
 
        stepSpan.end();
        return result;
 
      } catch (error) {
        if (this.isRetryable(error) && attempt < step.maxRetries) {
          this.metrics.stepRetries.inc({ saga_step: step.name });
          
          this.logger.warn('Step failed, retrying', {
            sagaId,
            stepName: step.name,
            attempt,
            maxRetries: step.maxRetries,
            error: error.message
          });
 
          await this.backoff(attempt);
          continue;
        }
 
        const duration = Date.now() - startTime;
        this.metrics.stepDuration.observe(
          { saga_step: step.name, outcome: 'failed' },
          duration
        );
 
        stepSpan.recordException(error as Error);
        stepSpan.end();
 
        throw Object.assign(error, { stepName: step.name });
      }
    }
  }
}
 
// Alert Conditions
class SagaAlertManager {
  private alertRules: AlertRule[] = [
    {
      name: 'high_saga_failure_rate',
      condition: (metrics) => 
        metrics.sagasFailed.rate('5m') / metrics.sagasStarted.rate('5m') > 0.1,
      severity: 'critical',
      message: 'Saga failure rate exceeds 10% in last 5 minutes'
    },
    {
      name: 'stuck_sagas',
      condition: async (metrics, repo) => 
        (await repo.countByStatus('IN_PROGRESS', { staleMins: 30 })) > 10,
      severity: 'warning',
      message: 'More than 10 sagas stuck for over 30 minutes'
    },
    {
      name: 'compensation_failures',
      condition: (metrics) =>
        metrics.compensationFailures.count('1h') > 5,
      severity: 'critical',
      message: 'Multiple compensation failures in last hour - data may be inconsistent'
    },
    {
      name: 'circuit_breaker_open',
      condition: (_, services) =>
        services.some(s => s.circuitBreaker.getState() === 'OPEN'),
      severity: 'warning',
      message: 'One or more service circuit breakers are open'
    }
  ];
}

Dead Letter Queues and Manual Intervention

When all automatic recovery attempts fail, sagas must have a graceful path to human intervention. Dead Letter Queues (DLQs) provide this mechanism.

dead-letter-handling.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
// Dead Letter Queue and Manual Intervention System
 
interface DeadLetterSaga {
  sagaId: string;
  sagaType: string;
  failedStep: string;
  error: string;
  data: unknown;
  completedSteps: string[];
  failedAt: Date;
  attempts: number;
  lastAttemptAt: Date;
  status: 'PENDING_REVIEW' | 'IN_REVIEW' | 'RESOLVED' | 'ABANDONED';
  assignedTo?: string;
  resolution?: SagaResolution;
}
 
type SagaResolution = 
  | { type: 'RETRY'; notes: string }
  | { type: 'MANUAL_COMPLETE'; notes: string }
  | { type: 'MANUAL_COMPENSATE'; notes: string }
  | { type: 'ABANDON'; reason: string };
 
class DeadLetterQueue {
  async moveToDeadLetter(saga: SagaInstance, error: Error): Promise<void> {
    const dlSaga: DeadLetterSaga = {
      sagaId: saga.id,
      sagaType: saga.type,
      failedStep: saga.currentStep,
      error: error.message,
      data: saga.data,
      completedSteps: saga.completedSteps,
      failedAt: new Date(),
      attempts: saga.attempts,
      lastAttemptAt: new Date(),
      status: 'PENDING_REVIEW'
    };
 
    await this.deadLetterRepository.create(dlSaga);
 
    // Update original saga status
    await this.sagaRepository.update(saga.id, {
      status: 'IN_DLQ',
      movedToDlqAt: new Date()
    });
 
    // Create incident ticket
    await this.ticketService.create({
      type: 'SAGA_DEAD_LETTER',
      priority: 'HIGH',
      title: `Saga ${saga.id} requires manual intervention`,
      description: `
        Saga Type: ${saga.type}
        Failed Step: ${saga.currentStep}
        Error: ${error.message}
        Completed Steps: ${saga.completedSteps.join(' -> ')}
        Data: ${JSON.stringify(saga.data, null, 2)}
      `,
      sagaId: saga.id
    });
 
    // Alert on-call
    await this.alertService.page({
      severity: 'high',
      message: `Saga moved to DLQ: ${saga.type}.${saga.currentStep}`,
      context: { sagaId: saga.id }
    });
  }
 
  async retryFromDlq(sagaId: string, operatorId: string): Promise<void> {
    const dlSaga = await this.deadLetterRepository.findById(sagaId);
    
    if (!dlSaga) {
      throw new NotFoundError(`Dead letter saga ${sagaId} not found`);
    }
 
    // Log the manual intervention
    await this.auditLog.record({
      action: 'DLQ_RETRY',
      sagaId,
      operatorId,
      timestamp: new Date()
    });
 
    // Update status
    await this.deadLetterRepository.update(sagaId, {
      status: 'IN_REVIEW',
      assignedTo: operatorId
    });
 
    // Attempt to resume saga
    const originalSaga = await this.sagaRepository.findById(sagaId);
    
    try {
      await this.sagaExecutor.resumeSaga(originalSaga);
      
      // Success - mark as resolved
      await this.deadLetterRepository.update(sagaId, {
        status: 'RESOLVED',
        resolution: { type: 'RETRY', notes: `Retried by ${operatorId}` }
      });
      
    } catch (retryError) {
      // Still failing - update DLQ record
      await this.deadLetterRepository.update(sagaId, {
        status: 'PENDING_REVIEW',
        lastAttemptAt: new Date(),
        attempts: dlSaga.attempts + 1,
        error: retryError.message
      });
      
      throw retryError;
    }
  }
 
  async manuallyComplete(
    sagaId: string, 
    operatorId: string,
    notes: string
  ): Promise<void> {
    // Mark saga as complete without executing remaining steps
    await this.sagaRepository.update(sagaId, {
      status: 'MANUALLY_COMPLETED',
      completedAt: new Date(),
      completedBy: operatorId,
      notes
    });
 
    await this.deadLetterRepository.update(sagaId, {
      status: 'RESOLVED',
      resolution: { type: 'MANUAL_COMPLETE', notes }
    });
 
    await this.auditLog.record({
      action: 'MANUAL_SAGA_COMPLETE',
      sagaId,
      operatorId,
      notes,
      timestamp: new Date()
    });
  }
 
  async manuallyCompensate(
    sagaId: string,
    operatorId: string,
    notes: string
  ): Promise<void> {
    const saga = await this.sagaRepository.findById(sagaId);
    
    // Execute compensations for completed steps
    for (const stepName of saga.completedSteps.reverse()) {
      const step = this.getStepDefinition(saga.type, stepName);
      await step.compensate({ sagaId, ...saga.data });
    }
 
    await this.sagaRepository.update(sagaId, {
      status: 'MANUALLY_COMPENSATED',
      compensatedAt: new Date(),
      compensatedBy: operatorId
    });
 
    await this.deadLetterRepository.update(sagaId, {
      status: 'RESOLVED',
      resolution: { type: 'MANUAL_COMPENSATE', notes }
    });
  }
}
 
// Admin UI for DLQ management
class DlqDashboard {
  async getStats(): Promise<DlqStats> {
    const pending = await this.dlqRepo.countByStatus('PENDING_REVIEW');
    const inReview = await this.dlqRepo.countByStatus('IN_REVIEW');
    const byType = await this.dlqRepo.groupBy('sagaType');
    const avgAge = await this.dlqRepo.avgAgeMinutes('PENDING_REVIEW');
 
    return { pending, inReview, byType, avgAgeMins: avgAge };
  }
 
  async getQueue(filters: DlqFilters): Promise<DeadLetterSaga[]> {
    return this.dlqRepo.find({
      status: filters.status,
      sagaType: filters.sagaType,
      olderThan: filters.olderThan,
      orderBy: 'failedAt',
      limit: 50
    });
  }
}

DLQ Is Not a Trash Can

Dead Letter Queues should be actively monitored and processed. A growing DLQ indicates systemic problems. Track DLQ depth as a key operational metric and set alerts when it grows beyond thresholds.

Summary: Building Failure-Resilient Sagas

Handling failures gracefully is what separates production-grade saga implementations from toy examples. Let's consolidate the essential takeaways:

Key Takeaways

•Classify failures — Different failure types require different handling. Transient failures need retries; business logic failures need compensation; infrastructure failures need escalation.
•Implement sophisticated retries — Exponential backoff with jitter, respect rate limits, have maximum retry budgets, and ensure idempotency before retrying.
•Manage timeouts carefully — Use step timeouts, response timeouts, overall saga timeouts, and idle timeouts. Propagate deadlines across service boundaries.
•Use circuit breakers — Protect struggling services from additional load. Fail fast when circuits are open rather than waiting for inevitable failure.
•Handle partial failures — Use idempotency keys, verification steps, and careful state tracking. Assume any operation can partially complete.
•Invest in observability — Track saga throughput, duration, failure rates, retry rates, and compensation rates. Alert on anomalies proactively.
•Plan for human intervention — Dead letter queues, operator dashboards, and manual resolution workflows are essential for production systems.

What's Next:\n\nWe've covered the theory and failure handling. The final page brings everything together with Saga Implementation—production-ready patterns, framework recommendations, and real-world deployment considerations.

Page Complete

You now understand how to make sagas resilient to the inevitable failures of distributed systems. The patterns covered here—retry strategies, circuit breakers, timeout management, and dead letter queues—are the difference between sagas that work in dev and sagas that work in production.

Failure Handling

Embracing Failure as a First-Class Concern

What You Will Learn

Taxonomy of Saga Failures

Not all failures are created equal. Understanding the types of failures helps design appropriate handling strategies.

Saga Failure Classification
Failure Type	Characteristics	Examples	Recovery Strategy
Transient	Temporary; will likely succeed on retry	Network timeout, database deadlock, rate limiting	Retry with backoff
Intermittent	Failures that occur sporadically	Memory pressure, garbage collection pauses	Retry + monitoring
Business Logic	Valid rejection based on business rules	Insufficient funds, inventory depleted	Compensate and notify
Permanent Infrastructure	Service or dependency completely down	Service crash, data center outage	Wait + fallback + escalate
Data Corruption	Data in unexpected/invalid state	Schema mismatch, corrupted records	Manual intervention
Saga Logic Bug	Defect in saga implementation	Null pointer, incorrect state transitions	Fix and redeploy + manual recovery

The Failure Spectrum:

Converting Mermaid diagram...

failure-classification.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
// Failure Classification System
 
enum FailureCategory {
  TRANSIENT = 'TRANSIENT',
  INTERMITTENT = 'INTERMITTENT',
  BUSINESS_LOGIC = 'BUSINESS_LOGIC',
  PERMANENT_INFRASTRUCTURE = 'PERMANENT_INFRASTRUCTURE',
  DATA_CORRUPTION = 'DATA_CORRUPTION',
  SAGA_BUG = 'SAGA_BUG'
}
 
interface ClassifiedFailure {
  category: FailureCategory;
  isRetryable: boolean;
  requiresCompensation: boolean;
  requiresEscalation: boolean;
  suggestedAction: string;
}
 
class FailureClassifier {
  classify(error: Error, context: SagaContext): ClassifiedFailure {
    // Business logic failures - explicit rejections
    if (error instanceof InsufficientFundsError ||
        error instanceof InventoryDepletedError ||
        error instanceof CustomerBlockedError) {
      return {
        category: FailureCategory.BUSINESS_LOGIC,
        isRetryable: false,
        requiresCompensation: true,
        requiresEscalation: false,
        suggestedAction: 'Execute compensating transactions and notify customer'
      };
    }
 
    // Data corruption
    if (error instanceof SchemaValidationError ||
        error instanceof InvalidStateError ||
        error.message.includes('constraint violation')) {
      return {
        category: FailureCategory.DATA_CORRUPTION,
        isRetryable: false,
        requiresCompensation: false, // May not be safe to compensate
        requiresEscalation: true,
        suggestedAction: 'Pause saga, escalate to engineering for investigation'
      };
    }
 
    // Transient failures - typically network/database issues
    if (error.message.includes('ECONNRESET') ||
        error.message.includes('timeout') ||
        error.message.includes('deadlock') ||
        error instanceof RateLimitError ||
        this.isHttpRetryable(error)) {
      return {
        category: FailureCategory.TRANSIENT,
        isRetryable: true,
        requiresCompensation: false,
        requiresEscalation: false,
        suggestedAction: 'Retry with exponential backoff'
      };
    }
 
    // Permanent infrastructure
    if (error.message.includes('service unavailable') ||
        error.message.includes('503') ||
        error.message.includes('connection refused')) {
      return {
        category: FailureCategory.PERMANENT_INFRASTRUCTURE,
        isRetryable: true, // But with longer delays
        requiresCompensation: false, // Wait for recovery
        requiresEscalation: true, // Alert operators
        suggestedAction: 'Enable circuit breaker, alert on-call, retry periodically'
      };
    }
 
    // Default: assume saga bug
    return {
      category: FailureCategory.SAGA_BUG,
      isRetryable: false,
      requiresCompensation: false,
      requiresEscalation: true,
      suggestedAction: 'Pause saga, capture full context, escalate to engineering'
    };
  }
 
  private isHttpRetryable(error: Error): boolean {
    const retryableStatusCodes = [408, 429, 500, 502, 503, 504];
    const match = error.message.match(/status (\d+)/);
    if (match) {
      return retryableStatusCodes.includes(parseInt(match[1], 10));
    }
    return false;
  }
}

Sophisticated Retry Strategies

Retry Anti-Patterns to Avoid

•Immediate Retry — Retrying immediately after failure often fails again, adding load to struggling services
•Fixed Delay Retry — Same delay for all retries misses the benefit of giving services time to recover
•Unlimited Retries — Retrying forever can cause sagas to hang indefinitely, consuming resources
•Ignoring Retry After Headers — Many APIs tell you when to retry; ignoring this leads to continued rejection
•Retrying Non-Idempotent Operations — Retrying a payment charge could charge the customer multiple times

retry-strategies.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
// Production-Grade Retry Implementation
 
interface RetryPolicy {
  maxAttempts: number;
  initialDelayMs: number;
  maxDelayMs: number;
  backoffMultiplier: number;
  jitterFactor: number;  // 0-1, how much randomness to add
  retryableExceptions: Array<new (...args: unknown[]) => Error>;
}
 
class ExponentialBackoffRetry {
  private readonly defaultPolicy: RetryPolicy = {
    maxAttempts: 5,
    initialDelayMs: 100,
    maxDelayMs: 30000,
    backoffMultiplier: 2,
    jitterFactor: 0.2,
    retryableExceptions: [NetworkError, TimeoutError, RateLimitError]
  };
 
  async execute<T>(
    operation: () => Promise<T>,
    policy: Partial<RetryPolicy> = {}
  ): Promise<T> {
    const config = { ...this.defaultPolicy, ...policy };
    let lastError: Error | undefined;
    let delay = config.initialDelayMs;
 
    for (let attempt = 1; attempt <= config.maxAttempts; attempt++) {
      try {
        return await operation();
      } catch (error) {
        lastError = error as Error;
 
        if (!this.isRetryable(error, config.retryableExceptions)) {
          throw error;
        }
 
        if (attempt === config.maxAttempts) {
          break;
        }
 
        // Calculate delay with jitter
        const jitter = 1 + (Math.random() * 2 - 1) * config.jitterFactor;
        const sleepTime = Math.min(delay * jitter, config.maxDelayMs);
 
        console.log(
          `Retry attempt ${attempt}/${config.maxAttempts} after ${Math.round(sleepTime)}ms`
        );
 
        await this.sleep(sleepTime);
        delay *= config.backoffMultiplier;
      }
    }
 
    throw new RetryExhaustedError(
      `Operation failed after ${config.maxAttempts} attempts`,
      lastError
    );
  }
 
  private isRetryable(
    error: unknown,
    retryableExceptions: Array<new (...args: unknown[]) => Error>
  ): boolean {
    return retryableExceptions.some(exc => error instanceof exc);
  }
 
  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}
 
// Retry with respect for rate limits
class RateLimitAwareRetry extends ExponentialBackoffRetry {
  async execute<T>(
    operation: () => Promise<T>,
    policy: Partial<RetryPolicy> = {}
  ): Promise<T> {
    try {
      return await super.execute(operation, policy);
    } catch (error) {
      // Check for rate limit response with Retry-After header
      if (error instanceof RateLimitError && error.retryAfterSeconds) {
        console.log(`Rate limited. Waiting ${error.retryAfterSeconds}s as instructed`);
        await this.sleep(error.retryAfterSeconds * 1000);
        return await super.execute(operation, policy);
      }
      throw error;
    }
  }
}
 
// Saga step with built-in retry
class RetryableSagaStep<TInput, TOutput> {
  constructor(
    private step: SagaStep<TInput, TOutput>,
    private retryPolicy: RetryPolicy,
    private retry: ExponentialBackoffRetry
  ) {}
 
  async execute(input: TInput, context: SagaContext): Promise<TOutput> {
    return await this.retry.execute(
      () => this.step.execute(input, context),
      this.retryPolicy
    );
  }
 
  async compensate(context: CompensationContext): Promise<void> {
    return await this.retry.execute(
      () => this.step.compensate(context),
      // Compensations get more aggressive retry
      {
        ...this.retryPolicy,
        maxAttempts: this.retryPolicy.maxAttempts * 2,
        maxDelayMs: this.retryPolicy.maxDelayMs * 2
      }
    );
  }
}

Timeout Management

Timeouts are essential for saga liveness—without them, a stuck service can hang the entire saga indefinitely. But timeout management in sagas is more nuanced than simple read/write timeouts.

Types of Timeouts in Sagas:

Saga Timeout Taxonomy
Timeout Type	What It Protects Against	Typical Value	Action on Timeout
Step Execution Timeout	Single step taking too long	5-30 seconds	Retry or compensate
Step Response Timeout	Waiting for async response	30-300 seconds	Retry or compensate
Saga Overall Timeout	Entire saga taking too long	5-60 minutes	Force compensation
Compensation Timeout	Compensation step hanging	30-120 seconds	Retry or escalate
Idle Timeout	Saga inactive (stuck)	5-30 minutes	Resume or escalate

timeout-management.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
// Comprehensive Timeout Management
 
interface TimeoutConfig {
  stepExecutionMs: number;
  stepResponseMs: number;
  sagaOverallMs: number;
  compensationMs: number;
  idleMs: number;
}
 
class TimeoutManager {
  private sagaTimers: Map<string, NodeJS.Timeout> = new Map();
  private stepTimers: Map<string, NodeJS.Timeout> = new Map();
 
  constructor(
    private config: TimeoutConfig,
    private sagaRepository: SagaRepository,
    private alertService: AlertService
  ) {}
 
  // Start overall saga timeout
  startSagaTimeout(sagaId: string, onTimeout: () => Promise<void>): void {
    const timer = setTimeout(async () => {
      console.warn(`Saga ${sagaId} exceeded overall timeout of ${this.config.sagaOverallMs}ms`);
      
      await this.alertService.warn({
        type: 'SAGA_TIMEOUT',
        sagaId,
        message: 'Saga exceeded maximum execution time'
      });
      
      await onTimeout();
    }, this.config.sagaOverallMs);
 
    this.sagaTimers.set(sagaId, timer);
  }
 
  // Clear saga timeout on completion
  clearSagaTimeout(sagaId: string): void {
    const timer = this.sagaTimers.get(sagaId);
    if (timer) {
      clearTimeout(timer);
      this.sagaTimers.delete(sagaId);
    }
  }
 
  // Execute step with timeout
  async executeWithTimeout<T>(
    stepName: string,
    operation: () => Promise<T>,
    timeoutMs: number = this.config.stepExecutionMs
  ): Promise<T> {
    return new Promise((resolve, reject) => {
      const timer = setTimeout(() => {
        reject(new StepTimeoutError(
          `Step '${stepName}' timed out after ${timeoutMs}ms`
        ));
      }, timeoutMs);
 
      operation()
        .then(result => {
          clearTimeout(timer);
          resolve(result);
        })
        .catch(error => {
          clearTimeout(timer);
          reject(error);
        });
    });
  }
 
  // Wait for async response with timeout
  async waitForResponse<T>(
    sagaId: string,
    stepName: string,
    responsePromise: Promise<T>,
    timeoutMs: number = this.config.stepResponseMs
  ): Promise<T> {
    return new Promise((resolve, reject) => {
      const timer = setTimeout(async () => {
        // Log timeout for investigation
        await this.sagaRepository.update(sagaId, {
          lastTimeoutAt: new Date(),
          lastTimeoutStep: stepName
        });
 
        reject(new ResponseTimeoutError(
          `No response for step '${stepName}' after ${timeoutMs}ms`
        ));
      }, timeoutMs);
 
      responsePromise
        .then(result => {
          clearTimeout(timer);
          resolve(result);
        })
        .catch(error => {
          clearTimeout(timer);
          reject(error);
        });
    });
  }
}
 
// Deadline propagation for nested calls
class DeadlinePropagation {
  private static DEADLINE_HEADER = 'X-Saga-Deadline';
 
  // Set deadline in outgoing request
  static setDeadline(headers: Record<string, string>, deadlineMs: number): void {
    const deadline = Date.now() + deadlineMs;
    headers[this.DEADLINE_HEADER] = deadline.toString();
  }
 
  // Extract deadline from incoming request
  static getDeadline(headers: Record<string, string>): number | null {
    const deadline = headers[this.DEADLINE_HEADER];
    return deadline ? parseInt(deadline, 10) : null;
  }
 
  // Calculate remaining time until deadline
  static getRemainingTime(headers: Record<string, string>): number {
    const deadline = this.getDeadline(headers);
    if (!deadline) return Infinity;
    return Math.max(0, deadline - Date.now());
  }
 
  // Check if deadline has passed
  static isExpired(headers: Record<string, string>): boolean {
    const remaining = this.getRemainingTime(headers);
    return remaining <= 0;
  }
}
 
// Service that respects propagated deadlines
class DeadlineAwareService {
  async processOrder(request: OrderRequest): Promise<OrderResult> {
    // Check if deadline already expired
    if (DeadlinePropagation.isExpired(request.headers)) {
      throw new DeadlineExceededError('Request deadline already passed');
    }
 
    const remainingTime = DeadlinePropagation.getRemainingTime(request.headers);
    
    // Set local timeout based on remaining deadline
    return await new Promise((resolve, reject) => {
      const timer = setTimeout(() => {
        reject(new DeadlineExceededError('Processing exceeded deadline'));
      }, Math.min(remainingTime, 30000)); // Cap at 30s max
 
      this.doProcessOrder(request)
        .then(result => {
          clearTimeout(timer);
          resolve(result);
        })
        .catch(error => {
          clearTimeout(timer);
          reject(error);
        });
    });
  }
}

Deadline vs Timeout

Circuit Breaker for Saga Steps

Converting Mermaid diagram...

circuit-breaker.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
// Circuit Breaker Implementation for Saga Steps
 
enum CircuitState {
  CLOSED = 'CLOSED',
  OPEN = 'OPEN',
  HALF_OPEN = 'HALF_OPEN'
}
 
interface CircuitBreakerConfig {
  failureThreshold: number;     // Number of failures to open circuit
  successThreshold: number;     // Successes in half-open to close
  openTimeoutMs: number;        // Time before trying half-open
  windowSizeMs: number;         // Rolling window for failure counting
}
 
class CircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED;
  private failureCount: number = 0;
  private successCount: number = 0;
  private lastFailureTime: number = 0;
  private failures: number[] = []; // Timestamps of failures
 
  constructor(
    private serviceName: string,
    private config: CircuitBreakerConfig = {
      failureThreshold: 5,
      successThreshold: 3,
      openTimeoutMs: 30000,
      windowSizeMs: 60000
    }
  ) {}
 
  async execute<T>(operation: () => Promise<T>): Promise<T> {
    // Check if we should allow the request
    if (!this.shouldAllowRequest()) {
      throw new CircuitOpenError(
        `Circuit breaker for ${this.serviceName} is OPEN`
      );
    }
 
    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
 
  private shouldAllowRequest(): boolean {
    switch (this.state) {
      case CircuitState.CLOSED:
        return true;
 
      case CircuitState.OPEN:
        // Check if we should transition to half-open
        if (Date.now() - this.lastFailureTime >= this.config.openTimeoutMs) {
          this.transitionTo(CircuitState.HALF_OPEN);
          return true;
        }
        return false;
 
      case CircuitState.HALF_OPEN:
        // Allow limited requests in half-open state
        return true;
    }
  }
 
  private onSuccess(): void {
    switch (this.state) {
      case CircuitState.CLOSED:
        // Reset failure count on success
        this.pruneOldFailures();
        break;
 
      case CircuitState.HALF_OPEN:
        this.successCount++;
        if (this.successCount >= this.config.successThreshold) {
          this.transitionTo(CircuitState.CLOSED);
        }
        break;
 
      case CircuitState.OPEN:
        // Shouldn't happen, but handle gracefully
        break;
    }
  }
 
  private onFailure(): void {
    this.lastFailureTime = Date.now();
    this.failures.push(this.lastFailureTime);
 
    switch (this.state) {
      case CircuitState.CLOSED:
        this.pruneOldFailures();
        if (this.failures.length >= this.config.failureThreshold) {
          this.transitionTo(CircuitState.OPEN);
        }
        break;
 
      case CircuitState.HALF_OPEN:
        // Any failure in half-open reopens the circuit
        this.transitionTo(CircuitState.OPEN);
        break;
 
      case CircuitState.OPEN:
        // Already open, just update last failure time
        break;
    }
  }
 
  private pruneOldFailures(): void {
    const cutoff = Date.now() - this.config.windowSizeMs;
    this.failures = this.failures.filter(t => t > cutoff);
  }
 
  private transitionTo(newState: CircuitState): void {
    console.log(
      `Circuit breaker for ${this.serviceName}: ${this.state} -> ${newState}`
    );
    
    this.state = newState;
    this.successCount = 0;
 
    // Emit metrics/alerts on state change
    if (newState === CircuitState.OPEN) {
      this.emitAlert(`Circuit opened for ${this.serviceName}`);
    }
  }
 
  private emitAlert(message: string): void {
    // Integration with alerting system
    console.warn(`[CIRCUIT_BREAKER] ${message}`);
  }
 
  getState(): CircuitState {
    return this.state;
  }
}
 
// Saga step wrapped with circuit breaker
class CircuitProtectedSagaStep<TInput, TOutput> {
  private circuitBreaker: CircuitBreaker;
 
  constructor(
    private step: SagaStep<TInput, TOutput>,
    private serviceName: string
  ) {
    this.circuitBreaker = new CircuitBreaker(serviceName);
  }
 
  async execute(input: TInput, context: SagaContext): Promise<TOutput> {
    try {
      return await this.circuitBreaker.execute(
        () => this.step.execute(input, context)
      );
    } catch (error) {
      if (error instanceof CircuitOpenError) {
        // Transform to saga-friendly error
        throw new ServiceUnavailableError(
          `Service ${this.serviceName} is currently unavailable (circuit open)`,
          { isRetryable: true, retryAfterMs: 30000 }
        );
      }
      throw error;
    }
  }
}

Handling Partial Failures

Types of Partial Failures

•Crash After Commit — Database committed, but saga doesn't know (crashed before recording success). Is it done or not?
•Timeout with Unknown State — Request timed out After 30s. Did the service receive it? Did it complete? We don't know.
•Batch Partial Success — Reserving 5 items: 3 succeeded, 2 failed. What now?
•External Service Ambiguity — Payment gateway returned error after long delay. Was the charge actually processed?
•Network Partition — Request was sent successfully, but response never arrived. Service might have completed.

partial-failure-handling.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
// Strategies for Handling Partial Failures
 
// 1. IDEMPOTENCY KEYS for "at least once" safety
class IdempotentOperation {
  async execute<T>(
    idempotencyKey: string,
    operation: () => Promise<T>
  ): Promise<T> {
    // Check if already executed
    const existing = await this.idempotencyStore.find(idempotencyKey);
    if (existing) {
      if (existing.status === 'COMPLETED') {
        return existing.result as T;
      }
      if (existing.status === 'IN_PROGRESS') {
        throw new OperationInProgressError(idempotencyKey);
      }
    }
 
    // Mark as in progress
    await this.idempotencyStore.create({
      key: idempotencyKey,
      status: 'IN_PROGRESS',
      startedAt: new Date()
    });
 
    try {
      const result = await operation();
      
      // Mark as completed with result
      await this.idempotencyStore.update(idempotencyKey, {
        status: 'COMPLETED',
        result,
        completedAt: new Date()
      });
      
      return result;
    } catch (error) {
      // Mark as failed (or delete to allow retry)
      await this.idempotencyStore.update(idempotencyKey, {
        status: 'FAILED',
        error: error.message,
        failedAt: new Date()
      });
      throw error;
    }
  }
}
 
// 2. VERIFICATION STEP for ambiguous outcomes
class PaymentStepWithVerification {
  private readonly verificationAttempts = 3;
  private readonly verificationDelayMs = 5000;
 
  async processPayment(orderId: string, amount: number): Promise<PaymentResult> {
    const idempotencyKey = `payment-${orderId}`;
 
    try {
      return await this.paymentGateway.charge({
        idempotencyKey,
        amount,
        metadata: { orderId }
      });
    } catch (error) {
      if (this.isAmbiguousFailure(error)) {
        // Timeout or network error - payment might have succeeded
        return await this.verifyPaymentStatus(idempotencyKey, orderId);
      }
      throw error;
    }
  }
 
  private isAmbiguousFailure(error: Error): boolean {
    return error instanceof TimeoutError ||
           error.message.includes('ECONNRESET') ||
           error.message.includes('network error');
  }
 
  private async verifyPaymentStatus(
    idempotencyKey: string,
    orderId: string
  ): Promise<PaymentResult> {
    for (let attempt = 1; attempt <= this.verificationAttempts; attempt++) {
      await this.sleep(this.verificationDelayMs);
 
      try {
        // Check if payment actually went through
        const status = await this.paymentGateway.getPaymentByIdempotencyKey(
          idempotencyKey
        );
 
        if (status) {
          // Payment was processed!
          return status;
        }
      } catch (verifyError) {
        if (attempt === this.verificationAttempts) {
          throw new AmbiguousPaymentStateError(
            `Cannot determine payment status for order ${orderId}`,
            { requiresManualVerification: true }
          );
        }
      }
    }
 
    // Verification exhausted, payment not found - likely didn't process
    throw new PaymentNotProcessedError(orderId);
  }
}
 
// 3. BATCH WITH INDIVIDUAL TRACKING
class BatchInventoryReservation {
  async reserveItems(
    orderId: string,
    items: OrderItem[]
  ): Promise<BatchReservationResult> {
    const results: ItemReservationResult[] = [];
    const reservedItems: string[] = [];
    const failedItems: { item: OrderItem; error: Error }[] = [];
 
    for (const item of items) {
      try {
        const reservationId = await this.inventoryService.reserve(
          item.productId,
          item.quantity,
          orderId
        );
        
        reservedItems.push(reservationId);
        results.push({
          productId: item.productId,
          status: 'RESERVED',
          reservationId
        });
      } catch (error) {
        failedItems.push({ item, error: error as Error });
        results.push({
          productId: item.productId,
          status: 'FAILED',
          error: error.message
        });
      }
    }
 
    // Decision: all-or-nothing or partial success?
    if (failedItems.length > 0) {
      // All-or-nothing: release all successful reservations
      for (const reservationId of reservedItems) {
        await this.inventoryService.release(reservationId);
      }
 
      throw new BatchReservationFailedError(
        `Failed to reserve ${failedItems.length} of ${items.length} items`,
        { failedItems, results }
      );
    }
 
    return {
      success: true,
      reservationIds: reservedItems,
      results
    };
  }
}
 
// 4. SAGA STATE RECOVERY AFTER CRASH
class SagaRecovery {
  async recoverStuckSagas(): Promise<void> {
    // Find sagas that were in progress when system crashed
    const stuckSagas = await this.sagaRepository.findByStatus({
      status: 'IN_PROGRESS',
      lastUpdatedBefore: new Date(Date.now() - 5 * 60 * 1000) // 5 min stale
    });
 
    for (const saga of stuckSagas) {
      console.log(`Recovering stuck saga: ${saga.id}`);
      
      try {
        await this.determineSagaState(saga);
      } catch (error) {
        console.error(`Failed to recover saga ${saga.id}:`, error);
        await this.escalateSaga(saga, error as Error);
      }
    }
  }
 
  private async determineSagaState(saga: SagaInstance): Promise<void> {
    const lastStep = saga.completedSteps[saga.completedSteps.length - 1];
    
    if (!lastStep) {
      // Saga never started - safe to restart
      await this.restartSaga(saga);
      return;
    }
 
    // Verify if last step actually completed
    const stepCompleted = await this.verifyStepCompletion(saga, lastStep);
    
    if (stepCompleted) {
      // Continue from next step
      await this.continueSaga(saga, saga.completedSteps.length);
    } else {
      // Last step is ambiguous - need manual inspection
      await this.escalateSaga(saga, new Error(
        `Ambiguous state at step ${lastStep}`
      ));
    }
  }
}

Observability and Alerting

When sagas fail, operators need visibility into what went wrong, where, and why. Comprehensive observability is essential for debugging and maintaining saga systems.

Essential Saga Metrics

•Saga Throughput — Sagas started, completed, failed, compensated per time period
•Saga Duration — P50, P95, P99 execution times; broken down by step
•Step Success Rate — Percentage of successful executions per step
•Compensation Rate — How often do sagas need to compensate? Which step triggers most?
•Retry Rate — How often are retries needed? Which steps are flaky?
•Stuck Sagas — Count of sagas not progressing (in same state > threshold)
•Circuit Breaker State — Which services have open circuits?

saga-observability.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
// Comprehensive Saga Observability System
 
interface SagaMetrics {
  sagasStarted: Counter;
  sagasCompleted: Counter;
  sagasFailed: Counter;
  sagasCompensated: Counter;
  sagaDuration: Histogram;
  stepDuration: Histogram;
  stepRetries: Counter;
  compensationDuration: Histogram;
}
 
class ObservableSagaExecutor {
  private metrics: SagaMetrics;
  private tracer: Tracer;
 
  async executeSaga<T>(
    sagaId: string,
    definition: SagaDefinition<T>,
    input: T
  ): Promise<SagaResult<T>> {
    // Start distributed trace
    const span = this.tracer.startSpan('saga.execute', {
      attributes: {
        'saga.id': sagaId,
        'saga.type': definition.name,
        'saga.input': JSON.stringify(input)
      }
    });
 
    this.metrics.sagasStarted.inc({ saga_type: definition.name });
    const startTime = Date.now();
 
    // Structured logging
    this.logger.info('Saga started', {
      sagaId,
      sagaType: definition.name,
      input: this.sanitizeForLogging(input)
    });
 
    try {
      let data = input;
      
      for (let i = 0; i < definition.steps.length; i++) {
        const step = definition.steps[i];
        data = await this.executeStep(sagaId, step, data, i + 1, span);
      }
 
      // Success metrics
      const duration = Date.now() - startTime;
      this.metrics.sagasCompleted.inc({ saga_type: definition.name });
      this.metrics.sagaDuration.observe(
        { saga_type: definition.name, outcome: 'success' },
        duration
      );
 
      this.logger.info('Saga completed', {
        sagaId,
        sagaType: definition.name,
        durationMs: duration
      });
 
      span.setStatus({ code: SpanStatusCode.OK });
      span.end();
 
      return { success: true, data };
 
    } catch (error) {
      const duration = Date.now() - startTime;
      this.metrics.sagasFailed.inc({ 
        saga_type: definition.name,
        failure_step: (error as SagaError).stepName || 'unknown'
      });
      this.metrics.sagaDuration.observe(
        { saga_type: definition.name, outcome: 'failed' },
        duration
      );
 
      this.logger.error('Saga failed', {
        sagaId,
        sagaType: definition.name,
        error: error.message,
        stack: error.stack,
        durationMs: duration
      });
 
      // Attempt compensation
      const compensationResult = await this.compensateSaga(
        sagaId, 
        definition, 
        data, 
        span
      );
 
      if (compensationResult.success) {
        this.metrics.sagasCompensated.inc({ saga_type: definition.name });
      }
 
      span.recordException(error as Error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.end();
 
      return { 
        success: false, 
        error: error.message,
        compensated: compensationResult.success 
      };
    }
  }
 
  private async executeStep<T>(
    sagaId: string,
    step: SagaStep<T, T>,
    data: T,
    stepNumber: number,
    parentSpan: Span
  ): Promise<T> {
    const stepSpan = this.tracer.startSpan(`saga.step.${step.name}`, {
      parent: parentSpan,
      attributes: {
        'saga.step.name': step.name,
        'saga.step.number': stepNumber
      }
    });
 
    const startTime = Date.now();
    let attempt = 0;
 
    while (true) {
      attempt++;
      
      try {
        const result = await step.execute(data, { sagaId, stepNumber });
        
        const duration = Date.now() - startTime;
        this.metrics.stepDuration.observe(
          { saga_step: step.name, outcome: 'success' },
          duration
        );
 
        this.logger.debug('Step completed', {
          sagaId,
          stepName: step.name,
          attempt,
          durationMs: duration
        });
 
        stepSpan.end();
        return result;
 
      } catch (error) {
        if (this.isRetryable(error) && attempt < step.maxRetries) {
          this.metrics.stepRetries.inc({ saga_step: step.name });
          
          this.logger.warn('Step failed, retrying', {
            sagaId,
            stepName: step.name,
            attempt,
            maxRetries: step.maxRetries,
            error: error.message
          });
 
          await this.backoff(attempt);
          continue;
        }
 
        const duration = Date.now() - startTime;
        this.metrics.stepDuration.observe(
          { saga_step: step.name, outcome: 'failed' },
          duration
        );
 
        stepSpan.recordException(error as Error);
        stepSpan.end();
 
        throw Object.assign(error, { stepName: step.name });
      }
    }
  }
}
 
// Alert Conditions
class SagaAlertManager {
  private alertRules: AlertRule[] = [
    {
      name: 'high_saga_failure_rate',
      condition: (metrics) => 
        metrics.sagasFailed.rate('5m') / metrics.sagasStarted.rate('5m') > 0.1,
      severity: 'critical',
      message: 'Saga failure rate exceeds 10% in last 5 minutes'
    },
    {
      name: 'stuck_sagas',
      condition: async (metrics, repo) => 
        (await repo.countByStatus('IN_PROGRESS', { staleMins: 30 })) > 10,
      severity: 'warning',
      message: 'More than 10 sagas stuck for over 30 minutes'
    },
    {
      name: 'compensation_failures',
      condition: (metrics) =>
        metrics.compensationFailures.count('1h') > 5,
      severity: 'critical',
      message: 'Multiple compensation failures in last hour - data may be inconsistent'
    },
    {
      name: 'circuit_breaker_open',
      condition: (_, services) =>
        services.some(s => s.circuitBreaker.getState() === 'OPEN'),
      severity: 'warning',
      message: 'One or more service circuit breakers are open'
    }
  ];
}

Dead Letter Queues and Manual Intervention

When all automatic recovery attempts fail, sagas must have a graceful path to human intervention. Dead Letter Queues (DLQs) provide this mechanism.

dead-letter-handling.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
// Dead Letter Queue and Manual Intervention System
 
interface DeadLetterSaga {
  sagaId: string;
  sagaType: string;
  failedStep: string;
  error: string;
  data: unknown;
  completedSteps: string[];
  failedAt: Date;
  attempts: number;
  lastAttemptAt: Date;
  status: 'PENDING_REVIEW' | 'IN_REVIEW' | 'RESOLVED' | 'ABANDONED';
  assignedTo?: string;
  resolution?: SagaResolution;
}
 
type SagaResolution = 
  | { type: 'RETRY'; notes: string }
  | { type: 'MANUAL_COMPLETE'; notes: string }
  | { type: 'MANUAL_COMPENSATE'; notes: string }
  | { type: 'ABANDON'; reason: string };
 
class DeadLetterQueue {
  async moveToDeadLetter(saga: SagaInstance, error: Error): Promise<void> {
    const dlSaga: DeadLetterSaga = {
      sagaId: saga.id,
      sagaType: saga.type,
      failedStep: saga.currentStep,
      error: error.message,
      data: saga.data,
      completedSteps: saga.completedSteps,
      failedAt: new Date(),
      attempts: saga.attempts,
      lastAttemptAt: new Date(),
      status: 'PENDING_REVIEW'
    };
 
    await this.deadLetterRepository.create(dlSaga);
 
    // Update original saga status
    await this.sagaRepository.update(saga.id, {
      status: 'IN_DLQ',
      movedToDlqAt: new Date()
    });
 
    // Create incident ticket
    await this.ticketService.create({
      type: 'SAGA_DEAD_LETTER',
      priority: 'HIGH',
      title: `Saga ${saga.id} requires manual intervention`,
      description: `
        Saga Type: ${saga.type}
        Failed Step: ${saga.currentStep}
        Error: ${error.message}
        Completed Steps: ${saga.completedSteps.join(' -> ')}
        Data: ${JSON.stringify(saga.data, null, 2)}
      `,
      sagaId: saga.id
    });
 
    // Alert on-call
    await this.alertService.page({
      severity: 'high',
      message: `Saga moved to DLQ: ${saga.type}.${saga.currentStep}`,
      context: { sagaId: saga.id }
    });
  }
 
  async retryFromDlq(sagaId: string, operatorId: string): Promise<void> {
    const dlSaga = await this.deadLetterRepository.findById(sagaId);
    
    if (!dlSaga) {
      throw new NotFoundError(`Dead letter saga ${sagaId} not found`);
    }
 
    // Log the manual intervention
    await this.auditLog.record({
      action: 'DLQ_RETRY',
      sagaId,
      operatorId,
      timestamp: new Date()
    });
 
    // Update status
    await this.deadLetterRepository.update(sagaId, {
      status: 'IN_REVIEW',
      assignedTo: operatorId
    });
 
    // Attempt to resume saga
    const originalSaga = await this.sagaRepository.findById(sagaId);
    
    try {
      await this.sagaExecutor.resumeSaga(originalSaga);
      
      // Success - mark as resolved
      await this.deadLetterRepository.update(sagaId, {
        status: 'RESOLVED',
        resolution: { type: 'RETRY', notes: `Retried by ${operatorId}` }
      });
      
    } catch (retryError) {
      // Still failing - update DLQ record
      await this.deadLetterRepository.update(sagaId, {
        status: 'PENDING_REVIEW',
        lastAttemptAt: new Date(),
        attempts: dlSaga.attempts + 1,
        error: retryError.message
      });
      
      throw retryError;
    }
  }
 
  async manuallyComplete(
    sagaId: string, 
    operatorId: string,
    notes: string
  ): Promise<void> {
    // Mark saga as complete without executing remaining steps
    await this.sagaRepository.update(sagaId, {
      status: 'MANUALLY_COMPLETED',
      completedAt: new Date(),
      completedBy: operatorId,
      notes
    });
 
    await this.deadLetterRepository.update(sagaId, {
      status: 'RESOLVED',
      resolution: { type: 'MANUAL_COMPLETE', notes }
    });
 
    await this.auditLog.record({
      action: 'MANUAL_SAGA_COMPLETE',
      sagaId,
      operatorId,
      notes,
      timestamp: new Date()
    });
  }
 
  async manuallyCompensate(
    sagaId: string,
    operatorId: string,
    notes: string
  ): Promise<void> {
    const saga = await this.sagaRepository.findById(sagaId);
    
    // Execute compensations for completed steps
    for (const stepName of saga.completedSteps.reverse()) {
      const step = this.getStepDefinition(saga.type, stepName);
      await step.compensate({ sagaId, ...saga.data });
    }
 
    await this.sagaRepository.update(sagaId, {
      status: 'MANUALLY_COMPENSATED',
      compensatedAt: new Date(),
      compensatedBy: operatorId
    });
 
    await this.deadLetterRepository.update(sagaId, {
      status: 'RESOLVED',
      resolution: { type: 'MANUAL_COMPENSATE', notes }
    });
  }
}
 
// Admin UI for DLQ management
class DlqDashboard {
  async getStats(): Promise<DlqStats> {
    const pending = await this.dlqRepo.countByStatus('PENDING_REVIEW');
    const inReview = await this.dlqRepo.countByStatus('IN_REVIEW');
    const byType = await this.dlqRepo.groupBy('sagaType');
    const avgAge = await this.dlqRepo.avgAgeMinutes('PENDING_REVIEW');
 
    return { pending, inReview, byType, avgAgeMins: avgAge };
  }
 
  async getQueue(filters: DlqFilters): Promise<DeadLetterSaga[]> {
    return this.dlqRepo.find({
      status: filters.status,
      sagaType: filters.sagaType,
      olderThan: filters.olderThan,
      orderBy: 'failedAt',
      limit: 50
    });
  }
}

DLQ Is Not a Trash Can

Dead Letter Queues should be actively monitored and processed. A growing DLQ indicates systemic problems. Track DLQ depth as a key operational metric and set alerts when it grows beyond thresholds.

Summary: Building Failure-Resilient Sagas

Handling failures gracefully is what separates production-grade saga implementations from toy examples. Let's consolidate the essential takeaways:

Key Takeaways

•Classify failures — Different failure types require different handling. Transient failures need retries; business logic failures need compensation; infrastructure failures need escalation.
•Implement sophisticated retries — Exponential backoff with jitter, respect rate limits, have maximum retry budgets, and ensure idempotency before retrying.
•Manage timeouts carefully — Use step timeouts, response timeouts, overall saga timeouts, and idle timeouts. Propagate deadlines across service boundaries.
•Use circuit breakers — Protect struggling services from additional load. Fail fast when circuits are open rather than waiting for inevitable failure.
•Handle partial failures — Use idempotency keys, verification steps, and careful state tracking. Assume any operation can partially complete.
•Invest in observability — Track saga throughput, duration, failure rates, retry rates, and compensation rates. Alert on anomalies proactively.
•Plan for human intervention — Dead letter queues, operator dashboards, and manual resolution workflows are essential for production systems.

Page Complete