System Design (LLD)Exception & Error Handling Design

Error Handling at Boundaries

LevelAdvanced

Duration90 mins

TopicException & Error Handling Design

4 / 4

Error Recovery Strategies

Beyond Detection: Making Systems Self-Healing

Robust systems don't just detect and report errors—they recover from them. While users see a momentary pause, behind the scenes your system retries failed operations, activates fallback mechanisms, isolates failing components, and maintains service continuity despite component failures.

Error recovery is the difference between a system that requires human intervention for every hiccup and a system that handles transient failures autonomously, pages engineers only for genuine outages, and maintains high availability even when dependencies fail.

This page examines the full spectrum of recovery strategies: from simple retries to sophisticated circuit breakers, from graceful degradation to compensating transactions. You'll learn when each strategy applies, how to implement them correctly, and how to combine them into defense-in-depth resilience.

What You Will Master

By the end of this page, you will understand retry patterns with exponential backoff and jitter, circuit breaker implementation and configuration, fallback strategies and graceful degradation, compensation and rollback for failed distributed operations, timeout management and deadline propagation, and the decision framework for choosing appropriate recovery strategies.

The Recovery Decision Framework

Not all errors can or should be recovered from. Before implementing recovery logic, you must classify errors and match them to appropriate strategies.

Error Classification for Recovery

Errors fall into categories based on their recoverability:

Error Categories and Recovery Potential
Category	Characteristics	Recovery Approach	Examples
Transient	Temporary condition that will likely resolve on retry	Retry with backoff	Network timeout, connection reset, 503 Service Unavailable
Retriable After Action	Failed but may succeed after taking a specific action	Action + Retry	Token expired (refresh + retry), rate limit (wait + retry)
Non-Retriable	Will fail the same way on retry; different approach needed	Fallback or fail	Invalid input, resource not found, permission denied
Systemic	Component or dependency is down; retries won't help soon	Circuit break, fallback	Database down, external service outage
Degraded	Full service unavailable, but partial service possible	Graceful degradation	Some features unavailable, cached data acceptable
Compensable	Operation partially succeeded; needs rollback	Compensation transaction	Payment succeeded but inventory reservation failed

error-classification.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
/**
 * Error classification system that determines appropriate recovery strategy.
 */
interface RecoveryDecision {
    strategy: 'retry' | 'retry-with-action' | 'fallback' | 'circuit-break' | 'fail' | 'compensate';
    maxRetries?: number;
    backoffConfig?: BackoffConfig;
    fallbackType?: 'cached' | 'default' | 'degraded' | 'none';
    action?: () => Promise<void>;
    compensationRequired?: boolean;
}
 
/**
 * Classify an error and determine recovery strategy.
 */
function classifyError(error: Error, context: OperationContext): RecoveryDecision {
    // ============================================
    // TRANSIENT ERRORS: Retry with backoff
    // ============================================
    if (isTransientError(error)) {
        return {
            strategy: 'retry',
            maxRetries: 3,
            backoffConfig: {
                initialDelayMs: 100,
                maxDelayMs: 5000,
                multiplier: 2,
                jitter: 0.2
            }
        };
    }
 
    // ============================================
    // AUTHENTICATION ERRORS: Refresh and retry
    // ============================================
    if (isAuthenticationExpiredError(error)) {
        return {
            strategy: 'retry-with-action',
            maxRetries: 1,
            action: async () => {
                await refreshAuthenticationToken();
            }
        };
    }
 
    // ============================================
    // RATE LIMITING: Wait and retry
    // ============================================
    if (isRateLimitError(error)) {
        const retryAfter = extractRetryAfterHeader(error);
        return {
            strategy: 'retry',
            maxRetries: 2,
            backoffConfig: {
                initialDelayMs: (retryAfter || 60) * 1000,
                maxDelayMs: 120000,
                multiplier: 1,
                jitter: 0.1
            }
        };
    }
 
    // ============================================
    // EXTERNAL SERVICE DOWN: Circuit break and fallback
    // ============================================
    if (isExternalServiceError(error)) {
        return {
            strategy: 'circuit-break',
            fallbackType: context.allowDegradedMode ? 'cached' : 'none'
        };
    }
 
    // ============================================
    // VALIDATION ERRORS: No retry, fail fast
    // ============================================
    if (isValidationError(error) || isNotFoundError(error) || isPermissionError(error)) {
        return {
            strategy: 'fail',
            fallbackType: 'none'
        };
    }
 
    // ============================================
    // PARTIAL FAILURE: Compensate
    // ============================================
    if (isPartialFailureError(error)) {
        return {
            strategy: 'compensate',
            compensationRequired: true
        };
    }
 
    // ============================================
    // UNKNOWN: Conservative default
    // ============================================
    return {
        strategy: 'fail',
        fallbackType: 'none'
    };
}
 
// Classification helper functions
function isTransientError(error: Error): boolean {
    if ('isTransient' in error && (error as any).isTransient) return true;
    
    // Network-level errors
    if (error.message.includes('ECONNRESET')) return true;
    if (error.message.includes('ETIMEDOUT')) return true;
    if (error.message.includes('ECONNREFUSED')) return true;
    
    // HTTP errors that are typically transient
    if ('statusCode' in error) {
        const status = (error as any).statusCode;
        if (status === 502 || status === 503 || status === 504) return true;
        if (status === 429) return true;  // Rate limit
    }
    
    return false;
}
 
function isAuthenticationExpiredError(error: Error): boolean {
    if ('statusCode' in error && (error as any).statusCode === 401) return true;
    if (error.message.includes('token expired')) return true;
    if (error.message.includes('jwt expired')) return true;
    return false;
}
 
function isRateLimitError(error: Error): boolean {
    if ('statusCode' in error && (error as any).statusCode === 429) return true;
    if (error.name === 'RateLimitException') return true;
    return false;
}
 
function isExternalServiceError(error: Error): boolean {
    return error instanceof ExternalServiceException;
}
 
function isValidationError(error: Error): boolean {
    return error instanceof ValidationException || 
           ('statusCode' in error && (error as any).statusCode === 400);
}

The Idempotency Prerequisite

Before implementing retry logic, ensure the operation is idempotent—performing it multiple times has the same effect as performing it once. Non-idempotent operations (payments, order creation) need idempotency keys or careful design to prevent duplicate effects when retried.

Retry Patterns: The Art of Trying Again

Retrying is the simplest recovery strategy—attempt the operation again in hope of success. But naive retry implementation can cause more harm than good: retry storms that overwhelm recovering services, or immediate retries that fail before transient conditions resolve.

Exponential Backoff with Jitter

The gold standard for retry timing is exponential backoff with jitter:

Exponential growth: Each retry waits longer than the previous (100ms, 200ms, 400ms...)
Maximum cap: Backoff doesn't grow indefinitely
Jitter: Random variation prevents synchronized retry storms

retry-with-backoff.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
/**
 * Production-ready retry implementation with exponential backoff and jitter.
 */
interface RetryConfig {
    maxRetries: number;
    initialDelayMs: number;
    maxDelayMs: number;
    multiplier: number;
    jitter: number;  // 0-1: portion of delay to randomize
    retryOn?: (error: Error) => boolean;  // Which errors to retry
}
 
interface RetryResult<T> {
    success: boolean;
    result?: T;
    error?: Error;
    attempts: number;
    totalDurationMs: number;
}
 
/**
 * Execute an operation with configurable retry behavior.
 */
async function withRetry<T>(
    operation: () => Promise<T>,
    config: RetryConfig,
    context?: { correlationId: string; operationName: string }
): Promise<RetryResult<T>> {
    const startTime = Date.now();
    let lastError: Error | undefined;
 
    for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
        try {
            const result = await operation();
            
            // Log successful retry if this wasn't the first attempt
            if (attempt > 0) {
                logger.info('Operation succeeded after retry', {
                    ...context,
                    attempt: attempt + 1,
                    totalAttempts: attempt + 1,
                    durationMs: Date.now() - startTime
                });
            }
 
            return {
                success: true,
                result,
                attempts: attempt + 1,
                totalDurationMs: Date.now() - startTime
            };
 
        } catch (error) {
            lastError = error instanceof Error ? error : new Error(String(error));
 
            // Check if this error is retryable
            const shouldRetry = config.retryOn 
                ? config.retryOn(lastError) 
                : isTransientError(lastError);
 
            if (!shouldRetry || attempt >= config.maxRetries) {
                // No more retries: log and return failure
                logger.warn('Operation failed after retries', {
                    ...context,
                    attempt: attempt + 1,
                    totalAttempts: attempt + 1,
                    error: lastError.message,
                    willRetry: false
                });
 
                return {
                    success: false,
                    error: lastError,
                    attempts: attempt + 1,
                    totalDurationMs: Date.now() - startTime
                };
            }
 
            // Calculate delay with exponential backoff and jitter
            const delay = calculateDelay(attempt, config);
 
            logger.debug('Retrying operation', {
                ...context,
                attempt: attempt + 1,
                nextAttempt: attempt + 2,
                delayMs: delay,
                error: lastError.message
            });
 
            await sleep(delay);
        }
    }
 
    // Should not reach here, but TypeScript needs this
    return {
        success: false,
        error: lastError,
        attempts: config.maxRetries + 1,
        totalDurationMs: Date.now() - startTime
    };
}
 
/**
 * Calculate delay with exponential backoff and jitter.
 */
function calculateDelay(attempt: number, config: RetryConfig): number {
    // Exponential backoff
    const exponentialDelay = config.initialDelayMs * Math.pow(config.multiplier, attempt);
    
    // Cap at maximum
    const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs);
    
    // Add jitter (randomize ±jitter% of the delay)
    const jitterRange = cappedDelay * config.jitter;
    const jitter = (Math.random() * 2 - 1) * jitterRange;
    
    return Math.max(0, Math.round(cappedDelay + jitter));
}
 
function sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
}
 
// Usage example
const result = await withRetry(
    () => externalPaymentService.processPayment(paymentRequest),
    {
        maxRetries: 3,
        initialDelayMs: 200,
        maxDelayMs: 5000,
        multiplier: 2,
        jitter: 0.25,
        retryOn: (error) => {
            // Only retry on transient errors, not validation failures
            return isTransientError(error) && !isValidationError(error);
        }
    },
    { correlationId, operationName: 'ProcessPayment' }
);
 
if (!result.success) {
    // All retries exhausted - handle the failure
    throw new PaymentProcessingException(
        'Payment failed after multiple attempts',
        result.error
    );
}

The Retry Budget Pattern

For high-volume systems, individual retry configs aren't enough. A retry budget limits the total percentage of requests that can be retries, preventing retry amplification during outages:

retry-budget.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
/**
 * Retry budget prevents retry storms by limiting the fraction of
 * requests that can be retries.
 */
class RetryBudget {
    private window: { timestamp: number; isRetry: boolean }[] = [];
    private readonly windowSizeMs = 10000;  // 10-second window
    private readonly maxRetryRatio: number;
 
    constructor(maxRetryRatio: number = 0.2) {
        // Maximum 20% of traffic can be retries
        this.maxRetryRatio = maxRetryRatio;
    }
 
    /**
     * Record a request attempt.
     */
    recordAttempt(isRetry: boolean): void {
        const now = Date.now();
        this.window.push({ timestamp: now, isRetry });
        this.pruneOldEntries(now);
    }
 
    /**
     * Check if we have budget for another retry.
     */
    canRetry(): boolean {
        this.pruneOldEntries(Date.now());
        
        if (this.window.length === 0) return true;
 
        const retries = this.window.filter(e => e.isRetry).length;
        const total = this.window.length;
        const currentRatio = retries / total;
 
        return currentRatio < this.maxRetryRatio;
    }
 
    private pruneOldEntries(now: number): void {
        const cutoff = now - this.windowSizeMs;
        this.window = this.window.filter(e => e.timestamp >= cutoff);
    }
}
 
// Integration with retry logic
const retryBudget = new RetryBudget(0.2);
 
async function withBudgetedRetry<T>(
    operation: () => Promise<T>,
    config: RetryConfig
): Promise<RetryResult<T>> {
    let attempt = 0;
    let lastError: Error | undefined;
 
    while (attempt <= config.maxRetries) {
        const isRetry = attempt > 0;
        
        // Check budget before retrying
        if (isRetry && !retryBudget.canRetry()) {
            logger.warn('Retry budget exhausted, failing without retry');
            return {
                success: false,
                error: lastError || new Error('Retry budget exhausted'),
                attempts: attempt,
                totalDurationMs: 0
            };
        }
 
        retryBudget.recordAttempt(isRetry);
 
        try {
            const result = await operation();
            return { success: true, result, attempts: attempt + 1, totalDurationMs: 0 };
        } catch (error) {
            lastError = error as Error;
            attempt++;
            
            if (attempt <= config.maxRetries) {
                await sleep(calculateDelay(attempt - 1, config));
            }
        }
    }
 
    return { success: false, error: lastError, attempts: attempt, totalDurationMs: 0 };
}

Circuit Breaker Pattern

When a dependency is down, continuing to send requests wastes resources and delays failure detection. The circuit breaker pattern prevents this by tracking failure rates and "opening the circuit" when a threshold is exceeded, failing fast without attempting the operation.

Circuit Breaker States

Closed: Normal operation; requests flow through
Open: Circuit tripped; requests fail immediately without attempting operation
Half-Open: Testing recovery; limited requests allowed to probe health

circuit-breaker.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
/**
 * Production-ready circuit breaker implementation.
 */
interface CircuitBreakerConfig {
    failureThreshold: number;      // Failures to open circuit
    successThreshold: number;      // Successes in half-open to close
    halfOpenRequests: number;      // Max requests in half-open state
    resetTimeoutMs: number;        // How long circuit stays open
    rollingWindowMs: number;       // Time window for counting failures
}
 
type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';
 
class CircuitBreaker {
    private state: CircuitState = 'CLOSED';
    private failures: number[] = [];
    private halfOpenSuccesses = 0;
    private halfOpenFailures = 0;
    private halfOpenRequests = 0;
    private lastOpenedAt?: number;
 
    constructor(
        private readonly name: string,
        private readonly config: CircuitBreakerConfig
    ) {}
 
    /**
     * Execute operation through the circuit breaker.
     */
    async execute<T>(operation: () => Promise<T>): Promise<T> {
        // Check if circuit allows this request
        this.evaluateState();
 
        if (this.state === 'OPEN') {
            throw new CircuitOpenException(this.name, this.getTimeUntilReset());
        }
 
        if (this.state === 'HALF_OPEN') {
            if (this.halfOpenRequests >= this.config.halfOpenRequests) {
                throw new CircuitOpenException(this.name, this.getTimeUntilReset());
            }
            this.halfOpenRequests++;
        }
 
        try {
            const result = await operation();
            this.recordSuccess();
            return result;
        } catch (error) {
            this.recordFailure();
            throw error;
        }
    }
 
    /**
     * Record a successful operation.
     */
    private recordSuccess(): void {
        if (this.state === 'HALF_OPEN') {
            this.halfOpenSuccesses++;
            
            if (this.halfOpenSuccesses >= this.config.successThreshold) {
                this.transitionTo('CLOSED');
            }
        }
        // In CLOSED state, success doesn't affect failure count
    }
 
    /**
     * Record a failed operation.
     */
    private recordFailure(): void {
        const now = Date.now();
 
        if (this.state === 'CLOSED') {
            this.failures.push(now);
            this.pruneOldFailures(now);
 
            if (this.failures.length >= this.config.failureThreshold) {
                this.transitionTo('OPEN');
            }
        } else if (this.state === 'HALF_OPEN') {
            this.halfOpenFailures++;
            // Any failure in half-open immediately re-opens
            this.transitionTo('OPEN');
        }
    }
 
    /**
     * Evaluate and potentially transition state.
     */
    private evaluateState(): void {
        if (this.state === 'OPEN') {
            const elapsed = Date.now() - (this.lastOpenedAt || 0);
            if (elapsed >= this.config.resetTimeoutMs) {
                this.transitionTo('HALF_OPEN');
            }
        }
    }
 
    private transitionTo(newState: CircuitState): void {
        const oldState = this.state;
        this.state = newState;
 
        logger.info('Circuit breaker state transition', {
            circuitName: this.name,
            oldState,
            newState,
            failures: this.failures.length
        });
 
        if (newState === 'OPEN') {
            this.lastOpenedAt = Date.now();
            metrics.increment('circuit_breaker.opened', { circuit: this.name });
        } else if (newState === 'CLOSED') {
            this.failures = [];
            this.halfOpenSuccesses = 0;
            this.halfOpenFailures = 0;
            this.halfOpenRequests = 0;
            metrics.increment('circuit_breaker.closed', { circuit: this.name });
        } else if (newState === 'HALF_OPEN') {
            this.halfOpenSuccesses = 0;
            this.halfOpenFailures = 0;
            this.halfOpenRequests = 0;
            metrics.increment('circuit_breaker.half_open', { circuit: this.name });
        }
    }
 
    private pruneOldFailures(now: number): void {
        const cutoff = now - this.config.rollingWindowMs;
        this.failures = this.failures.filter(ts => ts >= cutoff);
    }
 
    private getTimeUntilReset(): number {
        if (!this.lastOpenedAt) return 0;
        return Math.max(0, this.config.resetTimeoutMs - (Date.now() - this.lastOpenedAt));
    }
 
    /**
     * Get current circuit state for monitoring.
     */
    getState(): { state: CircuitState; failures: number; halfOpenSuccesses: number } {
        return {
            state: this.state,
            failures: this.failures.length,
            halfOpenSuccesses: this.halfOpenSuccesses
        };
    }
}
 
class CircuitOpenException extends Error {
    constructor(
        public readonly circuitName: string,
        public readonly resetInMs: number
    ) {
        super(`Circuit ${circuitName} is open. Retry after ${resetInMs}ms`);
        this.name = 'CircuitOpenException';
    }
}
 
// Usage: One circuit breaker per external dependency
const paymentServiceCircuit = new CircuitBreaker('payment-service', {
    failureThreshold: 5,       // Open after 5 failures
    successThreshold: 2,       // Close after 2 successes in half-open
    halfOpenRequests: 3,       // Allow 3 test requests in half-open
    resetTimeoutMs: 30000,     // Stay open for 30 seconds
    rollingWindowMs: 60000     // Count failures in 60-second window
});
 
async function processPayment(request: PaymentRequest): Promise<PaymentResult> {
    try {
        return await paymentServiceCircuit.execute(() =>
            paymentClient.process(request)
        );
    } catch (error) {
        if (error instanceof CircuitOpenException) {
            // Fast failure - use fallback
            logger.warn('Payment service circuit open, using fallback');
            return handlePaymentFallback(request-error);
        }
        throw error;
    }
}

Circuit Breaker per Dependency

Create separate circuit breakers for each external dependency, not one global breaker. This ensures that a problem with the payment service doesn't affect calls to the inventory service. For multiple instances of the same service, consider whether to use per-instance or per-service circuits based on your load balancing strategy.

Fallback Strategies

When operations can't complete normally, fallbacks provide alternative behavior that maintains service continuity, possibly with reduced functionality.

Types of Fallback Strategies

Fallback Strategy Types
Strategy	Behavior	Use When	Trade-offs
Cached Value	Return previously cached successful response	Data staleness is acceptable for short periods	May return outdated information
Default Value	Return a predetermined default	Safe default exists; partial is better than nothing	May not reflect actual state
Degraded Feature	Disable non-essential feature, continue with core	Feature is enhancement, not critical	Reduced functionality
Alternative Service	Use backup provider or read replica	Redundant systems exist	May have different performance or cost
Queue for Later	Accept request, process asynchronously when restored	Operation can be eventually consistent	Delay in completion; needs queue infrastructure
Static Response	Return static/pre-computed response	Dynamic data not critical	Same response for all, may be inappropriate

fallback-implementations.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
/**
 * Comprehensive fallback implementation patterns.
 */
 
/**
 * Operation wrapper that supports multiple fallback strategies.
 */
class ResilientOperation<T> {
    private fallbacks: Array<{
        name: string;
        condition?: (error: Error) => boolean;
        handler: (error: Error) => Promise<T> | T;
    }> = [];
 
    constructor(
        private readonly operation: () => Promise<T>,
        private readonly operationName: string
    ) {}
 
    /**
     * Add a fallback strategy.
     */
    withFallback(
        name: string,
        handler: (error: Error) => Promise<T> | T,
        condition?: (error: Error) => boolean
    ): this {
        this.fallbacks.push({ name, handler, condition });
        return this;
    }
 
    /**
     * Execute with fallback chain.
     */
    async execute(): Promise<T> {
        try {
            return await this.operation();
        } catch (primaryError) {
            logger.warn(`Primary operation failed: ${this.operationName}`, {
                error: primaryError instanceof Error ? primaryError.message : String(primaryError)
            });
 
            // Try fallbacks in order
            for (const fallback of this.fallbacks) {
                // Check condition if specified
                if (fallback.condition && !fallback.condition(primaryError as Error)) {
                    continue;
                }
 
                try {
                    logger.info(`Attempting fallback: ${fallback.name}`);
                    const result = await fallback.handler(primaryError as Error);
                    
                    metrics.increment('operation.fallback.success', {
                        operation: this.operationName,
                        fallback: fallback.name
                    });
 
                    return result;
                } catch (fallbackError) {
                    logger.warn(`Fallback failed: ${fallback.name}`, {
                        error: fallbackError instanceof Error ? fallbackError.message : String(fallbackError)
                    });
                    // Continue to next fallback
                }
            }
 
            // All fallbacks exhausted
            throw primaryError;
        }
    }
}
 
// ============================================
// EXAMPLE: Product recommendations with fallbacks
// ============================================
 
interface ProductRecommendations {
    products: Product[];
    source: 'ml-service' | 'cached' | 'popular' | 'empty';
}
 
async function getRecommendations(userId: string): Promise<ProductRecommendations> {
    return new ResilientOperation(
        // Primary: Real-time ML recommendations
        async () => ({
            products: await mlRecommendationService.getPersonalized(userId),
            source: 'ml-service' as const
        }),
        'getRecommendations'
    )
    // Fallback 1: Cached recommendations for this user
    .withFallback('cached-personal', async () => {
        const cached = await cache.get(`recommendations:${userId}`);
        if (!cached) throw new Error('No cached recommendations');
        return { products: cached, source: 'cached' as const };
    })
    // Fallback 2: Popular products (same for all users)
    .withFallback('popular-products', async () => ({
        products: await productService.getPopular(10),
        source: 'popular' as const
    }))
    // Fallback 3: Empty list (graceful degradation)
    .withFallback('empty', () => ({
        products: [],
        source: 'empty' as const
    }))
    .execute();
}
 
// ============================================
// EXAMPLE: Configuration with fallback chain
// ============================================
 
async function getConfiguration(key: string): Promise<ConfigValue> {
    const configService = new ResilientOperation(
        // Primary: Remote config service
        () => remoteConfigClient.get(key),
        'getConfiguration'
    )
    // Fallback 1: Local cache
    .withFallback('local-cache', async () => {
        const cached = localConfigCache.get(key);
        if (!cached) throw new Error('Not in cache');
        return cached;
    })
    // Fallback 2: Environment variable
    .withFallback('env-var', () => {
        const envValue = process.env[`CONFIG_${key.toUpperCase()}`];
        if (!envValue) throw new Error('Not in env');
        return { value: envValue, source: 'environment' };
    })
    // Fallback 3: Compiled default
    .withFallback('default', () => {
        const defaultValue = DEFAULT_CONFIG[key];
        if (defaultValue === undefined) throw new Error('No default');
        return { value: defaultValue, source: 'default' };
    });
 
    return configService.execute();
}
 
// ============================================
// EXAMPLE: Queue-based fallback for writes
// ============================================
 
async function saveUserPreference(userId: string, pref: Preference): Promise<void> {
    try {
        await preferenceService.save(userId, pref);
    } catch (error) {
        if (isTransientError(error as Error)) {
            // Queue for later processing
            await deferredOperationQueue.enqueue({
                type: 'save-preference',
                payload: { userId, preference: pref },
                maxRetries: 5,
                expiresAt: Date.now() + 24 * 60 * 60 * 1000 // 24 hours
            });
 
            logger.info('Preference save queued for later', { userId });
            // Return success - from user's perspective, operation accepted
            return;
        }
        throw error;
    }
}

Signal Fallback Usage

When returning fallback data, signal this to the caller so they can display appropriately. A recommendation marked 'source: popular' can show '...' instead of 'Recommended for you'. This maintains honesty with users while providing continuity.

Timeout and Deadline Management

Every external call should have a timeout. Without timeouts, a single hung dependency can exhaust your connection pool, block your thread pool, and cascade into complete system failure.

The Deadline Propagation Pattern

In distributed systems, a top-level request timeout must be propagated to all downstream calls. If a user request has 3 seconds to complete, and you've already spent 2 seconds, downstream calls must know they have only 1 second remaining.

timeout-management.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
/**
 * Deadline propagation for timeout management across service calls.
 */
 
/**
 * Deadline context that flows through the request.
 */
interface DeadlineContext {
    deadlineEpochMs: number;
    operationStart: number;
    parentOperationName?: string;
}
 
const deadlineStorage = new AsyncLocalStorage<DeadlineContext>();
 
/**
 * Get remaining time until deadline.
 */
function getTimeRemaining(): number {
    const ctx = deadlineStorage.getStore();
    if (!ctx) return Infinity;  // No deadline set
    return Math.max(0, ctx.deadlineEpochMs - Date.now());
}
 
/**
 * Check if deadline has passed.
 */
function isDeadlineExceeded(): boolean {
    return getTimeRemaining() <= 0;
}
 
/**
 * Middleware that extracts or creates deadline context.
 */
function deadlineMiddleware(
    req: Request,
    res: Response,
    next: NextFunction
) {
    // Try to get deadline from upstream caller
    const upstreamDeadline = req.headers['x-deadline-ms'];
    
    // Or use default timeout for this service
    const deadline = upstreamDeadline
        ? parseInt(upstreamDeadline as string, 10)
        : Date.now() + 30000;  // 30 second default
 
    const ctx: DeadlineContext = {
        deadlineEpochMs: deadline,
        operationStart: Date.now(),
        parentOperationName: req.path
    };
 
    deadlineStorage.run(ctx, () => next());
}
 
/**
 * Wrap an async operation with deadline enforcement.
 */
async function withDeadline<T>(
    operation: () => Promise<T>,
    operationName: string,
    reserveMs: number = 100  // Reserve time for response processing
): Promise<T> {
    const remaining = getTimeRemaining() - reserveMs;
 
    if (remaining <= 0) {
        throw new DeadlineExceededException(
            operationName,
            'Deadline already exceeded before operation started'
        );
    }
 
    return Promise.race([
        operation(),
        createTimeoutPromise(remaining, operationName)
    ]);
}
 
function createTimeoutPromise<T>(timeoutMs: number, operationName: string): Promise<T> {
    return new Promise((_, reject) => {
        setTimeout(() => {
            reject(new DeadlineExceededException(operationName, 
                `Operation timed out after ${timeoutMs}ms`));
        }, timeoutMs);
    });
}
 
/**
 * HTTP client that propagates deadlines to downstream services.
 */
class DeadlineAwareHttpClient {
    async request<T>(url: string, options: RequestInit = {}): Promise<T> {
        const remaining = getTimeRemaining();
 
        if (remaining <= 0) {
            throw new DeadlineExceededException('http-request', 
                'Deadline exceeded before making request');
        }
 
        const headers = new Headers(options.headers);
        headers.set('X-Deadline-Ms', String(deadlineStorage.getStore()?.deadlineEpochMs));
 
        // Set fetch timeout to remaining time (minus small buffer)
        const controller = new AbortController();
        const timeout = setTimeout(() => controller.abort(), remaining - 50);
 
        try {
            const response = await fetch(url, {
                ...options,
                headers,
                signal: controller.signal
            });
 
            if (!response.ok) {
                throw new HttpError(response.status, await response.text());
            }
 
            return response.json();
        } finally {
            clearTimeout(timeout);
        }
    }
}
 
class DeadlineExceededException extends Error {
    constructor(
        public readonly operation: string,
        message: string
    ) {
        super(message);
        this.name = 'DeadlineExceededException';
    }
}
 
// ============================================
// Usage: Multi-step operation with deadline budget
// ============================================
 
async function processOrder(order: Order): Promise<ProcessedOrder> {
    // Total operation timeout: 5 seconds
    // But we're within a request that may have its own deadline
    const effectiveTimeout = Math.min(5000, getTimeRemaining());
 
    return await withDeadline(async () => {
        // Step 1: Validate inventory (budget: 1 second)
        const inventory = await withDeadline(
            () => inventoryService.check(order.items),
            'inventory-check',
            100
        );
 
        // Step 2: Process payment (~2 seconds budget)
        const payment = await withDeadline(
            () => paymentService.charge(order.payment),
            'payment-processing',
            100
        );
 
        // Step 3: Create shipment (remaining time)
        const shipment = await withDeadline(
            () => shippingService.create(order.address),
            'shipment-creation',
            100
        );
 
        return { order, inventory, payment, shipment };
    }, 'process-order');
}

Compensating Transactions

In distributed systems, operations that span multiple services can't use traditional database transactions. When a multi-step operation fails partway through, previously successful steps may need to be undone using compensating transactions.

The Saga Pattern

A saga is a sequence of local transactions where each transaction publishes events triggering the next. If any step fails, compensating transactions undo the preceding successful steps in reverse order.

compensating-transactions.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
/**
 * Saga pattern implementation for distributed transaction compensation.
 */
interface SagaStep<TContext> {
    name: string;
    execute: (context: TContext) => Promise<void>;
    compensate: (context: TContext) => Promise<void>;
}
 
class SagaOrchestrator<TContext extends { sagaId: string }> {
    private steps: SagaStep<TContext>[] = [];
    private executedSteps: string[] = [];
 
    addStep(step: SagaStep<TContext>): this {
        this.steps.push(step);
        return this;
    }
 
    /**
     * Execute the saga with automatic compensation on failure.
     */
    async execute(context: TContext): Promise<void> {
        this.executedSteps = [];
 
        for (const step of this.steps) {
            try {
                logger.info(`Saga step starting: ${step.name}`, {
                    sagaId: context.sagaId,
                    step: step.name
                });
 
                await step.execute(context);
                this.executedSteps.push(step.name);
 
                logger.info(`Saga step completed: ${step.name}`, {
                    sagaId: context.sagaId,
                    step: step.name
                });
 
            } catch (error) {
                logger.error(`Saga step failed: ${step.name}`, {
                    sagaId: context.sagaId,
                    step: step.name,
                    error: error instanceof Error ? error.message : String(error)
                });
 
                // Compensate all executed steps in reverse order
                await this.compensate(context);
 
                throw new SagaFailedException(
                    context.sagaId,
                    step.name,
                    error instanceof Error ? error : new Error(String(error))
                );
            }
        }
    }
 
    /**
     * Compensate executed steps in reverse order.
     */
    private async compensate(context: TContext): Promise<void> {
        const stepsToCompensate = [...this.executedSteps].reverse();
 
        for (const stepName of stepsToCompensate) {
            const step = this.steps.find(s => s.name === stepName);
            if (!step) continue;
 
            try {
                logger.info(`Compensating step: ${stepName}`, {
                    sagaId: context.sagaId,
                    step: stepName
                });
 
                await step.compensate(context);
 
                logger.info(`Compensation completed: ${stepName}`, {
                    sagaId: context.sagaId,
                    step: stepName
                });
 
            } catch (compensationError) {
                // Compensation failure is critical - needs manual intervention
                logger.fatal(`Compensation failed: ${stepName}`, {
                    sagaId: context.sagaId,
                    step: stepName,
                    error: compensationError
                });
 
                // Record for manual remediation
                await deadLetterQueue.enqueue({
                    type: 'compensation-failure',
                    sagaId: context.sagaId,
                    step: stepName,
                    context,
                    error: compensationError
                });
 
                // Continue compensating other steps
            }
        }
    }
}
 
class SagaFailedException extends Error {
    constructor(
        public readonly sagaId: string,
        public readonly failedStep: string,
        public readonly cause: Error
    ) {
        super(`Saga ${sagaId} failed at step '${failedStep}'`);
        this.name = 'SagaFailedException';
    }
}
 
// ============================================
// Example: Order processing saga
// ============================================
 
interface OrderSagaContext {
    sagaId: string;
    orderId: string;
    userId: string;
    items: OrderItem[];
    paymentDetails: PaymentDetails;
    shippingAddress: Address;
    
    // Populated during execution for compensation
    reservationId?: string;
    paymentId?: string;
    shipmentId?: string;
}
 
const orderSaga = new SagaOrchestrator<OrderSagaContext>()
    // Step 1: Reserve inventory
    .addStep({
        name: 'reserve-inventory',
        execute: async (ctx) => {
            const reservation = await inventoryService.reserve(ctx.items);
            ctx.reservationId = reservation.id;
        },
        compensate: async (ctx) => {
            if (ctx.reservationId) {
                await inventoryService.releaseReservation(ctx.reservationId);
            }
        }
    })
    // Step 2: Process payment
    .addStep({
        name: 'process-payment',
        execute: async (ctx) => {
            const payment = await paymentService.charge(ctx.paymentDetails);
            ctx.paymentId = payment.id;
        },
        compensate: async (ctx) => {
            if (ctx.paymentId) {
                await paymentService.refund(ctx.paymentId);
            }
        }
    })
    // Step 3: Create shipment
    .addStep({
        name: 'create-shipment',
        execute: async (ctx) => {
            const shipment = await shippingService.create({
                orderId: ctx.orderId,
                items: ctx.items,
                address: ctx.shippingAddress
            });
            ctx.shipmentId = shipment.id;
        },
        compensate: async (ctx) => {
            if (ctx.shipmentId) {
                await shippingService.cancel(ctx.shipmentId);
            }
        }
    })
    // Step 4: Confirm order (no compensation - this is the commit point)
    .addStep({
        name: 'confirm-order',
        execute: async (ctx) => {
            await orderService.confirm(ctx.orderId);
        },
        compensate: async () => {
            // No compensation needed - if we fail before confirm,
            // previous steps will be compensated
            // If confirm succeeds, saga is complete
        }
    });
 
// Usage
async function createOrder(request: CreateOrderRequest): Promise<Order> {
    const order = await orderService.create(request);
    
    const sagaContext: OrderSagaContext = {
        sagaId: generateSagaId(),
        orderId: order.id,
        userId: request.userId,
        items: request.items,
        paymentDetails: request.payment,
        shippingAddress: request.address
    };
 
    try {
        await orderSaga.execute(sagaContext);
        return order;
    } catch (error) {
        // Saga failed and compensated - order is cancelled
        await orderService.markFailed(order.id, (error as Error).message);
        throw error;
    }
}

Compensation Can Fail Too

Compensation might fail due to the same issues (network, timeout) that caused the original failure. Design compensating operations to be idempotent and retriable. When compensation repeatedly fails, route to a dead-letter queue for manual intervention—never silently ignore compensation failures.

Graceful Degradation

Graceful degradation means intentionally reducing functionality to maintain core service during stress or component failures. Rather than complete failure, the system continues operating with reduced capabilities.

Degradation Levels

Define tiers of functionality that can be progressively disabled:

graceful-degradation.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
/**
 * Graceful degradation system with configurable feature tiers.
 */
 
enum DegradationLevel {
    NORMAL = 0,      // All features available
    REDUCED = 1,     // Non-essential features disabled
    ESSENTIAL = 2,   // Only critical features available
    MINIMAL = 3      // Absolute minimum viable service
}
 
interface FeatureDefinition {
    name: string;
    description: string;
    minLevel: DegradationLevel;  // Disable at this level and above
    dependencies: string[];   // Other services this feature needs
    fallback?: () => any;     // Fallback behavior when disabled
}
 
class DegradationManager {
    private currentLevel: DegradationLevel = DegradationLevel.NORMAL;
    private features: Map<string, FeatureDefinition> = new Map();
    private circuitBreakers: Map<string, CircuitBreaker> = new Map();
 
    registerFeature(feature: FeatureDefinition): void {
        this.features.set(feature.name, feature);
    }
 
    /**
     * Check if a feature is currently available.
     */
    isFeatureAvailable(featureName: string): boolean {
        const feature = this.features.get(featureName);
        if (!feature) return false;
 
        // Check explicit degradation level
        if (this.currentLevel >= feature.minLevel) {
            return false;
        }
 
        // Check dependencies' circuits
        for (const dep of feature.dependencies) {
            const circuit = this.circuitBreakers.get(dep);
            if (circuit?.getState().state === 'OPEN') {
                return false;
            }
        }
 
        return true;
    }
 
    /**
     * Execute operation with degradation-aware behavior.
     */
    async executeWithDegradation<T>(
        featureName: string,
        operation: () => Promise<T>
    ): Promise<T | null> {
        const feature = this.features.get(featureName);
 
        if (!this.isFeatureAvailable(featureName)) {
            if (feature?.fallback) {
                return feature.fallback();
            }
            return null;
        }
 
        try {
            return await operation();
        } catch (error) {
            // On error, consider activating degradation
            this.handleFeatureError(featureName, error as Error);
 
            if (feature?.fallback) {
                return feature.fallback();
            }
            throw error;
        }
    }
 
    /**
     * Manually set degradation level (e.g., during incidents).
     */
    setDegradationLevel(level: DegradationLevel): void {
        const previous = this.currentLevel;
        this.currentLevel = level;
 
        logger.warn('Degradation level changed', {
            previous: DegradationLevel[previous],
            current: DegradationLevel[level]
        });
 
        // Notify monitoring
        metrics.gauge('degradation.level', level);
    }
 
    /**
     * Auto-adjust degradation based on error rates.
     */
    private handleFeatureError(featureName: string, error: Error): void {
        // Track error rate and potentially auto-degrade
        const errorRate = this.errorRateTracker.record(featureName, error);
 
        if (errorRate > 0.5 && this.currentLevel < DegradationLevel.REDUCED) {
            logger.warn('Auto-degrading due to high error rate', {
                feature: featureName,
                errorRate
            });
            this.setDegradationLevel(DegradationLevel.REDUCED);
        }
    }
}
 
// ============================================
// Example: E-commerce with degradation tiers
// ============================================
 
const degradationManager = new DegradationManager();
 
// Register features with their degradation thresholds
degradationManager.registerFeature({
    name: 'personalized-recommendations',
    description: 'ML-powered product recommendations',
    minLevel: DegradationLevel.REDUCED,  // Disable early
    dependencies: ['recommendation-service'],
    fallback: () => ({ products: [], source: 'disabled' })
});
 
degradationManager.registerFeature({
    name: 'live-inventory',
    description: 'Real-time inventory checks',
    minLevel: DegradationLevel.ESSENTIAL,
    dependencies: ['inventory-service'],
    fallback: () => ({ available: true, cached: true }) // Optimistic
});
 
degradationManager.registerFeature({
    name: 'payment-processing',
    description: 'Process payments',
    minLevel: DegradationLevel.MINIMAL,  // Only disable in extreme cases
    dependencies: ['payment-service'],
    fallback: undefined  // No fallback - must fail if unavailable
});
 
// Usage in API handler
async function getProductPage(productId: string): Promise<ProductPageResponse> {
    const product = await productService.get(productId);
 
    // Recommendations - degradable
    const recommendations = await degradationManager.executeWithDegradation(
        'personalized-recommendations',
        () => recommendationService.getFor(productId)
    );
 
    // Inventory - degradable with cached fallback
    const inventory = await degradationManager.executeWithDegradation(
        'live-inventory',
        () => inventoryService.check(productId)
    );
 
    return {
        product,
        recommendations: recommendations || { products: [], source: 'unavailable' },
        inventory: inventory || { available: true, cached: true },
        degradationLevel: degradationManager.getCurrentLevel()
    };
}

Communicate Degradation to Users

When operating in degraded mode, consider informing users. A subtle banner saying 'Some features are temporarily limited' manages expectations better than features silently disappearing. Include degradation status in API responses so clients can adapt their UI accordingly.

Summary: Building Self-Healing Systems

Error recovery transforms systems from fragile to resilient. By combining retry strategies, circuit breakers, fallbacks, and compensation, you build systems that handle failures gracefully and maintain availability despite component problems.

Key Takeaways

•Classify errors before recovering — Different error types (transient, business, systemic) require different strategies.
•Retry with exponential backoff and jitter — Prevent retry storms; spread retry timing to reduce load on recovering services.
•Use circuit breakers for dependencies — Fail fast when dependencies are down; give them time to recover without being overwhelmed.
•Design fallback chains — Multiple fallback strategies provide layers of resilience: cached → default → degraded → empty.
•Propagate deadlines — Every operation in the chain should know how much time it has; fail fast when deadline is exceeded.
•Implement compensation for sagas — When multi-step operations fail partway, undo completed steps to maintain consistency.
•Degrade gracefully — Disable non-essential features to preserve core functionality during stress or outages.

Module Complete

You've now mastered error handling at boundaries—from exception translation through layers, to user-facing messages, comprehensive logging, and active recovery strategies. These patterns combine to create systems that handle errors not as exceptional disasters but as expected events with planned responses.

Apply these patterns to build systems that operators trust, users appreciate, and developers can debug efficiently. Error handling isn't just about catching exceptions—it's about designing for graceful behavior when the world doesn't cooperate.

Module Complete

You now understand the complete landscape of error recovery: retry patterns with budgets, circuit breakers for failing dependencies, fallback chains for continuity, timeout management with deadline propagation, compensating transactions for distributed consistency, and graceful degradation for stress scenarios. Apply these patterns to build truly resilient systems.

4 / 4

Loading learning content...

System Design (LLD)Exception & Error Handling Design

Error Handling at Boundaries

LevelAdvanced

Duration90 mins

TopicException & Error Handling Design

4 / 4

Error Recovery Strategies

Beyond Detection: Making Systems Self-Healing

What You Will Master

The Recovery Decision Framework

Not all errors can or should be recovered from. Before implementing recovery logic, you must classify errors and match them to appropriate strategies.

Error Classification for Recovery

Errors fall into categories based on their recoverability:

Error Categories and Recovery Potential
Category	Characteristics	Recovery Approach	Examples
Transient	Temporary condition that will likely resolve on retry	Retry with backoff	Network timeout, connection reset, 503 Service Unavailable
Retriable After Action	Failed but may succeed after taking a specific action	Action + Retry	Token expired (refresh + retry), rate limit (wait + retry)
Non-Retriable	Will fail the same way on retry; different approach needed	Fallback or fail	Invalid input, resource not found, permission denied
Systemic	Component or dependency is down; retries won't help soon	Circuit break, fallback	Database down, external service outage
Degraded	Full service unavailable, but partial service possible	Graceful degradation	Some features unavailable, cached data acceptable
Compensable	Operation partially succeeded; needs rollback	Compensation transaction	Payment succeeded but inventory reservation failed

error-classification.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
/**
 * Error classification system that determines appropriate recovery strategy.
 */
interface RecoveryDecision {
    strategy: 'retry' | 'retry-with-action' | 'fallback' | 'circuit-break' | 'fail' | 'compensate';
    maxRetries?: number;
    backoffConfig?: BackoffConfig;
    fallbackType?: 'cached' | 'default' | 'degraded' | 'none';
    action?: () => Promise<void>;
    compensationRequired?: boolean;
}
 
/**
 * Classify an error and determine recovery strategy.
 */
function classifyError(error: Error, context: OperationContext): RecoveryDecision {
    // ============================================
    // TRANSIENT ERRORS: Retry with backoff
    // ============================================
    if (isTransientError(error)) {
        return {
            strategy: 'retry',
            maxRetries: 3,
            backoffConfig: {
                initialDelayMs: 100,
                maxDelayMs: 5000,
                multiplier: 2,
                jitter: 0.2
            }
        };
    }
 
    // ============================================
    // AUTHENTICATION ERRORS: Refresh and retry
    // ============================================
    if (isAuthenticationExpiredError(error)) {
        return {
            strategy: 'retry-with-action',
            maxRetries: 1,
            action: async () => {
                await refreshAuthenticationToken();
            }
        };
    }
 
    // ============================================
    // RATE LIMITING: Wait and retry
    // ============================================
    if (isRateLimitError(error)) {
        const retryAfter = extractRetryAfterHeader(error);
        return {
            strategy: 'retry',
            maxRetries: 2,
            backoffConfig: {
                initialDelayMs: (retryAfter || 60) * 1000,
                maxDelayMs: 120000,
                multiplier: 1,
                jitter: 0.1
            }
        };
    }
 
    // ============================================
    // EXTERNAL SERVICE DOWN: Circuit break and fallback
    // ============================================
    if (isExternalServiceError(error)) {
        return {
            strategy: 'circuit-break',
            fallbackType: context.allowDegradedMode ? 'cached' : 'none'
        };
    }
 
    // ============================================
    // VALIDATION ERRORS: No retry, fail fast
    // ============================================
    if (isValidationError(error) || isNotFoundError(error) || isPermissionError(error)) {
        return {
            strategy: 'fail',
            fallbackType: 'none'
        };
    }
 
    // ============================================
    // PARTIAL FAILURE: Compensate
    // ============================================
    if (isPartialFailureError(error)) {
        return {
            strategy: 'compensate',
            compensationRequired: true
        };
    }
 
    // ============================================
    // UNKNOWN: Conservative default
    // ============================================
    return {
        strategy: 'fail',
        fallbackType: 'none'
    };
}
 
// Classification helper functions
function isTransientError(error: Error): boolean {
    if ('isTransient' in error && (error as any).isTransient) return true;
    
    // Network-level errors
    if (error.message.includes('ECONNRESET')) return true;
    if (error.message.includes('ETIMEDOUT')) return true;
    if (error.message.includes('ECONNREFUSED')) return true;
    
    // HTTP errors that are typically transient
    if ('statusCode' in error) {
        const status = (error as any).statusCode;
        if (status === 502 || status === 503 || status === 504) return true;
        if (status === 429) return true;  // Rate limit
    }
    
    return false;
}
 
function isAuthenticationExpiredError(error: Error): boolean {
    if ('statusCode' in error && (error as any).statusCode === 401) return true;
    if (error.message.includes('token expired')) return true;
    if (error.message.includes('jwt expired')) return true;
    return false;
}
 
function isRateLimitError(error: Error): boolean {
    if ('statusCode' in error && (error as any).statusCode === 429) return true;
    if (error.name === 'RateLimitException') return true;
    return false;
}
 
function isExternalServiceError(error: Error): boolean {
    return error instanceof ExternalServiceException;
}
 
function isValidationError(error: Error): boolean {
    return error instanceof ValidationException || 
           ('statusCode' in error && (error as any).statusCode === 400);
}

The Idempotency Prerequisite

Retry Patterns: The Art of Trying Again

Exponential Backoff with Jitter

The gold standard for retry timing is exponential backoff with jitter:

Exponential growth: Each retry waits longer than the previous (100ms, 200ms, 400ms...)
Maximum cap: Backoff doesn't grow indefinitely
Jitter: Random variation prevents synchronized retry storms

retry-with-backoff.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
/**
 * Production-ready retry implementation with exponential backoff and jitter.
 */
interface RetryConfig {
    maxRetries: number;
    initialDelayMs: number;
    maxDelayMs: number;
    multiplier: number;
    jitter: number;  // 0-1: portion of delay to randomize
    retryOn?: (error: Error) => boolean;  // Which errors to retry
}
 
interface RetryResult<T> {
    success: boolean;
    result?: T;
    error?: Error;
    attempts: number;
    totalDurationMs: number;
}
 
/**
 * Execute an operation with configurable retry behavior.
 */
async function withRetry<T>(
    operation: () => Promise<T>,
    config: RetryConfig,
    context?: { correlationId: string; operationName: string }
): Promise<RetryResult<T>> {
    const startTime = Date.now();
    let lastError: Error | undefined;
 
    for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
        try {
            const result = await operation();
            
            // Log successful retry if this wasn't the first attempt
            if (attempt > 0) {
                logger.info('Operation succeeded after retry', {
                    ...context,
                    attempt: attempt + 1,
                    totalAttempts: attempt + 1,
                    durationMs: Date.now() - startTime
                });
            }
 
            return {
                success: true,
                result,
                attempts: attempt + 1,
                totalDurationMs: Date.now() - startTime
            };
 
        } catch (error) {
            lastError = error instanceof Error ? error : new Error(String(error));
 
            // Check if this error is retryable
            const shouldRetry = config.retryOn 
                ? config.retryOn(lastError) 
                : isTransientError(lastError);
 
            if (!shouldRetry || attempt >= config.maxRetries) {
                // No more retries: log and return failure
                logger.warn('Operation failed after retries', {
                    ...context,
                    attempt: attempt + 1,
                    totalAttempts: attempt + 1,
                    error: lastError.message,
                    willRetry: false
                });
 
                return {
                    success: false,
                    error: lastError,
                    attempts: attempt + 1,
                    totalDurationMs: Date.now() - startTime
                };
            }
 
            // Calculate delay with exponential backoff and jitter
            const delay = calculateDelay(attempt, config);
 
            logger.debug('Retrying operation', {
                ...context,
                attempt: attempt + 1,
                nextAttempt: attempt + 2,
                delayMs: delay,
                error: lastError.message
            });
 
            await sleep(delay);
        }
    }
 
    // Should not reach here, but TypeScript needs this
    return {
        success: false,
        error: lastError,
        attempts: config.maxRetries + 1,
        totalDurationMs: Date.now() - startTime
    };
}
 
/**
 * Calculate delay with exponential backoff and jitter.
 */
function calculateDelay(attempt: number, config: RetryConfig): number {
    // Exponential backoff
    const exponentialDelay = config.initialDelayMs * Math.pow(config.multiplier, attempt);
    
    // Cap at maximum
    const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs);
    
    // Add jitter (randomize ±jitter% of the delay)
    const jitterRange = cappedDelay * config.jitter;
    const jitter = (Math.random() * 2 - 1) * jitterRange;
    
    return Math.max(0, Math.round(cappedDelay + jitter));
}
 
function sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
}
 
// Usage example
const result = await withRetry(
    () => externalPaymentService.processPayment(paymentRequest),
    {
        maxRetries: 3,
        initialDelayMs: 200,
        maxDelayMs: 5000,
        multiplier: 2,
        jitter: 0.25,
        retryOn: (error) => {
            // Only retry on transient errors, not validation failures
            return isTransientError(error) && !isValidationError(error);
        }
    },
    { correlationId, operationName: 'ProcessPayment' }
);
 
if (!result.success) {
    // All retries exhausted - handle the failure
    throw new PaymentProcessingException(
        'Payment failed after multiple attempts',
        result.error
    );
}

The Retry Budget Pattern

For high-volume systems, individual retry configs aren't enough. A retry budget limits the total percentage of requests that can be retries, preventing retry amplification during outages:

retry-budget.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
/**
 * Retry budget prevents retry storms by limiting the fraction of
 * requests that can be retries.
 */
class RetryBudget {
    private window: { timestamp: number; isRetry: boolean }[] = [];
    private readonly windowSizeMs = 10000;  // 10-second window
    private readonly maxRetryRatio: number;
 
    constructor(maxRetryRatio: number = 0.2) {
        // Maximum 20% of traffic can be retries
        this.maxRetryRatio = maxRetryRatio;
    }
 
    /**
     * Record a request attempt.
     */
    recordAttempt(isRetry: boolean): void {
        const now = Date.now();
        this.window.push({ timestamp: now, isRetry });
        this.pruneOldEntries(now);
    }
 
    /**
     * Check if we have budget for another retry.
     */
    canRetry(): boolean {
        this.pruneOldEntries(Date.now());
        
        if (this.window.length === 0) return true;
 
        const retries = this.window.filter(e => e.isRetry).length;
        const total = this.window.length;
        const currentRatio = retries / total;
 
        return currentRatio < this.maxRetryRatio;
    }
 
    private pruneOldEntries(now: number): void {
        const cutoff = now - this.windowSizeMs;
        this.window = this.window.filter(e => e.timestamp >= cutoff);
    }
}
 
// Integration with retry logic
const retryBudget = new RetryBudget(0.2);
 
async function withBudgetedRetry<T>(
    operation: () => Promise<T>,
    config: RetryConfig
): Promise<RetryResult<T>> {
    let attempt = 0;
    let lastError: Error | undefined;
 
    while (attempt <= config.maxRetries) {
        const isRetry = attempt > 0;
        
        // Check budget before retrying
        if (isRetry && !retryBudget.canRetry()) {
            logger.warn('Retry budget exhausted, failing without retry');
            return {
                success: false,
                error: lastError || new Error('Retry budget exhausted'),
                attempts: attempt,
                totalDurationMs: 0
            };
        }
 
        retryBudget.recordAttempt(isRetry);
 
        try {
            const result = await operation();
            return { success: true, result, attempts: attempt + 1, totalDurationMs: 0 };
        } catch (error) {
            lastError = error as Error;
            attempt++;
            
            if (attempt <= config.maxRetries) {
                await sleep(calculateDelay(attempt - 1, config));
            }
        }
    }
 
    return { success: false, error: lastError, attempts: attempt, totalDurationMs: 0 };
}

Circuit Breaker Pattern

Circuit Breaker States

Closed: Normal operation; requests flow through
Open: Circuit tripped; requests fail immediately without attempting operation
Half-Open: Testing recovery; limited requests allowed to probe health

circuit-breaker.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
/**
 * Production-ready circuit breaker implementation.
 */
interface CircuitBreakerConfig {
    failureThreshold: number;      // Failures to open circuit
    successThreshold: number;      // Successes in half-open to close
    halfOpenRequests: number;      // Max requests in half-open state
    resetTimeoutMs: number;        // How long circuit stays open
    rollingWindowMs: number;       // Time window for counting failures
}
 
type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';
 
class CircuitBreaker {
    private state: CircuitState = 'CLOSED';
    private failures: number[] = [];
    private halfOpenSuccesses = 0;
    private halfOpenFailures = 0;
    private halfOpenRequests = 0;
    private lastOpenedAt?: number;
 
    constructor(
        private readonly name: string,
        private readonly config: CircuitBreakerConfig
    ) {}
 
    /**
     * Execute operation through the circuit breaker.
     */
    async execute<T>(operation: () => Promise<T>): Promise<T> {
        // Check if circuit allows this request
        this.evaluateState();
 
        if (this.state === 'OPEN') {
            throw new CircuitOpenException(this.name, this.getTimeUntilReset());
        }
 
        if (this.state === 'HALF_OPEN') {
            if (this.halfOpenRequests >= this.config.halfOpenRequests) {
                throw new CircuitOpenException(this.name, this.getTimeUntilReset());
            }
            this.halfOpenRequests++;
        }
 
        try {
            const result = await operation();
            this.recordSuccess();
            return result;
        } catch (error) {
            this.recordFailure();
            throw error;
        }
    }
 
    /**
     * Record a successful operation.
     */
    private recordSuccess(): void {
        if (this.state === 'HALF_OPEN') {
            this.halfOpenSuccesses++;
            
            if (this.halfOpenSuccesses >= this.config.successThreshold) {
                this.transitionTo('CLOSED');
            }
        }
        // In CLOSED state, success doesn't affect failure count
    }
 
    /**
     * Record a failed operation.
     */
    private recordFailure(): void {
        const now = Date.now();
 
        if (this.state === 'CLOSED') {
            this.failures.push(now);
            this.pruneOldFailures(now);
 
            if (this.failures.length >= this.config.failureThreshold) {
                this.transitionTo('OPEN');
            }
        } else if (this.state === 'HALF_OPEN') {
            this.halfOpenFailures++;
            // Any failure in half-open immediately re-opens
            this.transitionTo('OPEN');
        }
    }
 
    /**
     * Evaluate and potentially transition state.
     */
    private evaluateState(): void {
        if (this.state === 'OPEN') {
            const elapsed = Date.now() - (this.lastOpenedAt || 0);
            if (elapsed >= this.config.resetTimeoutMs) {
                this.transitionTo('HALF_OPEN');
            }
        }
    }
 
    private transitionTo(newState: CircuitState): void {
        const oldState = this.state;
        this.state = newState;
 
        logger.info('Circuit breaker state transition', {
            circuitName: this.name,
            oldState,
            newState,
            failures: this.failures.length
        });
 
        if (newState === 'OPEN') {
            this.lastOpenedAt = Date.now();
            metrics.increment('circuit_breaker.opened', { circuit: this.name });
        } else if (newState === 'CLOSED') {
            this.failures = [];
            this.halfOpenSuccesses = 0;
            this.halfOpenFailures = 0;
            this.halfOpenRequests = 0;
            metrics.increment('circuit_breaker.closed', { circuit: this.name });
        } else if (newState === 'HALF_OPEN') {
            this.halfOpenSuccesses = 0;
            this.halfOpenFailures = 0;
            this.halfOpenRequests = 0;
            metrics.increment('circuit_breaker.half_open', { circuit: this.name });
        }
    }
 
    private pruneOldFailures(now: number): void {
        const cutoff = now - this.config.rollingWindowMs;
        this.failures = this.failures.filter(ts => ts >= cutoff);
    }
 
    private getTimeUntilReset(): number {
        if (!this.lastOpenedAt) return 0;
        return Math.max(0, this.config.resetTimeoutMs - (Date.now() - this.lastOpenedAt));
    }
 
    /**
     * Get current circuit state for monitoring.
     */
    getState(): { state: CircuitState; failures: number; halfOpenSuccesses: number } {
        return {
            state: this.state,
            failures: this.failures.length,
            halfOpenSuccesses: this.halfOpenSuccesses
        };
    }
}
 
class CircuitOpenException extends Error {
    constructor(
        public readonly circuitName: string,
        public readonly resetInMs: number
    ) {
        super(`Circuit ${circuitName} is open. Retry after ${resetInMs}ms`);
        this.name = 'CircuitOpenException';
    }
}
 
// Usage: One circuit breaker per external dependency
const paymentServiceCircuit = new CircuitBreaker('payment-service', {
    failureThreshold: 5,       // Open after 5 failures
    successThreshold: 2,       // Close after 2 successes in half-open
    halfOpenRequests: 3,       // Allow 3 test requests in half-open
    resetTimeoutMs: 30000,     // Stay open for 30 seconds
    rollingWindowMs: 60000     // Count failures in 60-second window
});
 
async function processPayment(request: PaymentRequest): Promise<PaymentResult> {
    try {
        return await paymentServiceCircuit.execute(() =>
            paymentClient.process(request)
        );
    } catch (error) {
        if (error instanceof CircuitOpenException) {
            // Fast failure - use fallback
            logger.warn('Payment service circuit open, using fallback');
            return handlePaymentFallback(request-error);
        }
        throw error;
    }
}

Circuit Breaker per Dependency

Fallback Strategies

When operations can't complete normally, fallbacks provide alternative behavior that maintains service continuity, possibly with reduced functionality.

Types of Fallback Strategies

Fallback Strategy Types
Strategy	Behavior	Use When	Trade-offs
Cached Value	Return previously cached successful response	Data staleness is acceptable for short periods	May return outdated information
Default Value	Return a predetermined default	Safe default exists; partial is better than nothing	May not reflect actual state
Degraded Feature	Disable non-essential feature, continue with core	Feature is enhancement, not critical	Reduced functionality
Alternative Service	Use backup provider or read replica	Redundant systems exist	May have different performance or cost
Queue for Later	Accept request, process asynchronously when restored	Operation can be eventually consistent	Delay in completion; needs queue infrastructure
Static Response	Return static/pre-computed response	Dynamic data not critical	Same response for all, may be inappropriate

fallback-implementations.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
/**
 * Comprehensive fallback implementation patterns.
 */
 
/**
 * Operation wrapper that supports multiple fallback strategies.
 */
class ResilientOperation<T> {
    private fallbacks: Array<{
        name: string;
        condition?: (error: Error) => boolean;
        handler: (error: Error) => Promise<T> | T;
    }> = [];
 
    constructor(
        private readonly operation: () => Promise<T>,
        private readonly operationName: string
    ) {}
 
    /**
     * Add a fallback strategy.
     */
    withFallback(
        name: string,
        handler: (error: Error) => Promise<T> | T,
        condition?: (error: Error) => boolean
    ): this {
        this.fallbacks.push({ name, handler, condition });
        return this;
    }
 
    /**
     * Execute with fallback chain.
     */
    async execute(): Promise<T> {
        try {
            return await this.operation();
        } catch (primaryError) {
            logger.warn(`Primary operation failed: ${this.operationName}`, {
                error: primaryError instanceof Error ? primaryError.message : String(primaryError)
            });
 
            // Try fallbacks in order
            for (const fallback of this.fallbacks) {
                // Check condition if specified
                if (fallback.condition && !fallback.condition(primaryError as Error)) {
                    continue;
                }
 
                try {
                    logger.info(`Attempting fallback: ${fallback.name}`);
                    const result = await fallback.handler(primaryError as Error);
                    
                    metrics.increment('operation.fallback.success', {
                        operation: this.operationName,
                        fallback: fallback.name
                    });
 
                    return result;
                } catch (fallbackError) {
                    logger.warn(`Fallback failed: ${fallback.name}`, {
                        error: fallbackError instanceof Error ? fallbackError.message : String(fallbackError)
                    });
                    // Continue to next fallback
                }
            }
 
            // All fallbacks exhausted
            throw primaryError;
        }
    }
}
 
// ============================================
// EXAMPLE: Product recommendations with fallbacks
// ============================================
 
interface ProductRecommendations {
    products: Product[];
    source: 'ml-service' | 'cached' | 'popular' | 'empty';
}
 
async function getRecommendations(userId: string): Promise<ProductRecommendations> {
    return new ResilientOperation(
        // Primary: Real-time ML recommendations
        async () => ({
            products: await mlRecommendationService.getPersonalized(userId),
            source: 'ml-service' as const
        }),
        'getRecommendations'
    )
    // Fallback 1: Cached recommendations for this user
    .withFallback('cached-personal', async () => {
        const cached = await cache.get(`recommendations:${userId}`);
        if (!cached) throw new Error('No cached recommendations');
        return { products: cached, source: 'cached' as const };
    })
    // Fallback 2: Popular products (same for all users)
    .withFallback('popular-products', async () => ({
        products: await productService.getPopular(10),
        source: 'popular' as const
    }))
    // Fallback 3: Empty list (graceful degradation)
    .withFallback('empty', () => ({
        products: [],
        source: 'empty' as const
    }))
    .execute();
}
 
// ============================================
// EXAMPLE: Configuration with fallback chain
// ============================================
 
async function getConfiguration(key: string): Promise<ConfigValue> {
    const configService = new ResilientOperation(
        // Primary: Remote config service
        () => remoteConfigClient.get(key),
        'getConfiguration'
    )
    // Fallback 1: Local cache
    .withFallback('local-cache', async () => {
        const cached = localConfigCache.get(key);
        if (!cached) throw new Error('Not in cache');
        return cached;
    })
    // Fallback 2: Environment variable
    .withFallback('env-var', () => {
        const envValue = process.env[`CONFIG_${key.toUpperCase()}`];
        if (!envValue) throw new Error('Not in env');
        return { value: envValue, source: 'environment' };
    })
    // Fallback 3: Compiled default
    .withFallback('default', () => {
        const defaultValue = DEFAULT_CONFIG[key];
        if (defaultValue === undefined) throw new Error('No default');
        return { value: defaultValue, source: 'default' };
    });
 
    return configService.execute();
}
 
// ============================================
// EXAMPLE: Queue-based fallback for writes
// ============================================
 
async function saveUserPreference(userId: string, pref: Preference): Promise<void> {
    try {
        await preferenceService.save(userId, pref);
    } catch (error) {
        if (isTransientError(error as Error)) {
            // Queue for later processing
            await deferredOperationQueue.enqueue({
                type: 'save-preference',
                payload: { userId, preference: pref },
                maxRetries: 5,
                expiresAt: Date.now() + 24 * 60 * 60 * 1000 // 24 hours
            });
 
            logger.info('Preference save queued for later', { userId });
            // Return success - from user's perspective, operation accepted
            return;
        }
        throw error;
    }
}

Signal Fallback Usage

Timeout and Deadline Management

Every external call should have a timeout. Without timeouts, a single hung dependency can exhaust your connection pool, block your thread pool, and cascade into complete system failure.

The Deadline Propagation Pattern

timeout-management.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
/**
 * Deadline propagation for timeout management across service calls.
 */
 
/**
 * Deadline context that flows through the request.
 */
interface DeadlineContext {
    deadlineEpochMs: number;
    operationStart: number;
    parentOperationName?: string;
}
 
const deadlineStorage = new AsyncLocalStorage<DeadlineContext>();
 
/**
 * Get remaining time until deadline.
 */
function getTimeRemaining(): number {
    const ctx = deadlineStorage.getStore();
    if (!ctx) return Infinity;  // No deadline set
    return Math.max(0, ctx.deadlineEpochMs - Date.now());
}
 
/**
 * Check if deadline has passed.
 */
function isDeadlineExceeded(): boolean {
    return getTimeRemaining() <= 0;
}
 
/**
 * Middleware that extracts or creates deadline context.
 */
function deadlineMiddleware(
    req: Request,
    res: Response,
    next: NextFunction
) {
    // Try to get deadline from upstream caller
    const upstreamDeadline = req.headers['x-deadline-ms'];
    
    // Or use default timeout for this service
    const deadline = upstreamDeadline
        ? parseInt(upstreamDeadline as string, 10)
        : Date.now() + 30000;  // 30 second default
 
    const ctx: DeadlineContext = {
        deadlineEpochMs: deadline,
        operationStart: Date.now(),
        parentOperationName: req.path
    };
 
    deadlineStorage.run(ctx, () => next());
}
 
/**
 * Wrap an async operation with deadline enforcement.
 */
async function withDeadline<T>(
    operation: () => Promise<T>,
    operationName: string,
    reserveMs: number = 100  // Reserve time for response processing
): Promise<T> {
    const remaining = getTimeRemaining() - reserveMs;
 
    if (remaining <= 0) {
        throw new DeadlineExceededException(
            operationName,
            'Deadline already exceeded before operation started'
        );
    }
 
    return Promise.race([
        operation(),
        createTimeoutPromise(remaining, operationName)
    ]);
}
 
function createTimeoutPromise<T>(timeoutMs: number, operationName: string): Promise<T> {
    return new Promise((_, reject) => {
        setTimeout(() => {
            reject(new DeadlineExceededException(operationName, 
                `Operation timed out after ${timeoutMs}ms`));
        }, timeoutMs);
    });
}
 
/**
 * HTTP client that propagates deadlines to downstream services.
 */
class DeadlineAwareHttpClient {
    async request<T>(url: string, options: RequestInit = {}): Promise<T> {
        const remaining = getTimeRemaining();
 
        if (remaining <= 0) {
            throw new DeadlineExceededException('http-request', 
                'Deadline exceeded before making request');
        }
 
        const headers = new Headers(options.headers);
        headers.set('X-Deadline-Ms', String(deadlineStorage.getStore()?.deadlineEpochMs));
 
        // Set fetch timeout to remaining time (minus small buffer)
        const controller = new AbortController();
        const timeout = setTimeout(() => controller.abort(), remaining - 50);
 
        try {
            const response = await fetch(url, {
                ...options,
                headers,
                signal: controller.signal
            });
 
            if (!response.ok) {
                throw new HttpError(response.status, await response.text());
            }
 
            return response.json();
        } finally {
            clearTimeout(timeout);
        }
    }
}
 
class DeadlineExceededException extends Error {
    constructor(
        public readonly operation: string,
        message: string
    ) {
        super(message);
        this.name = 'DeadlineExceededException';
    }
}
 
// ============================================
// Usage: Multi-step operation with deadline budget
// ============================================
 
async function processOrder(order: Order): Promise<ProcessedOrder> {
    // Total operation timeout: 5 seconds
    // But we're within a request that may have its own deadline
    const effectiveTimeout = Math.min(5000, getTimeRemaining());
 
    return await withDeadline(async () => {
        // Step 1: Validate inventory (budget: 1 second)
        const inventory = await withDeadline(
            () => inventoryService.check(order.items),
            'inventory-check',
            100
        );
 
        // Step 2: Process payment (~2 seconds budget)
        const payment = await withDeadline(
            () => paymentService.charge(order.payment),
            'payment-processing',
            100
        );
 
        // Step 3: Create shipment (remaining time)
        const shipment = await withDeadline(
            () => shippingService.create(order.address),
            'shipment-creation',
            100
        );
 
        return { order, inventory, payment, shipment };
    }, 'process-order');
}

Compensating Transactions

The Saga Pattern

compensating-transactions.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
/**
 * Saga pattern implementation for distributed transaction compensation.
 */
interface SagaStep<TContext> {
    name: string;
    execute: (context: TContext) => Promise<void>;
    compensate: (context: TContext) => Promise<void>;
}
 
class SagaOrchestrator<TContext extends { sagaId: string }> {
    private steps: SagaStep<TContext>[] = [];
    private executedSteps: string[] = [];
 
    addStep(step: SagaStep<TContext>): this {
        this.steps.push(step);
        return this;
    }
 
    /**
     * Execute the saga with automatic compensation on failure.
     */
    async execute(context: TContext): Promise<void> {
        this.executedSteps = [];
 
        for (const step of this.steps) {
            try {
                logger.info(`Saga step starting: ${step.name}`, {
                    sagaId: context.sagaId,
                    step: step.name
                });
 
                await step.execute(context);
                this.executedSteps.push(step.name);
 
                logger.info(`Saga step completed: ${step.name}`, {
                    sagaId: context.sagaId,
                    step: step.name
                });
 
            } catch (error) {
                logger.error(`Saga step failed: ${step.name}`, {
                    sagaId: context.sagaId,
                    step: step.name,
                    error: error instanceof Error ? error.message : String(error)
                });
 
                // Compensate all executed steps in reverse order
                await this.compensate(context);
 
                throw new SagaFailedException(
                    context.sagaId,
                    step.name,
                    error instanceof Error ? error : new Error(String(error))
                );
            }
        }
    }
 
    /**
     * Compensate executed steps in reverse order.
     */
    private async compensate(context: TContext): Promise<void> {
        const stepsToCompensate = [...this.executedSteps].reverse();
 
        for (const stepName of stepsToCompensate) {
            const step = this.steps.find(s => s.name === stepName);
            if (!step) continue;
 
            try {
                logger.info(`Compensating step: ${stepName}`, {
                    sagaId: context.sagaId,
                    step: stepName
                });
 
                await step.compensate(context);
 
                logger.info(`Compensation completed: ${stepName}`, {
                    sagaId: context.sagaId,
                    step: stepName
                });
 
            } catch (compensationError) {
                // Compensation failure is critical - needs manual intervention
                logger.fatal(`Compensation failed: ${stepName}`, {
                    sagaId: context.sagaId,
                    step: stepName,
                    error: compensationError
                });
 
                // Record for manual remediation
                await deadLetterQueue.enqueue({
                    type: 'compensation-failure',
                    sagaId: context.sagaId,
                    step: stepName,
                    context,
                    error: compensationError
                });
 
                // Continue compensating other steps
            }
        }
    }
}
 
class SagaFailedException extends Error {
    constructor(
        public readonly sagaId: string,
        public readonly failedStep: string,
        public readonly cause: Error
    ) {
        super(`Saga ${sagaId} failed at step '${failedStep}'`);
        this.name = 'SagaFailedException';
    }
}
 
// ============================================
// Example: Order processing saga
// ============================================
 
interface OrderSagaContext {
    sagaId: string;
    orderId: string;
    userId: string;
    items: OrderItem[];
    paymentDetails: PaymentDetails;
    shippingAddress: Address;
    
    // Populated during execution for compensation
    reservationId?: string;
    paymentId?: string;
    shipmentId?: string;
}
 
const orderSaga = new SagaOrchestrator<OrderSagaContext>()
    // Step 1: Reserve inventory
    .addStep({
        name: 'reserve-inventory',
        execute: async (ctx) => {
            const reservation = await inventoryService.reserve(ctx.items);
            ctx.reservationId = reservation.id;
        },
        compensate: async (ctx) => {
            if (ctx.reservationId) {
                await inventoryService.releaseReservation(ctx.reservationId);
            }
        }
    })
    // Step 2: Process payment
    .addStep({
        name: 'process-payment',
        execute: async (ctx) => {
            const payment = await paymentService.charge(ctx.paymentDetails);
            ctx.paymentId = payment.id;
        },
        compensate: async (ctx) => {
            if (ctx.paymentId) {
                await paymentService.refund(ctx.paymentId);
            }
        }
    })
    // Step 3: Create shipment
    .addStep({
        name: 'create-shipment',
        execute: async (ctx) => {
            const shipment = await shippingService.create({
                orderId: ctx.orderId,
                items: ctx.items,
                address: ctx.shippingAddress
            });
            ctx.shipmentId = shipment.id;
        },
        compensate: async (ctx) => {
            if (ctx.shipmentId) {
                await shippingService.cancel(ctx.shipmentId);
            }
        }
    })
    // Step 4: Confirm order (no compensation - this is the commit point)
    .addStep({
        name: 'confirm-order',
        execute: async (ctx) => {
            await orderService.confirm(ctx.orderId);
        },
        compensate: async () => {
            // No compensation needed - if we fail before confirm,
            // previous steps will be compensated
            // If confirm succeeds, saga is complete
        }
    });
 
// Usage
async function createOrder(request: CreateOrderRequest): Promise<Order> {
    const order = await orderService.create(request);
    
    const sagaContext: OrderSagaContext = {
        sagaId: generateSagaId(),
        orderId: order.id,
        userId: request.userId,
        items: request.items,
        paymentDetails: request.payment,
        shippingAddress: request.address
    };
 
    try {
        await orderSaga.execute(sagaContext);
        return order;
    } catch (error) {
        // Saga failed and compensated - order is cancelled
        await orderService.markFailed(order.id, (error as Error).message);
        throw error;
    }
}

Compensation Can Fail Too

Graceful Degradation

Degradation Levels

Define tiers of functionality that can be progressively disabled:

graceful-degradation.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
/**
 * Graceful degradation system with configurable feature tiers.
 */
 
enum DegradationLevel {
    NORMAL = 0,      // All features available
    REDUCED = 1,     // Non-essential features disabled
    ESSENTIAL = 2,   // Only critical features available
    MINIMAL = 3      // Absolute minimum viable service
}
 
interface FeatureDefinition {
    name: string;
    description: string;
    minLevel: DegradationLevel;  // Disable at this level and above
    dependencies: string[];   // Other services this feature needs
    fallback?: () => any;     // Fallback behavior when disabled
}
 
class DegradationManager {
    private currentLevel: DegradationLevel = DegradationLevel.NORMAL;
    private features: Map<string, FeatureDefinition> = new Map();
    private circuitBreakers: Map<string, CircuitBreaker> = new Map();
 
    registerFeature(feature: FeatureDefinition): void {
        this.features.set(feature.name, feature);
    }
 
    /**
     * Check if a feature is currently available.
     */
    isFeatureAvailable(featureName: string): boolean {
        const feature = this.features.get(featureName);
        if (!feature) return false;
 
        // Check explicit degradation level
        if (this.currentLevel >= feature.minLevel) {
            return false;
        }
 
        // Check dependencies' circuits
        for (const dep of feature.dependencies) {
            const circuit = this.circuitBreakers.get(dep);
            if (circuit?.getState().state === 'OPEN') {
                return false;
            }
        }
 
        return true;
    }
 
    /**
     * Execute operation with degradation-aware behavior.
     */
    async executeWithDegradation<T>(
        featureName: string,
        operation: () => Promise<T>
    ): Promise<T | null> {
        const feature = this.features.get(featureName);
 
        if (!this.isFeatureAvailable(featureName)) {
            if (feature?.fallback) {
                return feature.fallback();
            }
            return null;
        }
 
        try {
            return await operation();
        } catch (error) {
            // On error, consider activating degradation
            this.handleFeatureError(featureName, error as Error);
 
            if (feature?.fallback) {
                return feature.fallback();
            }
            throw error;
        }
    }
 
    /**
     * Manually set degradation level (e.g., during incidents).
     */
    setDegradationLevel(level: DegradationLevel): void {
        const previous = this.currentLevel;
        this.currentLevel = level;
 
        logger.warn('Degradation level changed', {
            previous: DegradationLevel[previous],
            current: DegradationLevel[level]
        });
 
        // Notify monitoring
        metrics.gauge('degradation.level', level);
    }
 
    /**
     * Auto-adjust degradation based on error rates.
     */
    private handleFeatureError(featureName: string, error: Error): void {
        // Track error rate and potentially auto-degrade
        const errorRate = this.errorRateTracker.record(featureName, error);
 
        if (errorRate > 0.5 && this.currentLevel < DegradationLevel.REDUCED) {
            logger.warn('Auto-degrading due to high error rate', {
                feature: featureName,
                errorRate
            });
            this.setDegradationLevel(DegradationLevel.REDUCED);
        }
    }
}
 
// ============================================
// Example: E-commerce with degradation tiers
// ============================================
 
const degradationManager = new DegradationManager();
 
// Register features with their degradation thresholds
degradationManager.registerFeature({
    name: 'personalized-recommendations',
    description: 'ML-powered product recommendations',
    minLevel: DegradationLevel.REDUCED,  // Disable early
    dependencies: ['recommendation-service'],
    fallback: () => ({ products: [], source: 'disabled' })
});
 
degradationManager.registerFeature({
    name: 'live-inventory',
    description: 'Real-time inventory checks',
    minLevel: DegradationLevel.ESSENTIAL,
    dependencies: ['inventory-service'],
    fallback: () => ({ available: true, cached: true }) // Optimistic
});
 
degradationManager.registerFeature({
    name: 'payment-processing',
    description: 'Process payments',
    minLevel: DegradationLevel.MINIMAL,  // Only disable in extreme cases
    dependencies: ['payment-service'],
    fallback: undefined  // No fallback - must fail if unavailable
});
 
// Usage in API handler
async function getProductPage(productId: string): Promise<ProductPageResponse> {
    const product = await productService.get(productId);
 
    // Recommendations - degradable
    const recommendations = await degradationManager.executeWithDegradation(
        'personalized-recommendations',
        () => recommendationService.getFor(productId)
    );
 
    // Inventory - degradable with cached fallback
    const inventory = await degradationManager.executeWithDegradation(
        'live-inventory',
        () => inventoryService.check(productId)
    );
 
    return {
        product,
        recommendations: recommendations || { products: [], source: 'unavailable' },
        inventory: inventory || { available: true, cached: true },
        degradationLevel: degradationManager.getCurrentLevel()
    };
}

Communicate Degradation to Users

Summary: Building Self-Healing Systems

Key Takeaways

•Classify errors before recovering — Different error types (transient, business, systemic) require different strategies.
•Retry with exponential backoff and jitter — Prevent retry storms; spread retry timing to reduce load on recovering services.
•Use circuit breakers for dependencies — Fail fast when dependencies are down; give them time to recover without being overwhelmed.
•Design fallback chains — Multiple fallback strategies provide layers of resilience: cached → default → degraded → empty.
•Propagate deadlines — Every operation in the chain should know how much time it has; fail fast when deadline is exceeded.
•Implement compensation for sagas — When multi-step operations fail partway, undo completed steps to maintain consistency.
•Degrade gracefully — Disable non-essential features to preserve core functionality during stress or outages.

Module Complete

Module Complete

4 / 4