System Design (HLD)Retry with Backoff

Retry with Backoff: Building Resilient Distributed Systems

LevelAdvanced

Duration75 mins

TopicRetry with Backoff

2 / 5

Exponential Backoff: Optimal Retry Timing

The Science of Waiting

Knowing when to retry is only half the equation. Equally critical is knowing how long to wait between retry attempts. Retry too quickly, and you hammer a service that's already struggling, potentially preventing its recovery. Wait too long, and you sacrifice availability unnecessarily, leaving users staring at spinners while a recovered service sits idle.

Exponential backoff is the elegant solution to this timing challenge. Rather than using fixed delays or ad-hoc timing, exponential backoff provides a mathematically principled approach that balances responsive recovery with resource protection. It's the standard retry timing strategy across cloud platforms, network protocols, and distributed systems—from TCP congestion control to AWS SDK retry policies to Kubernetes pod restart strategies.

This page explores exponential backoff in depth: the mathematical foundations, the intuition behind why it works, implementation patterns, configuration parameters, and real-world tuning strategies. By the end, you'll understand not just how to implement exponential backoff but why it's the optimal approach for most retry scenarios.

What You Will Learn

By the end of this page, you will understand the mathematical model of exponential backoff, why linear and fixed delays fail at scale, how to implement backoff correctly, key configuration parameters and their trade-offs, and advanced techniques like capped backoff and decorrelated delays.

Why Not Fixed or Linear Delays?

Before diving into exponential backoff, let's understand why simpler approaches fail. Many developers' first instinct is to implement fixed delays:

failed → wait 1 second → retry → failed → wait 1 second → retry → ...

Or linear delays:

failed → wait 1s → retry → failed → wait 2s → retry → failed → wait 3s → ...

Both approaches have fundamental problems when applied to distributed systems at scale.

Fixed Delay Problems

Fixed delays create retry synchronization, where multiple clients that failed at the same time will retry at the same time. Consider this scenario:

Service A experiences a brief 500ms outage affecting 10,000 concurrent requests
All 10,000 clients fail simultaneously
All clients wait exactly 1 second
All 10,000 retry at precisely the same moment
Service, just recovered, is immediately overwhelmed again
The cycle repeats, often indefinitely

This creates a periodic load spike every 1 second, making recovery extremely difficult. The service oscillates between "attempting to recover" and "overwhelmed by synchronized retries."

Fixed Delay Failures

•Synchronized retry storms — Clients failing together retry together, creating periodic load spikes
•No recovery opportunity — Constant fixed-interval traffic overwhelms recovering services
•Poor adaptation — Doesn't respond to severity; treats brief blip and extended outage identically
•Wasted resources — Continues retrying at same rate even as evidence mounts that service needs time

Linear Delay Failures

•Still synchronized — Clients failing together maintain lock-step, just with increasing gaps
•Slow adaptation — Linear growth is too gentle; takes many retries to reach meaningful delays
•Unbounded growth — Without cap, delays become unreasonably long for persistent failures
•Mathematically suboptimal — Doesn't account for exponentially increasing uncertainty with each failure

The Fundamental Insight

Each consecutive failure provides additional evidence that the service needs time to recover. This evidence should be weighted exponentially:

First failure: Could be a momentary blip. Quick retry is reasonable.
Second failure: Less likely to be transient. Wait longer.
Third failure: Strong signal of prolonged issue. Wait much longer.
Fourth failure: Almost certainly a significant outage. Wait substantially longer.

The probability of a true transient failure decreases roughly exponentially with each retry attempt. Therefore, the delay should increase exponentially to match this updated probability assessment.

The Bayesian Intuition

From a Bayesian perspective, each failed retry updates our prior belief about the nature of the failure. We start with a prior that the failure is transient (brief, will resolve quickly). Each failure reduces this probability, shifting our belief toward a more persistent condition. Exponential backoff encodes this belief update into our retry timing—longer waits reflect lower confidence that immediate retry will succeed.

The Mathematics of Exponential Backoff

Exponential backoff follows a simple mathematical model. Let's build it from first principles.

The Basic Formula

The delay before retry attempt n (where n starts at 0 for the first retry) is:

delay(n) = baseDelay × multiplier^n

Where:

baseDelay is the initial wait time (typically 100ms to 1s)
multiplier is the growth factor (typically 2)
n is the retry attempt number (0-indexed)

Example with baseDelay=100ms, multiplier=2:

Attempt	Formula	Delay
0 (1st retry)	100 × 2⁰	100ms
1 (2nd retry)	100 × 2¹	200ms
2 (3rd retry)	100 × 2²	400ms
3 (4th retry)	100 × 2³	800ms
4 (5th retry)	100 × 2⁴	1600ms
5 (6th retry)	100 × 2⁵	3200ms

Notice how quickly the delays grow. After just 5 retry attempts, we're waiting over 3 seconds—time for significant service recovery. After 10 attempts, we'd be waiting over 100 seconds (1.7 minutes).

basic-exponential-backoff.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Basic exponential backoff implementation
interface BackoffConfig {
    baseDelayMs: number;    // Initial delay (e.g., 100ms)
    multiplier: number;     // Growth factor (typically 2)
    maxDelayMs: number;     // Cap to prevent absurdly long waits
    maxAttempts: number;    // Maximum retry attempts
}
 
function calculateBackoffDelay(
    attemptNumber: number,  // 0-indexed: 0 = first retry
    config: BackoffConfig
): number {
    // Calculate exponential delay
    const exponentialDelay = config.baseDelayMs * Math.pow(config.multiplier, attemptNumber);
    
    // Apply maximum cap
    return Math.min(exponentialDelay, config.maxDelayMs);
}
 
// Example usage
const config: BackoffConfig = {
    baseDelayMs: 100,
    multiplier: 2,
    maxDelayMs: 30000,  // Cap at 30 seconds
    maxAttempts: 8,
};
 
// See how delays grow
for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
    const delay = calculateBackoffDelay(attempt, config);
    console.log(`Attempt ${attempt + 1}: wait ${delay}ms before retry`);
}
 
// Output:
// Attempt 1: wait 100ms before retry
// Attempt 2: wait 200ms before retry
// Attempt 3: wait 400ms before retry
// Attempt 4: wait 800ms before retry
// Attempt 5: wait 1600ms before retry
// Attempt 6: wait 3200ms before retry
// Attempt 7: wait 6400ms before retry
// Attempt 8: wait 12800ms before retry

Why Base 2?

The multiplier of 2 (doubling) is standard but not mandatory. Here's why 2 works well:

Intuitive: Doubling is easy to reason about and communicate
Balanced growth: Fast enough to create meaningful gaps, slow enough to allow reasonable retry counts
Binary system harmony: Works well with computer systems and network timeouts
Historical precedent: Ethernet's Binary Exponential Backoff uses 2, establishing industry standard

Alternative multipliers:

1.5: Gentler growth for less aggressive backoff
3: Faster growth for more aggressive backoff
Golden ratio (≈1.618): Sometimes used for aesthetically pleasing, moderate growth

Delay Comparison: Different Multipliers (baseDelay=100ms)
Attempt	Multiplier 1.5	Multiplier 2	Multiplier 3
1	100ms	100ms	100ms
2	150ms	200ms	300ms
3	225ms	400ms	900ms
4	338ms	800ms	2.7s
5	506ms	1.6s	8.1s
6	759ms	3.2s	24.3s
7	1.1s	6.4s	72.9s
8	1.7s	12.8s	218.7s (3.6min)

Total Time Budgets

When choosing a multiplier, consider the total time budget. With multiplier 2 and 8 attempts, total wait time before final attempt is about 25.5 seconds (sum of all delays). With multiplier 3, the same 8 attempts exhaust over 5 minutes. Ensure your timeout budget and user expectations align with your multiplier choice.

Capped Exponential Backoff

Pure exponential growth becomes impractical quickly. With baseDelay=100ms and multiplier=2, attempt 20 would wait over 27 hours. Clearly, we need a cap.

Maximum Delay Cap

The delay should be capped at a reasonable maximum:

delay(n) = min(baseDelay × multiplier^n, maxDelay)

Once the cap is reached, subsequent retries use the capped value:

Attempt	Uncapped	Capped (max=30s)
5	3.2s	3.2s
6	6.4s	6.4s
7	12.8s	12.8s
8	25.6s	25.6s
9	51.2s	30s
10	102.4s	30s

Choosing Maximum Delay

The maximum delay depends on your use case:

User-facing APIs: 30 seconds to 1 minute. Beyond this, user patience is exhausted.
Background jobs: 5 to 15 minutes. Can afford longer waits.
Persistent connections (WebSockets, gRPC streams): 1 to 5 minutes. Need to stay responsive but avoid thrashing.
External API integrations: Often dictated by the API provider's rate limit windows.

capped-backoff.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
// Complete exponential backoff with cap
interface CappedBackoffConfig {
    baseDelayMs: number;
    multiplier: number;
    maxDelayMs: number;
    maxAttempts: number;
    totalTimeoutMs?: number;  // Optional: total budget across all retries
}
 
class ExponentialBackoff {
    private attempt: number = 0;
    private totalElapsed: number = 0;
    private startTime: number = Date.now();
    
    constructor(private config: CappedBackoffConfig) {}
    
    /**
     * Returns next delay duration, or null if retries exhausted
     */
    nextDelay(): number | null {
        // Check attempt limit
        if (this.attempt >= this.config.maxAttempts) {
            return null;
        }
        
        // Check total time budget if configured
        if (this.config.totalTimeoutMs) {
            const elapsed = Date.now() - this.startTime;
            if (elapsed >= this.config.totalTimeoutMs) {
                return null;
            }
        }
        
        // Calculate delay
        const exponentialDelay = this.config.baseDelayMs * 
            Math.pow(this.config.multiplier, this.attempt);
        const cappedDelay = Math.min(exponentialDelay, this.config.maxDelayMs);
        
        // If total timeout configured, cap delay to remaining time
        if (this.config.totalTimeoutMs) {
            const remaining = this.config.totalTimeoutMs - (Date.now() - this.startTime);
            if (cappedDelay > remaining) {
                return null;  // Not enough time for this retry
            }
        }
        
        this.attempt++;
        return cappedDelay;
    }
    
    /**
     * Record that a delay was executed (for tracking)
     */
    recordWait(actualDelayMs: number): void {
        this.totalElapsed += actualDelayMs;
    }
    
    /**
     * Reset for reuse
     */
    reset(): void {
        this.attempt = 0;
        this.totalElapsed = 0;
        this.startTime = Date.now();
    }
    
    /**
     * Current attempt number
     */
    get currentAttempt(): number {
        return this.attempt;
    }
    
    /**
     * Total time spent waiting
     */
    get totalWaitTime(): number {
        return this.totalElapsed;
    }
}
 
// Usage example
async function retryWithBackoff<T>(
    operation: () => Promise<T>,
    config: CappedBackoffConfig
): Promise<T> {
    const backoff = new ExponentialBackoff(config);
    let lastError: Error | undefined;
    
    while (true) {
        try {
            return await operation();
        } catch (error) {
            lastError = error as Error;
            const delay = backoff.nextDelay();
            
            if (delay === null) {
                throw new Error(
                    `Retry exhausted after ${backoff.currentAttempt} attempts ` +
                    `(${backoff.totalWaitTime}ms total wait): ${lastError.message}`
                );
            }
            
            console.log(`Attempt ${backoff.currentAttempt} failed, ` +
                                        `waiting ${delay}ms before retry...`);
            
            await sleep(delay);
            backoff.recordWait(delay);
        }
    }
}
 
function sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
}

Total Time Budget Consideration

Beyond per-retry caps, consider the total time budget. A request with a 30-second deadline shouldn't initiate 8 retry attempts that would take 25 seconds before even starting the last attempt. Options:

Time-bounded retries: Stop retrying when insufficient time remains
Attempt estimation: Calculate likely total time before starting
Deadline propagation: Communicate remaining time to each attempt

The Infinite Retry Trap

Always enforce either maxAttempts or totalTimeout (preferably both). Without limits, retries continue indefinitely, consuming resources, accumulating memory, and never completing the operation. In production, this leads to memory leaks, thread starvation, and zombie requests that never resolve.

Full Implementation Pattern

Let's build a production-ready exponential backoff implementation that incorporates all best practices: capping, retryability classification, observability, and cancellation support.

production-backoff.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
// Production-ready exponential backoff with full features
interface RetryOptions {
    // Backoff configuration
    baseDelayMs: number;
    maxDelayMs: number;
    multiplier: number;
    maxAttempts: number;
    
    // Error classification  
    isRetryable: (error: Error) => boolean;
    
    // Observability
    onRetry?: (attempt: number, delay: number, error: Error) => void;
    onExhausted?: (totalAttempts: number, totalWaitMs: number, lastError: Error) => void;
    
    // Advanced options
    respectRetryAfter?: boolean;
    abortSignal?: AbortSignal;
}
 
interface RetryResult<T> {
    success: boolean;
    value?: T;
    error?: Error;
    attempts: number;
    totalWaitMs: number;
}
 
/**
 * Execute an operation with exponential backoff retry
 */
async function executeWithRetry<T>(
    operation: (attempt: number) => Promise<T>,
    options: RetryOptions
): Promise<RetryResult<T>> {
    let attempts = 0;
    let totalWaitMs = 0;
    let lastError: Error | undefined;
    
    while (attempts < options.maxAttempts) {
        // Check for cancellation
        if (options.abortSignal?.aborted) {
            return {
                success: false,
                error: new Error('Operation cancelled'),
                attempts,
                totalWaitMs,
            };
        }
        
        attempts++;
        
        try {
            const result = await operation(attempts);
            return {
                success: true,
                value: result,
                attempts,
                totalWaitMs,
            };
        } catch (error) {
            lastError = error as Error;
            
            // Check if retryable
            if (!options.isRetryable(lastError)) {
                return {
                    success: false,
                    error: lastError,
                    attempts,
                    totalWaitMs,
                };
            }
            
            // Check if more attempts available
            if (attempts >= options.maxAttempts) {
                break;
            }
            
            // Calculate delay
            let delay = calculateDelay(attempts - 1, options);
            
            // Check for Retry-After header
            if (options.respectRetryAfter) {
                const retryAfter = extractRetryAfter(lastError);
                if (retryAfter) {
                    delay = Math.max(delay, retryAfter);
                }
            }
            
            // Notify observer
            options.onRetry?.(attempts, delay, lastError);
            
            // Wait
            await sleep(delay, options.abortSignal);
            totalWaitMs += delay;
        }
    }
    
    // Exhausted
    options.onExhausted?.(attempts, totalWaitMs, lastError!);
    
    return {
        success: false,
        error: lastError,
        attempts,
        totalWaitMs,
    };
}
 
function calculateDelay(attemptIndex: number, options: RetryOptions): number {
    const exponential = options.baseDelayMs * Math.pow(options.multiplier, attemptIndex);
    return Math.min(exponential, options.maxDelayMs);
}
 
function extractRetryAfter(error: Error): number | null {
    // Implementation depends on your HTTP client
    const response = (error as any).response;
    const retryAfterHeader = response?.headers?.['retry-after'];
    
    if (!retryAfterHeader) return null;
    
    // Could be seconds or HTTP date
    const seconds = parseInt(retryAfterHeader, 10);
    if (!isNaN(seconds)) return seconds * 1000;
    
    const date = new Date(retryAfterHeader);
    if (!isNaN(date.getTime())) {
        return Math.max(0, date.getTime() - Date.now());
    }
    
    return null;
}
 
function sleep(ms: number, signal?: AbortSignal): Promise<void> {
    return new Promise((resolve, reject) => {
        const timeout = setTimeout(resolve, ms);
        
        signal?.addEventListener('abort', () => {
            clearTimeout(timeout);
            reject(new Error('Sleep aborted'));
        });
    });
}
 
// Example usage
async function exampleUsage() {
    const result = await executeWithRetry(
        async (attempt) => {
            console.log(`Executing attempt ${attempt}`);
            const response = await fetch('https://api.example.com/data');
            if (!response.ok) {
                throw new HttpError(response.status, await response.text());
            }
            return response.json();
        },
        {
            baseDelayMs: 100,
            maxDelayMs: 10000,
            multiplier: 2,
            maxAttempts: 5,
            
            isRetryable: (error) => {
                if (error instanceof HttpError) {
                    return [408, 429, 502, 503, 504].includes(error.status);
                }
                return error.name === 'NetworkError';
            },
            
            onRetry: (attempt, delay, error) => {
                console.log(`Attempt ${attempt} failed: ${error.message}. ` +
                                        `Retrying in ${delay}ms`);
            },
            
            onExhausted: (attempts, wait, error) => {
                console.error(`Failed after ${attempts} attempts ` +
                                    `(${wait}ms total): ${error.message}`);
            },
            
            respectRetryAfter: true,
        }
    );
    
    if (result.success) {
        console.log('Success:', result.value);
    } else {
        console.error('Failed:', result.error);
    }
}
 
class HttpError extends Error {
    constructor(public status: number, message: string) {
        super(`HTTP ${status}: ${message}`);
        this.name = 'HttpError';
    }
}

Production Implementation Features

•Retryability classification — Explicit function to determine if errors warrant retry
•Retry-After respect — Honors server-provided delay hints when present
•Cancellation support — AbortSignal integration for graceful cancellation
•Observability hooks — Callbacks for logging, metrics, and alerting integration
•Complete result tracking — Returns attempts, wait time, and final outcome
•Attempt number visibility — Operation receives attempt number for conditional behavior

Configuration Tuning Strategies

The default values (100ms base, 2x multiplier, 30s max) work well for many cases, but different scenarios require different tuning.

Tuning for User-Facing Latency

For user-facing requests where latency is critical:

Lower baseDelay (50-100ms): Reach first retry quickly
Lower maxDelay (5-10s): Don't keep users waiting
Fewer attempts (3-4): Fail fast rather than spin
Lower multiplier (1.5): Gentler growth within tight budget

Configuration Recommendations by Use Case
Use Case	Base Delay	Multiplier	Max Delay	Max Attempts
User-facing API call	50-100ms	1.5-2	5-10s	3-4
Background job processing	500ms-2s	2	5-15min	10-20
External API integration	1-5s	2	5-10min	5-8
Database reconnection	100-500ms	2	30s-2min	5-10
Message queue consumer	100ms-1s	2	30-60s	5-8
Microservice communication	100-200ms	2	10-30s	4-6
WebSocket reconnection	1s	2	2-5min	Unlimited*

*WebSocket reconnection often uses infinite retries with capped delay, as maintaining the connection is essential and the alternative is complete disconnection.

Tuning for External Dependencies

When calling external APIs or services you don't control:

Higher baseDelay (1-5s): External services often rate-limit aggressively
Higher maxDelay (5-10min): Rate limit windows can be long
Respect Retry-After: External APIs often provide explicit guidance
Conservative attempts (5-8): Avoid depleting rate limit quota on retries

Tuning for Heavy Load Recovery

When retrying services that may be overwhelmed:

Higher multiplier (2-3): Aggressive backoff to reduce pressure
Higher maxDelay: Give service time to recover
Combined with circuit breaker: Stop retrying entirely if too many failures

tuning-examples.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Configuration presets for common scenarios
const BackoffPresets = {
    // Fast user-facing calls
    userFacing: {
        baseDelayMs: 50,
        multiplier: 1.5,
        maxDelayMs: 5000,
        maxAttempts: 4,
        // Total max wait: ~500ms
    },
    
    // Standard microservice communication
    internalService: {
        baseDelayMs: 100,
        multiplier: 2,
        maxDelayMs: 10000,
        maxAttempts: 5,
        // Total max wait: ~15s
    },
    
    // External API with rate limiting
    externalApi: {
        baseDelayMs: 2000,
        multiplier: 2,
        maxDelayMs: 300000, // 5 minutes
        maxAttempts: 6,
        respectRetryAfter: true,
        // Total max wait: ~6min (if not hitting max)
    },
    
    // Background job processing
    backgroundJob: {
        baseDelayMs: 1000,
        multiplier: 2,
        maxDelayMs: 900000, // 15 minutes  
        maxAttempts: 15,
        // Will retry for a long time
    },
    
    // Database connection retry
    databaseConnection: {
        baseDelayMs: 200,
        multiplier: 2,
        maxDelayMs: 60000, // 1 minute
        maxAttempts: 10,
        // Reasonable for DB reconnection
    },
    
    // Aggressive backoff for overloaded service
    overloadedService: {
        baseDelayMs: 500,
        multiplier: 3, // Aggressive multiplier
        maxDelayMs: 120000, // 2 minutes
        maxAttempts: 8,
        // Grows very fast to reduce pressure
    },
} as const;
 
// Helper to calculate total max wait time for a configuration
function calculateTotalMaxWait(config: {
    baseDelayMs: number;
    multiplier: number;
    maxDelayMs: number;
    maxAttempts: number;
}): number {
    let total = 0;
    for (let i = 0; i < config.maxAttempts - 1; i++) {
        const delay = Math.min(
            config.baseDelayMs * Math.pow(config.multiplier, i),
            config.maxDelayMs
        );
        total += delay;
    }
    return total;
}
 
// Log configurations with their max wait times
for (const [name, config] of Object.entries(BackoffPresets)) {
    const maxWait = calculateTotalMaxWait(config);
    console.log(`${name}: max wait = ${(maxWait / 1000).toFixed(1)}s`);
}

Start Conservative, Tune Based on Data

When in doubt, start with conservative settings (higher delays, fewer attempts). It's easier to tighten retry policies after observing they're too slow than to recover from retry storms caused by overly aggressive policies. Use metrics to track retry success rates by attempt number, then tune to balance recovery rate against total latency.

Decorrelated Backoff: An Advanced Variant

Standard exponential backoff has a predictability problem. If you know the baseDelay and multiplier, you can calculate exact retry times. This can still lead to synchronized retries when many clients fail together, even with different delays, because they all follow the same deterministic formula.

Decorrelated backoff addresses this by breaking the direct correlation between successive delays. Instead of delay(n) = base × multiplier^n, decorrelated backoff uses:

delay(n) = min(maxDelay, random(baseDelay, previousDelay × 3))

The key insight: the next delay is a random value between the base delay and three times the previous delay. This:

Maintains the exponential growth envelope (delays grow roughly exponentially)
Introduces randomness into each step
Prevents synchronized retry patterns even without explicit jitter (covered in the next page)
Self-regulates: short delays lead to moderate growth, long delays lead to faster growth

decorrelated-backoff.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
// Decorrelated backoff implementation
interface DecorrelatedBackoffConfig {
    baseDelayMs: number;
    maxDelayMs: number;
}
 
class DecorrelatedBackoff {
    private previousDelay: number;
    
    constructor(private config: DecorrelatedBackoffConfig) {
        this.previousDelay = config.baseDelayMs;
    }
    
    /**
     * Calculate next delay using decorrelated algorithm
     */
    nextDelay(): number {
        // Random delay between base and 3x previous
        const minDelay = this.config.baseDelayMs;
        const maxDelay = this.previousDelay * 3;
        
        // Random value in range [minDelay, maxDelay]
        const nextDelay = minDelay + Math.random() * (maxDelay - minDelay);
        
        // Cap at maximum
        this.previousDelay = Math.min(nextDelay, this.config.maxDelayMs);
        
        return this.previousDelay;
    }
    
    /**
     * Reset to initial state
     */
    reset(): void {
        this.previousDelay = this.config.baseDelayMs;
    }
}
 
// Demonstration: generate sequence of delays
function demonstrateDecorrelatedBackoff() {
    const backoff = new DecorrelatedBackoff({
        baseDelayMs: 100,
        maxDelayMs: 30000,
    });
    
    console.log('Decorrelated backoff sequence:');
    for (let i = 0; i < 10; i++) {
        const delay = backoff.nextDelay();
        console.log(`  Attempt ${i + 1}: ${delay.toFixed(0)}ms`);
    }
}
 
// Compare multiple sequences to show decorrelation
function compareSequences() {
    console.log('
Three independent sequences (showing decorrelation):');
    
    for (let seq = 1; seq <= 3; seq++) {
        const backoff = new DecorrelatedBackoff({
            baseDelayMs: 100,
            maxDelayMs: 30000,
        });
        
        const delays: number[] = [];
        for (let i = 0; i < 6; i++) {
            delays.push(Math.round(backoff.nextDelay()));
        }
        
        console.log(`  Sequence ${seq}: ${delays.join(' -> ')} ms`);
    }
}
 
demonstrateDecorrelatedBackoff();
compareSequences();
 
// Sample output:
// Decorrelated backoff sequence:
//   Attempt 1: 178ms
//   Attempt 2: 312ms
//   Attempt 3: 534ms
//   Attempt 4: 1245ms
//   Attempt 5: 2156ms
//   Attempt 6: 5432ms
//   ...
//
// Three independent sequences (showing decorrelation):
//   Sequence 1: 167 -> 389 -> 892 -> 2134 -> 5678 -> 12345 ms
//   Sequence 2: 234 -> 456 -> 567 -> 1456 -> 3456 -> 8765 ms
//   Sequence 3: 145 -> 278 -> 712 -> 1823 -> 4321 -> 9876 ms

Decorrelated Benefits

•Built-in randomization without explicit jitter layer
•Natural spread of retry attempts across time
•Self-correcting: short waits lead to faster growth
•Used by AWS SDK and other production libraries
•Simple mental model: "random up to 3x previous"

Trade-offs

•Less predictable: harder to reason about total time
•Non-deterministic: complicates testing
•May occasionally produce too-short or too-long sequences
•Requires tracking state (previous delay)
•Some teams prefer explicit exponential + jitter for clarity

When to Use Decorrelated Backoff

Decorrelated backoff is particularly useful when you have many clients hitting the same endpoints and need to spread retries naturally. It's the default in the AWS SDK. However, for simpler scenarios or when you need predictable timing for testing/debugging, standard exponential backoff with jitter (covered next page) is often preferred.

Integration with Circuit Breakers

Exponential backoff and circuit breakers are complementary patterns that work together for comprehensive fault tolerance. Understanding their interaction is essential.

The Relationship

Exponential backoff handles transient failures at the request level, spacing out retry attempts for individual operations.
Circuit breakers handle persistent failures at the system level, preventing all requests when a dependency is known to be failing.

They operate at different time scales:

Backoff: milliseconds to seconds between retries
Circuit breaker: seconds to minutes of protection

Correct Integration Pattern

The circuit breaker should wrap the retry logic, not be inside it:

backoff-circuit-integration.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
// CORRECT: Circuit breaker wraps retry logic
async function correctPattern<T>(operation: () => Promise<T>): Promise<T> {
    // Circuit breaker check first
    if (circuitBreaker.isOpen()) {
        throw new CircuitOpenError('Circuit is open, failing fast');
    }
    
    try {
        // Retry logic inside circuit breaker context
        const result = await executeWithRetry(operation, backoffConfig);
        circuitBreaker.recordSuccess();
        return result;
    } catch (error) {
        circuitBreaker.recordFailure();
        throw error;
    }
}
 
// INCORRECT: Retry logic wraps circuit breaker
async function incorrectPattern<T>(operation: () => Promise<T>): Promise<T> {
    return executeWithRetry(async () => {
        // This is wrong: we'd retry even when circuit is open!
        if (circuitBreaker.isOpen()) {
            throw new CircuitOpenError('Circuit open');
        }
        return await operation();
    }, backoffConfig);
}
 
// The issue with incorrect pattern:
// - When circuit opens, each retry attempt immediately fails
// - Backoff waits between attempts that can't possibly succeed
// - Wastes time without providing any benefit
// - Circuit cooldown may end mid-retry sequence, causing inconsistent behavior
 
// Complete integrated example
class ResilientClient {
    private circuitBreaker: CircuitBreaker;
    private backoffConfig: BackoffConfig;
    
    constructor(
        private serviceName: string,
        circuitBreakerConfig: CircuitBreakerConfig,
        backoffConfig: BackoffConfig
    ) {
        this.circuitBreaker = new CircuitBreaker(circuitBreakerConfig);
        this.backoffConfig = backoffConfig;
    }
    
    async call<T>(operation: () => Promise<T>): Promise<T> {
        // 1. Check circuit state
        const circuitState = this.circuitBreaker.getState();
        
        if (circuitState === 'OPEN') {
            throw new CircuitOpenError(
                `Circuit for ${this.serviceName} is open. ` +
                                        `Will retry after ${this.circuitBreaker.getRemainingCooldown()}ms`
            );
        }
        
        // 2. If half-open, allow limited testing
        const retryConfig = circuitState === 'HALF_OPEN'
            ? { ...this.backoffConfig, maxAttempts: 1 }  // Single attempt for probing
            : this.backoffConfig;
        
        try {
            // 3. Execute with retry (inside circuit context)
            const result = await executeWithRetry(operation, retryConfig);
            
            // 4. Record success (may close circuit if half-open)
            this.circuitBreaker.recordSuccess();
            
            return result;
        } catch (error) {
            // 5. Record failure (may open circuit)
            this.circuitBreaker.recordFailure(error as Error);
            throw error;
        }
    }
}

Key Integration Principles

Circuit check before retry loop: Don't enter retry loop if circuit is open
All retries count toward circuit: Each failed retry attempt is a circuit failure signal
Successful retry closes circuit: A successful retry (even on attempt 3) should count as success
Half-open allows single probe: When circuit is half-open, limit to single attempt to test recovery
Backoff continues during half-open: If probe fails, backoff before next probe attempt

Avoid Retry Amplification

When combining retries with circuit breakers, be aware of amplification. If each request retries 5 times and you have 100 concurrent requests, the failing service sees 500 requests before the circuit opens. Coordinate retry budgets with circuit breaker thresholds to prevent excessive load during the detection window.

Summary: Exponential Backoff

Exponential backoff is the foundational retry timing strategy—mathematically principled, empirically proven, and universally adopted across distributed systems.

Key Takeaways

•Fixed and linear delays fail at scale — They create synchronized retry storms that prevent service recovery and can turn minor issues into major outages.
•Exponential growth reflects probability — Each failed retry provides evidence that longer waits are needed. Doubling delay matches the exponentially decreasing probability of transient recovery.
•Always cap maximum delay — Uncapped exponential growth leads to absurd waits. Cap based on your use case (seconds for user-facing, minutes for background jobs).
•Tune for your context — Different scenarios require different base delays, multipliers, and max attempts. User-facing calls need fast response; external APIs may need long waits.
•Consider decorrelated backoff — For scenarios with many concurrent clients, decorrelated backoff provides built-in desynchronization.
•Integrate correctly with circuit breakers — Circuit breaker wraps retry logic, not the reverse. Check circuit state before entering retry loops.
•Track total time — Beyond per-retry caps, ensure total retry time stays within acceptable latency budgets.

What's Next:

Exponential backoff solves the timing problem, but it doesn't fully address the synchronization problem. When many clients fail simultaneously, even exponential backoff can produce synchronized retry patterns. The next page explores jitter—random variance added to delays—and its critical role in preventing the thundering herd phenomenon.

Page Complete

You now understand the mathematical foundations of exponential backoff, why it outperforms simpler alternatives, how to implement it correctly with caps and limits, and how to tune parameters for different scenarios. This prepares you for the next critical concept: adding jitter to prevent synchronized retry storms.

2 / 5

Loading learning content...

System Design (HLD)Retry with Backoff

Retry with Backoff: Building Resilient Distributed Systems

LevelAdvanced

Duration75 mins

TopicRetry with Backoff

2 / 5

Exponential Backoff: Optimal Retry Timing

The Science of Waiting

What You Will Learn

Why Not Fixed or Linear Delays?

Before diving into exponential backoff, let's understand why simpler approaches fail. Many developers' first instinct is to implement fixed delays:

failed → wait 1 second → retry → failed → wait 1 second → retry → ...

Or linear delays:

failed → wait 1s → retry → failed → wait 2s → retry → failed → wait 3s → ...

Both approaches have fundamental problems when applied to distributed systems at scale.

Fixed Delay Problems

Fixed delays create retry synchronization, where multiple clients that failed at the same time will retry at the same time. Consider this scenario:

Service A experiences a brief 500ms outage affecting 10,000 concurrent requests
All 10,000 clients fail simultaneously
All clients wait exactly 1 second
All 10,000 retry at precisely the same moment
Service, just recovered, is immediately overwhelmed again
The cycle repeats, often indefinitely

This creates a periodic load spike every 1 second, making recovery extremely difficult. The service oscillates between "attempting to recover" and "overwhelmed by synchronized retries."

Fixed Delay Failures

•Synchronized retry storms — Clients failing together retry together, creating periodic load spikes
•No recovery opportunity — Constant fixed-interval traffic overwhelms recovering services
•Poor adaptation — Doesn't respond to severity; treats brief blip and extended outage identically
•Wasted resources — Continues retrying at same rate even as evidence mounts that service needs time

Linear Delay Failures

•Still synchronized — Clients failing together maintain lock-step, just with increasing gaps
•Slow adaptation — Linear growth is too gentle; takes many retries to reach meaningful delays
•Unbounded growth — Without cap, delays become unreasonably long for persistent failures
•Mathematically suboptimal — Doesn't account for exponentially increasing uncertainty with each failure

The Fundamental Insight

Each consecutive failure provides additional evidence that the service needs time to recover. This evidence should be weighted exponentially:

First failure: Could be a momentary blip. Quick retry is reasonable.
Second failure: Less likely to be transient. Wait longer.
Third failure: Strong signal of prolonged issue. Wait much longer.
Fourth failure: Almost certainly a significant outage. Wait substantially longer.

The probability of a true transient failure decreases roughly exponentially with each retry attempt. Therefore, the delay should increase exponentially to match this updated probability assessment.

The Bayesian Intuition

The Mathematics of Exponential Backoff

Exponential backoff follows a simple mathematical model. Let's build it from first principles.

The Basic Formula

The delay before retry attempt n (where n starts at 0 for the first retry) is:

delay(n) = baseDelay × multiplier^n

Where:

baseDelay is the initial wait time (typically 100ms to 1s)
multiplier is the growth factor (typically 2)
n is the retry attempt number (0-indexed)

Example with baseDelay=100ms, multiplier=2:

Attempt	Formula	Delay
0 (1st retry)	100 × 2⁰	100ms
1 (2nd retry)	100 × 2¹	200ms
2 (3rd retry)	100 × 2²	400ms
3 (4th retry)	100 × 2³	800ms
4 (5th retry)	100 × 2⁴	1600ms
5 (6th retry)	100 × 2⁵	3200ms

basic-exponential-backoff.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Basic exponential backoff implementation
interface BackoffConfig {
    baseDelayMs: number;    // Initial delay (e.g., 100ms)
    multiplier: number;     // Growth factor (typically 2)
    maxDelayMs: number;     // Cap to prevent absurdly long waits
    maxAttempts: number;    // Maximum retry attempts
}
 
function calculateBackoffDelay(
    attemptNumber: number,  // 0-indexed: 0 = first retry
    config: BackoffConfig
): number {
    // Calculate exponential delay
    const exponentialDelay = config.baseDelayMs * Math.pow(config.multiplier, attemptNumber);
    
    // Apply maximum cap
    return Math.min(exponentialDelay, config.maxDelayMs);
}
 
// Example usage
const config: BackoffConfig = {
    baseDelayMs: 100,
    multiplier: 2,
    maxDelayMs: 30000,  // Cap at 30 seconds
    maxAttempts: 8,
};
 
// See how delays grow
for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
    const delay = calculateBackoffDelay(attempt, config);
    console.log(`Attempt ${attempt + 1}: wait ${delay}ms before retry`);
}
 
// Output:
// Attempt 1: wait 100ms before retry
// Attempt 2: wait 200ms before retry
// Attempt 3: wait 400ms before retry
// Attempt 4: wait 800ms before retry
// Attempt 5: wait 1600ms before retry
// Attempt 6: wait 3200ms before retry
// Attempt 7: wait 6400ms before retry
// Attempt 8: wait 12800ms before retry

Why Base 2?

The multiplier of 2 (doubling) is standard but not mandatory. Here's why 2 works well:

Intuitive: Doubling is easy to reason about and communicate
Balanced growth: Fast enough to create meaningful gaps, slow enough to allow reasonable retry counts
Binary system harmony: Works well with computer systems and network timeouts
Historical precedent: Ethernet's Binary Exponential Backoff uses 2, establishing industry standard

Alternative multipliers:

1.5: Gentler growth for less aggressive backoff
3: Faster growth for more aggressive backoff
Golden ratio (≈1.618): Sometimes used for aesthetically pleasing, moderate growth

Delay Comparison: Different Multipliers (baseDelay=100ms)
Attempt	Multiplier 1.5	Multiplier 2	Multiplier 3
1	100ms	100ms	100ms
2	150ms	200ms	300ms
3	225ms	400ms	900ms
4	338ms	800ms	2.7s
5	506ms	1.6s	8.1s
6	759ms	3.2s	24.3s
7	1.1s	6.4s	72.9s
8	1.7s	12.8s	218.7s (3.6min)

Total Time Budgets

Capped Exponential Backoff

Pure exponential growth becomes impractical quickly. With baseDelay=100ms and multiplier=2, attempt 20 would wait over 27 hours. Clearly, we need a cap.

Maximum Delay Cap

The delay should be capped at a reasonable maximum:

delay(n) = min(baseDelay × multiplier^n, maxDelay)

Once the cap is reached, subsequent retries use the capped value:

Attempt	Uncapped	Capped (max=30s)
5	3.2s	3.2s
6	6.4s	6.4s
7	12.8s	12.8s
8	25.6s	25.6s
9	51.2s	30s
10	102.4s	30s

Choosing Maximum Delay

The maximum delay depends on your use case:

User-facing APIs: 30 seconds to 1 minute. Beyond this, user patience is exhausted.
Background jobs: 5 to 15 minutes. Can afford longer waits.
Persistent connections (WebSockets, gRPC streams): 1 to 5 minutes. Need to stay responsive but avoid thrashing.
External API integrations: Often dictated by the API provider's rate limit windows.

capped-backoff.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
// Complete exponential backoff with cap
interface CappedBackoffConfig {
    baseDelayMs: number;
    multiplier: number;
    maxDelayMs: number;
    maxAttempts: number;
    totalTimeoutMs?: number;  // Optional: total budget across all retries
}
 
class ExponentialBackoff {
    private attempt: number = 0;
    private totalElapsed: number = 0;
    private startTime: number = Date.now();
    
    constructor(private config: CappedBackoffConfig) {}
    
    /**
     * Returns next delay duration, or null if retries exhausted
     */
    nextDelay(): number | null {
        // Check attempt limit
        if (this.attempt >= this.config.maxAttempts) {
            return null;
        }
        
        // Check total time budget if configured
        if (this.config.totalTimeoutMs) {
            const elapsed = Date.now() - this.startTime;
            if (elapsed >= this.config.totalTimeoutMs) {
                return null;
            }
        }
        
        // Calculate delay
        const exponentialDelay = this.config.baseDelayMs * 
            Math.pow(this.config.multiplier, this.attempt);
        const cappedDelay = Math.min(exponentialDelay, this.config.maxDelayMs);
        
        // If total timeout configured, cap delay to remaining time
        if (this.config.totalTimeoutMs) {
            const remaining = this.config.totalTimeoutMs - (Date.now() - this.startTime);
            if (cappedDelay > remaining) {
                return null;  // Not enough time for this retry
            }
        }
        
        this.attempt++;
        return cappedDelay;
    }
    
    /**
     * Record that a delay was executed (for tracking)
     */
    recordWait(actualDelayMs: number): void {
        this.totalElapsed += actualDelayMs;
    }
    
    /**
     * Reset for reuse
     */
    reset(): void {
        this.attempt = 0;
        this.totalElapsed = 0;
        this.startTime = Date.now();
    }
    
    /**
     * Current attempt number
     */
    get currentAttempt(): number {
        return this.attempt;
    }
    
    /**
     * Total time spent waiting
     */
    get totalWaitTime(): number {
        return this.totalElapsed;
    }
}
 
// Usage example
async function retryWithBackoff<T>(
    operation: () => Promise<T>,
    config: CappedBackoffConfig
): Promise<T> {
    const backoff = new ExponentialBackoff(config);
    let lastError: Error | undefined;
    
    while (true) {
        try {
            return await operation();
        } catch (error) {
            lastError = error as Error;
            const delay = backoff.nextDelay();
            
            if (delay === null) {
                throw new Error(
                    `Retry exhausted after ${backoff.currentAttempt} attempts ` +
                    `(${backoff.totalWaitTime}ms total wait): ${lastError.message}`
                );
            }
            
            console.log(`Attempt ${backoff.currentAttempt} failed, ` +
                                        `waiting ${delay}ms before retry...`);
            
            await sleep(delay);
            backoff.recordWait(delay);
        }
    }
}
 
function sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
}

Total Time Budget Consideration

Time-bounded retries: Stop retrying when insufficient time remains
Attempt estimation: Calculate likely total time before starting
Deadline propagation: Communicate remaining time to each attempt

The Infinite Retry Trap

Full Implementation Pattern

Let's build a production-ready exponential backoff implementation that incorporates all best practices: capping, retryability classification, observability, and cancellation support.

production-backoff.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
// Production-ready exponential backoff with full features
interface RetryOptions {
    // Backoff configuration
    baseDelayMs: number;
    maxDelayMs: number;
    multiplier: number;
    maxAttempts: number;
    
    // Error classification  
    isRetryable: (error: Error) => boolean;
    
    // Observability
    onRetry?: (attempt: number, delay: number, error: Error) => void;
    onExhausted?: (totalAttempts: number, totalWaitMs: number, lastError: Error) => void;
    
    // Advanced options
    respectRetryAfter?: boolean;
    abortSignal?: AbortSignal;
}
 
interface RetryResult<T> {
    success: boolean;
    value?: T;
    error?: Error;
    attempts: number;
    totalWaitMs: number;
}
 
/**
 * Execute an operation with exponential backoff retry
 */
async function executeWithRetry<T>(
    operation: (attempt: number) => Promise<T>,
    options: RetryOptions
): Promise<RetryResult<T>> {
    let attempts = 0;
    let totalWaitMs = 0;
    let lastError: Error | undefined;
    
    while (attempts < options.maxAttempts) {
        // Check for cancellation
        if (options.abortSignal?.aborted) {
            return {
                success: false,
                error: new Error('Operation cancelled'),
                attempts,
                totalWaitMs,
            };
        }
        
        attempts++;
        
        try {
            const result = await operation(attempts);
            return {
                success: true,
                value: result,
                attempts,
                totalWaitMs,
            };
        } catch (error) {
            lastError = error as Error;
            
            // Check if retryable
            if (!options.isRetryable(lastError)) {
                return {
                    success: false,
                    error: lastError,
                    attempts,
                    totalWaitMs,
                };
            }
            
            // Check if more attempts available
            if (attempts >= options.maxAttempts) {
                break;
            }
            
            // Calculate delay
            let delay = calculateDelay(attempts - 1, options);
            
            // Check for Retry-After header
            if (options.respectRetryAfter) {
                const retryAfter = extractRetryAfter(lastError);
                if (retryAfter) {
                    delay = Math.max(delay, retryAfter);
                }
            }
            
            // Notify observer
            options.onRetry?.(attempts, delay, lastError);
            
            // Wait
            await sleep(delay, options.abortSignal);
            totalWaitMs += delay;
        }
    }
    
    // Exhausted
    options.onExhausted?.(attempts, totalWaitMs, lastError!);
    
    return {
        success: false,
        error: lastError,
        attempts,
        totalWaitMs,
    };
}
 
function calculateDelay(attemptIndex: number, options: RetryOptions): number {
    const exponential = options.baseDelayMs * Math.pow(options.multiplier, attemptIndex);
    return Math.min(exponential, options.maxDelayMs);
}
 
function extractRetryAfter(error: Error): number | null {
    // Implementation depends on your HTTP client
    const response = (error as any).response;
    const retryAfterHeader = response?.headers?.['retry-after'];
    
    if (!retryAfterHeader) return null;
    
    // Could be seconds or HTTP date
    const seconds = parseInt(retryAfterHeader, 10);
    if (!isNaN(seconds)) return seconds * 1000;
    
    const date = new Date(retryAfterHeader);
    if (!isNaN(date.getTime())) {
        return Math.max(0, date.getTime() - Date.now());
    }
    
    return null;
}
 
function sleep(ms: number, signal?: AbortSignal): Promise<void> {
    return new Promise((resolve, reject) => {
        const timeout = setTimeout(resolve, ms);
        
        signal?.addEventListener('abort', () => {
            clearTimeout(timeout);
            reject(new Error('Sleep aborted'));
        });
    });
}
 
// Example usage
async function exampleUsage() {
    const result = await executeWithRetry(
        async (attempt) => {
            console.log(`Executing attempt ${attempt}`);
            const response = await fetch('https://api.example.com/data');
            if (!response.ok) {
                throw new HttpError(response.status, await response.text());
            }
            return response.json();
        },
        {
            baseDelayMs: 100,
            maxDelayMs: 10000,
            multiplier: 2,
            maxAttempts: 5,
            
            isRetryable: (error) => {
                if (error instanceof HttpError) {
                    return [408, 429, 502, 503, 504].includes(error.status);
                }
                return error.name === 'NetworkError';
            },
            
            onRetry: (attempt, delay, error) => {
                console.log(`Attempt ${attempt} failed: ${error.message}. ` +
                                        `Retrying in ${delay}ms`);
            },
            
            onExhausted: (attempts, wait, error) => {
                console.error(`Failed after ${attempts} attempts ` +
                                    `(${wait}ms total): ${error.message}`);
            },
            
            respectRetryAfter: true,
        }
    );
    
    if (result.success) {
        console.log('Success:', result.value);
    } else {
        console.error('Failed:', result.error);
    }
}
 
class HttpError extends Error {
    constructor(public status: number, message: string) {
        super(`HTTP ${status}: ${message}`);
        this.name = 'HttpError';
    }
}

Production Implementation Features

•Retryability classification — Explicit function to determine if errors warrant retry
•Retry-After respect — Honors server-provided delay hints when present
•Cancellation support — AbortSignal integration for graceful cancellation
•Observability hooks — Callbacks for logging, metrics, and alerting integration
•Complete result tracking — Returns attempts, wait time, and final outcome
•Attempt number visibility — Operation receives attempt number for conditional behavior

Configuration Tuning Strategies

The default values (100ms base, 2x multiplier, 30s max) work well for many cases, but different scenarios require different tuning.

Tuning for User-Facing Latency

For user-facing requests where latency is critical:

Lower baseDelay (50-100ms): Reach first retry quickly
Lower maxDelay (5-10s): Don't keep users waiting
Fewer attempts (3-4): Fail fast rather than spin
Lower multiplier (1.5): Gentler growth within tight budget

Configuration Recommendations by Use Case
Use Case	Base Delay	Multiplier	Max Delay	Max Attempts
User-facing API call	50-100ms	1.5-2	5-10s	3-4
Background job processing	500ms-2s	2	5-15min	10-20
External API integration	1-5s	2	5-10min	5-8
Database reconnection	100-500ms	2	30s-2min	5-10
Message queue consumer	100ms-1s	2	30-60s	5-8
Microservice communication	100-200ms	2	10-30s	4-6
WebSocket reconnection	1s	2	2-5min	Unlimited*

*WebSocket reconnection often uses infinite retries with capped delay, as maintaining the connection is essential and the alternative is complete disconnection.

Tuning for External Dependencies

When calling external APIs or services you don't control:

Higher baseDelay (1-5s): External services often rate-limit aggressively
Higher maxDelay (5-10min): Rate limit windows can be long
Respect Retry-After: External APIs often provide explicit guidance
Conservative attempts (5-8): Avoid depleting rate limit quota on retries

Tuning for Heavy Load Recovery

When retrying services that may be overwhelmed:

Higher multiplier (2-3): Aggressive backoff to reduce pressure
Higher maxDelay: Give service time to recover
Combined with circuit breaker: Stop retrying entirely if too many failures

tuning-examples.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Configuration presets for common scenarios
const BackoffPresets = {
    // Fast user-facing calls
    userFacing: {
        baseDelayMs: 50,
        multiplier: 1.5,
        maxDelayMs: 5000,
        maxAttempts: 4,
        // Total max wait: ~500ms
    },
    
    // Standard microservice communication
    internalService: {
        baseDelayMs: 100,
        multiplier: 2,
        maxDelayMs: 10000,
        maxAttempts: 5,
        // Total max wait: ~15s
    },
    
    // External API with rate limiting
    externalApi: {
        baseDelayMs: 2000,
        multiplier: 2,
        maxDelayMs: 300000, // 5 minutes
        maxAttempts: 6,
        respectRetryAfter: true,
        // Total max wait: ~6min (if not hitting max)
    },
    
    // Background job processing
    backgroundJob: {
        baseDelayMs: 1000,
        multiplier: 2,
        maxDelayMs: 900000, // 15 minutes  
        maxAttempts: 15,
        // Will retry for a long time
    },
    
    // Database connection retry
    databaseConnection: {
        baseDelayMs: 200,
        multiplier: 2,
        maxDelayMs: 60000, // 1 minute
        maxAttempts: 10,
        // Reasonable for DB reconnection
    },
    
    // Aggressive backoff for overloaded service
    overloadedService: {
        baseDelayMs: 500,
        multiplier: 3, // Aggressive multiplier
        maxDelayMs: 120000, // 2 minutes
        maxAttempts: 8,
        // Grows very fast to reduce pressure
    },
} as const;
 
// Helper to calculate total max wait time for a configuration
function calculateTotalMaxWait(config: {
    baseDelayMs: number;
    multiplier: number;
    maxDelayMs: number;
    maxAttempts: number;
}): number {
    let total = 0;
    for (let i = 0; i < config.maxAttempts - 1; i++) {
        const delay = Math.min(
            config.baseDelayMs * Math.pow(config.multiplier, i),
            config.maxDelayMs
        );
        total += delay;
    }
    return total;
}
 
// Log configurations with their max wait times
for (const [name, config] of Object.entries(BackoffPresets)) {
    const maxWait = calculateTotalMaxWait(config);
    console.log(`${name}: max wait = ${(maxWait / 1000).toFixed(1)}s`);
}

Start Conservative, Tune Based on Data

Decorrelated Backoff: An Advanced Variant

Decorrelated backoff addresses this by breaking the direct correlation between successive delays. Instead of delay(n) = base × multiplier^n, decorrelated backoff uses:

delay(n) = min(maxDelay, random(baseDelay, previousDelay × 3))

The key insight: the next delay is a random value between the base delay and three times the previous delay. This:

Maintains the exponential growth envelope (delays grow roughly exponentially)
Introduces randomness into each step
Prevents synchronized retry patterns even without explicit jitter (covered in the next page)
Self-regulates: short delays lead to moderate growth, long delays lead to faster growth

decorrelated-backoff.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
// Decorrelated backoff implementation
interface DecorrelatedBackoffConfig {
    baseDelayMs: number;
    maxDelayMs: number;
}
 
class DecorrelatedBackoff {
    private previousDelay: number;
    
    constructor(private config: DecorrelatedBackoffConfig) {
        this.previousDelay = config.baseDelayMs;
    }
    
    /**
     * Calculate next delay using decorrelated algorithm
     */
    nextDelay(): number {
        // Random delay between base and 3x previous
        const minDelay = this.config.baseDelayMs;
        const maxDelay = this.previousDelay * 3;
        
        // Random value in range [minDelay, maxDelay]
        const nextDelay = minDelay + Math.random() * (maxDelay - minDelay);
        
        // Cap at maximum
        this.previousDelay = Math.min(nextDelay, this.config.maxDelayMs);
        
        return this.previousDelay;
    }
    
    /**
     * Reset to initial state
     */
    reset(): void {
        this.previousDelay = this.config.baseDelayMs;
    }
}
 
// Demonstration: generate sequence of delays
function demonstrateDecorrelatedBackoff() {
    const backoff = new DecorrelatedBackoff({
        baseDelayMs: 100,
        maxDelayMs: 30000,
    });
    
    console.log('Decorrelated backoff sequence:');
    for (let i = 0; i < 10; i++) {
        const delay = backoff.nextDelay();
        console.log(`  Attempt ${i + 1}: ${delay.toFixed(0)}ms`);
    }
}
 
// Compare multiple sequences to show decorrelation
function compareSequences() {
    console.log('
Three independent sequences (showing decorrelation):');
    
    for (let seq = 1; seq <= 3; seq++) {
        const backoff = new DecorrelatedBackoff({
            baseDelayMs: 100,
            maxDelayMs: 30000,
        });
        
        const delays: number[] = [];
        for (let i = 0; i < 6; i++) {
            delays.push(Math.round(backoff.nextDelay()));
        }
        
        console.log(`  Sequence ${seq}: ${delays.join(' -> ')} ms`);
    }
}
 
demonstrateDecorrelatedBackoff();
compareSequences();
 
// Sample output:
// Decorrelated backoff sequence:
//   Attempt 1: 178ms
//   Attempt 2: 312ms
//   Attempt 3: 534ms
//   Attempt 4: 1245ms
//   Attempt 5: 2156ms
//   Attempt 6: 5432ms
//   ...
//
// Three independent sequences (showing decorrelation):
//   Sequence 1: 167 -> 389 -> 892 -> 2134 -> 5678 -> 12345 ms
//   Sequence 2: 234 -> 456 -> 567 -> 1456 -> 3456 -> 8765 ms
//   Sequence 3: 145 -> 278 -> 712 -> 1823 -> 4321 -> 9876 ms

Decorrelated Benefits

•Built-in randomization without explicit jitter layer
•Natural spread of retry attempts across time
•Self-correcting: short waits lead to faster growth
•Used by AWS SDK and other production libraries
•Simple mental model: "random up to 3x previous"

Trade-offs

•Less predictable: harder to reason about total time
•Non-deterministic: complicates testing
•May occasionally produce too-short or too-long sequences
•Requires tracking state (previous delay)
•Some teams prefer explicit exponential + jitter for clarity

When to Use Decorrelated Backoff

Integration with Circuit Breakers

Exponential backoff and circuit breakers are complementary patterns that work together for comprehensive fault tolerance. Understanding their interaction is essential.

The Relationship

Exponential backoff handles transient failures at the request level, spacing out retry attempts for individual operations.
Circuit breakers handle persistent failures at the system level, preventing all requests when a dependency is known to be failing.

They operate at different time scales:

Backoff: milliseconds to seconds between retries
Circuit breaker: seconds to minutes of protection

Correct Integration Pattern

The circuit breaker should wrap the retry logic, not be inside it:

backoff-circuit-integration.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
// CORRECT: Circuit breaker wraps retry logic
async function correctPattern<T>(operation: () => Promise<T>): Promise<T> {
    // Circuit breaker check first
    if (circuitBreaker.isOpen()) {
        throw new CircuitOpenError('Circuit is open, failing fast');
    }
    
    try {
        // Retry logic inside circuit breaker context
        const result = await executeWithRetry(operation, backoffConfig);
        circuitBreaker.recordSuccess();
        return result;
    } catch (error) {
        circuitBreaker.recordFailure();
        throw error;
    }
}
 
// INCORRECT: Retry logic wraps circuit breaker
async function incorrectPattern<T>(operation: () => Promise<T>): Promise<T> {
    return executeWithRetry(async () => {
        // This is wrong: we'd retry even when circuit is open!
        if (circuitBreaker.isOpen()) {
            throw new CircuitOpenError('Circuit open');
        }
        return await operation();
    }, backoffConfig);
}
 
// The issue with incorrect pattern:
// - When circuit opens, each retry attempt immediately fails
// - Backoff waits between attempts that can't possibly succeed
// - Wastes time without providing any benefit
// - Circuit cooldown may end mid-retry sequence, causing inconsistent behavior
 
// Complete integrated example
class ResilientClient {
    private circuitBreaker: CircuitBreaker;
    private backoffConfig: BackoffConfig;
    
    constructor(
        private serviceName: string,
        circuitBreakerConfig: CircuitBreakerConfig,
        backoffConfig: BackoffConfig
    ) {
        this.circuitBreaker = new CircuitBreaker(circuitBreakerConfig);
        this.backoffConfig = backoffConfig;
    }
    
    async call<T>(operation: () => Promise<T>): Promise<T> {
        // 1. Check circuit state
        const circuitState = this.circuitBreaker.getState();
        
        if (circuitState === 'OPEN') {
            throw new CircuitOpenError(
                `Circuit for ${this.serviceName} is open. ` +
                                        `Will retry after ${this.circuitBreaker.getRemainingCooldown()}ms`
            );
        }
        
        // 2. If half-open, allow limited testing
        const retryConfig = circuitState === 'HALF_OPEN'
            ? { ...this.backoffConfig, maxAttempts: 1 }  // Single attempt for probing
            : this.backoffConfig;
        
        try {
            // 3. Execute with retry (inside circuit context)
            const result = await executeWithRetry(operation, retryConfig);
            
            // 4. Record success (may close circuit if half-open)
            this.circuitBreaker.recordSuccess();
            
            return result;
        } catch (error) {
            // 5. Record failure (may open circuit)
            this.circuitBreaker.recordFailure(error as Error);
            throw error;
        }
    }
}

Key Integration Principles

Circuit check before retry loop: Don't enter retry loop if circuit is open
All retries count toward circuit: Each failed retry attempt is a circuit failure signal
Successful retry closes circuit: A successful retry (even on attempt 3) should count as success
Half-open allows single probe: When circuit is half-open, limit to single attempt to test recovery
Backoff continues during half-open: If probe fails, backoff before next probe attempt

Avoid Retry Amplification

Summary: Exponential Backoff

Exponential backoff is the foundational retry timing strategy—mathematically principled, empirically proven, and universally adopted across distributed systems.

Key Takeaways

•Fixed and linear delays fail at scale — They create synchronized retry storms that prevent service recovery and can turn minor issues into major outages.
•Exponential growth reflects probability — Each failed retry provides evidence that longer waits are needed. Doubling delay matches the exponentially decreasing probability of transient recovery.
•Always cap maximum delay — Uncapped exponential growth leads to absurd waits. Cap based on your use case (seconds for user-facing, minutes for background jobs).
•Tune for your context — Different scenarios require different base delays, multipliers, and max attempts. User-facing calls need fast response; external APIs may need long waits.
•Consider decorrelated backoff — For scenarios with many concurrent clients, decorrelated backoff provides built-in desynchronization.
•Integrate correctly with circuit breakers — Circuit breaker wraps retry logic, not the reverse. Check circuit state before entering retry loops.
•Track total time — Beyond per-retry caps, ensure total retry time stays within acceptable latency budgets.

What's Next:

Page Complete

2 / 5