System Design (HLD)Retry with Backoff

Retry with Backoff: Building Resilient Distributed Systems

LevelAdvanced

Duration75 mins

TopicRetry with Backoff

3 / 5

Jitter to Prevent Thundering Herd

The Synchronized Stampede

Imagine a concert venue that experiences a brief power outage. The music stops, and 10,000 fans rush to customer service desks to demand refunds. If they all arrive at the same moment—a thundering herd—the desks are immediately overwhelmed, staff can't help anyone, and the situation escalates from inconvenient to chaotic.

Now imagine if those 10,000 fans naturally spread their arrivals over 20 minutes. Some check their tickets first, others discuss with friends, some head to restrooms first. The same number of people, the same requests, but the customer service desks can handle the load because arrivals are distributed over time.

The thundering herd problem in distributed systems is precisely this phenomenon. When a service experiences a transient failure affecting many clients simultaneously, the uniform retry behavior of those clients can create a massive synchronized spike that prevents recovery. Even with exponential backoff, if every client starts from the same moment and uses the same delay formula, they'll all retry at nearly the same times.

Jitter—random variance added to retry delays—is the solution. By introducing randomness, we transform synchronized retry storms into distributed retry flows, giving services the breathing room they need to recover.

What You Will Learn

By the end of this page, you will understand the thundering herd problem and its impact on distributed systems, master different jitter strategies (full jitter, equal jitter, decorrelated jitter), know how to implement jitter correctly, understand the mathematical basis for jitter effectiveness, and recognize when jitter is critical versus optional.

The Thundering Herd Problem

The thundering herd problem occurs whenever many clients attempt the same action simultaneously, overwhelming the target system. In the context of retries, it manifests when many clients fail at similar times and then retry at similar times.

How Synchronized Retries Form

Consider a scenario without jitter:

T=0: Service experiences 500ms outage
T=0 to T=500ms: 5,000 requests fail
T=500ms: All 5,000 clients calculate first retry delay: 100ms
T=600ms: All 5,000 clients retry simultaneously
T=600ms: Service, just recovered, receives 5,000 requests at once
T=600ms: Service overwhelmed, returns errors or goes down again
T=600ms to T=700ms: All 5,000 clients fail again
T=700ms: All clients calculate second retry delay: 200ms
T=900ms: All 5,000 retry simultaneously again
Cycle continues: Service oscillates between recovery and overload

Thundering Herd Consequences

•Recovery Prevention — The service never gets stable time to recover. Each recovery attempt is interrupted by a retry wave.
•Outage Extension — What would be a 500ms blip becomes minutes or hours of degradation as retry waves prevent stabilization.
•Resource Exhaustion — Connection pools, thread pools, and memory are overwhelmed by simultaneous requests.
•Cascade Propagation — The overwhelmed service may fail other services that depend on it, spreading the outage.
•Positive Feedback Loop — Each retry wave causes more failures, which cause more retries, in a self-reinforcing cycle.
•Monitoring Confusion — Metrics show periodic spikes that obscure the underlying issue, making diagnosis difficult.

Why Exponential Backoff Alone Isn't Enough

Exponential backoff increases delays over time, which helps reduce overall retry pressure. However, it doesn't address synchronization:

Time (ms)	Client 1	Client 2	Client 3	Client 5000
0	Fail	Fail	Fail	Fail
100	Retry	Retry	Retry	Retry
300	Retry	Retry	Retry	Retry
700	Retry	Retry	Retry	Retry
1500	Retry	Retry	Retry	Retry

Even with backoff, all clients fail at T=0 and compute identical retry times (100, 200, 400, 800...). They remain synchronized because the backoff formula is deterministic.

The solution: introduce randomness so each client's retry schedule is unique, spreading the load across time even when failures occur simultaneously.

Real-World Thundering Herd Incident

In 2021, a major cloud provider's configuration change caused a 2-minute authentication service outage. Without jitter, the millions of affected clients created synchronized retry waves that extended the recovery time to over 4 hours. Post-incident analysis showed the authentication service recovered within 5 minutes, but retry storms kept re-failing it. Adding jitter to client SDKs was the primary remediation.

Understanding Jitter Strategies

Jitter is random variance applied to retry delays to prevent synchronization. There are several strategies, each with different characteristics.

Full Jitter

The most aggressive approach: the actual delay is a random value between 0 and the calculated exponential delay.

delay = random(0, baseDelay × multiplier^n)

Pros: Maximum spread, excellent de-synchronization
Cons: Some retries may happen very quickly (near 0 delay), which may not give service enough recovery time

Equal Jitter

A balanced approach: half the delay is guaranteed, half is random.

temp = baseDelay × multiplier^n
delay = temp/2 + random(0, temp/2)

Pros: Guarantees minimum wait (half of exponential delay) while still spreading
Cons: Less spread than full jitter

Decorrelated Jitter

As we saw in the previous page, decorrelated jitter bases each delay on the previous delay:

delay = min(maxDelay, random(baseDelay, previousDelay × 3))

Pros: Natural growth with built-in randomization
Cons: Requires tracking state, less predictable bounds

Jitter Strategy Comparison (baseDelay=100ms, attempt=3, exponentialDelay=800ms)
Strategy	Formula	Min Delay	Max Delay	Expected Value
No Jitter	800	800ms	800ms	800ms
Full Jitter	random(0, 800)	0ms	800ms	400ms
Equal Jitter	400 + random(0, 400)	400ms	800ms	600ms
Decorrelated	random(100, prev×3)	100ms	prev×3	~varies

jitter-strategies.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
// Jitter strategy implementations
type JitterStrategy = 'none' | 'full' | 'equal' | 'decorrelated';
 
interface JitterConfig {
    strategy: JitterStrategy;
    baseDelayMs: number;
    multiplier: number;
    maxDelayMs: number;
}
 
class JitterCalculator {
    private previousDelay: number;
    
    constructor(private config: JitterConfig) {
        this.previousDelay = config.baseDelayMs;
    }
    
    /**
     * Calculate jittered delay for the given attempt
     */
    calculateDelay(attemptIndex: number): number {
        switch (this.config.strategy) {
            case 'none':
                return this.noJitter(attemptIndex);
            case 'full':
                return this.fullJitter(attemptIndex);
            case 'equal':
                return this.equalJitter(attemptIndex);
            case 'decorrelated':
                return this.decorrelatedJitter();
        }
    }
    
    /**
     * No jitter: pure exponential backoff
     */
    private noJitter(attemptIndex: number): number {
        const delay = this.config.baseDelayMs * 
            Math.pow(this.config.multiplier, attemptIndex);
        return Math.min(delay, this.config.maxDelayMs);
    }
    
    /**
     * Full jitter: random between 0 and exponential delay
     * Most aggressive spread, but may allow very short delays
     */
    private fullJitter(attemptIndex: number): number {
        const exponentialDelay = this.config.baseDelayMs * 
            Math.pow(this.config.multiplier, attemptIndex);
        const cappedDelay = Math.min(exponentialDelay, this.config.maxDelayMs);
        
        // Random between 0 and capped delay
        return Math.random() * cappedDelay;
    }
    
    /**
     * Equal jitter: half guaranteed, half random
     * Balanced approach with guaranteed minimum
     */
    private equalJitter(attemptIndex: number): number {
        const exponentialDelay = this.config.baseDelayMs * 
            Math.pow(this.config.multiplier, attemptIndex);
        const cappedDelay = Math.min(exponentialDelay, this.config.maxDelayMs);
        const halfDelay = cappedDelay / 2;
        
        // Guaranteed half + random half
        return halfDelay + Math.random() * halfDelay;
    }
    
    /**
     * Decorrelated jitter: based on previous delay
     * Self-regulating growth with natural randomization
     */
    private decorrelatedJitter(): number {
        const minDelay = this.config.baseDelayMs;
        const maxDelay = this.previousDelay * 3;
        
        // Random between base and 3x previous
        const delay = minDelay + Math.random() * (maxDelay - minDelay);
        
        // Cap and store for next iteration
        this.previousDelay = Math.min(delay, this.config.maxDelayMs);
        return this.previousDelay;
    }
    
    /**
     * Reset state (for decorrelated jitter)
     */
    reset(): void {
        this.previousDelay = this.config.baseDelayMs;
    }
}
 
// Demonstration function
function demonstrateJitterSpread() {
    const config: Omit<JitterConfig, 'strategy'> = {
        baseDelayMs: 100,
        multiplier: 2,
        maxDelayMs: 30000,
    };
    
    const strategies: JitterStrategy[] = ['none', 'full', 'equal', 'decorrelated'];
    const clientCount = 1000;
    const attemptIndex = 3; // 4th retry attempt, exponential delay = 800ms
    
    for (const strategy of strategies) {
        const calculator = new JitterCalculator({ ...config, strategy });
        const delays: number[] = [];
        
        for (let i = 0; i < clientCount; i++) {
            delays.push(calculator.calculateDelay(attemptIndex));
            if (strategy === 'decorrelated') {
                calculator.reset();
            }
        }
        
        const avgDelay = delays.reduce((a, b) => a + b) / delays.length;
        const minDelay = Math.min(...delays);
        const maxDelay = Math.max(...delays);
        const stdDev = Math.sqrt(
            delays.reduce((sum, d) => sum + Math.pow(d - avgDelay, 2), 0) / delays.length
        );
        
        console.log(`${strategy}: avg=${avgDelay.toFixed(0)}ms, ` +
                                        `min=${minDelay.toFixed(0)}ms, max=${maxDelay.toFixed(0)}ms, ` +
                                    `stddev=${stdDev.toFixed(0)}ms`);
    }
}

AWS Recommendation

AWS's analysis of jitter strategies concluded that full jitter provides the best de-synchronization and overall performance for most workloads. While it can produce short delays, the benefits of maximum spread outweigh the occasional fast retry. The AWS SDK uses full jitter by default.

Mathematical Analysis of Jitter Effectiveness

Understanding why jitter works requires examining how retry attempts distribute over time.

Without Jitter: Synchronized Spikes

With deterministic exponential backoff, if N clients fail at time T=0:

All N retry at T = baseDelay (100ms)
All N retry at T = 100 + 200 = 300ms
All N retry at T = 300 + 400 = 700ms
...

The load on the server follows a step function with N requests arriving simultaneously at deterministic intervals. The server sees:

Load at T=100ms: N requests (spike)
Load at T=100-300ms: 0 requests
Load at T=300ms: N requests (spike)
Load at T=300-700ms: 0 requests
...

With Full Jitter: Distributed Load

With full jitter randomizing each delay:

Client 1 retries at T = random(0, 100) ≈ 47ms
Client 2 retries at T = random(0, 100) ≈ 82ms
Client 3 retries at T = random(0, 100) ≈ 23ms
...

With N clients and uniform distribution, the expected arrivals per millisecond during the first retry window is approximately N/100. The server sees:

Load at T=0-100ms: ~N/100 requests per ms (smooth curve)
Load at T=100-300ms: first retries completing, second retries starting (mixed)
...

Quantifying the Improvement

Let's quantify the peak load difference:

Metric	No Jitter	Full Jitter	Improvement
Peak requests/ms (N=10,000)	10,000 (instantaneous)	~100	100x reduction
Time to clear first retry	~1ms (all at once)	~100ms (spread)	Smooth flow
Server headroom needed	Must handle N simultaneous	Handle N/delay	Much lower
Recovery opportunity	None (constant spikes)	Between bursts	Continuous

The Key Insight

Jitter transforms retry behavior from discrete, synchronized events into continuous, distributed flows. Instead of spike → idle → spike → idle, the server sees steady (though elevated) traffic that it can handle sustainably.

Standard Deviation as Spread Metric

The effectiveness of jitter can be measured by the standard deviation of retry times:

No jitter: σ = 0 (all retries at identical times)
Full jitter: σ = delay/√3 ≈ 0.577 × delay (maximum spread for uniform distribution)
Equal jitter: σ = delay/(2×√3) ≈ 0.289 × delay (half the spread of full jitter)

The Central Limit Theorem Advantage

When many clients use jitter, the aggregate retry load approximates a smooth continuous distribution (by the Central Limit Theorem). Even if individual client behavior is random, the aggregate becomes predictable and manageable. This is why jitter becomes more important as scale increases—more clients means smoother aggregate behavior with jitter, but also larger synchronized spikes without it.

Implementing Jitter Correctly

Correct jitter implementation requires attention to randomness quality, integration with backoff logic, and handling of edge cases.

jitter-implementation.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
// Production-ready jittered backoff implementation
interface JitteredBackoffConfig {
    baseDelayMs: number;
    maxDelayMs: number;
    multiplier: number;
    maxAttempts: number;
    jitter: 'none' | 'full' | 'equal';
    
    // Optional: minimum delay regardless of jitter
    minDelayMs?: number;
}
 
class JitteredExponentialBackoff {
    private attempt: number = 0;
    private totalWaitMs: number = 0;
    
    constructor(private config: JitteredBackoffConfig) {}
    
    /**
     * Get the next delay with jitter applied
     * Returns null if max attempts exceeded
     */
    nextDelay(): number | null {
        if (this.attempt >= this.config.maxAttempts) {
            return null;
        }
        
        // Calculate base exponential delay
        const exponentialDelay = this.config.baseDelayMs * 
            Math.pow(this.config.multiplier, this.attempt);
        
        // Apply cap
        const cappedDelay = Math.min(exponentialDelay, this.config.maxDelayMs);
        
        // Apply jitter
        let jitteredDelay: number;
        switch (this.config.jitter) {
            case 'full':
                jitteredDelay = Math.random() * cappedDelay;
                break;
            case 'equal':
                jitteredDelay = (cappedDelay / 2) + Math.random() * (cappedDelay / 2);
                break;
            case 'none':
            default:
                jitteredDelay = cappedDelay;
        }
        
        // Apply minimum if configured
        if (this.config.minDelayMs) {
            jitteredDelay = Math.max(jitteredDelay, this.config.minDelayMs);
        }
        
        this.attempt++;
        return Math.round(jitteredDelay);
    }
    
    recordWait(delayMs: number): void {
        this.totalWaitMs += delayMs;
    }
    
    get currentAttempt(): number {
        return this.attempt;
    }
    
    get totalWait(): number {
        return this.totalWaitMs;
    }
    
    reset(): void {
        this.attempt = 0;
        this.totalWaitMs = 0;
    }
}
 
/**
 * Higher-order function for retrying with jittered backoff
 */
async function withJitteredRetry<T>(
    operation: () => Promise<T>,
    config: JitteredBackoffConfig,
    isRetryable: (error: Error) => boolean = () => true
): Promise<T> {
    const backoff = new JitteredExponentialBackoff(config);
    let lastError: Error | undefined;
    
    for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
        try {
            return await operation();
        } catch (error) {
            lastError = error as Error;
            
            if (!isRetryable(lastError)) {
                throw lastError;
            }
            
            const delay = backoff.nextDelay();
            if (delay === null) {
                break;
            }
            
            console.log(
                `Attempt ${attempt + 1}/${config.maxAttempts} failed. ` +
                                        `Retrying in ${delay}ms (jitter: ${config.jitter})...`
            );
            
            await sleep(delay);
            backoff.recordWait(delay);
        }
    }
    
    throw new Error(
        `Operation failed after ${backoff.currentAttempt} attempts ` +
                                    `(${backoff.totalWait}ms total wait): ${lastError?.message}`
    );
}
 
function sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
}
 
// Usage examples
async function examples() {
    // User-facing API call: fast retries with full jitter
    const userFacingConfig: JitteredBackoffConfig = {
        baseDelayMs: 50,
        maxDelayMs: 5000,
        multiplier: 2,
        maxAttempts: 4,
        jitter: 'full',
        minDelayMs: 10, // Never less than 10ms
    };
    
    // Background job: more retries with equal jitter
    const backgroundConfig: JitteredBackoffConfig = {
        baseDelayMs: 500,
        maxDelayMs: 60000,
        multiplier: 2,
        maxAttempts: 10,
        jitter: 'equal',
    };
    
    // External API: respect their rate limits more carefully
    const externalApiConfig: JitteredBackoffConfig = {
        baseDelayMs: 1000,
        maxDelayMs: 300000, // 5 minutes
        multiplier: 2,
        maxAttempts: 6,
        jitter: 'equal', // More predictable than full jitter
        minDelayMs: 500,  // Never hammer the API
    };
}

Implementation Best Practices

•Use quality randomness — Math.random() is sufficient for jitter; cryptographic randomness isn't needed. Avoid pseudo-random sequences that might synchronize across clients.
•Apply minimum delays — Consider setting a floor (minDelayMs) to prevent full jitter from allowing near-zero delays, which could still overwhelm recovering services.
•Calculate then cap — Apply jitter to the pre-capped exponential delay, then cap the result. This ensures proper spread within bounds.
•Round final delays — Small floating-point differences don't matter for timing; round to integers for cleaner logging.
•Handle Retry-After first — If a server provides Retry-After, use that as the minimum, then add jitter on top if desired.

Avoid Seed-Based Synchronization

Some implementations seed their random number generators with predictable values (like process ID or current time at startup). If many instances start simultaneously (e.g., during deployment), their "random" sequences may align. Use unseeded default random or ensure seed sources are truly unpredictable.

Combining Jitter with Retry-After

When services return Retry-After headers, the interaction with jitter requires careful consideration. The server is providing explicit guidance about when retries are welcome, but we still need to prevent synchronized stampedes at the specified time.

The Problem with Exact Retry-After

If 10,000 clients all receive Retry-After: 30 at the same time, and all retry exactly 30 seconds later, we've simply scheduled a thundering herd for T+30 seconds.

Solution: Add Jitter to Retry-After

Treat the Retry-After value as a floor, then add jitter:

actualDelay = retryAfterSeconds + jitter(0, jitterWindow)

The jitter window depends on the Retry-After duration:

Short (≤60s): Add 0-20% jitter
Medium (1-5min): Add 0-30% jitter
Long (>5min): Add 0-60% jitter or more

The longer the server asks you to wait, the more likely you can spread the load without missing the recovery window.

retry-after-jitter.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
// Combining Retry-After with jitter
interface RetryAfterConfig {
    // Jitter percentage by Retry-After duration tier
    shortWindowJitterPercent: number;   // <= 60s
    mediumWindowJitterPercent: number;  // 60s - 300s
    longWindowJitterPercent: number;    // > 300s
    
    // Caps
    maxAdditionalJitterMs: number;
}
 
const defaultRetryAfterConfig: RetryAfterConfig = {
    shortWindowJitterPercent: 20,
    mediumWindowJitterPercent: 30,
    longWindowJitterPercent: 50,
    maxAdditionalJitterMs: 60000, // Never add more than 1 minute
};
 
function calculateDelayWithRetryAfter(
    exponentialDelayMs: number,
    retryAfterMs: number | undefined,
    config: RetryAfterConfig = defaultRetryAfterConfig
): number {
    // If no Retry-After, use standard jittered exponential backoff
    if (!retryAfterMs) {
        return Math.random() * exponentialDelayMs;
    }
    
    // Use Retry-After as minimum, then add jitter
    const retryAfterSeconds = retryAfterMs / 1000;
    
    let jitterPercent: number;
    if (retryAfterSeconds <= 60) {
        jitterPercent = config.shortWindowJitterPercent;
    } else if (retryAfterSeconds <= 300) {
        jitterPercent = config.mediumWindowJitterPercent;
    } else {
        jitterPercent = config.longWindowJitterPercent;
    }
    
    // Calculate jitter to add
    const maxJitter = Math.min(
        retryAfterMs * (jitterPercent / 100),
        config.maxAdditionalJitterMs
    );
    
    const additionalJitter = Math.random() * maxJitter;
    
    return retryAfterMs + additionalJitter;
}
 
// Example: Parse Retry-After header and apply jitter
function parseAndApplyRetryAfter(
    response: Response,
    fallbackDelayMs: number
): number {
    const retryAfterHeader = response.headers.get('Retry-After');
    
    if (!retryAfterHeader) {
        // No Retry-After, fall back to jittered exponential
        return Math.random() * fallbackDelayMs;
    }
    
    let retryAfterMs: number;
    
    // Retry-After can be seconds (integer) or HTTP date
    const seconds = parseInt(retryAfterHeader, 10);
    if (!isNaN(seconds)) {
        retryAfterMs = seconds * 1000;
    } else {
        // Try parsing as HTTP date
        const date = new Date(retryAfterHeader);
        if (isNaN(date.getTime())) {
            // Invalid format, use fallback
            return Math.random() * fallbackDelayMs;
        }
        retryAfterMs = Math.max(0, date.getTime() - Date.now());
    }
    
    return calculateDelayWithRetryAfter(fallbackDelayMs, retryAfterMs);
}
 
// Production retry wrapper with Retry-After support
async function retryWithRetryAfterSupport<T>(
    operation: () => Promise<T>,
    config: {
        baseDelayMs: number;
        maxDelayMs: number;
        multiplier: number;
        maxAttempts: number;
    }
): Promise<T> {
    let lastError: Error | undefined;
    
    for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
        try {
            return await operation();
        } catch (error) {
            lastError = error as Error;
            
            // Calculate fallback exponential delay
            const exponentialDelay = config.baseDelayMs * 
                Math.pow(config.multiplier, attempt);
            const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs);
            
            // Check for Retry-After in error response
            const retryAfterMs = extractRetryAfterFromError(lastError);
            
            // Calculate actual delay with jitter
            const actualDelay = calculateDelayWithRetryAfter(
                cappedDelay,
                retryAfterMs
            );
            
            console.log(
                `Attempt ${attempt + 1} failed. ` +
                                        `Retry-After: ${retryAfterMs ? retryAfterMs + 'ms' : 'none'}. ` +
                                    `Actual wait: ${Math.round(actualDelay)}ms`
            );
            
            await sleep(actualDelay);
        }
    }
    
    throw lastError;
}
 
function extractRetryAfterFromError(error: Error): number | undefined {
    const response = (error as any).response;
    const header = response?.headers?.['retry-after'];
    if (typeof header === 'string') {
        const seconds = parseInt(header, 10);
        if (!isNaN(seconds)) return seconds * 1000;
    }
    return undefined;
}
 
function sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
}

Respect the Floor, Spread the Ceiling

Never retry before the Retry-After time—the server explicitly told you not to. But always add jitter on top to prevent synchronized retry waves at the exact Retry-After moment. Think of Retry-After as 'don't retry before this' rather than 'retry exactly at this time.'

When Jitter Is Critical vs Optional

While jitter is generally recommended, its importance varies by scenario. Understanding when jitter is critical helps prioritize implementation efforts.

Jitter Is Critical When:

High Jitter Priority Scenarios

•High client count — Thousands or millions of clients hitting the same endpoints. Even small synchronization causes significant spikes.
•Shared failure events — Failures that affect many clients simultaneously (outages, misconfigurations, deployments).
•Limited server capacity — Services that can't absorb burst traffic gracefully.
•Cascade-prone architectures — Microservices where one service's overload propagates to dependents.
•Rate-limited APIs — External APIs with strict rate limits that synchronized retries would immediately exhaust.
•Background job queues — Many workers processing from the same queue may fail/retry together.

Jitter Is Less Critical When:

Lower Jitter Priority Scenarios

•Low client count — A handful of clients won't create meaningful thundering herds.
•Independent failures — Failures caused by client-specific issues (network, local errors) that don't correlate across clients.
•Massive server capacity — Services with significant headroom that can absorb synchronized spikes.
•Already-throttled clients — When other mechanisms (rate limiters, admission control) naturally distribute load.
•Sequential operations — Single-client workflows where only one operation is in flight at a time.

Jitter Priority by System Type
System Type	Jitter Priority
Mobile app backend	Critical
Web application backend	Critical
Internal microservices	High
Background job system	High
Single-instance admin tool	Optional
Internal data pipeline	Medium
CLI tooling	Low

Jitter Priority by Scale
Client Scale	Jitter Priority
1-10 clients	Optional
10-100 clients	Recommended
100-1,000 clients	High
1,000-10,000 clients	Critical
10,000+ clients	Absolutely Critical

When in Doubt, Add Jitter

Jitter has minimal downsides (slightly less predictable timing, marginally more complex code) and significant upsides when problems occur. Even in low-priority scenarios, adding jitter is cheap insurance against future scale increases or unexpected correlated failures.

Jitter in Different Contexts

Beyond HTTP request retries, jitter principles apply to many distributed systems scenarios.

Scheduled Job Execution

If thousands of scheduled jobs are configured to run at exactly "midnight," they'll all start simultaneously. Adding jitter spreads the load:

// Instead of running exactly at configured time
const scheduledTime = parseSchedule(config.cronExpression);

// Add jitter proportional to schedule interval
const jitterWindow = getScheduleInterval(config.cronExpression) * 0.1; // 10%
const jitteredTime = scheduledTime + Math.random() * jitterWindow;

Cache Expiration

When many cache entries share the same TTL and were set at similar times, they expire simultaneously, causing a "cache stampede" where all clients hit the database at once:

// Bad: All entries expire at exactly the same time
const ttl = 3600; // 1 hour

// Good: Jitter the TTL
const baseTTL = 3600;
const jitteredTTL = baseTTL + Math.random() * 600; // 1 hour ± 0-10 minutes

jitter-contexts.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
// Jitter applied to various distributed systems contexts
 
// 1. Scheduled Job Jitter
class JitteredScheduler {
    /**
     * Add jitter to scheduled execution time
     * Spreads jobs configured for the same time across a window
     */
    getJitteredStartTime(
        scheduledTime: Date,
        scheduleIntervalMs: number,
        jitterPercent: number = 10
    ): Date {
        const jitterWindow = scheduleIntervalMs * (jitterPercent / 100);
        const jitterMs = Math.random() * jitterWindow;
        return new Date(scheduledTime.getTime() + jitterMs);
    }
}
 
// 2. Cache TTL Jitter
class JitteredCache<T> {
    constructor(
        private cache: Map<string, { value: T; expiresAt: number }> = new Map(),
        private jitterPercent: number = 20
    ) {}
    
    /**
     * Set with jittered TTL to prevent cache stampede
     */
    set(key: string, value: T, baseTTLMs: number): void {
        const jitter = baseTTLMs * (this.jitterPercent / 100);
        const jitteredTTL = baseTTLMs + Math.random() * jitter;
        
        this.cache.set(key, {
            value,
            expiresAt: Date.now() + jitteredTTL,
        });
    }
    
    get(key: string): T | undefined {
        const entry = this.cache.get(key);
        if (!entry) return undefined;
        
        if (Date.now() > entry.expiresAt) {
            this.cache.delete(key);
            return undefined;
        }
        
        return entry.value;
    }
}
 
// 3. Health Check Intervals
class JitteredHealthChecker {
    private intervalId?: ReturnType<typeof setTimeout>;
    
    constructor(
        private checkFn: () => Promise<boolean>,
        private baseIntervalMs: number,
        private jitterPercent: number = 30
    ) {}
    
    start(): void {
        const scheduleNext = () => {
            // Jitter each interval to desynchronize from other checkers
            const jitter = this.baseIntervalMs * (this.jitterPercent / 100);
            const nextInterval = this.baseIntervalMs + Math.random() * jitter;
            
            this.intervalId = setTimeout(async () => {
                await this.checkFn();
                scheduleNext();
            }, nextInterval);
        };
        
        // Also jitter initial startup
        const initialDelay = Math.random() * this.baseIntervalMs;
        setTimeout(scheduleNext, initialDelay);
    }
    
    stop(): void {
        if (this.intervalId) {
            clearTimeout(this.intervalId);
        }
    }
}
 
// 4. Connection Pool Reconnection
class JitteredConnectionPool {
    private connections: Connection[] = [];
    
    /**
     * When a connection fails, reconnect with jitter
     * Prevents all failed connections from reconnecting simultaneously
     */
    async handleConnectionFailure(conn: Connection): Promise<void> {
        // Base delay with exponential backoff
        const baseDelay = 1000 * Math.pow(2, conn.reconnectAttempts);
        const cappedDelay = Math.min(baseDelay, 60000);
        
        // Add full jitter
        const jitteredDelay = Math.random() * cappedDelay;
        
        console.log(
            `Connection ${conn.id} failed. ` +
                                        `Reconnecting in ${Math.round(jitteredDelay)}ms`
        );
        
        await sleep(jitteredDelay);
        await this.reconnect(conn);
    }
    
    private async reconnect(conn: Connection): Promise<void> {
        // Reconnection logic
    }
}
 
// 5. Batch Processing Staggering
class JitteredBatchProcessor<T> {
    /**
     * Process batches with jittered start times
     * Prevents synchronized batch processing across workers
     */
    async processBatchesWithJitter(
        batches: T[][],
        processor: (batch: T[]) => Promise<void>,
        maxJitterMs: number = 5000
    ): Promise<void> {
        const promises = batches.map(async (batch, index) => {
            // Stagger starts with jitter
            const jitter = Math.random() * maxJitterMs;
            await sleep(jitter);
            
            console.log(`Starting batch ${index} after ${Math.round(jitter)}ms jitter`);
            return processor(batch);
        });
        
        await Promise.all(promises);
    }
}
 
interface Connection {
    id: string;
    reconnectAttempts: number;
}
 
function sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
}

Universal Jitter Principle

Whenever you have multiple independent entities doing the same thing at configured times or intervals, add jitter. This applies to scheduled jobs, cache TTLs, health checks, connection renewals, token refresh, DNS lookups, and any periodic operation. The thundering herd problem is universal; so should be jitter.

Summary: Jitter to Prevent Thundering Herd

Jitter is the essential companion to exponential backoff. While backoff spaces retries over time, jitter prevents synchronized retry storms that can extend outages from seconds to hours.

Key Takeaways

•Thundering herd is real and devastating — Synchronized retries can prevent service recovery, turning brief blips into extended outages.
•Exponential backoff alone isn't sufficient — Deterministic formulas produce synchronized retry times when many clients fail together.
•Full jitter provides maximum spread — Random delay between 0 and exponential value gives best de-synchronization for most cases.
•Equal jitter provides guaranteed minimum — When you need some minimum delay, equal jitter balances spread with floor.
•Add jitter to Retry-After — Server-provided delays should be floors, not exact times. Add jitter above to prevent scheduled stampedes.
•Jitter applies universally — Beyond retries, apply jitter to scheduled jobs, cache TTLs, health checks, and any periodic operations.
•Scale determines criticality — Jitter importance increases with client count. At thousands of clients, jitter is absolutely critical.

What's Next:

We've covered when to retry, how to space retries with exponential backoff, and how to desynchronize with jitter. The next page addresses maximum retry attempts—how to determine when to stop retrying and the critical role this plays in maintaining system stability and resource management.

Page Complete

You now understand the thundering herd problem, different jitter strategies and their trade-offs, how to implement jitter correctly, and how to apply jitter principles beyond just retries. Combined with exponential backoff, you can now design retry systems that recover gracefully from failures without amplifying them.

3 / 5

Loading learning content...

System Design (HLD)Retry with Backoff

Retry with Backoff: Building Resilient Distributed Systems

LevelAdvanced

Duration75 mins

TopicRetry with Backoff

3 / 5

Jitter to Prevent Thundering Herd

The Synchronized Stampede

What You Will Learn

The Thundering Herd Problem

How Synchronized Retries Form

Consider a scenario without jitter:

T=0: Service experiences 500ms outage
T=0 to T=500ms: 5,000 requests fail
T=500ms: All 5,000 clients calculate first retry delay: 100ms
T=600ms: All 5,000 clients retry simultaneously
T=600ms: Service, just recovered, receives 5,000 requests at once
T=600ms: Service overwhelmed, returns errors or goes down again
T=600ms to T=700ms: All 5,000 clients fail again
T=700ms: All clients calculate second retry delay: 200ms
T=900ms: All 5,000 retry simultaneously again
Cycle continues: Service oscillates between recovery and overload

Thundering Herd Consequences

•Recovery Prevention — The service never gets stable time to recover. Each recovery attempt is interrupted by a retry wave.
•Outage Extension — What would be a 500ms blip becomes minutes or hours of degradation as retry waves prevent stabilization.
•Resource Exhaustion — Connection pools, thread pools, and memory are overwhelmed by simultaneous requests.
•Cascade Propagation — The overwhelmed service may fail other services that depend on it, spreading the outage.
•Positive Feedback Loop — Each retry wave causes more failures, which cause more retries, in a self-reinforcing cycle.
•Monitoring Confusion — Metrics show periodic spikes that obscure the underlying issue, making diagnosis difficult.

Why Exponential Backoff Alone Isn't Enough

Exponential backoff increases delays over time, which helps reduce overall retry pressure. However, it doesn't address synchronization:

Time (ms)	Client 1	Client 2	Client 3	Client 5000
0	Fail	Fail	Fail	Fail
100	Retry	Retry	Retry	Retry
300	Retry	Retry	Retry	Retry
700	Retry	Retry	Retry	Retry
1500	Retry	Retry	Retry	Retry

Even with backoff, all clients fail at T=0 and compute identical retry times (100, 200, 400, 800...). They remain synchronized because the backoff formula is deterministic.

The solution: introduce randomness so each client's retry schedule is unique, spreading the load across time even when failures occur simultaneously.

Real-World Thundering Herd Incident

Understanding Jitter Strategies

Jitter is random variance applied to retry delays to prevent synchronization. There are several strategies, each with different characteristics.

Full Jitter

The most aggressive approach: the actual delay is a random value between 0 and the calculated exponential delay.

delay = random(0, baseDelay × multiplier^n)

Pros: Maximum spread, excellent de-synchronization
Cons: Some retries may happen very quickly (near 0 delay), which may not give service enough recovery time

Equal Jitter

A balanced approach: half the delay is guaranteed, half is random.

temp = baseDelay × multiplier^n
delay = temp/2 + random(0, temp/2)

Pros: Guarantees minimum wait (half of exponential delay) while still spreading
Cons: Less spread than full jitter

Decorrelated Jitter

As we saw in the previous page, decorrelated jitter bases each delay on the previous delay:

delay = min(maxDelay, random(baseDelay, previousDelay × 3))

Pros: Natural growth with built-in randomization
Cons: Requires tracking state, less predictable bounds

Jitter Strategy Comparison (baseDelay=100ms, attempt=3, exponentialDelay=800ms)
Strategy	Formula	Min Delay	Max Delay	Expected Value
No Jitter	800	800ms	800ms	800ms
Full Jitter	random(0, 800)	0ms	800ms	400ms
Equal Jitter	400 + random(0, 400)	400ms	800ms	600ms
Decorrelated	random(100, prev×3)	100ms	prev×3	~varies

jitter-strategies.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
// Jitter strategy implementations
type JitterStrategy = 'none' | 'full' | 'equal' | 'decorrelated';
 
interface JitterConfig {
    strategy: JitterStrategy;
    baseDelayMs: number;
    multiplier: number;
    maxDelayMs: number;
}
 
class JitterCalculator {
    private previousDelay: number;
    
    constructor(private config: JitterConfig) {
        this.previousDelay = config.baseDelayMs;
    }
    
    /**
     * Calculate jittered delay for the given attempt
     */
    calculateDelay(attemptIndex: number): number {
        switch (this.config.strategy) {
            case 'none':
                return this.noJitter(attemptIndex);
            case 'full':
                return this.fullJitter(attemptIndex);
            case 'equal':
                return this.equalJitter(attemptIndex);
            case 'decorrelated':
                return this.decorrelatedJitter();
        }
    }
    
    /**
     * No jitter: pure exponential backoff
     */
    private noJitter(attemptIndex: number): number {
        const delay = this.config.baseDelayMs * 
            Math.pow(this.config.multiplier, attemptIndex);
        return Math.min(delay, this.config.maxDelayMs);
    }
    
    /**
     * Full jitter: random between 0 and exponential delay
     * Most aggressive spread, but may allow very short delays
     */
    private fullJitter(attemptIndex: number): number {
        const exponentialDelay = this.config.baseDelayMs * 
            Math.pow(this.config.multiplier, attemptIndex);
        const cappedDelay = Math.min(exponentialDelay, this.config.maxDelayMs);
        
        // Random between 0 and capped delay
        return Math.random() * cappedDelay;
    }
    
    /**
     * Equal jitter: half guaranteed, half random
     * Balanced approach with guaranteed minimum
     */
    private equalJitter(attemptIndex: number): number {
        const exponentialDelay = this.config.baseDelayMs * 
            Math.pow(this.config.multiplier, attemptIndex);
        const cappedDelay = Math.min(exponentialDelay, this.config.maxDelayMs);
        const halfDelay = cappedDelay / 2;
        
        // Guaranteed half + random half
        return halfDelay + Math.random() * halfDelay;
    }
    
    /**
     * Decorrelated jitter: based on previous delay
     * Self-regulating growth with natural randomization
     */
    private decorrelatedJitter(): number {
        const minDelay = this.config.baseDelayMs;
        const maxDelay = this.previousDelay * 3;
        
        // Random between base and 3x previous
        const delay = minDelay + Math.random() * (maxDelay - minDelay);
        
        // Cap and store for next iteration
        this.previousDelay = Math.min(delay, this.config.maxDelayMs);
        return this.previousDelay;
    }
    
    /**
     * Reset state (for decorrelated jitter)
     */
    reset(): void {
        this.previousDelay = this.config.baseDelayMs;
    }
}
 
// Demonstration function
function demonstrateJitterSpread() {
    const config: Omit<JitterConfig, 'strategy'> = {
        baseDelayMs: 100,
        multiplier: 2,
        maxDelayMs: 30000,
    };
    
    const strategies: JitterStrategy[] = ['none', 'full', 'equal', 'decorrelated'];
    const clientCount = 1000;
    const attemptIndex = 3; // 4th retry attempt, exponential delay = 800ms
    
    for (const strategy of strategies) {
        const calculator = new JitterCalculator({ ...config, strategy });
        const delays: number[] = [];
        
        for (let i = 0; i < clientCount; i++) {
            delays.push(calculator.calculateDelay(attemptIndex));
            if (strategy === 'decorrelated') {
                calculator.reset();
            }
        }
        
        const avgDelay = delays.reduce((a, b) => a + b) / delays.length;
        const minDelay = Math.min(...delays);
        const maxDelay = Math.max(...delays);
        const stdDev = Math.sqrt(
            delays.reduce((sum, d) => sum + Math.pow(d - avgDelay, 2), 0) / delays.length
        );
        
        console.log(`${strategy}: avg=${avgDelay.toFixed(0)}ms, ` +
                                        `min=${minDelay.toFixed(0)}ms, max=${maxDelay.toFixed(0)}ms, ` +
                                    `stddev=${stdDev.toFixed(0)}ms`);
    }
}

AWS Recommendation

Mathematical Analysis of Jitter Effectiveness

Understanding why jitter works requires examining how retry attempts distribute over time.

Without Jitter: Synchronized Spikes

With deterministic exponential backoff, if N clients fail at time T=0:

All N retry at T = baseDelay (100ms)
All N retry at T = 100 + 200 = 300ms
All N retry at T = 300 + 400 = 700ms
...

The load on the server follows a step function with N requests arriving simultaneously at deterministic intervals. The server sees:

Load at T=100ms: N requests (spike)
Load at T=100-300ms: 0 requests
Load at T=300ms: N requests (spike)
Load at T=300-700ms: 0 requests
...

With Full Jitter: Distributed Load

With full jitter randomizing each delay:

Client 1 retries at T = random(0, 100) ≈ 47ms
Client 2 retries at T = random(0, 100) ≈ 82ms
Client 3 retries at T = random(0, 100) ≈ 23ms
...

With N clients and uniform distribution, the expected arrivals per millisecond during the first retry window is approximately N/100. The server sees:

Load at T=0-100ms: ~N/100 requests per ms (smooth curve)
Load at T=100-300ms: first retries completing, second retries starting (mixed)
...

Quantifying the Improvement

Let's quantify the peak load difference:

Metric	No Jitter	Full Jitter	Improvement
Peak requests/ms (N=10,000)	10,000 (instantaneous)	~100	100x reduction
Time to clear first retry	~1ms (all at once)	~100ms (spread)	Smooth flow
Server headroom needed	Must handle N simultaneous	Handle N/delay	Much lower
Recovery opportunity	None (constant spikes)	Between bursts	Continuous

The Key Insight

Standard Deviation as Spread Metric

The effectiveness of jitter can be measured by the standard deviation of retry times:

No jitter: σ = 0 (all retries at identical times)
Full jitter: σ = delay/√3 ≈ 0.577 × delay (maximum spread for uniform distribution)
Equal jitter: σ = delay/(2×√3) ≈ 0.289 × delay (half the spread of full jitter)

The Central Limit Theorem Advantage

Implementing Jitter Correctly

Correct jitter implementation requires attention to randomness quality, integration with backoff logic, and handling of edge cases.

jitter-implementation.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
// Production-ready jittered backoff implementation
interface JitteredBackoffConfig {
    baseDelayMs: number;
    maxDelayMs: number;
    multiplier: number;
    maxAttempts: number;
    jitter: 'none' | 'full' | 'equal';
    
    // Optional: minimum delay regardless of jitter
    minDelayMs?: number;
}
 
class JitteredExponentialBackoff {
    private attempt: number = 0;
    private totalWaitMs: number = 0;
    
    constructor(private config: JitteredBackoffConfig) {}
    
    /**
     * Get the next delay with jitter applied
     * Returns null if max attempts exceeded
     */
    nextDelay(): number | null {
        if (this.attempt >= this.config.maxAttempts) {
            return null;
        }
        
        // Calculate base exponential delay
        const exponentialDelay = this.config.baseDelayMs * 
            Math.pow(this.config.multiplier, this.attempt);
        
        // Apply cap
        const cappedDelay = Math.min(exponentialDelay, this.config.maxDelayMs);
        
        // Apply jitter
        let jitteredDelay: number;
        switch (this.config.jitter) {
            case 'full':
                jitteredDelay = Math.random() * cappedDelay;
                break;
            case 'equal':
                jitteredDelay = (cappedDelay / 2) + Math.random() * (cappedDelay / 2);
                break;
            case 'none':
            default:
                jitteredDelay = cappedDelay;
        }
        
        // Apply minimum if configured
        if (this.config.minDelayMs) {
            jitteredDelay = Math.max(jitteredDelay, this.config.minDelayMs);
        }
        
        this.attempt++;
        return Math.round(jitteredDelay);
    }
    
    recordWait(delayMs: number): void {
        this.totalWaitMs += delayMs;
    }
    
    get currentAttempt(): number {
        return this.attempt;
    }
    
    get totalWait(): number {
        return this.totalWaitMs;
    }
    
    reset(): void {
        this.attempt = 0;
        this.totalWaitMs = 0;
    }
}
 
/**
 * Higher-order function for retrying with jittered backoff
 */
async function withJitteredRetry<T>(
    operation: () => Promise<T>,
    config: JitteredBackoffConfig,
    isRetryable: (error: Error) => boolean = () => true
): Promise<T> {
    const backoff = new JitteredExponentialBackoff(config);
    let lastError: Error | undefined;
    
    for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
        try {
            return await operation();
        } catch (error) {
            lastError = error as Error;
            
            if (!isRetryable(lastError)) {
                throw lastError;
            }
            
            const delay = backoff.nextDelay();
            if (delay === null) {
                break;
            }
            
            console.log(
                `Attempt ${attempt + 1}/${config.maxAttempts} failed. ` +
                                        `Retrying in ${delay}ms (jitter: ${config.jitter})...`
            );
            
            await sleep(delay);
            backoff.recordWait(delay);
        }
    }
    
    throw new Error(
        `Operation failed after ${backoff.currentAttempt} attempts ` +
                                    `(${backoff.totalWait}ms total wait): ${lastError?.message}`
    );
}
 
function sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
}
 
// Usage examples
async function examples() {
    // User-facing API call: fast retries with full jitter
    const userFacingConfig: JitteredBackoffConfig = {
        baseDelayMs: 50,
        maxDelayMs: 5000,
        multiplier: 2,
        maxAttempts: 4,
        jitter: 'full',
        minDelayMs: 10, // Never less than 10ms
    };
    
    // Background job: more retries with equal jitter
    const backgroundConfig: JitteredBackoffConfig = {
        baseDelayMs: 500,
        maxDelayMs: 60000,
        multiplier: 2,
        maxAttempts: 10,
        jitter: 'equal',
    };
    
    // External API: respect their rate limits more carefully
    const externalApiConfig: JitteredBackoffConfig = {
        baseDelayMs: 1000,
        maxDelayMs: 300000, // 5 minutes
        multiplier: 2,
        maxAttempts: 6,
        jitter: 'equal', // More predictable than full jitter
        minDelayMs: 500,  // Never hammer the API
    };
}

Implementation Best Practices

•Use quality randomness — Math.random() is sufficient for jitter; cryptographic randomness isn't needed. Avoid pseudo-random sequences that might synchronize across clients.
•Apply minimum delays — Consider setting a floor (minDelayMs) to prevent full jitter from allowing near-zero delays, which could still overwhelm recovering services.
•Calculate then cap — Apply jitter to the pre-capped exponential delay, then cap the result. This ensures proper spread within bounds.
•Round final delays — Small floating-point differences don't matter for timing; round to integers for cleaner logging.
•Handle Retry-After first — If a server provides Retry-After, use that as the minimum, then add jitter on top if desired.

Avoid Seed-Based Synchronization

Combining Jitter with Retry-After

The Problem with Exact Retry-After

If 10,000 clients all receive Retry-After: 30 at the same time, and all retry exactly 30 seconds later, we've simply scheduled a thundering herd for T+30 seconds.

Solution: Add Jitter to Retry-After

Treat the Retry-After value as a floor, then add jitter:

actualDelay = retryAfterSeconds + jitter(0, jitterWindow)

The jitter window depends on the Retry-After duration:

Short (≤60s): Add 0-20% jitter
Medium (1-5min): Add 0-30% jitter
Long (>5min): Add 0-60% jitter or more

The longer the server asks you to wait, the more likely you can spread the load without missing the recovery window.

retry-after-jitter.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
// Combining Retry-After with jitter
interface RetryAfterConfig {
    // Jitter percentage by Retry-After duration tier
    shortWindowJitterPercent: number;   // <= 60s
    mediumWindowJitterPercent: number;  // 60s - 300s
    longWindowJitterPercent: number;    // > 300s
    
    // Caps
    maxAdditionalJitterMs: number;
}
 
const defaultRetryAfterConfig: RetryAfterConfig = {
    shortWindowJitterPercent: 20,
    mediumWindowJitterPercent: 30,
    longWindowJitterPercent: 50,
    maxAdditionalJitterMs: 60000, // Never add more than 1 minute
};
 
function calculateDelayWithRetryAfter(
    exponentialDelayMs: number,
    retryAfterMs: number | undefined,
    config: RetryAfterConfig = defaultRetryAfterConfig
): number {
    // If no Retry-After, use standard jittered exponential backoff
    if (!retryAfterMs) {
        return Math.random() * exponentialDelayMs;
    }
    
    // Use Retry-After as minimum, then add jitter
    const retryAfterSeconds = retryAfterMs / 1000;
    
    let jitterPercent: number;
    if (retryAfterSeconds <= 60) {
        jitterPercent = config.shortWindowJitterPercent;
    } else if (retryAfterSeconds <= 300) {
        jitterPercent = config.mediumWindowJitterPercent;
    } else {
        jitterPercent = config.longWindowJitterPercent;
    }
    
    // Calculate jitter to add
    const maxJitter = Math.min(
        retryAfterMs * (jitterPercent / 100),
        config.maxAdditionalJitterMs
    );
    
    const additionalJitter = Math.random() * maxJitter;
    
    return retryAfterMs + additionalJitter;
}
 
// Example: Parse Retry-After header and apply jitter
function parseAndApplyRetryAfter(
    response: Response,
    fallbackDelayMs: number
): number {
    const retryAfterHeader = response.headers.get('Retry-After');
    
    if (!retryAfterHeader) {
        // No Retry-After, fall back to jittered exponential
        return Math.random() * fallbackDelayMs;
    }
    
    let retryAfterMs: number;
    
    // Retry-After can be seconds (integer) or HTTP date
    const seconds = parseInt(retryAfterHeader, 10);
    if (!isNaN(seconds)) {
        retryAfterMs = seconds * 1000;
    } else {
        // Try parsing as HTTP date
        const date = new Date(retryAfterHeader);
        if (isNaN(date.getTime())) {
            // Invalid format, use fallback
            return Math.random() * fallbackDelayMs;
        }
        retryAfterMs = Math.max(0, date.getTime() - Date.now());
    }
    
    return calculateDelayWithRetryAfter(fallbackDelayMs, retryAfterMs);
}
 
// Production retry wrapper with Retry-After support
async function retryWithRetryAfterSupport<T>(
    operation: () => Promise<T>,
    config: {
        baseDelayMs: number;
        maxDelayMs: number;
        multiplier: number;
        maxAttempts: number;
    }
): Promise<T> {
    let lastError: Error | undefined;
    
    for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
        try {
            return await operation();
        } catch (error) {
            lastError = error as Error;
            
            // Calculate fallback exponential delay
            const exponentialDelay = config.baseDelayMs * 
                Math.pow(config.multiplier, attempt);
            const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs);
            
            // Check for Retry-After in error response
            const retryAfterMs = extractRetryAfterFromError(lastError);
            
            // Calculate actual delay with jitter
            const actualDelay = calculateDelayWithRetryAfter(
                cappedDelay,
                retryAfterMs
            );
            
            console.log(
                `Attempt ${attempt + 1} failed. ` +
                                        `Retry-After: ${retryAfterMs ? retryAfterMs + 'ms' : 'none'}. ` +
                                    `Actual wait: ${Math.round(actualDelay)}ms`
            );
            
            await sleep(actualDelay);
        }
    }
    
    throw lastError;
}
 
function extractRetryAfterFromError(error: Error): number | undefined {
    const response = (error as any).response;
    const header = response?.headers?.['retry-after'];
    if (typeof header === 'string') {
        const seconds = parseInt(header, 10);
        if (!isNaN(seconds)) return seconds * 1000;
    }
    return undefined;
}
 
function sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
}

Respect the Floor, Spread the Ceiling

When Jitter Is Critical vs Optional

While jitter is generally recommended, its importance varies by scenario. Understanding when jitter is critical helps prioritize implementation efforts.

Jitter Is Critical When:

High Jitter Priority Scenarios

•High client count — Thousands or millions of clients hitting the same endpoints. Even small synchronization causes significant spikes.
•Shared failure events — Failures that affect many clients simultaneously (outages, misconfigurations, deployments).
•Limited server capacity — Services that can't absorb burst traffic gracefully.
•Cascade-prone architectures — Microservices where one service's overload propagates to dependents.
•Rate-limited APIs — External APIs with strict rate limits that synchronized retries would immediately exhaust.
•Background job queues — Many workers processing from the same queue may fail/retry together.

Jitter Is Less Critical When:

Lower Jitter Priority Scenarios

•Low client count — A handful of clients won't create meaningful thundering herds.
•Independent failures — Failures caused by client-specific issues (network, local errors) that don't correlate across clients.
•Massive server capacity — Services with significant headroom that can absorb synchronized spikes.
•Already-throttled clients — When other mechanisms (rate limiters, admission control) naturally distribute load.
•Sequential operations — Single-client workflows where only one operation is in flight at a time.

Jitter Priority by System Type
System Type	Jitter Priority
Mobile app backend	Critical
Web application backend	Critical
Internal microservices	High
Background job system	High
Single-instance admin tool	Optional
Internal data pipeline	Medium
CLI tooling	Low

Jitter Priority by Scale
Client Scale	Jitter Priority
1-10 clients	Optional
10-100 clients	Recommended
100-1,000 clients	High
1,000-10,000 clients	Critical
10,000+ clients	Absolutely Critical

When in Doubt, Add Jitter

Jitter in Different Contexts

Beyond HTTP request retries, jitter principles apply to many distributed systems scenarios.

Scheduled Job Execution

If thousands of scheduled jobs are configured to run at exactly "midnight," they'll all start simultaneously. Adding jitter spreads the load:

// Instead of running exactly at configured time
const scheduledTime = parseSchedule(config.cronExpression);

// Add jitter proportional to schedule interval
const jitterWindow = getScheduleInterval(config.cronExpression) * 0.1; // 10%
const jitteredTime = scheduledTime + Math.random() * jitterWindow;

Cache Expiration

When many cache entries share the same TTL and were set at similar times, they expire simultaneously, causing a "cache stampede" where all clients hit the database at once:

// Bad: All entries expire at exactly the same time
const ttl = 3600; // 1 hour

// Good: Jitter the TTL
const baseTTL = 3600;
const jitteredTTL = baseTTL + Math.random() * 600; // 1 hour ± 0-10 minutes

jitter-contexts.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
// Jitter applied to various distributed systems contexts
 
// 1. Scheduled Job Jitter
class JitteredScheduler {
    /**
     * Add jitter to scheduled execution time
     * Spreads jobs configured for the same time across a window
     */
    getJitteredStartTime(
        scheduledTime: Date,
        scheduleIntervalMs: number,
        jitterPercent: number = 10
    ): Date {
        const jitterWindow = scheduleIntervalMs * (jitterPercent / 100);
        const jitterMs = Math.random() * jitterWindow;
        return new Date(scheduledTime.getTime() + jitterMs);
    }
}
 
// 2. Cache TTL Jitter
class JitteredCache<T> {
    constructor(
        private cache: Map<string, { value: T; expiresAt: number }> = new Map(),
        private jitterPercent: number = 20
    ) {}
    
    /**
     * Set with jittered TTL to prevent cache stampede
     */
    set(key: string, value: T, baseTTLMs: number): void {
        const jitter = baseTTLMs * (this.jitterPercent / 100);
        const jitteredTTL = baseTTLMs + Math.random() * jitter;
        
        this.cache.set(key, {
            value,
            expiresAt: Date.now() + jitteredTTL,
        });
    }
    
    get(key: string): T | undefined {
        const entry = this.cache.get(key);
        if (!entry) return undefined;
        
        if (Date.now() > entry.expiresAt) {
            this.cache.delete(key);
            return undefined;
        }
        
        return entry.value;
    }
}
 
// 3. Health Check Intervals
class JitteredHealthChecker {
    private intervalId?: ReturnType<typeof setTimeout>;
    
    constructor(
        private checkFn: () => Promise<boolean>,
        private baseIntervalMs: number,
        private jitterPercent: number = 30
    ) {}
    
    start(): void {
        const scheduleNext = () => {
            // Jitter each interval to desynchronize from other checkers
            const jitter = this.baseIntervalMs * (this.jitterPercent / 100);
            const nextInterval = this.baseIntervalMs + Math.random() * jitter;
            
            this.intervalId = setTimeout(async () => {
                await this.checkFn();
                scheduleNext();
            }, nextInterval);
        };
        
        // Also jitter initial startup
        const initialDelay = Math.random() * this.baseIntervalMs;
        setTimeout(scheduleNext, initialDelay);
    }
    
    stop(): void {
        if (this.intervalId) {
            clearTimeout(this.intervalId);
        }
    }
}
 
// 4. Connection Pool Reconnection
class JitteredConnectionPool {
    private connections: Connection[] = [];
    
    /**
     * When a connection fails, reconnect with jitter
     * Prevents all failed connections from reconnecting simultaneously
     */
    async handleConnectionFailure(conn: Connection): Promise<void> {
        // Base delay with exponential backoff
        const baseDelay = 1000 * Math.pow(2, conn.reconnectAttempts);
        const cappedDelay = Math.min(baseDelay, 60000);
        
        // Add full jitter
        const jitteredDelay = Math.random() * cappedDelay;
        
        console.log(
            `Connection ${conn.id} failed. ` +
                                        `Reconnecting in ${Math.round(jitteredDelay)}ms`
        );
        
        await sleep(jitteredDelay);
        await this.reconnect(conn);
    }
    
    private async reconnect(conn: Connection): Promise<void> {
        // Reconnection logic
    }
}
 
// 5. Batch Processing Staggering
class JitteredBatchProcessor<T> {
    /**
     * Process batches with jittered start times
     * Prevents synchronized batch processing across workers
     */
    async processBatchesWithJitter(
        batches: T[][],
        processor: (batch: T[]) => Promise<void>,
        maxJitterMs: number = 5000
    ): Promise<void> {
        const promises = batches.map(async (batch, index) => {
            // Stagger starts with jitter
            const jitter = Math.random() * maxJitterMs;
            await sleep(jitter);
            
            console.log(`Starting batch ${index} after ${Math.round(jitter)}ms jitter`);
            return processor(batch);
        });
        
        await Promise.all(promises);
    }
}
 
interface Connection {
    id: string;
    reconnectAttempts: number;
}
 
function sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
}

Universal Jitter Principle

Summary: Jitter to Prevent Thundering Herd

Jitter is the essential companion to exponential backoff. While backoff spaces retries over time, jitter prevents synchronized retry storms that can extend outages from seconds to hours.

Key Takeaways

•Thundering herd is real and devastating — Synchronized retries can prevent service recovery, turning brief blips into extended outages.
•Exponential backoff alone isn't sufficient — Deterministic formulas produce synchronized retry times when many clients fail together.
•Full jitter provides maximum spread — Random delay between 0 and exponential value gives best de-synchronization for most cases.
•Equal jitter provides guaranteed minimum — When you need some minimum delay, equal jitter balances spread with floor.
•Add jitter to Retry-After — Server-provided delays should be floors, not exact times. Add jitter above to prevent scheduled stampedes.
•Jitter applies universally — Beyond retries, apply jitter to scheduled jobs, cache TTLs, health checks, and any periodic operations.
•Scale determines criticality — Jitter importance increases with client count. At thousands of clients, jitter is absolutely critical.

What's Next:

Page Complete

3 / 5