Loading learning content...
Imagine a concert venue that experiences a brief power outage. The music stops, and 10,000 fans rush to customer service desks to demand refunds. If they all arrive at the same moment—a thundering herd—the desks are immediately overwhelmed, staff can't help anyone, and the situation escalates from inconvenient to chaotic.
Now imagine if those 10,000 fans naturally spread their arrivals over 20 minutes. Some check their tickets first, others discuss with friends, some head to restrooms first. The same number of people, the same requests, but the customer service desks can handle the load because arrivals are distributed over time.
The thundering herd problem in distributed systems is precisely this phenomenon. When a service experiences a transient failure affecting many clients simultaneously, the uniform retry behavior of those clients can create a massive synchronized spike that prevents recovery. Even with exponential backoff, if every client starts from the same moment and uses the same delay formula, they'll all retry at nearly the same times.
Jitter—random variance added to retry delays—is the solution. By introducing randomness, we transform synchronized retry storms into distributed retry flows, giving services the breathing room they need to recover.
By the end of this page, you will understand the thundering herd problem and its impact on distributed systems, master different jitter strategies (full jitter, equal jitter, decorrelated jitter), know how to implement jitter correctly, understand the mathematical basis for jitter effectiveness, and recognize when jitter is critical versus optional.
The thundering herd problem occurs whenever many clients attempt the same action simultaneously, overwhelming the target system. In the context of retries, it manifests when many clients fail at similar times and then retry at similar times.
How Synchronized Retries Form
Consider a scenario without jitter:
Why Exponential Backoff Alone Isn't Enough
Exponential backoff increases delays over time, which helps reduce overall retry pressure. However, it doesn't address synchronization:
| Time (ms) | Client 1 | Client 2 | Client 3 | ... | Client 5000 |
|---|---|---|---|---|---|
| 0 | Fail | Fail | Fail | Fail | |
| 100 | Retry | Retry | Retry | Retry | |
| 300 | Retry | Retry | Retry | Retry | |
| 700 | Retry | Retry | Retry | Retry | |
| 1500 | Retry | Retry | Retry | Retry |
Even with backoff, all clients fail at T=0 and compute identical retry times (100, 200, 400, 800...). They remain synchronized because the backoff formula is deterministic.
The solution: introduce randomness so each client's retry schedule is unique, spreading the load across time even when failures occur simultaneously.
In 2021, a major cloud provider's configuration change caused a 2-minute authentication service outage. Without jitter, the millions of affected clients created synchronized retry waves that extended the recovery time to over 4 hours. Post-incident analysis showed the authentication service recovered within 5 minutes, but retry storms kept re-failing it. Adding jitter to client SDKs was the primary remediation.
Jitter is random variance applied to retry delays to prevent synchronization. There are several strategies, each with different characteristics.
Full Jitter
The most aggressive approach: the actual delay is a random value between 0 and the calculated exponential delay.
delay = random(0, baseDelay × multiplier^n)
Equal Jitter
A balanced approach: half the delay is guaranteed, half is random.
temp = baseDelay × multiplier^n
delay = temp/2 + random(0, temp/2)
Decorrelated Jitter
As we saw in the previous page, decorrelated jitter bases each delay on the previous delay:
delay = min(maxDelay, random(baseDelay, previousDelay × 3))
| Strategy | Formula | Min Delay | Max Delay | Expected Value |
|---|---|---|---|---|
| No Jitter | 800 | 800ms | 800ms | 800ms |
| Full Jitter | random(0, 800) | 0ms | 800ms | 400ms |
| Equal Jitter | 400 + random(0, 400) | 400ms | 800ms | 600ms |
| Decorrelated | random(100, prev×3) | 100ms | prev×3 | ~varies |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
// Jitter strategy implementationstype JitterStrategy = 'none' | 'full' | 'equal' | 'decorrelated'; interface JitterConfig { strategy: JitterStrategy; baseDelayMs: number; multiplier: number; maxDelayMs: number;} class JitterCalculator { private previousDelay: number; constructor(private config: JitterConfig) { this.previousDelay = config.baseDelayMs; } /** * Calculate jittered delay for the given attempt */ calculateDelay(attemptIndex: number): number { switch (this.config.strategy) { case 'none': return this.noJitter(attemptIndex); case 'full': return this.fullJitter(attemptIndex); case 'equal': return this.equalJitter(attemptIndex); case 'decorrelated': return this.decorrelatedJitter(); } } /** * No jitter: pure exponential backoff */ private noJitter(attemptIndex: number): number { const delay = this.config.baseDelayMs * Math.pow(this.config.multiplier, attemptIndex); return Math.min(delay, this.config.maxDelayMs); } /** * Full jitter: random between 0 and exponential delay * Most aggressive spread, but may allow very short delays */ private fullJitter(attemptIndex: number): number { const exponentialDelay = this.config.baseDelayMs * Math.pow(this.config.multiplier, attemptIndex); const cappedDelay = Math.min(exponentialDelay, this.config.maxDelayMs); // Random between 0 and capped delay return Math.random() * cappedDelay; } /** * Equal jitter: half guaranteed, half random * Balanced approach with guaranteed minimum */ private equalJitter(attemptIndex: number): number { const exponentialDelay = this.config.baseDelayMs * Math.pow(this.config.multiplier, attemptIndex); const cappedDelay = Math.min(exponentialDelay, this.config.maxDelayMs); const halfDelay = cappedDelay / 2; // Guaranteed half + random half return halfDelay + Math.random() * halfDelay; } /** * Decorrelated jitter: based on previous delay * Self-regulating growth with natural randomization */ private decorrelatedJitter(): number { const minDelay = this.config.baseDelayMs; const maxDelay = this.previousDelay * 3; // Random between base and 3x previous const delay = minDelay + Math.random() * (maxDelay - minDelay); // Cap and store for next iteration this.previousDelay = Math.min(delay, this.config.maxDelayMs); return this.previousDelay; } /** * Reset state (for decorrelated jitter) */ reset(): void { this.previousDelay = this.config.baseDelayMs; }} // Demonstration functionfunction demonstrateJitterSpread() { const config: Omit<JitterConfig, 'strategy'> = { baseDelayMs: 100, multiplier: 2, maxDelayMs: 30000, }; const strategies: JitterStrategy[] = ['none', 'full', 'equal', 'decorrelated']; const clientCount = 1000; const attemptIndex = 3; // 4th retry attempt, exponential delay = 800ms for (const strategy of strategies) { const calculator = new JitterCalculator({ ...config, strategy }); const delays: number[] = []; for (let i = 0; i < clientCount; i++) { delays.push(calculator.calculateDelay(attemptIndex)); if (strategy === 'decorrelated') { calculator.reset(); } } const avgDelay = delays.reduce((a, b) => a + b) / delays.length; const minDelay = Math.min(...delays); const maxDelay = Math.max(...delays); const stdDev = Math.sqrt( delays.reduce((sum, d) => sum + Math.pow(d - avgDelay, 2), 0) / delays.length ); console.log(`${strategy}: avg=${avgDelay.toFixed(0)}ms, ` + `min=${minDelay.toFixed(0)}ms, max=${maxDelay.toFixed(0)}ms, ` + `stddev=${stdDev.toFixed(0)}ms`); }}AWS's analysis of jitter strategies concluded that full jitter provides the best de-synchronization and overall performance for most workloads. While it can produce short delays, the benefits of maximum spread outweigh the occasional fast retry. The AWS SDK uses full jitter by default.
Understanding why jitter works requires examining how retry attempts distribute over time.
Without Jitter: Synchronized Spikes
With deterministic exponential backoff, if N clients fail at time T=0:
The load on the server follows a step function with N requests arriving simultaneously at deterministic intervals. The server sees:
Load at T=100ms: N requests (spike)
Load at T=100-300ms: 0 requests
Load at T=300ms: N requests (spike)
Load at T=300-700ms: 0 requests
...
With Full Jitter: Distributed Load
With full jitter randomizing each delay:
With N clients and uniform distribution, the expected arrivals per millisecond during the first retry window is approximately N/100. The server sees:
Load at T=0-100ms: ~N/100 requests per ms (smooth curve)
Load at T=100-300ms: first retries completing, second retries starting (mixed)
...
Quantifying the Improvement
Let's quantify the peak load difference:
| Metric | No Jitter | Full Jitter | Improvement |
|---|---|---|---|
| Peak requests/ms (N=10,000) | 10,000 (instantaneous) | ~100 | 100x reduction |
| Time to clear first retry | ~1ms (all at once) | ~100ms (spread) | Smooth flow |
| Server headroom needed | Must handle N simultaneous | Handle N/delay | Much lower |
| Recovery opportunity | None (constant spikes) | Between bursts | Continuous |
The Key Insight
Jitter transforms retry behavior from discrete, synchronized events into continuous, distributed flows. Instead of spike → idle → spike → idle, the server sees steady (though elevated) traffic that it can handle sustainably.
Standard Deviation as Spread Metric
The effectiveness of jitter can be measured by the standard deviation of retry times:
When many clients use jitter, the aggregate retry load approximates a smooth continuous distribution (by the Central Limit Theorem). Even if individual client behavior is random, the aggregate becomes predictable and manageable. This is why jitter becomes more important as scale increases—more clients means smoother aggregate behavior with jitter, but also larger synchronized spikes without it.
Correct jitter implementation requires attention to randomness quality, integration with backoff logic, and handling of edge cases.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152
// Production-ready jittered backoff implementationinterface JitteredBackoffConfig { baseDelayMs: number; maxDelayMs: number; multiplier: number; maxAttempts: number; jitter: 'none' | 'full' | 'equal'; // Optional: minimum delay regardless of jitter minDelayMs?: number;} class JitteredExponentialBackoff { private attempt: number = 0; private totalWaitMs: number = 0; constructor(private config: JitteredBackoffConfig) {} /** * Get the next delay with jitter applied * Returns null if max attempts exceeded */ nextDelay(): number | null { if (this.attempt >= this.config.maxAttempts) { return null; } // Calculate base exponential delay const exponentialDelay = this.config.baseDelayMs * Math.pow(this.config.multiplier, this.attempt); // Apply cap const cappedDelay = Math.min(exponentialDelay, this.config.maxDelayMs); // Apply jitter let jitteredDelay: number; switch (this.config.jitter) { case 'full': jitteredDelay = Math.random() * cappedDelay; break; case 'equal': jitteredDelay = (cappedDelay / 2) + Math.random() * (cappedDelay / 2); break; case 'none': default: jitteredDelay = cappedDelay; } // Apply minimum if configured if (this.config.minDelayMs) { jitteredDelay = Math.max(jitteredDelay, this.config.minDelayMs); } this.attempt++; return Math.round(jitteredDelay); } recordWait(delayMs: number): void { this.totalWaitMs += delayMs; } get currentAttempt(): number { return this.attempt; } get totalWait(): number { return this.totalWaitMs; } reset(): void { this.attempt = 0; this.totalWaitMs = 0; }} /** * Higher-order function for retrying with jittered backoff */async function withJitteredRetry<T>( operation: () => Promise<T>, config: JitteredBackoffConfig, isRetryable: (error: Error) => boolean = () => true): Promise<T> { const backoff = new JitteredExponentialBackoff(config); let lastError: Error | undefined; for (let attempt = 0; attempt < config.maxAttempts; attempt++) { try { return await operation(); } catch (error) { lastError = error as Error; if (!isRetryable(lastError)) { throw lastError; } const delay = backoff.nextDelay(); if (delay === null) { break; } console.log( `Attempt ${attempt + 1}/${config.maxAttempts} failed. ` + `Retrying in ${delay}ms (jitter: ${config.jitter})...` ); await sleep(delay); backoff.recordWait(delay); } } throw new Error( `Operation failed after ${backoff.currentAttempt} attempts ` + `(${backoff.totalWait}ms total wait): ${lastError?.message}` );} function sleep(ms: number): Promise<void> { return new Promise(resolve => setTimeout(resolve, ms));} // Usage examplesasync function examples() { // User-facing API call: fast retries with full jitter const userFacingConfig: JitteredBackoffConfig = { baseDelayMs: 50, maxDelayMs: 5000, multiplier: 2, maxAttempts: 4, jitter: 'full', minDelayMs: 10, // Never less than 10ms }; // Background job: more retries with equal jitter const backgroundConfig: JitteredBackoffConfig = { baseDelayMs: 500, maxDelayMs: 60000, multiplier: 2, maxAttempts: 10, jitter: 'equal', }; // External API: respect their rate limits more carefully const externalApiConfig: JitteredBackoffConfig = { baseDelayMs: 1000, maxDelayMs: 300000, // 5 minutes multiplier: 2, maxAttempts: 6, jitter: 'equal', // More predictable than full jitter minDelayMs: 500, // Never hammer the API };}Some implementations seed their random number generators with predictable values (like process ID or current time at startup). If many instances start simultaneously (e.g., during deployment), their "random" sequences may align. Use unseeded default random or ensure seed sources are truly unpredictable.
When services return Retry-After headers, the interaction with jitter requires careful consideration. The server is providing explicit guidance about when retries are welcome, but we still need to prevent synchronized stampedes at the specified time.
The Problem with Exact Retry-After
If 10,000 clients all receive Retry-After: 30 at the same time, and all retry exactly 30 seconds later, we've simply scheduled a thundering herd for T+30 seconds.
Solution: Add Jitter to Retry-After
Treat the Retry-After value as a floor, then add jitter:
actualDelay = retryAfterSeconds + jitter(0, jitterWindow)
The jitter window depends on the Retry-After duration:
The longer the server asks you to wait, the more likely you can spread the load without missing the recovery window.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140
// Combining Retry-After with jitterinterface RetryAfterConfig { // Jitter percentage by Retry-After duration tier shortWindowJitterPercent: number; // <= 60s mediumWindowJitterPercent: number; // 60s - 300s longWindowJitterPercent: number; // > 300s // Caps maxAdditionalJitterMs: number;} const defaultRetryAfterConfig: RetryAfterConfig = { shortWindowJitterPercent: 20, mediumWindowJitterPercent: 30, longWindowJitterPercent: 50, maxAdditionalJitterMs: 60000, // Never add more than 1 minute}; function calculateDelayWithRetryAfter( exponentialDelayMs: number, retryAfterMs: number | undefined, config: RetryAfterConfig = defaultRetryAfterConfig): number { // If no Retry-After, use standard jittered exponential backoff if (!retryAfterMs) { return Math.random() * exponentialDelayMs; } // Use Retry-After as minimum, then add jitter const retryAfterSeconds = retryAfterMs / 1000; let jitterPercent: number; if (retryAfterSeconds <= 60) { jitterPercent = config.shortWindowJitterPercent; } else if (retryAfterSeconds <= 300) { jitterPercent = config.mediumWindowJitterPercent; } else { jitterPercent = config.longWindowJitterPercent; } // Calculate jitter to add const maxJitter = Math.min( retryAfterMs * (jitterPercent / 100), config.maxAdditionalJitterMs ); const additionalJitter = Math.random() * maxJitter; return retryAfterMs + additionalJitter;} // Example: Parse Retry-After header and apply jitterfunction parseAndApplyRetryAfter( response: Response, fallbackDelayMs: number): number { const retryAfterHeader = response.headers.get('Retry-After'); if (!retryAfterHeader) { // No Retry-After, fall back to jittered exponential return Math.random() * fallbackDelayMs; } let retryAfterMs: number; // Retry-After can be seconds (integer) or HTTP date const seconds = parseInt(retryAfterHeader, 10); if (!isNaN(seconds)) { retryAfterMs = seconds * 1000; } else { // Try parsing as HTTP date const date = new Date(retryAfterHeader); if (isNaN(date.getTime())) { // Invalid format, use fallback return Math.random() * fallbackDelayMs; } retryAfterMs = Math.max(0, date.getTime() - Date.now()); } return calculateDelayWithRetryAfter(fallbackDelayMs, retryAfterMs);} // Production retry wrapper with Retry-After supportasync function retryWithRetryAfterSupport<T>( operation: () => Promise<T>, config: { baseDelayMs: number; maxDelayMs: number; multiplier: number; maxAttempts: number; }): Promise<T> { let lastError: Error | undefined; for (let attempt = 0; attempt < config.maxAttempts; attempt++) { try { return await operation(); } catch (error) { lastError = error as Error; // Calculate fallback exponential delay const exponentialDelay = config.baseDelayMs * Math.pow(config.multiplier, attempt); const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs); // Check for Retry-After in error response const retryAfterMs = extractRetryAfterFromError(lastError); // Calculate actual delay with jitter const actualDelay = calculateDelayWithRetryAfter( cappedDelay, retryAfterMs ); console.log( `Attempt ${attempt + 1} failed. ` + `Retry-After: ${retryAfterMs ? retryAfterMs + 'ms' : 'none'}. ` + `Actual wait: ${Math.round(actualDelay)}ms` ); await sleep(actualDelay); } } throw lastError;} function extractRetryAfterFromError(error: Error): number | undefined { const response = (error as any).response; const header = response?.headers?.['retry-after']; if (typeof header === 'string') { const seconds = parseInt(header, 10); if (!isNaN(seconds)) return seconds * 1000; } return undefined;} function sleep(ms: number): Promise<void> { return new Promise(resolve => setTimeout(resolve, ms));}Never retry before the Retry-After time—the server explicitly told you not to. But always add jitter on top to prevent synchronized retry waves at the exact Retry-After moment. Think of Retry-After as 'don't retry before this' rather than 'retry exactly at this time.'
While jitter is generally recommended, its importance varies by scenario. Understanding when jitter is critical helps prioritize implementation efforts.
Jitter Is Critical When:
Jitter Is Less Critical When:
| System Type | Jitter Priority |
|---|---|
| Mobile app backend | Critical |
| Web application backend | Critical |
| Internal microservices | High |
| Background job system | High |
| Single-instance admin tool | Optional |
| Internal data pipeline | Medium |
| CLI tooling | Low |
| Client Scale | Jitter Priority |
|---|---|
| 1-10 clients | Optional |
| 10-100 clients | Recommended |
| 100-1,000 clients | High |
| 1,000-10,000 clients | Critical |
| 10,000+ clients | Absolutely Critical |
Jitter has minimal downsides (slightly less predictable timing, marginally more complex code) and significant upsides when problems occur. Even in low-priority scenarios, adding jitter is cheap insurance against future scale increases or unexpected correlated failures.
Beyond HTTP request retries, jitter principles apply to many distributed systems scenarios.
Scheduled Job Execution
If thousands of scheduled jobs are configured to run at exactly "midnight," they'll all start simultaneously. Adding jitter spreads the load:
// Instead of running exactly at configured time
const scheduledTime = parseSchedule(config.cronExpression);
// Add jitter proportional to schedule interval
const jitterWindow = getScheduleInterval(config.cronExpression) * 0.1; // 10%
const jitteredTime = scheduledTime + Math.random() * jitterWindow;
Cache Expiration
When many cache entries share the same TTL and were set at similar times, they expire simultaneously, causing a "cache stampede" where all clients hit the database at once:
// Bad: All entries expire at exactly the same time
const ttl = 3600; // 1 hour
// Good: Jitter the TTL
const baseTTL = 3600;
const jitteredTTL = baseTTL + Math.random() * 600; // 1 hour ± 0-10 minutes
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
// Jitter applied to various distributed systems contexts // 1. Scheduled Job Jitterclass JitteredScheduler { /** * Add jitter to scheduled execution time * Spreads jobs configured for the same time across a window */ getJitteredStartTime( scheduledTime: Date, scheduleIntervalMs: number, jitterPercent: number = 10 ): Date { const jitterWindow = scheduleIntervalMs * (jitterPercent / 100); const jitterMs = Math.random() * jitterWindow; return new Date(scheduledTime.getTime() + jitterMs); }} // 2. Cache TTL Jitterclass JitteredCache<T> { constructor( private cache: Map<string, { value: T; expiresAt: number }> = new Map(), private jitterPercent: number = 20 ) {} /** * Set with jittered TTL to prevent cache stampede */ set(key: string, value: T, baseTTLMs: number): void { const jitter = baseTTLMs * (this.jitterPercent / 100); const jitteredTTL = baseTTLMs + Math.random() * jitter; this.cache.set(key, { value, expiresAt: Date.now() + jitteredTTL, }); } get(key: string): T | undefined { const entry = this.cache.get(key); if (!entry) return undefined; if (Date.now() > entry.expiresAt) { this.cache.delete(key); return undefined; } return entry.value; }} // 3. Health Check Intervalsclass JitteredHealthChecker { private intervalId?: ReturnType<typeof setTimeout>; constructor( private checkFn: () => Promise<boolean>, private baseIntervalMs: number, private jitterPercent: number = 30 ) {} start(): void { const scheduleNext = () => { // Jitter each interval to desynchronize from other checkers const jitter = this.baseIntervalMs * (this.jitterPercent / 100); const nextInterval = this.baseIntervalMs + Math.random() * jitter; this.intervalId = setTimeout(async () => { await this.checkFn(); scheduleNext(); }, nextInterval); }; // Also jitter initial startup const initialDelay = Math.random() * this.baseIntervalMs; setTimeout(scheduleNext, initialDelay); } stop(): void { if (this.intervalId) { clearTimeout(this.intervalId); } }} // 4. Connection Pool Reconnectionclass JitteredConnectionPool { private connections: Connection[] = []; /** * When a connection fails, reconnect with jitter * Prevents all failed connections from reconnecting simultaneously */ async handleConnectionFailure(conn: Connection): Promise<void> { // Base delay with exponential backoff const baseDelay = 1000 * Math.pow(2, conn.reconnectAttempts); const cappedDelay = Math.min(baseDelay, 60000); // Add full jitter const jitteredDelay = Math.random() * cappedDelay; console.log( `Connection ${conn.id} failed. ` + `Reconnecting in ${Math.round(jitteredDelay)}ms` ); await sleep(jitteredDelay); await this.reconnect(conn); } private async reconnect(conn: Connection): Promise<void> { // Reconnection logic }} // 5. Batch Processing Staggeringclass JitteredBatchProcessor<T> { /** * Process batches with jittered start times * Prevents synchronized batch processing across workers */ async processBatchesWithJitter( batches: T[][], processor: (batch: T[]) => Promise<void>, maxJitterMs: number = 5000 ): Promise<void> { const promises = batches.map(async (batch, index) => { // Stagger starts with jitter const jitter = Math.random() * maxJitterMs; await sleep(jitter); console.log(`Starting batch ${index} after ${Math.round(jitter)}ms jitter`); return processor(batch); }); await Promise.all(promises); }} interface Connection { id: string; reconnectAttempts: number;} function sleep(ms: number): Promise<void> { return new Promise(resolve => setTimeout(resolve, ms));}Whenever you have multiple independent entities doing the same thing at configured times or intervals, add jitter. This applies to scheduled jobs, cache TTLs, health checks, connection renewals, token refresh, DNS lookups, and any periodic operation. The thundering herd problem is universal; so should be jitter.
Jitter is the essential companion to exponential backoff. While backoff spaces retries over time, jitter prevents synchronized retry storms that can extend outages from seconds to hours.
What's Next:
We've covered when to retry, how to space retries with exponential backoff, and how to desynchronize with jitter. The next page addresses maximum retry attempts—how to determine when to stop retrying and the critical role this plays in maintaining system stability and resource management.
You now understand the thundering herd problem, different jitter strategies and their trade-offs, how to implement jitter correctly, and how to apply jitter principles beyond just retries. Combined with exponential backoff, you can now design retry systems that recover gracefully from failures without amplifying them.