System Design (HLD)Retry Strategies

Retry Strategies in Distributed Systems

LevelIntermediate

Duration75 mins

TopicRetry Strategies

4 / 5

Retry Budgets

The Limits of Individual Retries

We've learned when to retry, how to space retries with exponential backoff, and how to add jitter to prevent thundering herds. These are powerful techniques for individual requests.

But they share a dangerous assumption: that retrying is always beneficial if the failure seems transient.

Consider this scenario: A downstream service is running at 98% capacity—healthy, but near its limit. 2% of requests fail due to load. Each of those 2% is retried. Now we have 102% of normal load. More failures occur. More retries. 104%. 106%. The system cascades to failure.

The problem: Each individual retry decision was correct. The collective impact was catastrophic.

The solution: Retry budgets—system-level limits that constrain total retry volume, preventing well-intentioned retries from becoming the final straw.

What You Will Master

By the end of this page, you will understand retry amplification mathematics, how retry budgets work, different budget strategies (percentage-based, token bucket, circuit-based), implementation patterns, and how major platforms like Google and Netflix use retry budgets to maintain system stability.

Retry Amplification: The Hidden Multiplier

Retry amplification is the phenomenon where retries increase system load during failures, making recovery harder or impossible. Understanding its mathematics is essential for designing robust systems.

Basic amplification formula:

retry-amplification.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
/**
 * Retry Amplification Mathematics
 * 
 * When failures occur, retries increase total load.
 * This increased load can cause more failures, creating a vicious cycle.
 */
 
interface AmplificationResult {
  initialLoad: number;
  failureRate: number;
  retryCount: number;
  amplifiedLoad: number;
  amplificationFactor: number;
  sustainable: boolean;
}
 
/**
 * Calculate the amplified load when all failed requests are retried.
 * 
 * Without any limiting:
 * amplifiedLoad = initialLoad × (1 + failureRate × retryCount)
 */
function calculateAmplification(
  initialLoad: number,
  failureRate: number,  // 0.0 to 1.0
  retryCount: number,
  maxCapacity: number
): AmplificationResult {
  // First-order amplification (simple model)
  const amplifiedLoad = initialLoad * (1 + failureRate * retryCount);
  const amplificationFactor = amplifiedLoad / initialLoad;
  
  return {
    initialLoad,
    failureRate,
    retryCount,
    amplifiedLoad,
    amplificationFactor,
    sustainable: amplifiedLoad <= maxCapacity,
  };
}
 
// Scenario: Service handles 10,000 req/s, max capacity 12,000 req/s
const capacity = 12000;
 
console.log("Retry Amplification Scenarios");
console.log("System capacity: 12,000 req/s");
console.log("Normal load: 10,000 req/s (83% utilization)");
console.log("=========================================
");
 
// Scenario 1: Healthy system (1% failure rate)
const healthy = calculateAmplification(10000, 0.01, 3, capacity);
console.log("Healthy (1% failure, 3 retries):");
console.log(`  Amplified load: ${healthy.amplifiedLoad.toFixed(0)} req/s`);
console.log(`  Amplification: ${healthy.amplificationFactor.toFixed(2)}x`);
console.log(`  Sustainable: ${healthy.sustainable}
`);
// Output: 10,300 req/s (1.03x) - Sustainable ✓
 
// Scenario 2: Under stress (10% failure rate)
const stressed = calculateAmplification(10000, 0.10, 3, capacity);
console.log("Stressed (10% failure, 3 retries):");
console.log(`  Amplified load: ${stressed.amplifiedLoad.toFixed(0)} req/s`);
console.log(`  Amplification: ${stressed.amplificationFactor.toFixed(2)}x`);
console.log(`  Sustainable: ${stressed.sustainable}
`);
// Output: 13,000 req/s (1.30x) - UNSUSTAINABLE! Exceeds capacity!
 
// Scenario 3: Same stress with retry budget (50% of failures retried)
const withBudget = calculateAmplification(10000, 0.10, 3 * 0.5, capacity);
console.log("Stressed with 50% retry budget:");
console.log(`  Amplified load: ${withBudget.amplifiedLoad.toFixed(0)} req/s`);
console.log(`  Amplification: ${withBudget.amplificationFactor.toFixed(2)}x`);
console.log(`  Sustainable: ${withBudget.sustainable}
`);
// Output: 11,500 req/s (1.15x) - Sustainable with budget ✓

The cascading feedback loop:

The simple model above assumes failure rate stays constant. In reality, increased load from retries increases failure rate, which triggers more retries:

cascading-failure.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
/**
 * Cascading Failure Simulation
 * 
 * Models how retries can amplify failures in a feedback loop,
 * turning minor overload into complete system failure.
 */
 
interface SystemState {
  load: number;
  capacity: number;
  failureRate: number;
  retriesInFlight: number;
}
 
function simulateSeconds(
  initialState: SystemState,
  retryPolicy: { maxRetries: number; retryAfterSeconds: number },
  seconds: number
): SystemState[] {
  const history: SystemState[] = [initialState];
  let state = { ...initialState };
  
  // Pending retries by arrival time
  const pendingRetries: number[] = [];
  
  for (let t = 1; t <= seconds; t++) {
    // Calculate failure rate based on load vs capacity
    // Simple model: linear increase above 80% capacity
    const utilizationRatio = state.load / state.capacity;
    let failureRate = 0;
    if (utilizationRatio > 0.8) {
      failureRate = Math.min(1, (utilizationRatio - 0.8) * 5);
    }
    
    // Apply retries from previous seconds
    const arrivingRetries = pendingRetries.shift() || 0;
    const newLoad = state.load + arrivingRetries;
    
    // Calculate failures at current load
    const failures = newLoad * failureRate;
    
    // Schedule retries (with limit)
    const retriesToSchedule = failures * retryPolicy.maxRetries;
    if (pendingRetries.length < retryPolicy.retryAfterSeconds) {
      for (let i = pendingRetries.length; i < retryPolicy.retryAfterSeconds; i++) {
        pendingRetries.push(0);
      }
    }
    
    // Distribute retries over time (simplified)
    for (let r = 0; r < retryPolicy.retryAfterSeconds && r < pendingRetries.length; r++) {
      pendingRetries[r] += retriesToSchedule / retryPolicy.retryAfterSeconds;
    }
    
    state = {
      load: newLoad,
      capacity: state.capacity,
      failureRate,
      retriesInFlight: pendingRetries.reduce((a, b) => a + b, 0),
    };
    
    history.push(state);
    
    // Reset load to baseline for next second (requests don't accumulate)
    state.load = initialState.load;
  }
  
  return history;
}
 
// Simulate a system under sudden load spike
const initial: SystemState = {
  load: 8000,  // Normal load: 80% of capacity
  capacity: 10000,
  failureRate: 0,
  retriesInFlight: 0,
};
 
// Spike to 95% load
const spikedInitial = { ...initial, load: 9500 };
 
console.log("Cascading Failure Simulation");
console.log("System capacity: 10,000 req/s");
console.log("Initial spike: 9,500 req/s (95% utilization)");
console.log("=========================================
");
 
const withRetries = simulateSeconds(
  spikedInitial,
  { maxRetries: 3, retryAfterSeconds: 1 },
  10
);
 
console.log("With unlimited retries:");
withRetries.slice(0, 6).forEach((state, t) => {
  console.log(`  t=${t}s: load=${state.load.toFixed(0)}, failure=${(state.failureRate * 100).toFixed(0)}%, pending=${state.retriesInFlight.toFixed(0)}`);
});
 
// Key insight: The failure rate and pending retries escalate rapidly
// What started as 5% overload becomes catastrophic failure

The Amplification Cascade

In the worst case, retry amplification creates a feedback loop: failures cause retries, retries increase load, increased load causes more failures. Without budgets, this loop continues until the system completely fails or all requests timeout.

What Is a Retry Budget?

A retry budget is a mechanism that limits the total number of retries a client or system can issue over a time window. Instead of allowing unlimited retries per request, the budget constrains retry volume as a fraction of successful requests.

The core principle:

"You may only retry if you have budget remaining. Budget is earned through successful requests and spent on retries."

Retry Budget Properties

•Budget accrual: Successful requests add to the budget (typically a small fraction, like 10%).
•Budget consumption: Each retry consumes budget (typically 1 unit per retry).
•Budget cap: Maximum budget is capped to prevent accumulation during long healthy periods.
•Budget exhaustion: When budget reaches zero, no more retries are allowed—fail fast.
•Auto-recovery: As the system stabilizes and successes resume, budget naturally rebuilds.

Example: 10% retry budget

With a 10% retry budget:

Every 10 successful requests earns 1 retry credit
If 1% of requests fail, we have 10× budget for retries (plenty)
If 50% of requests fail, we use budget quickly and stop retrying
If 100% fail, budget exhausts after initial buffer, preventing amplification

Google's SRE Recommendation

Google's Site Reliability Engineering book recommends a retry budget of 10% of successful request volume. This means if you process 1,000 successful requests, you can issue up to 100 retries. This is a well-tested starting point for most systems.

retry-budget-concept.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
/**
 * Conceptual Retry Budget
 * 
 * Demonstrates the basic mechanics of a percentage-based retry budget.
 */
 
interface BudgetState {
  available: number;        // Current budget balance
  maxBudget: number;        // Maximum budget cap
  totalSuccesses: number;   // Successes in current window
  totalRetries: number;     // Retries spent in current window
}
 
class ConceptualRetryBudget {
  private available: number;
  private readonly budgetRatio: number;  // e.g., 0.10 for 10%
  private readonly maxBudget: number;    // Upper cap
  
  constructor(budgetRatio: number = 0.10, maxBudget: number = 100) {
    this.budgetRatio = budgetRatio;
    this.maxBudget = maxBudget;
    this.available = maxBudget / 2;  // Start with some initial budget
  }
  
  /**
   * Record a successful request. This earns retry budget.
   */
  recordSuccess(): void {
    // Each success earns a fraction of a retry credit
    this.available = Math.min(
      this.available + this.budgetRatio,
      this.maxBudget
    );
  }
  
  /**
   * Check if we can afford a retry.
   */
  canRetry(): boolean {
    return this.available >= 1.0;
  }
  
  /**
   * Consume budget for a retry. Returns true if retry was allowed.
   */
  consumeForRetry(): boolean {
    if (!this.canRetry()) {
      return false;  // No budget, cannot retry
    }
    
    this.available -= 1.0;
    return true;
  }
  
  /**
   * Get current budget state for monitoring.
   */
  getState(): { available: number; maxBudget: number; percentFull: number } {
    return {
      available: this.available,
      maxBudget: this.maxBudget,
      percentFull: (this.available / this.maxBudget) * 100,
    };
  }
}
 
// Demonstration
const budget = new ConceptualRetryBudget(0.10, 100);
 
console.log("Retry Budget Demonstration");
console.log("Budget ratio: 10% (1 retry credit per 10 successes)");
console.log("========================================
");
 
// Simulate healthy traffic: 100 requests, 2% failure
console.log("Scenario 1: Healthy traffic (2% failure rate)");
for (let i = 0; i < 100; i++) {
  if (Math.random() < 0.02) {
    // Failure - try to retry
    const couldRetry = budget.consumeForRetry();
    console.log(`  Request ${i}: FAILED - Retry ${couldRetry ? "ALLOWED" : "DENIED"}`);
  } else {
    // Success - earn budget
    budget.recordSuccess();
  }
}
console.log(`  Final budget: ${budget.getState().available.toFixed(1)} / ${budget.getState().maxBudget}
`);
 
// Reset and simulate unhealthy traffic
const budget2 = new ConceptualRetryBudget(0.10, 100);
console.log("Scenario 2: Unhealthy traffic (50% failure rate)");
let retriesAllowed = 0;
let retriesDenied = 0;
 
for (let i = 0; i < 100; i++) {
  if (Math.random() < 0.50) {
    // High failure rate
    if (budget2.consumeForRetry()) {
      retriesAllowed++;
    } else {
      retriesDenied++;
    }
  } else {
    budget2.recordSuccess();
  }
}
console.log(`  Retries allowed: ${retriesAllowed}`);
console.log(`  Retries denied: ${retriesDenied}`);
console.log(`  Final budget: ${budget2.getState().available.toFixed(1)}
`);
 
// Key insight: Under high failure, budget exhausts and denies retries,
// preventing amplification even though each individual retry seems reasonable

Retry Budget Strategies

There are several approaches to implementing retry budgets, each with different characteristics. The right choice depends on your system's traffic patterns and reliability requirements.

Retry Budget Strategy Comparison
Strategy	Mechanism	Pros	Cons	Best For
Percentage-Based	Retries limited to X% of successes	Simple, proportional, self-adjusting	Needs success tracking	General purpose
Token Bucket	Fixed token regeneration rate, consumed by retries	Smooth, familiar pattern	Requires tuning regen rate	Steady traffic
Sliding Window	Track retry/request ratio in time window	Accurate recent view	Memory for window	Variable traffic
Circuit-Breaker Hybrid	Budget + circuit breaker integration	Best protection	More complex	Critical paths

budget-strategies.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
/**
 * Multiple Retry Budget Strategy Implementations
 */
 
// =========================================
// Strategy 1: Token Bucket Budget
// =========================================
class TokenBucketRetryBudget {
  private tokens: number;
  private readonly maxTokens: number;
  private readonly refillRatePerSecond: number;
  private lastRefillTime: number;
  
  constructor(maxTokens: number = 10, refillRatePerSecond: number = 1) {
    this.tokens = maxTokens;  // Start full
    this.maxTokens = maxTokens;
    this.refillRatePerSecond = refillRatePerSecond;
    this.lastRefillTime = Date.now();
  }
  
  private refill(): void {
    const now = Date.now();
    const secondsElapsed = (now - this.lastRefillTime) / 1000;
    
    const tokensToAdd = secondsElapsed * this.refillRatePerSecond;
    this.tokens = Math.min(this.maxTokens, this.tokens + tokensToAdd);
    this.lastRefillTime = now;
  }
  
  canRetry(): boolean {
    this.refill();
    return this.tokens >= 1.0;
  }
  
  consumeForRetry(): boolean {
    if (!this.canRetry()) return false;
    this.tokens -= 1.0;
    return true;
  }
  
  getAvailableTokens(): number {
    this.refill();
    return this.tokens;
  }
}
 
// =========================================
// Strategy 2: Sliding Window Budget
// =========================================
interface WindowEntry {
  timestamp: number;
  type: "success" | "retry";
}
 
class SlidingWindowRetryBudget {
  private readonly windowSizeMs: number;
  private readonly maxRetryRatio: number;
  private entries: WindowEntry[] = [];
  
  constructor(windowSizeMs: number = 60000, maxRetryRatio: number = 0.10) {
    this.windowSizeMs = windowSizeMs;
    this.maxRetryRatio = maxRetryRatio;
  }
  
  private pruneOldEntries(): void {
    const cutoff = Date.now() - this.windowSizeMs;
    this.entries = this.entries.filter(e => e.timestamp >= cutoff);
  }
  
  recordSuccess(): void {
    this.entries.push({ timestamp: Date.now(), type: "success" });
    this.pruneOldEntries();
  }
  
  canRetry(): boolean {
    this.pruneOldEntries();
    
    const successCount = this.entries.filter(e => e.type === "success").length;
    const retryCount = this.entries.filter(e => e.type === "retry").length;
    
    if (successCount === 0) {
      // No successes in window - allow minimal retries based on initial buffer
      return retryCount < 5;  // Allow a few retries to bootstrap
    }
    
    const currentRatio = retryCount / successCount;
    return currentRatio < this.maxRetryRatio;
  }
  
  consumeForRetry(): boolean {
    if (!this.canRetry()) return false;
    
    this.entries.push({ timestamp: Date.now(), type: "retry" });
    return true;
  }
  
  getStats(): { successes: number; retries: number; ratio: number } {
    this.pruneOldEntries();
    
    const successes = this.entries.filter(e => e.type === "success").length;
    const retries = this.entries.filter(e => e.type === "retry").length;
    
    return {
      successes,
      retries,
      ratio: successes > 0 ? retries / successes : 0,
    };
  }
}
 
// =========================================
// Strategy 3: Adaptive Budget (Google-style)
// =========================================
class AdaptiveRetryBudget {
  private budget: number;
  private readonly maxBudget: number;
  private readonly budgetRatio: number;
  private readonly minBudgetForRetry: number;
  
  // Exponential moving averages for monitoring
  private successRate: number = 1.0;
  private readonly alpha: number = 0.1;  // Smoothing factor
  
  constructor(options: {
    maxBudget?: number;
    budgetRatio?: number;
    minBudgetForRetry?: number;
  } = {}) {
    this.maxBudget = options.maxBudget ?? 100;
    this.budgetRatio = options.budgetRatio ?? 0.2;  // 20% default
    this.minBudgetForRetry = options.minBudgetForRetry ?? 1.0;
    this.budget = this.maxBudget;  // Start full
  }
  
  recordSuccess(): void {
    // Add to budget
    this.budget = Math.min(this.maxBudget, this.budget + this.budgetRatio);
    
    // Update success rate EMA
    this.successRate = this.alpha * 1.0 + (1 - this.alpha) * this.successRate;
  }
  
  recordFailure(): void {
    // Update success rate EMA
    this.successRate = this.alpha * 0.0 + (1 - this.alpha) * this.successRate;
  }
  
  canRetry(): boolean {
    // Two conditions must be met:
    // 1. Have enough budget
    // 2. Success rate isn't too low (adaptive response)
    if (this.budget < this.minBudgetForRetry) {
      return false;
    }
    
    // If success rate is very low, be more conservative
    // This provides extra protection during severe failures
    if (this.successRate < 0.1) {
      return this.budget >= this.maxBudget * 0.5;  // Require 50% budget
    }
    
    return true;
  }
  
  consumeForRetry(): boolean {
    if (!this.canRetry()) return false;
    this.budget -= 1.0;
    return true;
  }
  
  getState(): { budget: number; maxBudget: number; successRate: number } {
    return {
      budget: this.budget,
      maxBudget: this.maxBudget,
      successRate: this.successRate,
    };
  }
}
 
// =========================================
// Usage Comparison
// =========================================
 
console.log("Budget Strategy Comparison
");
 
// Token bucket: good for rate-based limiting
const tokenBucket = new TokenBucketRetryBudget(10, 2);  // 10 tokens, 2/sec refill
console.log("Token Bucket: Best for steady-state traffic");
console.log(`  Available: ${tokenBucket.getAvailableTokens()} tokens
`);
 
// Sliding window: good for ratio-based limiting
const slidingWindow = new SlidingWindowRetryBudget(60000, 0.10);  // 60s window, 10% limit
console.log("Sliding Window: Best for accurate ratio tracking");
console.log(`  Stats: ${JSON.stringify(slidingWindow.getStats())}
`);
 
// Adaptive: good for varying conditions
const adaptive = new AdaptiveRetryBudget({ maxBudget: 50, budgetRatio: 0.20 });
console.log("Adaptive: Best for varying failure conditions");
console.log(`  State: ${JSON.stringify(adaptive.getState())}
`);

Production-Ready Implementation

A production retry budget needs to integrate seamlessly with your retry logic, support monitoring, and handle edge cases gracefully.

production-retry-budget.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
/**
 * Production-Grade Retry Budget System
 * 
 * Features:
 * - Configurable budget strategy
 * - Metrics and monitoring support
 * - Integration with retry functions
 * - Thread-safe for concurrent access
 */
 
interface RetryBudgetMetrics {
  totalRequests: number;
  successfulRequests: number;
  failedRequests: number;
  retriesAttempted: number;
  retriesAllowed: number;
  retriesDenied: number;
  currentBudget: number;
  budgetUtilization: number;
}
 
interface RetryBudgetConfig {
  /** Maximum budget capacity */
  maxBudget: number;
  
  /** Budget earned per successful request (ratio) */
  budgetPerSuccess: number;
  
  /** Budget consumed per retry */
  budgetPerRetry: number;
  
  /** Minimum budget required to allow retry */
  minBudgetForRetry: number;
  
  /** Initial budget as percentage of max */
  initialBudgetPercent: number;
  
  /** Optional: callback when budget is exhausted */
  onBudgetExhausted?: () => void;
  
  /** Optional: callback when budget recovers */
  onBudgetRecovered?: () => void;
}
 
class ProductionRetryBudget {
  private budget: number;
  private readonly config: RetryBudgetConfig;
  private wasExhausted: boolean = false;
  
  // Metrics
  private metrics: RetryBudgetMetrics = {
    totalRequests: 0,
    successfulRequests: 0,
    failedRequests: 0,
    retriesAttempted: 0,
    retriesAllowed: 0,
    retriesDenied: 0,
    currentBudget: 0,
    budgetUtilization: 0,
  };
  
  constructor(config: Partial<RetryBudgetConfig> = {}) {
    this.config = {
      maxBudget: 100,
      budgetPerSuccess: 0.1,    // 10 successes = 1 retry
      budgetPerRetry: 1.0,
      minBudgetForRetry: 1.0,
      initialBudgetPercent: 50,
      ...config,
    };
    
    this.budget = this.config.maxBudget * (this.config.initialBudgetPercent / 100);
    this.updateMetrics();
  }
  
  /**
   * Record a successful request. Adds to budget.
   */
  recordSuccess(): void {
    this.metrics.totalRequests++;
    this.metrics.successfulRequests++;
    
    const previousBudget = this.budget;
    this.budget = Math.min(
      this.config.maxBudget,
      this.budget + this.config.budgetPerSuccess
    );
    
    // Check if we recovered from exhaustion
    if (this.wasExhausted && this.budget >= this.config.minBudgetForRetry) {
      this.wasExhausted = false;
      this.config.onBudgetRecovered?.();
    }
    
    this.updateMetrics();
  }
  
  /**
   * Record a failed request (without retry).
   */
  recordFailure(): void {
    this.metrics.totalRequests++;
    this.metrics.failedRequests++;
    this.updateMetrics();
  }
  
  /**
   * Check if retry is allowed without consuming budget.
   */
  canRetry(): boolean {
    return this.budget >= this.config.minBudgetForRetry;
  }
  
  /**
   * Attempt to consume budget for a retry.
   * Returns true if retry is allowed, false if denied.
   */
  tryConsumeForRetry(): boolean {
    this.metrics.retriesAttempted++;
    
    if (!this.canRetry()) {
      this.metrics.retriesDenied++;
      
      // Track exhaustion state
      if (!this.wasExhausted) {
        this.wasExhausted = true;
        this.config.onBudgetExhausted?.();
      }
      
      this.updateMetrics();
      return false;
    }
    
    this.budget -= this.config.budgetPerRetry;
    this.metrics.retriesAllowed++;
    this.updateMetrics();
    return true;
  }
  
  /**
   * Get current metrics for monitoring.
   */
  getMetrics(): RetryBudgetMetrics {
    return { ...this.metrics };
  }
  
  /**
   * Get current budget level.
   */
  getBudget(): number {
    return this.budget;
  }
  
  /**
   * Get budget as percentage of max.
   */
  getBudgetPercent(): number {
    return (this.budget / this.config.maxBudget) * 100;
  }
  
  /**
   * Reset metrics (for testing or rolling windows).
   */
  resetMetrics(): void {
    this.metrics = {
      totalRequests: 0,
      successfulRequests: 0,
      failedRequests: 0,
      retriesAttempted: 0,
      retriesAllowed: 0,
      retriesDenied: 0,
      currentBudget: this.budget,
      budgetUtilization: 0,
    };
  }
  
  private updateMetrics(): void {
    this.metrics.currentBudget = this.budget;
    this.metrics.budgetUtilization = 
      ((this.config.maxBudget - this.budget) / this.config.maxBudget) * 100;
  }
}
 
/**
 * Retry function with integrated budget management.
 */
async function retryWithBudget<T>(
  operation: () => Promise<T>,
  budget: ProductionRetryBudget,
  options: {
    maxAttempts?: number;
    backoffMs?: (attempt: number) => number;
    shouldRetry?: (error: Error) => boolean;
    onRetryDenied?: (error: Error) => void;
  } = {}
): Promise<T> {
  const {
    maxAttempts = 3,
    backoffMs = (attempt) => 100 * Math.pow(2, attempt - 1),
    shouldRetry = () => true,
    onRetryDenied,
  } = options;
  
  let lastError: Error | null = null;
  
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      const result = await operation();
      budget.recordSuccess();
      return result;
    } catch (error) {
      lastError = error as Error;
      budget.recordFailure();
      
      // Check if this error is retryable
      if (!shouldRetry(lastError)) {
        throw lastError;
      }
      
      // Check if we have attempts remaining
      if (attempt >= maxAttempts) {
        throw lastError;
      }
      
      // Check budget before retrying
      if (!budget.tryConsumeForRetry()) {
        onRetryDenied?.(lastError);
        throw new RetryBudgetExhaustedError(
          "Retry budget exhausted",
          lastError
        );
      }
      
      // Wait before retry
      await new Promise(r => setTimeout(r, backoffMs(attempt)));
    }
  }
  
  throw lastError || new Error("Retry failed");
}
 
class RetryBudgetExhaustedError extends Error {
  constructor(message: string, public readonly cause: Error) {
    super(message);
    this.name = "RetryBudgetExhaustedError";
  }
}
 
// =========================================
// Usage Example
// =========================================
 
const budget = new ProductionRetryBudget({
  maxBudget: 50,
  budgetPerSuccess: 0.1,
  onBudgetExhausted: () => console.log("⚠️  Retry budget exhausted!"),
  onBudgetRecovered: () => console.log("✅ Retry budget recovered"),
});
 
async function makeRequest(shouldFail: boolean): Promise<string> {
  return retryWithBudget(
    async () => {
      if (shouldFail) throw new Error("Simulated failure");
      return "success";
    },
    budget,
    {
      maxAttempts: 3,
      onRetryDenied: (err) => console.log(`Retry denied: ${err.message}`),
    }
  );
}

Distributed Retry Budgets

In distributed systems with multiple client instances, local retry budgets may not prevent global amplification. If 100 instances each have their own budget, the aggregate retry volume could still be too high.

Strategies for distributed coordination:

Distributed Budget Approaches

•Per-instance budgets with lower limits: Each instance gets 1/N of the total budget. Simple but inflexible.
•Shared budget via Redis/centralized store: All instances read/write to shared counter. Accurate but adds latency.
•Probabilistic budgets: Each instance independently decides with probability P whether to retry. No coordination needed.
•Hedged retries with sampling: Only a random fraction of failures are retried. Natural load shedding.
•Leader-based coordination: One instance tracks global state and advises others. Single point of failure.

distributed-budget.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
/**
 * Distributed Retry Budget Strategies
 */
 
// =========================================
// Strategy 1: Per-Instance Budget Division
// =========================================
class PerInstanceBudget {
  private readonly localBudget: ProductionRetryBudget;
  
  constructor(totalBudget: number, instanceCount: number) {
    // Each instance gets an equal share of the global budget
    const perInstanceBudget = totalBudget / instanceCount;
    
    this.localBudget = new ProductionRetryBudget({
      maxBudget: perInstanceBudget,
      budgetPerSuccess: 0.1 / instanceCount,  // Slower accrual
    });
  }
  
  tryConsumeForRetry(): boolean {
    return this.localBudget.tryConsumeForRetry();
  }
  
  recordSuccess(): void {
    this.localBudget.recordSuccess();
  }
}
 
// =========================================
// Strategy 2: Probabilistic Budget (No Coordination)
// =========================================
class ProbabilisticRetryBudget {
  private successCount: number = 0;
  private failureCount: number = 0;
  private readonly targetRetryRatio: number;
  
  constructor(targetRetryRatio: number = 0.10) {
    this.targetRetryRatio = targetRetryRatio;
  }
  
  recordSuccess(): void {
    this.successCount++;
  }
  
  recordFailure(): void {
    this.failureCount++;
  }
  
  /**
   * Probabilistically decide whether to retry.
   * Each instance independently makes this decision,
   * but the aggregate converges to the target ratio.
   */
  shouldRetry(): boolean {
    const totalRequests = this.successCount + this.failureCount;
    
    if (totalRequests < 10) {
      // Not enough data - use 50% probability
      return Math.random() < 0.5;
    }
    
    const observedFailureRate = this.failureCount / totalRequests;
    
    if (observedFailureRate < this.targetRetryRatio) {
      // Low failure rate - always retry
      return true;
    }
    
    // High failure rate - probabilistic retry
    // P(retry) = targetRatio / observedFailureRate
    // This ensures aggregate retry rate ≈ targetRatio
    const retryProbability = this.targetRetryRatio / observedFailureRate;
    return Math.random() < retryProbability;
  }
}
 
// =========================================
// Strategy 3: Redis-Backed Shared Budget
// =========================================
interface RedisClient {
  incr(key: string): Promise<number>;
  decr(key: string): Promise<number>;
  get(key: string): Promise<string | null>;
  expire(key: string, seconds: number): Promise<void>;
}
 
class RedisRetryBudget {
  private readonly redis: RedisClient;
  private readonly budgetKey: string;
  private readonly successKey: string;
  private readonly maxBudget: number;
  private readonly budgetRatio: number;
  private readonly windowSeconds: number;
  
  constructor(
    redis: RedisClient,
    serviceName: string,
    options: {
      maxBudget?: number;
      budgetRatio?: number;
      windowSeconds?: number;
    } = {}
  ) {
    this.redis = redis;
    this.budgetKey = `retry_budget:${serviceName}:budget`;
    this.successKey = `retry_budget:${serviceName}:success`;
    this.maxBudget = options.maxBudget ?? 1000;
    this.budgetRatio = options.budgetRatio ?? 0.1;
    this.windowSeconds = options.windowSeconds ?? 60;
  }
  
  async recordSuccess(): Promise<void> {
    // Increment success counter atomically
    const successes = await this.redis.incr(this.successKey);
    await this.redis.expire(this.successKey, this.windowSeconds);
    
    // Earn budget (capped at max)
    const earnedBudget = Math.floor(successes * this.budgetRatio);
    // In practice, use INCRBY with MINVAL to cap
  }
  
  async tryConsumeForRetry(): Promise<boolean> {
    // Decrement budget atomically
    // Returns false if would go negative
    const newBudget = await this.redis.decr(this.budgetKey);
    
    if (newBudget < 0) {
      // Went negative - restore and deny
      await this.redis.incr(this.budgetKey);
      return false;
    }
    
    return true;
  }
  
  async getAvailableBudget(): Promise<number> {
    const budget = await this.redis.get(this.budgetKey);
    return budget ? parseInt(budget, 10) : 0;
  }
}
 
// =========================================
// Strategy 4: Hedged Retry with Sampling
// =========================================
class HedgedRetryBudget {
  private readonly samplingRate: number;
  private readonly baseRetryBudget: ProductionRetryBudget;
  
  constructor(samplingRate: number = 0.25) {
    this.samplingRate = samplingRate;
    this.baseRetryBudget = new ProductionRetryBudget();
  }
  
  /**
   * Only sample a fraction of failures for retry.
   * Combined with local budget for additional protection.
   */
  shouldRetry(): boolean {
    // First: random sampling
    if (Math.random() > this.samplingRate) {
      return false;  // Not sampled for retry
    }
    
    // Second: local budget check
    return this.baseRetryBudget.tryConsumeForRetry();
  }
  
  recordSuccess(): void {
    this.baseRetryBudget.recordSuccess();
  }
}
 
console.log("Distributed Budget Strategies Loaded");

Probabilistic Is Often Sufficient

Probabilistic retry budgets require no coordination and naturally limit aggregate retry volume. If your target is 10% retries and you have 100 instances, each independently retrying with P=0.1 achieves the same aggregate limit as coordinated budgets—without the complexity or latency.

Integration with Circuit Breakers

Retry budgets and circuit breakers are complementary patterns. Circuit breakers respond to consecutive failures by stopping all requests. Retry budgets limit the volume of retries. Used together, they provide defense in depth.

Retry Budget vs Circuit Breaker
Aspect	Retry Budget	Circuit Breaker	Combined
Trigger	Ratio of retries to successes	Consecutive failures	Either condition
Response	Reduce retry rate	Stop all requests	Graceful degradation then stop
Recovery	Auto-recover with successes	Half-open probing	Both mechanisms
Scope	Retry decisions only	All requests	Full protection
Best for	High-volume steady traffic	Sudden complete failures	Production systems

circuit-budget-integration.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
/**
 * Integrated Circuit Breaker + Retry Budget System
 * 
 * Provides layered protection:
 * 1. Retry budget controls retry amplification
 * 2. Circuit breaker halts traffic during severe failures
 */
 
type CircuitState = "closed" | "open" | "half-open";
 
interface IntegratedPolicyConfig {
  // Retry budget config
  maxBudget: number;
  budgetPerSuccess: number;
  
  // Circuit breaker config
  failureThreshold: number;      // Failures before opening
  successThreshold: number;      // Successes in half-open to close
  openDurationMs: number;        // How long to stay open
  halfOpenMaxConcurrent: number; // Max requests during half-open
}
 
class IntegratedRetryPolicy {
  private readonly config: IntegratedPolicyConfig;
  private readonly budget: ProductionRetryBudget;
  
  // Circuit breaker state
  private circuitState: CircuitState = "closed";
  private consecutiveFailures: number = 0;
  private consecutiveSuccesses: number = 0;
  private openedAt: number = 0;
  private halfOpenInFlight: number = 0;
  
  constructor(config: Partial<IntegratedPolicyConfig> = {}) {
    this.config = {
      maxBudget: 100,
      budgetPerSuccess: 0.1,
      failureThreshold: 5,
      successThreshold: 3,
      openDurationMs: 30000,
      halfOpenMaxConcurrent: 3,
      ...config,
    };
    
    this.budget = new ProductionRetryBudget({
      maxBudget: this.config.maxBudget,
      budgetPerSuccess: this.config.budgetPerSuccess,
    });
  }
  
  /**
   * Check if a request is allowed (considering circuit state).
   */
  allowRequest(): boolean {
    switch (this.circuitState) {
      case "closed":
        return true;
        
      case "open":
        // Check if it's time to try half-open
        if (Date.now() - this.openedAt >= this.config.openDurationMs) {
          this.circuitState = "half-open";
          this.halfOpenInFlight = 0;
          return this.halfOpenInFlight < this.config.halfOpenMaxConcurrent;
        }
        return false;
        
      case "half-open":
        // Allow limited requests during probing
        if (this.halfOpenInFlight < this.config.halfOpenMaxConcurrent) {
          this.halfOpenInFlight++;
          return true;
        }
        return false;
    }
  }
  
  /**
   * Check if a retry is allowed (budget + circuit state).
   */
  allowRetry(): boolean {
    // Circuit must allow requests
    if (!this.allowRequest()) {
      return false;
    }
    
    // Budget must allow retry
    return this.budget.tryConsumeForRetry();
  }
  
  /**
   * Record a successful request.
   */
  recordSuccess(): void {
    this.budget.recordSuccess();
    this.consecutiveFailures = 0;
    this.consecutiveSuccesses++;
    
    if (this.circuitState === "half-open") {
      this.halfOpenInFlight = Math.max(0, this.halfOpenInFlight - 1);
      
      if (this.consecutiveSuccesses >= this.config.successThreshold) {
        this.circuitState = "closed";
        console.log("Circuit CLOSED - service recovered");
      }
    }
  }
  
  /**
   * Record a failed request.
   */
  recordFailure(): void {
    this.budget.recordFailure();
    this.consecutiveSuccesses = 0;
    this.consecutiveFailures++;
    
    if (this.circuitState === "half-open") {
      // Failure during half-open - back to open
      this.circuitState = "open";
      this.openedAt = Date.now();
      console.log("Circuit OPEN - failure during probe");
    } else if (this.circuitState === "closed" &&
               this.consecutiveFailures >= this.config.failureThreshold) {
      // Too many failures - open circuit
      this.circuitState = "open";
      this.openedAt = Date.now();
      console.log("Circuit OPEN - failure threshold reached");
    }
  }
  
  /**
   * Get current state for monitoring.
   */
  getState(): {
    circuitState: CircuitState;
    consecutiveFailures: number;
    budget: number;
    budgetPercent: number;
  } {
    return {
      circuitState: this.circuitState,
      consecutiveFailures: this.consecutiveFailures,
      budget: this.budget.getBudget(),
      budgetPercent: this.budget.getBudgetPercent(),
    };
  }
}
 
// =========================================
// Usage Example
// =========================================
 
const policy = new IntegratedRetryPolicy({
  maxBudget: 50,
  failureThreshold: 3,
  openDurationMs: 10000,
});
 
async function makeProtectedRequest<T>(
  operation: () => Promise<T>,
  maxRetries: number = 3
): Promise<T> {
  for (let attempt = 1; attempt <= maxRetries + 1; attempt++) {
    // Check if circuit allows request
    if (!policy.allowRequest()) {
      throw new Error("Circuit breaker is open");
    }
    
    try {
      const result = await operation();
      policy.recordSuccess();
      return result;
    } catch (error) {
      policy.recordFailure();
      
      // Check if we should retry
      if (attempt <= maxRetries && policy.allowRetry()) {
        console.log(`Retry ${attempt} allowed by policy`);
        await new Promise(r => setTimeout(r, 100 * Math.pow(2, attempt)));
        continue;
      }
      
      throw error;
    }
  }
  
  throw new Error("Exhausted retries");
}

Summary: Retry Budget Mastery

Retry budgets complete our toolkit for safe retries. While backoff and jitter control when to retry, budgets control whether to retry at all—preventing the retry amplification that can turn minor failures into major outages.

Key Takeaways

•Retry amplification is multiplicative — Small failure rates with aggressive retries create load spikes that prevent recovery.
•Budgets limit aggregate retry volume — By tying retry permission to successful requests, budgets naturally throttle during failures.
•10% is a sensible default — Google's recommendation: retries should not exceed 10% of successful request volume.
•Multiple strategies exist — Percentage-based, token bucket, sliding window, and adaptive budgets each have tradeoffs.
•Distributed systems need coordination — Or use probabilistic approaches that converge to target limits without coordination.
•Combine with circuit breakers — Budgets handle gradual degradation; circuit breakers handle complete failures.

What's next:

Our final topic in retry strategies addresses a fundamental requirement for safe retries: Idempotency. Without idempotent operations, even correctly implemented retries can cause data corruption, duplicate charges, or inconsistent state. The next page explores how to design and implement idempotent operations.

Page Complete

You now understand retry budgets—the system-level mechanism that prevents retry amplification from causing cascading failures. Combined with exponential backoff and jitter, retry budgets form a complete framework for resilient retry behavior.

4 / 5

Loading learning content...

System Design (HLD)Retry Strategies

Retry Strategies in Distributed Systems

LevelIntermediate

Duration75 mins

TopicRetry Strategies

4 / 5

Retry Budgets

The Limits of Individual Retries

We've learned when to retry, how to space retries with exponential backoff, and how to add jitter to prevent thundering herds. These are powerful techniques for individual requests.

But they share a dangerous assumption: that retrying is always beneficial if the failure seems transient.

The problem: Each individual retry decision was correct. The collective impact was catastrophic.

The solution: Retry budgets—system-level limits that constrain total retry volume, preventing well-intentioned retries from becoming the final straw.

What You Will Master

Retry Amplification: The Hidden Multiplier

Basic amplification formula:

retry-amplification.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
/**
 * Retry Amplification Mathematics
 * 
 * When failures occur, retries increase total load.
 * This increased load can cause more failures, creating a vicious cycle.
 */
 
interface AmplificationResult {
  initialLoad: number;
  failureRate: number;
  retryCount: number;
  amplifiedLoad: number;
  amplificationFactor: number;
  sustainable: boolean;
}
 
/**
 * Calculate the amplified load when all failed requests are retried.
 * 
 * Without any limiting:
 * amplifiedLoad = initialLoad × (1 + failureRate × retryCount)
 */
function calculateAmplification(
  initialLoad: number,
  failureRate: number,  // 0.0 to 1.0
  retryCount: number,
  maxCapacity: number
): AmplificationResult {
  // First-order amplification (simple model)
  const amplifiedLoad = initialLoad * (1 + failureRate * retryCount);
  const amplificationFactor = amplifiedLoad / initialLoad;
  
  return {
    initialLoad,
    failureRate,
    retryCount,
    amplifiedLoad,
    amplificationFactor,
    sustainable: amplifiedLoad <= maxCapacity,
  };
}
 
// Scenario: Service handles 10,000 req/s, max capacity 12,000 req/s
const capacity = 12000;
 
console.log("Retry Amplification Scenarios");
console.log("System capacity: 12,000 req/s");
console.log("Normal load: 10,000 req/s (83% utilization)");
console.log("=========================================
");
 
// Scenario 1: Healthy system (1% failure rate)
const healthy = calculateAmplification(10000, 0.01, 3, capacity);
console.log("Healthy (1% failure, 3 retries):");
console.log(`  Amplified load: ${healthy.amplifiedLoad.toFixed(0)} req/s`);
console.log(`  Amplification: ${healthy.amplificationFactor.toFixed(2)}x`);
console.log(`  Sustainable: ${healthy.sustainable}
`);
// Output: 10,300 req/s (1.03x) - Sustainable ✓
 
// Scenario 2: Under stress (10% failure rate)
const stressed = calculateAmplification(10000, 0.10, 3, capacity);
console.log("Stressed (10% failure, 3 retries):");
console.log(`  Amplified load: ${stressed.amplifiedLoad.toFixed(0)} req/s`);
console.log(`  Amplification: ${stressed.amplificationFactor.toFixed(2)}x`);
console.log(`  Sustainable: ${stressed.sustainable}
`);
// Output: 13,000 req/s (1.30x) - UNSUSTAINABLE! Exceeds capacity!
 
// Scenario 3: Same stress with retry budget (50% of failures retried)
const withBudget = calculateAmplification(10000, 0.10, 3 * 0.5, capacity);
console.log("Stressed with 50% retry budget:");
console.log(`  Amplified load: ${withBudget.amplifiedLoad.toFixed(0)} req/s`);
console.log(`  Amplification: ${withBudget.amplificationFactor.toFixed(2)}x`);
console.log(`  Sustainable: ${withBudget.sustainable}
`);
// Output: 11,500 req/s (1.15x) - Sustainable with budget ✓

The cascading feedback loop:

The simple model above assumes failure rate stays constant. In reality, increased load from retries increases failure rate, which triggers more retries:

cascading-failure.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
/**
 * Cascading Failure Simulation
 * 
 * Models how retries can amplify failures in a feedback loop,
 * turning minor overload into complete system failure.
 */
 
interface SystemState {
  load: number;
  capacity: number;
  failureRate: number;
  retriesInFlight: number;
}
 
function simulateSeconds(
  initialState: SystemState,
  retryPolicy: { maxRetries: number; retryAfterSeconds: number },
  seconds: number
): SystemState[] {
  const history: SystemState[] = [initialState];
  let state = { ...initialState };
  
  // Pending retries by arrival time
  const pendingRetries: number[] = [];
  
  for (let t = 1; t <= seconds; t++) {
    // Calculate failure rate based on load vs capacity
    // Simple model: linear increase above 80% capacity
    const utilizationRatio = state.load / state.capacity;
    let failureRate = 0;
    if (utilizationRatio > 0.8) {
      failureRate = Math.min(1, (utilizationRatio - 0.8) * 5);
    }
    
    // Apply retries from previous seconds
    const arrivingRetries = pendingRetries.shift() || 0;
    const newLoad = state.load + arrivingRetries;
    
    // Calculate failures at current load
    const failures = newLoad * failureRate;
    
    // Schedule retries (with limit)
    const retriesToSchedule = failures * retryPolicy.maxRetries;
    if (pendingRetries.length < retryPolicy.retryAfterSeconds) {
      for (let i = pendingRetries.length; i < retryPolicy.retryAfterSeconds; i++) {
        pendingRetries.push(0);
      }
    }
    
    // Distribute retries over time (simplified)
    for (let r = 0; r < retryPolicy.retryAfterSeconds && r < pendingRetries.length; r++) {
      pendingRetries[r] += retriesToSchedule / retryPolicy.retryAfterSeconds;
    }
    
    state = {
      load: newLoad,
      capacity: state.capacity,
      failureRate,
      retriesInFlight: pendingRetries.reduce((a, b) => a + b, 0),
    };
    
    history.push(state);
    
    // Reset load to baseline for next second (requests don't accumulate)
    state.load = initialState.load;
  }
  
  return history;
}
 
// Simulate a system under sudden load spike
const initial: SystemState = {
  load: 8000,  // Normal load: 80% of capacity
  capacity: 10000,
  failureRate: 0,
  retriesInFlight: 0,
};
 
// Spike to 95% load
const spikedInitial = { ...initial, load: 9500 };
 
console.log("Cascading Failure Simulation");
console.log("System capacity: 10,000 req/s");
console.log("Initial spike: 9,500 req/s (95% utilization)");
console.log("=========================================
");
 
const withRetries = simulateSeconds(
  spikedInitial,
  { maxRetries: 3, retryAfterSeconds: 1 },
  10
);
 
console.log("With unlimited retries:");
withRetries.slice(0, 6).forEach((state, t) => {
  console.log(`  t=${t}s: load=${state.load.toFixed(0)}, failure=${(state.failureRate * 100).toFixed(0)}%, pending=${state.retriesInFlight.toFixed(0)}`);
});
 
// Key insight: The failure rate and pending retries escalate rapidly
// What started as 5% overload becomes catastrophic failure

The Amplification Cascade

What Is a Retry Budget?

The core principle:

"You may only retry if you have budget remaining. Budget is earned through successful requests and spent on retries."

Retry Budget Properties

•Budget accrual: Successful requests add to the budget (typically a small fraction, like 10%).
•Budget consumption: Each retry consumes budget (typically 1 unit per retry).
•Budget cap: Maximum budget is capped to prevent accumulation during long healthy periods.
•Budget exhaustion: When budget reaches zero, no more retries are allowed—fail fast.
•Auto-recovery: As the system stabilizes and successes resume, budget naturally rebuilds.

Example: 10% retry budget

With a 10% retry budget:

Every 10 successful requests earns 1 retry credit
If 1% of requests fail, we have 10× budget for retries (plenty)
If 50% of requests fail, we use budget quickly and stop retrying
If 100% fail, budget exhausts after initial buffer, preventing amplification

Google's SRE Recommendation

retry-budget-concept.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
/**
 * Conceptual Retry Budget
 * 
 * Demonstrates the basic mechanics of a percentage-based retry budget.
 */
 
interface BudgetState {
  available: number;        // Current budget balance
  maxBudget: number;        // Maximum budget cap
  totalSuccesses: number;   // Successes in current window
  totalRetries: number;     // Retries spent in current window
}
 
class ConceptualRetryBudget {
  private available: number;
  private readonly budgetRatio: number;  // e.g., 0.10 for 10%
  private readonly maxBudget: number;    // Upper cap
  
  constructor(budgetRatio: number = 0.10, maxBudget: number = 100) {
    this.budgetRatio = budgetRatio;
    this.maxBudget = maxBudget;
    this.available = maxBudget / 2;  // Start with some initial budget
  }
  
  /**
   * Record a successful request. This earns retry budget.
   */
  recordSuccess(): void {
    // Each success earns a fraction of a retry credit
    this.available = Math.min(
      this.available + this.budgetRatio,
      this.maxBudget
    );
  }
  
  /**
   * Check if we can afford a retry.
   */
  canRetry(): boolean {
    return this.available >= 1.0;
  }
  
  /**
   * Consume budget for a retry. Returns true if retry was allowed.
   */
  consumeForRetry(): boolean {
    if (!this.canRetry()) {
      return false;  // No budget, cannot retry
    }
    
    this.available -= 1.0;
    return true;
  }
  
  /**
   * Get current budget state for monitoring.
   */
  getState(): { available: number; maxBudget: number; percentFull: number } {
    return {
      available: this.available,
      maxBudget: this.maxBudget,
      percentFull: (this.available / this.maxBudget) * 100,
    };
  }
}
 
// Demonstration
const budget = new ConceptualRetryBudget(0.10, 100);
 
console.log("Retry Budget Demonstration");
console.log("Budget ratio: 10% (1 retry credit per 10 successes)");
console.log("========================================
");
 
// Simulate healthy traffic: 100 requests, 2% failure
console.log("Scenario 1: Healthy traffic (2% failure rate)");
for (let i = 0; i < 100; i++) {
  if (Math.random() < 0.02) {
    // Failure - try to retry
    const couldRetry = budget.consumeForRetry();
    console.log(`  Request ${i}: FAILED - Retry ${couldRetry ? "ALLOWED" : "DENIED"}`);
  } else {
    // Success - earn budget
    budget.recordSuccess();
  }
}
console.log(`  Final budget: ${budget.getState().available.toFixed(1)} / ${budget.getState().maxBudget}
`);
 
// Reset and simulate unhealthy traffic
const budget2 = new ConceptualRetryBudget(0.10, 100);
console.log("Scenario 2: Unhealthy traffic (50% failure rate)");
let retriesAllowed = 0;
let retriesDenied = 0;
 
for (let i = 0; i < 100; i++) {
  if (Math.random() < 0.50) {
    // High failure rate
    if (budget2.consumeForRetry()) {
      retriesAllowed++;
    } else {
      retriesDenied++;
    }
  } else {
    budget2.recordSuccess();
  }
}
console.log(`  Retries allowed: ${retriesAllowed}`);
console.log(`  Retries denied: ${retriesDenied}`);
console.log(`  Final budget: ${budget2.getState().available.toFixed(1)}
`);
 
// Key insight: Under high failure, budget exhausts and denies retries,
// preventing amplification even though each individual retry seems reasonable

Retry Budget Strategies

There are several approaches to implementing retry budgets, each with different characteristics. The right choice depends on your system's traffic patterns and reliability requirements.

Retry Budget Strategy Comparison
Strategy	Mechanism	Pros	Cons	Best For
Percentage-Based	Retries limited to X% of successes	Simple, proportional, self-adjusting	Needs success tracking	General purpose
Token Bucket	Fixed token regeneration rate, consumed by retries	Smooth, familiar pattern	Requires tuning regen rate	Steady traffic
Sliding Window	Track retry/request ratio in time window	Accurate recent view	Memory for window	Variable traffic
Circuit-Breaker Hybrid	Budget + circuit breaker integration	Best protection	More complex	Critical paths

budget-strategies.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
/**
 * Multiple Retry Budget Strategy Implementations
 */
 
// =========================================
// Strategy 1: Token Bucket Budget
// =========================================
class TokenBucketRetryBudget {
  private tokens: number;
  private readonly maxTokens: number;
  private readonly refillRatePerSecond: number;
  private lastRefillTime: number;
  
  constructor(maxTokens: number = 10, refillRatePerSecond: number = 1) {
    this.tokens = maxTokens;  // Start full
    this.maxTokens = maxTokens;
    this.refillRatePerSecond = refillRatePerSecond;
    this.lastRefillTime = Date.now();
  }
  
  private refill(): void {
    const now = Date.now();
    const secondsElapsed = (now - this.lastRefillTime) / 1000;
    
    const tokensToAdd = secondsElapsed * this.refillRatePerSecond;
    this.tokens = Math.min(this.maxTokens, this.tokens + tokensToAdd);
    this.lastRefillTime = now;
  }
  
  canRetry(): boolean {
    this.refill();
    return this.tokens >= 1.0;
  }
  
  consumeForRetry(): boolean {
    if (!this.canRetry()) return false;
    this.tokens -= 1.0;
    return true;
  }
  
  getAvailableTokens(): number {
    this.refill();
    return this.tokens;
  }
}
 
// =========================================
// Strategy 2: Sliding Window Budget
// =========================================
interface WindowEntry {
  timestamp: number;
  type: "success" | "retry";
}
 
class SlidingWindowRetryBudget {
  private readonly windowSizeMs: number;
  private readonly maxRetryRatio: number;
  private entries: WindowEntry[] = [];
  
  constructor(windowSizeMs: number = 60000, maxRetryRatio: number = 0.10) {
    this.windowSizeMs = windowSizeMs;
    this.maxRetryRatio = maxRetryRatio;
  }
  
  private pruneOldEntries(): void {
    const cutoff = Date.now() - this.windowSizeMs;
    this.entries = this.entries.filter(e => e.timestamp >= cutoff);
  }
  
  recordSuccess(): void {
    this.entries.push({ timestamp: Date.now(), type: "success" });
    this.pruneOldEntries();
  }
  
  canRetry(): boolean {
    this.pruneOldEntries();
    
    const successCount = this.entries.filter(e => e.type === "success").length;
    const retryCount = this.entries.filter(e => e.type === "retry").length;
    
    if (successCount === 0) {
      // No successes in window - allow minimal retries based on initial buffer
      return retryCount < 5;  // Allow a few retries to bootstrap
    }
    
    const currentRatio = retryCount / successCount;
    return currentRatio < this.maxRetryRatio;
  }
  
  consumeForRetry(): boolean {
    if (!this.canRetry()) return false;
    
    this.entries.push({ timestamp: Date.now(), type: "retry" });
    return true;
  }
  
  getStats(): { successes: number; retries: number; ratio: number } {
    this.pruneOldEntries();
    
    const successes = this.entries.filter(e => e.type === "success").length;
    const retries = this.entries.filter(e => e.type === "retry").length;
    
    return {
      successes,
      retries,
      ratio: successes > 0 ? retries / successes : 0,
    };
  }
}
 
// =========================================
// Strategy 3: Adaptive Budget (Google-style)
// =========================================
class AdaptiveRetryBudget {
  private budget: number;
  private readonly maxBudget: number;
  private readonly budgetRatio: number;
  private readonly minBudgetForRetry: number;
  
  // Exponential moving averages for monitoring
  private successRate: number = 1.0;
  private readonly alpha: number = 0.1;  // Smoothing factor
  
  constructor(options: {
    maxBudget?: number;
    budgetRatio?: number;
    minBudgetForRetry?: number;
  } = {}) {
    this.maxBudget = options.maxBudget ?? 100;
    this.budgetRatio = options.budgetRatio ?? 0.2;  // 20% default
    this.minBudgetForRetry = options.minBudgetForRetry ?? 1.0;
    this.budget = this.maxBudget;  // Start full
  }
  
  recordSuccess(): void {
    // Add to budget
    this.budget = Math.min(this.maxBudget, this.budget + this.budgetRatio);
    
    // Update success rate EMA
    this.successRate = this.alpha * 1.0 + (1 - this.alpha) * this.successRate;
  }
  
  recordFailure(): void {
    // Update success rate EMA
    this.successRate = this.alpha * 0.0 + (1 - this.alpha) * this.successRate;
  }
  
  canRetry(): boolean {
    // Two conditions must be met:
    // 1. Have enough budget
    // 2. Success rate isn't too low (adaptive response)
    if (this.budget < this.minBudgetForRetry) {
      return false;
    }
    
    // If success rate is very low, be more conservative
    // This provides extra protection during severe failures
    if (this.successRate < 0.1) {
      return this.budget >= this.maxBudget * 0.5;  // Require 50% budget
    }
    
    return true;
  }
  
  consumeForRetry(): boolean {
    if (!this.canRetry()) return false;
    this.budget -= 1.0;
    return true;
  }
  
  getState(): { budget: number; maxBudget: number; successRate: number } {
    return {
      budget: this.budget,
      maxBudget: this.maxBudget,
      successRate: this.successRate,
    };
  }
}
 
// =========================================
// Usage Comparison
// =========================================
 
console.log("Budget Strategy Comparison
");
 
// Token bucket: good for rate-based limiting
const tokenBucket = new TokenBucketRetryBudget(10, 2);  // 10 tokens, 2/sec refill
console.log("Token Bucket: Best for steady-state traffic");
console.log(`  Available: ${tokenBucket.getAvailableTokens()} tokens
`);
 
// Sliding window: good for ratio-based limiting
const slidingWindow = new SlidingWindowRetryBudget(60000, 0.10);  // 60s window, 10% limit
console.log("Sliding Window: Best for accurate ratio tracking");
console.log(`  Stats: ${JSON.stringify(slidingWindow.getStats())}
`);
 
// Adaptive: good for varying conditions
const adaptive = new AdaptiveRetryBudget({ maxBudget: 50, budgetRatio: 0.20 });
console.log("Adaptive: Best for varying failure conditions");
console.log(`  State: ${JSON.stringify(adaptive.getState())}
`);

Production-Ready Implementation

A production retry budget needs to integrate seamlessly with your retry logic, support monitoring, and handle edge cases gracefully.

production-retry-budget.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
/**
 * Production-Grade Retry Budget System
 * 
 * Features:
 * - Configurable budget strategy
 * - Metrics and monitoring support
 * - Integration with retry functions
 * - Thread-safe for concurrent access
 */
 
interface RetryBudgetMetrics {
  totalRequests: number;
  successfulRequests: number;
  failedRequests: number;
  retriesAttempted: number;
  retriesAllowed: number;
  retriesDenied: number;
  currentBudget: number;
  budgetUtilization: number;
}
 
interface RetryBudgetConfig {
  /** Maximum budget capacity */
  maxBudget: number;
  
  /** Budget earned per successful request (ratio) */
  budgetPerSuccess: number;
  
  /** Budget consumed per retry */
  budgetPerRetry: number;
  
  /** Minimum budget required to allow retry */
  minBudgetForRetry: number;
  
  /** Initial budget as percentage of max */
  initialBudgetPercent: number;
  
  /** Optional: callback when budget is exhausted */
  onBudgetExhausted?: () => void;
  
  /** Optional: callback when budget recovers */
  onBudgetRecovered?: () => void;
}
 
class ProductionRetryBudget {
  private budget: number;
  private readonly config: RetryBudgetConfig;
  private wasExhausted: boolean = false;
  
  // Metrics
  private metrics: RetryBudgetMetrics = {
    totalRequests: 0,
    successfulRequests: 0,
    failedRequests: 0,
    retriesAttempted: 0,
    retriesAllowed: 0,
    retriesDenied: 0,
    currentBudget: 0,
    budgetUtilization: 0,
  };
  
  constructor(config: Partial<RetryBudgetConfig> = {}) {
    this.config = {
      maxBudget: 100,
      budgetPerSuccess: 0.1,    // 10 successes = 1 retry
      budgetPerRetry: 1.0,
      minBudgetForRetry: 1.0,
      initialBudgetPercent: 50,
      ...config,
    };
    
    this.budget = this.config.maxBudget * (this.config.initialBudgetPercent / 100);
    this.updateMetrics();
  }
  
  /**
   * Record a successful request. Adds to budget.
   */
  recordSuccess(): void {
    this.metrics.totalRequests++;
    this.metrics.successfulRequests++;
    
    const previousBudget = this.budget;
    this.budget = Math.min(
      this.config.maxBudget,
      this.budget + this.config.budgetPerSuccess
    );
    
    // Check if we recovered from exhaustion
    if (this.wasExhausted && this.budget >= this.config.minBudgetForRetry) {
      this.wasExhausted = false;
      this.config.onBudgetRecovered?.();
    }
    
    this.updateMetrics();
  }
  
  /**
   * Record a failed request (without retry).
   */
  recordFailure(): void {
    this.metrics.totalRequests++;
    this.metrics.failedRequests++;
    this.updateMetrics();
  }
  
  /**
   * Check if retry is allowed without consuming budget.
   */
  canRetry(): boolean {
    return this.budget >= this.config.minBudgetForRetry;
  }
  
  /**
   * Attempt to consume budget for a retry.
   * Returns true if retry is allowed, false if denied.
   */
  tryConsumeForRetry(): boolean {
    this.metrics.retriesAttempted++;
    
    if (!this.canRetry()) {
      this.metrics.retriesDenied++;
      
      // Track exhaustion state
      if (!this.wasExhausted) {
        this.wasExhausted = true;
        this.config.onBudgetExhausted?.();
      }
      
      this.updateMetrics();
      return false;
    }
    
    this.budget -= this.config.budgetPerRetry;
    this.metrics.retriesAllowed++;
    this.updateMetrics();
    return true;
  }
  
  /**
   * Get current metrics for monitoring.
   */
  getMetrics(): RetryBudgetMetrics {
    return { ...this.metrics };
  }
  
  /**
   * Get current budget level.
   */
  getBudget(): number {
    return this.budget;
  }
  
  /**
   * Get budget as percentage of max.
   */
  getBudgetPercent(): number {
    return (this.budget / this.config.maxBudget) * 100;
  }
  
  /**
   * Reset metrics (for testing or rolling windows).
   */
  resetMetrics(): void {
    this.metrics = {
      totalRequests: 0,
      successfulRequests: 0,
      failedRequests: 0,
      retriesAttempted: 0,
      retriesAllowed: 0,
      retriesDenied: 0,
      currentBudget: this.budget,
      budgetUtilization: 0,
    };
  }
  
  private updateMetrics(): void {
    this.metrics.currentBudget = this.budget;
    this.metrics.budgetUtilization = 
      ((this.config.maxBudget - this.budget) / this.config.maxBudget) * 100;
  }
}
 
/**
 * Retry function with integrated budget management.
 */
async function retryWithBudget<T>(
  operation: () => Promise<T>,
  budget: ProductionRetryBudget,
  options: {
    maxAttempts?: number;
    backoffMs?: (attempt: number) => number;
    shouldRetry?: (error: Error) => boolean;
    onRetryDenied?: (error: Error) => void;
  } = {}
): Promise<T> {
  const {
    maxAttempts = 3,
    backoffMs = (attempt) => 100 * Math.pow(2, attempt - 1),
    shouldRetry = () => true,
    onRetryDenied,
  } = options;
  
  let lastError: Error | null = null;
  
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      const result = await operation();
      budget.recordSuccess();
      return result;
    } catch (error) {
      lastError = error as Error;
      budget.recordFailure();
      
      // Check if this error is retryable
      if (!shouldRetry(lastError)) {
        throw lastError;
      }
      
      // Check if we have attempts remaining
      if (attempt >= maxAttempts) {
        throw lastError;
      }
      
      // Check budget before retrying
      if (!budget.tryConsumeForRetry()) {
        onRetryDenied?.(lastError);
        throw new RetryBudgetExhaustedError(
          "Retry budget exhausted",
          lastError
        );
      }
      
      // Wait before retry
      await new Promise(r => setTimeout(r, backoffMs(attempt)));
    }
  }
  
  throw lastError || new Error("Retry failed");
}
 
class RetryBudgetExhaustedError extends Error {
  constructor(message: string, public readonly cause: Error) {
    super(message);
    this.name = "RetryBudgetExhaustedError";
  }
}
 
// =========================================
// Usage Example
// =========================================
 
const budget = new ProductionRetryBudget({
  maxBudget: 50,
  budgetPerSuccess: 0.1,
  onBudgetExhausted: () => console.log("⚠️  Retry budget exhausted!"),
  onBudgetRecovered: () => console.log("✅ Retry budget recovered"),
});
 
async function makeRequest(shouldFail: boolean): Promise<string> {
  return retryWithBudget(
    async () => {
      if (shouldFail) throw new Error("Simulated failure");
      return "success";
    },
    budget,
    {
      maxAttempts: 3,
      onRetryDenied: (err) => console.log(`Retry denied: ${err.message}`),
    }
  );
}

Distributed Retry Budgets

Strategies for distributed coordination:

Distributed Budget Approaches

•Per-instance budgets with lower limits: Each instance gets 1/N of the total budget. Simple but inflexible.
•Shared budget via Redis/centralized store: All instances read/write to shared counter. Accurate but adds latency.
•Probabilistic budgets: Each instance independently decides with probability P whether to retry. No coordination needed.
•Hedged retries with sampling: Only a random fraction of failures are retried. Natural load shedding.
•Leader-based coordination: One instance tracks global state and advises others. Single point of failure.

distributed-budget.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
/**
 * Distributed Retry Budget Strategies
 */
 
// =========================================
// Strategy 1: Per-Instance Budget Division
// =========================================
class PerInstanceBudget {
  private readonly localBudget: ProductionRetryBudget;
  
  constructor(totalBudget: number, instanceCount: number) {
    // Each instance gets an equal share of the global budget
    const perInstanceBudget = totalBudget / instanceCount;
    
    this.localBudget = new ProductionRetryBudget({
      maxBudget: perInstanceBudget,
      budgetPerSuccess: 0.1 / instanceCount,  // Slower accrual
    });
  }
  
  tryConsumeForRetry(): boolean {
    return this.localBudget.tryConsumeForRetry();
  }
  
  recordSuccess(): void {
    this.localBudget.recordSuccess();
  }
}
 
// =========================================
// Strategy 2: Probabilistic Budget (No Coordination)
// =========================================
class ProbabilisticRetryBudget {
  private successCount: number = 0;
  private failureCount: number = 0;
  private readonly targetRetryRatio: number;
  
  constructor(targetRetryRatio: number = 0.10) {
    this.targetRetryRatio = targetRetryRatio;
  }
  
  recordSuccess(): void {
    this.successCount++;
  }
  
  recordFailure(): void {
    this.failureCount++;
  }
  
  /**
   * Probabilistically decide whether to retry.
   * Each instance independently makes this decision,
   * but the aggregate converges to the target ratio.
   */
  shouldRetry(): boolean {
    const totalRequests = this.successCount + this.failureCount;
    
    if (totalRequests < 10) {
      // Not enough data - use 50% probability
      return Math.random() < 0.5;
    }
    
    const observedFailureRate = this.failureCount / totalRequests;
    
    if (observedFailureRate < this.targetRetryRatio) {
      // Low failure rate - always retry
      return true;
    }
    
    // High failure rate - probabilistic retry
    // P(retry) = targetRatio / observedFailureRate
    // This ensures aggregate retry rate ≈ targetRatio
    const retryProbability = this.targetRetryRatio / observedFailureRate;
    return Math.random() < retryProbability;
  }
}
 
// =========================================
// Strategy 3: Redis-Backed Shared Budget
// =========================================
interface RedisClient {
  incr(key: string): Promise<number>;
  decr(key: string): Promise<number>;
  get(key: string): Promise<string | null>;
  expire(key: string, seconds: number): Promise<void>;
}
 
class RedisRetryBudget {
  private readonly redis: RedisClient;
  private readonly budgetKey: string;
  private readonly successKey: string;
  private readonly maxBudget: number;
  private readonly budgetRatio: number;
  private readonly windowSeconds: number;
  
  constructor(
    redis: RedisClient,
    serviceName: string,
    options: {
      maxBudget?: number;
      budgetRatio?: number;
      windowSeconds?: number;
    } = {}
  ) {
    this.redis = redis;
    this.budgetKey = `retry_budget:${serviceName}:budget`;
    this.successKey = `retry_budget:${serviceName}:success`;
    this.maxBudget = options.maxBudget ?? 1000;
    this.budgetRatio = options.budgetRatio ?? 0.1;
    this.windowSeconds = options.windowSeconds ?? 60;
  }
  
  async recordSuccess(): Promise<void> {
    // Increment success counter atomically
    const successes = await this.redis.incr(this.successKey);
    await this.redis.expire(this.successKey, this.windowSeconds);
    
    // Earn budget (capped at max)
    const earnedBudget = Math.floor(successes * this.budgetRatio);
    // In practice, use INCRBY with MINVAL to cap
  }
  
  async tryConsumeForRetry(): Promise<boolean> {
    // Decrement budget atomically
    // Returns false if would go negative
    const newBudget = await this.redis.decr(this.budgetKey);
    
    if (newBudget < 0) {
      // Went negative - restore and deny
      await this.redis.incr(this.budgetKey);
      return false;
    }
    
    return true;
  }
  
  async getAvailableBudget(): Promise<number> {
    const budget = await this.redis.get(this.budgetKey);
    return budget ? parseInt(budget, 10) : 0;
  }
}
 
// =========================================
// Strategy 4: Hedged Retry with Sampling
// =========================================
class HedgedRetryBudget {
  private readonly samplingRate: number;
  private readonly baseRetryBudget: ProductionRetryBudget;
  
  constructor(samplingRate: number = 0.25) {
    this.samplingRate = samplingRate;
    this.baseRetryBudget = new ProductionRetryBudget();
  }
  
  /**
   * Only sample a fraction of failures for retry.
   * Combined with local budget for additional protection.
   */
  shouldRetry(): boolean {
    // First: random sampling
    if (Math.random() > this.samplingRate) {
      return false;  // Not sampled for retry
    }
    
    // Second: local budget check
    return this.baseRetryBudget.tryConsumeForRetry();
  }
  
  recordSuccess(): void {
    this.baseRetryBudget.recordSuccess();
  }
}
 
console.log("Distributed Budget Strategies Loaded");

Probabilistic Is Often Sufficient

Integration with Circuit Breakers

Retry Budget vs Circuit Breaker
Aspect	Retry Budget	Circuit Breaker	Combined
Trigger	Ratio of retries to successes	Consecutive failures	Either condition
Response	Reduce retry rate	Stop all requests	Graceful degradation then stop
Recovery	Auto-recover with successes	Half-open probing	Both mechanisms
Scope	Retry decisions only	All requests	Full protection
Best for	High-volume steady traffic	Sudden complete failures	Production systems

circuit-budget-integration.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
/**
 * Integrated Circuit Breaker + Retry Budget System
 * 
 * Provides layered protection:
 * 1. Retry budget controls retry amplification
 * 2. Circuit breaker halts traffic during severe failures
 */
 
type CircuitState = "closed" | "open" | "half-open";
 
interface IntegratedPolicyConfig {
  // Retry budget config
  maxBudget: number;
  budgetPerSuccess: number;
  
  // Circuit breaker config
  failureThreshold: number;      // Failures before opening
  successThreshold: number;      // Successes in half-open to close
  openDurationMs: number;        // How long to stay open
  halfOpenMaxConcurrent: number; // Max requests during half-open
}
 
class IntegratedRetryPolicy {
  private readonly config: IntegratedPolicyConfig;
  private readonly budget: ProductionRetryBudget;
  
  // Circuit breaker state
  private circuitState: CircuitState = "closed";
  private consecutiveFailures: number = 0;
  private consecutiveSuccesses: number = 0;
  private openedAt: number = 0;
  private halfOpenInFlight: number = 0;
  
  constructor(config: Partial<IntegratedPolicyConfig> = {}) {
    this.config = {
      maxBudget: 100,
      budgetPerSuccess: 0.1,
      failureThreshold: 5,
      successThreshold: 3,
      openDurationMs: 30000,
      halfOpenMaxConcurrent: 3,
      ...config,
    };
    
    this.budget = new ProductionRetryBudget({
      maxBudget: this.config.maxBudget,
      budgetPerSuccess: this.config.budgetPerSuccess,
    });
  }
  
  /**
   * Check if a request is allowed (considering circuit state).
   */
  allowRequest(): boolean {
    switch (this.circuitState) {
      case "closed":
        return true;
        
      case "open":
        // Check if it's time to try half-open
        if (Date.now() - this.openedAt >= this.config.openDurationMs) {
          this.circuitState = "half-open";
          this.halfOpenInFlight = 0;
          return this.halfOpenInFlight < this.config.halfOpenMaxConcurrent;
        }
        return false;
        
      case "half-open":
        // Allow limited requests during probing
        if (this.halfOpenInFlight < this.config.halfOpenMaxConcurrent) {
          this.halfOpenInFlight++;
          return true;
        }
        return false;
    }
  }
  
  /**
   * Check if a retry is allowed (budget + circuit state).
   */
  allowRetry(): boolean {
    // Circuit must allow requests
    if (!this.allowRequest()) {
      return false;
    }
    
    // Budget must allow retry
    return this.budget.tryConsumeForRetry();
  }
  
  /**
   * Record a successful request.
   */
  recordSuccess(): void {
    this.budget.recordSuccess();
    this.consecutiveFailures = 0;
    this.consecutiveSuccesses++;
    
    if (this.circuitState === "half-open") {
      this.halfOpenInFlight = Math.max(0, this.halfOpenInFlight - 1);
      
      if (this.consecutiveSuccesses >= this.config.successThreshold) {
        this.circuitState = "closed";
        console.log("Circuit CLOSED - service recovered");
      }
    }
  }
  
  /**
   * Record a failed request.
   */
  recordFailure(): void {
    this.budget.recordFailure();
    this.consecutiveSuccesses = 0;
    this.consecutiveFailures++;
    
    if (this.circuitState === "half-open") {
      // Failure during half-open - back to open
      this.circuitState = "open";
      this.openedAt = Date.now();
      console.log("Circuit OPEN - failure during probe");
    } else if (this.circuitState === "closed" &&
               this.consecutiveFailures >= this.config.failureThreshold) {
      // Too many failures - open circuit
      this.circuitState = "open";
      this.openedAt = Date.now();
      console.log("Circuit OPEN - failure threshold reached");
    }
  }
  
  /**
   * Get current state for monitoring.
   */
  getState(): {
    circuitState: CircuitState;
    consecutiveFailures: number;
    budget: number;
    budgetPercent: number;
  } {
    return {
      circuitState: this.circuitState,
      consecutiveFailures: this.consecutiveFailures,
      budget: this.budget.getBudget(),
      budgetPercent: this.budget.getBudgetPercent(),
    };
  }
}
 
// =========================================
// Usage Example
// =========================================
 
const policy = new IntegratedRetryPolicy({
  maxBudget: 50,
  failureThreshold: 3,
  openDurationMs: 10000,
});
 
async function makeProtectedRequest<T>(
  operation: () => Promise<T>,
  maxRetries: number = 3
): Promise<T> {
  for (let attempt = 1; attempt <= maxRetries + 1; attempt++) {
    // Check if circuit allows request
    if (!policy.allowRequest()) {
      throw new Error("Circuit breaker is open");
    }
    
    try {
      const result = await operation();
      policy.recordSuccess();
      return result;
    } catch (error) {
      policy.recordFailure();
      
      // Check if we should retry
      if (attempt <= maxRetries && policy.allowRetry()) {
        console.log(`Retry ${attempt} allowed by policy`);
        await new Promise(r => setTimeout(r, 100 * Math.pow(2, attempt)));
        continue;
      }
      
      throw error;
    }
  }
  
  throw new Error("Exhausted retries");
}

Summary: Retry Budget Mastery

Key Takeaways

•Retry amplification is multiplicative — Small failure rates with aggressive retries create load spikes that prevent recovery.
•Budgets limit aggregate retry volume — By tying retry permission to successful requests, budgets naturally throttle during failures.
•10% is a sensible default — Google's recommendation: retries should not exceed 10% of successful request volume.
•Multiple strategies exist — Percentage-based, token bucket, sliding window, and adaptive budgets each have tradeoffs.
•Distributed systems need coordination — Or use probabilistic approaches that converge to target limits without coordination.
•Combine with circuit breakers — Budgets handle gradual degradation; circuit breakers handle complete failures.

What's next:

Page Complete

4 / 5