Loading learning content...
We've learned when to retry, how to space retries with exponential backoff, and how to add jitter to prevent thundering herds. These are powerful techniques for individual requests.
But they share a dangerous assumption: that retrying is always beneficial if the failure seems transient.
Consider this scenario: A downstream service is running at 98% capacity—healthy, but near its limit. 2% of requests fail due to load. Each of those 2% is retried. Now we have 102% of normal load. More failures occur. More retries. 104%. 106%. The system cascades to failure.
The problem: Each individual retry decision was correct. The collective impact was catastrophic.
The solution: Retry budgets—system-level limits that constrain total retry volume, preventing well-intentioned retries from becoming the final straw.
By the end of this page, you will understand retry amplification mathematics, how retry budgets work, different budget strategies (percentage-based, token bucket, circuit-based), implementation patterns, and how major platforms like Google and Netflix use retry budgets to maintain system stability.
Retry amplification is the phenomenon where retries increase system load during failures, making recovery harder or impossible. Understanding its mathematics is essential for designing robust systems.
Basic amplification formula:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
/** * Retry Amplification Mathematics * * When failures occur, retries increase total load. * This increased load can cause more failures, creating a vicious cycle. */ interface AmplificationResult { initialLoad: number; failureRate: number; retryCount: number; amplifiedLoad: number; amplificationFactor: number; sustainable: boolean;} /** * Calculate the amplified load when all failed requests are retried. * * Without any limiting: * amplifiedLoad = initialLoad × (1 + failureRate × retryCount) */function calculateAmplification( initialLoad: number, failureRate: number, // 0.0 to 1.0 retryCount: number, maxCapacity: number): AmplificationResult { // First-order amplification (simple model) const amplifiedLoad = initialLoad * (1 + failureRate * retryCount); const amplificationFactor = amplifiedLoad / initialLoad; return { initialLoad, failureRate, retryCount, amplifiedLoad, amplificationFactor, sustainable: amplifiedLoad <= maxCapacity, };} // Scenario: Service handles 10,000 req/s, max capacity 12,000 req/sconst capacity = 12000; console.log("Retry Amplification Scenarios");console.log("System capacity: 12,000 req/s");console.log("Normal load: 10,000 req/s (83% utilization)");console.log("========================================="); // Scenario 1: Healthy system (1% failure rate)const healthy = calculateAmplification(10000, 0.01, 3, capacity);console.log("Healthy (1% failure, 3 retries):");console.log(` Amplified load: ${healthy.amplifiedLoad.toFixed(0)} req/s`);console.log(` Amplification: ${healthy.amplificationFactor.toFixed(2)}x`);console.log(` Sustainable: ${healthy.sustainable}`);// Output: 10,300 req/s (1.03x) - Sustainable ✓ // Scenario 2: Under stress (10% failure rate)const stressed = calculateAmplification(10000, 0.10, 3, capacity);console.log("Stressed (10% failure, 3 retries):");console.log(` Amplified load: ${stressed.amplifiedLoad.toFixed(0)} req/s`);console.log(` Amplification: ${stressed.amplificationFactor.toFixed(2)}x`);console.log(` Sustainable: ${stressed.sustainable}`);// Output: 13,000 req/s (1.30x) - UNSUSTAINABLE! Exceeds capacity! // Scenario 3: Same stress with retry budget (50% of failures retried)const withBudget = calculateAmplification(10000, 0.10, 3 * 0.5, capacity);console.log("Stressed with 50% retry budget:");console.log(` Amplified load: ${withBudget.amplifiedLoad.toFixed(0)} req/s`);console.log(` Amplification: ${withBudget.amplificationFactor.toFixed(2)}x`);console.log(` Sustainable: ${withBudget.sustainable}`);// Output: 11,500 req/s (1.15x) - Sustainable with budget ✓The cascading feedback loop:
The simple model above assumes failure rate stays constant. In reality, increased load from retries increases failure rate, which triggers more retries:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
/** * Cascading Failure Simulation * * Models how retries can amplify failures in a feedback loop, * turning minor overload into complete system failure. */ interface SystemState { load: number; capacity: number; failureRate: number; retriesInFlight: number;} function simulateSeconds( initialState: SystemState, retryPolicy: { maxRetries: number; retryAfterSeconds: number }, seconds: number): SystemState[] { const history: SystemState[] = [initialState]; let state = { ...initialState }; // Pending retries by arrival time const pendingRetries: number[] = []; for (let t = 1; t <= seconds; t++) { // Calculate failure rate based on load vs capacity // Simple model: linear increase above 80% capacity const utilizationRatio = state.load / state.capacity; let failureRate = 0; if (utilizationRatio > 0.8) { failureRate = Math.min(1, (utilizationRatio - 0.8) * 5); } // Apply retries from previous seconds const arrivingRetries = pendingRetries.shift() || 0; const newLoad = state.load + arrivingRetries; // Calculate failures at current load const failures = newLoad * failureRate; // Schedule retries (with limit) const retriesToSchedule = failures * retryPolicy.maxRetries; if (pendingRetries.length < retryPolicy.retryAfterSeconds) { for (let i = pendingRetries.length; i < retryPolicy.retryAfterSeconds; i++) { pendingRetries.push(0); } } // Distribute retries over time (simplified) for (let r = 0; r < retryPolicy.retryAfterSeconds && r < pendingRetries.length; r++) { pendingRetries[r] += retriesToSchedule / retryPolicy.retryAfterSeconds; } state = { load: newLoad, capacity: state.capacity, failureRate, retriesInFlight: pendingRetries.reduce((a, b) => a + b, 0), }; history.push(state); // Reset load to baseline for next second (requests don't accumulate) state.load = initialState.load; } return history;} // Simulate a system under sudden load spikeconst initial: SystemState = { load: 8000, // Normal load: 80% of capacity capacity: 10000, failureRate: 0, retriesInFlight: 0,}; // Spike to 95% loadconst spikedInitial = { ...initial, load: 9500 }; console.log("Cascading Failure Simulation");console.log("System capacity: 10,000 req/s");console.log("Initial spike: 9,500 req/s (95% utilization)");console.log("========================================="); const withRetries = simulateSeconds( spikedInitial, { maxRetries: 3, retryAfterSeconds: 1 }, 10); console.log("With unlimited retries:");withRetries.slice(0, 6).forEach((state, t) => { console.log(` t=${t}s: load=${state.load.toFixed(0)}, failure=${(state.failureRate * 100).toFixed(0)}%, pending=${state.retriesInFlight.toFixed(0)}`);}); // Key insight: The failure rate and pending retries escalate rapidly// What started as 5% overload becomes catastrophic failureIn the worst case, retry amplification creates a feedback loop: failures cause retries, retries increase load, increased load causes more failures. Without budgets, this loop continues until the system completely fails or all requests timeout.
A retry budget is a mechanism that limits the total number of retries a client or system can issue over a time window. Instead of allowing unlimited retries per request, the budget constrains retry volume as a fraction of successful requests.
The core principle:
"You may only retry if you have budget remaining. Budget is earned through successful requests and spent on retries."
Example: 10% retry budget
With a 10% retry budget:
Google's Site Reliability Engineering book recommends a retry budget of 10% of successful request volume. This means if you process 1,000 successful requests, you can issue up to 100 retries. This is a well-tested starting point for most systems.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
/** * Conceptual Retry Budget * * Demonstrates the basic mechanics of a percentage-based retry budget. */ interface BudgetState { available: number; // Current budget balance maxBudget: number; // Maximum budget cap totalSuccesses: number; // Successes in current window totalRetries: number; // Retries spent in current window} class ConceptualRetryBudget { private available: number; private readonly budgetRatio: number; // e.g., 0.10 for 10% private readonly maxBudget: number; // Upper cap constructor(budgetRatio: number = 0.10, maxBudget: number = 100) { this.budgetRatio = budgetRatio; this.maxBudget = maxBudget; this.available = maxBudget / 2; // Start with some initial budget } /** * Record a successful request. This earns retry budget. */ recordSuccess(): void { // Each success earns a fraction of a retry credit this.available = Math.min( this.available + this.budgetRatio, this.maxBudget ); } /** * Check if we can afford a retry. */ canRetry(): boolean { return this.available >= 1.0; } /** * Consume budget for a retry. Returns true if retry was allowed. */ consumeForRetry(): boolean { if (!this.canRetry()) { return false; // No budget, cannot retry } this.available -= 1.0; return true; } /** * Get current budget state for monitoring. */ getState(): { available: number; maxBudget: number; percentFull: number } { return { available: this.available, maxBudget: this.maxBudget, percentFull: (this.available / this.maxBudget) * 100, }; }} // Demonstrationconst budget = new ConceptualRetryBudget(0.10, 100); console.log("Retry Budget Demonstration");console.log("Budget ratio: 10% (1 retry credit per 10 successes)");console.log("========================================"); // Simulate healthy traffic: 100 requests, 2% failureconsole.log("Scenario 1: Healthy traffic (2% failure rate)");for (let i = 0; i < 100; i++) { if (Math.random() < 0.02) { // Failure - try to retry const couldRetry = budget.consumeForRetry(); console.log(` Request ${i}: FAILED - Retry ${couldRetry ? "ALLOWED" : "DENIED"}`); } else { // Success - earn budget budget.recordSuccess(); }}console.log(` Final budget: ${budget.getState().available.toFixed(1)} / ${budget.getState().maxBudget}`); // Reset and simulate unhealthy trafficconst budget2 = new ConceptualRetryBudget(0.10, 100);console.log("Scenario 2: Unhealthy traffic (50% failure rate)");let retriesAllowed = 0;let retriesDenied = 0; for (let i = 0; i < 100; i++) { if (Math.random() < 0.50) { // High failure rate if (budget2.consumeForRetry()) { retriesAllowed++; } else { retriesDenied++; } } else { budget2.recordSuccess(); }}console.log(` Retries allowed: ${retriesAllowed}`);console.log(` Retries denied: ${retriesDenied}`);console.log(` Final budget: ${budget2.getState().available.toFixed(1)}`); // Key insight: Under high failure, budget exhausts and denies retries,// preventing amplification even though each individual retry seems reasonableThere are several approaches to implementing retry budgets, each with different characteristics. The right choice depends on your system's traffic patterns and reliability requirements.
| Strategy | Mechanism | Pros | Cons | Best For |
|---|---|---|---|---|
| Percentage-Based | Retries limited to X% of successes | Simple, proportional, self-adjusting | Needs success tracking | General purpose |
| Token Bucket | Fixed token regeneration rate, consumed by retries | Smooth, familiar pattern | Requires tuning regen rate | Steady traffic |
| Sliding Window | Track retry/request ratio in time window | Accurate recent view | Memory for window | Variable traffic |
| Circuit-Breaker Hybrid | Budget + circuit breaker integration | Best protection | More complex | Critical paths |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203
/** * Multiple Retry Budget Strategy Implementations */ // =========================================// Strategy 1: Token Bucket Budget// =========================================class TokenBucketRetryBudget { private tokens: number; private readonly maxTokens: number; private readonly refillRatePerSecond: number; private lastRefillTime: number; constructor(maxTokens: number = 10, refillRatePerSecond: number = 1) { this.tokens = maxTokens; // Start full this.maxTokens = maxTokens; this.refillRatePerSecond = refillRatePerSecond; this.lastRefillTime = Date.now(); } private refill(): void { const now = Date.now(); const secondsElapsed = (now - this.lastRefillTime) / 1000; const tokensToAdd = secondsElapsed * this.refillRatePerSecond; this.tokens = Math.min(this.maxTokens, this.tokens + tokensToAdd); this.lastRefillTime = now; } canRetry(): boolean { this.refill(); return this.tokens >= 1.0; } consumeForRetry(): boolean { if (!this.canRetry()) return false; this.tokens -= 1.0; return true; } getAvailableTokens(): number { this.refill(); return this.tokens; }} // =========================================// Strategy 2: Sliding Window Budget// =========================================interface WindowEntry { timestamp: number; type: "success" | "retry";} class SlidingWindowRetryBudget { private readonly windowSizeMs: number; private readonly maxRetryRatio: number; private entries: WindowEntry[] = []; constructor(windowSizeMs: number = 60000, maxRetryRatio: number = 0.10) { this.windowSizeMs = windowSizeMs; this.maxRetryRatio = maxRetryRatio; } private pruneOldEntries(): void { const cutoff = Date.now() - this.windowSizeMs; this.entries = this.entries.filter(e => e.timestamp >= cutoff); } recordSuccess(): void { this.entries.push({ timestamp: Date.now(), type: "success" }); this.pruneOldEntries(); } canRetry(): boolean { this.pruneOldEntries(); const successCount = this.entries.filter(e => e.type === "success").length; const retryCount = this.entries.filter(e => e.type === "retry").length; if (successCount === 0) { // No successes in window - allow minimal retries based on initial buffer return retryCount < 5; // Allow a few retries to bootstrap } const currentRatio = retryCount / successCount; return currentRatio < this.maxRetryRatio; } consumeForRetry(): boolean { if (!this.canRetry()) return false; this.entries.push({ timestamp: Date.now(), type: "retry" }); return true; } getStats(): { successes: number; retries: number; ratio: number } { this.pruneOldEntries(); const successes = this.entries.filter(e => e.type === "success").length; const retries = this.entries.filter(e => e.type === "retry").length; return { successes, retries, ratio: successes > 0 ? retries / successes : 0, }; }} // =========================================// Strategy 3: Adaptive Budget (Google-style)// =========================================class AdaptiveRetryBudget { private budget: number; private readonly maxBudget: number; private readonly budgetRatio: number; private readonly minBudgetForRetry: number; // Exponential moving averages for monitoring private successRate: number = 1.0; private readonly alpha: number = 0.1; // Smoothing factor constructor(options: { maxBudget?: number; budgetRatio?: number; minBudgetForRetry?: number; } = {}) { this.maxBudget = options.maxBudget ?? 100; this.budgetRatio = options.budgetRatio ?? 0.2; // 20% default this.minBudgetForRetry = options.minBudgetForRetry ?? 1.0; this.budget = this.maxBudget; // Start full } recordSuccess(): void { // Add to budget this.budget = Math.min(this.maxBudget, this.budget + this.budgetRatio); // Update success rate EMA this.successRate = this.alpha * 1.0 + (1 - this.alpha) * this.successRate; } recordFailure(): void { // Update success rate EMA this.successRate = this.alpha * 0.0 + (1 - this.alpha) * this.successRate; } canRetry(): boolean { // Two conditions must be met: // 1. Have enough budget // 2. Success rate isn't too low (adaptive response) if (this.budget < this.minBudgetForRetry) { return false; } // If success rate is very low, be more conservative // This provides extra protection during severe failures if (this.successRate < 0.1) { return this.budget >= this.maxBudget * 0.5; // Require 50% budget } return true; } consumeForRetry(): boolean { if (!this.canRetry()) return false; this.budget -= 1.0; return true; } getState(): { budget: number; maxBudget: number; successRate: number } { return { budget: this.budget, maxBudget: this.maxBudget, successRate: this.successRate, }; }} // =========================================// Usage Comparison// ========================================= console.log("Budget Strategy Comparison"); // Token bucket: good for rate-based limitingconst tokenBucket = new TokenBucketRetryBudget(10, 2); // 10 tokens, 2/sec refillconsole.log("Token Bucket: Best for steady-state traffic");console.log(` Available: ${tokenBucket.getAvailableTokens()} tokens`); // Sliding window: good for ratio-based limitingconst slidingWindow = new SlidingWindowRetryBudget(60000, 0.10); // 60s window, 10% limitconsole.log("Sliding Window: Best for accurate ratio tracking");console.log(` Stats: ${JSON.stringify(slidingWindow.getStats())}`); // Adaptive: good for varying conditionsconst adaptive = new AdaptiveRetryBudget({ maxBudget: 50, budgetRatio: 0.20 });console.log("Adaptive: Best for varying failure conditions");console.log(` State: ${JSON.stringify(adaptive.getState())}`);A production retry budget needs to integrate seamlessly with your retry logic, support monitoring, and handle edge cases gracefully.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272
/** * Production-Grade Retry Budget System * * Features: * - Configurable budget strategy * - Metrics and monitoring support * - Integration with retry functions * - Thread-safe for concurrent access */ interface RetryBudgetMetrics { totalRequests: number; successfulRequests: number; failedRequests: number; retriesAttempted: number; retriesAllowed: number; retriesDenied: number; currentBudget: number; budgetUtilization: number;} interface RetryBudgetConfig { /** Maximum budget capacity */ maxBudget: number; /** Budget earned per successful request (ratio) */ budgetPerSuccess: number; /** Budget consumed per retry */ budgetPerRetry: number; /** Minimum budget required to allow retry */ minBudgetForRetry: number; /** Initial budget as percentage of max */ initialBudgetPercent: number; /** Optional: callback when budget is exhausted */ onBudgetExhausted?: () => void; /** Optional: callback when budget recovers */ onBudgetRecovered?: () => void;} class ProductionRetryBudget { private budget: number; private readonly config: RetryBudgetConfig; private wasExhausted: boolean = false; // Metrics private metrics: RetryBudgetMetrics = { totalRequests: 0, successfulRequests: 0, failedRequests: 0, retriesAttempted: 0, retriesAllowed: 0, retriesDenied: 0, currentBudget: 0, budgetUtilization: 0, }; constructor(config: Partial<RetryBudgetConfig> = {}) { this.config = { maxBudget: 100, budgetPerSuccess: 0.1, // 10 successes = 1 retry budgetPerRetry: 1.0, minBudgetForRetry: 1.0, initialBudgetPercent: 50, ...config, }; this.budget = this.config.maxBudget * (this.config.initialBudgetPercent / 100); this.updateMetrics(); } /** * Record a successful request. Adds to budget. */ recordSuccess(): void { this.metrics.totalRequests++; this.metrics.successfulRequests++; const previousBudget = this.budget; this.budget = Math.min( this.config.maxBudget, this.budget + this.config.budgetPerSuccess ); // Check if we recovered from exhaustion if (this.wasExhausted && this.budget >= this.config.minBudgetForRetry) { this.wasExhausted = false; this.config.onBudgetRecovered?.(); } this.updateMetrics(); } /** * Record a failed request (without retry). */ recordFailure(): void { this.metrics.totalRequests++; this.metrics.failedRequests++; this.updateMetrics(); } /** * Check if retry is allowed without consuming budget. */ canRetry(): boolean { return this.budget >= this.config.minBudgetForRetry; } /** * Attempt to consume budget for a retry. * Returns true if retry is allowed, false if denied. */ tryConsumeForRetry(): boolean { this.metrics.retriesAttempted++; if (!this.canRetry()) { this.metrics.retriesDenied++; // Track exhaustion state if (!this.wasExhausted) { this.wasExhausted = true; this.config.onBudgetExhausted?.(); } this.updateMetrics(); return false; } this.budget -= this.config.budgetPerRetry; this.metrics.retriesAllowed++; this.updateMetrics(); return true; } /** * Get current metrics for monitoring. */ getMetrics(): RetryBudgetMetrics { return { ...this.metrics }; } /** * Get current budget level. */ getBudget(): number { return this.budget; } /** * Get budget as percentage of max. */ getBudgetPercent(): number { return (this.budget / this.config.maxBudget) * 100; } /** * Reset metrics (for testing or rolling windows). */ resetMetrics(): void { this.metrics = { totalRequests: 0, successfulRequests: 0, failedRequests: 0, retriesAttempted: 0, retriesAllowed: 0, retriesDenied: 0, currentBudget: this.budget, budgetUtilization: 0, }; } private updateMetrics(): void { this.metrics.currentBudget = this.budget; this.metrics.budgetUtilization = ((this.config.maxBudget - this.budget) / this.config.maxBudget) * 100; }} /** * Retry function with integrated budget management. */async function retryWithBudget<T>( operation: () => Promise<T>, budget: ProductionRetryBudget, options: { maxAttempts?: number; backoffMs?: (attempt: number) => number; shouldRetry?: (error: Error) => boolean; onRetryDenied?: (error: Error) => void; } = {}): Promise<T> { const { maxAttempts = 3, backoffMs = (attempt) => 100 * Math.pow(2, attempt - 1), shouldRetry = () => true, onRetryDenied, } = options; let lastError: Error | null = null; for (let attempt = 1; attempt <= maxAttempts; attempt++) { try { const result = await operation(); budget.recordSuccess(); return result; } catch (error) { lastError = error as Error; budget.recordFailure(); // Check if this error is retryable if (!shouldRetry(lastError)) { throw lastError; } // Check if we have attempts remaining if (attempt >= maxAttempts) { throw lastError; } // Check budget before retrying if (!budget.tryConsumeForRetry()) { onRetryDenied?.(lastError); throw new RetryBudgetExhaustedError( "Retry budget exhausted", lastError ); } // Wait before retry await new Promise(r => setTimeout(r, backoffMs(attempt))); } } throw lastError || new Error("Retry failed");} class RetryBudgetExhaustedError extends Error { constructor(message: string, public readonly cause: Error) { super(message); this.name = "RetryBudgetExhaustedError"; }} // =========================================// Usage Example// ========================================= const budget = new ProductionRetryBudget({ maxBudget: 50, budgetPerSuccess: 0.1, onBudgetExhausted: () => console.log("⚠️ Retry budget exhausted!"), onBudgetRecovered: () => console.log("✅ Retry budget recovered"),}); async function makeRequest(shouldFail: boolean): Promise<string> { return retryWithBudget( async () => { if (shouldFail) throw new Error("Simulated failure"); return "success"; }, budget, { maxAttempts: 3, onRetryDenied: (err) => console.log(`Retry denied: ${err.message}`), } );}In distributed systems with multiple client instances, local retry budgets may not prevent global amplification. If 100 instances each have their own budget, the aggregate retry volume could still be too high.
Strategies for distributed coordination:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174
/** * Distributed Retry Budget Strategies */ // =========================================// Strategy 1: Per-Instance Budget Division// =========================================class PerInstanceBudget { private readonly localBudget: ProductionRetryBudget; constructor(totalBudget: number, instanceCount: number) { // Each instance gets an equal share of the global budget const perInstanceBudget = totalBudget / instanceCount; this.localBudget = new ProductionRetryBudget({ maxBudget: perInstanceBudget, budgetPerSuccess: 0.1 / instanceCount, // Slower accrual }); } tryConsumeForRetry(): boolean { return this.localBudget.tryConsumeForRetry(); } recordSuccess(): void { this.localBudget.recordSuccess(); }} // =========================================// Strategy 2: Probabilistic Budget (No Coordination)// =========================================class ProbabilisticRetryBudget { private successCount: number = 0; private failureCount: number = 0; private readonly targetRetryRatio: number; constructor(targetRetryRatio: number = 0.10) { this.targetRetryRatio = targetRetryRatio; } recordSuccess(): void { this.successCount++; } recordFailure(): void { this.failureCount++; } /** * Probabilistically decide whether to retry. * Each instance independently makes this decision, * but the aggregate converges to the target ratio. */ shouldRetry(): boolean { const totalRequests = this.successCount + this.failureCount; if (totalRequests < 10) { // Not enough data - use 50% probability return Math.random() < 0.5; } const observedFailureRate = this.failureCount / totalRequests; if (observedFailureRate < this.targetRetryRatio) { // Low failure rate - always retry return true; } // High failure rate - probabilistic retry // P(retry) = targetRatio / observedFailureRate // This ensures aggregate retry rate ≈ targetRatio const retryProbability = this.targetRetryRatio / observedFailureRate; return Math.random() < retryProbability; }} // =========================================// Strategy 3: Redis-Backed Shared Budget// =========================================interface RedisClient { incr(key: string): Promise<number>; decr(key: string): Promise<number>; get(key: string): Promise<string | null>; expire(key: string, seconds: number): Promise<void>;} class RedisRetryBudget { private readonly redis: RedisClient; private readonly budgetKey: string; private readonly successKey: string; private readonly maxBudget: number; private readonly budgetRatio: number; private readonly windowSeconds: number; constructor( redis: RedisClient, serviceName: string, options: { maxBudget?: number; budgetRatio?: number; windowSeconds?: number; } = {} ) { this.redis = redis; this.budgetKey = `retry_budget:${serviceName}:budget`; this.successKey = `retry_budget:${serviceName}:success`; this.maxBudget = options.maxBudget ?? 1000; this.budgetRatio = options.budgetRatio ?? 0.1; this.windowSeconds = options.windowSeconds ?? 60; } async recordSuccess(): Promise<void> { // Increment success counter atomically const successes = await this.redis.incr(this.successKey); await this.redis.expire(this.successKey, this.windowSeconds); // Earn budget (capped at max) const earnedBudget = Math.floor(successes * this.budgetRatio); // In practice, use INCRBY with MINVAL to cap } async tryConsumeForRetry(): Promise<boolean> { // Decrement budget atomically // Returns false if would go negative const newBudget = await this.redis.decr(this.budgetKey); if (newBudget < 0) { // Went negative - restore and deny await this.redis.incr(this.budgetKey); return false; } return true; } async getAvailableBudget(): Promise<number> { const budget = await this.redis.get(this.budgetKey); return budget ? parseInt(budget, 10) : 0; }} // =========================================// Strategy 4: Hedged Retry with Sampling// =========================================class HedgedRetryBudget { private readonly samplingRate: number; private readonly baseRetryBudget: ProductionRetryBudget; constructor(samplingRate: number = 0.25) { this.samplingRate = samplingRate; this.baseRetryBudget = new ProductionRetryBudget(); } /** * Only sample a fraction of failures for retry. * Combined with local budget for additional protection. */ shouldRetry(): boolean { // First: random sampling if (Math.random() > this.samplingRate) { return false; // Not sampled for retry } // Second: local budget check return this.baseRetryBudget.tryConsumeForRetry(); } recordSuccess(): void { this.baseRetryBudget.recordSuccess(); }} console.log("Distributed Budget Strategies Loaded");Probabilistic retry budgets require no coordination and naturally limit aggregate retry volume. If your target is 10% retries and you have 100 instances, each independently retrying with P=0.1 achieves the same aggregate limit as coordinated budgets—without the complexity or latency.
Retry budgets and circuit breakers are complementary patterns. Circuit breakers respond to consecutive failures by stopping all requests. Retry budgets limit the volume of retries. Used together, they provide defense in depth.
| Aspect | Retry Budget | Circuit Breaker | Combined |
|---|---|---|---|
| Trigger | Ratio of retries to successes | Consecutive failures | Either condition |
| Response | Reduce retry rate | Stop all requests | Graceful degradation then stop |
| Recovery | Auto-recover with successes | Half-open probing | Both mechanisms |
| Scope | Retry decisions only | All requests | Full protection |
| Best for | High-volume steady traffic | Sudden complete failures | Production systems |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188
/** * Integrated Circuit Breaker + Retry Budget System * * Provides layered protection: * 1. Retry budget controls retry amplification * 2. Circuit breaker halts traffic during severe failures */ type CircuitState = "closed" | "open" | "half-open"; interface IntegratedPolicyConfig { // Retry budget config maxBudget: number; budgetPerSuccess: number; // Circuit breaker config failureThreshold: number; // Failures before opening successThreshold: number; // Successes in half-open to close openDurationMs: number; // How long to stay open halfOpenMaxConcurrent: number; // Max requests during half-open} class IntegratedRetryPolicy { private readonly config: IntegratedPolicyConfig; private readonly budget: ProductionRetryBudget; // Circuit breaker state private circuitState: CircuitState = "closed"; private consecutiveFailures: number = 0; private consecutiveSuccesses: number = 0; private openedAt: number = 0; private halfOpenInFlight: number = 0; constructor(config: Partial<IntegratedPolicyConfig> = {}) { this.config = { maxBudget: 100, budgetPerSuccess: 0.1, failureThreshold: 5, successThreshold: 3, openDurationMs: 30000, halfOpenMaxConcurrent: 3, ...config, }; this.budget = new ProductionRetryBudget({ maxBudget: this.config.maxBudget, budgetPerSuccess: this.config.budgetPerSuccess, }); } /** * Check if a request is allowed (considering circuit state). */ allowRequest(): boolean { switch (this.circuitState) { case "closed": return true; case "open": // Check if it's time to try half-open if (Date.now() - this.openedAt >= this.config.openDurationMs) { this.circuitState = "half-open"; this.halfOpenInFlight = 0; return this.halfOpenInFlight < this.config.halfOpenMaxConcurrent; } return false; case "half-open": // Allow limited requests during probing if (this.halfOpenInFlight < this.config.halfOpenMaxConcurrent) { this.halfOpenInFlight++; return true; } return false; } } /** * Check if a retry is allowed (budget + circuit state). */ allowRetry(): boolean { // Circuit must allow requests if (!this.allowRequest()) { return false; } // Budget must allow retry return this.budget.tryConsumeForRetry(); } /** * Record a successful request. */ recordSuccess(): void { this.budget.recordSuccess(); this.consecutiveFailures = 0; this.consecutiveSuccesses++; if (this.circuitState === "half-open") { this.halfOpenInFlight = Math.max(0, this.halfOpenInFlight - 1); if (this.consecutiveSuccesses >= this.config.successThreshold) { this.circuitState = "closed"; console.log("Circuit CLOSED - service recovered"); } } } /** * Record a failed request. */ recordFailure(): void { this.budget.recordFailure(); this.consecutiveSuccesses = 0; this.consecutiveFailures++; if (this.circuitState === "half-open") { // Failure during half-open - back to open this.circuitState = "open"; this.openedAt = Date.now(); console.log("Circuit OPEN - failure during probe"); } else if (this.circuitState === "closed" && this.consecutiveFailures >= this.config.failureThreshold) { // Too many failures - open circuit this.circuitState = "open"; this.openedAt = Date.now(); console.log("Circuit OPEN - failure threshold reached"); } } /** * Get current state for monitoring. */ getState(): { circuitState: CircuitState; consecutiveFailures: number; budget: number; budgetPercent: number; } { return { circuitState: this.circuitState, consecutiveFailures: this.consecutiveFailures, budget: this.budget.getBudget(), budgetPercent: this.budget.getBudgetPercent(), }; }} // =========================================// Usage Example// ========================================= const policy = new IntegratedRetryPolicy({ maxBudget: 50, failureThreshold: 3, openDurationMs: 10000,}); async function makeProtectedRequest<T>( operation: () => Promise<T>, maxRetries: number = 3): Promise<T> { for (let attempt = 1; attempt <= maxRetries + 1; attempt++) { // Check if circuit allows request if (!policy.allowRequest()) { throw new Error("Circuit breaker is open"); } try { const result = await operation(); policy.recordSuccess(); return result; } catch (error) { policy.recordFailure(); // Check if we should retry if (attempt <= maxRetries && policy.allowRetry()) { console.log(`Retry ${attempt} allowed by policy`); await new Promise(r => setTimeout(r, 100 * Math.pow(2, attempt))); continue; } throw error; } } throw new Error("Exhausted retries");}Retry budgets complete our toolkit for safe retries. While backoff and jitter control when to retry, budgets control whether to retry at all—preventing the retry amplification that can turn minor failures into major outages.
What's next:
Our final topic in retry strategies addresses a fundamental requirement for safe retries: Idempotency. Without idempotent operations, even correctly implemented retries can cause data corruption, duplicate charges, or inconsistent state. The next page explores how to design and implement idempotent operations.
You now understand retry budgets—the system-level mechanism that prevents retry amplification from causing cascading failures. Combined with exponential backoff and jitter, retry budgets form a complete framework for resilient retry behavior.