Loading content...
Persistence is a virtue—but in distributed systems, knowing when to stop is equally essential. Every retry consumes resources: memory for state tracking, threads or connections waiting, network bandwidth for repeated requests, and server capacity on the receiving end. Unbounded retries create unbounded resource consumption.
More fundamentally, endless retries postpone the inevitable. If a transient failure has become a persistent outage, continuing to retry delays the moment when the system can take alternative action: returning an error to the user, triggering a fallback, or alerting operations teams.
The maximum retry attempts limit is the backstop that prevents retry logic from becoming a resource leak. It answers the critical question: at what point do we accept that this particular operation has failed and move on? Getting this answer right is the difference between a system that gracefully degrades under failure and one that slowly consumes itself.
This page explores how to determine appropriate retry limits, the multiple dimensions of retry budgets, and the relationship between retry limits and overall system health.
By the end of this page, you will understand how to calculate optimal maximum retry attempts, the difference between attempt-based and time-based limits, how retry limits interact with system resources, layered retry considerations in microservices, and strategies for communicating retry exhaustion to callers.
Before exploring how to set retry limits, we must understand why limits matter. Unbounded retries—or poorly chosen limits—create cascading problems.
Resource Accumulation
Every pending retry consumes resources:
The Zombie Request Problem
Consider a request that enters an infinite retry loop:
With thousands of concurrent users, zombie requests accumulate, consuming capacity needed for active users.
A major e-commerce platform experienced a 4-hour outage when their payment service went down for 30 seconds. Without retry limits, checkout operations continued retrying indefinitely. The accumulated retry state consumed all available memory on API servers. By the time the payment service recovered, the API servers themselves were crashing due to OOM conditions, creating a cascading failure that took hours to clear.
Retry limits can be expressed in two fundamental ways: maximum number of attempts, or maximum total time. Understanding the trade-offs helps you choose appropriately.
Attempt-Based Limits
Limit retries to a fixed number of attempts (e.g., "retry up to 5 times"):
maxAttempts = 5
for attempt = 1 to maxAttempts:
try operation
if success: return
if attempt < maxAttempts: wait(backoff)
throw RetryExhausted
Advantages:
Disadvantages:
Time-Based Limits
Limit retries to a maximum total duration (e.g., "retry for up to 30 seconds"):
timeoutAt = now() + maxDuration
while now() < timeoutAt:
try operation
if success: return
if now() + backoff > timeoutAt: break
wait(backoff)
throw RetryExhausted
Advantages:
Disadvantages:
| Scenario | Attempts | Total Time |
|---|---|---|
| Fast failures (10ms each) | 5 | ~150ms* |
| Timeout failures (5s each) | 5 | ~25s + backoff |
| Mixed failures | 5 | Variable |
| Scenario | Time Limit | Attempts |
|---|---|---|
| Fast failures (10ms each) | 30s | Many (10+) |
| Timeout failures (5s each) | 30s | Few (3-4) |
| Mixed failures | 30s | Variable |
*Including backoff delays
Best Practice: Combine Both
Production systems typically combine both limits:
Retry until:
maxAttempts reached OR
totalTimeLimit exceeded
whichever comes first
This provides the predictability of attempt limits with the latency guarantees of time limits.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143
// Combined attempt-based and time-based retry limitsinterface RetryLimits { maxAttempts: number; // Stop after this many attempts maxDurationMs: number; // Stop after this total time perAttemptTimeoutMs?: number; // Timeout for each individual attempt} interface RetryState { attempt: number; startTime: number; lastError?: Error;} function shouldContinueRetrying( state: RetryState, limits: RetryLimits, nextDelayMs: number): { shouldRetry: boolean; reason?: string } { // Check attempt limit if (state.attempt >= limits.maxAttempts) { return { shouldRetry: false, reason: `Max attempts (${limits.maxAttempts}) reached`, }; } const elapsed = Date.now() - state.startTime; // Check if already past time limit if (elapsed >= limits.maxDurationMs) { return { shouldRetry: false, reason: `Time limit (${limits.maxDurationMs}ms) exceeded`, }; } // Check if next attempt would exceed time limit // (delay + minimum expected operation time) const minOperationTime = limits.perAttemptTimeoutMs || 1000; if (elapsed + nextDelayMs + minOperationTime > limits.maxDurationMs) { return { shouldRetry: false, reason: 'Insufficient time for another attempt', }; } return { shouldRetry: true };} // Production retry executor with combined limitsasync function executeWithCombinedLimits<T>( operation: () => Promise<T>, limits: RetryLimits, backoff: BackoffCalculator, isRetryable: (error: Error) => boolean): Promise<T> { const state: RetryState = { attempt: 0, startTime: Date.now(), }; while (true) { state.attempt++; try { // Apply per-attempt timeout if configured if (limits.perAttemptTimeoutMs) { return await withTimeout(operation(), limits.perAttemptTimeoutMs); } return await operation(); } catch (error) { state.lastError = error as Error; // Check if error is retryable if (!isRetryable(state.lastError)) { throw new NonRetryableError(state.lastError, state); } // Calculate next delay const nextDelay = backoff.nextDelay(state.attempt - 1); // Check if we should continue const decision = shouldContinueRetrying(state, limits, nextDelay); if (!decision.shouldRetry) { throw new RetryExhaustedError( decision.reason!, state, limits ); } // Wait and retry await sleep(nextDelay); } }} class RetryExhaustedError extends Error { constructor( public reason: string, public state: RetryState, public limits: RetryLimits ) { const elapsed = Date.now() - state.startTime; super( `Retry exhausted: ${reason}. ` + `Attempts: ${state.attempt}/${limits.maxAttempts}, ` + `Duration: ${elapsed}ms/${limits.maxDurationMs}ms. ` + `Last error: ${state.lastError?.message}` ); this.name = 'RetryExhaustedError'; }} class NonRetryableError extends Error { constructor( public originalError: Error, public state: RetryState ) { super( `Non-retryable error on attempt ${state.attempt}: ` + `${originalError.message}` ); this.name = 'NonRetryableError'; }} interface BackoffCalculator { nextDelay(attemptIndex: number): number;} function withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> { return Promise.race([ promise, new Promise<never>((_, reject) => setTimeout(() => reject(new Error('Timeout')), ms) ), ]);} function sleep(ms: number): Promise<void> { return new Promise(resolve => setTimeout(resolve, ms));}When time-based limits are used, propagate the remaining deadline to downstream calls. If your overall budget is 30 seconds and you've spent 10 seconds on retries, downstream calls should have at most 20 seconds remaining. This prevents retry chains from exceeding the original caller's expectations.
Determining the right retry limit is part science, part art. The optimal limit depends on multiple factors that must be balanced.
Factors Influencing Retry Limits
How long can users wait for a response?
The retry limit must fit within these expectations.
How long do transient failures typically last?
Retry limits should provide reasonable opportunity for recovery while not waiting for improbable recovery.
Your backoff parameters determine how long N retries take:
| Retries | Delays Applied | Total Wait Time | Notes |
|---|---|---|---|
| 1 | 100ms | 100ms | Minimal recovery opportunity |
| 2 | 100 + 200 | 300ms | Brief transients only |
| 3 | 100 + 200 + 400 | 700ms | Network issues |
| 4 | ...+ 800 | 1.5s | Short service interruptions |
| 5 | ...+ 1600 | 3.1s | Reasonable for most APIs |
| 6 | ...+ 3200 | 6.3s | Extended recovery window |
| 8 | ...+ 12800 | 25.5s | Long recovery window |
| 10 | ...+ 51200 | 102s (~1.7min) | Very patient retry |
Historical data reveals how often retries succeed by attempt number:
// Example from production system:
Attempt 1 (original): 96% success
Attempt 2 (1st retry): 3% success (of remaining 4%)
Attempt 3 (2nd retry): 0.7% success
Attempt 4 (3rd retry): 0.2% success
Attempt 5+: < 0.1% success
In this example, retries beyond attempt 4-5 provide diminishing returns. This data-driven approach is the gold standard for tuning retry limits.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
// Framework for calculating optimal retry limitsinterface BackoffConfig { baseDelayMs: number; multiplier: number; maxDelayMs: number;} interface RetryConstraints { maxLatencyMs: number; // Maximum acceptable total latency expectedRecoveryMs: number; // Typical transient failure duration operationTimeoutMs: number; // Timeout for each individual operation} /** * Calculate maximum retry attempts that fit within latency budget */function calculateMaxAttempts( backoff: BackoffConfig, constraints: RetryConstraints): { maxAttempts: number; totalExpectedMs: number; reasoning: string } { let totalDelayMs = 0; let attempts = 0; // Calculate how many attempts fit within budget while (true) { attempts++; // Time for this attempt: operation + subsequent delay const attemptDelay = attempts > 1 ? Math.min( backoff.baseDelayMs * Math.pow(backoff.multiplier, attempts - 2), backoff.maxDelayMs ) : 0; const attemptTotal = totalDelayMs + attemptDelay + constraints.operationTimeoutMs; // Check if this attempt would exceed budget if (attemptTotal > constraints.maxLatencyMs) { break; } totalDelayMs += attemptDelay; // Check if we've provided enough recovery window if (totalDelayMs >= constraints.expectedRecoveryMs && attempts >= 3) { return { maxAttempts: attempts, totalExpectedMs: attemptTotal, reasoning: `${attempts} attempts provide ${totalDelayMs}ms recovery window, ` + `exceeding expected ${constraints.expectedRecoveryMs}ms recovery time`, }; } } return { maxAttempts: Math.max(attempts - 1, 1), // At least 1 attempt totalExpectedMs: totalDelayMs, reasoning: `Limited to ${attempts - 1} attempts to fit within ` + `${constraints.maxLatencyMs}ms latency budget`, };} // Example calculations for different scenariosconst scenarios = [ { name: 'User-facing API', backoff: { baseDelayMs: 100, multiplier: 2, maxDelayMs: 5000 }, constraints: { maxLatencyMs: 10000, expectedRecoveryMs: 3000, operationTimeoutMs: 2000 }, }, { name: 'Background Job', backoff: { baseDelayMs: 1000, multiplier: 2, maxDelayMs: 60000 }, constraints: { maxLatencyMs: 300000, expectedRecoveryMs: 30000, operationTimeoutMs: 10000 }, }, { name: 'External API (rate limited)', backoff: { baseDelayMs: 2000, multiplier: 2, maxDelayMs: 120000 }, constraints: { maxLatencyMs: 600000, expectedRecoveryMs: 60000, operationTimeoutMs: 30000 }, },]; for (const scenario of scenarios) { const result = calculateMaxAttempts(scenario.backoff, scenario.constraints); console.log(`${scenario.name}:`); console.log(` Max attempts: ${result.maxAttempts}`); console.log(` Expected duration: ${(result.totalExpectedMs / 1000).toFixed(1)}s`); console.log(` Reasoning: ${result.reasoning}`);} /** * Utility to calculate total delay for a given number of attempts */function calculateTotalDelay( attempts: number, backoff: BackoffConfig): number { let total = 0; for (let i = 0; i < attempts - 1; i++) { total += Math.min( backoff.baseDelayMs * Math.pow(backoff.multiplier, i), backoff.maxDelayMs ); } return total;}The best retry limits come from production data. Track success rate by attempt number, and set your limit where marginal success rate drops below a meaningful threshold (e.g., 0.5% additional success). This balances resource consumption against recovery probability.
In modern architectures, requests often pass through multiple layers, each potentially implementing its own retry logic. Without coordination, retry counts multiply exponentially.
The Retry Amplification Problem
Consider a typical microservices call chain:
Client → API Gateway → Service A → Service B → Database
If each layer retries 3 times:
A request from the client could generate up to 81 database requests from what started as a single user action. If the database is struggling, this amplification makes recovery nearly impossible.
Strategies for Coordinated Retries
1. Single Retry Point
Designate one layer as responsible for retries. Other layers fail fast.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
// Strategy 1: Single Retry Point Configurationconst layerRetryConfigs = { // API Gateway: Primary retry point apiGateway: { maxAttempts: 3, baseDelayMs: 100, multiplier: 2, maxDelayMs: 5000, }, // Internal services: No retry (rely on gateway) serviceA: { maxAttempts: 1, // No retry baseDelayMs: 0, multiplier: 1, maxDelayMs: 0, }, // Database client: Minimal, connection-level retry only databaseClient: { maxAttempts: 2, // Only for connection establishment baseDelayMs: 50, multiplier: 2, maxDelayMs: 200, onlyConnectionErrors: true, // Don't retry query failures },}; // Strategy 2: Tiered Retry Budgetsinterface RetryBudget { maxAttempts: number; maxDurationMs: number;} function tierRetryBudgets( totalBudget: RetryBudget, layerCount: number): RetryBudget[] { // Distribute budget across layers with diminishing allocations // Each layer gets progressively smaller budget const budgets: RetryBudget[] = []; let remainingDuration = totalBudget.maxDurationMs; let remainingAttempts = totalBudget.maxAttempts; for (let i = 0; i < layerCount; i++) { const fraction = 1 / (layerCount - i); const layerDuration = Math.floor(remainingDuration * fraction * 0.7); const layerAttempts = Math.max(1, Math.floor(remainingAttempts * fraction)); budgets.push({ maxAttempts: layerAttempts, maxDurationMs: layerDuration, }); remainingDuration = Math.max(0, remainingDuration - layerDuration); remainingAttempts = Math.max(1, remainingAttempts - layerAttempts + 1); } return budgets;} // Example: 3 layers with 10s total budget, 6 total attemptsconst totalBudget = { maxAttempts: 6, maxDurationMs: 10000 };const layerBudgets = tierRetryBudgets(totalBudget, 3);// Results in something like:// Layer 0 (outer): { maxAttempts: 2, maxDurationMs: 2333 }// Layer 1 (middle): { maxAttempts: 2, maxDurationMs: 2333 }// Layer 2 (inner): { maxAttempts: 2, maxDurationMs: 2333 } // Strategy 3: Deadline-Based Coordinationinterface RequestContext { deadline: number; // Absolute timestamp when request must complete remainingRetryBudget: number; // Shared retry budget across layers} function shouldRetryWithContext( context: RequestContext, attemptNumber: number, nextDelayMs: number): boolean { // Check deadline if (Date.now() + nextDelayMs > context.deadline) { return false; } // Check shared retry budget if (context.remainingRetryBudget <= 0) { return false; } return true;} function consumeRetryFromContext(context: RequestContext): void { context.remainingRetryBudget--;} // Propagate context to downstream callsfunction propagateContext( parentContext: RequestContext, operationTimeMs: number): RequestContext { return { deadline: Math.min( parentContext.deadline, Date.now() + operationTimeMs ), remainingRetryBudget: parentContext.remainingRetryBudget, };}If using a service mesh (Istio, Linkerd, Envoy), decide whether retries happen at the mesh layer or application layer—not both. Service mesh retries are transparent to application code but harder to customize. Application retries offer more control but require explicit coding. Pick one as primary and configure the other layer to pass through failures.
Static retry limits work for many scenarios, but sophisticated systems may benefit from dynamic adjustment based on current conditions.
Signals for Dynamic Adjustment
When a service is experiencing high error rates, continuing to retry at full capacity is counterproductive:
Increasing latency suggests overload; retries add more load:
Integrate with circuit breakers for coordinated response:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154
// Dynamic retry limit adjustment based on system healthinterface ServiceHealth { errorRate: number; // 0-1, current error rate p99LatencyMs: number; // Current p99 latency normalP99LatencyMs: number; // Baseline p99 latency circuitState: 'closed' | 'half-open' | 'open';} interface DynamicRetryConfig { baseMaxAttempts: number; // Full retry attempts when healthy minMaxAttempts: number; // Minimum (typically 1) errorRateThresholds: { moderate: number; // Start reducing at this rate high: number; // Severe reduction at this rate }; latencyThresholds: { elevated: number; // Times normal - start reducing severe: number; // Times normal - severe reduction };} function calculateDynamicRetryLimit( health: ServiceHealth, config: DynamicRetryConfig): number { // Circuit breaker override if (health.circuitState === 'open') { return 0; // Don't even try } if (health.circuitState === 'half-open') { return 1; // Single probe attempt } let limit = config.baseMaxAttempts; // Error rate adjustment if (health.errorRate >= config.errorRateThresholds.high) { limit = Math.max(config.minMaxAttempts, Math.floor(limit * 0.3)); } else if (health.errorRate >= config.errorRateThresholds.moderate) { limit = Math.max(config.minMaxAttempts, Math.floor(limit * 0.6)); } // Latency adjustment const latencyMultiplier = health.p99LatencyMs / health.normalP99LatencyMs; if (latencyMultiplier >= config.latencyThresholds.severe) { limit = Math.max(config.minMaxAttempts, Math.floor(limit * 0.4)); } else if (latencyMultiplier >= config.latencyThresholds.elevated) { limit = Math.max(config.minMaxAttempts, Math.floor(limit * 0.7)); } return limit;} // Example: Adaptive retry clientclass AdaptiveRetryClient { private healthTracker: HealthTracker; private config: DynamicRetryConfig; constructor( serviceName: string, config: DynamicRetryConfig ) { this.healthTracker = new HealthTracker(serviceName); this.config = config; } async execute<T>(operation: () => Promise<T>): Promise<T> { const health = this.healthTracker.getCurrentHealth(); const maxAttempts = calculateDynamicRetryLimit(health, this.config); console.log( `Dynamic retry: ${maxAttempts} attempts ` + `(error rate: ${(health.errorRate * 100).toFixed(1)}%, ` + `p99: ${health.p99LatencyMs}ms)` ); let lastError: Error | undefined; for (let attempt = 1; attempt <= maxAttempts; attempt++) { const startTime = Date.now(); try { const result = await operation(); this.healthTracker.recordSuccess(Date.now() - startTime); return result; } catch (error) { lastError = error as Error; this.healthTracker.recordFailure(Date.now() - startTime); if (attempt < maxAttempts) { await this.delay(attempt); } } } throw new Error( `Operation failed after ${maxAttempts} dynamic attempts: ` + `${lastError?.message}` ); } private delay(attemptNumber: number): Promise<void> { const delay = 100 * Math.pow(2, attemptNumber - 1); return new Promise(resolve => setTimeout(resolve, delay)); }} // Health tracking (simplified)class HealthTracker { private recentRequests: { success: boolean; latencyMs: number }[] = []; private windowMs = 60000; // 1 minute window constructor(private serviceName: string) {} recordSuccess(latencyMs: number): void { this.recentRequests.push({ success: true, latencyMs }); this.cleanup(); } recordFailure(latencyMs: number): void { this.recentRequests.push({ success: false, latencyMs }); this.cleanup(); } getCurrentHealth(): ServiceHealth { this.cleanup(); if (this.recentRequests.length === 0) { return { errorRate: 0, p99LatencyMs: 100, normalP99LatencyMs: 100, circuitState: 'closed', }; } const failures = this.recentRequests.filter(r => !r.success).length; const errorRate = failures / this.recentRequests.length; const latencies = this.recentRequests.map(r => r.latencyMs).sort((a, b) => a - b); const p99Index = Math.floor(latencies.length * 0.99); const p99LatencyMs = latencies[p99Index]; return { errorRate, p99LatencyMs, normalP99LatencyMs: 100, // Would be calculated from baseline circuitState: errorRate > 0.5 ? 'open' : 'closed', }; } private cleanup(): void { // Remove old entries (in production, maintain with timestamps) }}Dynamic retry limits add significant complexity. For most systems, static limits with circuit breakers provide sufficient adaptiveness. Consider dynamic limits only when you have sophisticated monitoring, clear failure patterns to respond to, and the operational capacity to debug adaptive behavior.
When retries are exhausted, how you communicate back to the caller significantly impacts system behavior and user experience.
Rich Retry Exhaustion Errors
Rather than a generic "request failed" error, provide actionable information:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
// Rich retry exhaustion error with actionable informationinterface RetryExhaustionDetails { // What was attempted operation: string; targetService: string; // Retry statistics totalAttempts: number; totalDurationMs: number; // Limit that was hit exhaustionReason: 'max_attempts' | 'timeout' | 'circuit_open' | 'cancelled'; // Error information lastError: { message: string; code?: string; statusCode?: number; }; // Per-attempt breakdown (optional, for debugging) attempts?: { attemptNumber: number; durationMs: number; error: string; }[]; // Retry-After hint if known retryAfterMs?: number; // Whether this might succeed if retried fresh retryable: boolean;} class RetryExhaustionError extends Error { constructor(public details: RetryExhaustionDetails) { super( `Retry exhausted for ${details.operation} to ${details.targetService}: ` + `${details.exhaustionReason} after ${details.totalAttempts} attempts ` + `(${details.totalDurationMs}ms). Last error: ${details.lastError.message}` ); this.name = 'RetryExhaustionError'; } /** * Should the caller retry this operation? */ shouldCallerRetry(): boolean { // Non-retryable errors (4xx) shouldn't be retried if (!this.details.retryable) return false; // If circuit is open, don't retry until cooldown if (this.details.exhaustionReason === 'circuit_open') { return false; } // Transient errors may succeed with fresh attempt return true; } /** * How long should caller wait before retrying? */ suggestedRetryDelayMs(): number | null { if (!this.shouldCallerRetry()) return null; // If server provided Retry-After if (this.details.retryAfterMs) { return this.details.retryAfterMs; } // Default: exponential based on attempts made return Math.min( 1000 * Math.pow(2, this.details.totalAttempts), 60000 ); } /** * Convert to API response format */ toApiResponse(): { status: number; body: object; headers: Record<string, string>; } { return { status: 503, body: { error: 'service_unavailable', message: `Service temporarily unavailable. Please retry.`, retryable: this.details.retryable, details: { attempts: this.details.totalAttempts, durationMs: this.details.totalDurationMs, }, }, headers: { 'Retry-After': String( Math.ceil((this.suggestedRetryDelayMs() || 30000) / 1000) ), }, }; }} // Usage in API handlerasync function handleApiRequest(request: Request): Promise<Response> { try { return await processWithRetry(request); } catch (error) { if (error instanceof RetryExhaustionError) { const { status, body, headers } = error.toApiResponse(); // Log with full details for debugging console.error('Retry exhaustion:', { ...error.details, requestId: request.headers.get('x-request-id'), }); return new Response(JSON.stringify(body), { status, headers: new Headers({ 'Content-Type': 'application/json', ...headers, }), }); } throw error; }} declare function processWithRetry(request: Request): Promise<Response>;When retry exhaustion occurs from upstream service failure, return 503 Service Unavailable (not 500 Internal Server Error). 503 specifically indicates temporary unavailability and suggests retry may succeed later. Include Retry-After header. 500 implies a bug in your service rather than upstream issues.
Retry exhaustion events are valuable operational signals. Properly monitoring them enables proactive incident response.
Key Metrics to Track
| Metric | Purpose | Alert Threshold |
|---|---|---|
| retry_exhaustion_total | Total failed operations after all retries | 0.1% of requests |
| retry_attempts_histogram | Distribution of attempts before success/failure | p99 > maxAttempts - 1 |
| retry_success_rate_by_attempt | Success rate at each attempt number | Sudden drop in attempt 1 |
| retry_total_duration_seconds | Time spent in retry logic | p99 > SLA budget |
| retry_limit_reached_by_service | Exhaustion breakdown by target service | Any single service dominant |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
// Retry metrics collection for observabilityinterface RetryMetrics { // Counter: total retry exhaustion events recordExhaustion( service: string, operation: string, reason: 'max_attempts' | 'timeout' | 'circuit_open', attempts: number ): void; // Counter: successful retries recordRetrySuccess( service: string, operation: string, attemptNumber: number, totalDurationMs: number ): void; // Histogram: attempts before resolution recordAttemptsBeforeResolution( service: string, operation: string, attempts: number, succeeded: boolean ): void; // Histogram: total retry duration recordRetryDuration( service: string, operation: string, durationMs: number, succeeded: boolean ): void;} // Example Prometheus-style implementationclass PrometheusRetryMetrics implements RetryMetrics { recordExhaustion( service: string, operation: string, reason: string, attempts: number ): void { // Counter with labels // retry_exhaustion_total{service="payment",operation="charge",reason="max_attempts"} console.log(`COUNTER retry_exhaustion_total{service="${service}",` + `operation="${operation}",reason="${reason}"} 1`); // Also record the attempt count at exhaustion console.log(`HISTOGRAM retry_attempts_at_exhaustion{service="${service}",` + `operation="${operation}"} ${attempts}`); } recordRetrySuccess( service: string, operation: string, attemptNumber: number, totalDurationMs: number ): void { // Track success by attempt number console.log(`COUNTER retry_success_total{service="${service}",` + `operation="${operation}",attempt="${attemptNumber}"} 1`); console.log(`HISTOGRAM retry_success_duration_ms{service="${service}",` + `operation="${operation}"} ${totalDurationMs}`); } recordAttemptsBeforeResolution( service: string, operation: string, attempts: number, succeeded: boolean ): void { const outcome = succeeded ? 'success' : 'failure'; console.log(`HISTOGRAM retry_attempts{service="${service}",` + `operation="${operation}",outcome="${outcome}"} ${attempts}`); } recordRetryDuration( service: string, operation: string, durationMs: number, succeeded: boolean ): void { const outcome = succeeded ? 'success' : 'failure'; console.log(`HISTOGRAM retry_duration_ms{service="${service}",` + `operation="${operation}",outcome="${outcome}"} ${durationMs}`); }} // Alert definitions (example PromQL)const alertDefinitions = [ { name: 'HighRetryExhaustionRate', query: ` sum(rate(retry_exhaustion_total[5m])) / sum(rate(requests_total[5m])) > 0.01 `, severity: 'warning', description: 'More than 1% of requests exhausting retries', }, { name: 'CriticalRetryExhaustionRate', query: ` sum(rate(retry_exhaustion_total[5m])) / sum(rate(requests_total[5m])) > 0.05 `, severity: 'critical', description: 'More than 5% of requests exhausting retries', }, { name: 'RetryDependencyDegraded', query: ` max by (service) ( sum(rate(retry_exhaustion_total[5m])) by (service) / sum(rate(retry_attempts{attempt="1"}[5m])) by (service) ) > 0.1 `, severity: 'warning', description: 'A specific service is causing > 10% retry exhaustion', }, { name: 'RetryLatencyBudgetExceeded', query: ` histogram_quantile(0.99, sum(rate(retry_duration_ms_bucket[5m])) by (le)) > 5000 `, severity: 'warning', description: 'p99 retry duration exceeding 5 seconds', },];Create a dedicated "Retry Health" dashboard showing: (1) Retry exhaustion rate over time, (2) Success rate by attempt number (which attempt usually succeeds?), (3) Top services causing exhaustion, (4) Retry duration percentiles. This dashboard becomes critical during incidents to understand if retries are helping or hurting recovery.
Maximum retry limits are the essential backstop that prevents retry logic from becoming a resource leak. Getting limits right balances recovery opportunity against system stability.
What's Next:
We've covered when to retry, how to time retries, how to prevent thundering herds, and when to stop retrying. The final critical piece is idempotency requirements—the essential precondition for safe retries of operations that modify state. Without idempotency, retry logic can cause data corruption, duplicate charges, and inconsistent system state.
You now understand how to determine optimal retry limits, the difference between attempt-based and time-based limits, how to coordinate retries across service layers, and how to properly communicate retry exhaustion. This prepares you for the final and crucial topic: idempotency requirements.