Loading learning content...
In a perfect world, every remote call would succeed on the first attempt. Reality, however, is far more complex. Networks experience transient congestion, services momentarily overload, and infrastructure components undergo brief interruptions. The question facing every distributed systems engineer isn't whether failures will occur, but how to respond when they do.
Retry logic is one of the most powerful—and most dangerous—tools in your resilience toolkit. Implemented correctly, retries allow systems to automatically recover from transient failures without human intervention, providing seamless resilience that users never notice. Implemented incorrectly, retries can transform minor hiccups into cascading catastrophes, amplifying load during outages and turning recoverable situations into system-wide meltdowns.
This page establishes the fundamental decision framework for retry logic: understanding when to retry, why that distinction matters, and how to classify failures accurately. Before you ever consider backoff algorithms or jitter strategies, you must first master this critical question: Should you retry at all?
By the end of this page, you will understand the distinction between transient and permanent failures, master a systematic framework for retry decisions, recognize the dangers of inappropriate retries, and know how to classify different failure types for optimal retry strategies. This knowledge forms the foundation for all subsequent retry pattern discussions.
Before discussing retry strategies, we must understand the taxonomy of failures in distributed systems. Not all failures are created equal, and treating them identically leads to either missed recovery opportunities or catastrophic retry storms.
The Fundamental Distinction: Transient vs Permanent
Every failure in a distributed system falls into one of two broad categories:
Transient failures are temporary conditions that resolve spontaneously over time. The underlying cause—network congestion, temporary resource exhaustion, brief service restart—is self-correcting. If you simply wait and try again, the request will likely succeed. These are the ideal candidates for retry.
Permanent failures represent conditions that will not change regardless of how long you wait or how many times you retry. A malformed request, an invalid authentication token, a deleted resource—these errors indicate fundamental problems that require different handling: code changes, user intervention, or graceful degradation.
| Characteristic | Transient Failures | Permanent Failures |
|---|---|---|
| Self-correcting | Yes - resolve without intervention | No - require external action |
| Time-dependent | Likely to succeed if retried later | Will fail indefinitely |
| Root cause | Temporary system conditions | Fundamental request/state problems |
| Retry value | High - automatic recovery | None - wastes resources |
| Correct response | Retry with backoff | Fail fast, escalate, or degrade |
| Examples | Network timeout, 503 Service Unavailable | 400 Bad Request, 404 Not Found |
Why This Distinction Matters Operationally
Misclassifying failures has severe consequences in both directions:
Treating permanent failures as transient (retry what shouldn't be retried):
Treating transient failures as permanent (fail fast when you should retry):
In practice, classifying failures isn't always straightforward. A timeout could indicate transient network congestion (retry) or a permanently deadlocked service (don't retry). An HTTP 500 might be a momentary glitch or a fundamental bug. The art of retry policy design lies in making classification decisions that are correct in the common case while minimizing harm when classification is wrong.
Transient failures are the primary target for retry logic—situations where the same request, made moments later, has a reasonable probability of success. Understanding common transient failure patterns helps you design effective retry policies.
Network-Level Transient Failures
Network infrastructure introduces numerous opportunities for transient failures:
Connection timeouts: The remote server exists and is healthy, but momentary network congestion prevents establishing a connection within the timeout window.
Read/write timeouts: A connection was established but data transfer stalled due to temporary bandwidth constraints or packet loss.
Connection resets: TCP RST packets received due to transient network equipment issues, NAT table expirations, or brief service interruptions.
DNS resolution failures: Temporary DNS server unreachability or propagation delays.
TLS handshake failures: Transient issues during secure connection establishment, often related to clock skew or temporary certificate validation issues.
Identifying Transient Failures Through HTTP Status Codes
HTTP response codes provide valuable signals for retry decisions, though they must be interpreted carefully:
Clearly Retryable (Transient):
408 Request Timeout — Server timed out waiting for request429 Too Many Requests — Rate limiting; often includes Retry-After header502 Bad Gateway — Upstream server returned invalid response503 Service Unavailable — Server temporarily unable to handle request504 Gateway Timeout — Upstream server didn't respond in timeContextually Retryable:
500 Internal Server Error — May be transient (OOM, deadlock) or permanent (bug)Connection refused — May be transient (restart) or permanent (misconfiguration)Generally Not Retryable:
400 Bad Request — Request is malformed; retry will produce same result401 Unauthorized — Authentication failed; retry won't fix credentials403 Forbidden — Authorization denied; retry won't grant permissions404 Not Found — Resource doesn't exist; retry won't create it409 Conflict — State conflict that likely requires resolution422 Unprocessable Entity — Semantic errors in request payloadWhen services return 429 or 503 responses, they often include a Retry-After header indicating how long to wait before retrying. Always respect this header when present—it provides server-side intelligence about when retry is appropriate. Ignoring it and retrying immediately often makes the situation worse.
Not all failures deserve retries. Permanent failures—conditions that will persist regardless of time or retry attempts—must be handled differently. Retrying permanent failures is not merely wasteful; it's actively harmful.
Categories of Permanent Failures
Client-Side Errors (4xx responses)
Most 4xx HTTP status codes indicate problems with the request itself. The same request, made identically, will fail identically:
Semantic Errors and Business Logic Failures
Beyond HTTP status codes, many permanent failures are semantic or business-logic level:
The Fail-Fast Imperative
For permanent failures, the correct strategy is fail fast: immediately propagate the error to the caller, allowing them to handle it appropriately. This might mean:
Fail-fast preserves resources, provides quicker feedback, and allows the system to focus on requests that can actually succeed.
During a major cloud provider outage, a single misconfigured retry policy multiplied error traffic by 100x. A service returning 503 errors was bombarded with retries from thousands of clients, each configured to retry 10 times with minimal delay. The resulting traffic prevented the service from recovering even after the underlying issue was resolved. The outage extended from minutes to hours solely due to inappropriate retry behavior.
With the understanding of transient vs permanent failures, we can establish a systematic decision framework for retry behavior. This framework should be applied consistently across your distributed system.
Step 1: Classify the Failure
Before deciding to retry, classify the failure using all available information:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
// Retry decision framework implementationinterface RetryDecision { shouldRetry: boolean; reason: string; suggestedDelayMs?: number;} type FailureCategory = 'transient' | 'permanent' | 'unknown'; interface FailureClassification { category: FailureCategory; confidence: 'high' | 'medium' | 'low'; evidence: string;} function classifyFailure( statusCode: number | undefined, errorCode: string | undefined, errorMessage: string | undefined, exception: Error | undefined): FailureClassification { // HTTP status code classification if (statusCode !== undefined) { // Definitely transient if ([408, 429, 502, 503, 504].includes(statusCode)) { return { category: 'transient', confidence: 'high', evidence: `HTTP ${statusCode} is explicitly transient` }; } // Definitely permanent (client errors) if (statusCode >= 400 && statusCode < 500 && statusCode !== 408 && statusCode !== 429) { return { category: 'permanent', confidence: 'high', evidence: `HTTP ${statusCode} indicates client error` }; } // Ambiguous server error if (statusCode === 500) { return { category: 'unknown', confidence: 'low', evidence: 'HTTP 500 may be transient or permanent' }; } } // Exception type classification if (exception) { const exceptionName = exception.name || exception.constructor.name; const transientExceptions = [ 'TimeoutError', 'ConnectionError', 'NetworkError', 'ECONNRESET', 'ETIMEDOUT', 'ECONNREFUSED' ]; if (transientExceptions.some(te => exceptionName.includes(te))) { return { category: 'transient', confidence: 'high', evidence: `Exception ${exceptionName} is typically transient` }; } const permanentExceptions = [ 'ValidationError', 'AuthenticationError', 'AuthorizationError', 'NotFoundError' ]; if (permanentExceptions.some(pe => exceptionName.includes(pe))) { return { category: 'permanent', confidence: 'high', evidence: `Exception ${exceptionName} is typically permanent` }; } } // Error code classification (service-specific) if (errorCode) { const transientCodes = ['RESOURCE_EXHAUSTED', 'UNAVAILABLE', 'DEADLINE_EXCEEDED']; const permanentCodes = ['INVALID_ARGUMENT', 'NOT_FOUND', 'PERMISSION_DENIED']; if (transientCodes.includes(errorCode)) { return { category: 'transient', confidence: 'high', evidence: `Error code ${errorCode}` }; } if (permanentCodes.includes(errorCode)) { return { category: 'permanent', confidence: 'high', evidence: `Error code ${errorCode}` }; } } // Default to unknown with low confidence return { category: 'unknown', confidence: 'low', evidence: 'Unable to classify failure from available signals' };}Step 2: Apply the Retry Decision Matrix
Once classified, apply the decision matrix based on failure category and operation characteristics:
| Failure Category | Operation Type | Decision | Rationale |
|---|---|---|---|
| Transient (high confidence) | Read | Retry | Safe, likely to succeed |
| Transient (high confidence) | Write (idempotent) | Retry | Safe with idempotency key |
| Transient (high confidence) | Write (non-idempotent) | Conditional retry | Only if timeout/connection failure |
| Permanent (high confidence) | Any | Fail fast | Retry will not help |
| Unknown (low confidence) | Read | Limited retry | 1-2 retries to probe |
| Unknown (low confidence) | Write | No retry | Risk of duplication too high |
Step 3: Check Retry Preconditions
Even when failure classification suggests retry, additional preconditions must be met:
When failure classification is uncertain, adopt a conservative posture. For unknown failures on read operations, allow 1-2 retries with backoff—this provides opportunity to recover from transient issues while limiting damage if permanent. For unknown failures on write operations, fail fast unless idempotency is guaranteed. The cost of duplicate writes typically exceeds the cost of a false negative on retry.
Beyond failure classification, the nature of the operation itself profoundly influences retry safety. The same transient failure may warrant retry for one operation but not another.
Read vs Write Operations
Read operations (GET, HEAD, OPTIONS) are inherently safer to retry because they don't modify state. Even if a read is executed multiple times due to retry, the result is merely redundant work, not data corruption. This is why HTTP considers these methods safe and idempotent.
Write operations (POST, PUT, DELETE, PATCH) carry the risk of duplicate execution. If a request succeeds but the response is lost (timeout, connection reset), the client doesn't know whether to retry. Retrying could create duplicate records, double-charge payments, or send multiple notifications.
The Timeout Dilemma
Timeout failures present the most challenging retry decision because the request's fate is unknown. Consider this scenario:
From the client's perspective, it's impossible to distinguish this case (order created) from one where the server never received the request (order not created). Retrying could create a duplicate order.
Resolution strategies:
As a general rule: never implement aggressive retry policies for write operations unless you've also implemented idempotency. The two must be designed together. We'll explore idempotency requirements in depth in a later page of this module.
Effective retry policies aren't one-size-fits-all. Different operations, services, and contexts warrant different retry behaviors. This section explores how to design context-aware policies.
Service-Level Differentiation
Not all dependencies are equally important or equally likely to experience transient failures:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
// Context-aware retry policy configurationinterface RetryPolicyConfig { maxAttempts: number; baseDelayMs: number; maxDelayMs: number; retryableStatuses: number[]; retryableExceptions: string[]; respectRetryAfter: boolean; nonRetryableStatuses: number[];} // Different policies for different contextsconst retryPolicies: Record<string, RetryPolicyConfig> = { // Critical payment service: more retries, careful classification paymentService: { maxAttempts: 5, baseDelayMs: 100, maxDelayMs: 5000, retryableStatuses: [408, 429, 502, 503, 504], retryableExceptions: ['TimeoutError', 'ConnectionError'], respectRetryAfter: true, nonRetryableStatuses: [400, 401, 402, 403, 404, 409, 422], }, // Best-effort analytics: fast fail, minimal retry analyticsService: { maxAttempts: 2, baseDelayMs: 50, maxDelayMs: 500, retryableStatuses: [502, 503, 504], retryableExceptions: ['TimeoutError'], respectRetryAfter: false, nonRetryableStatuses: [400, 401, 403, 404, 429], // Don't retry rate limiting }, // External third-party API: respect their limits externalApi: { maxAttempts: 3, baseDelayMs: 1000, maxDelayMs: 60000, retryableStatuses: [429, 500, 502, 503, 504], retryableExceptions: ['TimeoutError', 'ConnectionError', 'NetworkError'], respectRetryAfter: true, // Critical for external APIs nonRetryableStatuses: [400, 401, 403, 404, 405], }, // Internal microservice: more tolerant internalService: { maxAttempts: 4, baseDelayMs: 100, maxDelayMs: 2000, retryableStatuses: [408, 429, 500, 502, 503, 504], retryableExceptions: ['TimeoutError', 'ConnectionError', 'ECONNRESET'], respectRetryAfter: true, nonRetryableStatuses: [400, 401, 403, 404], },}; // Policy selection based on contextfunction selectRetryPolicy( serviceName: string, operationType: 'read' | 'write', isCriticalPath: boolean): RetryPolicyConfig { const basePolicy = retryPolicies[serviceName] || retryPolicies.internalService; // Reduce retry attempts for writes unless idempotent if (operationType === 'write') { return { ...basePolicy, maxAttempts: Math.min(basePolicy.maxAttempts, 3), }; } // Increase attempts for critical path operations if (isCriticalPath) { return { ...basePolicy, maxAttempts: Math.min(basePolicy.maxAttempts + 2, 7), }; } return basePolicy;}Request-Level Context
Beyond service-level differentiation, individual requests may warrant different treatment:
Always consider the remaining time budget before initiating a retry. If a request has a 5-second deadline and 4.5 seconds have elapsed, starting a retry (with its own potential timeout) is wasteful. The request should fail fast and let the caller decide whether to abandon or start fresh. We'll explore deadline propagation patterns in the Timeout and Deadline module.
Accurate failure classification requires proper infrastructure. Without consistent error handling and propagation, retry policies can't make informed decisions.
Standardized Error Responses
Services should return structured error responses that enable accurate classification:
{
"error": {
"code": "RESOURCE_EXHAUSTED",
"message": "Rate limit exceeded for API calls",
"details": {
"retryable": true,
"retryAfterSeconds": 30,
"quotaLimit": 1000,
"quotaRemaining": 0,
"quotaResetsAt": "2024-01-15T10:00:00Z"
}
}
}
Key elements for retry classification:
RESOURCE_EXHAUSTED, INVALID_ARGUMENT, DEADLINE_EXCEEDEDException Mapping and Wrapping
In code, different infrastructure layers throw different exception types. Establishing a mapping to retryability domains simplifies policy implementation:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
// Exception classification systemenum ErrorCategory { TRANSIENT = 'transient', PERMANENT = 'permanent', UNKNOWN = 'unknown',} interface ClassifiedError extends Error { category: ErrorCategory; originalError?: Error; httpStatus?: number; errorCode?: string; retryAfterMs?: number;} // Map infrastructure exceptions to categoriesfunction classifyException(error: Error): ClassifiedError { const classified: ClassifiedError = Object.assign( new Error(error.message), { name: error.name, stack: error.stack, category: ErrorCategory.UNKNOWN, originalError: error, } ); // Node.js system errors const errorCode = (error as NodeJS.ErrnoException).code; if (errorCode === 'ECONNRESET' || errorCode === 'ETIMEDOUT' || errorCode === 'ECONNREFUSED' || errorCode === 'EPIPE' || errorCode === 'EHOSTUNREACH') { classified.category = ErrorCategory.TRANSIENT; return classified; } // HTTP client errors (from fetch, axios, etc.) const httpStatus = (error as any).response?.status; if (httpStatus) { classified.httpStatus = httpStatus; if ([408, 429, 502, 503, 504].includes(httpStatus)) { classified.category = ErrorCategory.TRANSIENT; // Extract Retry-After if present const retryAfter = (error as any).response?.headers?.['retry-after']; if (retryAfter) { classified.retryAfterMs = parseRetryAfter(retryAfter); } } else if (httpStatus >= 400 && httpStatus < 500) { classified.category = ErrorCategory.PERMANENT; } else if (httpStatus === 500) { classified.category = ErrorCategory.UNKNOWN; // 500 is ambiguous } return classified; } // gRPC status codes const grpcCode = (error as any).code; if (grpcCode !== undefined) { const transientGrpcCodes = [ 'DEADLINE_EXCEEDED', 'RESOURCE_EXHAUSTED', 'UNAVAILABLE', 'ABORTED' ]; const permanentGrpcCodes = [ 'INVALID_ARGUMENT', 'NOT_FOUND', 'ALREADY_EXISTS', 'PERMISSION_DENIED', 'UNAUTHENTICATED' ]; if (transientGrpcCodes.includes(grpcCode)) { classified.category = ErrorCategory.TRANSIENT; } else if (permanentGrpcCodes.includes(grpcCode)) { classified.category = ErrorCategory.PERMANENT; } } return classified;} function parseRetryAfter(value: string): number { // Retry-After can be seconds or HTTP date const seconds = parseInt(value, 10); if (!isNaN(seconds)) { return seconds * 1000; } const date = new Date(value); if (!isNaN(date.getTime())) { return Math.max(0, date.getTime() - Date.now()); } return 0;}Retry logic is a powerful tool for building resilient distributed systems, but it must be wielded with precision. The decision when to retry is as important as how to retry.
What's Next:
Now that we understand when to retry, the next page explores how to retry effectively. Exponential backoff is the foundational technique for spacing retries to avoid overwhelming recovering services while still providing timely recovery from transient failures.
You now understand the critical distinction between transient and permanent failures, the systematic framework for retry decisions, and the importance of operation characteristics in retry safety. This foundation prepares you to implement exponential backoff and other advanced retry strategies covered in subsequent pages.