System Design (HLD)Retry with Backoff

Retry with Backoff: Building Resilient Distributed Systems

LevelAdvanced

Duration75 mins

TopicRetry with Backoff

1 / 5

When to Retry: The Decision Framework

The Fundamental Question of Distributed Reliability

In a perfect world, every remote call would succeed on the first attempt. Reality, however, is far more complex. Networks experience transient congestion, services momentarily overload, and infrastructure components undergo brief interruptions. The question facing every distributed systems engineer isn't whether failures will occur, but how to respond when they do.

Retry logic is one of the most powerful—and most dangerous—tools in your resilience toolkit. Implemented correctly, retries allow systems to automatically recover from transient failures without human intervention, providing seamless resilience that users never notice. Implemented incorrectly, retries can transform minor hiccups into cascading catastrophes, amplifying load during outages and turning recoverable situations into system-wide meltdowns.

This page establishes the fundamental decision framework for retry logic: understanding when to retry, why that distinction matters, and how to classify failures accurately. Before you ever consider backoff algorithms or jitter strategies, you must first master this critical question: Should you retry at all?

What You Will Learn

By the end of this page, you will understand the distinction between transient and permanent failures, master a systematic framework for retry decisions, recognize the dangers of inappropriate retries, and know how to classify different failure types for optimal retry strategies. This knowledge forms the foundation for all subsequent retry pattern discussions.

The Nature of Failures in Distributed Systems

Before discussing retry strategies, we must understand the taxonomy of failures in distributed systems. Not all failures are created equal, and treating them identically leads to either missed recovery opportunities or catastrophic retry storms.

The Fundamental Distinction: Transient vs Permanent

Every failure in a distributed system falls into one of two broad categories:

Transient failures are temporary conditions that resolve spontaneously over time. The underlying cause—network congestion, temporary resource exhaustion, brief service restart—is self-correcting. If you simply wait and try again, the request will likely succeed. These are the ideal candidates for retry.

Permanent failures represent conditions that will not change regardless of how long you wait or how many times you retry. A malformed request, an invalid authentication token, a deleted resource—these errors indicate fundamental problems that require different handling: code changes, user intervention, or graceful degradation.

Transient vs Permanent Failure Characteristics
Characteristic	Transient Failures	Permanent Failures
Self-correcting	Yes - resolve without intervention	No - require external action
Time-dependent	Likely to succeed if retried later	Will fail indefinitely
Root cause	Temporary system conditions	Fundamental request/state problems
Retry value	High - automatic recovery	None - wastes resources
Correct response	Retry with backoff	Fail fast, escalate, or degrade
Examples	Network timeout, 503 Service Unavailable	400 Bad Request, 404 Not Found

Why This Distinction Matters Operationally

Misclassifying failures has severe consequences in both directions:

Treating permanent failures as transient (retry what shouldn't be retried):

Wastes compute resources on futile attempts
Increases latency for users waiting for inevitable failure
Generates unnecessary load on already-stressed dependencies
Pollutes logs and metrics with noise
May exhaust retry budgets needed for genuine transient failures

Treating transient failures as permanent (fail fast when you should retry):

Converts recoverable situations into user-visible errors
Reduces system availability unnecessarily
Forces human intervention for self-healing issues
Undermines user confidence in system reliability
May trigger unnecessary on-call pages

The Classification Challenge

In practice, classifying failures isn't always straightforward. A timeout could indicate transient network congestion (retry) or a permanently deadlocked service (don't retry). An HTTP 500 might be a momentary glitch or a fundamental bug. The art of retry policy design lies in making classification decisions that are correct in the common case while minimizing harm when classification is wrong.

Transient Failures: The Retry Sweet Spot

Transient failures are the primary target for retry logic—situations where the same request, made moments later, has a reasonable probability of success. Understanding common transient failure patterns helps you design effective retry policies.

Network-Level Transient Failures

Network infrastructure introduces numerous opportunities for transient failures:

Connection timeouts: The remote server exists and is healthy, but momentary network congestion prevents establishing a connection within the timeout window.
Read/write timeouts: A connection was established but data transfer stalled due to temporary bandwidth constraints or packet loss.
Connection resets: TCP RST packets received due to transient network equipment issues, NAT table expirations, or brief service interruptions.
DNS resolution failures: Temporary DNS server unreachability or propagation delays.
TLS handshake failures: Transient issues during secure connection establishment, often related to clock skew or temporary certificate validation issues.

Common Transient Failure Scenarios

•Service Overload (429, 503) — The service is temporarily overwhelmed but actively shedding load. Retrying after a delay allows time for recovery.
•Resource Contention — Database connection pools exhausted, thread pools full, or memory pressure. These conditions often resolve as competing requests complete.
•Deployment Transitions — Brief unavailability during rolling deployments when old instances terminate before new ones become healthy.
•Load Balancer Re-routing — Momentary failures as load balancers detect unhealthy backends and route to healthy alternatives.
•Circuit Breaker Reset — Upstream circuit breakers temporarily rejecting requests but will soon allow testing traffic.
•Rate Limit Bucket Replenishment — Rate limits that reset on a time window boundary, allowing retries after the window rolls.
•Leader Election — Distributed systems temporarily unavailable during consensus leader changes.
•Garbage Collection Pauses — JVM-based services experiencing stop-the-world GC events causing temporary unresponsiveness.

Identifying Transient Failures Through HTTP Status Codes

HTTP response codes provide valuable signals for retry decisions, though they must be interpreted carefully:

Clearly Retryable (Transient):

408 Request Timeout — Server timed out waiting for request
429 Too Many Requests — Rate limiting; often includes Retry-After header
502 Bad Gateway — Upstream server returned invalid response
503 Service Unavailable — Server temporarily unable to handle request
504 Gateway Timeout — Upstream server didn't respond in time

Contextually Retryable:

500 Internal Server Error — May be transient (OOM, deadlock) or permanent (bug)
Connection refused — May be transient (restart) or permanent (misconfiguration)

Generally Not Retryable:

400 Bad Request — Request is malformed; retry will produce same result
401 Unauthorized — Authentication failed; retry won't fix credentials
403 Forbidden — Authorization denied; retry won't grant permissions
404 Not Found — Resource doesn't exist; retry won't create it
409 Conflict — State conflict that likely requires resolution
422 Unprocessable Entity — Semantic errors in request payload

The Retry-After Header

When services return 429 or 503 responses, they often include a Retry-After header indicating how long to wait before retrying. Always respect this header when present—it provides server-side intelligence about when retry is appropriate. Ignoring it and retrying immediately often makes the situation worse.

Permanent Failures: When Retries Harm

Not all failures deserve retries. Permanent failures—conditions that will persist regardless of time or retry attempts—must be handled differently. Retrying permanent failures is not merely wasteful; it's actively harmful.

Categories of Permanent Failures

Client-Side Errors (4xx responses)

Most 4xx HTTP status codes indicate problems with the request itself. The same request, made identically, will fail identically:

Validation failures (400): Request syntax is malformed or violates business rules
Authentication failures (401): Credentials are missing, expired, or invalid
Authorization failures (403): User lacks permission for the requested operation
Resource not found (404): The target resource doesn't exist
Method not allowed (405): The HTTP method isn't supported for this resource
Payload too large (413): Request body exceeds server limits
Unsupported media type (415): Content-Type not accepted

Why Retrying Permanent Failures Is Harmful

•Amplified Load on Failing Services — A service returning errors is already under stress. Retry storms multiply the number of requests it must process and reject, potentially converting partial outages into complete failures.
•Delayed User Feedback — Users waiting for retries to exhaust experience increased latency with no benefit. A request that will ultimately fail should fail fast.
•Wasted Compute Resources — Every futile retry consumes CPU, memory, network bandwidth, and connection pool slots that could serve valid requests.
•Polluted Observability — Logs fill with repeated failures, metrics spike with noise, and genuine signals become harder to distinguish from retry artifacts.
•Exhausted Retry Budgets — Some systems limit total retries. Burning retries on permanent failures leaves none for genuine transient issues.
•Cascading Resource Starvation — Threads blocked waiting for retry delays can starve other operations, spreading failure beyond the original scope.

Semantic Errors and Business Logic Failures

Beyond HTTP status codes, many permanent failures are semantic or business-logic level:

Insufficient funds: Retrying a payment won't change the account balance
Inventory exhaustion: Retrying a purchase won't restock the item
Duplicate operations: Retrying a unique constraint violation won't make it unique
Expired offers: Retrying after a deadline won't extend the deadline
Version conflicts: Retrying optimistic locking failures without re-reading state won't resolve the conflict

The Fail-Fast Imperative

For permanent failures, the correct strategy is fail fast: immediately propagate the error to the caller, allowing them to handle it appropriately. This might mean:

Returning a clear error message to the user
Triggering compensating transactions
Logging for debugging and alerting
Falling back to alternative flows
Queuing for human review

Fail-fast preserves resources, provides quicker feedback, and allows the system to focus on requests that can actually succeed.

The Retry Storm Catastrophe

During a major cloud provider outage, a single misconfigured retry policy multiplied error traffic by 100x. A service returning 503 errors was bombarded with retries from thousands of clients, each configured to retry 10 times with minimal delay. The resulting traffic prevented the service from recovering even after the underlying issue was resolved. The outage extended from minutes to hours solely due to inappropriate retry behavior.

The Retry Decision Framework

With the understanding of transient vs permanent failures, we can establish a systematic decision framework for retry behavior. This framework should be applied consistently across your distributed system.

Step 1: Classify the Failure

Before deciding to retry, classify the failure using all available information:

HTTP status code: The most immediate signal, but not always definitive
Error response body: Often contains error codes or messages indicating permanence
Error type/exception: Programming language exceptions may indicate transience (timeout exceptions) or permanence (validation exceptions)
Request context: Was this a read or write? Reads are generally safer to retry
Service-specific knowledge: Some services define custom retryable error codes

retry-decision.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
// Retry decision framework implementation
interface RetryDecision {
    shouldRetry: boolean;
    reason: string;
    suggestedDelayMs?: number;
}
 
type FailureCategory = 'transient' | 'permanent' | 'unknown';
 
interface FailureClassification {
    category: FailureCategory;
    confidence: 'high' | 'medium' | 'low';
    evidence: string;
}
 
function classifyFailure(
    statusCode: number | undefined,
    errorCode: string | undefined,
    errorMessage: string | undefined,
    exception: Error | undefined
): FailureClassification {
    // HTTP status code classification
    if (statusCode !== undefined) {
        // Definitely transient
        if ([408, 429, 502, 503, 504].includes(statusCode)) {
            return {
                category: 'transient',
                confidence: 'high',
                evidence: `HTTP ${statusCode} is explicitly transient`
            };
        }
        
        // Definitely permanent (client errors)
        if (statusCode >= 400 && statusCode < 500 && statusCode !== 408 && statusCode !== 429) {
            return {
                category: 'permanent',
                confidence: 'high',
                evidence: `HTTP ${statusCode} indicates client error`
            };
        }
        
        // Ambiguous server error
        if (statusCode === 500) {
            return {
                category: 'unknown',
                confidence: 'low',
                evidence: 'HTTP 500 may be transient or permanent'
            };
        }
    }
    
    // Exception type classification
    if (exception) {
        const exceptionName = exception.name || exception.constructor.name;
        
        const transientExceptions = [
            'TimeoutError', 'ConnectionError', 'NetworkError',
            'ECONNRESET', 'ETIMEDOUT', 'ECONNREFUSED'
        ];
        
        if (transientExceptions.some(te => exceptionName.includes(te))) {
            return {
                category: 'transient',
                confidence: 'high',
                evidence: `Exception ${exceptionName} is typically transient`
            };
        }
        
        const permanentExceptions = [
            'ValidationError', 'AuthenticationError',
            'AuthorizationError', 'NotFoundError'
        ];
        
        if (permanentExceptions.some(pe => exceptionName.includes(pe))) {
            return {
                category: 'permanent',
                confidence: 'high',
                evidence: `Exception ${exceptionName} is typically permanent`
            };
        }
    }
    
    // Error code classification (service-specific)
    if (errorCode) {
        const transientCodes = ['RESOURCE_EXHAUSTED', 'UNAVAILABLE', 'DEADLINE_EXCEEDED'];
        const permanentCodes = ['INVALID_ARGUMENT', 'NOT_FOUND', 'PERMISSION_DENIED'];
        
        if (transientCodes.includes(errorCode)) {
            return { category: 'transient', confidence: 'high', evidence: `Error code ${errorCode}` };
        }
        if (permanentCodes.includes(errorCode)) {
            return { category: 'permanent', confidence: 'high', evidence: `Error code ${errorCode}` };
        }
    }
    
    // Default to unknown with low confidence
    return {
        category: 'unknown',
        confidence: 'low',
        evidence: 'Unable to classify failure from available signals'
    };
}

Step 2: Apply the Retry Decision Matrix

Once classified, apply the decision matrix based on failure category and operation characteristics:

Retry Decision Matrix
Failure Category	Operation Type	Decision	Rationale
Transient (high confidence)	Read	Retry	Safe, likely to succeed
Transient (high confidence)	Write (idempotent)	Retry	Safe with idempotency key
Transient (high confidence)	Write (non-idempotent)	Conditional retry	Only if timeout/connection failure
Permanent (high confidence)	Any	Fail fast	Retry will not help
Unknown (low confidence)	Read	Limited retry	1-2 retries to probe
Unknown (low confidence)	Write	No retry	Risk of duplication too high

Step 3: Check Retry Preconditions

Even when failure classification suggests retry, additional preconditions must be met:

Retry budget remaining: Haven't exceeded maximum retry attempts
Time budget remaining: Total elapsed time within acceptable latency bounds
Circuit breaker open: Upstream circuit isn't open (would reject retry immediately)
Idempotency assured: For writes, idempotency key is present or operation is naturally idempotent
Client still waiting: Connection hasn't been closed or cancelled
Retry-After respected: If server provided delay, sufficient time has passed

The Unknown Category Strategy

When failure classification is uncertain, adopt a conservative posture. For unknown failures on read operations, allow 1-2 retries with backoff—this provides opportunity to recover from transient issues while limiting damage if permanent. For unknown failures on write operations, fail fast unless idempotency is guaranteed. The cost of duplicate writes typically exceeds the cost of a false negative on retry.

Operation Characteristics and Retry Safety

Beyond failure classification, the nature of the operation itself profoundly influences retry safety. The same transient failure may warrant retry for one operation but not another.

Read vs Write Operations

Read operations (GET, HEAD, OPTIONS) are inherently safer to retry because they don't modify state. Even if a read is executed multiple times due to retry, the result is merely redundant work, not data corruption. This is why HTTP considers these methods safe and idempotent.

Write operations (POST, PUT, DELETE, PATCH) carry the risk of duplicate execution. If a request succeeds but the response is lost (timeout, connection reset), the client doesn't know whether to retry. Retrying could create duplicate records, double-charge payments, or send multiple notifications.

Safe to Retry

•All read operations — GET, HEAD, OPTIONS, TRACE
•Naturally idempotent writes — PUT that sets absolute state, DELETE of specific resource
•Writes with idempotency keys — POST/PATCH with client-generated unique identifier
•Timeout failures on reads — The operation may or may not have completed, but re-reading is safe
•Connection failures before request sent — Request definitely not processed

Dangerous to Retry

•Non-idempotent writes without keys — POST creating resources, triggering workflows
•Timeout failures on writes — Request may have succeeded; retry = duplicate
•Partial success scenarios — Some side effects completed before failure
•Stateful sequences — Operations that depend on prior state changes
•Financial transactions — Payments, transfers without deduplication

The Timeout Dilemma

Timeout failures present the most challenging retry decision because the request's fate is unknown. Consider this scenario:

Client sends POST request to create an order
Server receives and processes request, creates the order
Server begins sending response
Network interruption: response never reaches client
Client experiences timeout after 30 seconds

From the client's perspective, it's impossible to distinguish this case (order created) from one where the server never received the request (order not created). Retrying could create a duplicate order.

Resolution strategies:

Idempotency keys: Include a unique request ID; server rejects/returns duplicates
Conditional operations: Use ETags or version numbers for concurrency control
Two-phase operations: Reserve, then commit with confirmation
Query before retry: Check if the operation succeeded before retrying
Accept eventual consistency: Design downstream to handle duplicates gracefully

The Idempotency Prerequisite

As a general rule: never implement aggressive retry policies for write operations unless you've also implemented idempotency. The two must be designed together. We'll explore idempotency requirements in depth in a later page of this module.

Context-Aware Retry Policies

Effective retry policies aren't one-size-fits-all. Different operations, services, and contexts warrant different retry behaviors. This section explores how to design context-aware policies.

Service-Level Differentiation

Not all dependencies are equally important or equally likely to experience transient failures:

Critical path services (payment, authentication): May warrant more aggressive retry for transient failures, but stricter classification to avoid retry storms
Best-effort services (analytics, recommendations): May tolerate faster failure with fewer retries
External APIs: Often have rate limits that require respecting Retry-After headers
Internal microservices: May allow more aggressive retry given controlled environment

context-aware-retry.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
// Context-aware retry policy configuration
interface RetryPolicyConfig {
    maxAttempts: number;
    baseDelayMs: number;
    maxDelayMs: number;
    retryableStatuses: number[];
    retryableExceptions: string[];
    respectRetryAfter: boolean;
    nonRetryableStatuses: number[];
}
 
// Different policies for different contexts
const retryPolicies: Record<string, RetryPolicyConfig> = {
    // Critical payment service: more retries, careful classification
    paymentService: {
        maxAttempts: 5,
        baseDelayMs: 100,
        maxDelayMs: 5000,
        retryableStatuses: [408, 429, 502, 503, 504],
        retryableExceptions: ['TimeoutError', 'ConnectionError'],
        respectRetryAfter: true,
        nonRetryableStatuses: [400, 401, 402, 403, 404, 409, 422],
    },
    
    // Best-effort analytics: fast fail, minimal retry
    analyticsService: {
        maxAttempts: 2,
        baseDelayMs: 50,
        maxDelayMs: 500,
        retryableStatuses: [502, 503, 504],
        retryableExceptions: ['TimeoutError'],
        respectRetryAfter: false,
        nonRetryableStatuses: [400, 401, 403, 404, 429], // Don't retry rate limiting
    },
    
    // External third-party API: respect their limits
    externalApi: {
        maxAttempts: 3,
        baseDelayMs: 1000,
        maxDelayMs: 60000,
        retryableStatuses: [429, 500, 502, 503, 504],
        retryableExceptions: ['TimeoutError', 'ConnectionError', 'NetworkError'],
        respectRetryAfter: true, // Critical for external APIs
        nonRetryableStatuses: [400, 401, 403, 404, 405],
    },
    
    // Internal microservice: more tolerant
    internalService: {
        maxAttempts: 4,
        baseDelayMs: 100,
        maxDelayMs: 2000,
        retryableStatuses: [408, 429, 500, 502, 503, 504],
        retryableExceptions: ['TimeoutError', 'ConnectionError', 'ECONNRESET'],
        respectRetryAfter: true,
        nonRetryableStatuses: [400, 401, 403, 404],
    },
};
 
// Policy selection based on context
function selectRetryPolicy(
    serviceName: string,
    operationType: 'read' | 'write',
    isCriticalPath: boolean
): RetryPolicyConfig {
    const basePolicy = retryPolicies[serviceName] || retryPolicies.internalService;
    
    // Reduce retry attempts for writes unless idempotent
    if (operationType === 'write') {
        return {
            ...basePolicy,
            maxAttempts: Math.min(basePolicy.maxAttempts, 3),
        };
    }
    
    // Increase attempts for critical path operations
    if (isCriticalPath) {
        return {
            ...basePolicy,
            maxAttempts: Math.min(basePolicy.maxAttempts + 2, 7),
        };
    }
    
    return basePolicy;
}

Request-Level Context

Beyond service-level differentiation, individual requests may warrant different treatment:

User-facing vs background: User-facing requests may prefer fast failure to keep UI responsive; background jobs may retry more aggressively
Remaining deadline: A request near its deadline shouldn't start new retries that can't complete in time
Prior failures in session: If a user's session has experienced many failures, additional retries may not improve experience
Request priority: High-priority requests may get more generous retry budgets

Deadline-Aware Retries

Always consider the remaining time budget before initiating a retry. If a request has a 5-second deadline and 4.5 seconds have elapsed, starting a retry (with its own potential timeout) is wasteful. The request should fail fast and let the caller decide whether to abandon or start fresh. We'll explore deadline propagation patterns in the Timeout and Deadline module.

Failure Detection and Classification Infrastructure

Accurate failure classification requires proper infrastructure. Without consistent error handling and propagation, retry policies can't make informed decisions.

Standardized Error Responses

Services should return structured error responses that enable accurate classification:

{
  "error": {
    "code": "RESOURCE_EXHAUSTED",
    "message": "Rate limit exceeded for API calls",
    "details": {
      "retryable": true,
      "retryAfterSeconds": 30,
      "quotaLimit": 1000,
      "quotaRemaining": 0,
      "quotaResetsAt": "2024-01-15T10:00:00Z"
    }
  }
}

Key elements for retry classification:

Machine-readable error code: Not just HTTP status, but semantic codes like RESOURCE_EXHAUSTED, INVALID_ARGUMENT, DEADLINE_EXCEEDED
Explicit retryability signal: When possible, services should indicate whether the error is retryable
Retry timing hints: How long to wait before retrying (Retry-After header or response field)
Debug information: Context for logging/debugging without affecting retry logic

Building Reliable Classification Infrastructure

•Consistent error taxonomy: Define a standard set of error codes across all services with clear retryability semantics
•HTTP status code discipline: Use status codes consistently to indicate error categories (4xx = don't retry, 503 = retry encouraged)
•Structured logging: Log failures with classifiable metadata, enabling analysis of retry behavior
•Error code registries: Maintain centralized documentation of error codes and their retry implications
•Client library standards: Provide shared retry infrastructure in client libraries to ensure consistent behavior
•Observability integration: Track retry decisions, success rates, and time-to-success for tuning policies

Exception Mapping and Wrapping

In code, different infrastructure layers throw different exception types. Establishing a mapping to retryability domains simplifies policy implementation:

exception-classification.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
// Exception classification system
enum ErrorCategory {
    TRANSIENT = 'transient',
    PERMANENT = 'permanent',
    UNKNOWN = 'unknown',
}
 
interface ClassifiedError extends Error {
    category: ErrorCategory;
    originalError?: Error;
    httpStatus?: number;
    errorCode?: string;
    retryAfterMs?: number;
}
 
// Map infrastructure exceptions to categories
function classifyException(error: Error): ClassifiedError {
    const classified: ClassifiedError = Object.assign(
        new Error(error.message),
        {
            name: error.name,
            stack: error.stack,
            category: ErrorCategory.UNKNOWN,
            originalError: error,
        }
    );
    
    // Node.js system errors
    const errorCode = (error as NodeJS.ErrnoException).code;
    if (errorCode === 'ECONNRESET' || 
        errorCode === 'ETIMEDOUT' || 
        errorCode === 'ECONNREFUSED' ||
        errorCode === 'EPIPE' ||
        errorCode === 'EHOSTUNREACH') {
        classified.category = ErrorCategory.TRANSIENT;
        return classified;
    }
    
    // HTTP client errors (from fetch, axios, etc.)
    const httpStatus = (error as any).response?.status;
    if (httpStatus) {
        classified.httpStatus = httpStatus;
        
        if ([408, 429, 502, 503, 504].includes(httpStatus)) {
            classified.category = ErrorCategory.TRANSIENT;
            
            // Extract Retry-After if present
            const retryAfter = (error as any).response?.headers?.['retry-after'];
            if (retryAfter) {
                classified.retryAfterMs = parseRetryAfter(retryAfter);
            }
        } else if (httpStatus >= 400 && httpStatus < 500) {
            classified.category = ErrorCategory.PERMANENT;
        } else if (httpStatus === 500) {
            classified.category = ErrorCategory.UNKNOWN; // 500 is ambiguous
        }
        
        return classified;
    }
    
    // gRPC status codes
    const grpcCode = (error as any).code;
    if (grpcCode !== undefined) {
        const transientGrpcCodes = [
            'DEADLINE_EXCEEDED', 'RESOURCE_EXHAUSTED', 'UNAVAILABLE', 'ABORTED'
        ];
        const permanentGrpcCodes = [
            'INVALID_ARGUMENT', 'NOT_FOUND', 'ALREADY_EXISTS',
            'PERMISSION_DENIED', 'UNAUTHENTICATED'
        ];
        
        if (transientGrpcCodes.includes(grpcCode)) {
            classified.category = ErrorCategory.TRANSIENT;
        } else if (permanentGrpcCodes.includes(grpcCode)) {
            classified.category = ErrorCategory.PERMANENT;
        }
    }
    
    return classified;
}
 
function parseRetryAfter(value: string): number {
    // Retry-After can be seconds or HTTP date
    const seconds = parseInt(value, 10);
    if (!isNaN(seconds)) {
        return seconds * 1000;
    }
    
    const date = new Date(value);
    if (!isNaN(date.getTime())) {
        return Math.max(0, date.getTime() - Date.now());
    }
    
    return 0;
}

Summary: When to Retry

Retry logic is a powerful tool for building resilient distributed systems, but it must be wielded with precision. The decision when to retry is as important as how to retry.

Key Takeaways

•Classify before retrying — Distinguish transient failures (network issues, temporary overload) from permanent failures (invalid requests, missing resources). Only transient failures benefit from retry.
•Fail fast for permanent errors — Retrying permanent failures wastes resources, increases latency, and can exacerbate outages. Return errors quickly.
•Consider operation type — Reads are safe to retry; writes require idempotency guarantees before aggressive retry is appropriate.
•Handle the timeout dilemma — Timeout failures have unknown outcomes. Never retry writes after timeout without idempotency mechanisms.
•Apply context-aware policies — Different services, operation types, and criticality levels warrant different retry configurations.
•Build classification infrastructure — Structured error responses, consistent status code usage, and exception mapping enable accurate retry decisions.
•Respect server signals — Honor Retry-After headers and explicit retryability hints from services.

What's Next:

Now that we understand when to retry, the next page explores how to retry effectively. Exponential backoff is the foundational technique for spacing retries to avoid overwhelming recovering services while still providing timely recovery from transient failures.

Page Complete

You now understand the critical distinction between transient and permanent failures, the systematic framework for retry decisions, and the importance of operation characteristics in retry safety. This foundation prepares you to implement exponential backoff and other advanced retry strategies covered in subsequent pages.

1 / 5

Loading learning content...

System Design (HLD)Retry with Backoff

Retry with Backoff: Building Resilient Distributed Systems

LevelAdvanced

Duration75 mins

TopicRetry with Backoff

1 / 5

When to Retry: The Decision Framework

The Fundamental Question of Distributed Reliability

What You Will Learn

The Nature of Failures in Distributed Systems

The Fundamental Distinction: Transient vs Permanent

Every failure in a distributed system falls into one of two broad categories:

Transient vs Permanent Failure Characteristics
Characteristic	Transient Failures	Permanent Failures
Self-correcting	Yes - resolve without intervention	No - require external action
Time-dependent	Likely to succeed if retried later	Will fail indefinitely
Root cause	Temporary system conditions	Fundamental request/state problems
Retry value	High - automatic recovery	None - wastes resources
Correct response	Retry with backoff	Fail fast, escalate, or degrade
Examples	Network timeout, 503 Service Unavailable	400 Bad Request, 404 Not Found

Why This Distinction Matters Operationally

Misclassifying failures has severe consequences in both directions:

Treating permanent failures as transient (retry what shouldn't be retried):

Wastes compute resources on futile attempts
Increases latency for users waiting for inevitable failure
Generates unnecessary load on already-stressed dependencies
Pollutes logs and metrics with noise
May exhaust retry budgets needed for genuine transient failures

Treating transient failures as permanent (fail fast when you should retry):

Converts recoverable situations into user-visible errors
Reduces system availability unnecessarily
Forces human intervention for self-healing issues
Undermines user confidence in system reliability
May trigger unnecessary on-call pages

The Classification Challenge

Transient Failures: The Retry Sweet Spot

Network-Level Transient Failures

Network infrastructure introduces numerous opportunities for transient failures:

Connection timeouts: The remote server exists and is healthy, but momentary network congestion prevents establishing a connection within the timeout window.
Read/write timeouts: A connection was established but data transfer stalled due to temporary bandwidth constraints or packet loss.
Connection resets: TCP RST packets received due to transient network equipment issues, NAT table expirations, or brief service interruptions.
DNS resolution failures: Temporary DNS server unreachability or propagation delays.
TLS handshake failures: Transient issues during secure connection establishment, often related to clock skew or temporary certificate validation issues.

Common Transient Failure Scenarios

•Service Overload (429, 503) — The service is temporarily overwhelmed but actively shedding load. Retrying after a delay allows time for recovery.
•Resource Contention — Database connection pools exhausted, thread pools full, or memory pressure. These conditions often resolve as competing requests complete.
•Deployment Transitions — Brief unavailability during rolling deployments when old instances terminate before new ones become healthy.
•Load Balancer Re-routing — Momentary failures as load balancers detect unhealthy backends and route to healthy alternatives.
•Circuit Breaker Reset — Upstream circuit breakers temporarily rejecting requests but will soon allow testing traffic.
•Rate Limit Bucket Replenishment — Rate limits that reset on a time window boundary, allowing retries after the window rolls.
•Leader Election — Distributed systems temporarily unavailable during consensus leader changes.
•Garbage Collection Pauses — JVM-based services experiencing stop-the-world GC events causing temporary unresponsiveness.

Identifying Transient Failures Through HTTP Status Codes

HTTP response codes provide valuable signals for retry decisions, though they must be interpreted carefully:

Clearly Retryable (Transient):

408 Request Timeout — Server timed out waiting for request
429 Too Many Requests — Rate limiting; often includes Retry-After header
502 Bad Gateway — Upstream server returned invalid response
503 Service Unavailable — Server temporarily unable to handle request
504 Gateway Timeout — Upstream server didn't respond in time

Contextually Retryable:

500 Internal Server Error — May be transient (OOM, deadlock) or permanent (bug)
Connection refused — May be transient (restart) or permanent (misconfiguration)

Generally Not Retryable:

400 Bad Request — Request is malformed; retry will produce same result
401 Unauthorized — Authentication failed; retry won't fix credentials
403 Forbidden — Authorization denied; retry won't grant permissions
404 Not Found — Resource doesn't exist; retry won't create it
409 Conflict — State conflict that likely requires resolution
422 Unprocessable Entity — Semantic errors in request payload

The Retry-After Header

Permanent Failures: When Retries Harm

Categories of Permanent Failures

Client-Side Errors (4xx responses)

Most 4xx HTTP status codes indicate problems with the request itself. The same request, made identically, will fail identically:

Validation failures (400): Request syntax is malformed or violates business rules
Authentication failures (401): Credentials are missing, expired, or invalid
Authorization failures (403): User lacks permission for the requested operation
Resource not found (404): The target resource doesn't exist
Method not allowed (405): The HTTP method isn't supported for this resource
Payload too large (413): Request body exceeds server limits
Unsupported media type (415): Content-Type not accepted

Why Retrying Permanent Failures Is Harmful

•Amplified Load on Failing Services — A service returning errors is already under stress. Retry storms multiply the number of requests it must process and reject, potentially converting partial outages into complete failures.
•Delayed User Feedback — Users waiting for retries to exhaust experience increased latency with no benefit. A request that will ultimately fail should fail fast.
•Wasted Compute Resources — Every futile retry consumes CPU, memory, network bandwidth, and connection pool slots that could serve valid requests.
•Polluted Observability — Logs fill with repeated failures, metrics spike with noise, and genuine signals become harder to distinguish from retry artifacts.
•Exhausted Retry Budgets — Some systems limit total retries. Burning retries on permanent failures leaves none for genuine transient issues.
•Cascading Resource Starvation — Threads blocked waiting for retry delays can starve other operations, spreading failure beyond the original scope.

Semantic Errors and Business Logic Failures

Beyond HTTP status codes, many permanent failures are semantic or business-logic level:

Insufficient funds: Retrying a payment won't change the account balance
Inventory exhaustion: Retrying a purchase won't restock the item
Duplicate operations: Retrying a unique constraint violation won't make it unique
Expired offers: Retrying after a deadline won't extend the deadline
Version conflicts: Retrying optimistic locking failures without re-reading state won't resolve the conflict

The Fail-Fast Imperative

For permanent failures, the correct strategy is fail fast: immediately propagate the error to the caller, allowing them to handle it appropriately. This might mean:

Returning a clear error message to the user
Triggering compensating transactions
Logging for debugging and alerting
Falling back to alternative flows
Queuing for human review

Fail-fast preserves resources, provides quicker feedback, and allows the system to focus on requests that can actually succeed.

The Retry Storm Catastrophe

The Retry Decision Framework

Step 1: Classify the Failure

Before deciding to retry, classify the failure using all available information:

HTTP status code: The most immediate signal, but not always definitive
Error response body: Often contains error codes or messages indicating permanence
Error type/exception: Programming language exceptions may indicate transience (timeout exceptions) or permanence (validation exceptions)
Request context: Was this a read or write? Reads are generally safer to retry
Service-specific knowledge: Some services define custom retryable error codes

retry-decision.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
// Retry decision framework implementation
interface RetryDecision {
    shouldRetry: boolean;
    reason: string;
    suggestedDelayMs?: number;
}
 
type FailureCategory = 'transient' | 'permanent' | 'unknown';
 
interface FailureClassification {
    category: FailureCategory;
    confidence: 'high' | 'medium' | 'low';
    evidence: string;
}
 
function classifyFailure(
    statusCode: number | undefined,
    errorCode: string | undefined,
    errorMessage: string | undefined,
    exception: Error | undefined
): FailureClassification {
    // HTTP status code classification
    if (statusCode !== undefined) {
        // Definitely transient
        if ([408, 429, 502, 503, 504].includes(statusCode)) {
            return {
                category: 'transient',
                confidence: 'high',
                evidence: `HTTP ${statusCode} is explicitly transient`
            };
        }
        
        // Definitely permanent (client errors)
        if (statusCode >= 400 && statusCode < 500 && statusCode !== 408 && statusCode !== 429) {
            return {
                category: 'permanent',
                confidence: 'high',
                evidence: `HTTP ${statusCode} indicates client error`
            };
        }
        
        // Ambiguous server error
        if (statusCode === 500) {
            return {
                category: 'unknown',
                confidence: 'low',
                evidence: 'HTTP 500 may be transient or permanent'
            };
        }
    }
    
    // Exception type classification
    if (exception) {
        const exceptionName = exception.name || exception.constructor.name;
        
        const transientExceptions = [
            'TimeoutError', 'ConnectionError', 'NetworkError',
            'ECONNRESET', 'ETIMEDOUT', 'ECONNREFUSED'
        ];
        
        if (transientExceptions.some(te => exceptionName.includes(te))) {
            return {
                category: 'transient',
                confidence: 'high',
                evidence: `Exception ${exceptionName} is typically transient`
            };
        }
        
        const permanentExceptions = [
            'ValidationError', 'AuthenticationError',
            'AuthorizationError', 'NotFoundError'
        ];
        
        if (permanentExceptions.some(pe => exceptionName.includes(pe))) {
            return {
                category: 'permanent',
                confidence: 'high',
                evidence: `Exception ${exceptionName} is typically permanent`
            };
        }
    }
    
    // Error code classification (service-specific)
    if (errorCode) {
        const transientCodes = ['RESOURCE_EXHAUSTED', 'UNAVAILABLE', 'DEADLINE_EXCEEDED'];
        const permanentCodes = ['INVALID_ARGUMENT', 'NOT_FOUND', 'PERMISSION_DENIED'];
        
        if (transientCodes.includes(errorCode)) {
            return { category: 'transient', confidence: 'high', evidence: `Error code ${errorCode}` };
        }
        if (permanentCodes.includes(errorCode)) {
            return { category: 'permanent', confidence: 'high', evidence: `Error code ${errorCode}` };
        }
    }
    
    // Default to unknown with low confidence
    return {
        category: 'unknown',
        confidence: 'low',
        evidence: 'Unable to classify failure from available signals'
    };
}

Step 2: Apply the Retry Decision Matrix

Once classified, apply the decision matrix based on failure category and operation characteristics:

Retry Decision Matrix
Failure Category	Operation Type	Decision	Rationale
Transient (high confidence)	Read	Retry	Safe, likely to succeed
Transient (high confidence)	Write (idempotent)	Retry	Safe with idempotency key
Transient (high confidence)	Write (non-idempotent)	Conditional retry	Only if timeout/connection failure
Permanent (high confidence)	Any	Fail fast	Retry will not help
Unknown (low confidence)	Read	Limited retry	1-2 retries to probe
Unknown (low confidence)	Write	No retry	Risk of duplication too high

Step 3: Check Retry Preconditions

Even when failure classification suggests retry, additional preconditions must be met:

Retry budget remaining: Haven't exceeded maximum retry attempts
Time budget remaining: Total elapsed time within acceptable latency bounds
Circuit breaker open: Upstream circuit isn't open (would reject retry immediately)
Idempotency assured: For writes, idempotency key is present or operation is naturally idempotent
Client still waiting: Connection hasn't been closed or cancelled
Retry-After respected: If server provided delay, sufficient time has passed

The Unknown Category Strategy

Operation Characteristics and Retry Safety

Beyond failure classification, the nature of the operation itself profoundly influences retry safety. The same transient failure may warrant retry for one operation but not another.

Read vs Write Operations

Safe to Retry

•All read operations — GET, HEAD, OPTIONS, TRACE
•Naturally idempotent writes — PUT that sets absolute state, DELETE of specific resource
•Writes with idempotency keys — POST/PATCH with client-generated unique identifier
•Timeout failures on reads — The operation may or may not have completed, but re-reading is safe
•Connection failures before request sent — Request definitely not processed

Dangerous to Retry

•Non-idempotent writes without keys — POST creating resources, triggering workflows
•Timeout failures on writes — Request may have succeeded; retry = duplicate
•Partial success scenarios — Some side effects completed before failure
•Stateful sequences — Operations that depend on prior state changes
•Financial transactions — Payments, transfers without deduplication

The Timeout Dilemma

Timeout failures present the most challenging retry decision because the request's fate is unknown. Consider this scenario:

Client sends POST request to create an order
Server receives and processes request, creates the order
Server begins sending response
Network interruption: response never reaches client
Client experiences timeout after 30 seconds

Resolution strategies:

Idempotency keys: Include a unique request ID; server rejects/returns duplicates
Conditional operations: Use ETags or version numbers for concurrency control
Two-phase operations: Reserve, then commit with confirmation
Query before retry: Check if the operation succeeded before retrying
Accept eventual consistency: Design downstream to handle duplicates gracefully

The Idempotency Prerequisite

Context-Aware Retry Policies

Effective retry policies aren't one-size-fits-all. Different operations, services, and contexts warrant different retry behaviors. This section explores how to design context-aware policies.

Service-Level Differentiation

Not all dependencies are equally important or equally likely to experience transient failures:

Critical path services (payment, authentication): May warrant more aggressive retry for transient failures, but stricter classification to avoid retry storms
Best-effort services (analytics, recommendations): May tolerate faster failure with fewer retries
External APIs: Often have rate limits that require respecting Retry-After headers
Internal microservices: May allow more aggressive retry given controlled environment

context-aware-retry.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
// Context-aware retry policy configuration
interface RetryPolicyConfig {
    maxAttempts: number;
    baseDelayMs: number;
    maxDelayMs: number;
    retryableStatuses: number[];
    retryableExceptions: string[];
    respectRetryAfter: boolean;
    nonRetryableStatuses: number[];
}
 
// Different policies for different contexts
const retryPolicies: Record<string, RetryPolicyConfig> = {
    // Critical payment service: more retries, careful classification
    paymentService: {
        maxAttempts: 5,
        baseDelayMs: 100,
        maxDelayMs: 5000,
        retryableStatuses: [408, 429, 502, 503, 504],
        retryableExceptions: ['TimeoutError', 'ConnectionError'],
        respectRetryAfter: true,
        nonRetryableStatuses: [400, 401, 402, 403, 404, 409, 422],
    },
    
    // Best-effort analytics: fast fail, minimal retry
    analyticsService: {
        maxAttempts: 2,
        baseDelayMs: 50,
        maxDelayMs: 500,
        retryableStatuses: [502, 503, 504],
        retryableExceptions: ['TimeoutError'],
        respectRetryAfter: false,
        nonRetryableStatuses: [400, 401, 403, 404, 429], // Don't retry rate limiting
    },
    
    // External third-party API: respect their limits
    externalApi: {
        maxAttempts: 3,
        baseDelayMs: 1000,
        maxDelayMs: 60000,
        retryableStatuses: [429, 500, 502, 503, 504],
        retryableExceptions: ['TimeoutError', 'ConnectionError', 'NetworkError'],
        respectRetryAfter: true, // Critical for external APIs
        nonRetryableStatuses: [400, 401, 403, 404, 405],
    },
    
    // Internal microservice: more tolerant
    internalService: {
        maxAttempts: 4,
        baseDelayMs: 100,
        maxDelayMs: 2000,
        retryableStatuses: [408, 429, 500, 502, 503, 504],
        retryableExceptions: ['TimeoutError', 'ConnectionError', 'ECONNRESET'],
        respectRetryAfter: true,
        nonRetryableStatuses: [400, 401, 403, 404],
    },
};
 
// Policy selection based on context
function selectRetryPolicy(
    serviceName: string,
    operationType: 'read' | 'write',
    isCriticalPath: boolean
): RetryPolicyConfig {
    const basePolicy = retryPolicies[serviceName] || retryPolicies.internalService;
    
    // Reduce retry attempts for writes unless idempotent
    if (operationType === 'write') {
        return {
            ...basePolicy,
            maxAttempts: Math.min(basePolicy.maxAttempts, 3),
        };
    }
    
    // Increase attempts for critical path operations
    if (isCriticalPath) {
        return {
            ...basePolicy,
            maxAttempts: Math.min(basePolicy.maxAttempts + 2, 7),
        };
    }
    
    return basePolicy;
}

Request-Level Context

Beyond service-level differentiation, individual requests may warrant different treatment:

User-facing vs background: User-facing requests may prefer fast failure to keep UI responsive; background jobs may retry more aggressively
Remaining deadline: A request near its deadline shouldn't start new retries that can't complete in time
Prior failures in session: If a user's session has experienced many failures, additional retries may not improve experience
Request priority: High-priority requests may get more generous retry budgets

Deadline-Aware Retries

Failure Detection and Classification Infrastructure

Accurate failure classification requires proper infrastructure. Without consistent error handling and propagation, retry policies can't make informed decisions.

Standardized Error Responses

Services should return structured error responses that enable accurate classification:

{
  "error": {
    "code": "RESOURCE_EXHAUSTED",
    "message": "Rate limit exceeded for API calls",
    "details": {
      "retryable": true,
      "retryAfterSeconds": 30,
      "quotaLimit": 1000,
      "quotaRemaining": 0,
      "quotaResetsAt": "2024-01-15T10:00:00Z"
    }
  }
}

Key elements for retry classification:

Machine-readable error code: Not just HTTP status, but semantic codes like RESOURCE_EXHAUSTED, INVALID_ARGUMENT, DEADLINE_EXCEEDED
Explicit retryability signal: When possible, services should indicate whether the error is retryable
Retry timing hints: How long to wait before retrying (Retry-After header or response field)
Debug information: Context for logging/debugging without affecting retry logic

Building Reliable Classification Infrastructure

•Consistent error taxonomy: Define a standard set of error codes across all services with clear retryability semantics
•HTTP status code discipline: Use status codes consistently to indicate error categories (4xx = don't retry, 503 = retry encouraged)
•Structured logging: Log failures with classifiable metadata, enabling analysis of retry behavior
•Error code registries: Maintain centralized documentation of error codes and their retry implications
•Client library standards: Provide shared retry infrastructure in client libraries to ensure consistent behavior
•Observability integration: Track retry decisions, success rates, and time-to-success for tuning policies

Exception Mapping and Wrapping

In code, different infrastructure layers throw different exception types. Establishing a mapping to retryability domains simplifies policy implementation:

exception-classification.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
// Exception classification system
enum ErrorCategory {
    TRANSIENT = 'transient',
    PERMANENT = 'permanent',
    UNKNOWN = 'unknown',
}
 
interface ClassifiedError extends Error {
    category: ErrorCategory;
    originalError?: Error;
    httpStatus?: number;
    errorCode?: string;
    retryAfterMs?: number;
}
 
// Map infrastructure exceptions to categories
function classifyException(error: Error): ClassifiedError {
    const classified: ClassifiedError = Object.assign(
        new Error(error.message),
        {
            name: error.name,
            stack: error.stack,
            category: ErrorCategory.UNKNOWN,
            originalError: error,
        }
    );
    
    // Node.js system errors
    const errorCode = (error as NodeJS.ErrnoException).code;
    if (errorCode === 'ECONNRESET' || 
        errorCode === 'ETIMEDOUT' || 
        errorCode === 'ECONNREFUSED' ||
        errorCode === 'EPIPE' ||
        errorCode === 'EHOSTUNREACH') {
        classified.category = ErrorCategory.TRANSIENT;
        return classified;
    }
    
    // HTTP client errors (from fetch, axios, etc.)
    const httpStatus = (error as any).response?.status;
    if (httpStatus) {
        classified.httpStatus = httpStatus;
        
        if ([408, 429, 502, 503, 504].includes(httpStatus)) {
            classified.category = ErrorCategory.TRANSIENT;
            
            // Extract Retry-After if present
            const retryAfter = (error as any).response?.headers?.['retry-after'];
            if (retryAfter) {
                classified.retryAfterMs = parseRetryAfter(retryAfter);
            }
        } else if (httpStatus >= 400 && httpStatus < 500) {
            classified.category = ErrorCategory.PERMANENT;
        } else if (httpStatus === 500) {
            classified.category = ErrorCategory.UNKNOWN; // 500 is ambiguous
        }
        
        return classified;
    }
    
    // gRPC status codes
    const grpcCode = (error as any).code;
    if (grpcCode !== undefined) {
        const transientGrpcCodes = [
            'DEADLINE_EXCEEDED', 'RESOURCE_EXHAUSTED', 'UNAVAILABLE', 'ABORTED'
        ];
        const permanentGrpcCodes = [
            'INVALID_ARGUMENT', 'NOT_FOUND', 'ALREADY_EXISTS',
            'PERMISSION_DENIED', 'UNAUTHENTICATED'
        ];
        
        if (transientGrpcCodes.includes(grpcCode)) {
            classified.category = ErrorCategory.TRANSIENT;
        } else if (permanentGrpcCodes.includes(grpcCode)) {
            classified.category = ErrorCategory.PERMANENT;
        }
    }
    
    return classified;
}
 
function parseRetryAfter(value: string): number {
    // Retry-After can be seconds or HTTP date
    const seconds = parseInt(value, 10);
    if (!isNaN(seconds)) {
        return seconds * 1000;
    }
    
    const date = new Date(value);
    if (!isNaN(date.getTime())) {
        return Math.max(0, date.getTime() - Date.now());
    }
    
    return 0;
}

Summary: When to Retry

Retry logic is a powerful tool for building resilient distributed systems, but it must be wielded with precision. The decision when to retry is as important as how to retry.

Key Takeaways

•Classify before retrying — Distinguish transient failures (network issues, temporary overload) from permanent failures (invalid requests, missing resources). Only transient failures benefit from retry.
•Fail fast for permanent errors — Retrying permanent failures wastes resources, increases latency, and can exacerbate outages. Return errors quickly.
•Consider operation type — Reads are safe to retry; writes require idempotency guarantees before aggressive retry is appropriate.
•Handle the timeout dilemma — Timeout failures have unknown outcomes. Never retry writes after timeout without idempotency mechanisms.
•Apply context-aware policies — Different services, operation types, and criticality levels warrant different retry configurations.
•Build classification infrastructure — Structured error responses, consistent status code usage, and exception mapping enable accurate retry decisions.
•Respect server signals — Honor Retry-After headers and explicit retryability hints from services.

What's Next:

Page Complete

1 / 5