System Design (HLD)Retry Strategies

Retry Strategies in Distributed Systems

LevelIntermediate

Duration75 mins

TopicRetry Strategies

1 / 5

When to Retry

The Retry Paradox

In distributed systems, failure is not an exception—it's the default state. Networks partition, services crash, databases timeout, and cloud resources become temporarily unavailable. The natural response is to retry failed operations: if at first you don't succeed, try again.

But here's the paradox: retries can heal systems, and retries can destroy them.

A well-designed retry strategy transforms transient failures into invisible hiccups, maintaining the illusion of reliability for end users. A poorly designed retry strategy amplifies failures exponentially, turning a momentary glitch into a cascading outage that brings down entire platforms.

Understanding when to retry—and equally important, when not to retry—is one of the most critical skills in distributed systems engineering.

What You Will Master

By the end of this page, you will understand the fundamental principles governing retry decisions: classifying failures as transient or permanent, evaluating operation safety for retries, recognizing the dangers of naive retry implementations, and building the mental framework that underpins all sophisticated retry strategies.

The Failure Taxonomy: Not All Failures Are Equal

Before deciding whether to retry, you must first understand what failed and why. Distributed systems exhibit a rich taxonomy of failure modes, each with different implications for retry behavior.

The fundamental classification divides failures into two categories:

Transient Failures — Temporary conditions that will likely resolve on their own
Permanent Failures — Persistent conditions that require intervention to resolve

Transient vs Permanent Failures
Failure Type	Characteristics	Examples	Retry Appropriate?
Transient	Self-resolving, time-bounded, infrastructure-related	Network timeout, connection reset, 503 Service Unavailable, resource contention	Yes — with proper backoff
Permanent	Persistent until external action, logic/data-related	404 Not Found, 401 Unauthorized, invalid request format, business rule violation	No — will fail indefinitely
Ambiguous	Unknown whether transient or permanent	500 Internal Server Error, connection refused, DNS resolution failure	Maybe — requires context and limits

The critical insight: Retrying permanent failures wastes resources and delays error propagation to users. Retrying transient failures is essential for reliability. The challenge is accurately classifying failures in real-time with incomplete information.

The Ambiguity Problem

Many failures are genuinely ambiguous. A 500 Internal Server Error could indicate a transient server overload (retry!) or a permanent code bug triggered by your specific request (don't retry!). Sophisticated retry strategies must handle this ambiguity gracefully.

HTTP status codes as retry signals:

HTTP provides some guidance through status codes, but interpretation requires nuance:

http-status-retry-classification.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
// HTTP Status Code Retry Classification Framework
 
interface RetryDecision {
  shouldRetry: boolean;
  reason: string;
  maxRetries?: number;
  useBackoff?: boolean;
}
 
function classifyHttpStatusForRetry(status: number): RetryDecision {
  // ============================================
  // DEFINITE NO-RETRY: Client errors (4xx)
  // ============================================
  
  // 400 Bad Request - Client sent malformed request
  // Retrying will always produce the same result
  if (status === 400) {
    return {
      shouldRetry: false,
      reason: "Malformed request - fix client code",
    };
  }
  
  // 401 Unauthorized - Authentication required/failed
  // Retrying without new credentials is pointless
  if (status === 401) {
    return {
      shouldRetry: false,
      reason: "Authentication required - obtain valid credentials",
    };
  }
  
  // 403 Forbidden - Authorized but not permitted
  // No amount of retrying grants permission
  if (status === 403) {
    return {
      shouldRetry: false,
      reason: "Permission denied - requires authorization change",
    };
  }
  
  // 404 Not Found - Resource doesn't exist
  // Unless expecting eventual consistency, don't retry
  if (status === 404) {
    return {
      shouldRetry: false,
      reason: "Resource not found - verify resource exists",
    };
  }
  
  // 409 Conflict - Request conflicts with current state
  // Often requires reading current state before retry
  if (status === 409) {
    return {
      shouldRetry: false,  // Usually - but read-modify-write patterns may retry
      reason: "Conflict - resolve state conflict first",
    };
  }
  
  // 422 Unprocessable Entity - Semantic error
  // Request is syntactically valid but semantically wrong
  if (status === 422) {
    return {
      shouldRetry: false,
      reason: "Validation failed - fix request payload",
    };
  }
  
  // ============================================
  // SPECIAL CASE: Rate limiting (429)
  // ============================================
  
  // 429 Too Many Requests - Rate limited
  // SHOULD retry, but with significant delay
  if (status === 429) {
    return {
      shouldRetry: true,
      reason: "Rate limited - respect Retry-After header",
      maxRetries: 3,
      useBackoff: true,  // Or use Retry-After if provided
    };
  }
  
  // ============================================
  // DEFINITE RETRY: Server errors indicating transient issues
  // ============================================
  
  // 502 Bad Gateway - Upstream server error
  // Usually transient - gateway couldn't reach backend
  if (status === 502) {
    return {
      shouldRetry: true,
      reason: "Upstream unavailable - likely transient",
      maxRetries: 3,
      useBackoff: true,
    };
  }
  
  // 503 Service Unavailable - Server temporarily overloaded
  // Explicitly designed for temporary conditions
  if (status === 503) {
    return {
      shouldRetry: true,
      reason: "Service unavailable - temporary condition",
      maxRetries: 3,
      useBackoff: true,
    };
  }
  
  // 504 Gateway Timeout - Upstream timed out
  // Network timing issue - retry with backoff
  if (status === 504) {
    return {
      shouldRetry: true,
      reason: "Gateway timeout - transient network issue",
      maxRetries: 3,
      useBackoff: true,
    };
  }
  
  // ============================================
  // AMBIGUOUS: 500 Internal Server Error
  // ============================================
  
  // 500 is the tricky one - could be anything
  if (status === 500) {
    return {
      shouldRetry: true,  // Default to retry with limits
      reason: "Server error - may be transient or permanent",
      maxRetries: 2,  // Lower limit due to ambiguity
      useBackoff: true,
    };
  }
  
  // Default for unhandled 4xx - don't retry
  if (status >= 400 && status < 500) {
    return {
      shouldRetry: false,
      reason: "Client error - fix request",
    };
  }
  
  // Default for unhandled 5xx - cautious retry
  if (status >= 500) {
    return {
      shouldRetry: true,
      reason: "Server error - cautious retry",
      maxRetries: 2,
      useBackoff: true,
    };
  }
  
  // 2xx and 3xx not really "failures"
  return {
    shouldRetry: false,
    reason: "Not a failure status",
  };
}

Beyond HTTP Status Codes: Network-Level Failures

HTTP status codes only tell part of the story. Many failures occur before receiving any HTTP response at all. These network-level failures require their own classification and handling strategies.

Common Network-Level Failures

•Connection Timeout — Unable to establish TCP connection within timeout. Usually transient (server overloaded or network congested). Retry: Yes
•Connection Refused — Server actively rejected connection (port closed). May be transient (service restarting) or permanent (wrong port). Retry: Yes, with limits
•Connection Reset — Established connection forcibly closed by peer. Often transient (server crashed mid-request). Retry: Yes
•DNS Resolution Failure — Cannot resolve hostname. Usually transient (DNS server overloaded) but could be permanent (typo in hostname). Retry: Yes, with verification
•Read Timeout — Connection established but response not received in time. Server may still be processing. Retry: Caution required
•SSL/TLS Handshake Failure — Encryption negotiation failed. Usually permanent (certificate issues) but can be transient (clock skew). Retry: Usually no

The Read Timeout Trap

Read timeouts are especially dangerous. The server may have successfully processed your request—you just didn't receive the response. Retrying might execute the operation twice. This is the idempotency problem we'll explore in depth later.

Comprehensive failure classification in practice:

network-failure-classifier.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
// Network-Level Failure Classification
 
type FailureType = 
  | "transient"      // Definitely retry
  | "permanent"      // Never retry
  | "ambiguous"      // Retry cautiously with limits
  | "dangerous";     // Retry only if idempotent
 
interface FailureClassification {
  type: FailureType;
  retryable: boolean;
  description: string;
  recommendedAction: string;
}
 
function classifyNetworkError(error: Error): FailureClassification {
  const errorCode = (error as NodeJS.ErrnoException).code;
  const errorMessage = error.message.toLowerCase();
  
  // ============================================
  // Connection Establishment Failures
  // ============================================
  
  if (errorCode === "ECONNREFUSED") {
    return {
      type: "ambiguous",
      retryable: true,
      description: "Connection refused - server not accepting connections",
      recommendedAction: "Retry with backoff; may indicate service down",
    };
  }
  
  if (errorCode === "ETIMEDOUT" || errorCode === "ECONNABORTED") {
    return {
      type: "transient",
      retryable: true,
      description: "Connection timeout - network congestion or overload",
      recommendedAction: "Retry with exponential backoff",
    };
  }
  
  if (errorCode === "ECONNRESET") {
    return {
      type: "transient",
      retryable: true,
      description: "Connection reset - peer unexpectedly closed connection",
      recommendedAction: "Retry immediately or with minimal delay",
    };
  }
  
  // ============================================
  // DNS Failures
  // ============================================
  
  if (errorCode === "ENOTFOUND" || errorCode === "EAI_AGAIN") {
    return {
      type: "ambiguous",
      retryable: true,
      description: "DNS resolution failed",
      recommendedAction: "Retry with backoff; verify hostname if persists",
    };
  }
  
  // ============================================
  // Read/Write Failures (DANGEROUS)
  // ============================================
  
  if (errorMessage.includes("socket hang up") ||
      errorMessage.includes("read econnreset")) {
    return {
      type: "dangerous",
      retryable: true, // Only if operation is idempotent!
      description: "Connection closed during data transfer",
      recommendedAction: "Retry ONLY if operation is idempotent",
    };
  }
  
  if (errorCode === "EPIPE" || errorCode === "ENOTCONN") {
    return {
      type: "dangerous",
      retryable: true,
      description: "Connection lost while writing",
      recommendedAction: "Retry ONLY if operation is idempotent",
    };
  }
  
  // ============================================
  // SSL/TLS Failures
  // ============================================
  
  if (errorMessage.includes("certificate") ||
      errorMessage.includes("ssl") ||
      errorCode === "CERT_HAS_EXPIRED") {
    return {
      type: "permanent",
      retryable: false,
      description: "SSL/TLS certificate error",
      recommendedAction: "Fix certificate configuration",
    };
  }
  
  if (errorMessage.includes("handshake")) {
    return {
      type: "ambiguous",
      retryable: true, // Sometimes transient (protocol negotiation)
      description: "TLS handshake failed",
      recommendedAction: "Retry once; investigate if persists",
    };
  }
  
  // ============================================
  // Resource Exhaustion
  // ============================================
  
  if (errorCode === "EMFILE" || errorCode === "ENFILE") {
    return {
      type: "transient",
      retryable: true,
      description: "Too many open files - local resource exhaustion",
      recommendedAction: "Retry with delay; may indicate leak",
    };
  }
  
  // ============================================
  // Unknown/Default
  // ============================================
  
  return {
    type: "ambiguous",
    retryable: true,
    description: "Unknown network error",
    recommendedAction: "Retry with limits and monitoring",
  };
}

The Safety Dimension: When Retries Are Dangerous

Even when failures are clearly transient, retrying may not be safe. The safety of a retry depends critically on the nature of the operation being retried.

This leads to one of the most important concepts in distributed systems: idempotency.

Definition: Idempotent Operation

An operation is idempotent if executing it multiple times produces the same result as executing it once. Mathematically: f(f(x)) = f(x). In distributed systems: retrying the same request N times has the same effect as sending it once.

Why idempotency matters for retries:

When a request fails with ambiguous outcome (like a timeout), you don't know if:

(A) The server never received the request, or
(B) The server received and processed it, but the response was lost

If you retry:

If (A) was true: Correct behavior — operation now executes once
If (B) was true: Operation executed twice — potentially disastrous

HTTP Methods and Idempotency
HTTP Method	Idempotent?	Safe to Retry?	Example
GET	✅ Yes	Always safe	Fetch user profile
HEAD	✅ Yes	Always safe	Check if resource exists
OPTIONS	✅ Yes	Always safe	CORS preflight
PUT	✅ Yes (by design)	Safe if properly implemented	Update user address to '123 Main St'
DELETE	✅ Yes (by design)	Safe if properly implemented	Delete user with ID 42
POST	❌ No (typically)	UNSAFE without idempotency key	Create new order, charge payment
PATCH	❌ No (typically)	UNSAFE for incremental changes	Increment counter, append to list

The critical distinction:

GET, HEAD, OPTIONS: Read-only operations. Always safe to retry because they don't modify state.
PUT, DELETE: Should be idempotent by design, but implementation matters. PUT /users/42 with the same body should always produce the same state. DELETE /users/42 should succeed whether user exists or not.
POST: Generally not idempotent. POST /orders creates a new order each time. POST /payments charges the card each time. Retrying these without safeguards causes duplicate orders and charges.
PATCH: Depends entirely on implementation. PATCH /counter {increment: 1} is non-idempotent. PATCH /user {email: 'new@email.com'} could be idempotent.

Real-World Horror Story

A major e-commerce platform once charged customers multiple times because their payment service retried timed-out requests without idempotency keys. A single 30-second database slowdown resulted in thousands of duplicate charges, millions in refunds, and lasting reputation damage. This is why understanding retry safety is non-negotiable.

retry-safety-evaluator.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
// Comprehensive Retry Safety Evaluation
 
type OperationType = 
  | "read"           // Pure read, no side effects
  | "idempotent"     // Safe to retry any number of times
  | "non-idempotent" // Executing twice has different effects
  | "unknown";       // Cannot determine safety
 
interface RetrySafetyResult {
  safeToRetry: boolean;
  reason: string;
  recommendation: string;
  requiresIdempotencyKey: boolean;
}
 
class RetrySafetyEvaluator {
  /**
   * Evaluates whether an operation is safe to retry.
   * This is the CRITICAL decision point before any retry.
   */
  evaluate(
    method: string,
    hasIdempotencyKey: boolean,
    operationType: OperationType,
    errorType: "connection" | "timeout" | "response"
  ): RetrySafetyResult {
    
    // =========================================
    // Case 1: Connection failures (request never sent)
    // =========================================
    if (errorType === "connection") {
      // Server never received request — always safe to retry
      return {
        safeToRetry: true,
        reason: "Connection failed before request transmission",
        recommendation: "Retry with exponential backoff",
        requiresIdempotencyKey: false,
      };
    }
    
    // =========================================
    // Case 2: Response-based failures (got response)
    // =========================================
    if (errorType === "response") {
      // Server responded with error — request was processed
      // Safe to retry if operation is idempotent
      const isMethodIdempotent = ["GET", "HEAD", "OPTIONS", "PUT", "DELETE"]
        .includes(method.toUpperCase());
      
      if (isMethodIdempotent || hasIdempotencyKey) {
        return {
          safeToRetry: true,
          reason: "Operation is idempotent or has idempotency key",
          recommendation: "Retry with backoff based on status code",
          requiresIdempotencyKey: false,
        };
      }
      
      return {
        safeToRetry: false,
        reason: "Non-idempotent operation already reached server",
        recommendation: "Do not retry; return error to caller",
        requiresIdempotencyKey: true,
      };
    }
    
    // =========================================
    // Case 3: Timeout failures (UNKNOWN outcome)
    // =========================================
    if (errorType === "timeout") {
      // This is the dangerous case: we don't know if server
      // received and processed the request or not
      
      // Read operations are always safe
      if (operationType === "read") {
        return {
          safeToRetry: true,
          reason: "Read-only operation",
          recommendation: "Retry safely",
          requiresIdempotencyKey: false,
        };
      }
      
      // Idempotent operations are safe
      if (operationType === "idempotent" || hasIdempotencyKey) {
        return {
          safeToRetry: true,
          reason: "Operation is idempotent",
          recommendation: "Retry with backoff",
          requiresIdempotencyKey: false,
        };
      }
      
      // Non-idempotent without key: DANGER
      return {
        safeToRetry: false,
        reason: "Timeout on non-idempotent operation — duplicate risk",
        recommendation: "Log for investigation; alert user of unknown state",
        requiresIdempotencyKey: true,
      };
    }
    
    // Unknown error type — be conservative
    return {
      safeToRetry: false,
      reason: "Cannot determine safety",
      recommendation: "Manual investigation required",
      requiresIdempotencyKey: true,
    };
  }
}
 
// Usage Example
const evaluator = new RetrySafetyEvaluator();
 
// Safe to retry: GET request that timed out
const getRetry = evaluator.evaluate("GET", false, "read", "timeout");
console.log(getRetry);
// { safeToRetry: true, reason: "Read-only operation", ... }
 
// UNSAFE to retry: POST payment that timed out
const paymentRetry = evaluator.evaluate("POST", false, "non-idempotent", "timeout");
console.log(paymentRetry);
// { safeToRetry: false, reason: "Timeout on non-idempotent operation — duplicate risk", ... }
 
// Safe with idempotency key: POST payment with key
const paymentWithKey = evaluator.evaluate("POST", true, "non-idempotent", "timeout");
console.log(paymentWithKey);
// { safeToRetry: true, reason: "Operation is idempotent", ... }

The Retry Decision Framework

Combining failure classification and safety evaluation, we can construct a comprehensive retry decision framework. This framework should be applied consistently across all service-to-service communication.

The three-question framework:

Before Any Retry, Ask:

•Is the failure transient? If no (permanent failure like 404 or 401), don't retry. If ambiguous, proceed cautiously with retry limits.
•Is the operation safe to retry? If no (non-idempotent POST without idempotency key), don't retry after timeout. For connection failures, usually safe.
•Do we have retry budget remaining? If no (already retried too many times or too many concurrent failures), fail fast rather than amplify the problem.

Visual decision tree:

retry-decision-tree.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
                    ┌─────────────────────┐
                    │   REQUEST FAILED   │
                    └─────────┬───────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │  Is failure clearly PERMANENT? │
              │  (404, 401, 400, 422, etc.)   │
              └───────────────┬───────────────┘
                              │
           ┌──────────────────┼──────────────────┐
           │ YES              │ NO / UNCLEAR     │
           ▼                  ▼                  │
   ┌───────────────┐  ┌─────────────────────┐   │
   │ DON'T RETRY   │  │ Is operation SAFE   │   │
   │ Return error  │  │ to retry?           │   │
   └───────────────┘  │ (idempotent/has key)│   │
                      └─────────┬───────────┘   │
                                │               │
              ┌─────────────────┼───────────────┤
              │ YES             │ NO            │
              ▼                 ▼               │
   ┌───────────────────┐  ┌─────────────────┐  │
   │ Was this a        │  │ Was it CONN     │  │
   │ TIMEOUT failure?  │  │ failure (never  │  │
   │                   │  │ sent)?          │  │
   └─────────┬─────────┘  └───────┬─────────┘  │
             │                    │             │
    ┌────────┼────────┐    ┌──────┼──────┐     │
    │ YES    │ NO     │    │ YES  │ NO   │     │
    ▼        ▼        │    ▼      ▼      │     │
┌────────┐ ┌────────┐ │ ┌────────┐ ┌────────┐ │
│ RETRY  │ │ RETRY  │ │ │ RETRY  │ │ DON'T  │ │
│with    │ │with    │ │ │(safe)  │ │RETRY   │ │
│backoff │ │backoff │ │ └────────┘ │(risky) │ │
└────────┘ └────────┘ │            └────────┘ │
                      │                       │
                      ▼                       ▼
              ┌───────────────────────────────┐
              │    BEFORE RETRYING:           │
              │    ✓ Check retry budget       │
              │    ✓ Apply backoff delay      │
              │    ✓ Log retry attempt        │
              │    ✓ Increment retry counter  │
              └───────────────────────────────┘

When NOT to Retry: The Anti-Patterns

Understanding when not to retry is as important as knowing when to retry. Inappropriate retries cause more outages than they prevent.

The Deadly Sins of Retry Strategies:

Retry Anti-Patterns

•Immediate Retry Storm — Retrying immediately without delay turns transient overload into sustained overload. One slow database query triggers millions of immediate retries, guaranteeing the database never recovers.
•Infinite Retry Loops — Retrying forever without limits consumes resources indefinitely. A misconfigured service could retry for hours or days, burning compute and preventing other work.
•Retry Amplification — Service A retries service B, which retries service C. Each layer multiplies the load. 3 layers with 3 retries each = 27x request amplification during failures.
•Retrying Non-Retryable Errors — Retrying 401 Unauthorized repeatedly locks accounts, triggers security alerts, and wastes resources on requests that will never succeed without intervention.
•Ignoring Retry-After Headers — Many rate limiters and overload responses include Retry-After headers. Ignoring these and retrying sooner is antisocial and extends the outage.
•Retrying During Circuit Breaker Open — If a circuit breaker has opened to protect a failing service, retrying defeats its purpose. Respect the circuit state.

The Retry Storm Cascade

In 2017, a major cloud provider experienced a multi-hour outage triggered by a control plane becoming briefly overloaded. Clients without proper backoff immediately retried millions of requests, preventing recovery. The system couldn't stabilize because each recovery attempt was immediately overwhelmed by backed-up retries. This is preventable with proper retry design.

Situations where retrying is almost never appropriate:

Never-Retry Scenarios
Scenario	Why Retry Is Wrong	Correct Action
Authentication failed (401)	Credentials are wrong; retrying won't make them right	Prompt for new credentials
Authorization denied (403)	User lacks permission; retrying won't grant it	Escalate or deny operation
Resource not found (404)	Resource doesn't exist; retrying won't create it	Handle as not found
Business rule violation (422)	Request violates domain rules	Fix request or inform user
Request too large (413)	Payload exceeds limits; retrying same payload fails	Chunk or compress data
Circuit breaker open	Service is known-broken; retries will be rejected locally	Wait for circuit recovery
Global outage/maintenance	Provider reports intentional downtime	Wait or failover to backup

Practical Implementation: A Complete Retry-Decision Engine

Bringing together the principles we've covered, here's a production-quality implementation of a retry decision engine that encapsulates all the considerations discussed.

retry-decision-engine.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
/**
 * Production-grade Retry Decision Engine
 * 
 * This engine encapsulates the complete decision logic for whether
 * to retry a failed request, unifying failure classification,
 * safety evaluation, and policy enforcement.
 */
 
interface RetryableError {
  type: "http" | "network" | "timeout";
  statusCode?: number;
  errorCode?: string;
  message: string;
}
 
interface RequestContext {
  method: string;
  path: string;
  hasIdempotencyKey: boolean;
  isExplicitlyIdempotent: boolean;  // Marked by developer
  retryCount: number;
  maxRetries: number;
  retryAfterHeader?: number;  // Seconds, from response
}
 
interface RetryDecision {
  shouldRetry: boolean;
  delayMs: number;
  reason: string;
  final: boolean;  // If true, this is definitive; don't ask again
}
 
class RetryDecisionEngine {
  private maxRetriesHard = 5;  // Never exceed this regardless of config
  private minDelayMs = 100;
  private maxDelayMs = 30000;
  
  /**
   * Main entry point: Should we retry this failed request?
   */
  decide(error: RetryableError, context: RequestContext): RetryDecision {
    // =========================================
    // Gate 1: Hard limits
    // =========================================
    if (context.retryCount >= Math.min(context.maxRetries, this.maxRetriesHard)) {
      return {
        shouldRetry: false,
        delayMs: 0,
        reason: `Retry limit reached (${context.retryCount}/${context.maxRetries})`,
        final: true,
      };
    }
    
    // =========================================
    // Gate 2: Failure classification
    // =========================================
    const failureClass = this.classifyFailure(error);
    
    if (failureClass.type === "permanent") {
      return {
        shouldRetry: false,
        delayMs: 0,
        reason: failureClass.reason,
        final: true,
      };
    }
    
    // =========================================
    // Gate 3: Safety check
    // =========================================
    const safetyResult = this.checkSafety(error, context);
    
    if (!safetyResult.safe) {
      return {
        shouldRetry: false,
        delayMs: 0,
        reason: safetyResult.reason,
        final: true,
      };
    }
    
    // =========================================
    // Approved for retry: Calculate delay
    // =========================================
    let delayMs = this.calculateDelay(error, context);
    
    // Respect Retry-After if provided
    if (context.retryAfterHeader) {
      delayMs = Math.max(delayMs, context.retryAfterHeader * 1000);
    }
    
    // Clamp to limits
    delayMs = Math.max(this.minDelayMs, Math.min(this.maxDelayMs, delayMs));
    
    return {
      shouldRetry: true,
      delayMs,
      reason: `Transient failure, retry ${context.retryCount + 1}/${context.maxRetries}`,
      final: false,
    };
  }
  
  private classifyFailure(error: RetryableError): { type: "transient" | "permanent" | "ambiguous"; reason: string } {
    // HTTP errors
    if (error.type === "http" && error.statusCode) {
      // 4xx client errors are usually permanent
      if (error.statusCode >= 400 && error.statusCode < 500) {
        // Exception: 429 is transient (rate limiting)
        if (error.statusCode === 429) {
          return { type: "transient", reason: "Rate limited" };
        }
        // Exception: 408 is transient (request timeout)
        if (error.statusCode === 408) {
          return { type: "transient", reason: "Request timeout (server side)" };
        }
        return { type: "permanent", reason: `Client error: ${error.statusCode}` };
      }
      
      // 5xx server errors are usually transient
      if (error.statusCode >= 500) {
        // 501 Not Implemented is permanent
        if (error.statusCode === 501) {
          return { type: "permanent", reason: "Not implemented" };
        }
        // 505 HTTP Version Not Supported is permanent
        if (error.statusCode === 505) {
          return { type: "permanent", reason: "HTTP version not supported" };
        }
        return { type: "transient", reason: `Server error: ${error.statusCode}` };
      }
    }
    
    // Network errors
    if (error.type === "network") {
      // Most network errors are transient
      if (error.errorCode === "CERT_HAS_EXPIRED" ||
          error.message.includes("certificate")) {
        return { type: "permanent", reason: "Certificate error" };
      }
      return { type: "transient", reason: "Network error" };
    }
    
    // Timeouts are transient by nature
    if (error.type === "timeout") {
      return { type: "transient", reason: "Request timed out" };
    }
    
    return { type: "ambiguous", reason: "Unknown failure type" };
  }
  
  private checkSafety(
    error: RetryableError,
    context: RequestContext
  ): { safe: boolean; reason: string } {
    const method = context.method.toUpperCase();
    
    // GET, HEAD, OPTIONS are always safe (read-only)
    if (["GET", "HEAD", "OPTIONS"].includes(method)) {
      return { safe: true, reason: "Read-only method" };
    }
    
    // PUT and DELETE are idempotent by HTTP spec
    if (["PUT", "DELETE"].includes(method)) {
      return { safe: true, reason: "Idempotent method by spec" };
    }
    
    // POST and PATCH require explicit idempotency guarantee
    if (["POST", "PATCH"].includes(method)) {
      // If developer marked as idempotent
      if (context.isExplicitlyIdempotent) {
        return { safe: true, reason: "Explicitly marked idempotent" };
      }
      
      // If idempotency key is present
      if (context.hasIdempotencyKey) {
        return { safe: true, reason: "Has idempotency key" };
      }
      
      // For network/connection errors (request never sent), safe to retry
      if (error.type === "network" && 
          (error.errorCode === "ECONNREFUSED" ||
           error.errorCode === "ETIMEDOUT")) {
        return { safe: true, reason: "Request never transmitted" };
      }
      
      // Timeout on POST without idempotency key: UNSAFE
      if (error.type === "timeout") {
        return {
          safe: false,
          reason: "Timeout on non-idempotent POST/PATCH without idempotency key - risk of duplicate execution",
        };
      }
      
      // HTTP error means server received request; need idempotency
      if (error.type === "http") {
        return {
          safe: false,
          reason: "Server error on non-idempotent operation - requires idempotency key for safe retry",
        };
      }
    }
    
    return { safe: false, reason: "Unknown method safety" };
  }
  
  private calculateDelay(error: RetryableError, context: RequestContext): number {
    // Base: exponential backoff
    const base = 100;  // Start at 100ms
    const exponentialDelay = base * Math.pow(2, context.retryCount);
    
    // Add jitter (±25%) to prevent thundering herd
    const jitter = exponentialDelay * 0.25 * (Math.random() * 2 - 1);
    
    return Math.round(exponentialDelay + jitter);
  }
}
 
// =========================================
// Usage Example
// =========================================
 
const engine = new RetryDecisionEngine();
 
// Example 1: GET request that timed out
const decision1 = engine.decide(
  { type: "timeout", message: "Request timed out after 5000ms" },
  {
    method: "GET",
    path: "/api/users/123",
    hasIdempotencyKey: false,
    isExplicitlyIdempotent: false,
    retryCount: 0,
    maxRetries: 3,
  }
);
console.log("GET timeout:", decision1);
// { shouldRetry: true, delayMs: ~100, reason: "Transient failure, retry 1/3" }
 
// Example 2: POST payment without idempotency key that timed out
const decision2 = engine.decide(
  { type: "timeout", message: "Request timed out" },
  {
    method: "POST",
    path: "/api/payments",
    hasIdempotencyKey: false,
    isExplicitlyIdempotent: false,
    retryCount: 0,
    maxRetries: 3,
  }
);
console.log("POST payment timeout (no key):", decision2);
// { shouldRetry: false, reason: "Timeout on non-idempotent POST/PATCH..." }
 
// Example 3: POST payment WITH idempotency key that got 503
const decision3 = engine.decide(
  { type: "http", statusCode: 503, message: "Service Unavailable" },
  {
    method: "POST",
    path: "/api/payments",
    hasIdempotencyKey: true,
    isExplicitlyIdempotent: false,
    retryCount: 1,
    maxRetries: 3,
    retryAfterHeader: 5,
  }
);
console.log("POST payment 503 (with key):", decision3);
// { shouldRetry: true, delayMs: 5000, reason: "Transient failure, retry 2/3" }

Summary: The When-to-Retry Manifesto

We've established the foundational principles for retry decisions. Before implementing any retry logic, internalize these principles:

Key Principles

•Classify first, retry second — Determine if the failure is transient, permanent, or ambiguous before any retry decision.
•Safety is non-negotiable — Never retry non-idempotent operations without idempotency guarantees when outcome is unknown.
•Timeouts are dangerous — They represent unknown states. The server may or may not have processed the request.
•HTTP status codes guide but don't dictate — Use them as signals within a broader decision framework.
•Network failures need context — Whether a retry is safe depends on when in the request lifecycle the failure occurred.
•Retries can kill — Inappropriate retries cause outages. They amplify load, extend incidents, and prevent recovery.

What's next:

Now that we understand when to retry, we need to learn how to retry effectively. The next page covers Exponential Backoff — the mathematical foundation for spacing retries to maximize success probability while minimizing system impact.

Page Complete

You now understand the critical decision framework for retry strategies: classifying failures, evaluating safety, and making informed retry decisions. This foundation is essential before implementing the timing strategies we'll cover next.

1 / 5

Loading learning content...

System Design (HLD)Retry Strategies

Retry Strategies in Distributed Systems

LevelIntermediate

Duration75 mins

TopicRetry Strategies

1 / 5

When to Retry

The Retry Paradox

But here's the paradox: retries can heal systems, and retries can destroy them.

Understanding when to retry—and equally important, when not to retry—is one of the most critical skills in distributed systems engineering.

What You Will Master

The Failure Taxonomy: Not All Failures Are Equal

Before deciding whether to retry, you must first understand what failed and why. Distributed systems exhibit a rich taxonomy of failure modes, each with different implications for retry behavior.

The fundamental classification divides failures into two categories:

Transient Failures — Temporary conditions that will likely resolve on their own
Permanent Failures — Persistent conditions that require intervention to resolve

Transient vs Permanent Failures
Failure Type	Characteristics	Examples	Retry Appropriate?
Transient	Self-resolving, time-bounded, infrastructure-related	Network timeout, connection reset, 503 Service Unavailable, resource contention	Yes — with proper backoff
Permanent	Persistent until external action, logic/data-related	404 Not Found, 401 Unauthorized, invalid request format, business rule violation	No — will fail indefinitely
Ambiguous	Unknown whether transient or permanent	500 Internal Server Error, connection refused, DNS resolution failure	Maybe — requires context and limits

The Ambiguity Problem

HTTP status codes as retry signals:

HTTP provides some guidance through status codes, but interpretation requires nuance:

http-status-retry-classification.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
// HTTP Status Code Retry Classification Framework
 
interface RetryDecision {
  shouldRetry: boolean;
  reason: string;
  maxRetries?: number;
  useBackoff?: boolean;
}
 
function classifyHttpStatusForRetry(status: number): RetryDecision {
  // ============================================
  // DEFINITE NO-RETRY: Client errors (4xx)
  // ============================================
  
  // 400 Bad Request - Client sent malformed request
  // Retrying will always produce the same result
  if (status === 400) {
    return {
      shouldRetry: false,
      reason: "Malformed request - fix client code",
    };
  }
  
  // 401 Unauthorized - Authentication required/failed
  // Retrying without new credentials is pointless
  if (status === 401) {
    return {
      shouldRetry: false,
      reason: "Authentication required - obtain valid credentials",
    };
  }
  
  // 403 Forbidden - Authorized but not permitted
  // No amount of retrying grants permission
  if (status === 403) {
    return {
      shouldRetry: false,
      reason: "Permission denied - requires authorization change",
    };
  }
  
  // 404 Not Found - Resource doesn't exist
  // Unless expecting eventual consistency, don't retry
  if (status === 404) {
    return {
      shouldRetry: false,
      reason: "Resource not found - verify resource exists",
    };
  }
  
  // 409 Conflict - Request conflicts with current state
  // Often requires reading current state before retry
  if (status === 409) {
    return {
      shouldRetry: false,  // Usually - but read-modify-write patterns may retry
      reason: "Conflict - resolve state conflict first",
    };
  }
  
  // 422 Unprocessable Entity - Semantic error
  // Request is syntactically valid but semantically wrong
  if (status === 422) {
    return {
      shouldRetry: false,
      reason: "Validation failed - fix request payload",
    };
  }
  
  // ============================================
  // SPECIAL CASE: Rate limiting (429)
  // ============================================
  
  // 429 Too Many Requests - Rate limited
  // SHOULD retry, but with significant delay
  if (status === 429) {
    return {
      shouldRetry: true,
      reason: "Rate limited - respect Retry-After header",
      maxRetries: 3,
      useBackoff: true,  // Or use Retry-After if provided
    };
  }
  
  // ============================================
  // DEFINITE RETRY: Server errors indicating transient issues
  // ============================================
  
  // 502 Bad Gateway - Upstream server error
  // Usually transient - gateway couldn't reach backend
  if (status === 502) {
    return {
      shouldRetry: true,
      reason: "Upstream unavailable - likely transient",
      maxRetries: 3,
      useBackoff: true,
    };
  }
  
  // 503 Service Unavailable - Server temporarily overloaded
  // Explicitly designed for temporary conditions
  if (status === 503) {
    return {
      shouldRetry: true,
      reason: "Service unavailable - temporary condition",
      maxRetries: 3,
      useBackoff: true,
    };
  }
  
  // 504 Gateway Timeout - Upstream timed out
  // Network timing issue - retry with backoff
  if (status === 504) {
    return {
      shouldRetry: true,
      reason: "Gateway timeout - transient network issue",
      maxRetries: 3,
      useBackoff: true,
    };
  }
  
  // ============================================
  // AMBIGUOUS: 500 Internal Server Error
  // ============================================
  
  // 500 is the tricky one - could be anything
  if (status === 500) {
    return {
      shouldRetry: true,  // Default to retry with limits
      reason: "Server error - may be transient or permanent",
      maxRetries: 2,  // Lower limit due to ambiguity
      useBackoff: true,
    };
  }
  
  // Default for unhandled 4xx - don't retry
  if (status >= 400 && status < 500) {
    return {
      shouldRetry: false,
      reason: "Client error - fix request",
    };
  }
  
  // Default for unhandled 5xx - cautious retry
  if (status >= 500) {
    return {
      shouldRetry: true,
      reason: "Server error - cautious retry",
      maxRetries: 2,
      useBackoff: true,
    };
  }
  
  // 2xx and 3xx not really "failures"
  return {
    shouldRetry: false,
    reason: "Not a failure status",
  };
}

Beyond HTTP Status Codes: Network-Level Failures

HTTP status codes only tell part of the story. Many failures occur before receiving any HTTP response at all. These network-level failures require their own classification and handling strategies.

Common Network-Level Failures

•Connection Timeout — Unable to establish TCP connection within timeout. Usually transient (server overloaded or network congested). Retry: Yes
•Connection Refused — Server actively rejected connection (port closed). May be transient (service restarting) or permanent (wrong port). Retry: Yes, with limits
•Connection Reset — Established connection forcibly closed by peer. Often transient (server crashed mid-request). Retry: Yes
•DNS Resolution Failure — Cannot resolve hostname. Usually transient (DNS server overloaded) but could be permanent (typo in hostname). Retry: Yes, with verification
•Read Timeout — Connection established but response not received in time. Server may still be processing. Retry: Caution required
•SSL/TLS Handshake Failure — Encryption negotiation failed. Usually permanent (certificate issues) but can be transient (clock skew). Retry: Usually no

The Read Timeout Trap

Comprehensive failure classification in practice:

network-failure-classifier.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
// Network-Level Failure Classification
 
type FailureType = 
  | "transient"      // Definitely retry
  | "permanent"      // Never retry
  | "ambiguous"      // Retry cautiously with limits
  | "dangerous";     // Retry only if idempotent
 
interface FailureClassification {
  type: FailureType;
  retryable: boolean;
  description: string;
  recommendedAction: string;
}
 
function classifyNetworkError(error: Error): FailureClassification {
  const errorCode = (error as NodeJS.ErrnoException).code;
  const errorMessage = error.message.toLowerCase();
  
  // ============================================
  // Connection Establishment Failures
  // ============================================
  
  if (errorCode === "ECONNREFUSED") {
    return {
      type: "ambiguous",
      retryable: true,
      description: "Connection refused - server not accepting connections",
      recommendedAction: "Retry with backoff; may indicate service down",
    };
  }
  
  if (errorCode === "ETIMEDOUT" || errorCode === "ECONNABORTED") {
    return {
      type: "transient",
      retryable: true,
      description: "Connection timeout - network congestion or overload",
      recommendedAction: "Retry with exponential backoff",
    };
  }
  
  if (errorCode === "ECONNRESET") {
    return {
      type: "transient",
      retryable: true,
      description: "Connection reset - peer unexpectedly closed connection",
      recommendedAction: "Retry immediately or with minimal delay",
    };
  }
  
  // ============================================
  // DNS Failures
  // ============================================
  
  if (errorCode === "ENOTFOUND" || errorCode === "EAI_AGAIN") {
    return {
      type: "ambiguous",
      retryable: true,
      description: "DNS resolution failed",
      recommendedAction: "Retry with backoff; verify hostname if persists",
    };
  }
  
  // ============================================
  // Read/Write Failures (DANGEROUS)
  // ============================================
  
  if (errorMessage.includes("socket hang up") ||
      errorMessage.includes("read econnreset")) {
    return {
      type: "dangerous",
      retryable: true, // Only if operation is idempotent!
      description: "Connection closed during data transfer",
      recommendedAction: "Retry ONLY if operation is idempotent",
    };
  }
  
  if (errorCode === "EPIPE" || errorCode === "ENOTCONN") {
    return {
      type: "dangerous",
      retryable: true,
      description: "Connection lost while writing",
      recommendedAction: "Retry ONLY if operation is idempotent",
    };
  }
  
  // ============================================
  // SSL/TLS Failures
  // ============================================
  
  if (errorMessage.includes("certificate") ||
      errorMessage.includes("ssl") ||
      errorCode === "CERT_HAS_EXPIRED") {
    return {
      type: "permanent",
      retryable: false,
      description: "SSL/TLS certificate error",
      recommendedAction: "Fix certificate configuration",
    };
  }
  
  if (errorMessage.includes("handshake")) {
    return {
      type: "ambiguous",
      retryable: true, // Sometimes transient (protocol negotiation)
      description: "TLS handshake failed",
      recommendedAction: "Retry once; investigate if persists",
    };
  }
  
  // ============================================
  // Resource Exhaustion
  // ============================================
  
  if (errorCode === "EMFILE" || errorCode === "ENFILE") {
    return {
      type: "transient",
      retryable: true,
      description: "Too many open files - local resource exhaustion",
      recommendedAction: "Retry with delay; may indicate leak",
    };
  }
  
  // ============================================
  // Unknown/Default
  // ============================================
  
  return {
    type: "ambiguous",
    retryable: true,
    description: "Unknown network error",
    recommendedAction: "Retry with limits and monitoring",
  };
}

The Safety Dimension: When Retries Are Dangerous

Even when failures are clearly transient, retrying may not be safe. The safety of a retry depends critically on the nature of the operation being retried.

This leads to one of the most important concepts in distributed systems: idempotency.

Definition: Idempotent Operation

Why idempotency matters for retries:

When a request fails with ambiguous outcome (like a timeout), you don't know if:

(A) The server never received the request, or
(B) The server received and processed it, but the response was lost

If you retry:

If (A) was true: Correct behavior — operation now executes once
If (B) was true: Operation executed twice — potentially disastrous

HTTP Methods and Idempotency
HTTP Method	Idempotent?	Safe to Retry?	Example
GET	✅ Yes	Always safe	Fetch user profile
HEAD	✅ Yes	Always safe	Check if resource exists
OPTIONS	✅ Yes	Always safe	CORS preflight
PUT	✅ Yes (by design)	Safe if properly implemented	Update user address to '123 Main St'
DELETE	✅ Yes (by design)	Safe if properly implemented	Delete user with ID 42
POST	❌ No (typically)	UNSAFE without idempotency key	Create new order, charge payment
PATCH	❌ No (typically)	UNSAFE for incremental changes	Increment counter, append to list

The critical distinction:

GET, HEAD, OPTIONS: Read-only operations. Always safe to retry because they don't modify state.
PUT, DELETE: Should be idempotent by design, but implementation matters. PUT /users/42 with the same body should always produce the same state. DELETE /users/42 should succeed whether user exists or not.
POST: Generally not idempotent. POST /orders creates a new order each time. POST /payments charges the card each time. Retrying these without safeguards causes duplicate orders and charges.
PATCH: Depends entirely on implementation. PATCH /counter {increment: 1} is non-idempotent. PATCH /user {email: 'new@email.com'} could be idempotent.

Real-World Horror Story

retry-safety-evaluator.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
// Comprehensive Retry Safety Evaluation
 
type OperationType = 
  | "read"           // Pure read, no side effects
  | "idempotent"     // Safe to retry any number of times
  | "non-idempotent" // Executing twice has different effects
  | "unknown";       // Cannot determine safety
 
interface RetrySafetyResult {
  safeToRetry: boolean;
  reason: string;
  recommendation: string;
  requiresIdempotencyKey: boolean;
}
 
class RetrySafetyEvaluator {
  /**
   * Evaluates whether an operation is safe to retry.
   * This is the CRITICAL decision point before any retry.
   */
  evaluate(
    method: string,
    hasIdempotencyKey: boolean,
    operationType: OperationType,
    errorType: "connection" | "timeout" | "response"
  ): RetrySafetyResult {
    
    // =========================================
    // Case 1: Connection failures (request never sent)
    // =========================================
    if (errorType === "connection") {
      // Server never received request — always safe to retry
      return {
        safeToRetry: true,
        reason: "Connection failed before request transmission",
        recommendation: "Retry with exponential backoff",
        requiresIdempotencyKey: false,
      };
    }
    
    // =========================================
    // Case 2: Response-based failures (got response)
    // =========================================
    if (errorType === "response") {
      // Server responded with error — request was processed
      // Safe to retry if operation is idempotent
      const isMethodIdempotent = ["GET", "HEAD", "OPTIONS", "PUT", "DELETE"]
        .includes(method.toUpperCase());
      
      if (isMethodIdempotent || hasIdempotencyKey) {
        return {
          safeToRetry: true,
          reason: "Operation is idempotent or has idempotency key",
          recommendation: "Retry with backoff based on status code",
          requiresIdempotencyKey: false,
        };
      }
      
      return {
        safeToRetry: false,
        reason: "Non-idempotent operation already reached server",
        recommendation: "Do not retry; return error to caller",
        requiresIdempotencyKey: true,
      };
    }
    
    // =========================================
    // Case 3: Timeout failures (UNKNOWN outcome)
    // =========================================
    if (errorType === "timeout") {
      // This is the dangerous case: we don't know if server
      // received and processed the request or not
      
      // Read operations are always safe
      if (operationType === "read") {
        return {
          safeToRetry: true,
          reason: "Read-only operation",
          recommendation: "Retry safely",
          requiresIdempotencyKey: false,
        };
      }
      
      // Idempotent operations are safe
      if (operationType === "idempotent" || hasIdempotencyKey) {
        return {
          safeToRetry: true,
          reason: "Operation is idempotent",
          recommendation: "Retry with backoff",
          requiresIdempotencyKey: false,
        };
      }
      
      // Non-idempotent without key: DANGER
      return {
        safeToRetry: false,
        reason: "Timeout on non-idempotent operation — duplicate risk",
        recommendation: "Log for investigation; alert user of unknown state",
        requiresIdempotencyKey: true,
      };
    }
    
    // Unknown error type — be conservative
    return {
      safeToRetry: false,
      reason: "Cannot determine safety",
      recommendation: "Manual investigation required",
      requiresIdempotencyKey: true,
    };
  }
}
 
// Usage Example
const evaluator = new RetrySafetyEvaluator();
 
// Safe to retry: GET request that timed out
const getRetry = evaluator.evaluate("GET", false, "read", "timeout");
console.log(getRetry);
// { safeToRetry: true, reason: "Read-only operation", ... }
 
// UNSAFE to retry: POST payment that timed out
const paymentRetry = evaluator.evaluate("POST", false, "non-idempotent", "timeout");
console.log(paymentRetry);
// { safeToRetry: false, reason: "Timeout on non-idempotent operation — duplicate risk", ... }
 
// Safe with idempotency key: POST payment with key
const paymentWithKey = evaluator.evaluate("POST", true, "non-idempotent", "timeout");
console.log(paymentWithKey);
// { safeToRetry: true, reason: "Operation is idempotent", ... }

The Retry Decision Framework

The three-question framework:

Before Any Retry, Ask:

•Is the failure transient? If no (permanent failure like 404 or 401), don't retry. If ambiguous, proceed cautiously with retry limits.
•Is the operation safe to retry? If no (non-idempotent POST without idempotency key), don't retry after timeout. For connection failures, usually safe.
•Do we have retry budget remaining? If no (already retried too many times or too many concurrent failures), fail fast rather than amplify the problem.

Visual decision tree:

retry-decision-tree.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
                    ┌─────────────────────┐
                    │   REQUEST FAILED   │
                    └─────────┬───────────┘
                              │
                              ▼
              ┌───────────────────────────────┐
              │  Is failure clearly PERMANENT? │
              │  (404, 401, 400, 422, etc.)   │
              └───────────────┬───────────────┘
                              │
           ┌──────────────────┼──────────────────┐
           │ YES              │ NO / UNCLEAR     │
           ▼                  ▼                  │
   ┌───────────────┐  ┌─────────────────────┐   │
   │ DON'T RETRY   │  │ Is operation SAFE   │   │
   │ Return error  │  │ to retry?           │   │
   └───────────────┘  │ (idempotent/has key)│   │
                      └─────────┬───────────┘   │
                                │               │
              ┌─────────────────┼───────────────┤
              │ YES             │ NO            │
              ▼                 ▼               │
   ┌───────────────────┐  ┌─────────────────┐  │
   │ Was this a        │  │ Was it CONN     │  │
   │ TIMEOUT failure?  │  │ failure (never  │  │
   │                   │  │ sent)?          │  │
   └─────────┬─────────┘  └───────┬─────────┘  │
             │                    │             │
    ┌────────┼────────┐    ┌──────┼──────┐     │
    │ YES    │ NO     │    │ YES  │ NO   │     │
    ▼        ▼        │    ▼      ▼      │     │
┌────────┐ ┌────────┐ │ ┌────────┐ ┌────────┐ │
│ RETRY  │ │ RETRY  │ │ │ RETRY  │ │ DON'T  │ │
│with    │ │with    │ │ │(safe)  │ │RETRY   │ │
│backoff │ │backoff │ │ └────────┘ │(risky) │ │
└────────┘ └────────┘ │            └────────┘ │
                      │                       │
                      ▼                       ▼
              ┌───────────────────────────────┐
              │    BEFORE RETRYING:           │
              │    ✓ Check retry budget       │
              │    ✓ Apply backoff delay      │
              │    ✓ Log retry attempt        │
              │    ✓ Increment retry counter  │
              └───────────────────────────────┘

When NOT to Retry: The Anti-Patterns

Understanding when not to retry is as important as knowing when to retry. Inappropriate retries cause more outages than they prevent.

The Deadly Sins of Retry Strategies:

Retry Anti-Patterns

•Immediate Retry Storm — Retrying immediately without delay turns transient overload into sustained overload. One slow database query triggers millions of immediate retries, guaranteeing the database never recovers.
•Infinite Retry Loops — Retrying forever without limits consumes resources indefinitely. A misconfigured service could retry for hours or days, burning compute and preventing other work.
•Retry Amplification — Service A retries service B, which retries service C. Each layer multiplies the load. 3 layers with 3 retries each = 27x request amplification during failures.
•Retrying Non-Retryable Errors — Retrying 401 Unauthorized repeatedly locks accounts, triggers security alerts, and wastes resources on requests that will never succeed without intervention.
•Ignoring Retry-After Headers — Many rate limiters and overload responses include Retry-After headers. Ignoring these and retrying sooner is antisocial and extends the outage.
•Retrying During Circuit Breaker Open — If a circuit breaker has opened to protect a failing service, retrying defeats its purpose. Respect the circuit state.

The Retry Storm Cascade

Situations where retrying is almost never appropriate:

Never-Retry Scenarios
Scenario	Why Retry Is Wrong	Correct Action
Authentication failed (401)	Credentials are wrong; retrying won't make them right	Prompt for new credentials
Authorization denied (403)	User lacks permission; retrying won't grant it	Escalate or deny operation
Resource not found (404)	Resource doesn't exist; retrying won't create it	Handle as not found
Business rule violation (422)	Request violates domain rules	Fix request or inform user
Request too large (413)	Payload exceeds limits; retrying same payload fails	Chunk or compress data
Circuit breaker open	Service is known-broken; retries will be rejected locally	Wait for circuit recovery
Global outage/maintenance	Provider reports intentional downtime	Wait or failover to backup

Practical Implementation: A Complete Retry-Decision Engine

Bringing together the principles we've covered, here's a production-quality implementation of a retry decision engine that encapsulates all the considerations discussed.

retry-decision-engine.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
/**
 * Production-grade Retry Decision Engine
 * 
 * This engine encapsulates the complete decision logic for whether
 * to retry a failed request, unifying failure classification,
 * safety evaluation, and policy enforcement.
 */
 
interface RetryableError {
  type: "http" | "network" | "timeout";
  statusCode?: number;
  errorCode?: string;
  message: string;
}
 
interface RequestContext {
  method: string;
  path: string;
  hasIdempotencyKey: boolean;
  isExplicitlyIdempotent: boolean;  // Marked by developer
  retryCount: number;
  maxRetries: number;
  retryAfterHeader?: number;  // Seconds, from response
}
 
interface RetryDecision {
  shouldRetry: boolean;
  delayMs: number;
  reason: string;
  final: boolean;  // If true, this is definitive; don't ask again
}
 
class RetryDecisionEngine {
  private maxRetriesHard = 5;  // Never exceed this regardless of config
  private minDelayMs = 100;
  private maxDelayMs = 30000;
  
  /**
   * Main entry point: Should we retry this failed request?
   */
  decide(error: RetryableError, context: RequestContext): RetryDecision {
    // =========================================
    // Gate 1: Hard limits
    // =========================================
    if (context.retryCount >= Math.min(context.maxRetries, this.maxRetriesHard)) {
      return {
        shouldRetry: false,
        delayMs: 0,
        reason: `Retry limit reached (${context.retryCount}/${context.maxRetries})`,
        final: true,
      };
    }
    
    // =========================================
    // Gate 2: Failure classification
    // =========================================
    const failureClass = this.classifyFailure(error);
    
    if (failureClass.type === "permanent") {
      return {
        shouldRetry: false,
        delayMs: 0,
        reason: failureClass.reason,
        final: true,
      };
    }
    
    // =========================================
    // Gate 3: Safety check
    // =========================================
    const safetyResult = this.checkSafety(error, context);
    
    if (!safetyResult.safe) {
      return {
        shouldRetry: false,
        delayMs: 0,
        reason: safetyResult.reason,
        final: true,
      };
    }
    
    // =========================================
    // Approved for retry: Calculate delay
    // =========================================
    let delayMs = this.calculateDelay(error, context);
    
    // Respect Retry-After if provided
    if (context.retryAfterHeader) {
      delayMs = Math.max(delayMs, context.retryAfterHeader * 1000);
    }
    
    // Clamp to limits
    delayMs = Math.max(this.minDelayMs, Math.min(this.maxDelayMs, delayMs));
    
    return {
      shouldRetry: true,
      delayMs,
      reason: `Transient failure, retry ${context.retryCount + 1}/${context.maxRetries}`,
      final: false,
    };
  }
  
  private classifyFailure(error: RetryableError): { type: "transient" | "permanent" | "ambiguous"; reason: string } {
    // HTTP errors
    if (error.type === "http" && error.statusCode) {
      // 4xx client errors are usually permanent
      if (error.statusCode >= 400 && error.statusCode < 500) {
        // Exception: 429 is transient (rate limiting)
        if (error.statusCode === 429) {
          return { type: "transient", reason: "Rate limited" };
        }
        // Exception: 408 is transient (request timeout)
        if (error.statusCode === 408) {
          return { type: "transient", reason: "Request timeout (server side)" };
        }
        return { type: "permanent", reason: `Client error: ${error.statusCode}` };
      }
      
      // 5xx server errors are usually transient
      if (error.statusCode >= 500) {
        // 501 Not Implemented is permanent
        if (error.statusCode === 501) {
          return { type: "permanent", reason: "Not implemented" };
        }
        // 505 HTTP Version Not Supported is permanent
        if (error.statusCode === 505) {
          return { type: "permanent", reason: "HTTP version not supported" };
        }
        return { type: "transient", reason: `Server error: ${error.statusCode}` };
      }
    }
    
    // Network errors
    if (error.type === "network") {
      // Most network errors are transient
      if (error.errorCode === "CERT_HAS_EXPIRED" ||
          error.message.includes("certificate")) {
        return { type: "permanent", reason: "Certificate error" };
      }
      return { type: "transient", reason: "Network error" };
    }
    
    // Timeouts are transient by nature
    if (error.type === "timeout") {
      return { type: "transient", reason: "Request timed out" };
    }
    
    return { type: "ambiguous", reason: "Unknown failure type" };
  }
  
  private checkSafety(
    error: RetryableError,
    context: RequestContext
  ): { safe: boolean; reason: string } {
    const method = context.method.toUpperCase();
    
    // GET, HEAD, OPTIONS are always safe (read-only)
    if (["GET", "HEAD", "OPTIONS"].includes(method)) {
      return { safe: true, reason: "Read-only method" };
    }
    
    // PUT and DELETE are idempotent by HTTP spec
    if (["PUT", "DELETE"].includes(method)) {
      return { safe: true, reason: "Idempotent method by spec" };
    }
    
    // POST and PATCH require explicit idempotency guarantee
    if (["POST", "PATCH"].includes(method)) {
      // If developer marked as idempotent
      if (context.isExplicitlyIdempotent) {
        return { safe: true, reason: "Explicitly marked idempotent" };
      }
      
      // If idempotency key is present
      if (context.hasIdempotencyKey) {
        return { safe: true, reason: "Has idempotency key" };
      }
      
      // For network/connection errors (request never sent), safe to retry
      if (error.type === "network" && 
          (error.errorCode === "ECONNREFUSED" ||
           error.errorCode === "ETIMEDOUT")) {
        return { safe: true, reason: "Request never transmitted" };
      }
      
      // Timeout on POST without idempotency key: UNSAFE
      if (error.type === "timeout") {
        return {
          safe: false,
          reason: "Timeout on non-idempotent POST/PATCH without idempotency key - risk of duplicate execution",
        };
      }
      
      // HTTP error means server received request; need idempotency
      if (error.type === "http") {
        return {
          safe: false,
          reason: "Server error on non-idempotent operation - requires idempotency key for safe retry",
        };
      }
    }
    
    return { safe: false, reason: "Unknown method safety" };
  }
  
  private calculateDelay(error: RetryableError, context: RequestContext): number {
    // Base: exponential backoff
    const base = 100;  // Start at 100ms
    const exponentialDelay = base * Math.pow(2, context.retryCount);
    
    // Add jitter (±25%) to prevent thundering herd
    const jitter = exponentialDelay * 0.25 * (Math.random() * 2 - 1);
    
    return Math.round(exponentialDelay + jitter);
  }
}
 
// =========================================
// Usage Example
// =========================================
 
const engine = new RetryDecisionEngine();
 
// Example 1: GET request that timed out
const decision1 = engine.decide(
  { type: "timeout", message: "Request timed out after 5000ms" },
  {
    method: "GET",
    path: "/api/users/123",
    hasIdempotencyKey: false,
    isExplicitlyIdempotent: false,
    retryCount: 0,
    maxRetries: 3,
  }
);
console.log("GET timeout:", decision1);
// { shouldRetry: true, delayMs: ~100, reason: "Transient failure, retry 1/3" }
 
// Example 2: POST payment without idempotency key that timed out
const decision2 = engine.decide(
  { type: "timeout", message: "Request timed out" },
  {
    method: "POST",
    path: "/api/payments",
    hasIdempotencyKey: false,
    isExplicitlyIdempotent: false,
    retryCount: 0,
    maxRetries: 3,
  }
);
console.log("POST payment timeout (no key):", decision2);
// { shouldRetry: false, reason: "Timeout on non-idempotent POST/PATCH..." }
 
// Example 3: POST payment WITH idempotency key that got 503
const decision3 = engine.decide(
  { type: "http", statusCode: 503, message: "Service Unavailable" },
  {
    method: "POST",
    path: "/api/payments",
    hasIdempotencyKey: true,
    isExplicitlyIdempotent: false,
    retryCount: 1,
    maxRetries: 3,
    retryAfterHeader: 5,
  }
);
console.log("POST payment 503 (with key):", decision3);
// { shouldRetry: true, delayMs: 5000, reason: "Transient failure, retry 2/3" }

Summary: The When-to-Retry Manifesto

We've established the foundational principles for retry decisions. Before implementing any retry logic, internalize these principles:

Key Principles

•Classify first, retry second — Determine if the failure is transient, permanent, or ambiguous before any retry decision.
•Safety is non-negotiable — Never retry non-idempotent operations without idempotency guarantees when outcome is unknown.
•Timeouts are dangerous — They represent unknown states. The server may or may not have processed the request.
•HTTP status codes guide but don't dictate — Use them as signals within a broader decision framework.
•Network failures need context — Whether a retry is safe depends on when in the request lifecycle the failure occurred.
•Retries can kill — Inappropriate retries cause outages. They amplify load, extend incidents, and prevent recovery.

What's next:

Page Complete

1 / 5