Loading learning content...
In distributed systems, failure is not an exception—it's the default state. Networks partition, services crash, databases timeout, and cloud resources become temporarily unavailable. The natural response is to retry failed operations: if at first you don't succeed, try again.
But here's the paradox: retries can heal systems, and retries can destroy them.
A well-designed retry strategy transforms transient failures into invisible hiccups, maintaining the illusion of reliability for end users. A poorly designed retry strategy amplifies failures exponentially, turning a momentary glitch into a cascading outage that brings down entire platforms.
Understanding when to retry—and equally important, when not to retry—is one of the most critical skills in distributed systems engineering.
By the end of this page, you will understand the fundamental principles governing retry decisions: classifying failures as transient or permanent, evaluating operation safety for retries, recognizing the dangers of naive retry implementations, and building the mental framework that underpins all sophisticated retry strategies.
Before deciding whether to retry, you must first understand what failed and why. Distributed systems exhibit a rich taxonomy of failure modes, each with different implications for retry behavior.
The fundamental classification divides failures into two categories:
| Failure Type | Characteristics | Examples | Retry Appropriate? |
|---|---|---|---|
| Transient | Self-resolving, time-bounded, infrastructure-related | Network timeout, connection reset, 503 Service Unavailable, resource contention | Yes — with proper backoff |
| Permanent | Persistent until external action, logic/data-related | 404 Not Found, 401 Unauthorized, invalid request format, business rule violation | No — will fail indefinitely |
| Ambiguous | Unknown whether transient or permanent | 500 Internal Server Error, connection refused, DNS resolution failure | Maybe — requires context and limits |
The critical insight: Retrying permanent failures wastes resources and delays error propagation to users. Retrying transient failures is essential for reliability. The challenge is accurately classifying failures in real-time with incomplete information.
Many failures are genuinely ambiguous. A 500 Internal Server Error could indicate a transient server overload (retry!) or a permanent code bug triggered by your specific request (don't retry!). Sophisticated retry strategies must handle this ambiguity gracefully.
HTTP status codes as retry signals:
HTTP provides some guidance through status codes, but interpretation requires nuance:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158
// HTTP Status Code Retry Classification Framework interface RetryDecision { shouldRetry: boolean; reason: string; maxRetries?: number; useBackoff?: boolean;} function classifyHttpStatusForRetry(status: number): RetryDecision { // ============================================ // DEFINITE NO-RETRY: Client errors (4xx) // ============================================ // 400 Bad Request - Client sent malformed request // Retrying will always produce the same result if (status === 400) { return { shouldRetry: false, reason: "Malformed request - fix client code", }; } // 401 Unauthorized - Authentication required/failed // Retrying without new credentials is pointless if (status === 401) { return { shouldRetry: false, reason: "Authentication required - obtain valid credentials", }; } // 403 Forbidden - Authorized but not permitted // No amount of retrying grants permission if (status === 403) { return { shouldRetry: false, reason: "Permission denied - requires authorization change", }; } // 404 Not Found - Resource doesn't exist // Unless expecting eventual consistency, don't retry if (status === 404) { return { shouldRetry: false, reason: "Resource not found - verify resource exists", }; } // 409 Conflict - Request conflicts with current state // Often requires reading current state before retry if (status === 409) { return { shouldRetry: false, // Usually - but read-modify-write patterns may retry reason: "Conflict - resolve state conflict first", }; } // 422 Unprocessable Entity - Semantic error // Request is syntactically valid but semantically wrong if (status === 422) { return { shouldRetry: false, reason: "Validation failed - fix request payload", }; } // ============================================ // SPECIAL CASE: Rate limiting (429) // ============================================ // 429 Too Many Requests - Rate limited // SHOULD retry, but with significant delay if (status === 429) { return { shouldRetry: true, reason: "Rate limited - respect Retry-After header", maxRetries: 3, useBackoff: true, // Or use Retry-After if provided }; } // ============================================ // DEFINITE RETRY: Server errors indicating transient issues // ============================================ // 502 Bad Gateway - Upstream server error // Usually transient - gateway couldn't reach backend if (status === 502) { return { shouldRetry: true, reason: "Upstream unavailable - likely transient", maxRetries: 3, useBackoff: true, }; } // 503 Service Unavailable - Server temporarily overloaded // Explicitly designed for temporary conditions if (status === 503) { return { shouldRetry: true, reason: "Service unavailable - temporary condition", maxRetries: 3, useBackoff: true, }; } // 504 Gateway Timeout - Upstream timed out // Network timing issue - retry with backoff if (status === 504) { return { shouldRetry: true, reason: "Gateway timeout - transient network issue", maxRetries: 3, useBackoff: true, }; } // ============================================ // AMBIGUOUS: 500 Internal Server Error // ============================================ // 500 is the tricky one - could be anything if (status === 500) { return { shouldRetry: true, // Default to retry with limits reason: "Server error - may be transient or permanent", maxRetries: 2, // Lower limit due to ambiguity useBackoff: true, }; } // Default for unhandled 4xx - don't retry if (status >= 400 && status < 500) { return { shouldRetry: false, reason: "Client error - fix request", }; } // Default for unhandled 5xx - cautious retry if (status >= 500) { return { shouldRetry: true, reason: "Server error - cautious retry", maxRetries: 2, useBackoff: true, }; } // 2xx and 3xx not really "failures" return { shouldRetry: false, reason: "Not a failure status", };}HTTP status codes only tell part of the story. Many failures occur before receiving any HTTP response at all. These network-level failures require their own classification and handling strategies.
Read timeouts are especially dangerous. The server may have successfully processed your request—you just didn't receive the response. Retrying might execute the operation twice. This is the idempotency problem we'll explore in depth later.
Comprehensive failure classification in practice:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134
// Network-Level Failure Classification type FailureType = | "transient" // Definitely retry | "permanent" // Never retry | "ambiguous" // Retry cautiously with limits | "dangerous"; // Retry only if idempotent interface FailureClassification { type: FailureType; retryable: boolean; description: string; recommendedAction: string;} function classifyNetworkError(error: Error): FailureClassification { const errorCode = (error as NodeJS.ErrnoException).code; const errorMessage = error.message.toLowerCase(); // ============================================ // Connection Establishment Failures // ============================================ if (errorCode === "ECONNREFUSED") { return { type: "ambiguous", retryable: true, description: "Connection refused - server not accepting connections", recommendedAction: "Retry with backoff; may indicate service down", }; } if (errorCode === "ETIMEDOUT" || errorCode === "ECONNABORTED") { return { type: "transient", retryable: true, description: "Connection timeout - network congestion or overload", recommendedAction: "Retry with exponential backoff", }; } if (errorCode === "ECONNRESET") { return { type: "transient", retryable: true, description: "Connection reset - peer unexpectedly closed connection", recommendedAction: "Retry immediately or with minimal delay", }; } // ============================================ // DNS Failures // ============================================ if (errorCode === "ENOTFOUND" || errorCode === "EAI_AGAIN") { return { type: "ambiguous", retryable: true, description: "DNS resolution failed", recommendedAction: "Retry with backoff; verify hostname if persists", }; } // ============================================ // Read/Write Failures (DANGEROUS) // ============================================ if (errorMessage.includes("socket hang up") || errorMessage.includes("read econnreset")) { return { type: "dangerous", retryable: true, // Only if operation is idempotent! description: "Connection closed during data transfer", recommendedAction: "Retry ONLY if operation is idempotent", }; } if (errorCode === "EPIPE" || errorCode === "ENOTCONN") { return { type: "dangerous", retryable: true, description: "Connection lost while writing", recommendedAction: "Retry ONLY if operation is idempotent", }; } // ============================================ // SSL/TLS Failures // ============================================ if (errorMessage.includes("certificate") || errorMessage.includes("ssl") || errorCode === "CERT_HAS_EXPIRED") { return { type: "permanent", retryable: false, description: "SSL/TLS certificate error", recommendedAction: "Fix certificate configuration", }; } if (errorMessage.includes("handshake")) { return { type: "ambiguous", retryable: true, // Sometimes transient (protocol negotiation) description: "TLS handshake failed", recommendedAction: "Retry once; investigate if persists", }; } // ============================================ // Resource Exhaustion // ============================================ if (errorCode === "EMFILE" || errorCode === "ENFILE") { return { type: "transient", retryable: true, description: "Too many open files - local resource exhaustion", recommendedAction: "Retry with delay; may indicate leak", }; } // ============================================ // Unknown/Default // ============================================ return { type: "ambiguous", retryable: true, description: "Unknown network error", recommendedAction: "Retry with limits and monitoring", };}Even when failures are clearly transient, retrying may not be safe. The safety of a retry depends critically on the nature of the operation being retried.
This leads to one of the most important concepts in distributed systems: idempotency.
An operation is idempotent if executing it multiple times produces the same result as executing it once. Mathematically: f(f(x)) = f(x). In distributed systems: retrying the same request N times has the same effect as sending it once.
Why idempotency matters for retries:
When a request fails with ambiguous outcome (like a timeout), you don't know if:
If you retry:
| HTTP Method | Idempotent? | Safe to Retry? | Example |
|---|---|---|---|
| GET | ✅ Yes | Always safe | Fetch user profile |
| HEAD | ✅ Yes | Always safe | Check if resource exists |
| OPTIONS | ✅ Yes | Always safe | CORS preflight |
| PUT | ✅ Yes (by design) | Safe if properly implemented | Update user address to '123 Main St' |
| DELETE | ✅ Yes (by design) | Safe if properly implemented | Delete user with ID 42 |
| POST | ❌ No (typically) | UNSAFE without idempotency key | Create new order, charge payment |
| PATCH | ❌ No (typically) | UNSAFE for incremental changes | Increment counter, append to list |
The critical distinction:
GET, HEAD, OPTIONS: Read-only operations. Always safe to retry because they don't modify state.
PUT, DELETE: Should be idempotent by design, but implementation matters. PUT /users/42 with the same body should always produce the same state. DELETE /users/42 should succeed whether user exists or not.
POST: Generally not idempotent. POST /orders creates a new order each time. POST /payments charges the card each time. Retrying these without safeguards causes duplicate orders and charges.
PATCH: Depends entirely on implementation. PATCH /counter {increment: 1} is non-idempotent. PATCH /user {email: 'new@email.com'} could be idempotent.
A major e-commerce platform once charged customers multiple times because their payment service retried timed-out requests without idempotency keys. A single 30-second database slowdown resulted in thousands of duplicate charges, millions in refunds, and lasting reputation damage. This is why understanding retry safety is non-negotiable.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
// Comprehensive Retry Safety Evaluation type OperationType = | "read" // Pure read, no side effects | "idempotent" // Safe to retry any number of times | "non-idempotent" // Executing twice has different effects | "unknown"; // Cannot determine safety interface RetrySafetyResult { safeToRetry: boolean; reason: string; recommendation: string; requiresIdempotencyKey: boolean;} class RetrySafetyEvaluator { /** * Evaluates whether an operation is safe to retry. * This is the CRITICAL decision point before any retry. */ evaluate( method: string, hasIdempotencyKey: boolean, operationType: OperationType, errorType: "connection" | "timeout" | "response" ): RetrySafetyResult { // ========================================= // Case 1: Connection failures (request never sent) // ========================================= if (errorType === "connection") { // Server never received request — always safe to retry return { safeToRetry: true, reason: "Connection failed before request transmission", recommendation: "Retry with exponential backoff", requiresIdempotencyKey: false, }; } // ========================================= // Case 2: Response-based failures (got response) // ========================================= if (errorType === "response") { // Server responded with error — request was processed // Safe to retry if operation is idempotent const isMethodIdempotent = ["GET", "HEAD", "OPTIONS", "PUT", "DELETE"] .includes(method.toUpperCase()); if (isMethodIdempotent || hasIdempotencyKey) { return { safeToRetry: true, reason: "Operation is idempotent or has idempotency key", recommendation: "Retry with backoff based on status code", requiresIdempotencyKey: false, }; } return { safeToRetry: false, reason: "Non-idempotent operation already reached server", recommendation: "Do not retry; return error to caller", requiresIdempotencyKey: true, }; } // ========================================= // Case 3: Timeout failures (UNKNOWN outcome) // ========================================= if (errorType === "timeout") { // This is the dangerous case: we don't know if server // received and processed the request or not // Read operations are always safe if (operationType === "read") { return { safeToRetry: true, reason: "Read-only operation", recommendation: "Retry safely", requiresIdempotencyKey: false, }; } // Idempotent operations are safe if (operationType === "idempotent" || hasIdempotencyKey) { return { safeToRetry: true, reason: "Operation is idempotent", recommendation: "Retry with backoff", requiresIdempotencyKey: false, }; } // Non-idempotent without key: DANGER return { safeToRetry: false, reason: "Timeout on non-idempotent operation — duplicate risk", recommendation: "Log for investigation; alert user of unknown state", requiresIdempotencyKey: true, }; } // Unknown error type — be conservative return { safeToRetry: false, reason: "Cannot determine safety", recommendation: "Manual investigation required", requiresIdempotencyKey: true, }; }} // Usage Exampleconst evaluator = new RetrySafetyEvaluator(); // Safe to retry: GET request that timed outconst getRetry = evaluator.evaluate("GET", false, "read", "timeout");console.log(getRetry);// { safeToRetry: true, reason: "Read-only operation", ... } // UNSAFE to retry: POST payment that timed outconst paymentRetry = evaluator.evaluate("POST", false, "non-idempotent", "timeout");console.log(paymentRetry);// { safeToRetry: false, reason: "Timeout on non-idempotent operation — duplicate risk", ... } // Safe with idempotency key: POST payment with keyconst paymentWithKey = evaluator.evaluate("POST", true, "non-idempotent", "timeout");console.log(paymentWithKey);// { safeToRetry: true, reason: "Operation is idempotent", ... }Combining failure classification and safety evaluation, we can construct a comprehensive retry decision framework. This framework should be applied consistently across all service-to-service communication.
The three-question framework:
Visual decision tree:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
┌─────────────────────┐ │ REQUEST FAILED │ └─────────┬───────────┘ │ ▼ ┌───────────────────────────────┐ │ Is failure clearly PERMANENT? │ │ (404, 401, 400, 422, etc.) │ └───────────────┬───────────────┘ │ ┌──────────────────┼──────────────────┐ │ YES │ NO / UNCLEAR │ ▼ ▼ │ ┌───────────────┐ ┌─────────────────────┐ │ │ DON'T RETRY │ │ Is operation SAFE │ │ │ Return error │ │ to retry? │ │ └───────────────┘ │ (idempotent/has key)│ │ └─────────┬───────────┘ │ │ │ ┌─────────────────┼───────────────┤ │ YES │ NO │ ▼ ▼ │ ┌───────────────────┐ ┌─────────────────┐ │ │ Was this a │ │ Was it CONN │ │ │ TIMEOUT failure? │ │ failure (never │ │ │ │ │ sent)? │ │ └─────────┬─────────┘ └───────┬─────────┘ │ │ │ │ ┌────────┼────────┐ ┌──────┼──────┐ │ │ YES │ NO │ │ YES │ NO │ │ ▼ ▼ │ ▼ ▼ │ │┌────────┐ ┌────────┐ │ ┌────────┐ ┌────────┐ ││ RETRY │ │ RETRY │ │ │ RETRY │ │ DON'T │ ││with │ │with │ │ │(safe) │ │RETRY │ ││backoff │ │backoff │ │ └────────┘ │(risky) │ │└────────┘ └────────┘ │ └────────┘ │ │ │ ▼ ▼ ┌───────────────────────────────┐ │ BEFORE RETRYING: │ │ ✓ Check retry budget │ │ ✓ Apply backoff delay │ │ ✓ Log retry attempt │ │ ✓ Increment retry counter │ └───────────────────────────────┘Understanding when not to retry is as important as knowing when to retry. Inappropriate retries cause more outages than they prevent.
The Deadly Sins of Retry Strategies:
In 2017, a major cloud provider experienced a multi-hour outage triggered by a control plane becoming briefly overloaded. Clients without proper backoff immediately retried millions of requests, preventing recovery. The system couldn't stabilize because each recovery attempt was immediately overwhelmed by backed-up retries. This is preventable with proper retry design.
Situations where retrying is almost never appropriate:
| Scenario | Why Retry Is Wrong | Correct Action |
|---|---|---|
| Authentication failed (401) | Credentials are wrong; retrying won't make them right | Prompt for new credentials |
| Authorization denied (403) | User lacks permission; retrying won't grant it | Escalate or deny operation |
| Resource not found (404) | Resource doesn't exist; retrying won't create it | Handle as not found |
| Business rule violation (422) | Request violates domain rules | Fix request or inform user |
| Request too large (413) | Payload exceeds limits; retrying same payload fails | Chunk or compress data |
| Circuit breaker open | Service is known-broken; retries will be rejected locally | Wait for circuit recovery |
| Global outage/maintenance | Provider reports intentional downtime | Wait or failover to backup |
Bringing together the principles we've covered, here's a production-quality implementation of a retry decision engine that encapsulates all the considerations discussed.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268
/** * Production-grade Retry Decision Engine * * This engine encapsulates the complete decision logic for whether * to retry a failed request, unifying failure classification, * safety evaluation, and policy enforcement. */ interface RetryableError { type: "http" | "network" | "timeout"; statusCode?: number; errorCode?: string; message: string;} interface RequestContext { method: string; path: string; hasIdempotencyKey: boolean; isExplicitlyIdempotent: boolean; // Marked by developer retryCount: number; maxRetries: number; retryAfterHeader?: number; // Seconds, from response} interface RetryDecision { shouldRetry: boolean; delayMs: number; reason: string; final: boolean; // If true, this is definitive; don't ask again} class RetryDecisionEngine { private maxRetriesHard = 5; // Never exceed this regardless of config private minDelayMs = 100; private maxDelayMs = 30000; /** * Main entry point: Should we retry this failed request? */ decide(error: RetryableError, context: RequestContext): RetryDecision { // ========================================= // Gate 1: Hard limits // ========================================= if (context.retryCount >= Math.min(context.maxRetries, this.maxRetriesHard)) { return { shouldRetry: false, delayMs: 0, reason: `Retry limit reached (${context.retryCount}/${context.maxRetries})`, final: true, }; } // ========================================= // Gate 2: Failure classification // ========================================= const failureClass = this.classifyFailure(error); if (failureClass.type === "permanent") { return { shouldRetry: false, delayMs: 0, reason: failureClass.reason, final: true, }; } // ========================================= // Gate 3: Safety check // ========================================= const safetyResult = this.checkSafety(error, context); if (!safetyResult.safe) { return { shouldRetry: false, delayMs: 0, reason: safetyResult.reason, final: true, }; } // ========================================= // Approved for retry: Calculate delay // ========================================= let delayMs = this.calculateDelay(error, context); // Respect Retry-After if provided if (context.retryAfterHeader) { delayMs = Math.max(delayMs, context.retryAfterHeader * 1000); } // Clamp to limits delayMs = Math.max(this.minDelayMs, Math.min(this.maxDelayMs, delayMs)); return { shouldRetry: true, delayMs, reason: `Transient failure, retry ${context.retryCount + 1}/${context.maxRetries}`, final: false, }; } private classifyFailure(error: RetryableError): { type: "transient" | "permanent" | "ambiguous"; reason: string } { // HTTP errors if (error.type === "http" && error.statusCode) { // 4xx client errors are usually permanent if (error.statusCode >= 400 && error.statusCode < 500) { // Exception: 429 is transient (rate limiting) if (error.statusCode === 429) { return { type: "transient", reason: "Rate limited" }; } // Exception: 408 is transient (request timeout) if (error.statusCode === 408) { return { type: "transient", reason: "Request timeout (server side)" }; } return { type: "permanent", reason: `Client error: ${error.statusCode}` }; } // 5xx server errors are usually transient if (error.statusCode >= 500) { // 501 Not Implemented is permanent if (error.statusCode === 501) { return { type: "permanent", reason: "Not implemented" }; } // 505 HTTP Version Not Supported is permanent if (error.statusCode === 505) { return { type: "permanent", reason: "HTTP version not supported" }; } return { type: "transient", reason: `Server error: ${error.statusCode}` }; } } // Network errors if (error.type === "network") { // Most network errors are transient if (error.errorCode === "CERT_HAS_EXPIRED" || error.message.includes("certificate")) { return { type: "permanent", reason: "Certificate error" }; } return { type: "transient", reason: "Network error" }; } // Timeouts are transient by nature if (error.type === "timeout") { return { type: "transient", reason: "Request timed out" }; } return { type: "ambiguous", reason: "Unknown failure type" }; } private checkSafety( error: RetryableError, context: RequestContext ): { safe: boolean; reason: string } { const method = context.method.toUpperCase(); // GET, HEAD, OPTIONS are always safe (read-only) if (["GET", "HEAD", "OPTIONS"].includes(method)) { return { safe: true, reason: "Read-only method" }; } // PUT and DELETE are idempotent by HTTP spec if (["PUT", "DELETE"].includes(method)) { return { safe: true, reason: "Idempotent method by spec" }; } // POST and PATCH require explicit idempotency guarantee if (["POST", "PATCH"].includes(method)) { // If developer marked as idempotent if (context.isExplicitlyIdempotent) { return { safe: true, reason: "Explicitly marked idempotent" }; } // If idempotency key is present if (context.hasIdempotencyKey) { return { safe: true, reason: "Has idempotency key" }; } // For network/connection errors (request never sent), safe to retry if (error.type === "network" && (error.errorCode === "ECONNREFUSED" || error.errorCode === "ETIMEDOUT")) { return { safe: true, reason: "Request never transmitted" }; } // Timeout on POST without idempotency key: UNSAFE if (error.type === "timeout") { return { safe: false, reason: "Timeout on non-idempotent POST/PATCH without idempotency key - risk of duplicate execution", }; } // HTTP error means server received request; need idempotency if (error.type === "http") { return { safe: false, reason: "Server error on non-idempotent operation - requires idempotency key for safe retry", }; } } return { safe: false, reason: "Unknown method safety" }; } private calculateDelay(error: RetryableError, context: RequestContext): number { // Base: exponential backoff const base = 100; // Start at 100ms const exponentialDelay = base * Math.pow(2, context.retryCount); // Add jitter (±25%) to prevent thundering herd const jitter = exponentialDelay * 0.25 * (Math.random() * 2 - 1); return Math.round(exponentialDelay + jitter); }} // =========================================// Usage Example// ========================================= const engine = new RetryDecisionEngine(); // Example 1: GET request that timed outconst decision1 = engine.decide( { type: "timeout", message: "Request timed out after 5000ms" }, { method: "GET", path: "/api/users/123", hasIdempotencyKey: false, isExplicitlyIdempotent: false, retryCount: 0, maxRetries: 3, });console.log("GET timeout:", decision1);// { shouldRetry: true, delayMs: ~100, reason: "Transient failure, retry 1/3" } // Example 2: POST payment without idempotency key that timed outconst decision2 = engine.decide( { type: "timeout", message: "Request timed out" }, { method: "POST", path: "/api/payments", hasIdempotencyKey: false, isExplicitlyIdempotent: false, retryCount: 0, maxRetries: 3, });console.log("POST payment timeout (no key):", decision2);// { shouldRetry: false, reason: "Timeout on non-idempotent POST/PATCH..." } // Example 3: POST payment WITH idempotency key that got 503const decision3 = engine.decide( { type: "http", statusCode: 503, message: "Service Unavailable" }, { method: "POST", path: "/api/payments", hasIdempotencyKey: true, isExplicitlyIdempotent: false, retryCount: 1, maxRetries: 3, retryAfterHeader: 5, });console.log("POST payment 503 (with key):", decision3);// { shouldRetry: true, delayMs: 5000, reason: "Transient failure, retry 2/3" }We've established the foundational principles for retry decisions. Before implementing any retry logic, internalize these principles:
What's next:
Now that we understand when to retry, we need to learn how to retry effectively. The next page covers Exponential Backoff — the mathematical foundation for spacing retries to maximize success probability while minimizing system impact.
You now understand the critical decision framework for retry strategies: classifying failures, evaluating safety, and making informed retry decisions. This foundation is essential before implementing the timing strategies we'll cover next.