System Design (HLD)Inter-Service Communication

Inter-Service Communication

LevelAdvanced

Duration90 mins

TopicInter-Service Communication

4 / 5

Error Handling Across Services

Embracing Failure as Normal

In a monolithic application, a method call either succeeds, throws an exception, or hangs indefinitely (a bug). In distributed systems, an entirely new category of failures emerges: partial failures. Service A might successfully process a request, but Service B's response is lost in transit. Or Service B processes the request but takes so long that Service A times out and assumes failure.

This is not a pathological scenario—it's the normal operating mode of distributed systems. Networks fail, services restart, databases have maintenance windows, and cloud providers experience outages. Building reliable systems means accepting that failure is not exceptional; it's expected.

The question isn't "how do we prevent failures?" but rather "how do we design systems that behave predictably when failures inevitably occur?"

What You Will Master

By the end of this page, you will understand the unique failure modes of distributed systems, master patterns like circuit breakers, retries, and bulkheads, learn to propagate errors meaningfully across service boundaries, and design systems that degrade gracefully rather than failing catastrophically.

Distributed System Failure Modes

Before designing error handling strategies, we must understand the unique failure modes of distributed systems. These failures don't exist in monolithic applications and require fundamentally different handling approaches.

Network-Level Failures

Connection Failure: Service cannot establish a TCP connection

Causes: Target service down, firewall rules, DNS failure, network partition
Detection: Connection timeout (typically 1-5 seconds)
Challenge: Is the service down or just unreachable from this location?

Request Timeout: Connection established but no response received

Causes: Service overloaded, slow dependency, GC pause, deadlock
Detection: Read timeout after connection (configurable)
Challenge: Did the operation happen? We genuinely don't know.

Partial Response: Response truncated or corrupted

Causes: Connection dropped mid-response, serialization errors
Detection: Content-Length mismatch, parse failures
Challenge: How much of the response is trustworthy?

Application-Level Failures

Server Errors (5xx): Service understood request but couldn't fulfill it

503 Service Unavailable: Temporary overload, try again later
500 Internal Server Error: Bug or unexpected condition
502 Bad Gateway: Proxy/gateway couldn't reach upstream
504 Gateway Timeout: Upstream service too slow

Client Errors (4xx): Request was problematic

400 Bad Request: Malformed request, validation failure
401/403 Unauthorized/Forbidden: Authentication/authorization failure
404 Not Found: Resource doesn't exist
409 Conflict: State conflict (concurrent modification)
429 Too Many Requests: Rate limit exceeded

The Uncertainty Principle

The most insidious failure mode is timeout with unknown outcome. When Service A calls Service B and times out:

Service B might have failed before processing (safe to retry)
Service B might be processing but slow (retry causes duplicate)
Service B processed successfully but response was lost (retry causes duplicate)
Service B is in an infinite loop (will never respond)

You cannot distinguish these cases from Service A's perspective. This fundamental uncertainty drives the need for idempotency and compensating transactions.

Failure Mode Classification and Handling
Failure Type	Retryable?	Safe to Retry Without Idempotency?	Typical Action
Connection refused	Yes	Yes	Immediate retry (different instance)
DNS failure	Usually	Yes	Retry after delay (DNS may recover)
Connection timeout	Yes	Yes	Retry (operation never reached server)
Read timeout	Maybe	NO — Uncertain state	Retry only if idempotent
503 Service Unavailable	Yes	Usually Yes	Retry with backoff
500 Internal Server Error	Maybe	Maybe	Depends on error type
400 Bad Request	No	N/A	Fix request or fail permanently
401 Unauthorized	Maybe	Yes	Refresh credentials, retry once
429 Too Many Requests	Yes	Yes	Back off per Retry-After header

The Read Timeout Trap

Read timeouts are the most dangerous failure mode. If you timeout while waiting for a response, you have zero knowledge about whether the operation completed. Naively retrying can cause duplicate orders, duplicate payments, or double inventory decrements. Every mutating operation in a distributed system should be designed for idempotent retry.

Retry Strategies

Retries are the first line of defense against transient failures. But naive retries can amplify problems instead of solving them. Effective retry strategies require careful consideration of timing, backoff, and limits.

Backoff Strategies

Fixed Delay: Wait the same amount between retries

Simple but can overwhelm recovering services if all clients retry simultaneously

Exponential Backoff: Double the delay each retry (1s, 2s, 4s, 8s...)

Prevents thundering herd; gives services time to recover
Industry standard for AWS, Google Cloud, and most APIs

Exponential Backoff with Jitter: Add randomness to break synchronization

Prevents correlated retries from synchronized clients
Critical for high-scale systems

Decorrelated Jitter: Even more aggressive randomization

sleep = min(cap, random_between(base, sleep * 3))
Best for minimizing total time and load

retry-strategies
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
// Comprehensive retry implementation with exponential backoff and jitter
interface RetryConfig {
  maxRetries: number;          // Maximum retry attempts
  baseDelayMs: number;         // Base delay for exponential backoff
  maxDelayMs: number;          // Cap on delay
  jitterFactor: number;        // 0-1, how much randomness to add
  retryableErrors: Set<string>; // Error types that should trigger retry
}
 
const defaultConfig: RetryConfig = {
  maxRetries: 3,
  baseDelayMs: 100,
  maxDelayMs: 30000,
  jitterFactor: 0.5,
  retryableErrors: new Set([
    'ECONNRESET',
    'ETIMEDOUT',
    'ECONNREFUSED',
    'SERVICE_UNAVAILABLE',
    'TOO_MANY_REQUESTS',
  ]),
};
 
async function withRetry<T>(
  operation: () => Promise<T>,
  config: Partial<RetryConfig> = {}
): Promise<T> {
  const cfg = { ...defaultConfig, ...config };
  let lastError: Error;
  
  for (let attempt = 0; attempt <= cfg.maxRetries; attempt++) {
    try {
      return await operation();
    } catch (error) {
      lastError = error as Error;
      
      // Check if error is retryable
      if (!isRetryable(error, cfg.retryableErrors)) {
        throw error; // Non-retryable, fail immediately
      }
      
      // Check if we have retries left
      if (attempt === cfg.maxRetries) {
        throw new Error(
          `All ${cfg.maxRetries} retries exhausted: ${lastError.message}`
        );
      }
      
      // Calculate delay with exponential backoff + jitter
      const exponentialDelay = cfg.baseDelayMs * Math.pow(2, attempt);
      const cappedDelay = Math.min(exponentialDelay, cfg.maxDelayMs);
      const jitter = cappedDelay * cfg.jitterFactor * Math.random();
      const finalDelay = cappedDelay + jitter;
      
      console.log(
        `Retry ${attempt + 1}/${cfg.maxRetries} after ${finalDelay.toFixed(0)}ms`
      );
      
      await sleep(finalDelay);
    }
  }
  
  throw lastError!;
}
 
function isRetryable(error: unknown, retryableErrors: Set<string>): boolean {
  if (error instanceof HttpError) {
    // Rate limit: always retry (after backoff)
    if (error.status === 429) return true;
    // Server errors: retry (service might recover)
    if (error.status >= 500 && error.status <= 599) return true;
    // Client errors: don't retry (our request is wrong)
    if (error.status >= 400 && error.status <= 499) return false;
  }
  
  if (error instanceof Error) {
    // Check error code (Node.js network errors)
    const code = (error as NodeJS.ErrnoException).code;
    if (code && retryableErrors.has(code)) return true;
  }
  
  return false;
}
 
// Usage
const result = await withRetry(
  () => orderService.createOrder(orderData),
  { maxRetries: 3, baseDelayMs: 100 }
);

Retry Budgets

Instead of per-request retry limits, use retry budgets: 'no more than 20% of requests should be retries.' This prevents retry amplification where retries cause more load than original requests. Google SRE practices emphasize budget-based approaches over per-call configurations.

Circuit Breakers

Circuit breakers prevent cascade failures by stopping calls to failing services. Named after electrical circuit breakers that prevent overload, they provide fast failure and automatic recovery.

Circuit Breaker States

CLOSED (Normal Operation)

Requests flow through to downstream service
Failures are counted
When failure threshold reached → OPEN

OPEN (Blocking Requests)

Requests immediately fail without calling downstream
After timeout → HALF-OPEN

HALF-OPEN (Testing Recovery)

Limited requests allowed through
If successful → CLOSED
If failed → OPEN (restart timeout)

This state machine prevents a failing service from being overwhelmed while simultaneously failing fast for callers.

Circuit Breaker Configuration Parameters
Parameter	Typical Value	Purpose
Failure Threshold	5-10 failures	Errors before opening
Failure Rate Threshold	50% failure rate	Error percentage before opening
Measurement Window	10 seconds	Time window for counting failures
Open Timeout	30-60 seconds	Time before trying HALF-OPEN
Half-Open Trials	3 requests	Successful calls needed to close
Minimum Throughput	10 requests	Min calls before rate calculation

circuit-breaker
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
// Production-grade circuit breaker implementation
enum CircuitState {
  CLOSED = 'CLOSED',
  OPEN = 'OPEN',
  HALF_OPEN = 'HALF_OPEN',
}
 
interface CircuitBreakerConfig {
  failureThreshold: number;     // Failures before opening
  failureRateThreshold: number; // Failure rate (0-1) before opening
  successThreshold: number;     // Successes in half-open before closing
  openTimeout: number;          // Ms before trying half-open
  windowSize: number;           // Sliding window size in ms
  minimumThroughput: number;    // Min calls before rate calculation
}
 
class CircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED;
  private failures: number[] = [];       // Failure timestamps
  private successes: number[] = [];      // Success timestamps
  private lastFailure: number = 0;
  private halfOpenSuccesses: number = 0;
 
  constructor(
    private readonly name: string,
    private readonly config: CircuitBreakerConfig
  ) {}
 
  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (!this.canExecute()) {
      throw new CircuitOpenError(
        `Circuit breaker '${this.name}' is OPEN. ` +
                                        `Retry after ${this.getRetryAfter()}ms`
      );
    }
 
    try {
      const result = await operation();
      this.recordSuccess();
      return result;
    } catch (error) {
      this.recordFailure();
      throw error;
    }
  }
 
  private canExecute(): boolean {
    this.pruneOldCalls();
 
    switch (this.state) {
      case CircuitState.CLOSED:
        return true;
        
      case CircuitState.OPEN:
        if (Date.now() - this.lastFailure >= this.config.openTimeout) {
          this.state = CircuitState.HALF_OPEN;
          this.halfOpenSuccesses = 0;
          console.log(`Circuit '${this.name}' → HALF_OPEN`);
          return true;
        }
        return false;
        
      case CircuitState.HALF_OPEN:
        return true;
    }
  }
 
  private recordSuccess(): void {
    const now = Date.now();
    this.successes.push(now);
 
    if (this.state === CircuitState.HALF_OPEN) {
      this.halfOpenSuccesses++;
      if (this.halfOpenSuccesses >= this.config.successThreshold) {
        this.state = CircuitState.CLOSED;
        this.failures = [];
        console.log(`Circuit '${this.name}' → CLOSED`);
      }
    }
  }
 
  private recordFailure(): void {
    const now = Date.now();
    this.failures.push(now);
    this.lastFailure = now;
 
    if (this.state === CircuitState.HALF_OPEN) {
      this.state = CircuitState.OPEN;
      console.log(`Circuit '${this.name}' → OPEN (half-open failure)`);
      return;
    }
 
    if (this.shouldOpen()) {
      this.state = CircuitState.OPEN;
      console.log(`Circuit '${this.name}' → OPEN`);
    }
  }
 
  private shouldOpen(): boolean {
    if (this.state !== CircuitState.CLOSED) return false;
 
    this.pruneOldCalls();
    
    const totalCalls = this.failures.length + this.successes.length;
    if (totalCalls < this.config.minimumThroughput) return false;
 
    // Check absolute threshold
    if (this.failures.length >= this.config.failureThreshold) return true;
 
    // Check rate threshold
    const failureRate = this.failures.length / totalCalls;
    return failureRate >= this.config.failureRateThreshold;
  }
 
  private pruneOldCalls(): void {
    const cutoff = Date.now() - this.config.windowSize;
    this.failures = this.failures.filter(t => t > cutoff);
    this.successes = this.successes.filter(t => t > cutoff);
  }
 
  private getRetryAfter(): number {
    const elapsed = Date.now() - this.lastFailure;
    return Math.max(0, this.config.openTimeout - elapsed);
  }
 
  getState(): CircuitState {
    return this.state;
  }
}
 
// Usage with circuit breaker per downstream service
const paymentCircuit = new CircuitBreaker('payment-service', {
  failureThreshold: 5,
  failureRateThreshold: 0.5,
  successThreshold: 3,
  openTimeout: 30000,
  windowSize: 60000,
  minimumThroughput: 10,
});
 
async function processPayment(order: Order): Promise<PaymentResult> {
  return paymentCircuit.execute(() => 
    paymentService.charge(order.customerId, order.total)
  );
}

Circuit Breaker per Dependency

Create separate circuit breakers for each downstream service. If the payment service is failing, you don't want the inventory service circuit to open. Bulkhead isolation combined with per-service circuits prevents localized failures from becoming system-wide outages.

Bulkheads and Isolation

Bulkheads isolate failures to prevent them from consuming all system resources. Named after ship compartments that contain flooding, bulkheads in software partition resources so a failure in one area doesn't sink the entire ship.

Types of Bulkheads

Thread Pool Isolation Separate thread pools for each downstream dependency. If the payment service is slow::

Without bulkhead: All threads blocked waiting for payment, entire service hangs
With bulkhead: Payment thread pool exhausted, but order reading continues

Connection Pool Isolation Separate connection pools per downstream service. One service's connection leak doesn't starve others.

Process/Container Isolation Different functions in separate processes/containers. Memory leak in one doesn't crash others.

Queue Isolation Separate queues for different priority traffic. Bulk operations don't block interactive requests.

Without Bulkheads

•Single thread pool for all operations
•Slow dependency blocks entire service
•Memory pressure affects all functions
•One bad downstream brings everything down
•No isolation between critical and bulk

With Bulkheads

•Separate pools per dependency
•Slow dependency only impacts its pool
•Memory limits per function/container
•Failures contained to compartment
•Critical traffic gets reserved capacity

bulkhead-pattern
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
// Bulkhead using semaphore pattern for concurrency limiting
class Bulkhead {
  private readonly permits: number;
  private available: number;
  private readonly queue: Array<{
    resolve: () => void;
    reject: (error: Error) => void;
    timeout: NodeJS.Timeout;
  }> = [];
 
  constructor(
    private readonly name: string,
    permits: number,
    private readonly queueLimit: number = 100,
    private readonly queueTimeout: number = 5000
  ) {
    this.permits = permits;
    this.available = permits;
  }
 
  async execute<T>(operation: () => Promise<T>): Promise<T> {
    await this.acquire();
    try {
      return await operation();
    } finally {
      this.release();
    }
  }
 
  private async acquire(): Promise<void> {
    if (this.available > 0) {
      this.available--;
      return;
    }
 
    if (this.queue.length >= this.queueLimit) {
      throw new BulkheadFullError(
        `Bulkhead '${this.name}' is full. ` +
                                        `Permits: ${this.permits}, Queue: ${this.queue.length}`
      );
    }
 
    return new Promise<void>((resolve, reject) => {
      const timeout = setTimeout(() => {
        const index = this.queue.findIndex(w => w.resolve === resolve);
        if (index !== -1) {
          this.queue.splice(index, 1);
          reject(new BulkheadTimeoutError(
            `Bulkhead '${this.name}' queue timeout after ${this.queueTimeout}ms`
          ));
        }
      }, this.queueTimeout);
 
      this.queue.push({ resolve, reject, timeout });
    });
  }
 
  private release(): void {
    if (this.queue.length > 0) {
      const waiter = this.queue.shift()!;
      clearTimeout(waiter.timeout);
      waiter.resolve();
    } else {
      this.available++;
    }
  }
 
  getMetrics(): BulkheadMetrics {
    return {
      name: this.name,
      permits: this.permits,
      available: this.available,
      queueSize: this.queue.length,
      queueLimit: this.queueLimit,
    };
  }
}
 
// Create bulkheads per downstream service
const bulkheads = {
  payment: new Bulkhead('payment', 20),     // Max 20 concurrent payment calls
  inventory: new Bulkhead('inventory', 50), // Max 50 concurrent inventory calls  
  shipping: new Bulkhead('shipping', 30),   // Max 30 concurrent shipping calls
};
 
async function processOrder(order: Order): Promise<void> {
  // Each call respects its bulkhead limit
  // Payment service slowdown won't block inventory checks
  
  const [inventoryResult, _] = await Promise.all([
    bulkheads.inventory.execute(() => 
      inventoryService.reserve(order.items)
    ),
    // Payment might be slow, but won't consume inventory bulkhead
  ]);
 
  // Sequential payment (after inventory confirmed)
  const paymentResult = await bulkheads.payment.execute(() =>
    paymentService.charge(order.customerId, order.total)
  );
 
  // Shipping can proceed independently
  await bulkheads.shipping.execute(() =>
    shippingService.createShipment(order)
  );
}

Error Propagation Patterns

When errors occur deep in a service chain (A → B → C → D), how should they propagate back to the original caller? Poor error propagation leads to debugging nightmares.

Principles of Error Propagation

1. Preserve Original Error Information Wrap errors rather than replacing them. The root cause should be discoverable.

2. Add Context at Each Layer Each service should add its perspective: what operation failed, what inputs were provided.

3. Translate to Appropriate Level Internal errors shouldn't leak implementation details. Map to domain-appropriate error types.

4. Include Correlation IDs Every request should carry a correlation ID for distributed tracing.

5. Distinguish Client vs Server Errors Clearly indicate whether the caller did something wrong (4xx) or the system failed (5xx).

error-propagation
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
// Structured error response format
interface ServiceError {
  // Machine-readable code for programmatic handling
  code: string;
  
  // Human-readable message (can be shown to users for 4xx)
  message: string;
  
  // Detailed description (for developers/logs)
  details?: string;
  
  // Field-level errors for validation failures
  fieldErrors?: Array<{
    field: string;
    code: string;
    message: string;
  }>;
  
  // For debugging (not exposed to external clients)
  internal?: {
    correlationId: string;
    service: string;
    timestamp: string;
    cause?: ServiceError;  // Chain of errors
    stackTrace?: string;   // Only in development
  };
}
 
// Error class with context chaining
class ServiceException extends Error {
  constructor(
    public readonly code: string,
    message: string,
    public readonly status: number,
    public readonly cause?: Error,
    public readonly fieldErrors?: Array<{ field: string; code: string; message: string }>
  ) {
    super(message);
    this.name = 'ServiceException';
  }
 
  toResponse(correlationId: string, includeInternal: boolean): ServiceError {
    const response: ServiceError = {
      code: this.code,
      message: this.message,
      fieldErrors: this.fieldErrors,
    };
 
    if (includeInternal) {
      response.internal = {
        correlationId,
        service: process.env.SERVICE_NAME || 'unknown',
        timestamp: new Date().toISOString(),
        stackTrace: this.stack,
      };
 
      if (this.cause instanceof ServiceException) {
        response.internal.cause = this.cause.toResponse(correlationId, true);
      }
    }
 
    return response;
  }
}
 
// Error translation between services
function translateDownstreamError(
  error: unknown,
  operation: string
): ServiceException {
  if (error instanceof ServiceException) {
    // Wrap downstream error with context
    return new ServiceException(
      'DOWNSTREAM_ERROR',
      `${operation} failed: ${error.message}`,
      error.status >= 500 ? 502 : error.status, // Map 5xx to 502 (Bad Gateway)
      error
    );
  }
 
  if (error instanceof HttpError) {
    // HTTP error from downstream service
    const isRetryable = error.status >= 500 || error.status === 429;
    return new ServiceException(
      isRetryable ? 'DOWNSTREAM_UNAVAILABLE' : 'DOWNSTREAM_REJECTED',
      `${operation} failed with status ${error.status}`,
      isRetryable ? 503 : 502,
      error
    );
  }
 
  if (error instanceof TimeoutError) {
    return new ServiceException(
      'DOWNSTREAM_TIMEOUT',
      `${operation} timed out`,
      504,  // Gateway Timeout
      error
    );
  }
 
  // Unknown error type
  return new ServiceException(
    'INTERNAL_ERROR',
    `${operation} failed unexpectedly`,
    500,
    error instanceof Error ? error : new Error(String(error))
  );
}
 
// Usage in service layer
async function createOrder(request: CreateOrderRequest): Promise<Order> {
  // Validate input
  const validationErrors = validateOrderRequest(request);
  if (validationErrors.length > 0) {
    throw new ServiceException(
      'VALIDATION_ERROR',
      'Order validation failed',
      400,
      undefined,
      validationErrors
    );
  }
 
  try {
    // Check inventory (downstream call)
    const inventory = await inventoryService.check(request.items);
    if (!inventory.allAvailable) {
      throw new ServiceException(
        'INSUFFICIENT_INVENTORY',
        'Some items are not available',
        409,  // Conflict
        undefined,
        inventory.unavailableItems.map(item => ({
          field: `items[${item.productId}]`,
          code: 'INSUFFICIENT_STOCK',
          message: `Only ${item.available} available`,
        }))
      );
    }
  } catch (error) {
    if (error instanceof ServiceException) throw error;
    throw translateDownstreamError(error, 'Inventory check');
  }
 
  try {
    // Process payment (downstream call)
    await paymentService.charge(request.customerId, calculateTotal(request));
  } catch (error) {
    if (error instanceof ServiceException) throw error;
    throw translateDownstreamError(error, 'Payment processing');
  }
 
  // Create order (local operation)
  return orderRepository.create(request);
}

Don't Leak Internal Details

Internal error details (stack traces, database errors, infrastructure info) must never reach external clients. Use error translation layers to map internal errors to safe external responses. Log the full details server-side with correlation IDs for debugging.

Graceful Degradation Strategies

When downstream services fail, complete failure isn't always the best option. Graceful degradation provides reduced functionality rather than no functionality.

Degradation Strategies

Fallback to Cache Return cached data when live data is unavailable. Stale data is often better than no data.

Example: Product catalog shows cached prices when pricing service is down
Trade-off: Potentially stale data; must handle cache misses

Fallback to Default Return sensible defaults when specific data unavailable.

Example: Recommendation service down → show popular items instead
Trade-off: Less personalized experience

Fallback to Degraded Functionality Skip optional features when their services are unavailable.

Example: Review service down → show product without reviews
Trade-off: Reduced feature set

Fallback to Queue Queue operations for later processing when services are unavailable.

Example: Email service down → queue confirmation email for retry
Trade-off: Delayed processing; must handle queue growth

Static Fallback Return pre-computed static responses.

Example: Dynamic homepage → static cached version during outage
Trade-off: Stale content, no personalization

graceful-degradation
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
// Graceful degradation with fallback strategies
interface DegradationPolicy<T> {
  strategy: 'cache' | 'default' | 'skip' | 'queue';
  fallback?: T | (() => T | Promise<T>);
  shouldDegrade: (error: Error) => boolean;
}
 
async function withDegradation<T>(
  operation: () => Promise<T>,
  policy: DegradationPolicy<T>,
  context: { operationName: string; correlationId: string }
): Promise<{ result: T; degraded: boolean }> {
  try {
    const result = await operation();
    return { result, degraded: false };
  } catch (error) {
    if (!policy.shouldDegrade(error as Error)) {
      throw error; // Not a degradable error
    }
 
    console.warn(
      `Operation '${context.operationName}' degrading: ${(error as Error).message}`,
      { correlationId: context.correlationId }
    );
 
    const fallbackResult = await resolveFallback(policy);
    return { result: fallbackResult, degraded: true };
  }
}
 
async function resolveFallback<T>(policy: DegradationPolicy<T>): Promise<T> {
  if (policy.fallback === undefined) {
    throw new Error(`No fallback defined for strategy: ${policy.strategy}`);
  }
  
  return typeof policy.fallback === 'function'
    ? await (policy.fallback as () => T | Promise<T>)()
    : policy.fallback;
}
 
// Product page with multiple degradation strategies
async function getProductPage(productId: string): Promise<ProductPage> {
  const correlationId = generateCorrelationId();
 
  // Core product data - no degradation, must succeed
  const product = await productService.getProduct(productId);
 
  // Reviews - degrade to empty if service unavailable
  const { result: reviews, degraded: reviewsDegraded } = await withDegradation(
    () => reviewService.getProductReviews(productId),
    {
      strategy: 'default',
      fallback: { reviews: [], averageRating: null, totalCount: 0 },
      shouldDegrade: (err) => err instanceof ServiceUnavailableError,
    },
    { operationName: 'getProductReviews', correlationId }
  );
 
  // Recommendations - degrade to popular items
  const { result: recommendations, degraded: recsDegraded } = await withDegradation(
    () => recommendationService.getForProduct(productId),
    {
      strategy: 'cache',
      fallback: () => getCachedPopularProducts(),
      shouldDegrade: (err) => err instanceof ServiceUnavailableError,
    },
    { operationName: 'getRecommendations', correlationId }
  );
 
  // Inventory - degrade to showing "check availability"
  const { result: inventory, degraded: inventoryDegraded } = await withDegradation(
    () => inventoryService.getAvailability(productId),
    {
      strategy: 'default',
      fallback: { available: null, message: 'Check availability in store' },
      shouldDegrade: (err) => err instanceof ServiceUnavailableError,
    },
    { operationName: 'getInventory', correlationId }
  );
 
  return {
    product,
    reviews,
    recommendations,
    inventory,
    _degraded: {
      reviews: reviewsDegraded,
      recommendations: recsDegraded,
      inventory: inventoryDegraded,
    },
  };
}

Communicate Degradation

When serving degraded responses, include metadata indicating what's degraded. This helps UIs display appropriate warnings ("Prices may not be current") and helps monitoring systems track degradation frequency. The response should make degradation explicit, not invisible.

Summary: Building Resilient Inter-Service Communication

Error handling in distributed systems is fundamentally different from monolithic applications. Partial failures, network uncertainty, and cascade effects require deliberate architectural patterns rather than simple exception handling.

Let's consolidate the key insights:

Key Takeaways

•Embrace failure as normal — Design for the reality that networks fail, services restart, and dependencies timeout. Don't treat failures as exceptional.
•Retry with exponential backoff and jitter — Protect recovering services from thundering herds. Make retry behavior configurable per dependency.
•Circuit breakers prevent cascades — Stop calling failing services to give them recovery time and fail fast for callers. Use per-dependency circuits.
•Bulkheads isolate failures — Thread pool and connection isolation prevent one slow dependency from consuming all resources.
•Propagate errors with context — Chain errors, add context at each layer, preserve correlation IDs, and translate appropriately for clients.
•Degrade gracefully when possible — Stale data, defaults, or reduced functionality are often better than complete failure.

What's next:

As services evolve, their APIs must change. But changing APIs in a distributed system risks breaking all dependent services. The next page explores service versioning strategies that enable evolution without breaking changes—keeping the promise of independent deployability.

Page Complete

You now understand how to handle errors gracefully in distributed systems. You can implement retry strategies, circuit breakers, and bulkheads, propagate errors meaningfully across service boundaries, and design systems that degrade gracefully rather than failing completely. Next, we'll tackle service versioning—evolving APIs without breaking consumers.

4 / 5

Loading learning content...

System Design (HLD)Inter-Service Communication

Inter-Service Communication

LevelAdvanced

Duration90 mins

TopicInter-Service Communication

4 / 5

Error Handling Across Services

Embracing Failure as Normal

The question isn't "how do we prevent failures?" but rather "how do we design systems that behave predictably when failures inevitably occur?"

What You Will Master

Distributed System Failure Modes

Network-Level Failures

Connection Failure: Service cannot establish a TCP connection

Causes: Target service down, firewall rules, DNS failure, network partition
Detection: Connection timeout (typically 1-5 seconds)
Challenge: Is the service down or just unreachable from this location?

Request Timeout: Connection established but no response received

Causes: Service overloaded, slow dependency, GC pause, deadlock
Detection: Read timeout after connection (configurable)
Challenge: Did the operation happen? We genuinely don't know.

Partial Response: Response truncated or corrupted

Causes: Connection dropped mid-response, serialization errors
Detection: Content-Length mismatch, parse failures
Challenge: How much of the response is trustworthy?

Application-Level Failures

Server Errors (5xx): Service understood request but couldn't fulfill it

503 Service Unavailable: Temporary overload, try again later
500 Internal Server Error: Bug or unexpected condition
502 Bad Gateway: Proxy/gateway couldn't reach upstream
504 Gateway Timeout: Upstream service too slow

Client Errors (4xx): Request was problematic

400 Bad Request: Malformed request, validation failure
401/403 Unauthorized/Forbidden: Authentication/authorization failure
404 Not Found: Resource doesn't exist
409 Conflict: State conflict (concurrent modification)
429 Too Many Requests: Rate limit exceeded

The Uncertainty Principle

The most insidious failure mode is timeout with unknown outcome. When Service A calls Service B and times out:

Service B might have failed before processing (safe to retry)
Service B might be processing but slow (retry causes duplicate)
Service B processed successfully but response was lost (retry causes duplicate)
Service B is in an infinite loop (will never respond)

You cannot distinguish these cases from Service A's perspective. This fundamental uncertainty drives the need for idempotency and compensating transactions.

Failure Mode Classification and Handling
Failure Type	Retryable?	Safe to Retry Without Idempotency?	Typical Action
Connection refused	Yes	Yes	Immediate retry (different instance)
DNS failure	Usually	Yes	Retry after delay (DNS may recover)
Connection timeout	Yes	Yes	Retry (operation never reached server)
Read timeout	Maybe	NO — Uncertain state	Retry only if idempotent
503 Service Unavailable	Yes	Usually Yes	Retry with backoff
500 Internal Server Error	Maybe	Maybe	Depends on error type
400 Bad Request	No	N/A	Fix request or fail permanently
401 Unauthorized	Maybe	Yes	Refresh credentials, retry once
429 Too Many Requests	Yes	Yes	Back off per Retry-After header

The Read Timeout Trap

Retry Strategies

Backoff Strategies

Fixed Delay: Wait the same amount between retries

Simple but can overwhelm recovering services if all clients retry simultaneously

Exponential Backoff: Double the delay each retry (1s, 2s, 4s, 8s...)

Prevents thundering herd; gives services time to recover
Industry standard for AWS, Google Cloud, and most APIs

Exponential Backoff with Jitter: Add randomness to break synchronization

Prevents correlated retries from synchronized clients
Critical for high-scale systems

Decorrelated Jitter: Even more aggressive randomization

sleep = min(cap, random_between(base, sleep * 3))
Best for minimizing total time and load

retry-strategies
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
// Comprehensive retry implementation with exponential backoff and jitter
interface RetryConfig {
  maxRetries: number;          // Maximum retry attempts
  baseDelayMs: number;         // Base delay for exponential backoff
  maxDelayMs: number;          // Cap on delay
  jitterFactor: number;        // 0-1, how much randomness to add
  retryableErrors: Set<string>; // Error types that should trigger retry
}
 
const defaultConfig: RetryConfig = {
  maxRetries: 3,
  baseDelayMs: 100,
  maxDelayMs: 30000,
  jitterFactor: 0.5,
  retryableErrors: new Set([
    'ECONNRESET',
    'ETIMEDOUT',
    'ECONNREFUSED',
    'SERVICE_UNAVAILABLE',
    'TOO_MANY_REQUESTS',
  ]),
};
 
async function withRetry<T>(
  operation: () => Promise<T>,
  config: Partial<RetryConfig> = {}
): Promise<T> {
  const cfg = { ...defaultConfig, ...config };
  let lastError: Error;
  
  for (let attempt = 0; attempt <= cfg.maxRetries; attempt++) {
    try {
      return await operation();
    } catch (error) {
      lastError = error as Error;
      
      // Check if error is retryable
      if (!isRetryable(error, cfg.retryableErrors)) {
        throw error; // Non-retryable, fail immediately
      }
      
      // Check if we have retries left
      if (attempt === cfg.maxRetries) {
        throw new Error(
          `All ${cfg.maxRetries} retries exhausted: ${lastError.message}`
        );
      }
      
      // Calculate delay with exponential backoff + jitter
      const exponentialDelay = cfg.baseDelayMs * Math.pow(2, attempt);
      const cappedDelay = Math.min(exponentialDelay, cfg.maxDelayMs);
      const jitter = cappedDelay * cfg.jitterFactor * Math.random();
      const finalDelay = cappedDelay + jitter;
      
      console.log(
        `Retry ${attempt + 1}/${cfg.maxRetries} after ${finalDelay.toFixed(0)}ms`
      );
      
      await sleep(finalDelay);
    }
  }
  
  throw lastError!;
}
 
function isRetryable(error: unknown, retryableErrors: Set<string>): boolean {
  if (error instanceof HttpError) {
    // Rate limit: always retry (after backoff)
    if (error.status === 429) return true;
    // Server errors: retry (service might recover)
    if (error.status >= 500 && error.status <= 599) return true;
    // Client errors: don't retry (our request is wrong)
    if (error.status >= 400 && error.status <= 499) return false;
  }
  
  if (error instanceof Error) {
    // Check error code (Node.js network errors)
    const code = (error as NodeJS.ErrnoException).code;
    if (code && retryableErrors.has(code)) return true;
  }
  
  return false;
}
 
// Usage
const result = await withRetry(
  () => orderService.createOrder(orderData),
  { maxRetries: 3, baseDelayMs: 100 }
);

Retry Budgets

Circuit Breakers

Circuit breakers prevent cascade failures by stopping calls to failing services. Named after electrical circuit breakers that prevent overload, they provide fast failure and automatic recovery.

Circuit Breaker States

CLOSED (Normal Operation)

Requests flow through to downstream service
Failures are counted
When failure threshold reached → OPEN

OPEN (Blocking Requests)

Requests immediately fail without calling downstream
After timeout → HALF-OPEN

HALF-OPEN (Testing Recovery)

Limited requests allowed through
If successful → CLOSED
If failed → OPEN (restart timeout)

This state machine prevents a failing service from being overwhelmed while simultaneously failing fast for callers.

Circuit Breaker Configuration Parameters
Parameter	Typical Value	Purpose
Failure Threshold	5-10 failures	Errors before opening
Failure Rate Threshold	50% failure rate	Error percentage before opening
Measurement Window	10 seconds	Time window for counting failures
Open Timeout	30-60 seconds	Time before trying HALF-OPEN
Half-Open Trials	3 requests	Successful calls needed to close
Minimum Throughput	10 requests	Min calls before rate calculation

circuit-breaker
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
// Production-grade circuit breaker implementation
enum CircuitState {
  CLOSED = 'CLOSED',
  OPEN = 'OPEN',
  HALF_OPEN = 'HALF_OPEN',
}
 
interface CircuitBreakerConfig {
  failureThreshold: number;     // Failures before opening
  failureRateThreshold: number; // Failure rate (0-1) before opening
  successThreshold: number;     // Successes in half-open before closing
  openTimeout: number;          // Ms before trying half-open
  windowSize: number;           // Sliding window size in ms
  minimumThroughput: number;    // Min calls before rate calculation
}
 
class CircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED;
  private failures: number[] = [];       // Failure timestamps
  private successes: number[] = [];      // Success timestamps
  private lastFailure: number = 0;
  private halfOpenSuccesses: number = 0;
 
  constructor(
    private readonly name: string,
    private readonly config: CircuitBreakerConfig
  ) {}
 
  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (!this.canExecute()) {
      throw new CircuitOpenError(
        `Circuit breaker '${this.name}' is OPEN. ` +
                                        `Retry after ${this.getRetryAfter()}ms`
      );
    }
 
    try {
      const result = await operation();
      this.recordSuccess();
      return result;
    } catch (error) {
      this.recordFailure();
      throw error;
    }
  }
 
  private canExecute(): boolean {
    this.pruneOldCalls();
 
    switch (this.state) {
      case CircuitState.CLOSED:
        return true;
        
      case CircuitState.OPEN:
        if (Date.now() - this.lastFailure >= this.config.openTimeout) {
          this.state = CircuitState.HALF_OPEN;
          this.halfOpenSuccesses = 0;
          console.log(`Circuit '${this.name}' → HALF_OPEN`);
          return true;
        }
        return false;
        
      case CircuitState.HALF_OPEN:
        return true;
    }
  }
 
  private recordSuccess(): void {
    const now = Date.now();
    this.successes.push(now);
 
    if (this.state === CircuitState.HALF_OPEN) {
      this.halfOpenSuccesses++;
      if (this.halfOpenSuccesses >= this.config.successThreshold) {
        this.state = CircuitState.CLOSED;
        this.failures = [];
        console.log(`Circuit '${this.name}' → CLOSED`);
      }
    }
  }
 
  private recordFailure(): void {
    const now = Date.now();
    this.failures.push(now);
    this.lastFailure = now;
 
    if (this.state === CircuitState.HALF_OPEN) {
      this.state = CircuitState.OPEN;
      console.log(`Circuit '${this.name}' → OPEN (half-open failure)`);
      return;
    }
 
    if (this.shouldOpen()) {
      this.state = CircuitState.OPEN;
      console.log(`Circuit '${this.name}' → OPEN`);
    }
  }
 
  private shouldOpen(): boolean {
    if (this.state !== CircuitState.CLOSED) return false;
 
    this.pruneOldCalls();
    
    const totalCalls = this.failures.length + this.successes.length;
    if (totalCalls < this.config.minimumThroughput) return false;
 
    // Check absolute threshold
    if (this.failures.length >= this.config.failureThreshold) return true;
 
    // Check rate threshold
    const failureRate = this.failures.length / totalCalls;
    return failureRate >= this.config.failureRateThreshold;
  }
 
  private pruneOldCalls(): void {
    const cutoff = Date.now() - this.config.windowSize;
    this.failures = this.failures.filter(t => t > cutoff);
    this.successes = this.successes.filter(t => t > cutoff);
  }
 
  private getRetryAfter(): number {
    const elapsed = Date.now() - this.lastFailure;
    return Math.max(0, this.config.openTimeout - elapsed);
  }
 
  getState(): CircuitState {
    return this.state;
  }
}
 
// Usage with circuit breaker per downstream service
const paymentCircuit = new CircuitBreaker('payment-service', {
  failureThreshold: 5,
  failureRateThreshold: 0.5,
  successThreshold: 3,
  openTimeout: 30000,
  windowSize: 60000,
  minimumThroughput: 10,
});
 
async function processPayment(order: Order): Promise<PaymentResult> {
  return paymentCircuit.execute(() => 
    paymentService.charge(order.customerId, order.total)
  );
}

Circuit Breaker per Dependency

Bulkheads and Isolation

Types of Bulkheads

Thread Pool Isolation Separate thread pools for each downstream dependency. If the payment service is slow::

Without bulkhead: All threads blocked waiting for payment, entire service hangs
With bulkhead: Payment thread pool exhausted, but order reading continues

Connection Pool Isolation Separate connection pools per downstream service. One service's connection leak doesn't starve others.

Process/Container Isolation Different functions in separate processes/containers. Memory leak in one doesn't crash others.

Queue Isolation Separate queues for different priority traffic. Bulk operations don't block interactive requests.

Without Bulkheads

•Single thread pool for all operations
•Slow dependency blocks entire service
•Memory pressure affects all functions
•One bad downstream brings everything down
•No isolation between critical and bulk

With Bulkheads

•Separate pools per dependency
•Slow dependency only impacts its pool
•Memory limits per function/container
•Failures contained to compartment
•Critical traffic gets reserved capacity

bulkhead-pattern
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
// Bulkhead using semaphore pattern for concurrency limiting
class Bulkhead {
  private readonly permits: number;
  private available: number;
  private readonly queue: Array<{
    resolve: () => void;
    reject: (error: Error) => void;
    timeout: NodeJS.Timeout;
  }> = [];
 
  constructor(
    private readonly name: string,
    permits: number,
    private readonly queueLimit: number = 100,
    private readonly queueTimeout: number = 5000
  ) {
    this.permits = permits;
    this.available = permits;
  }
 
  async execute<T>(operation: () => Promise<T>): Promise<T> {
    await this.acquire();
    try {
      return await operation();
    } finally {
      this.release();
    }
  }
 
  private async acquire(): Promise<void> {
    if (this.available > 0) {
      this.available--;
      return;
    }
 
    if (this.queue.length >= this.queueLimit) {
      throw new BulkheadFullError(
        `Bulkhead '${this.name}' is full. ` +
                                        `Permits: ${this.permits}, Queue: ${this.queue.length}`
      );
    }
 
    return new Promise<void>((resolve, reject) => {
      const timeout = setTimeout(() => {
        const index = this.queue.findIndex(w => w.resolve === resolve);
        if (index !== -1) {
          this.queue.splice(index, 1);
          reject(new BulkheadTimeoutError(
            `Bulkhead '${this.name}' queue timeout after ${this.queueTimeout}ms`
          ));
        }
      }, this.queueTimeout);
 
      this.queue.push({ resolve, reject, timeout });
    });
  }
 
  private release(): void {
    if (this.queue.length > 0) {
      const waiter = this.queue.shift()!;
      clearTimeout(waiter.timeout);
      waiter.resolve();
    } else {
      this.available++;
    }
  }
 
  getMetrics(): BulkheadMetrics {
    return {
      name: this.name,
      permits: this.permits,
      available: this.available,
      queueSize: this.queue.length,
      queueLimit: this.queueLimit,
    };
  }
}
 
// Create bulkheads per downstream service
const bulkheads = {
  payment: new Bulkhead('payment', 20),     // Max 20 concurrent payment calls
  inventory: new Bulkhead('inventory', 50), // Max 50 concurrent inventory calls  
  shipping: new Bulkhead('shipping', 30),   // Max 30 concurrent shipping calls
};
 
async function processOrder(order: Order): Promise<void> {
  // Each call respects its bulkhead limit
  // Payment service slowdown won't block inventory checks
  
  const [inventoryResult, _] = await Promise.all([
    bulkheads.inventory.execute(() => 
      inventoryService.reserve(order.items)
    ),
    // Payment might be slow, but won't consume inventory bulkhead
  ]);
 
  // Sequential payment (after inventory confirmed)
  const paymentResult = await bulkheads.payment.execute(() =>
    paymentService.charge(order.customerId, order.total)
  );
 
  // Shipping can proceed independently
  await bulkheads.shipping.execute(() =>
    shippingService.createShipment(order)
  );
}

Error Propagation Patterns

When errors occur deep in a service chain (A → B → C → D), how should they propagate back to the original caller? Poor error propagation leads to debugging nightmares.

Principles of Error Propagation

1. Preserve Original Error Information Wrap errors rather than replacing them. The root cause should be discoverable.

2. Add Context at Each Layer Each service should add its perspective: what operation failed, what inputs were provided.

3. Translate to Appropriate Level Internal errors shouldn't leak implementation details. Map to domain-appropriate error types.

4. Include Correlation IDs Every request should carry a correlation ID for distributed tracing.

5. Distinguish Client vs Server Errors Clearly indicate whether the caller did something wrong (4xx) or the system failed (5xx).

error-propagation
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
// Structured error response format
interface ServiceError {
  // Machine-readable code for programmatic handling
  code: string;
  
  // Human-readable message (can be shown to users for 4xx)
  message: string;
  
  // Detailed description (for developers/logs)
  details?: string;
  
  // Field-level errors for validation failures
  fieldErrors?: Array<{
    field: string;
    code: string;
    message: string;
  }>;
  
  // For debugging (not exposed to external clients)
  internal?: {
    correlationId: string;
    service: string;
    timestamp: string;
    cause?: ServiceError;  // Chain of errors
    stackTrace?: string;   // Only in development
  };
}
 
// Error class with context chaining
class ServiceException extends Error {
  constructor(
    public readonly code: string,
    message: string,
    public readonly status: number,
    public readonly cause?: Error,
    public readonly fieldErrors?: Array<{ field: string; code: string; message: string }>
  ) {
    super(message);
    this.name = 'ServiceException';
  }
 
  toResponse(correlationId: string, includeInternal: boolean): ServiceError {
    const response: ServiceError = {
      code: this.code,
      message: this.message,
      fieldErrors: this.fieldErrors,
    };
 
    if (includeInternal) {
      response.internal = {
        correlationId,
        service: process.env.SERVICE_NAME || 'unknown',
        timestamp: new Date().toISOString(),
        stackTrace: this.stack,
      };
 
      if (this.cause instanceof ServiceException) {
        response.internal.cause = this.cause.toResponse(correlationId, true);
      }
    }
 
    return response;
  }
}
 
// Error translation between services
function translateDownstreamError(
  error: unknown,
  operation: string
): ServiceException {
  if (error instanceof ServiceException) {
    // Wrap downstream error with context
    return new ServiceException(
      'DOWNSTREAM_ERROR',
      `${operation} failed: ${error.message}`,
      error.status >= 500 ? 502 : error.status, // Map 5xx to 502 (Bad Gateway)
      error
    );
  }
 
  if (error instanceof HttpError) {
    // HTTP error from downstream service
    const isRetryable = error.status >= 500 || error.status === 429;
    return new ServiceException(
      isRetryable ? 'DOWNSTREAM_UNAVAILABLE' : 'DOWNSTREAM_REJECTED',
      `${operation} failed with status ${error.status}`,
      isRetryable ? 503 : 502,
      error
    );
  }
 
  if (error instanceof TimeoutError) {
    return new ServiceException(
      'DOWNSTREAM_TIMEOUT',
      `${operation} timed out`,
      504,  // Gateway Timeout
      error
    );
  }
 
  // Unknown error type
  return new ServiceException(
    'INTERNAL_ERROR',
    `${operation} failed unexpectedly`,
    500,
    error instanceof Error ? error : new Error(String(error))
  );
}
 
// Usage in service layer
async function createOrder(request: CreateOrderRequest): Promise<Order> {
  // Validate input
  const validationErrors = validateOrderRequest(request);
  if (validationErrors.length > 0) {
    throw new ServiceException(
      'VALIDATION_ERROR',
      'Order validation failed',
      400,
      undefined,
      validationErrors
    );
  }
 
  try {
    // Check inventory (downstream call)
    const inventory = await inventoryService.check(request.items);
    if (!inventory.allAvailable) {
      throw new ServiceException(
        'INSUFFICIENT_INVENTORY',
        'Some items are not available',
        409,  // Conflict
        undefined,
        inventory.unavailableItems.map(item => ({
          field: `items[${item.productId}]`,
          code: 'INSUFFICIENT_STOCK',
          message: `Only ${item.available} available`,
        }))
      );
    }
  } catch (error) {
    if (error instanceof ServiceException) throw error;
    throw translateDownstreamError(error, 'Inventory check');
  }
 
  try {
    // Process payment (downstream call)
    await paymentService.charge(request.customerId, calculateTotal(request));
  } catch (error) {
    if (error instanceof ServiceException) throw error;
    throw translateDownstreamError(error, 'Payment processing');
  }
 
  // Create order (local operation)
  return orderRepository.create(request);
}

Don't Leak Internal Details

Graceful Degradation Strategies

When downstream services fail, complete failure isn't always the best option. Graceful degradation provides reduced functionality rather than no functionality.

Degradation Strategies

Fallback to Cache Return cached data when live data is unavailable. Stale data is often better than no data.

Example: Product catalog shows cached prices when pricing service is down
Trade-off: Potentially stale data; must handle cache misses

Fallback to Default Return sensible defaults when specific data unavailable.

Example: Recommendation service down → show popular items instead
Trade-off: Less personalized experience

Fallback to Degraded Functionality Skip optional features when their services are unavailable.

Example: Review service down → show product without reviews
Trade-off: Reduced feature set

Fallback to Queue Queue operations for later processing when services are unavailable.

Example: Email service down → queue confirmation email for retry
Trade-off: Delayed processing; must handle queue growth

Static Fallback Return pre-computed static responses.

Example: Dynamic homepage → static cached version during outage
Trade-off: Stale content, no personalization

graceful-degradation
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
// Graceful degradation with fallback strategies
interface DegradationPolicy<T> {
  strategy: 'cache' | 'default' | 'skip' | 'queue';
  fallback?: T | (() => T | Promise<T>);
  shouldDegrade: (error: Error) => boolean;
}
 
async function withDegradation<T>(
  operation: () => Promise<T>,
  policy: DegradationPolicy<T>,
  context: { operationName: string; correlationId: string }
): Promise<{ result: T; degraded: boolean }> {
  try {
    const result = await operation();
    return { result, degraded: false };
  } catch (error) {
    if (!policy.shouldDegrade(error as Error)) {
      throw error; // Not a degradable error
    }
 
    console.warn(
      `Operation '${context.operationName}' degrading: ${(error as Error).message}`,
      { correlationId: context.correlationId }
    );
 
    const fallbackResult = await resolveFallback(policy);
    return { result: fallbackResult, degraded: true };
  }
}
 
async function resolveFallback<T>(policy: DegradationPolicy<T>): Promise<T> {
  if (policy.fallback === undefined) {
    throw new Error(`No fallback defined for strategy: ${policy.strategy}`);
  }
  
  return typeof policy.fallback === 'function'
    ? await (policy.fallback as () => T | Promise<T>)()
    : policy.fallback;
}
 
// Product page with multiple degradation strategies
async function getProductPage(productId: string): Promise<ProductPage> {
  const correlationId = generateCorrelationId();
 
  // Core product data - no degradation, must succeed
  const product = await productService.getProduct(productId);
 
  // Reviews - degrade to empty if service unavailable
  const { result: reviews, degraded: reviewsDegraded } = await withDegradation(
    () => reviewService.getProductReviews(productId),
    {
      strategy: 'default',
      fallback: { reviews: [], averageRating: null, totalCount: 0 },
      shouldDegrade: (err) => err instanceof ServiceUnavailableError,
    },
    { operationName: 'getProductReviews', correlationId }
  );
 
  // Recommendations - degrade to popular items
  const { result: recommendations, degraded: recsDegraded } = await withDegradation(
    () => recommendationService.getForProduct(productId),
    {
      strategy: 'cache',
      fallback: () => getCachedPopularProducts(),
      shouldDegrade: (err) => err instanceof ServiceUnavailableError,
    },
    { operationName: 'getRecommendations', correlationId }
  );
 
  // Inventory - degrade to showing "check availability"
  const { result: inventory, degraded: inventoryDegraded } = await withDegradation(
    () => inventoryService.getAvailability(productId),
    {
      strategy: 'default',
      fallback: { available: null, message: 'Check availability in store' },
      shouldDegrade: (err) => err instanceof ServiceUnavailableError,
    },
    { operationName: 'getInventory', correlationId }
  );
 
  return {
    product,
    reviews,
    recommendations,
    inventory,
    _degraded: {
      reviews: reviewsDegraded,
      recommendations: recsDegraded,
      inventory: inventoryDegraded,
    },
  };
}

Communicate Degradation

Summary: Building Resilient Inter-Service Communication

Let's consolidate the key insights:

Key Takeaways

•Embrace failure as normal — Design for the reality that networks fail, services restart, and dependencies timeout. Don't treat failures as exceptional.
•Retry with exponential backoff and jitter — Protect recovering services from thundering herds. Make retry behavior configurable per dependency.
•Circuit breakers prevent cascades — Stop calling failing services to give them recovery time and fail fast for callers. Use per-dependency circuits.
•Bulkheads isolate failures — Thread pool and connection isolation prevent one slow dependency from consuming all resources.
•Propagate errors with context — Chain errors, add context at each layer, preserve correlation IDs, and translate appropriately for clients.
•Degrade gracefully when possible — Stale data, defaults, or reduced functionality are often better than complete failure.

What's next:

Page Complete

4 / 5