Circuit Breaker - Learning Module

Loading content...

0/273

Circuit Breaker with Retries

The Dangerous Dance of Retries and Circuit Breakers

Retries and circuit breakers are both fundamental resilience patterns—but combining them incorrectly is one of the most common causes of production incidents in distributed systems. Retries without circuit breakers can create thundering herds that amplify failures. Circuit breakers without retries may fail requests that could have succeeded on a second attempt. And retries inside circuit breakers can prevent circuits from opening when they should.

This page addresses the subtle but critical interplay between these patterns. We'll explore why ordering matters, how to configure retry budgets, and the mathematical considerations behind intelligent retry strategies. By the end, you'll understand how to combine these patterns for maximum resilience without creating new failure modes.

What You Will Learn

By the end of this page, you will understand why retries must wrap circuit breakers (not the reverse), how to implement exponential backoff with jitter, the concept of retry budgets for system-wide protection, idempotency requirements for safe retries, and production-ready configurations for common scenarios.

The Ordering Problem

The most common mistake when combining retries with circuit breakers is incorrect ordering. The difference between "retry wrapping circuit breaker" and "circuit breaker wrapping retry" has profound implications for system behavior.

The Wrong Way: Retry Inside Circuit Breaker

Converting Mermaid diagram...

In this (incorrect) arrangement, the retry logic is inside the circuit breaker. This causes several problems:

Problem 1: One failure counts as multiple failures

When a call fails and is retried 3 times, all 3 failures are counted by the circuit breaker. A single transient error becomes 3 failures in the circuit's statistics, making it trip prematurely.

Problem 2: Increased load on failing service

When the downstream service is degraded, the retry logic amplifies load precisely when the service can least handle it. The circuit sees this amplified failure rate and trips—but not before the retries have made things worse.

Problem 3: Longer blocking when circuit should be open

If you make a request when the circuit is about to trip, the retry logic executes all retries before the circuit opens. The user experiences the full retry timeout before getting an error.

The Right Way: Retry Outside Circuit Breaker

Converting Mermaid diagram...

In this (correct) arrangement:

Benefit 1: Failures are counted correctly

Each request through the circuit breaker is a single attempt. Retries are handled at the outer layer, so the circuit sees accurate failure rates.

Benefit 2: Circuit open exception stops retries

When the circuit opens, the retry logic receives CircuitOpenException. It can choose not to retry (since retry is pointless when circuit is open), immediately invoking fallback behavior.

Benefit 3: Retries don't amplify load on failing service

Once the circuit opens, retries are handled by returning circuit-open exceptions—no additional load is sent to the failing service.

CorrectOrdering.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// CORRECT: Retry wraps CircuitBreaker
CircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults("service");
Retry retry = Retry.of("service", RetryConfig.custom()
    .maxAttempts(3)
    .waitDuration(Duration.ofMillis(500))
    // Don't retry when circuit is open!
    .retryOnException(e -> !(e instanceof CallNotPermittedException))
    .build());
 
Supplier<Product> decoratedSupplier = Decorators
    .ofSupplier(() -> productService.getProduct(productId))
    .withCircuitBreaker(circuitBreaker)  // Inner: circuit breaker
    .withRetry(retry)                     // Outer: retry
    .decorate();
 
// The call flows:
// 1. Retry (outer) receives request
// 2. Retry calls CircuitBreaker (inner)
// 3. If circuit closed, call proceeds to service
// 4. If service fails, exception propagates to Retry
// 5. Retry checks if exception is retryable
// 6. If CallNotPermittedException (circuit open), don't retry
// 7. Otherwise, retry with backoff
 
// WRONG (DON'T DO THIS): CircuitBreaker wraps Retry
Supplier<Product> wrongOrder = Decorators
    .ofSupplier(() -> productService.getProduct(productId))
    .withRetry(retry)                     // Inner: retry (WRONG)
    .withCircuitBreaker(circuitBreaker)   // Outer: circuit breaker (WRONG)
    .decorate();
// This counts every retry as a separate failure!

Critical: Don't Retry Circuit Open Exceptions

When configuring retry logic, ALWAYS exclude circuit-open exceptions from retry. Retrying when the circuit is open is pure waste—the circuit will still be open, and you're just consuming resources. The whole point of the circuit opening is to fail fast.

Exponential Backoff with Jitter

When retrying failed requests, the timing between retries significantly impacts both success probability and system load. Naive retry strategies (immediate retry, fixed delay) can cause problems. Exponential backoff with jitter is the gold standard.

The Thundering Herd Problem

Consider a scenario where 1000 clients are waiting for a service. The service briefly goes down, then comes back up. If all clients retry immediately and simultaneously:

1000 requests hit the service at once
Service is overwhelmed by the spike
Most requests fail
1000 clients retry at the same time again
Repeat until service stabilizes or clients give up

This is the "thundering herd"—synchronized retry creates load spikes that prevent recovery.

Exponential Backoff

Exponential backoff increases wait time between retries exponentially:

wait_time = base_delay × 2^(attempt - 1)

Attempt 1: 500ms × 2^0 = 500ms
Attempt 2: 500ms × 2^1 = 1000ms
Attempt 3: 500ms × 2^2 = 2000ms
Attempt 4: 500ms × 2^3 = 4000ms

This spreads retries over time, reducing peak load. But there's still a problem: if all clients started at the same time, they all retry at the same times (all at 500ms, all at 1500ms cumulative, etc.).

Adding Jitter

Jitter adds randomness to the backoff calculation, desynchronizing retry attempts:

wait_time = base_delay × 2^(attempt - 1) × random(0.5, 1.5)

Client A Attempt 2: 1000ms × 0.7  = 700ms
Client B Attempt 2: 1000ms × 1.2  = 1200ms
Client C Attempt 2: 1000ms × 0.95 = 950ms

Now clients retry at different times, smoothing the load.

BackoffStrategies.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Different jitter strategies in Resilience4j
import io.github.resilience4j.retry.RetryConfig;
import io.github.resilience4j.core.IntervalFunction;
 
// 1. EXPONENTIAL BACKOFF (no jitter)
RetryConfig exponentialOnly = RetryConfig.custom()
    .maxAttempts(4)
    .intervalFunction(IntervalFunction.ofExponentialBackoff(
        Duration.ofMillis(500),  // Initial interval
        2.0                       // Multiplier
    ))
    .build();
// Produces: 500ms, 1000ms, 2000ms, 4000ms
 
// 2. EXPONENTIAL BACKOFF WITH RANDOM JITTER
RetryConfig exponentialWithJitter = RetryConfig.custom()
    .maxAttempts(4)
    .intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
        Duration.ofMillis(500),  // Initial interval
        2.0,                     // Multiplier
        0.5                      // Randomization factor (±50%)
    ))
    .build();
// Produces: 250-750ms, 500-1500ms, 1000-3000ms, 2000-6000ms
 
// 3. DECORRELATED JITTER (AWS recommendation)
// Each retry's delay is random between base delay and 3× previous delay
RetryConfig decorrelatedJitter = RetryConfig.custom()
    .maxAttempts(4)
    .intervalFunction(attempt -> {
        long baseMs = 500;
        long maxMs = 60_000;
        long previousDelay = attempt == 1 ? baseMs : 
            (long)(baseMs * Math.pow(3, attempt - 2));
        long delay = (long)(baseMs + Math.random() * (previousDelay * 3 - baseMs));
        return Duration.ofMillis(Math.min(delay, maxMs));
    })
    .build();
 
// 4. EQUAL JITTER (hybrid approach)
// Half exponential, half random
RetryConfig equalJitter = RetryConfig.custom()
    .maxAttempts(4)
    .intervalFunction(attempt -> {
        long base = 500;
        long exponential = base * (1L << (attempt - 1)); // 2^(attempt-1)
        long halfExponential = exponential / 2;
        long delay = halfExponential + (long)(Math.random() * halfExponential);
        return Duration.ofMillis(Math.min(delay, 60_000));
    })
    .build();
// Provides minimum guaranteed backoff with random component

Jitter Strategy Comparison
Strategy	Delay Formula	Characteristics
No Jitter	base × 2^attempt	Synchronized retries; thundering herd risk
Full Jitter	random(0, base × 2^attempt)	Maximum spread; minimum guaranteed delay is 0
Equal Jitter	half_exp + random(0, half_exp)	Guaranteed minimum with randomization
Decorrelated Jitter	random(base, prev_delay × 3)	AWS recommended; good balance

Cap Your Maximum Delay

Always cap the maximum delay. Without a cap, exponential backoff can produce extremely long waits (500ms × 2^10 = 8 minutes). Typical caps are 30-60 seconds. Beyond that, it's better to fail and let higher-level retry mechanisms take over.

Retry Budgets

Individual retry configuration is important, but it's not enough. When many clients retry independently, the aggregate effect can still overwhelm services. Retry budgets provide system-level retry governance.

The Problem: Cumulative Retry Load

Consider a service with 100 clients, each configured to retry 3 times:

Normal load: 1000 requests/second
Service experiences 50% failures
Each failed request retries up to 3 times
Retry load: 1000 × 0.5 × 3 = 1500 additional requests/second
Total load: 2500 requests/second (2.5× normal)

The service that was struggling at 1000 req/s is now receiving 2500 req/s. The retries intended to recover from failure are causing complete collapse.

Retry Budgets: The Solution

A retry budget limits the proportion of requests that can be retries. Instead of fixed retry counts, you constrain:

retry_ratio = retries / (original_requests + retries)

With a 20% retry budget:

If you've made 100 original requests
You can have at most 25 retries (25/125 = 20%)
Additional retries are rejected until the budget refreshes

RetryBudget.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
/**
 * A token-bucket based retry budget.
 * Limits retries to a percentage of total traffic.
 */
public class RetryBudget {
    private final double retryRatio;        // Target retry percentage
    private final long windowMs;            // Time window for calculation
    private final AtomicLong attempts;      // Total attempts in window
    private final AtomicLong retries;       // Retries in window
    private final AtomicLong windowStart;
    
    public RetryBudget(double retryRatio, Duration window) {
        this.retryRatio = retryRatio;
        this.windowMs = window.toMillis();
        this.attempts = new AtomicLong(0);
        this.retries = new AtomicLong(0);
        this.windowStart = new AtomicLong(System.currentTimeMillis());
    }
    
    /**
     * Record an original (non-retry) attempt.
     */
    public void recordAttempt() {
        maybeResetWindow();
        attempts.incrementAndGet();
    }
    
    /**
     * Check if a retry is allowed and record it if so.
     */
    public boolean tryAcquireRetry() {
        maybeResetWindow();
        
        long currentAttempts = attempts.get();
        long currentRetries = retries.get();
        long total = currentAttempts + currentRetries;
        
        if (total == 0) {
            // No attempts yet, allow retry
            retries.incrementAndGet();
            return true;
        }
        
        // Check if adding a retry would exceed budget
        double potentialRetryRatio = (currentRetries + 1.0) / (total + 1.0);
        
        if (potentialRetryRatio <= retryRatio) {
            retries.incrementAndGet();
            return true;
        }
        
        return false;  // Budget exhausted
    }
    
    private void maybeResetWindow() {
        long now = System.currentTimeMillis();
        long start = windowStart.get();
        
        if (now - start > windowMs) {
            // Reset window
            if (windowStart.compareAndSet(start, now)) {
                attempts.set(0);
                retries.set(0);
            }
        }
    }
    
    public double getCurrentRetryRatio() {
        long total = attempts.get() + retries.get();
        return total == 0 ? 0.0 : (double) retries.get() / total;
    }
}
 
// Usage
RetryBudget budget = new RetryBudget(0.20, Duration.ofSeconds(10));
 
public CompletableFuture<Response> callWithBudgetedRetry(Request request) {
    budget.recordAttempt();  // Original attempt
    
    return circuitBreaker.executeCompletableFuture(() -> 
        httpClient.send(request)
    ).exceptionallyCompose(error -> {
        if (error instanceof CallNotPermittedException) {
            // Circuit open, no retry
            return fallback(request);
        }
        
        if (budget.tryAcquireRetry()) {
            // Budget allows retry
            return callWithBudgetedRetry(request);  // Recursive retry
        } else {
            // Budget exhausted
            log.warn("Retry budget exhausted, failing request");
            return fallback(request);
        }
    });
}

Service Mesh Retry Budgets

Service meshes like Linkerd implement retry budgets at the infrastructure layer:

linkerd-retry-budget.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
apiVersion: policy.linkerd.io/v1beta1
kind: HTTPRoute
metadata:
  name: product-service-route
spec:
  parentRefs:
    - name: product-service
      kind: Service
  rules:
    - backendRefs:
        - name: product-service
          port: 80
      timeouts:
        request: 10s
      retry:
        limit:
          # Retry budget: max 20% of traffic can be retries
          retryRatio: 0.2
          # Allow this many retries per request regardless of ratio
          minRetriesPerSecond: 10
        conditions:
          - statusCodes:
              - 502
              - 503
              - 504
          - timeouts: true
        backoff:
          baseInterval: 25ms
          maxInterval: 250ms

Retry Budget vs. Fixed Retry Count

Retry budgets are strictly superior to fixed retry counts for distributed systems. Fixed counts ("retry 3 times") can amplify load exponentially as the number of clients increases. Budgets ("retries should be ≤20% of traffic") maintain proportional load regardless of client count.

Idempotency Requirements

Retries are only safe if the operation being retried is idempotent—meaning executing it multiple times has the same effect as executing it once. Without idempotency guarantees, retries can cause duplicate processing, double charges, or data corruption.

The Retry Safety Problem

Consider this scenario:

Client sends: "Transfer $100 from Account A to Account B"
Server processes the transfer successfully
Response packet is lost in the network
Client times out and retries
Server processes the transfer again
Result: $200 transferred instead of $100

The retry caused a correctness bug because the operation wasn't idempotent.

Idempotency Strategies

Idempotency Implementation Strategies
Strategy	How It Works	Trade-offs
Idempotency Key	Client provides unique key; server deduplicates	Client must generate keys; server stores keys
Natural Idempotency	Design operations to be naturally idempotent	Not always possible; requires careful API design
Conditional Requests	Use ETags/versions for conditional updates	Adds complexity; requires version tracking
At-Least-Once + Dedup	Accept duplicates; deduplicate downstream	Processing overhead; eventual consistency

IdempotencyKey.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
@RestController
public class PaymentController {
    
    private final PaymentService paymentService;
    private final IdempotencyStore idempotencyStore;
    
    @PostMapping("/payments")
    public ResponseEntity<PaymentResult> createPayment(
            @RequestHeader("Idempotency-Key") String idempotencyKey,
            @RequestBody PaymentRequest request) {
        
        // Check if we've seen this idempotency key before
        Optional<PaymentResult> cachedResult = 
            idempotencyStore.get(idempotencyKey);
        
        if (cachedResult.isPresent()) {
            // Return cached result - this is a retry
            log.info("Returning cached result for idempotency key: {}", 
                     idempotencyKey);
            return ResponseEntity.ok(cachedResult.get());
        }
        
        // Try to acquire lock on idempotency key
        if (!idempotencyStore.tryLock(idempotencyKey, Duration.ofMinutes(5))) {
            // Another request with same key is in progress
            return ResponseEntity.status(HttpStatus.CONFLICT)
                .body(PaymentResult.inProgress());
        }
        
        try {
            // Process the payment
            PaymentResult result = paymentService.processPayment(request);
            
            // Store result for future retries
            idempotencyStore.store(idempotencyKey, result, Duration.ofHours(24));
            
            return ResponseEntity.ok(result);
            
        } finally {
            idempotencyStore.unlock(idempotencyKey);
        }
    }
}
 
@Component
public class RedisIdempotencyStore implements IdempotencyStore {
    
    private final RedisTemplate<String, String> redis;
    private final ObjectMapper objectMapper;
    
    @Override
    public Optional<PaymentResult> get(String key) {
        String value = redis.opsForValue().get("idempotency:" + key);
        if (value == null) return Optional.empty();
        return Optional.of(objectMapper.readValue(value, PaymentResult.class));
    }
    
    @Override
    public boolean tryLock(String key, Duration timeout) {
        return redis.opsForValue()
            .setIfAbsent("lock:" + key, "locked", timeout);
    }
    
    @Override
    public void store(String key, PaymentResult result, Duration ttl) {
        String value = objectMapper.writeValueAsString(result);
        redis.opsForValue().set("idempotency:" + key, value, ttl);
    }
    
    @Override
    public void unlock(String key) {
        redis.delete("lock:" + key);
    }
}

Naturally Idempotent Operations

Some operations are naturally idempotent and safe to retry without additional mechanisms:

Idempotent	Not Idempotent
GET requests	POST requests (typically)
PUT (replace entire resource)	PATCH (partial updates)
DELETE (by ID)	Counter increments
Set value to X	Add X to value
"Create if not exists"	"Create" (may create duplicates)

Design for Idempotency

When possible, design your APIs to be naturally idempotent:

// NON-IDEMPOTENT: Add $100 to balance
POST /accounts/{id}/balance
{ "amount": 100 }

// IDEMPOTENT: Set balance to $500 (requires knowing current state)
PUT /accounts/{id}/balance
{ "balance": 500, "version": 42 }

// IDEMPOTENT: Use idempotency key for operations that can't be naturally idempotent
POST /accounts/{id}/transactions
Idempotency-Key: tx-12345
{ "amount": 100, "type": "credit" }

Never Retry Non-Idempotent Operations Without Protection

If you're retrying operations that modify state (POST, PUT, PATCH, DELETE), ALWAYS ensure idempotency protection is in place. The consequences of duplicate execution can range from minor (duplicate notifications) to severe (double charges, data corruption).

Complete Integration Pattern

Let's bring together everything we've learned into a complete, production-ready integration of circuit breakers and retries.

The Full Stack

┌─────────────────────────────────────────────────────────────┐
│                        Request Flow                          │
├─────────────────────────────────────────────────────────────┤
│  1. Rate Limiter (optional) - Prevent overload              │
│  2. Retry (with budget) - Outer layer, handles retryable    │
│  3. Circuit Breaker - Fast fail if unhealthy                │
│  4. Bulkhead - Isolate resources                            │
│  5. Timeout - Bound execution time                          │
│  6. Actual Call - The real HTTP/RPC call                    │
└─────────────────────────────────────────────────────────────┘

Production Configuration

application.yml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
resilience4j:
  # Circuit Breaker Configuration
  circuitbreaker:
    configs:
      default:
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 100
        minimumNumberOfCalls: 10
        failureRateThreshold: 50
        slowCallRateThreshold: 80
        slowCallDurationThreshold: 2s
        waitDurationInOpenState: 30s
        permittedNumberOfCallsInHalfOpenState: 5
        automaticTransitionFromOpenToHalfOpenEnabled: true
        recordExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
          - org.springframework.web.client.HttpServerErrorException
        ignoreExceptions:
          - org.springframework.web.client.HttpClientErrorException
    instances:
      productService:
        baseConfig: default
      paymentService:
        baseConfig: default
        failureRateThreshold: 30  # More conservative for payments
        waitDurationInOpenState: 60s
 
  # Retry Configuration
  retry:
    configs:
      default:
        maxAttempts: 3
        waitDuration: 500ms
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        exponentialMaxWaitDuration: 5s
        retryExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
        ignoreExceptions:
          # CRITICAL: Don't retry when circuit is open
          - io.github.resilience4j.circuitbreaker.CallNotPermittedException
          - org.springframework.web.client.HttpClientErrorException
    instances:
      productService:
        baseConfig: default
      paymentService:
        baseConfig: default
        maxAttempts: 2  # Fewer retries for sensitive operations
 
  # Bulkhead Configuration
  bulkhead:
    configs:
      default:
        maxConcurrentCalls: 25
        maxWaitDuration: 500ms
    instances:
      productService:
        baseConfig: default
      paymentService:
        maxConcurrentCalls: 10  # More restrictive for payments
 
  # Time Limiter Configuration
  timelimiter:
    configs:
      default:
        timeoutDuration: 5s
        cancelRunningFuture: true
    instances:
      productService:
        baseConfig: default
      paymentService:
        timeoutDuration: 10s  # Longer for payment processing

ProductServiceClient.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
@Service
@Slf4j
public class ProductServiceClient {
    
    private final WebClient webClient;
    
    public ProductServiceClient(WebClient.Builder webClientBuilder) {
        this.webClient = webClientBuilder
            .baseUrl("http://product-service")
            .build();
    }
    
    /**
     * Get product with full resilience stack.
     * Decorator order: Retry → CircuitBreaker → Bulkhead → TimeLimiter → Call
     */
    @Retry(name = "productService")
    @CircuitBreaker(name = "productService", fallbackMethod = "getProductFallback")
    @Bulkhead(name = "productService")
    @TimeLimiter(name = "productService")
    public CompletableFuture<Product> getProduct(String productId) {
        return webClient.get()
            .uri("/products/{id}", productId)
            .retrieve()
            .bodyToMono(Product.class)
            .toFuture();
    }
    
    /**
     * Fallback when circuit is open or all retries exhausted.
     */
    private CompletableFuture<Product> getProductFallback(
            String productId, 
            Throwable throwable) {
        
        if (throwable instanceof CallNotPermittedException) {
            log.warn("Circuit open for product service, using fallback");
            // Circuit is open - return cached data
            return CompletableFuture.completedFuture(
                productCache.get(productId).orElse(Product.placeholder(productId))
            );
        }
        
        if (throwable instanceof BulkheadFullException) {
            log.warn("Bulkhead full for product service, using fallback");
            // Too many concurrent requests
            return CompletableFuture.completedFuture(
                Product.placeholder(productId)
            );
        }
        
        if (throwable instanceof TimeoutException) {
            log.warn("Request timed out for product {}", productId);
        }
        
        // Default fallback
        log.error("Product service call failed for {}", productId, throwable);
        return CompletableFuture.completedFuture(
            Product.unavailable(productId)
        );
    }
    
    /**
     * Create product with idempotency key.
     * Uses fewer retries to avoid duplicate creation issues.
     */
    @Retry(name = "productService", fallbackMethod = "createProductFallback")
    @CircuitBreaker(name = "productService")
    @Bulkhead(name = "productService")
    @TimeLimiter(name = "productService")
    public CompletableFuture<Product> createProduct(
            String idempotencyKey, 
            ProductCreateRequest request) {
        
        return webClient.post()
            .uri("/products")
            .header("Idempotency-Key", idempotencyKey)
            .bodyValue(request)
            .retrieve()
            .bodyToMono(Product.class)
            .toFuture();
    }
    
    private CompletableFuture<Product> createProductFallback(
            String idempotencyKey,
            ProductCreateRequest request,
            Throwable throwable) {
        
        // For creates, queue the operation for later processing
        log.error("Product creation failed, queuing for retry", throwable);
        productQueue.enqueue(idempotencyKey, request);
        
        return CompletableFuture.failedFuture(
            new ServiceTemporarilyUnavailableException(
                "Product creation queued for processing"
            )
        );
    }
}

Annotation Order Matters

With Spring annotations, the order is determined by the @Order annotation on the aspect classes. Resilience4j's default ordering is: Retry → CircuitBreaker → RateLimiter → TimeLimiter → Bulkhead (outside to inside). This is the recommended order. You can customize it via resilience4j.circuitbreaker.circuitBreakerAspectOrder etc.

Monitoring and Debugging

When retry and circuit breaker behaviors interact, debugging requires comprehensive visibility into both.

Essential Metrics

Key Metrics for Retry + Circuit Breaker Monitoring
Metric	What It Tells You	Alert Threshold Guidance
circuit.state	Current circuit state (0/1/2)	Alert on transitions to OPEN
circuit.failure_rate	Current failure rate percentage	Alert if significantly above baseline
circuit.not_permitted_calls	Calls rejected by open circuit	Alert if sustained; indicates outage
retry.attempts	Total retry attempts	Alert if retry ratio exceeds budget
retry.max_retries_exceeded	Retries that exhausted all attempts	Alert if climbing; indicates chronic issues
retry.wait_duration	Time spent in retry backoff	Monitor for latency impact
bulkhead.available_permits	Available concurrency slots	Alert if consistently low

DatadogDashboard.json
Dashboard Query Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{
  "title": "Resilience Dashboard",
  "widgets": [
    {
      "title": "Circuit Breaker State",
      "query": "resilience4j_circuitbreaker_state{service:product}",
      "visualization": "timeseries"
    },
    {
      "title": "Calls Rejected by Open Circuit",
      "query": "sum:resilience4j_circuitbreaker_not_permitted_calls{*}.as_rate()",
      "visualization": "timeseries"
    },
    {
      "title": "Retry Ratio",
      "query": "sum:resilience4j_retry_calls{kind:retry} / sum:resilience4j_retry_calls{*}",
      "visualization": "query_value",
      "alert_threshold": 0.2
    },
    {
      "title": "Failure Rate by Service",
      "query": "avg:resilience4j_circuitbreaker_failure_rate{*} by {name}",
      "visualization": "timeseries"
    },
    {
      "title": "Retry Exhaustion Rate",
      "query": "sum:resilience4j_retry_calls{kind:failed_with_retry}.as_rate()",
      "visualization": "timeseries"
    }
  ]
}

Debugging Common Issues

Issue: Circuit opens unexpectedly

Symptoms: Circuit opens when dependency appears healthy.

Debugging steps:

Check if retries are inside circuit breaker (causing amplified failure counts)
Verify minimum volume is set appropriately
Check if slow calls are being counted (slowCallRateThreshold)
Review which exceptions are being recorded vs. ignored

Issue: Retries never succeed

Symptoms: All retries fail; high retry exhaustion rate.

Debugging steps:

Check if circuit is open (retries hitting closed circuit)
Verify retry exceptions match actual exception types
Check if backoff is too short (not giving service time to recover)
Look for idempotency issues causing repeated failures

Issue: Cascading circuit opens

Symptoms: One circuit opens, then others follow.

Debugging steps:

Map dependency graph to identify root cause
Check for shared resources (connection pools, thread pools)
Verify bulkhead isolation is properly configured
Look for retry amplification effects

Best Practices Summary

Let's consolidate the key best practices for combining circuit breakers with retries.

Circuit Breaker + Retry Best Practices

•Retry MUST wrap circuit breaker — Never the reverse. This ensures failures are counted correctly and circuit-open exceptions stop retries.
•Don't retry circuit-open exceptions — When the circuit is open, retrying is pointless. Configure retry to ignore CallNotPermittedException.
•Use exponential backoff with jitter — Full jitter or decorrelated jitter prevents thundering herds and gives services time to recover.
•Cap maximum backoff duration — Don't let exponential growth create unreasonable delays. 30-60 seconds is typically sufficient.
•Implement retry budgets — Limit retries to a percentage of total traffic to prevent load amplification during outages.
•Ensure idempotency for retried operations — Use idempotency keys, natural idempotency, or conditional requests to prevent duplicate processing.
•Match retry count to operation criticality — Critical operations may warrant more retries; fire-and-forget operations may need none.
•Monitor both circuit and retry metrics — You need visibility into both to understand system behavior during degradation.
•Test failure scenarios — Use chaos engineering to verify retry+circuit breaker behavior under realistic failure conditions.

Quick Reference: Recommended Defaults

Production-Ready Default Configuration
Parameter	Recommended Value	Notes
Max retry attempts	3	Including initial attempt; adjust for operation cost
Initial backoff	500ms	Adjust based on typical recovery time
Max backoff	30s	Don't wait forever
Backoff multiplier	2	Standard exponential
Jitter	Full or decorrelated	Avoids thundering herd
Retry budget	20%	System-wide protection
Circuit failure threshold	50%	Trip when half are failing
Circuit minimum volume	10-20	Statistical significance
Circuit recovery timeout	30s	Give service time to recover

Summary: Mastering Resilience Patterns

We've completed our comprehensive exploration of the circuit breaker pattern and its integration with retry strategies. Let's consolidate the key insights from this final page:

Key Takeaways

•Ordering is critical — Retry must wrap circuit breaker. This ensures correct failure counting and prevents retries when the circuit is open.
•Exponential backoff with jitter — The only production-safe retry strategy. Prevents thundering herds and gives failing services time to recover.
•Retry budgets prevent load amplification — Limit retries as a percentage of traffic, not fixed counts per request.
•Idempotency is mandatory for safe retries — Without idempotency protection, retries can cause duplicate processing and data corruption.
•The full resilience stack works together — Rate limiting, retry, circuit breaker, bulkhead, and timeout form a layered defense.
•Monitoring enables debugging — Track circuit state, failure rates, retry counts, and retry exhaustion to understand system behavior.

Module Summary: Circuit Breaker Pattern

Over the course of this module, we've developed a comprehensive understanding of the circuit breaker pattern:

Why circuit breakers exist: To prevent cascade failures in distributed systems by failing fast when dependencies are unhealthy.
How the state machine works: Three states (Closed, Open, Half-Open) with transitions based on failure thresholds and recovery probes.
How to configure thresholds: The mathematics of failure rate, window sizing, minimum volume, and slow call detection.
Production implementations: Hystrix, Resilience4j, and implementations in other ecosystems.
Integration with retries: The critical ordering, backoff strategies, retry budgets, and idempotency requirements.

You now have the knowledge to design, implement, and operate circuit breakers in production systems. This pattern, combined with the other resilience patterns in this chapter, forms the foundation for building systems that remain reliable despite the inherent unreliability of distributed computing.

Module Complete

Congratulations! You've mastered the circuit breaker pattern—one of the most important resilience patterns in distributed systems engineering. You understand the theory, the implementations, and the practical integration with other patterns. Apply this knowledge to build systems that fail gracefully and recover automatically.

Circuit Breaker with Retries

The Dangerous Dance of Retries and Circuit Breakers

What You Will Learn

The Ordering Problem

The Wrong Way: Retry Inside Circuit Breaker

Converting Mermaid diagram...

In this (incorrect) arrangement, the retry logic is inside the circuit breaker. This causes several problems:

Problem 1: One failure counts as multiple failures

When a call fails and is retried 3 times, all 3 failures are counted by the circuit breaker. A single transient error becomes 3 failures in the circuit's statistics, making it trip prematurely.

Problem 2: Increased load on failing service

Problem 3: Longer blocking when circuit should be open

If you make a request when the circuit is about to trip, the retry logic executes all retries before the circuit opens. The user experiences the full retry timeout before getting an error.

The Right Way: Retry Outside Circuit Breaker

Converting Mermaid diagram...

In this (correct) arrangement:

Benefit 1: Failures are counted correctly

Each request through the circuit breaker is a single attempt. Retries are handled at the outer layer, so the circuit sees accurate failure rates.

Benefit 2: Circuit open exception stops retries

When the circuit opens, the retry logic receives CircuitOpenException. It can choose not to retry (since retry is pointless when circuit is open), immediately invoking fallback behavior.

Benefit 3: Retries don't amplify load on failing service

Once the circuit opens, retries are handled by returning circuit-open exceptions—no additional load is sent to the failing service.

CorrectOrdering.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// CORRECT: Retry wraps CircuitBreaker
CircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults("service");
Retry retry = Retry.of("service", RetryConfig.custom()
    .maxAttempts(3)
    .waitDuration(Duration.ofMillis(500))
    // Don't retry when circuit is open!
    .retryOnException(e -> !(e instanceof CallNotPermittedException))
    .build());
 
Supplier<Product> decoratedSupplier = Decorators
    .ofSupplier(() -> productService.getProduct(productId))
    .withCircuitBreaker(circuitBreaker)  // Inner: circuit breaker
    .withRetry(retry)                     // Outer: retry
    .decorate();
 
// The call flows:
// 1. Retry (outer) receives request
// 2. Retry calls CircuitBreaker (inner)
// 3. If circuit closed, call proceeds to service
// 4. If service fails, exception propagates to Retry
// 5. Retry checks if exception is retryable
// 6. If CallNotPermittedException (circuit open), don't retry
// 7. Otherwise, retry with backoff
 
// WRONG (DON'T DO THIS): CircuitBreaker wraps Retry
Supplier<Product> wrongOrder = Decorators
    .ofSupplier(() -> productService.getProduct(productId))
    .withRetry(retry)                     // Inner: retry (WRONG)
    .withCircuitBreaker(circuitBreaker)   // Outer: circuit breaker (WRONG)
    .decorate();
// This counts every retry as a separate failure!

Critical: Don't Retry Circuit Open Exceptions

Exponential Backoff with Jitter

The Thundering Herd Problem

Consider a scenario where 1000 clients are waiting for a service. The service briefly goes down, then comes back up. If all clients retry immediately and simultaneously:

1000 requests hit the service at once
Service is overwhelmed by the spike
Most requests fail
1000 clients retry at the same time again
Repeat until service stabilizes or clients give up

This is the "thundering herd"—synchronized retry creates load spikes that prevent recovery.

Exponential Backoff

Exponential backoff increases wait time between retries exponentially:

wait_time = base_delay × 2^(attempt - 1)

Attempt 1: 500ms × 2^0 = 500ms
Attempt 2: 500ms × 2^1 = 1000ms
Attempt 3: 500ms × 2^2 = 2000ms
Attempt 4: 500ms × 2^3 = 4000ms

Adding Jitter

Jitter adds randomness to the backoff calculation, desynchronizing retry attempts:

wait_time = base_delay × 2^(attempt - 1) × random(0.5, 1.5)

Client A Attempt 2: 1000ms × 0.7  = 700ms
Client B Attempt 2: 1000ms × 1.2  = 1200ms
Client C Attempt 2: 1000ms × 0.95 = 950ms

Now clients retry at different times, smoothing the load.

BackoffStrategies.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Different jitter strategies in Resilience4j
import io.github.resilience4j.retry.RetryConfig;
import io.github.resilience4j.core.IntervalFunction;
 
// 1. EXPONENTIAL BACKOFF (no jitter)
RetryConfig exponentialOnly = RetryConfig.custom()
    .maxAttempts(4)
    .intervalFunction(IntervalFunction.ofExponentialBackoff(
        Duration.ofMillis(500),  // Initial interval
        2.0                       // Multiplier
    ))
    .build();
// Produces: 500ms, 1000ms, 2000ms, 4000ms
 
// 2. EXPONENTIAL BACKOFF WITH RANDOM JITTER
RetryConfig exponentialWithJitter = RetryConfig.custom()
    .maxAttempts(4)
    .intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
        Duration.ofMillis(500),  // Initial interval
        2.0,                     // Multiplier
        0.5                      // Randomization factor (±50%)
    ))
    .build();
// Produces: 250-750ms, 500-1500ms, 1000-3000ms, 2000-6000ms
 
// 3. DECORRELATED JITTER (AWS recommendation)
// Each retry's delay is random between base delay and 3× previous delay
RetryConfig decorrelatedJitter = RetryConfig.custom()
    .maxAttempts(4)
    .intervalFunction(attempt -> {
        long baseMs = 500;
        long maxMs = 60_000;
        long previousDelay = attempt == 1 ? baseMs : 
            (long)(baseMs * Math.pow(3, attempt - 2));
        long delay = (long)(baseMs + Math.random() * (previousDelay * 3 - baseMs));
        return Duration.ofMillis(Math.min(delay, maxMs));
    })
    .build();
 
// 4. EQUAL JITTER (hybrid approach)
// Half exponential, half random
RetryConfig equalJitter = RetryConfig.custom()
    .maxAttempts(4)
    .intervalFunction(attempt -> {
        long base = 500;
        long exponential = base * (1L << (attempt - 1)); // 2^(attempt-1)
        long halfExponential = exponential / 2;
        long delay = halfExponential + (long)(Math.random() * halfExponential);
        return Duration.ofMillis(Math.min(delay, 60_000));
    })
    .build();
// Provides minimum guaranteed backoff with random component

Jitter Strategy Comparison
Strategy	Delay Formula	Characteristics
No Jitter	base × 2^attempt	Synchronized retries; thundering herd risk
Full Jitter	random(0, base × 2^attempt)	Maximum spread; minimum guaranteed delay is 0
Equal Jitter	half_exp + random(0, half_exp)	Guaranteed minimum with randomization
Decorrelated Jitter	random(base, prev_delay × 3)	AWS recommended; good balance

Cap Your Maximum Delay

Retry Budgets

The Problem: Cumulative Retry Load

Consider a service with 100 clients, each configured to retry 3 times:

Normal load: 1000 requests/second
Service experiences 50% failures
Each failed request retries up to 3 times
Retry load: 1000 × 0.5 × 3 = 1500 additional requests/second
Total load: 2500 requests/second (2.5× normal)

The service that was struggling at 1000 req/s is now receiving 2500 req/s. The retries intended to recover from failure are causing complete collapse.

Retry Budgets: The Solution

A retry budget limits the proportion of requests that can be retries. Instead of fixed retry counts, you constrain:

retry_ratio = retries / (original_requests + retries)

With a 20% retry budget:

If you've made 100 original requests
You can have at most 25 retries (25/125 = 20%)
Additional retries are rejected until the budget refreshes

RetryBudget.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
/**
 * A token-bucket based retry budget.
 * Limits retries to a percentage of total traffic.
 */
public class RetryBudget {
    private final double retryRatio;        // Target retry percentage
    private final long windowMs;            // Time window for calculation
    private final AtomicLong attempts;      // Total attempts in window
    private final AtomicLong retries;       // Retries in window
    private final AtomicLong windowStart;
    
    public RetryBudget(double retryRatio, Duration window) {
        this.retryRatio = retryRatio;
        this.windowMs = window.toMillis();
        this.attempts = new AtomicLong(0);
        this.retries = new AtomicLong(0);
        this.windowStart = new AtomicLong(System.currentTimeMillis());
    }
    
    /**
     * Record an original (non-retry) attempt.
     */
    public void recordAttempt() {
        maybeResetWindow();
        attempts.incrementAndGet();
    }
    
    /**
     * Check if a retry is allowed and record it if so.
     */
    public boolean tryAcquireRetry() {
        maybeResetWindow();
        
        long currentAttempts = attempts.get();
        long currentRetries = retries.get();
        long total = currentAttempts + currentRetries;
        
        if (total == 0) {
            // No attempts yet, allow retry
            retries.incrementAndGet();
            return true;
        }
        
        // Check if adding a retry would exceed budget
        double potentialRetryRatio = (currentRetries + 1.0) / (total + 1.0);
        
        if (potentialRetryRatio <= retryRatio) {
            retries.incrementAndGet();
            return true;
        }
        
        return false;  // Budget exhausted
    }
    
    private void maybeResetWindow() {
        long now = System.currentTimeMillis();
        long start = windowStart.get();
        
        if (now - start > windowMs) {
            // Reset window
            if (windowStart.compareAndSet(start, now)) {
                attempts.set(0);
                retries.set(0);
            }
        }
    }
    
    public double getCurrentRetryRatio() {
        long total = attempts.get() + retries.get();
        return total == 0 ? 0.0 : (double) retries.get() / total;
    }
}
 
// Usage
RetryBudget budget = new RetryBudget(0.20, Duration.ofSeconds(10));
 
public CompletableFuture<Response> callWithBudgetedRetry(Request request) {
    budget.recordAttempt();  // Original attempt
    
    return circuitBreaker.executeCompletableFuture(() -> 
        httpClient.send(request)
    ).exceptionallyCompose(error -> {
        if (error instanceof CallNotPermittedException) {
            // Circuit open, no retry
            return fallback(request);
        }
        
        if (budget.tryAcquireRetry()) {
            // Budget allows retry
            return callWithBudgetedRetry(request);  // Recursive retry
        } else {
            // Budget exhausted
            log.warn("Retry budget exhausted, failing request");
            return fallback(request);
        }
    });
}

Service Mesh Retry Budgets

Service meshes like Linkerd implement retry budgets at the infrastructure layer:

linkerd-retry-budget.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
apiVersion: policy.linkerd.io/v1beta1
kind: HTTPRoute
metadata:
  name: product-service-route
spec:
  parentRefs:
    - name: product-service
      kind: Service
  rules:
    - backendRefs:
        - name: product-service
          port: 80
      timeouts:
        request: 10s
      retry:
        limit:
          # Retry budget: max 20% of traffic can be retries
          retryRatio: 0.2
          # Allow this many retries per request regardless of ratio
          minRetriesPerSecond: 10
        conditions:
          - statusCodes:
              - 502
              - 503
              - 504
          - timeouts: true
        backoff:
          baseInterval: 25ms
          maxInterval: 250ms

Retry Budget vs. Fixed Retry Count

Idempotency Requirements

The Retry Safety Problem

Consider this scenario:

Client sends: "Transfer $100 from Account A to Account B"
Server processes the transfer successfully
Response packet is lost in the network
Client times out and retries
Server processes the transfer again
Result: $200 transferred instead of $100

The retry caused a correctness bug because the operation wasn't idempotent.

Idempotency Strategies

Idempotency Implementation Strategies
Strategy	How It Works	Trade-offs
Idempotency Key	Client provides unique key; server deduplicates	Client must generate keys; server stores keys
Natural Idempotency	Design operations to be naturally idempotent	Not always possible; requires careful API design
Conditional Requests	Use ETags/versions for conditional updates	Adds complexity; requires version tracking
At-Least-Once + Dedup	Accept duplicates; deduplicate downstream	Processing overhead; eventual consistency

IdempotencyKey.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
@RestController
public class PaymentController {
    
    private final PaymentService paymentService;
    private final IdempotencyStore idempotencyStore;
    
    @PostMapping("/payments")
    public ResponseEntity<PaymentResult> createPayment(
            @RequestHeader("Idempotency-Key") String idempotencyKey,
            @RequestBody PaymentRequest request) {
        
        // Check if we've seen this idempotency key before
        Optional<PaymentResult> cachedResult = 
            idempotencyStore.get(idempotencyKey);
        
        if (cachedResult.isPresent()) {
            // Return cached result - this is a retry
            log.info("Returning cached result for idempotency key: {}", 
                     idempotencyKey);
            return ResponseEntity.ok(cachedResult.get());
        }
        
        // Try to acquire lock on idempotency key
        if (!idempotencyStore.tryLock(idempotencyKey, Duration.ofMinutes(5))) {
            // Another request with same key is in progress
            return ResponseEntity.status(HttpStatus.CONFLICT)
                .body(PaymentResult.inProgress());
        }
        
        try {
            // Process the payment
            PaymentResult result = paymentService.processPayment(request);
            
            // Store result for future retries
            idempotencyStore.store(idempotencyKey, result, Duration.ofHours(24));
            
            return ResponseEntity.ok(result);
            
        } finally {
            idempotencyStore.unlock(idempotencyKey);
        }
    }
}
 
@Component
public class RedisIdempotencyStore implements IdempotencyStore {
    
    private final RedisTemplate<String, String> redis;
    private final ObjectMapper objectMapper;
    
    @Override
    public Optional<PaymentResult> get(String key) {
        String value = redis.opsForValue().get("idempotency:" + key);
        if (value == null) return Optional.empty();
        return Optional.of(objectMapper.readValue(value, PaymentResult.class));
    }
    
    @Override
    public boolean tryLock(String key, Duration timeout) {
        return redis.opsForValue()
            .setIfAbsent("lock:" + key, "locked", timeout);
    }
    
    @Override
    public void store(String key, PaymentResult result, Duration ttl) {
        String value = objectMapper.writeValueAsString(result);
        redis.opsForValue().set("idempotency:" + key, value, ttl);
    }
    
    @Override
    public void unlock(String key) {
        redis.delete("lock:" + key);
    }
}

Naturally Idempotent Operations

Some operations are naturally idempotent and safe to retry without additional mechanisms:

Idempotent	Not Idempotent
GET requests	POST requests (typically)
PUT (replace entire resource)	PATCH (partial updates)
DELETE (by ID)	Counter increments
Set value to X	Add X to value
"Create if not exists"	"Create" (may create duplicates)

Design for Idempotency

When possible, design your APIs to be naturally idempotent:

// NON-IDEMPOTENT: Add $100 to balance
POST /accounts/{id}/balance
{ "amount": 100 }

// IDEMPOTENT: Set balance to $500 (requires knowing current state)
PUT /accounts/{id}/balance
{ "balance": 500, "version": 42 }

// IDEMPOTENT: Use idempotency key for operations that can't be naturally idempotent
POST /accounts/{id}/transactions
Idempotency-Key: tx-12345
{ "amount": 100, "type": "credit" }

Never Retry Non-Idempotent Operations Without Protection

Complete Integration Pattern

Let's bring together everything we've learned into a complete, production-ready integration of circuit breakers and retries.

The Full Stack

┌─────────────────────────────────────────────────────────────┐
│                        Request Flow                          │
├─────────────────────────────────────────────────────────────┤
│  1. Rate Limiter (optional) - Prevent overload              │
│  2. Retry (with budget) - Outer layer, handles retryable    │
│  3. Circuit Breaker - Fast fail if unhealthy                │
│  4. Bulkhead - Isolate resources                            │
│  5. Timeout - Bound execution time                          │
│  6. Actual Call - The real HTTP/RPC call                    │
└─────────────────────────────────────────────────────────────┘

Production Configuration

application.yml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
resilience4j:
  # Circuit Breaker Configuration
  circuitbreaker:
    configs:
      default:
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 100
        minimumNumberOfCalls: 10
        failureRateThreshold: 50
        slowCallRateThreshold: 80
        slowCallDurationThreshold: 2s
        waitDurationInOpenState: 30s
        permittedNumberOfCallsInHalfOpenState: 5
        automaticTransitionFromOpenToHalfOpenEnabled: true
        recordExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
          - org.springframework.web.client.HttpServerErrorException
        ignoreExceptions:
          - org.springframework.web.client.HttpClientErrorException
    instances:
      productService:
        baseConfig: default
      paymentService:
        baseConfig: default
        failureRateThreshold: 30  # More conservative for payments
        waitDurationInOpenState: 60s
 
  # Retry Configuration
  retry:
    configs:
      default:
        maxAttempts: 3
        waitDuration: 500ms
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
        exponentialMaxWaitDuration: 5s
        retryExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
        ignoreExceptions:
          # CRITICAL: Don't retry when circuit is open
          - io.github.resilience4j.circuitbreaker.CallNotPermittedException
          - org.springframework.web.client.HttpClientErrorException
    instances:
      productService:
        baseConfig: default
      paymentService:
        baseConfig: default
        maxAttempts: 2  # Fewer retries for sensitive operations
 
  # Bulkhead Configuration
  bulkhead:
    configs:
      default:
        maxConcurrentCalls: 25
        maxWaitDuration: 500ms
    instances:
      productService:
        baseConfig: default
      paymentService:
        maxConcurrentCalls: 10  # More restrictive for payments
 
  # Time Limiter Configuration
  timelimiter:
    configs:
      default:
        timeoutDuration: 5s
        cancelRunningFuture: true
    instances:
      productService:
        baseConfig: default
      paymentService:
        timeoutDuration: 10s  # Longer for payment processing

ProductServiceClient.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
@Service
@Slf4j
public class ProductServiceClient {
    
    private final WebClient webClient;
    
    public ProductServiceClient(WebClient.Builder webClientBuilder) {
        this.webClient = webClientBuilder
            .baseUrl("http://product-service")
            .build();
    }
    
    /**
     * Get product with full resilience stack.
     * Decorator order: Retry → CircuitBreaker → Bulkhead → TimeLimiter → Call
     */
    @Retry(name = "productService")
    @CircuitBreaker(name = "productService", fallbackMethod = "getProductFallback")
    @Bulkhead(name = "productService")
    @TimeLimiter(name = "productService")
    public CompletableFuture<Product> getProduct(String productId) {
        return webClient.get()
            .uri("/products/{id}", productId)
            .retrieve()
            .bodyToMono(Product.class)
            .toFuture();
    }
    
    /**
     * Fallback when circuit is open or all retries exhausted.
     */
    private CompletableFuture<Product> getProductFallback(
            String productId, 
            Throwable throwable) {
        
        if (throwable instanceof CallNotPermittedException) {
            log.warn("Circuit open for product service, using fallback");
            // Circuit is open - return cached data
            return CompletableFuture.completedFuture(
                productCache.get(productId).orElse(Product.placeholder(productId))
            );
        }
        
        if (throwable instanceof BulkheadFullException) {
            log.warn("Bulkhead full for product service, using fallback");
            // Too many concurrent requests
            return CompletableFuture.completedFuture(
                Product.placeholder(productId)
            );
        }
        
        if (throwable instanceof TimeoutException) {
            log.warn("Request timed out for product {}", productId);
        }
        
        // Default fallback
        log.error("Product service call failed for {}", productId, throwable);
        return CompletableFuture.completedFuture(
            Product.unavailable(productId)
        );
    }
    
    /**
     * Create product with idempotency key.
     * Uses fewer retries to avoid duplicate creation issues.
     */
    @Retry(name = "productService", fallbackMethod = "createProductFallback")
    @CircuitBreaker(name = "productService")
    @Bulkhead(name = "productService")
    @TimeLimiter(name = "productService")
    public CompletableFuture<Product> createProduct(
            String idempotencyKey, 
            ProductCreateRequest request) {
        
        return webClient.post()
            .uri("/products")
            .header("Idempotency-Key", idempotencyKey)
            .bodyValue(request)
            .retrieve()
            .bodyToMono(Product.class)
            .toFuture();
    }
    
    private CompletableFuture<Product> createProductFallback(
            String idempotencyKey,
            ProductCreateRequest request,
            Throwable throwable) {
        
        // For creates, queue the operation for later processing
        log.error("Product creation failed, queuing for retry", throwable);
        productQueue.enqueue(idempotencyKey, request);
        
        return CompletableFuture.failedFuture(
            new ServiceTemporarilyUnavailableException(
                "Product creation queued for processing"
            )
        );
    }
}

Annotation Order Matters

Monitoring and Debugging

When retry and circuit breaker behaviors interact, debugging requires comprehensive visibility into both.

Essential Metrics

Key Metrics for Retry + Circuit Breaker Monitoring
Metric	What It Tells You	Alert Threshold Guidance
circuit.state	Current circuit state (0/1/2)	Alert on transitions to OPEN
circuit.failure_rate	Current failure rate percentage	Alert if significantly above baseline
circuit.not_permitted_calls	Calls rejected by open circuit	Alert if sustained; indicates outage
retry.attempts	Total retry attempts	Alert if retry ratio exceeds budget
retry.max_retries_exceeded	Retries that exhausted all attempts	Alert if climbing; indicates chronic issues
retry.wait_duration	Time spent in retry backoff	Monitor for latency impact
bulkhead.available_permits	Available concurrency slots	Alert if consistently low

DatadogDashboard.json
Dashboard Query Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{
  "title": "Resilience Dashboard",
  "widgets": [
    {
      "title": "Circuit Breaker State",
      "query": "resilience4j_circuitbreaker_state{service:product}",
      "visualization": "timeseries"
    },
    {
      "title": "Calls Rejected by Open Circuit",
      "query": "sum:resilience4j_circuitbreaker_not_permitted_calls{*}.as_rate()",
      "visualization": "timeseries"
    },
    {
      "title": "Retry Ratio",
      "query": "sum:resilience4j_retry_calls{kind:retry} / sum:resilience4j_retry_calls{*}",
      "visualization": "query_value",
      "alert_threshold": 0.2
    },
    {
      "title": "Failure Rate by Service",
      "query": "avg:resilience4j_circuitbreaker_failure_rate{*} by {name}",
      "visualization": "timeseries"
    },
    {
      "title": "Retry Exhaustion Rate",
      "query": "sum:resilience4j_retry_calls{kind:failed_with_retry}.as_rate()",
      "visualization": "timeseries"
    }
  ]
}

Debugging Common Issues

Issue: Circuit opens unexpectedly

Symptoms: Circuit opens when dependency appears healthy.

Debugging steps:

Check if retries are inside circuit breaker (causing amplified failure counts)
Verify minimum volume is set appropriately
Check if slow calls are being counted (slowCallRateThreshold)
Review which exceptions are being recorded vs. ignored

Issue: Retries never succeed

Symptoms: All retries fail; high retry exhaustion rate.

Debugging steps:

Check if circuit is open (retries hitting closed circuit)
Verify retry exceptions match actual exception types
Check if backoff is too short (not giving service time to recover)
Look for idempotency issues causing repeated failures

Issue: Cascading circuit opens

Symptoms: One circuit opens, then others follow.

Debugging steps:

Map dependency graph to identify root cause
Check for shared resources (connection pools, thread pools)
Verify bulkhead isolation is properly configured
Look for retry amplification effects

Best Practices Summary

Let's consolidate the key best practices for combining circuit breakers with retries.

Circuit Breaker + Retry Best Practices

•Retry MUST wrap circuit breaker — Never the reverse. This ensures failures are counted correctly and circuit-open exceptions stop retries.
•Don't retry circuit-open exceptions — When the circuit is open, retrying is pointless. Configure retry to ignore CallNotPermittedException.
•Use exponential backoff with jitter — Full jitter or decorrelated jitter prevents thundering herds and gives services time to recover.
•Cap maximum backoff duration — Don't let exponential growth create unreasonable delays. 30-60 seconds is typically sufficient.
•Implement retry budgets — Limit retries to a percentage of total traffic to prevent load amplification during outages.
•Ensure idempotency for retried operations — Use idempotency keys, natural idempotency, or conditional requests to prevent duplicate processing.
•Match retry count to operation criticality — Critical operations may warrant more retries; fire-and-forget operations may need none.
•Monitor both circuit and retry metrics — You need visibility into both to understand system behavior during degradation.
•Test failure scenarios — Use chaos engineering to verify retry+circuit breaker behavior under realistic failure conditions.

Quick Reference: Recommended Defaults

Production-Ready Default Configuration
Parameter	Recommended Value	Notes
Max retry attempts	3	Including initial attempt; adjust for operation cost
Initial backoff	500ms	Adjust based on typical recovery time
Max backoff	30s	Don't wait forever
Backoff multiplier	2	Standard exponential
Jitter	Full or decorrelated	Avoids thundering herd
Retry budget	20%	System-wide protection
Circuit failure threshold	50%	Trip when half are failing
Circuit minimum volume	10-20	Statistical significance
Circuit recovery timeout	30s	Give service time to recover

Summary: Mastering Resilience Patterns

We've completed our comprehensive exploration of the circuit breaker pattern and its integration with retry strategies. Let's consolidate the key insights from this final page:

Key Takeaways

•Ordering is critical — Retry must wrap circuit breaker. This ensures correct failure counting and prevents retries when the circuit is open.
•Exponential backoff with jitter — The only production-safe retry strategy. Prevents thundering herds and gives failing services time to recover.
•Retry budgets prevent load amplification — Limit retries as a percentage of traffic, not fixed counts per request.
•Idempotency is mandatory for safe retries — Without idempotency protection, retries can cause duplicate processing and data corruption.
•The full resilience stack works together — Rate limiting, retry, circuit breaker, bulkhead, and timeout form a layered defense.
•Monitoring enables debugging — Track circuit state, failure rates, retry counts, and retry exhaustion to understand system behavior.

Module Summary: Circuit Breaker Pattern

Over the course of this module, we've developed a comprehensive understanding of the circuit breaker pattern:

Why circuit breakers exist: To prevent cascade failures in distributed systems by failing fast when dependencies are unhealthy.
How the state machine works: Three states (Closed, Open, Half-Open) with transitions based on failure thresholds and recovery probes.
How to configure thresholds: The mathematics of failure rate, window sizing, minimum volume, and slow call detection.
Production implementations: Hystrix, Resilience4j, and implementations in other ecosystems.
Integration with retries: The critical ordering, backoff strategies, retry budgets, and idempotency requirements.

Module Complete