Loading content...
Retries and circuit breakers are both fundamental resilience patterns—but combining them incorrectly is one of the most common causes of production incidents in distributed systems. Retries without circuit breakers can create thundering herds that amplify failures. Circuit breakers without retries may fail requests that could have succeeded on a second attempt. And retries inside circuit breakers can prevent circuits from opening when they should.
This page addresses the subtle but critical interplay between these patterns. We'll explore why ordering matters, how to configure retry budgets, and the mathematical considerations behind intelligent retry strategies. By the end, you'll understand how to combine these patterns for maximum resilience without creating new failure modes.
By the end of this page, you will understand why retries must wrap circuit breakers (not the reverse), how to implement exponential backoff with jitter, the concept of retry budgets for system-wide protection, idempotency requirements for safe retries, and production-ready configurations for common scenarios.
The most common mistake when combining retries with circuit breakers is incorrect ordering. The difference between "retry wrapping circuit breaker" and "circuit breaker wrapping retry" has profound implications for system behavior.
The Wrong Way: Retry Inside Circuit Breaker
In this (incorrect) arrangement, the retry logic is inside the circuit breaker. This causes several problems:
Problem 1: One failure counts as multiple failures
When a call fails and is retried 3 times, all 3 failures are counted by the circuit breaker. A single transient error becomes 3 failures in the circuit's statistics, making it trip prematurely.
Problem 2: Increased load on failing service
When the downstream service is degraded, the retry logic amplifies load precisely when the service can least handle it. The circuit sees this amplified failure rate and trips—but not before the retries have made things worse.
Problem 3: Longer blocking when circuit should be open
If you make a request when the circuit is about to trip, the retry logic executes all retries before the circuit opens. The user experiences the full retry timeout before getting an error.
The Right Way: Retry Outside Circuit Breaker
In this (correct) arrangement:
Benefit 1: Failures are counted correctly
Each request through the circuit breaker is a single attempt. Retries are handled at the outer layer, so the circuit sees accurate failure rates.
Benefit 2: Circuit open exception stops retries
When the circuit opens, the retry logic receives CircuitOpenException. It can choose not to retry (since retry is pointless when circuit is open), immediately invoking fallback behavior.
Benefit 3: Retries don't amplify load on failing service
Once the circuit opens, retries are handled by returning circuit-open exceptions—no additional load is sent to the failing service.
12345678910111213141516171819202122232425262728293031
// CORRECT: Retry wraps CircuitBreakerCircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults("service");Retry retry = Retry.of("service", RetryConfig.custom() .maxAttempts(3) .waitDuration(Duration.ofMillis(500)) // Don't retry when circuit is open! .retryOnException(e -> !(e instanceof CallNotPermittedException)) .build()); Supplier<Product> decoratedSupplier = Decorators .ofSupplier(() -> productService.getProduct(productId)) .withCircuitBreaker(circuitBreaker) // Inner: circuit breaker .withRetry(retry) // Outer: retry .decorate(); // The call flows:// 1. Retry (outer) receives request// 2. Retry calls CircuitBreaker (inner)// 3. If circuit closed, call proceeds to service// 4. If service fails, exception propagates to Retry// 5. Retry checks if exception is retryable// 6. If CallNotPermittedException (circuit open), don't retry// 7. Otherwise, retry with backoff // WRONG (DON'T DO THIS): CircuitBreaker wraps RetrySupplier<Product> wrongOrder = Decorators .ofSupplier(() -> productService.getProduct(productId)) .withRetry(retry) // Inner: retry (WRONG) .withCircuitBreaker(circuitBreaker) // Outer: circuit breaker (WRONG) .decorate();// This counts every retry as a separate failure!When configuring retry logic, ALWAYS exclude circuit-open exceptions from retry. Retrying when the circuit is open is pure waste—the circuit will still be open, and you're just consuming resources. The whole point of the circuit opening is to fail fast.
When retrying failed requests, the timing between retries significantly impacts both success probability and system load. Naive retry strategies (immediate retry, fixed delay) can cause problems. Exponential backoff with jitter is the gold standard.
The Thundering Herd Problem
Consider a scenario where 1000 clients are waiting for a service. The service briefly goes down, then comes back up. If all clients retry immediately and simultaneously:
This is the "thundering herd"—synchronized retry creates load spikes that prevent recovery.
Exponential Backoff
Exponential backoff increases wait time between retries exponentially:
wait_time = base_delay × 2^(attempt - 1)
Attempt 1: 500ms × 2^0 = 500ms
Attempt 2: 500ms × 2^1 = 1000ms
Attempt 3: 500ms × 2^2 = 2000ms
Attempt 4: 500ms × 2^3 = 4000ms
This spreads retries over time, reducing peak load. But there's still a problem: if all clients started at the same time, they all retry at the same times (all at 500ms, all at 1500ms cumulative, etc.).
Adding Jitter
Jitter adds randomness to the backoff calculation, desynchronizing retry attempts:
wait_time = base_delay × 2^(attempt - 1) × random(0.5, 1.5)
Client A Attempt 2: 1000ms × 0.7 = 700ms
Client B Attempt 2: 1000ms × 1.2 = 1200ms
Client C Attempt 2: 1000ms × 0.95 = 950ms
Now clients retry at different times, smoothing the load.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
// Different jitter strategies in Resilience4jimport io.github.resilience4j.retry.RetryConfig;import io.github.resilience4j.core.IntervalFunction; // 1. EXPONENTIAL BACKOFF (no jitter)RetryConfig exponentialOnly = RetryConfig.custom() .maxAttempts(4) .intervalFunction(IntervalFunction.ofExponentialBackoff( Duration.ofMillis(500), // Initial interval 2.0 // Multiplier )) .build();// Produces: 500ms, 1000ms, 2000ms, 4000ms // 2. EXPONENTIAL BACKOFF WITH RANDOM JITTERRetryConfig exponentialWithJitter = RetryConfig.custom() .maxAttempts(4) .intervalFunction(IntervalFunction.ofExponentialRandomBackoff( Duration.ofMillis(500), // Initial interval 2.0, // Multiplier 0.5 // Randomization factor (±50%) )) .build();// Produces: 250-750ms, 500-1500ms, 1000-3000ms, 2000-6000ms // 3. DECORRELATED JITTER (AWS recommendation)// Each retry's delay is random between base delay and 3× previous delayRetryConfig decorrelatedJitter = RetryConfig.custom() .maxAttempts(4) .intervalFunction(attempt -> { long baseMs = 500; long maxMs = 60_000; long previousDelay = attempt == 1 ? baseMs : (long)(baseMs * Math.pow(3, attempt - 2)); long delay = (long)(baseMs + Math.random() * (previousDelay * 3 - baseMs)); return Duration.ofMillis(Math.min(delay, maxMs)); }) .build(); // 4. EQUAL JITTER (hybrid approach)// Half exponential, half randomRetryConfig equalJitter = RetryConfig.custom() .maxAttempts(4) .intervalFunction(attempt -> { long base = 500; long exponential = base * (1L << (attempt - 1)); // 2^(attempt-1) long halfExponential = exponential / 2; long delay = halfExponential + (long)(Math.random() * halfExponential); return Duration.ofMillis(Math.min(delay, 60_000)); }) .build();// Provides minimum guaranteed backoff with random component| Strategy | Delay Formula | Characteristics |
|---|---|---|
| No Jitter | base × 2^attempt | Synchronized retries; thundering herd risk |
| Full Jitter | random(0, base × 2^attempt) | Maximum spread; minimum guaranteed delay is 0 |
| Equal Jitter | half_exp + random(0, half_exp) | Guaranteed minimum with randomization |
| Decorrelated Jitter | random(base, prev_delay × 3) | AWS recommended; good balance |
Always cap the maximum delay. Without a cap, exponential backoff can produce extremely long waits (500ms × 2^10 = 8 minutes). Typical caps are 30-60 seconds. Beyond that, it's better to fail and let higher-level retry mechanisms take over.
Individual retry configuration is important, but it's not enough. When many clients retry independently, the aggregate effect can still overwhelm services. Retry budgets provide system-level retry governance.
The Problem: Cumulative Retry Load
Consider a service with 100 clients, each configured to retry 3 times:
The service that was struggling at 1000 req/s is now receiving 2500 req/s. The retries intended to recover from failure are causing complete collapse.
Retry Budgets: The Solution
A retry budget limits the proportion of requests that can be retries. Instead of fixed retry counts, you constrain:
retry_ratio = retries / (original_requests + retries)
With a 20% retry budget:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
/** * A token-bucket based retry budget. * Limits retries to a percentage of total traffic. */public class RetryBudget { private final double retryRatio; // Target retry percentage private final long windowMs; // Time window for calculation private final AtomicLong attempts; // Total attempts in window private final AtomicLong retries; // Retries in window private final AtomicLong windowStart; public RetryBudget(double retryRatio, Duration window) { this.retryRatio = retryRatio; this.windowMs = window.toMillis(); this.attempts = new AtomicLong(0); this.retries = new AtomicLong(0); this.windowStart = new AtomicLong(System.currentTimeMillis()); } /** * Record an original (non-retry) attempt. */ public void recordAttempt() { maybeResetWindow(); attempts.incrementAndGet(); } /** * Check if a retry is allowed and record it if so. */ public boolean tryAcquireRetry() { maybeResetWindow(); long currentAttempts = attempts.get(); long currentRetries = retries.get(); long total = currentAttempts + currentRetries; if (total == 0) { // No attempts yet, allow retry retries.incrementAndGet(); return true; } // Check if adding a retry would exceed budget double potentialRetryRatio = (currentRetries + 1.0) / (total + 1.0); if (potentialRetryRatio <= retryRatio) { retries.incrementAndGet(); return true; } return false; // Budget exhausted } private void maybeResetWindow() { long now = System.currentTimeMillis(); long start = windowStart.get(); if (now - start > windowMs) { // Reset window if (windowStart.compareAndSet(start, now)) { attempts.set(0); retries.set(0); } } } public double getCurrentRetryRatio() { long total = attempts.get() + retries.get(); return total == 0 ? 0.0 : (double) retries.get() / total; }} // UsageRetryBudget budget = new RetryBudget(0.20, Duration.ofSeconds(10)); public CompletableFuture<Response> callWithBudgetedRetry(Request request) { budget.recordAttempt(); // Original attempt return circuitBreaker.executeCompletableFuture(() -> httpClient.send(request) ).exceptionallyCompose(error -> { if (error instanceof CallNotPermittedException) { // Circuit open, no retry return fallback(request); } if (budget.tryAcquireRetry()) { // Budget allows retry return callWithBudgetedRetry(request); // Recursive retry } else { // Budget exhausted log.warn("Retry budget exhausted, failing request"); return fallback(request); } });}Service Mesh Retry Budgets
Service meshes like Linkerd implement retry budgets at the infrastructure layer:
1234567891011121314151617181920212223242526272829
apiVersion: policy.linkerd.io/v1beta1kind: HTTPRoutemetadata: name: product-service-routespec: parentRefs: - name: product-service kind: Service rules: - backendRefs: - name: product-service port: 80 timeouts: request: 10s retry: limit: # Retry budget: max 20% of traffic can be retries retryRatio: 0.2 # Allow this many retries per request regardless of ratio minRetriesPerSecond: 10 conditions: - statusCodes: - 502 - 503 - 504 - timeouts: true backoff: baseInterval: 25ms maxInterval: 250msRetry budgets are strictly superior to fixed retry counts for distributed systems. Fixed counts ("retry 3 times") can amplify load exponentially as the number of clients increases. Budgets ("retries should be ≤20% of traffic") maintain proportional load regardless of client count.
Retries are only safe if the operation being retried is idempotent—meaning executing it multiple times has the same effect as executing it once. Without idempotency guarantees, retries can cause duplicate processing, double charges, or data corruption.
The Retry Safety Problem
Consider this scenario:
The retry caused a correctness bug because the operation wasn't idempotent.
Idempotency Strategies
| Strategy | How It Works | Trade-offs |
|---|---|---|
| Idempotency Key | Client provides unique key; server deduplicates | Client must generate keys; server stores keys |
| Natural Idempotency | Design operations to be naturally idempotent | Not always possible; requires careful API design |
| Conditional Requests | Use ETags/versions for conditional updates | Adds complexity; requires version tracking |
| At-Least-Once + Dedup | Accept duplicates; deduplicate downstream | Processing overhead; eventual consistency |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
@RestControllerpublic class PaymentController { private final PaymentService paymentService; private final IdempotencyStore idempotencyStore; @PostMapping("/payments") public ResponseEntity<PaymentResult> createPayment( @RequestHeader("Idempotency-Key") String idempotencyKey, @RequestBody PaymentRequest request) { // Check if we've seen this idempotency key before Optional<PaymentResult> cachedResult = idempotencyStore.get(idempotencyKey); if (cachedResult.isPresent()) { // Return cached result - this is a retry log.info("Returning cached result for idempotency key: {}", idempotencyKey); return ResponseEntity.ok(cachedResult.get()); } // Try to acquire lock on idempotency key if (!idempotencyStore.tryLock(idempotencyKey, Duration.ofMinutes(5))) { // Another request with same key is in progress return ResponseEntity.status(HttpStatus.CONFLICT) .body(PaymentResult.inProgress()); } try { // Process the payment PaymentResult result = paymentService.processPayment(request); // Store result for future retries idempotencyStore.store(idempotencyKey, result, Duration.ofHours(24)); return ResponseEntity.ok(result); } finally { idempotencyStore.unlock(idempotencyKey); } }} @Componentpublic class RedisIdempotencyStore implements IdempotencyStore { private final RedisTemplate<String, String> redis; private final ObjectMapper objectMapper; @Override public Optional<PaymentResult> get(String key) { String value = redis.opsForValue().get("idempotency:" + key); if (value == null) return Optional.empty(); return Optional.of(objectMapper.readValue(value, PaymentResult.class)); } @Override public boolean tryLock(String key, Duration timeout) { return redis.opsForValue() .setIfAbsent("lock:" + key, "locked", timeout); } @Override public void store(String key, PaymentResult result, Duration ttl) { String value = objectMapper.writeValueAsString(result); redis.opsForValue().set("idempotency:" + key, value, ttl); } @Override public void unlock(String key) { redis.delete("lock:" + key); }}Naturally Idempotent Operations
Some operations are naturally idempotent and safe to retry without additional mechanisms:
| Idempotent | Not Idempotent |
|---|---|
| GET requests | POST requests (typically) |
| PUT (replace entire resource) | PATCH (partial updates) |
| DELETE (by ID) | Counter increments |
| Set value to X | Add X to value |
| "Create if not exists" | "Create" (may create duplicates) |
Design for Idempotency
When possible, design your APIs to be naturally idempotent:
// NON-IDEMPOTENT: Add $100 to balance
POST /accounts/{id}/balance
{ "amount": 100 }
// IDEMPOTENT: Set balance to $500 (requires knowing current state)
PUT /accounts/{id}/balance
{ "balance": 500, "version": 42 }
// IDEMPOTENT: Use idempotency key for operations that can't be naturally idempotent
POST /accounts/{id}/transactions
Idempotency-Key: tx-12345
{ "amount": 100, "type": "credit" }
If you're retrying operations that modify state (POST, PUT, PATCH, DELETE), ALWAYS ensure idempotency protection is in place. The consequences of duplicate execution can range from minor (duplicate notifications) to severe (double charges, data corruption).
Let's bring together everything we've learned into a complete, production-ready integration of circuit breakers and retries.
The Full Stack
┌─────────────────────────────────────────────────────────────┐
│ Request Flow │
├─────────────────────────────────────────────────────────────┤
│ 1. Rate Limiter (optional) - Prevent overload │
│ 2. Retry (with budget) - Outer layer, handles retryable │
│ 3. Circuit Breaker - Fast fail if unhealthy │
│ 4. Bulkhead - Isolate resources │
│ 5. Timeout - Bound execution time │
│ 6. Actual Call - The real HTTP/RPC call │
└─────────────────────────────────────────────────────────────┘
Production Configuration
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
resilience4j: # Circuit Breaker Configuration circuitbreaker: configs: default: slidingWindowType: COUNT_BASED slidingWindowSize: 100 minimumNumberOfCalls: 10 failureRateThreshold: 50 slowCallRateThreshold: 80 slowCallDurationThreshold: 2s waitDurationInOpenState: 30s permittedNumberOfCallsInHalfOpenState: 5 automaticTransitionFromOpenToHalfOpenEnabled: true recordExceptions: - java.io.IOException - java.util.concurrent.TimeoutException - org.springframework.web.client.HttpServerErrorException ignoreExceptions: - org.springframework.web.client.HttpClientErrorException instances: productService: baseConfig: default paymentService: baseConfig: default failureRateThreshold: 30 # More conservative for payments waitDurationInOpenState: 60s # Retry Configuration retry: configs: default: maxAttempts: 3 waitDuration: 500ms enableExponentialBackoff: true exponentialBackoffMultiplier: 2 exponentialMaxWaitDuration: 5s retryExceptions: - java.io.IOException - java.util.concurrent.TimeoutException ignoreExceptions: # CRITICAL: Don't retry when circuit is open - io.github.resilience4j.circuitbreaker.CallNotPermittedException - org.springframework.web.client.HttpClientErrorException instances: productService: baseConfig: default paymentService: baseConfig: default maxAttempts: 2 # Fewer retries for sensitive operations # Bulkhead Configuration bulkhead: configs: default: maxConcurrentCalls: 25 maxWaitDuration: 500ms instances: productService: baseConfig: default paymentService: maxConcurrentCalls: 10 # More restrictive for payments # Time Limiter Configuration timelimiter: configs: default: timeoutDuration: 5s cancelRunningFuture: true instances: productService: baseConfig: default paymentService: timeoutDuration: 10s # Longer for payment processing123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
@Service@Slf4jpublic class ProductServiceClient { private final WebClient webClient; public ProductServiceClient(WebClient.Builder webClientBuilder) { this.webClient = webClientBuilder .baseUrl("http://product-service") .build(); } /** * Get product with full resilience stack. * Decorator order: Retry → CircuitBreaker → Bulkhead → TimeLimiter → Call */ @Retry(name = "productService") @CircuitBreaker(name = "productService", fallbackMethod = "getProductFallback") @Bulkhead(name = "productService") @TimeLimiter(name = "productService") public CompletableFuture<Product> getProduct(String productId) { return webClient.get() .uri("/products/{id}", productId) .retrieve() .bodyToMono(Product.class) .toFuture(); } /** * Fallback when circuit is open or all retries exhausted. */ private CompletableFuture<Product> getProductFallback( String productId, Throwable throwable) { if (throwable instanceof CallNotPermittedException) { log.warn("Circuit open for product service, using fallback"); // Circuit is open - return cached data return CompletableFuture.completedFuture( productCache.get(productId).orElse(Product.placeholder(productId)) ); } if (throwable instanceof BulkheadFullException) { log.warn("Bulkhead full for product service, using fallback"); // Too many concurrent requests return CompletableFuture.completedFuture( Product.placeholder(productId) ); } if (throwable instanceof TimeoutException) { log.warn("Request timed out for product {}", productId); } // Default fallback log.error("Product service call failed for {}", productId, throwable); return CompletableFuture.completedFuture( Product.unavailable(productId) ); } /** * Create product with idempotency key. * Uses fewer retries to avoid duplicate creation issues. */ @Retry(name = "productService", fallbackMethod = "createProductFallback") @CircuitBreaker(name = "productService") @Bulkhead(name = "productService") @TimeLimiter(name = "productService") public CompletableFuture<Product> createProduct( String idempotencyKey, ProductCreateRequest request) { return webClient.post() .uri("/products") .header("Idempotency-Key", idempotencyKey) .bodyValue(request) .retrieve() .bodyToMono(Product.class) .toFuture(); } private CompletableFuture<Product> createProductFallback( String idempotencyKey, ProductCreateRequest request, Throwable throwable) { // For creates, queue the operation for later processing log.error("Product creation failed, queuing for retry", throwable); productQueue.enqueue(idempotencyKey, request); return CompletableFuture.failedFuture( new ServiceTemporarilyUnavailableException( "Product creation queued for processing" ) ); }}With Spring annotations, the order is determined by the @Order annotation on the aspect classes. Resilience4j's default ordering is: Retry → CircuitBreaker → RateLimiter → TimeLimiter → Bulkhead (outside to inside). This is the recommended order. You can customize it via resilience4j.circuitbreaker.circuitBreakerAspectOrder etc.
When retry and circuit breaker behaviors interact, debugging requires comprehensive visibility into both.
Essential Metrics
| Metric | What It Tells You | Alert Threshold Guidance |
|---|---|---|
| circuit.state | Current circuit state (0/1/2) | Alert on transitions to OPEN |
| circuit.failure_rate | Current failure rate percentage | Alert if significantly above baseline |
| circuit.not_permitted_calls | Calls rejected by open circuit | Alert if sustained; indicates outage |
| retry.attempts | Total retry attempts | Alert if retry ratio exceeds budget |
| retry.max_retries_exceeded | Retries that exhausted all attempts | Alert if climbing; indicates chronic issues |
| retry.wait_duration | Time spent in retry backoff | Monitor for latency impact |
| bulkhead.available_permits | Available concurrency slots | Alert if consistently low |
12345678910111213141516171819202122232425262728293031
{ "title": "Resilience Dashboard", "widgets": [ { "title": "Circuit Breaker State", "query": "resilience4j_circuitbreaker_state{service:product}", "visualization": "timeseries" }, { "title": "Calls Rejected by Open Circuit", "query": "sum:resilience4j_circuitbreaker_not_permitted_calls{*}.as_rate()", "visualization": "timeseries" }, { "title": "Retry Ratio", "query": "sum:resilience4j_retry_calls{kind:retry} / sum:resilience4j_retry_calls{*}", "visualization": "query_value", "alert_threshold": 0.2 }, { "title": "Failure Rate by Service", "query": "avg:resilience4j_circuitbreaker_failure_rate{*} by {name}", "visualization": "timeseries" }, { "title": "Retry Exhaustion Rate", "query": "sum:resilience4j_retry_calls{kind:failed_with_retry}.as_rate()", "visualization": "timeseries" } ]}Debugging Common Issues
Issue: Circuit opens unexpectedly
Symptoms: Circuit opens when dependency appears healthy.
Debugging steps:
Issue: Retries never succeed
Symptoms: All retries fail; high retry exhaustion rate.
Debugging steps:
Issue: Cascading circuit opens
Symptoms: One circuit opens, then others follow.
Debugging steps:
Let's consolidate the key best practices for combining circuit breakers with retries.
Quick Reference: Recommended Defaults
| Parameter | Recommended Value | Notes |
|---|---|---|
| Max retry attempts | 3 | Including initial attempt; adjust for operation cost |
| Initial backoff | 500ms | Adjust based on typical recovery time |
| Max backoff | 30s | Don't wait forever |
| Backoff multiplier | 2 | Standard exponential |
| Jitter | Full or decorrelated | Avoids thundering herd |
| Retry budget | 20% | System-wide protection |
| Circuit failure threshold | 50% | Trip when half are failing |
| Circuit minimum volume | 10-20 | Statistical significance |
| Circuit recovery timeout | 30s | Give service time to recover |
We've completed our comprehensive exploration of the circuit breaker pattern and its integration with retry strategies. Let's consolidate the key insights from this final page:
Module Summary: Circuit Breaker Pattern
Over the course of this module, we've developed a comprehensive understanding of the circuit breaker pattern:
Why circuit breakers exist: To prevent cascade failures in distributed systems by failing fast when dependencies are unhealthy.
How the state machine works: Three states (Closed, Open, Half-Open) with transitions based on failure thresholds and recovery probes.
How to configure thresholds: The mathematics of failure rate, window sizing, minimum volume, and slow call detection.
Production implementations: Hystrix, Resilience4j, and implementations in other ecosystems.
Integration with retries: The critical ordering, backoff strategies, retry budgets, and idempotency requirements.
You now have the knowledge to design, implement, and operate circuit breakers in production systems. This pattern, combined with the other resilience patterns in this chapter, forms the foundation for building systems that remain reliable despite the inherent unreliability of distributed computing.
Congratulations! You've mastered the circuit breaker pattern—one of the most important resilience patterns in distributed systems engineering. You understand the theory, the implementations, and the practical integration with other patterns. Apply this knowledge to build systems that fail gracefully and recover automatically.