Circuit Breaker - Learning Module

Loading content...

0/273

Circuit States: Closed, Open, Half-Open

The State Machine at the Heart of Resilience

At its core, a circuit breaker is a finite state machine—a computational model that exists in exactly one of a finite number of states at any given time, transitioning between states based on external events. This simple abstraction enables sophisticated behavior: detecting failures, protecting resources, and automatically recovering when conditions improve.

Understanding the circuit breaker's state machine is essential for several reasons:

Debugging: When your circuit breaker isn't behaving as expected, you need to understand which state it's in and why it transitioned there.
Configuration: The parameters you tune (thresholds, timeouts, probe counts) directly control state transitions.
Integration: Your application logic must handle different states appropriately, particularly when implementing fallbacks.
Monitoring: Effective alerting requires understanding what each state means operationally.

This page dissects the circuit breaker state machine with the rigor it deserves, covering not just the what, but the why behind each design decision.

What You Will Learn

By the end of this page, you will thoroughly understand the three circuit breaker states (Closed, Open, Half-Open), the precise conditions that trigger transitions between them, the timing and counter mechanics, and the design rationale. You'll be able to trace the lifecycle of a circuit through failure, protection, and recovery.

The Three-State Model Overview

The classical circuit breaker pattern, as popularized by Michael Nygard in Release It! and implemented in libraries like Netflix Hystrix, uses three states. Each state represents a different operational mode and determines how the circuit breaker handles incoming requests.

The Complete State Transition Diagram

Converting Mermaid diagram...

State Summary

Circuit Breaker States at a Glance
State	Meaning	Request Behavior	Transition Trigger
CLOSED	Normal operation; dependency assumed healthy	All requests pass through to dependency	Opens when failure threshold exceeded
OPEN	Failure detected; dependency assumed unhealthy	Requests fail immediately without calling dependency	Transitions to HALF-OPEN after recovery timeout
HALF-OPEN	Testing recovery; dependency status unknown	Limited probe requests pass through	Closes on success; re-opens on failure

The Naming Convention

The names 'Closed' and 'Open' come from electrical circuit breakers. A CLOSED circuit allows current to flow (analogous to allowing requests through). An OPEN circuit interrupts the flow (blocking requests). This is initially counterintuitive for software engineers—remember that 'closed' means 'allowing requests', not 'shut down'.

The Lifecycle Narrative

To build intuition, let's trace through a complete circuit breaker lifecycle:

Birth (CLOSED): The circuit breaker starts in the CLOSED state. All requests pass through to the downstream dependency. The breaker monitors success and failure rates, maintaining running counters or windows.
Degradation Detected: The dependency begins failing. Perhaps it returns errors, or requests time out. Each failure increments the failure counter or contributes to the failure rate calculation.
Threshold Exceeded (→ OPEN): When failures exceed the configured threshold (e.g., 50% failure rate over 10 requests), the circuit "trips" and transitions to OPEN. This is the protective mechanism engaging.
Fast Failure (OPEN): While OPEN, all requests immediately fail with a circuit-open exception. No requests reach the dependency. Resources are preserved. The dependency has time to recover without being hammered by requests.
Recovery Window: The circuit remains OPEN for a configured duration (the "recovery timeout" or "sleep window"). This gives the dependency time to recover.
Probe Initiation (→ HALF-OPEN): After the recovery timeout expires, the circuit transitions to HALF-OPEN. This is the testing phase.
Recovery Test (HALF-OPEN): A limited number of "probe" requests are allowed through. These test whether the dependency has recovered.
Recovery Confirmed (→ CLOSED): If probes succeed, the circuit transitions back to CLOSED. Normal operation resumes. Counters are reset.
Recovery Failed (→ OPEN): If probes fail, the circuit re-opens. The recovery timeout restarts. We wait again before the next recovery test.

CLOSED State: The Watchful Gatekeeper

The CLOSED state is the normal operating mode. All requests pass through to the dependency, but the circuit breaker is actively monitoring for signs of trouble. Think of it as a vigilant gatekeeper: allowing everyone through while scanning for threats.

Request Flow in CLOSED State

Converting Mermaid diagram...

Failure Counting Mechanisms

The circuit breaker must track failures to know when to trip. There are several approaches, each with trade-offs:

1. Simple Counter

The most basic approach: count consecutive failures. Trip when the count exceeds a threshold.

Pros: Simple to implement and understand. Cons: A single success resets the counter, potentially masking systemic issues.

if (consecutiveFailures >= threshold) {
    trip();
}

2. Rolling Window Counter

Track failures within a time window (e.g., last 60 seconds). Calculate failure rate as: failures / total requests in window.

Pros: More representative of actual health; resists counter-reset gaming. Cons: Requires more memory to track individual request outcomes.

3. Sliding Window Counter

Similar to rolling window but with buckets. Divide time into buckets (e.g., 10 buckets of 6 seconds each for a 60-second window). Each bucket aggregates counts.

Pros: Memory-efficient; provides time decay. Cons: Bucket granularity affects precision.

SlidingWindowCounter.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
/**
 * A sliding window implementation for tracking failure rates.
 * Divides time into buckets for memory efficiency.
 */
public class SlidingWindowCounter {
    private final int bucketCount;
    private final long bucketDurationMs;
    private final AtomicLong[] successCounts;
    private final AtomicLong[] failureCounts;
    private final AtomicLong[] bucketStartTimes;
    
    public SlidingWindowCounter(int bucketCount, long windowDurationMs) {
        this.bucketCount = bucketCount;
        this.bucketDurationMs = windowDurationMs / bucketCount;
        this.successCounts = new AtomicLong[bucketCount];
        this.failureCounts = new AtomicLong[bucketCount];
        this.bucketStartTimes = new AtomicLong[bucketCount];
        
        for (int i = 0; i < bucketCount; i++) {
            successCounts[i] = new AtomicLong(0);
            failureCounts[i] = new AtomicLong(0);
            bucketStartTimes[i] = new AtomicLong(0);
        }
    }
    
    private int getCurrentBucket() {
        long now = System.currentTimeMillis();
        int bucket = (int) ((now / bucketDurationMs) % bucketCount);
        
        // Reset bucket if it's stale (from a previous window)
        long bucketStart = (now / bucketDurationMs) * bucketDurationMs;
        if (bucketStartTimes[bucket].getAndSet(bucketStart) != bucketStart) {
            successCounts[bucket].set(0);
            failureCounts[bucket].set(0);
        }
        
        return bucket;
    }
    
    public void recordSuccess() {
        successCounts[getCurrentBucket()].incrementAndGet();
    }
    
    public void recordFailure() {
        failureCounts[getCurrentBucket()].incrementAndGet();
    }
    
    public double getFailureRate() {
        long totalSuccess = 0;
        long totalFailure = 0;
        long now = System.currentTimeMillis();
        long windowStart = now - (bucketCount * bucketDurationMs);
        
        for (int i = 0; i < bucketCount; i++) {
            if (bucketStartTimes[i].get() >= windowStart) {
                totalSuccess += successCounts[i].get();
                totalFailure += failureCounts[i].get();
            }
        }
        
        long total = totalSuccess + totalFailure;
        return total == 0 ? 0.0 : (double) totalFailure / total;
    }
    
    public long getTotalRequests() {
        long total = 0;
        long now = System.currentTimeMillis();
        long windowStart = now - (bucketCount * bucketDurationMs);
        
        for (int i = 0; i < bucketCount; i++) {
            if (bucketStartTimes[i].get() >= windowStart) {
                total += successCounts[i].get() + failureCounts[i].get();
            }
        }
        
        return total;
    }
}

What Counts as a Failure?

An often-overlooked aspect of circuit breaker design is defining precisely what constitutes a "failure." This is not always obvious:

Scenario	Should Count as Failure?	Rationale
HTTP 500 Internal Server Error	Yes	Server-side error; dependency is unhealthy
HTTP 503 Service Unavailable	Yes	Service explicitly signaling overload
HTTP 429 Too Many Requests	Maybe	Could be caller's fault (rate limit), not dependency health
HTTP 400 Bad Request	No	Client error; dependency is healthy
HTTP 404 Not Found	Typically No	Resource doesn't exist; not a health indicator
Connection Timeout	Yes	Cannot reach dependency
Read Timeout	Yes	Dependency too slow to respond
Connection Refused	Yes	Dependency is down
SSL/TLS Handshake Failure	Depends	Could be temporary or permanent; needs investigation

Most circuit breaker implementations allow configuring which exceptions or status codes count as failures. Getting this configuration right is crucial—too broad catches normal errors; too narrow misses real failures.

The Minimum Volume Safeguard

A circuit should not trip based on a small number of requests. If you've only made 2 requests and 1 failed, your failure rate is 50%—but that's not statistically significant. Most implementations require a minimum request volume (e.g., 10 requests in the window) before evaluating failure rate. Otherwise, a single failure could trip the circuit inappropriately.

OPEN State: The Protective Barrier

The OPEN state is the circuit breaker's protective mode. When the breaker is open, it's actively preventing requests from reaching the troubled dependency. This is where the cascade-prevention logic is in full effect.

Request Flow in OPEN State

Converting Mermaid diagram...

Why Fast Failure Matters

The power of the OPEN state lies in its speed. Rejecting a request in OPEN state typically takes microseconds—far faster than waiting for a timeout:

Scenario	Time Consumed	Resources Held
CLOSED: Successful call	50ms (typical)	Thread/connection for 50ms
CLOSED: Failed call with timeout	30,000ms (30s timeout)	Thread/connection for 30s
OPEN: Immediate rejection	0.1ms	Nearly zero

The difference is dramatic. A service handling 1000 requests/second:

With 30-second timeouts, needs 30,000 threads to handle all requests during failure
With circuit open, needs nearly zero additional resources for rejected requests

This is the circuit breaker's fundamental value proposition: transforming slow failures into fast failures.

The Recovery Timeout (Sleep Window)

The OPEN state isn't permanent—that would be useless. The circuit remains open for a configurable duration called the "recovery timeout" or "sleep window" (Hystrix terminology). During this time:

No requests reach the dependency
The dependency has time to recover from whatever was causing failures
Any temporary resource exhaustion can clear
Human operators might intervene if necessary

Choosing the Recovery Timeout

The recovery timeout is a critical tuning parameter:

Recovery Timeout Trade-offs
Timeout Length	Pros	Cons
Too Short (1-5 seconds)	Quick recovery testing; minimal downtime if dependency recovers fast	May hammer dependency before it recovers; may cause oscillation
Moderate (10-60 seconds)	Balanced approach; gives dependency reasonable recovery time	Users experience degraded service during this window
Too Long (minutes+)	Near certainty dependency is recovered before testing	Extended service degradation even if dependency recovers quickly

Typical recommendations:

Start with 30-60 seconds for most services
Reduce if the dependency is known to recover quickly (e.g., connection pool exhaustion)
Increase if the dependency takes time to recover (e.g., database restart)
Consider dynamic adjustment based on failure severity

What the Application Sees

When the circuit is open, the circuit breaker throws an exception (or returns an error value) immediately. The application must handle this:

Option 1: Propagate the error — Let the caller see that the operation failed
Option 2: Return fallback — Provide degraded but functional response
Option 3: Fail silently — For non-critical operations, just log and continue

CircuitOpenHandling.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public ProductDetails getProductDetails(String productId) {
    try {
        return circuitBreaker.run(
            () -> productService.getDetails(productId),
            // Fallback when circuit is open or call fails
            throwable -> {
                if (throwable instanceof CircuitOpenException) {
                    // Circuit is open - use cached data or default
                    logger.warn("Circuit open for product service, using cached data");
                    return cachedProductDetails.get(productId)
                        .orElse(ProductDetails.placeholder(productId));
                }
                // Other failure - still use fallback
                logger.error("Product service call failed", throwable);
                return ProductDetails.placeholder(productId);
            }
        );
    } catch (Exception e) {
        // Unexpected error
        return ProductDetails.unavailable(productId);
    }
}

Fallbacks Enable Graceful Degradation

The quality of your fallback determines user experience during outages. A well-designed fallback (cached data, default values, simpler computation) provides partial functionality. A poor fallback (generic error message) provides no value. Invest in fallback design—it's the difference between 'slightly degraded' and 'completely broken'.

HALF-OPEN State: The Recovery Probe

The HALF-OPEN state is perhaps the most interesting from an engineering perspective. It's the circuit breaker's mechanism for automatically detecting that a dependency has recovered, without risking a cascade if it hasn't.

The Problem HALF-OPEN Solves

Without an intermediate state, the circuit breaker faces a dilemma:

Stay open forever? The dependency might have recovered, but we'll never know. Manual intervention required.
Close immediately after timeout? If the dependency hasn't recovered, all queued requests flood it simultaneously, potentially causing another failure. We've learned nothing.

HALF-OPEN solves this by allowing controlled probing: a small number of requests are allowed through to test the dependency's health.

Request Flow in HALF-OPEN State

Converting Mermaid diagram...

Probe Mechanics

Different circuit breaker implementations handle probing differently:

1. Single Probe Model (Classic)

Allow exactly one request through as a probe. If it succeeds, close the circuit. If it fails, re-open.

Pros: Simple; minimal risk if dependency is still failing. Cons: A single transient failure keeps the circuit open; one unlucky request decides fate.

2. Multi-Probe Model (More Robust)

Allow N requests through. Require M successes (where M ≤ N) to close the circuit. Any failure immediately re-opens.

Pros: More robust against transient failures. Cons: Slightly more complex; more requests reach a potentially-failing dependency.

3. Percentage-Based Model

Allow a percentage of traffic through (e.g., 10%). If failure rate in this traffic is acceptable, gradually increase percentage until circuit is fully closed.

Pros: Gradual warm-up; good for dependencies that need to warm caches. Cons: More complex; longer recovery time.

Example Configuration: Resilience4j

Resilience4jHalfOpenConfig.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    // CLOSED state configuration
    .failureRateThreshold(50)                 // Open if 50% of requests fail
    .minimumNumberOfCalls(10)                 // Need at least 10 calls to evaluate
    .slidingWindowType(SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(20)                    // Evaluate last 20 calls
    
    // OPEN state configuration
    .waitDurationInOpenState(Duration.ofSeconds(30))  // Recovery timeout
    
    // HALF-OPEN state configuration
    .permittedNumberOfCallsInHalfOpenState(5) // Allow 5 probe calls
    .automaticTransitionFromOpenToHalfOpenEnabled(true)
    
    .build();
 
CircuitBreaker circuitBreaker = CircuitBreaker.of("productService", config);

The Waiting Room Problem

When the circuit transitions from OPEN to HALF-OPEN, there might be many requests waiting. What happens to them?

Option 1: Let them wait for probe result The first probe request is sent. Other requests wait (with timeout) for the result. If probe succeeds, waiting requests can proceed. If probe fails, waiting requests are rejected.

Issue: Could create a thundering herd if many requests are waiting.

Option 2: Reject non-probe requests immediately Only probe requests are allowed. All other requests get immediate rejection (same as OPEN state).

Issue: Continued service degradation during probing.

Option 3: Queue with capacity limit Maintain a small queue. If queue is full, reject. Otherwise queue requests until probe resolves.

Issue: Complexity; needs careful timeout management.

Most production implementations use Option 2 for simplicity and predictability.

Half-Open is Transient

HALF-OPEN is designed to be a brief transitional state. A circuit shouldn't stay half-open for extended periods. If you see a circuit stuck in half-open, investigate: Are probes timing out? Is the success threshold misconfigured? Is there a race condition in the implementation?

State Transitions in Detail

Understanding the precise conditions for each state transition is critical for debugging and configuration. Let's examine each transition with formal precision.

Transition 1: CLOSED → OPEN (Trip)

This transition occurs when the failure detection mechanism determines the dependency is unhealthy.

CLOSED → OPEN Transition Conditions
Approach	Transition Condition	Configuration Parameters
Failure Rate	failureRate ≥ threshold AND totalCalls ≥ minimum	failureRateThreshold, minimumNumberOfCalls
Slow Call Rate	slowCallRate ≥ threshold AND totalCalls ≥ minimum	slowCallRateThreshold, slowCallDurationThreshold, minimumNumberOfCalls
Consecutive Failures	consecutiveFailures ≥ threshold	consecutiveFailureThreshold

Transition 2: OPEN → HALF-OPEN (Recovery Test Initiation)

This transition is time-based: after the recovery timeout expires, the circuit transitions to HALF-OPEN.

IF currentTime - lastOpenTime ≥ recoveryTimeout THEN
    transition(HALF-OPEN)

Implementation Note: This transition can be:

Timer-triggered: A background timer fires after recoveryTimeout
Request-triggered: The next request after timeout has elapsed triggers the transition

Request-triggered is more common as it avoids background threads and ensures the transition only happens when there's demand.

Transition 3: HALF-OPEN → CLOSED (Recovery Confirmed)

This transition occurs when probe requests succeed.

HALF-OPEN → CLOSED Transition Conditions
Approach	Transition Condition	Configuration Parameters
Single Probe	First probe succeeds	N/A (single probe)
Multiple Probes	successfulProbes ≥ requiredSuccesses	permittedCallsInHalfOpen, requiredSuccessfulCalls
Percentage	probeFailureRate < threshold	permittedCallsInHalfOpen, failureRateThreshold

Transition 4: HALF-OPEN → OPEN (Recovery Failed)

This transition occurs when probe requests fail, indicating the dependency hasn't recovered.

IF probeOutcome == FAILURE THEN
    transition(OPEN)
    resetRecoveryTimer()

Important: The recovery timer resets when re-opening. This ensures another full recovery timeout before the next probe attempt.

Race Conditions in Transitions

State transitions must be thread-safe. Consider: Two threads simultaneously evaluate failure thresholds. Both determine the circuit should trip. Without proper synchronization, both might attempt the transition, potentially corrupting state. Use atomic compare-and-swap operations or synchronized blocks for state transitions.

ThreadSafeTransition.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
public class CircuitBreaker {
    private final AtomicReference<State> state = new AtomicReference<>(State.CLOSED);
    private final AtomicLong stateTimestamp = new AtomicLong(System.currentTimeMillis());
    
    /**
     * Atomically transitions from expected state to new state.
     * Returns true if transition occurred, false if current state
     * didn't match expected (another thread already transitioned).
     */
    private boolean tryTransition(State expected, State newState) {
        if (state.compareAndSet(expected, newState)) {
            stateTimestamp.set(System.currentTimeMillis());
            notifyStateChangeListeners(expected, newState);
            return true;
        }
        return false;
    }
    
    private void handleFailure() {
        State currentState = state.get();
        
        if (currentState == State.CLOSED) {
            recordFailure();
            if (shouldTrip()) {
                // Atomically trip - only one thread succeeds
                tryTransition(State.CLOSED, State.OPEN);
            }
        } else if (currentState == State.HALF_OPEN) {
            // Any failure in half-open re-opens immediately
            tryTransition(State.HALF_OPEN, State.OPEN);
        }
        // OPEN state: nothing to do on failure (already open)
    }
    
    private void handleSuccess() {
        State currentState = state.get();
        
        if (currentState == State.CLOSED) {
            recordSuccess();
        } else if (currentState == State.HALF_OPEN) {
            if (recordProbeSuccessAndCheckThreshold()) {
                // Atomically close - only one thread succeeds
                tryTransition(State.HALF_OPEN, State.CLOSED);
            }
        }
        // OPEN state: success shouldn't happen (calls are rejected)
    }
}

Transition Diagram with Conditions

Let's visualize all transitions with their formal conditions:

Converting Mermaid diagram...

Advanced State Machine Variations

The three-state model is the classic form, but production implementations often extend it with additional states or mechanisms to handle edge cases.

Variation 1: DISABLED State

Some systems include a DISABLED state that allows bypassing the circuit breaker entirely. This is useful for:

Testing in production without circuit breaker interference
Emergency bypass when the circuit breaker itself is misbehaving
Gradual rollout of circuit breaker functionality

Variation 2: FORCED_OPEN State

The inverse of DISABLED: a state that forces the circuit open regardless of dependency health. Useful for:

Manual intervention during known outages
Testing fallback behavior
Draining traffic from a dependency before maintenance

Variation 3: Slow Call Detection

Modern implementations like Resilience4j can trip not just on failures, but on slow calls. A call is considered "slow" if it exceeds a duration threshold.

slowCallRate = slowCalls / totalCalls
IF slowCallRate >= slowCallRateThreshold THEN trip()

This catches degradation before it becomes outright failure—a slow dependency consumes resources just like a failing one.

SlowCallDetection.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
CircuitBreakerConfig configWithSlowCallDetection = CircuitBreakerConfig.custom()
    // Standard failure rate configuration
    .failureRateThreshold(50)
    .minimumNumberOfCalls(10)
    
    // Slow call detection
    .slowCallRateThreshold(80)               // Open if 80% of calls are slow
    .slowCallDurationThreshold(Duration.ofSeconds(2))  // "Slow" = >2 seconds
    
    // Sliding window
    .slidingWindowType(SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(20)
    
    .build();
 
// Now the circuit will open if EITHER:
// 1. Failure rate >= 50%, OR
// 2. Slow call rate >= 80% (where slow = >2 seconds)

Variation 4: Gradual Recovery (Ramp-Up)

Instead of a binary HALF-OPEN → CLOSED transition, some implementations gradually increase traffic:

HALF-OPEN: Allow 1% of traffic
If successful, increase to 5%
If successful, increase to 25%
If successful, increase to 50%
If successful, CLOSED (100%)
Any failure at any stage → OPEN

This prevents the "thundering herd" problem when a circuit closes and all backed-up requests flood the recovered dependency.

Variation 5: Ignore Exceptions

Some exceptions shouldn't affect circuit state:

Business logic exceptions (e.g., "item out of stock")
Client errors (e.g., validation failures)
Expected exceptions (e.g., "user not found" in a lookup)

Circuit breakers should allow configuring which exception types to ignore:

IgnoreExceptions.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)
    .minimumNumberOfCalls(10)
    
    // These exceptions are recorded as failures
    .recordExceptions(
        IOException.class,
        TimeoutException.class,
        ServiceUnavailableException.class
    )
    
    // These exceptions are NOT recorded as failures
    // They pass through but don't affect circuit state
    .ignoreExceptions(
        BusinessException.class,
        ValidationException.class,
        NotFoundException.class
    )
    
    .build();

Exception Classification is Critical

Carefully classifying which exceptions indicate dependency health issues versus application/business logic issues prevents false positives. A circuit tripping because users kept searching for non-existent items is counterproductive.

Monitoring and Observability

A circuit breaker's state must be observable. Without visibility, you can't debug issues, tune thresholds, or alert on problems.

Essential Metrics to Track

•Current State — Gauge showing current state (0=CLOSED, 1=OPEN, 2=HALF-OPEN). Essential for dashboards and alerts.
•State Transition Count — Counter for each transition type. Spikes in CLOSED→OPEN indicate dependency issues. Frequent HALF-OPEN→OPEN indicates incomplete recovery.
•Call Count — Total calls, broken down by outcome (success, failure, rejected). Useful for understanding traffic patterns.
•Failure Rate — Current failure rate in the sliding window. Useful for seeing trends before circuit trips.
•Slow Call Rate — If using slow call detection. Shows latency degradation.
•Time in State — Duration spent in each state. Circuits spending significant time in OPEN indicate chronic dependency issues.
•Fallback Executions — Count of times fallback logic was triggered. High counts indicate users are experiencing degraded service.

CircuitBreakerMetrics.java
Java (Micrometer)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
@Configuration
public class CircuitBreakerMetricsConfig {
    
    @Bean
    public CircuitBreakerRegistry circuitBreakerRegistry(MeterRegistry meterRegistry) {
        CircuitBreakerRegistry registry = CircuitBreakerRegistry.ofDefaults();
        
        // Register metrics for all circuit breakers
        TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(registry)
            .bindTo(meterRegistry);
        
        return registry;
    }
    
    // Access metrics for a specific circuit breaker
    public void logCircuitMetrics(CircuitBreaker breaker) {
        CircuitBreaker.Metrics metrics = breaker.getMetrics();
        
        log.info("Circuit Breaker: {} State: {}", 
            breaker.getName(), 
            breaker.getState());
        
        log.info("  Failure Rate: {}%", 
            metrics.getFailureRate());
        
        log.info("  Slow Call Rate: {}%", 
            metrics.getSlowCallRate());
        
        log.info("  Total Calls: {} (Success: {}, Failed: {}, Not Permitted: {})",
            metrics.getNumberOfBufferedCalls(),
            metrics.getNumberOfSuccessfulCalls(),
            metrics.getNumberOfFailedCalls(),
            metrics.getNumberOfNotPermittedCalls());
    }
}

Alerting on State Changes

State changes should trigger alerts (at appropriate severity levels):

State Change	Alert Severity	Action
CLOSED → OPEN	Warning / High	Investigate dependency; expect degraded service
OPEN → HALF-OPEN	Info	Recovery test in progress
HALF-OPEN → CLOSED	Info	Dependency recovered; full service restored
HALF-OPEN → OPEN	Warning	Recovery failed; dependency still unhealthy
Frequent oscillation	High	Unstable dependency; may need intervention

Event Logging

Every state transition should be logged with context:

CircuitBreakerEventLogging.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
circuitBreaker.getEventPublisher()
    .onStateTransition(event -> {
        log.warn(
            "Circuit breaker '{}' transitioned: {} → {}. " +
            "Metrics: failureRate={}%, slowCallRate={}%, " +
            "bufferedCalls={}, failedCalls={}",
            event.getCircuitBreakerName(),
            event.getStateTransition().getFromState(),
            event.getStateTransition().getToState(),
            circuitBreaker.getMetrics().getFailureRate(),
            circuitBreaker.getMetrics().getSlowCallRate(),
            circuitBreaker.getMetrics().getNumberOfBufferedCalls(),
            circuitBreaker.getMetrics().getNumberOfFailedCalls()
        );
    })
    .onCallNotPermitted(event -> {
        log.debug(
            "Circuit breaker '{}' rejected call (circuit open)",
            event.getCircuitBreakerName()
        );
    })
    .onError(event -> {
        log.debug(
            "Circuit breaker '{}' recorded error: {} (duration: {}ms)",
            event.getCircuitBreakerName(),
            event.getThrowable().getClass().getSimpleName(),
            event.getElapsedDuration().toMillis()
        );
    });

Summary: Mastering the State Machine

We've thoroughly examined the circuit breaker state machine. Let's consolidate the key insights:

Key Takeaways

•Three fundamental states — CLOSED (normal operation, monitoring failures), OPEN (protecting resources, fast failure), and HALF-OPEN (testing recovery).
•CLOSED is the watchful gatekeeper — All requests pass through, but failures are counted. When thresholds are exceeded, the circuit trips.
•OPEN is the protective barrier — Requests are rejected immediately, preserving resources and giving the dependency time to recover.
•HALF-OPEN is the recovery probe — Limited requests test whether the dependency has recovered. Success closes the circuit; failure re-opens it.
•Failure counting requires sophistication — Sliding windows, minimum volume requirements, and careful exception classification are essential for accurate health detection.
•State transitions must be thread-safe — Use atomic operations to prevent race conditions when multiple threads evaluate thresholds concurrently.
•Observability is non-negotiable — Monitor state, transition counts, failure rates, and call counts. Alert on state changes. Log transitions with context.

What's Next

Now that we understand the state machine mechanics, we need to configure it correctly. The next page dives deep into failure thresholds—how to determine the right threshold values, why minimum volume requirements matter, and the mathematical considerations behind window sizing. Misconfigured thresholds lead to circuits that either never trip or trip too aggressively.

Page Complete

You now understand the circuit breaker state machine in detail. You can describe each state's purpose, the conditions for transitions, the mechanics of failure counting, and the importance of observability. Next, we'll explore how to configure failure thresholds effectively.

Circuit States: Closed, Open, Half-Open

The State Machine at the Heart of Resilience

Understanding the circuit breaker's state machine is essential for several reasons:

Debugging: When your circuit breaker isn't behaving as expected, you need to understand which state it's in and why it transitioned there.
Configuration: The parameters you tune (thresholds, timeouts, probe counts) directly control state transitions.
Integration: Your application logic must handle different states appropriately, particularly when implementing fallbacks.
Monitoring: Effective alerting requires understanding what each state means operationally.

This page dissects the circuit breaker state machine with the rigor it deserves, covering not just the what, but the why behind each design decision.

What You Will Learn

The Three-State Model Overview

The Complete State Transition Diagram

Converting Mermaid diagram...

State Summary

Circuit Breaker States at a Glance
State	Meaning	Request Behavior	Transition Trigger
CLOSED	Normal operation; dependency assumed healthy	All requests pass through to dependency	Opens when failure threshold exceeded
OPEN	Failure detected; dependency assumed unhealthy	Requests fail immediately without calling dependency	Transitions to HALF-OPEN after recovery timeout
HALF-OPEN	Testing recovery; dependency status unknown	Limited probe requests pass through	Closes on success; re-opens on failure

The Naming Convention

The Lifecycle Narrative

To build intuition, let's trace through a complete circuit breaker lifecycle:

Birth (CLOSED): The circuit breaker starts in the CLOSED state. All requests pass through to the downstream dependency. The breaker monitors success and failure rates, maintaining running counters or windows.
Degradation Detected: The dependency begins failing. Perhaps it returns errors, or requests time out. Each failure increments the failure counter or contributes to the failure rate calculation.
Threshold Exceeded (→ OPEN): When failures exceed the configured threshold (e.g., 50% failure rate over 10 requests), the circuit "trips" and transitions to OPEN. This is the protective mechanism engaging.
Fast Failure (OPEN): While OPEN, all requests immediately fail with a circuit-open exception. No requests reach the dependency. Resources are preserved. The dependency has time to recover without being hammered by requests.
Recovery Window: The circuit remains OPEN for a configured duration (the "recovery timeout" or "sleep window"). This gives the dependency time to recover.
Probe Initiation (→ HALF-OPEN): After the recovery timeout expires, the circuit transitions to HALF-OPEN. This is the testing phase.
Recovery Test (HALF-OPEN): A limited number of "probe" requests are allowed through. These test whether the dependency has recovered.
Recovery Confirmed (→ CLOSED): If probes succeed, the circuit transitions back to CLOSED. Normal operation resumes. Counters are reset.
Recovery Failed (→ OPEN): If probes fail, the circuit re-opens. The recovery timeout restarts. We wait again before the next recovery test.

CLOSED State: The Watchful Gatekeeper

Request Flow in CLOSED State

Converting Mermaid diagram...

Failure Counting Mechanisms

The circuit breaker must track failures to know when to trip. There are several approaches, each with trade-offs:

1. Simple Counter

The most basic approach: count consecutive failures. Trip when the count exceeds a threshold.

Pros: Simple to implement and understand. Cons: A single success resets the counter, potentially masking systemic issues.

if (consecutiveFailures >= threshold) {
    trip();
}

2. Rolling Window Counter

Track failures within a time window (e.g., last 60 seconds). Calculate failure rate as: failures / total requests in window.

Pros: More representative of actual health; resists counter-reset gaming. Cons: Requires more memory to track individual request outcomes.

3. Sliding Window Counter

Similar to rolling window but with buckets. Divide time into buckets (e.g., 10 buckets of 6 seconds each for a 60-second window). Each bucket aggregates counts.

Pros: Memory-efficient; provides time decay. Cons: Bucket granularity affects precision.

SlidingWindowCounter.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
/**
 * A sliding window implementation for tracking failure rates.
 * Divides time into buckets for memory efficiency.
 */
public class SlidingWindowCounter {
    private final int bucketCount;
    private final long bucketDurationMs;
    private final AtomicLong[] successCounts;
    private final AtomicLong[] failureCounts;
    private final AtomicLong[] bucketStartTimes;
    
    public SlidingWindowCounter(int bucketCount, long windowDurationMs) {
        this.bucketCount = bucketCount;
        this.bucketDurationMs = windowDurationMs / bucketCount;
        this.successCounts = new AtomicLong[bucketCount];
        this.failureCounts = new AtomicLong[bucketCount];
        this.bucketStartTimes = new AtomicLong[bucketCount];
        
        for (int i = 0; i < bucketCount; i++) {
            successCounts[i] = new AtomicLong(0);
            failureCounts[i] = new AtomicLong(0);
            bucketStartTimes[i] = new AtomicLong(0);
        }
    }
    
    private int getCurrentBucket() {
        long now = System.currentTimeMillis();
        int bucket = (int) ((now / bucketDurationMs) % bucketCount);
        
        // Reset bucket if it's stale (from a previous window)
        long bucketStart = (now / bucketDurationMs) * bucketDurationMs;
        if (bucketStartTimes[bucket].getAndSet(bucketStart) != bucketStart) {
            successCounts[bucket].set(0);
            failureCounts[bucket].set(0);
        }
        
        return bucket;
    }
    
    public void recordSuccess() {
        successCounts[getCurrentBucket()].incrementAndGet();
    }
    
    public void recordFailure() {
        failureCounts[getCurrentBucket()].incrementAndGet();
    }
    
    public double getFailureRate() {
        long totalSuccess = 0;
        long totalFailure = 0;
        long now = System.currentTimeMillis();
        long windowStart = now - (bucketCount * bucketDurationMs);
        
        for (int i = 0; i < bucketCount; i++) {
            if (bucketStartTimes[i].get() >= windowStart) {
                totalSuccess += successCounts[i].get();
                totalFailure += failureCounts[i].get();
            }
        }
        
        long total = totalSuccess + totalFailure;
        return total == 0 ? 0.0 : (double) totalFailure / total;
    }
    
    public long getTotalRequests() {
        long total = 0;
        long now = System.currentTimeMillis();
        long windowStart = now - (bucketCount * bucketDurationMs);
        
        for (int i = 0; i < bucketCount; i++) {
            if (bucketStartTimes[i].get() >= windowStart) {
                total += successCounts[i].get() + failureCounts[i].get();
            }
        }
        
        return total;
    }
}

What Counts as a Failure?

An often-overlooked aspect of circuit breaker design is defining precisely what constitutes a "failure." This is not always obvious:

Scenario	Should Count as Failure?	Rationale
HTTP 500 Internal Server Error	Yes	Server-side error; dependency is unhealthy
HTTP 503 Service Unavailable	Yes	Service explicitly signaling overload
HTTP 429 Too Many Requests	Maybe	Could be caller's fault (rate limit), not dependency health
HTTP 400 Bad Request	No	Client error; dependency is healthy
HTTP 404 Not Found	Typically No	Resource doesn't exist; not a health indicator
Connection Timeout	Yes	Cannot reach dependency
Read Timeout	Yes	Dependency too slow to respond
Connection Refused	Yes	Dependency is down
SSL/TLS Handshake Failure	Depends	Could be temporary or permanent; needs investigation

The Minimum Volume Safeguard

OPEN State: The Protective Barrier

Request Flow in OPEN State

Converting Mermaid diagram...

Why Fast Failure Matters

The power of the OPEN state lies in its speed. Rejecting a request in OPEN state typically takes microseconds—far faster than waiting for a timeout:

Scenario	Time Consumed	Resources Held
CLOSED: Successful call	50ms (typical)	Thread/connection for 50ms
CLOSED: Failed call with timeout	30,000ms (30s timeout)	Thread/connection for 30s
OPEN: Immediate rejection	0.1ms	Nearly zero

The difference is dramatic. A service handling 1000 requests/second:

With 30-second timeouts, needs 30,000 threads to handle all requests during failure
With circuit open, needs nearly zero additional resources for rejected requests

This is the circuit breaker's fundamental value proposition: transforming slow failures into fast failures.

The Recovery Timeout (Sleep Window)

The OPEN state isn't permanent—that would be useless. The circuit remains open for a configurable duration called the "recovery timeout" or "sleep window" (Hystrix terminology). During this time:

No requests reach the dependency
The dependency has time to recover from whatever was causing failures
Any temporary resource exhaustion can clear
Human operators might intervene if necessary

Choosing the Recovery Timeout

The recovery timeout is a critical tuning parameter:

Recovery Timeout Trade-offs
Timeout Length	Pros	Cons
Too Short (1-5 seconds)	Quick recovery testing; minimal downtime if dependency recovers fast	May hammer dependency before it recovers; may cause oscillation
Moderate (10-60 seconds)	Balanced approach; gives dependency reasonable recovery time	Users experience degraded service during this window
Too Long (minutes+)	Near certainty dependency is recovered before testing	Extended service degradation even if dependency recovers quickly

Typical recommendations:

Start with 30-60 seconds for most services
Reduce if the dependency is known to recover quickly (e.g., connection pool exhaustion)
Increase if the dependency takes time to recover (e.g., database restart)
Consider dynamic adjustment based on failure severity

What the Application Sees

When the circuit is open, the circuit breaker throws an exception (or returns an error value) immediately. The application must handle this:

Option 1: Propagate the error — Let the caller see that the operation failed
Option 2: Return fallback — Provide degraded but functional response
Option 3: Fail silently — For non-critical operations, just log and continue

CircuitOpenHandling.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public ProductDetails getProductDetails(String productId) {
    try {
        return circuitBreaker.run(
            () -> productService.getDetails(productId),
            // Fallback when circuit is open or call fails
            throwable -> {
                if (throwable instanceof CircuitOpenException) {
                    // Circuit is open - use cached data or default
                    logger.warn("Circuit open for product service, using cached data");
                    return cachedProductDetails.get(productId)
                        .orElse(ProductDetails.placeholder(productId));
                }
                // Other failure - still use fallback
                logger.error("Product service call failed", throwable);
                return ProductDetails.placeholder(productId);
            }
        );
    } catch (Exception e) {
        // Unexpected error
        return ProductDetails.unavailable(productId);
    }
}

Fallbacks Enable Graceful Degradation

HALF-OPEN State: The Recovery Probe

The Problem HALF-OPEN Solves

Without an intermediate state, the circuit breaker faces a dilemma:

Stay open forever? The dependency might have recovered, but we'll never know. Manual intervention required.
Close immediately after timeout? If the dependency hasn't recovered, all queued requests flood it simultaneously, potentially causing another failure. We've learned nothing.

HALF-OPEN solves this by allowing controlled probing: a small number of requests are allowed through to test the dependency's health.

Request Flow in HALF-OPEN State

Converting Mermaid diagram...

Probe Mechanics

Different circuit breaker implementations handle probing differently:

1. Single Probe Model (Classic)

Allow exactly one request through as a probe. If it succeeds, close the circuit. If it fails, re-open.

Pros: Simple; minimal risk if dependency is still failing. Cons: A single transient failure keeps the circuit open; one unlucky request decides fate.

2. Multi-Probe Model (More Robust)

Allow N requests through. Require M successes (where M ≤ N) to close the circuit. Any failure immediately re-opens.

Pros: More robust against transient failures. Cons: Slightly more complex; more requests reach a potentially-failing dependency.

3. Percentage-Based Model

Allow a percentage of traffic through (e.g., 10%). If failure rate in this traffic is acceptable, gradually increase percentage until circuit is fully closed.

Pros: Gradual warm-up; good for dependencies that need to warm caches. Cons: More complex; longer recovery time.

Example Configuration: Resilience4j

Resilience4jHalfOpenConfig.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    // CLOSED state configuration
    .failureRateThreshold(50)                 // Open if 50% of requests fail
    .minimumNumberOfCalls(10)                 // Need at least 10 calls to evaluate
    .slidingWindowType(SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(20)                    // Evaluate last 20 calls
    
    // OPEN state configuration
    .waitDurationInOpenState(Duration.ofSeconds(30))  // Recovery timeout
    
    // HALF-OPEN state configuration
    .permittedNumberOfCallsInHalfOpenState(5) // Allow 5 probe calls
    .automaticTransitionFromOpenToHalfOpenEnabled(true)
    
    .build();
 
CircuitBreaker circuitBreaker = CircuitBreaker.of("productService", config);

The Waiting Room Problem

When the circuit transitions from OPEN to HALF-OPEN, there might be many requests waiting. What happens to them?

Issue: Could create a thundering herd if many requests are waiting.

Option 2: Reject non-probe requests immediately Only probe requests are allowed. All other requests get immediate rejection (same as OPEN state).

Issue: Continued service degradation during probing.

Option 3: Queue with capacity limit Maintain a small queue. If queue is full, reject. Otherwise queue requests until probe resolves.

Issue: Complexity; needs careful timeout management.

Most production implementations use Option 2 for simplicity and predictability.

Half-Open is Transient

State Transitions in Detail

Understanding the precise conditions for each state transition is critical for debugging and configuration. Let's examine each transition with formal precision.

Transition 1: CLOSED → OPEN (Trip)

This transition occurs when the failure detection mechanism determines the dependency is unhealthy.

CLOSED → OPEN Transition Conditions
Approach	Transition Condition	Configuration Parameters
Failure Rate	failureRate ≥ threshold AND totalCalls ≥ minimum	failureRateThreshold, minimumNumberOfCalls
Slow Call Rate	slowCallRate ≥ threshold AND totalCalls ≥ minimum	slowCallRateThreshold, slowCallDurationThreshold, minimumNumberOfCalls
Consecutive Failures	consecutiveFailures ≥ threshold	consecutiveFailureThreshold

Transition 2: OPEN → HALF-OPEN (Recovery Test Initiation)

This transition is time-based: after the recovery timeout expires, the circuit transitions to HALF-OPEN.

IF currentTime - lastOpenTime ≥ recoveryTimeout THEN
    transition(HALF-OPEN)

Implementation Note: This transition can be:

Timer-triggered: A background timer fires after recoveryTimeout
Request-triggered: The next request after timeout has elapsed triggers the transition

Request-triggered is more common as it avoids background threads and ensures the transition only happens when there's demand.

Transition 3: HALF-OPEN → CLOSED (Recovery Confirmed)

This transition occurs when probe requests succeed.

HALF-OPEN → CLOSED Transition Conditions
Approach	Transition Condition	Configuration Parameters
Single Probe	First probe succeeds	N/A (single probe)
Multiple Probes	successfulProbes ≥ requiredSuccesses	permittedCallsInHalfOpen, requiredSuccessfulCalls
Percentage	probeFailureRate < threshold	permittedCallsInHalfOpen, failureRateThreshold

Transition 4: HALF-OPEN → OPEN (Recovery Failed)

This transition occurs when probe requests fail, indicating the dependency hasn't recovered.

IF probeOutcome == FAILURE THEN
    transition(OPEN)
    resetRecoveryTimer()

Important: The recovery timer resets when re-opening. This ensures another full recovery timeout before the next probe attempt.

Race Conditions in Transitions

ThreadSafeTransition.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
public class CircuitBreaker {
    private final AtomicReference<State> state = new AtomicReference<>(State.CLOSED);
    private final AtomicLong stateTimestamp = new AtomicLong(System.currentTimeMillis());
    
    /**
     * Atomically transitions from expected state to new state.
     * Returns true if transition occurred, false if current state
     * didn't match expected (another thread already transitioned).
     */
    private boolean tryTransition(State expected, State newState) {
        if (state.compareAndSet(expected, newState)) {
            stateTimestamp.set(System.currentTimeMillis());
            notifyStateChangeListeners(expected, newState);
            return true;
        }
        return false;
    }
    
    private void handleFailure() {
        State currentState = state.get();
        
        if (currentState == State.CLOSED) {
            recordFailure();
            if (shouldTrip()) {
                // Atomically trip - only one thread succeeds
                tryTransition(State.CLOSED, State.OPEN);
            }
        } else if (currentState == State.HALF_OPEN) {
            // Any failure in half-open re-opens immediately
            tryTransition(State.HALF_OPEN, State.OPEN);
        }
        // OPEN state: nothing to do on failure (already open)
    }
    
    private void handleSuccess() {
        State currentState = state.get();
        
        if (currentState == State.CLOSED) {
            recordSuccess();
        } else if (currentState == State.HALF_OPEN) {
            if (recordProbeSuccessAndCheckThreshold()) {
                // Atomically close - only one thread succeeds
                tryTransition(State.HALF_OPEN, State.CLOSED);
            }
        }
        // OPEN state: success shouldn't happen (calls are rejected)
    }
}

Transition Diagram with Conditions

Let's visualize all transitions with their formal conditions:

Converting Mermaid diagram...

Advanced State Machine Variations

The three-state model is the classic form, but production implementations often extend it with additional states or mechanisms to handle edge cases.

Variation 1: DISABLED State

Some systems include a DISABLED state that allows bypassing the circuit breaker entirely. This is useful for:

Testing in production without circuit breaker interference
Emergency bypass when the circuit breaker itself is misbehaving
Gradual rollout of circuit breaker functionality

Variation 2: FORCED_OPEN State

The inverse of DISABLED: a state that forces the circuit open regardless of dependency health. Useful for:

Manual intervention during known outages
Testing fallback behavior
Draining traffic from a dependency before maintenance

Variation 3: Slow Call Detection

Modern implementations like Resilience4j can trip not just on failures, but on slow calls. A call is considered "slow" if it exceeds a duration threshold.

slowCallRate = slowCalls / totalCalls
IF slowCallRate >= slowCallRateThreshold THEN trip()

This catches degradation before it becomes outright failure—a slow dependency consumes resources just like a failing one.

SlowCallDetection.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
CircuitBreakerConfig configWithSlowCallDetection = CircuitBreakerConfig.custom()
    // Standard failure rate configuration
    .failureRateThreshold(50)
    .minimumNumberOfCalls(10)
    
    // Slow call detection
    .slowCallRateThreshold(80)               // Open if 80% of calls are slow
    .slowCallDurationThreshold(Duration.ofSeconds(2))  // "Slow" = >2 seconds
    
    // Sliding window
    .slidingWindowType(SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(20)
    
    .build();
 
// Now the circuit will open if EITHER:
// 1. Failure rate >= 50%, OR
// 2. Slow call rate >= 80% (where slow = >2 seconds)

Variation 4: Gradual Recovery (Ramp-Up)

Instead of a binary HALF-OPEN → CLOSED transition, some implementations gradually increase traffic:

HALF-OPEN: Allow 1% of traffic
If successful, increase to 5%
If successful, increase to 25%
If successful, increase to 50%
If successful, CLOSED (100%)
Any failure at any stage → OPEN

This prevents the "thundering herd" problem when a circuit closes and all backed-up requests flood the recovered dependency.

Variation 5: Ignore Exceptions

Some exceptions shouldn't affect circuit state:

Business logic exceptions (e.g., "item out of stock")
Client errors (e.g., validation failures)
Expected exceptions (e.g., "user not found" in a lookup)

Circuit breakers should allow configuring which exception types to ignore:

IgnoreExceptions.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)
    .minimumNumberOfCalls(10)
    
    // These exceptions are recorded as failures
    .recordExceptions(
        IOException.class,
        TimeoutException.class,
        ServiceUnavailableException.class
    )
    
    // These exceptions are NOT recorded as failures
    // They pass through but don't affect circuit state
    .ignoreExceptions(
        BusinessException.class,
        ValidationException.class,
        NotFoundException.class
    )
    
    .build();

Exception Classification is Critical

Monitoring and Observability

A circuit breaker's state must be observable. Without visibility, you can't debug issues, tune thresholds, or alert on problems.

Essential Metrics to Track

•Current State — Gauge showing current state (0=CLOSED, 1=OPEN, 2=HALF-OPEN). Essential for dashboards and alerts.
•State Transition Count — Counter for each transition type. Spikes in CLOSED→OPEN indicate dependency issues. Frequent HALF-OPEN→OPEN indicates incomplete recovery.
•Call Count — Total calls, broken down by outcome (success, failure, rejected). Useful for understanding traffic patterns.
•Failure Rate — Current failure rate in the sliding window. Useful for seeing trends before circuit trips.
•Slow Call Rate — If using slow call detection. Shows latency degradation.
•Time in State — Duration spent in each state. Circuits spending significant time in OPEN indicate chronic dependency issues.
•Fallback Executions — Count of times fallback logic was triggered. High counts indicate users are experiencing degraded service.

CircuitBreakerMetrics.java
Java (Micrometer)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
@Configuration
public class CircuitBreakerMetricsConfig {
    
    @Bean
    public CircuitBreakerRegistry circuitBreakerRegistry(MeterRegistry meterRegistry) {
        CircuitBreakerRegistry registry = CircuitBreakerRegistry.ofDefaults();
        
        // Register metrics for all circuit breakers
        TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(registry)
            .bindTo(meterRegistry);
        
        return registry;
    }
    
    // Access metrics for a specific circuit breaker
    public void logCircuitMetrics(CircuitBreaker breaker) {
        CircuitBreaker.Metrics metrics = breaker.getMetrics();
        
        log.info("Circuit Breaker: {} State: {}", 
            breaker.getName(), 
            breaker.getState());
        
        log.info("  Failure Rate: {}%", 
            metrics.getFailureRate());
        
        log.info("  Slow Call Rate: {}%", 
            metrics.getSlowCallRate());
        
        log.info("  Total Calls: {} (Success: {}, Failed: {}, Not Permitted: {})",
            metrics.getNumberOfBufferedCalls(),
            metrics.getNumberOfSuccessfulCalls(),
            metrics.getNumberOfFailedCalls(),
            metrics.getNumberOfNotPermittedCalls());
    }
}

Alerting on State Changes

State changes should trigger alerts (at appropriate severity levels):

State Change	Alert Severity	Action
CLOSED → OPEN	Warning / High	Investigate dependency; expect degraded service
OPEN → HALF-OPEN	Info	Recovery test in progress
HALF-OPEN → CLOSED	Info	Dependency recovered; full service restored
HALF-OPEN → OPEN	Warning	Recovery failed; dependency still unhealthy
Frequent oscillation	High	Unstable dependency; may need intervention

Event Logging

Every state transition should be logged with context:

CircuitBreakerEventLogging.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
circuitBreaker.getEventPublisher()
    .onStateTransition(event -> {
        log.warn(
            "Circuit breaker '{}' transitioned: {} → {}. " +
            "Metrics: failureRate={}%, slowCallRate={}%, " +
            "bufferedCalls={}, failedCalls={}",
            event.getCircuitBreakerName(),
            event.getStateTransition().getFromState(),
            event.getStateTransition().getToState(),
            circuitBreaker.getMetrics().getFailureRate(),
            circuitBreaker.getMetrics().getSlowCallRate(),
            circuitBreaker.getMetrics().getNumberOfBufferedCalls(),
            circuitBreaker.getMetrics().getNumberOfFailedCalls()
        );
    })
    .onCallNotPermitted(event -> {
        log.debug(
            "Circuit breaker '{}' rejected call (circuit open)",
            event.getCircuitBreakerName()
        );
    })
    .onError(event -> {
        log.debug(
            "Circuit breaker '{}' recorded error: {} (duration: {}ms)",
            event.getCircuitBreakerName(),
            event.getThrowable().getClass().getSimpleName(),
            event.getElapsedDuration().toMillis()
        );
    });

Summary: Mastering the State Machine

We've thoroughly examined the circuit breaker state machine. Let's consolidate the key insights:

Key Takeaways

•Three fundamental states — CLOSED (normal operation, monitoring failures), OPEN (protecting resources, fast failure), and HALF-OPEN (testing recovery).
•CLOSED is the watchful gatekeeper — All requests pass through, but failures are counted. When thresholds are exceeded, the circuit trips.
•OPEN is the protective barrier — Requests are rejected immediately, preserving resources and giving the dependency time to recover.
•HALF-OPEN is the recovery probe — Limited requests test whether the dependency has recovered. Success closes the circuit; failure re-opens it.
•Failure counting requires sophistication — Sliding windows, minimum volume requirements, and careful exception classification are essential for accurate health detection.
•State transitions must be thread-safe — Use atomic operations to prevent race conditions when multiple threads evaluate thresholds concurrently.
•Observability is non-negotiable — Monitor state, transition counts, failure rates, and call counts. Alert on state changes. Log transitions with context.

What's Next

Page Complete