Loading content...
At its core, a circuit breaker is a finite state machine—a computational model that exists in exactly one of a finite number of states at any given time, transitioning between states based on external events. This simple abstraction enables sophisticated behavior: detecting failures, protecting resources, and automatically recovering when conditions improve.
Understanding the circuit breaker's state machine is essential for several reasons:
This page dissects the circuit breaker state machine with the rigor it deserves, covering not just the what, but the why behind each design decision.
By the end of this page, you will thoroughly understand the three circuit breaker states (Closed, Open, Half-Open), the precise conditions that trigger transitions between them, the timing and counter mechanics, and the design rationale. You'll be able to trace the lifecycle of a circuit through failure, protection, and recovery.
The classical circuit breaker pattern, as popularized by Michael Nygard in Release It! and implemented in libraries like Netflix Hystrix, uses three states. Each state represents a different operational mode and determines how the circuit breaker handles incoming requests.
The Complete State Transition Diagram
State Summary
| State | Meaning | Request Behavior | Transition Trigger |
|---|---|---|---|
| CLOSED | Normal operation; dependency assumed healthy | All requests pass through to dependency | Opens when failure threshold exceeded |
| OPEN | Failure detected; dependency assumed unhealthy | Requests fail immediately without calling dependency | Transitions to HALF-OPEN after recovery timeout |
| HALF-OPEN | Testing recovery; dependency status unknown | Limited probe requests pass through | Closes on success; re-opens on failure |
The names 'Closed' and 'Open' come from electrical circuit breakers. A CLOSED circuit allows current to flow (analogous to allowing requests through). An OPEN circuit interrupts the flow (blocking requests). This is initially counterintuitive for software engineers—remember that 'closed' means 'allowing requests', not 'shut down'.
The Lifecycle Narrative
To build intuition, let's trace through a complete circuit breaker lifecycle:
Birth (CLOSED): The circuit breaker starts in the CLOSED state. All requests pass through to the downstream dependency. The breaker monitors success and failure rates, maintaining running counters or windows.
Degradation Detected: The dependency begins failing. Perhaps it returns errors, or requests time out. Each failure increments the failure counter or contributes to the failure rate calculation.
Threshold Exceeded (→ OPEN): When failures exceed the configured threshold (e.g., 50% failure rate over 10 requests), the circuit "trips" and transitions to OPEN. This is the protective mechanism engaging.
Fast Failure (OPEN): While OPEN, all requests immediately fail with a circuit-open exception. No requests reach the dependency. Resources are preserved. The dependency has time to recover without being hammered by requests.
Recovery Window: The circuit remains OPEN for a configured duration (the "recovery timeout" or "sleep window"). This gives the dependency time to recover.
Probe Initiation (→ HALF-OPEN): After the recovery timeout expires, the circuit transitions to HALF-OPEN. This is the testing phase.
Recovery Test (HALF-OPEN): A limited number of "probe" requests are allowed through. These test whether the dependency has recovered.
Recovery Confirmed (→ CLOSED): If probes succeed, the circuit transitions back to CLOSED. Normal operation resumes. Counters are reset.
Recovery Failed (→ OPEN): If probes fail, the circuit re-opens. The recovery timeout restarts. We wait again before the next recovery test.
The CLOSED state is the normal operating mode. All requests pass through to the dependency, but the circuit breaker is actively monitoring for signs of trouble. Think of it as a vigilant gatekeeper: allowing everyone through while scanning for threats.
Request Flow in CLOSED State
Failure Counting Mechanisms
The circuit breaker must track failures to know when to trip. There are several approaches, each with trade-offs:
1. Simple Counter
The most basic approach: count consecutive failures. Trip when the count exceeds a threshold.
Pros: Simple to implement and understand. Cons: A single success resets the counter, potentially masking systemic issues.
if (consecutiveFailures >= threshold) {
trip();
}
2. Rolling Window Counter
Track failures within a time window (e.g., last 60 seconds). Calculate failure rate as: failures / total requests in window.
Pros: More representative of actual health; resists counter-reset gaming. Cons: Requires more memory to track individual request outcomes.
3. Sliding Window Counter
Similar to rolling window but with buckets. Divide time into buckets (e.g., 10 buckets of 6 seconds each for a 60-second window). Each bucket aggregates counts.
Pros: Memory-efficient; provides time decay. Cons: Bucket granularity affects precision.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
/** * A sliding window implementation for tracking failure rates. * Divides time into buckets for memory efficiency. */public class SlidingWindowCounter { private final int bucketCount; private final long bucketDurationMs; private final AtomicLong[] successCounts; private final AtomicLong[] failureCounts; private final AtomicLong[] bucketStartTimes; public SlidingWindowCounter(int bucketCount, long windowDurationMs) { this.bucketCount = bucketCount; this.bucketDurationMs = windowDurationMs / bucketCount; this.successCounts = new AtomicLong[bucketCount]; this.failureCounts = new AtomicLong[bucketCount]; this.bucketStartTimes = new AtomicLong[bucketCount]; for (int i = 0; i < bucketCount; i++) { successCounts[i] = new AtomicLong(0); failureCounts[i] = new AtomicLong(0); bucketStartTimes[i] = new AtomicLong(0); } } private int getCurrentBucket() { long now = System.currentTimeMillis(); int bucket = (int) ((now / bucketDurationMs) % bucketCount); // Reset bucket if it's stale (from a previous window) long bucketStart = (now / bucketDurationMs) * bucketDurationMs; if (bucketStartTimes[bucket].getAndSet(bucketStart) != bucketStart) { successCounts[bucket].set(0); failureCounts[bucket].set(0); } return bucket; } public void recordSuccess() { successCounts[getCurrentBucket()].incrementAndGet(); } public void recordFailure() { failureCounts[getCurrentBucket()].incrementAndGet(); } public double getFailureRate() { long totalSuccess = 0; long totalFailure = 0; long now = System.currentTimeMillis(); long windowStart = now - (bucketCount * bucketDurationMs); for (int i = 0; i < bucketCount; i++) { if (bucketStartTimes[i].get() >= windowStart) { totalSuccess += successCounts[i].get(); totalFailure += failureCounts[i].get(); } } long total = totalSuccess + totalFailure; return total == 0 ? 0.0 : (double) totalFailure / total; } public long getTotalRequests() { long total = 0; long now = System.currentTimeMillis(); long windowStart = now - (bucketCount * bucketDurationMs); for (int i = 0; i < bucketCount; i++) { if (bucketStartTimes[i].get() >= windowStart) { total += successCounts[i].get() + failureCounts[i].get(); } } return total; }}What Counts as a Failure?
An often-overlooked aspect of circuit breaker design is defining precisely what constitutes a "failure." This is not always obvious:
| Scenario | Should Count as Failure? | Rationale |
|---|---|---|
| HTTP 500 Internal Server Error | Yes | Server-side error; dependency is unhealthy |
| HTTP 503 Service Unavailable | Yes | Service explicitly signaling overload |
| HTTP 429 Too Many Requests | Maybe | Could be caller's fault (rate limit), not dependency health |
| HTTP 400 Bad Request | No | Client error; dependency is healthy |
| HTTP 404 Not Found | Typically No | Resource doesn't exist; not a health indicator |
| Connection Timeout | Yes | Cannot reach dependency |
| Read Timeout | Yes | Dependency too slow to respond |
| Connection Refused | Yes | Dependency is down |
| SSL/TLS Handshake Failure | Depends | Could be temporary or permanent; needs investigation |
Most circuit breaker implementations allow configuring which exceptions or status codes count as failures. Getting this configuration right is crucial—too broad catches normal errors; too narrow misses real failures.
A circuit should not trip based on a small number of requests. If you've only made 2 requests and 1 failed, your failure rate is 50%—but that's not statistically significant. Most implementations require a minimum request volume (e.g., 10 requests in the window) before evaluating failure rate. Otherwise, a single failure could trip the circuit inappropriately.
The OPEN state is the circuit breaker's protective mode. When the breaker is open, it's actively preventing requests from reaching the troubled dependency. This is where the cascade-prevention logic is in full effect.
Request Flow in OPEN State
Why Fast Failure Matters
The power of the OPEN state lies in its speed. Rejecting a request in OPEN state typically takes microseconds—far faster than waiting for a timeout:
| Scenario | Time Consumed | Resources Held |
|---|---|---|
| CLOSED: Successful call | 50ms (typical) | Thread/connection for 50ms |
| CLOSED: Failed call with timeout | 30,000ms (30s timeout) | Thread/connection for 30s |
| OPEN: Immediate rejection | 0.1ms | Nearly zero |
The difference is dramatic. A service handling 1000 requests/second:
This is the circuit breaker's fundamental value proposition: transforming slow failures into fast failures.
The Recovery Timeout (Sleep Window)
The OPEN state isn't permanent—that would be useless. The circuit remains open for a configurable duration called the "recovery timeout" or "sleep window" (Hystrix terminology). During this time:
Choosing the Recovery Timeout
The recovery timeout is a critical tuning parameter:
| Timeout Length | Pros | Cons |
|---|---|---|
| Too Short (1-5 seconds) | Quick recovery testing; minimal downtime if dependency recovers fast | May hammer dependency before it recovers; may cause oscillation |
| Moderate (10-60 seconds) | Balanced approach; gives dependency reasonable recovery time | Users experience degraded service during this window |
| Too Long (minutes+) | Near certainty dependency is recovered before testing | Extended service degradation even if dependency recovers quickly |
Typical recommendations:
What the Application Sees
When the circuit is open, the circuit breaker throws an exception (or returns an error value) immediately. The application must handle this:
12345678910111213141516171819202122
public ProductDetails getProductDetails(String productId) { try { return circuitBreaker.run( () -> productService.getDetails(productId), // Fallback when circuit is open or call fails throwable -> { if (throwable instanceof CircuitOpenException) { // Circuit is open - use cached data or default logger.warn("Circuit open for product service, using cached data"); return cachedProductDetails.get(productId) .orElse(ProductDetails.placeholder(productId)); } // Other failure - still use fallback logger.error("Product service call failed", throwable); return ProductDetails.placeholder(productId); } ); } catch (Exception e) { // Unexpected error return ProductDetails.unavailable(productId); }}The quality of your fallback determines user experience during outages. A well-designed fallback (cached data, default values, simpler computation) provides partial functionality. A poor fallback (generic error message) provides no value. Invest in fallback design—it's the difference between 'slightly degraded' and 'completely broken'.
The HALF-OPEN state is perhaps the most interesting from an engineering perspective. It's the circuit breaker's mechanism for automatically detecting that a dependency has recovered, without risking a cascade if it hasn't.
The Problem HALF-OPEN Solves
Without an intermediate state, the circuit breaker faces a dilemma:
HALF-OPEN solves this by allowing controlled probing: a small number of requests are allowed through to test the dependency's health.
Request Flow in HALF-OPEN State
Probe Mechanics
Different circuit breaker implementations handle probing differently:
1. Single Probe Model (Classic)
Allow exactly one request through as a probe. If it succeeds, close the circuit. If it fails, re-open.
Pros: Simple; minimal risk if dependency is still failing. Cons: A single transient failure keeps the circuit open; one unlucky request decides fate.
2. Multi-Probe Model (More Robust)
Allow N requests through. Require M successes (where M ≤ N) to close the circuit. Any failure immediately re-opens.
Pros: More robust against transient failures. Cons: Slightly more complex; more requests reach a potentially-failing dependency.
3. Percentage-Based Model
Allow a percentage of traffic through (e.g., 10%). If failure rate in this traffic is acceptable, gradually increase percentage until circuit is fully closed.
Pros: Gradual warm-up; good for dependencies that need to warm caches. Cons: More complex; longer recovery time.
Example Configuration: Resilience4j
1234567891011121314151617
CircuitBreakerConfig config = CircuitBreakerConfig.custom() // CLOSED state configuration .failureRateThreshold(50) // Open if 50% of requests fail .minimumNumberOfCalls(10) // Need at least 10 calls to evaluate .slidingWindowType(SlidingWindowType.COUNT_BASED) .slidingWindowSize(20) // Evaluate last 20 calls // OPEN state configuration .waitDurationInOpenState(Duration.ofSeconds(30)) // Recovery timeout // HALF-OPEN state configuration .permittedNumberOfCallsInHalfOpenState(5) // Allow 5 probe calls .automaticTransitionFromOpenToHalfOpenEnabled(true) .build(); CircuitBreaker circuitBreaker = CircuitBreaker.of("productService", config);The Waiting Room Problem
When the circuit transitions from OPEN to HALF-OPEN, there might be many requests waiting. What happens to them?
Option 1: Let them wait for probe result The first probe request is sent. Other requests wait (with timeout) for the result. If probe succeeds, waiting requests can proceed. If probe fails, waiting requests are rejected.
Issue: Could create a thundering herd if many requests are waiting.
Option 2: Reject non-probe requests immediately Only probe requests are allowed. All other requests get immediate rejection (same as OPEN state).
Issue: Continued service degradation during probing.
Option 3: Queue with capacity limit Maintain a small queue. If queue is full, reject. Otherwise queue requests until probe resolves.
Issue: Complexity; needs careful timeout management.
Most production implementations use Option 2 for simplicity and predictability.
HALF-OPEN is designed to be a brief transitional state. A circuit shouldn't stay half-open for extended periods. If you see a circuit stuck in half-open, investigate: Are probes timing out? Is the success threshold misconfigured? Is there a race condition in the implementation?
Understanding the precise conditions for each state transition is critical for debugging and configuration. Let's examine each transition with formal precision.
Transition 1: CLOSED → OPEN (Trip)
This transition occurs when the failure detection mechanism determines the dependency is unhealthy.
| Approach | Transition Condition | Configuration Parameters |
|---|---|---|
| Failure Rate | failureRate ≥ threshold AND totalCalls ≥ minimum | failureRateThreshold, minimumNumberOfCalls |
| Slow Call Rate | slowCallRate ≥ threshold AND totalCalls ≥ minimum | slowCallRateThreshold, slowCallDurationThreshold, minimumNumberOfCalls |
| Consecutive Failures | consecutiveFailures ≥ threshold | consecutiveFailureThreshold |
Transition 2: OPEN → HALF-OPEN (Recovery Test Initiation)
This transition is time-based: after the recovery timeout expires, the circuit transitions to HALF-OPEN.
IF currentTime - lastOpenTime ≥ recoveryTimeout THEN
transition(HALF-OPEN)
Implementation Note: This transition can be:
Request-triggered is more common as it avoids background threads and ensures the transition only happens when there's demand.
Transition 3: HALF-OPEN → CLOSED (Recovery Confirmed)
This transition occurs when probe requests succeed.
| Approach | Transition Condition | Configuration Parameters |
|---|---|---|
| Single Probe | First probe succeeds | N/A (single probe) |
| Multiple Probes | successfulProbes ≥ requiredSuccesses | permittedCallsInHalfOpen, requiredSuccessfulCalls |
| Percentage | probeFailureRate < threshold | permittedCallsInHalfOpen, failureRateThreshold |
Transition 4: HALF-OPEN → OPEN (Recovery Failed)
This transition occurs when probe requests fail, indicating the dependency hasn't recovered.
IF probeOutcome == FAILURE THEN
transition(OPEN)
resetRecoveryTimer()
Important: The recovery timer resets when re-opening. This ensures another full recovery timeout before the next probe attempt.
State transitions must be thread-safe. Consider: Two threads simultaneously evaluate failure thresholds. Both determine the circuit should trip. Without proper synchronization, both might attempt the transition, potentially corrupting state. Use atomic compare-and-swap operations or synchronized blocks for state transitions.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
public class CircuitBreaker { private final AtomicReference<State> state = new AtomicReference<>(State.CLOSED); private final AtomicLong stateTimestamp = new AtomicLong(System.currentTimeMillis()); /** * Atomically transitions from expected state to new state. * Returns true if transition occurred, false if current state * didn't match expected (another thread already transitioned). */ private boolean tryTransition(State expected, State newState) { if (state.compareAndSet(expected, newState)) { stateTimestamp.set(System.currentTimeMillis()); notifyStateChangeListeners(expected, newState); return true; } return false; } private void handleFailure() { State currentState = state.get(); if (currentState == State.CLOSED) { recordFailure(); if (shouldTrip()) { // Atomically trip - only one thread succeeds tryTransition(State.CLOSED, State.OPEN); } } else if (currentState == State.HALF_OPEN) { // Any failure in half-open re-opens immediately tryTransition(State.HALF_OPEN, State.OPEN); } // OPEN state: nothing to do on failure (already open) } private void handleSuccess() { State currentState = state.get(); if (currentState == State.CLOSED) { recordSuccess(); } else if (currentState == State.HALF_OPEN) { if (recordProbeSuccessAndCheckThreshold()) { // Atomically close - only one thread succeeds tryTransition(State.HALF_OPEN, State.CLOSED); } } // OPEN state: success shouldn't happen (calls are rejected) }}Transition Diagram with Conditions
Let's visualize all transitions with their formal conditions:
The three-state model is the classic form, but production implementations often extend it with additional states or mechanisms to handle edge cases.
Variation 1: DISABLED State
Some systems include a DISABLED state that allows bypassing the circuit breaker entirely. This is useful for:
Variation 2: FORCED_OPEN State
The inverse of DISABLED: a state that forces the circuit open regardless of dependency health. Useful for:
Variation 3: Slow Call Detection
Modern implementations like Resilience4j can trip not just on failures, but on slow calls. A call is considered "slow" if it exceeds a duration threshold.
slowCallRate = slowCalls / totalCalls
IF slowCallRate >= slowCallRateThreshold THEN trip()
This catches degradation before it becomes outright failure—a slow dependency consumes resources just like a failing one.
123456789101112131415161718
CircuitBreakerConfig configWithSlowCallDetection = CircuitBreakerConfig.custom() // Standard failure rate configuration .failureRateThreshold(50) .minimumNumberOfCalls(10) // Slow call detection .slowCallRateThreshold(80) // Open if 80% of calls are slow .slowCallDurationThreshold(Duration.ofSeconds(2)) // "Slow" = >2 seconds // Sliding window .slidingWindowType(SlidingWindowType.COUNT_BASED) .slidingWindowSize(20) .build(); // Now the circuit will open if EITHER:// 1. Failure rate >= 50%, OR// 2. Slow call rate >= 80% (where slow = >2 seconds)Variation 4: Gradual Recovery (Ramp-Up)
Instead of a binary HALF-OPEN → CLOSED transition, some implementations gradually increase traffic:
This prevents the "thundering herd" problem when a circuit closes and all backed-up requests flood the recovered dependency.
Variation 5: Ignore Exceptions
Some exceptions shouldn't affect circuit state:
Circuit breakers should allow configuring which exception types to ignore:
1234567891011121314151617181920
CircuitBreakerConfig config = CircuitBreakerConfig.custom() .failureRateThreshold(50) .minimumNumberOfCalls(10) // These exceptions are recorded as failures .recordExceptions( IOException.class, TimeoutException.class, ServiceUnavailableException.class ) // These exceptions are NOT recorded as failures // They pass through but don't affect circuit state .ignoreExceptions( BusinessException.class, ValidationException.class, NotFoundException.class ) .build();Carefully classifying which exceptions indicate dependency health issues versus application/business logic issues prevents false positives. A circuit tripping because users kept searching for non-existent items is counterproductive.
A circuit breaker's state must be observable. Without visibility, you can't debug issues, tune thresholds, or alert on problems.
1234567891011121314151617181920212223242526272829303132333435
@Configurationpublic class CircuitBreakerMetricsConfig { @Bean public CircuitBreakerRegistry circuitBreakerRegistry(MeterRegistry meterRegistry) { CircuitBreakerRegistry registry = CircuitBreakerRegistry.ofDefaults(); // Register metrics for all circuit breakers TaggedCircuitBreakerMetrics.ofCircuitBreakerRegistry(registry) .bindTo(meterRegistry); return registry; } // Access metrics for a specific circuit breaker public void logCircuitMetrics(CircuitBreaker breaker) { CircuitBreaker.Metrics metrics = breaker.getMetrics(); log.info("Circuit Breaker: {} State: {}", breaker.getName(), breaker.getState()); log.info(" Failure Rate: {}%", metrics.getFailureRate()); log.info(" Slow Call Rate: {}%", metrics.getSlowCallRate()); log.info(" Total Calls: {} (Success: {}, Failed: {}, Not Permitted: {})", metrics.getNumberOfBufferedCalls(), metrics.getNumberOfSuccessfulCalls(), metrics.getNumberOfFailedCalls(), metrics.getNumberOfNotPermittedCalls()); }}Alerting on State Changes
State changes should trigger alerts (at appropriate severity levels):
| State Change | Alert Severity | Action |
|---|---|---|
| CLOSED → OPEN | Warning / High | Investigate dependency; expect degraded service |
| OPEN → HALF-OPEN | Info | Recovery test in progress |
| HALF-OPEN → CLOSED | Info | Dependency recovered; full service restored |
| HALF-OPEN → OPEN | Warning | Recovery failed; dependency still unhealthy |
| Frequent oscillation | High | Unstable dependency; may need intervention |
Event Logging
Every state transition should be logged with context:
1234567891011121314151617181920212223242526272829
circuitBreaker.getEventPublisher() .onStateTransition(event -> { log.warn( "Circuit breaker '{}' transitioned: {} → {}. " + "Metrics: failureRate={}%, slowCallRate={}%, " + "bufferedCalls={}, failedCalls={}", event.getCircuitBreakerName(), event.getStateTransition().getFromState(), event.getStateTransition().getToState(), circuitBreaker.getMetrics().getFailureRate(), circuitBreaker.getMetrics().getSlowCallRate(), circuitBreaker.getMetrics().getNumberOfBufferedCalls(), circuitBreaker.getMetrics().getNumberOfFailedCalls() ); }) .onCallNotPermitted(event -> { log.debug( "Circuit breaker '{}' rejected call (circuit open)", event.getCircuitBreakerName() ); }) .onError(event -> { log.debug( "Circuit breaker '{}' recorded error: {} (duration: {}ms)", event.getCircuitBreakerName(), event.getThrowable().getClass().getSimpleName(), event.getElapsedDuration().toMillis() ); });We've thoroughly examined the circuit breaker state machine. Let's consolidate the key insights:
What's Next
Now that we understand the state machine mechanics, we need to configure it correctly. The next page dives deep into failure thresholds—how to determine the right threshold values, why minimum volume requirements matter, and the mathematical considerations behind window sizing. Misconfigured thresholds lead to circuits that either never trip or trip too aggressively.
You now understand the circuit breaker state machine in detail. You can describe each state's purpose, the conditions for transitions, the mechanics of failure counting, and the importance of observability. Next, we'll explore how to configure failure thresholds effectively.