Loading learning content...
The circuit breaker pattern is fundamentally a finite state machine—a computational model that responds to events by transitioning between a defined set of states, each with its own behavior. Understanding these states and their transitions is essential for both implementing and tuning circuit breakers effectively.
Unlike many software patterns where implementation details vary widely, circuit breaker state machines follow a well-established structure refined over decades of production use. Netflix's Hystrix library, the progenitor of modern circuit breakers, established conventions that have been adopted by virtually every subsequent implementation.
This page provides an exhaustive examination of circuit breaker states, the conditions that trigger transitions between them, and the nuanced timing mechanisms that control recovery behavior.
By the end of this page, you will understand each circuit breaker state in detail, master the transition logic that moves between states, comprehend the timing mechanisms that control recovery probes, and recognize the edge cases that can cause unexpected behavior.
A circuit breaker operates in exactly three states. Each state defines distinct behavior for incoming requests, distinct metrics tracking, and distinct transition eligibility.
State 1: CLOSED (Normal Operation)
The CLOSED state represents normal, healthy operation. The term 'closed' comes from electrical circuits—a closed circuit allows current to flow, just as a closed software circuit allows requests to flow.
In the CLOSED state:
The CLOSED state is the 'steady state' for healthy systems. A well-designed system should spend 99%+ of its time in CLOSED across all circuit breakers.
State 2: OPEN (Protecting the System)
The OPEN state activates when failures exceed tolerance. An open circuit stops current flow—all requests fail immediately without attempting the downstream call.
In the OPEN state:
The OPEN state is the protection mechanism. Its purpose is to preserve resources by not waiting for inevitable failures.
State 3: HALF-OPEN (Testing Recovery)
The HALF-OPEN state is a transitional state for testing whether the downstream service has recovered. It allows a limited number of probe requests through while deciding whether to fully close or re-open the circuit.
In the HALF-OPEN state:
HALF-OPEN embodies the principle of 'trust but verify.' Rather than immediately resuming full traffic (which could overwhelm a recovering service) or waiting indefinitely (which unnecessarily prolongs degradation), it takes a measured, cautious approach to recovery.
Understanding precisely what triggers each state transition is critical for both configuration and debugging. Let's examine each transition in detail.
Transition: CLOSED → OPEN (Tripping the Circuit)
This is the most consequential transition—it activates protection. The transition occurs when failure metrics exceed configured thresholds.
Failure rate threshold approach: The circuit trips when the percentage of failed requests exceeds a threshold. For example:
Failure count threshold approach: Alternatively, the circuit trips after a fixed number of failures:
Slow call rate threshold approach: Some implementations also consider latency:
123456789101112131415161718192021222324252627282930313233343536
// Resilience4j-style configuration for CLOSED → OPEN transitionconst circuitBreakerConfig = { // Failure rate threshold (percentage) failureRateThreshold: 50, // Trip if 50%+ requests fail // Minimum number of calls before calculating failure rate minimumNumberOfCalls: 10, // Wait for at least 10 calls before evaluating // Slow call configuration slowCallDurationThreshold: 3000, // 3 seconds = "slow" slowCallRateThreshold: 80, // Trip if 80%+ calls are slow // Sliding window configuration slidingWindowType: 'COUNT_BASED', // or 'TIME_BASED' slidingWindowSize: 100, // Last 100 calls (or 100 seconds if time-based)}; // Transition logic pseudocode:function evaluateTransition(metrics: CircuitMetrics): State { if (metrics.totalCalls < config.minimumNumberOfCalls) { return State.CLOSED; // Not enough data to evaluate } const failureRate = metrics.failures / metrics.totalCalls * 100; const slowCallRate = metrics.slowCalls / metrics.totalCalls * 100; if (failureRate >= config.failureRateThreshold) { return State.OPEN; // Too many failures } if (slowCallRate >= config.slowCallRateThreshold) { return State.OPEN; // Too many slow calls } return State.CLOSED; // All thresholds within limits}Transition: OPEN → HALF-OPEN (Testing Recovery)
This transition is entirely time-based. After the circuit opens, a timer begins. When the timer expires, the circuit automatically moves to HALF-OPEN to test recovery.
The wait duration is a critical tuning parameter:
Transition: HALF-OPEN → CLOSED (Recovery Confirmed)
This transition occurs when probe requests succeed, indicating the downstream service has recovered.
Approaches to evaluating probes:
Single probe success:
Minimum successful probes:
Sliding window on probes:
1234567891011121314151617181920212223242526272829
// Half-open configurationconst halfOpenConfig = { permittedNumberOfCallsInHalfOpenState: 5, // Allow 5 probe requests // Option 1: Single failure re-opens failImmediatelyOnProbeFailure: true, // Option 2: Apply threshold to probes failureThresholdForProbes: 50, // If 50%+ probes fail, re-open}; function evaluateHalfOpenProbes(probeResults: boolean[]): State { const totalProbes = probeResults.length; const failures = probeResults.filter(r => !r).length; if (config.failImmediatelyOnProbeFailure && failures > 0) { return State.OPEN; // Any failure → re-open immediately } if (totalProbes >= config.permittedNumberOfCallsInHalfOpenState) { const failureRate = failures / totalProbes * 100; if (failureRate >= config.failureThresholdForProbes) { return State.OPEN; // Too many probe failures } return State.CLOSED; // Probes successful, close circuit } return State.HALF_OPEN; // Still collecting probe results}Transition: HALF-OPEN → OPEN (Recovery Failed)
If probe requests fail, the circuit returns to the OPEN state to continue protecting the system.
The accuracy of circuit breaker decisions depends heavily on how failures are counted. Most implementations use sliding window algorithms to track recent outcomes.
Count-Based Sliding Window
A count-based window tracks the last N requests, regardless of when they occurred.
Time-Based Sliding Window
A time-based window tracks all requests within a time period (e.g., the last 60 seconds).
Implementation Detail: Ring Buffer
Most efficient implementations use a ring buffer (circular buffer) to track outcomes:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
// Simplified count-based sliding window using ring bufferclass CountBasedSlidingWindow { private buffer: Array<'success' | 'failure'>; private position: number = 0; private successCount: number = 0; private failureCount: number = 0; private filled: boolean = false; constructor(private size: number) { this.buffer = new Array(size).fill(null); } record(outcome: 'success' | 'failure'): void { const previous = this.buffer[this.position]; // Update running counts if (previous === 'success') this.successCount--; if (previous === 'failure') this.failureCount--; this.buffer[this.position] = outcome; if (outcome === 'success') this.successCount++; if (outcome === 'failure') this.failureCount++; this.position = (this.position + 1) % this.size; if (this.position === 0) this.filled = true; } getFailureRate(): number { const total = this.filled ? this.size : this.position; if (total === 0) return 0; return (this.failureCount / total) * 100; } getTotalCount(): number { return this.filled ? this.size : this.position; }} // Usage:const window = new CountBasedSlidingWindow(100);window.record('success');window.record('failure');window.record('success');console.log(window.getFailureRate()); // 33.33%Aggregated Time-Based Windows
For time-based windows with high traffic, implementations often use bucketed aggregation:
This approach uses O(window_duration / bucket_size) memory regardless of traffic volume.
| Aspect | Count-Based | Time-Based |
|---|---|---|
| Window definition | Last N requests | Last T seconds |
| Data freshness | Variable (depends on traffic) | Consistent (bounded by time) |
| Memory usage | Fixed (N entries) | Variable (depends on traffic) |
| Low traffic behavior | Stale data persists | Old data naturally expires |
| High traffic behavior | Always fresh | May require bucketing |
| Best for | Consistent traffic patterns | Variable traffic patterns |
For most production systems, time-based windows are preferable. They ensure that circuit breaker decisions are based on recent behavior, which is especially important for services with variable traffic patterns (batch jobs, daily cycles, etc.).
One of the most commonly overlooked configuration parameters is the minimum number of calls threshold. This parameter prevents the circuit from tripping based on statistically insignificant data.
The problem it solves:
Imagine a circuit with a 50% failure threshold. If only 2 requests have been made and 1 failed, the failure rate is 50%—technically exceeding the threshold. Should the circuit open?
Almost certainly not. A single failure out of two requests might be:
The minimum calls threshold prevents premature tripping:
1234567891011121314151617181920212223
// The minimum calls threshold in actionconst config = { failureRateThreshold: 50, // 50% failure rate triggers open minimumNumberOfCalls: 10, // But only after at least 10 calls slidingWindowSize: 100, // Evaluate over last 100 calls}; function shouldOpenCircuit(metrics: Metrics): boolean { // Check 1: Do we have enough data? if (metrics.totalCalls < config.minimumNumberOfCalls) { return false; // NOT ENOUGH DATA - don't evaluate yet } // Check 2: Does failure rate exceed threshold? const failureRate = metrics.failures / metrics.totalCalls * 100; return failureRate >= config.failureRateThreshold;} // Examples:// 2 calls, 1 failure (50% rate) → Don't open (only 2 calls < 10 minimum)// 5 calls, 3 failures (60% rate) → Don't open (only 5 calls < 10 minimum)// 10 calls, 4 failures (40% rate) → Don't open (40% < 50% threshold)// 10 calls, 5 failures (50% rate) → OPEN (threshold met, enough data)Tuning the minimum calls threshold:
The right value depends on your traffic patterns and failure tolerance:
| Traffic Level | Recommended Minimum | Reasoning |
|---|---|---|
| Low (< 100/min) | 5-10 calls | Small sample size; want quick detection |
| Medium (100-1000/min) | 10-20 calls | Balance between speed and accuracy |
| High (> 1000/min) | 20-50 calls | Plenty of data; prioritize accuracy |
| Critical path | Lower values | Faster protection, accept more false positives |
| Non-critical path | Higher values | Avoid unnecessary open circuits |
Setting minimum calls too high for low-traffic services means the circuit might never evaluate failure rates. If minimum calls is 100 but the service only handles 50 requests per hour, the circuit will never consider opening—even during a real outage.
Interaction with sliding window:
The minimum calls threshold works in conjunction with the sliding window:
If your sliding window size is 100 (count-based) and minimum calls is 20, evaluation begins after 20 calls and continues until the window is full. Once full, the oldest call is evicted as new calls arrive, maintaining exactly 100 calls in the evaluation set.
The timing of state transitions introduces subtle behaviors that affect system dynamics. Understanding these nuances is essential for advanced circuit breaker tuning.
Wait Duration in Open State
The wait duration (also called 'open state duration' or 'sleep window') controls how long the circuit stays open before attempting recovery.
Key considerations:
Typical wait duration recommendations:
| Failure Type | Recommended Wait | Rationale |
|---|---|---|
| Network blip | 10-30 seconds | Transient issues resolve quickly |
| Service restart | 30-60 seconds | Container/process restart time |
| Database failover | 60-180 seconds | Primary/replica promotion time |
| Deployment failure | 120-300 seconds | Rollback or fix deployment time |
| External provider outage | 300-600 seconds | Third-party recovery time varies |
Half-Open Probe Limiting
In the HALF-OPEN state, not all requests become probes. The circuit limits how many probe requests are allowed to prevent overwhelming a recovering service.
probe limiting strategies:
First-N approach: The first N requests after entering half-open become probes; subsequent requests fail fast until transition occurs
Single probe approach: Only one request is allowed through; all others fail fast until that probe completes
Rate-limited probes: Probes are allowed at a fixed rate (e.g., 1 per second) regardless of incoming traffic
Most implementations use the First-N approach with N between 3 and 10.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
// Half-open state with probe limitingclass CircuitBreaker { private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED'; private halfOpenProbesAllowed: number = 5; private halfOpenProbeCount: number = 0; private halfOpenProbeResults: boolean[] = []; async execute<T>(fn: () => Promise<T>): Promise<T> { switch (this.state) { case 'CLOSED': return this.executeWithTracking(fn); case 'OPEN': if (this.shouldTransitionToHalfOpen()) { this.state = 'HALF_OPEN'; this.halfOpenProbeCount = 0; this.halfOpenProbeResults = []; return this.executeAsProbe(fn); } throw new CircuitBreakerOpenException('Circuit is OPEN'); case 'HALF_OPEN': if (this.halfOpenProbeCount < this.halfOpenProbesAllowed) { return this.executeAsProbe(fn); } // No more probes allowed - fail fast throw new CircuitBreakerOpenException('Circuit is HALF_OPEN, max probes reached'); } } private async executeAsProbe<T>(fn: () => Promise<T>): Promise<T> { this.halfOpenProbeCount++; try { const result = await fn(); this.halfOpenProbeResults.push(true); // Success this.evaluateHalfOpenState(); return result; } catch (error) { this.halfOpenProbeResults.push(false); // Failure this.evaluateHalfOpenState(); throw error; } } private evaluateHalfOpenState(): void { const failures = this.halfOpenProbeResults.filter(r => !r).length; if (failures > 0) { // Any failure → back to OPEN this.state = 'OPEN'; this.resetOpenStateTimer(); return; } if (this.halfOpenProbeResults.length >= this.halfOpenProbesAllowed) { // All probes succeeded → CLOSED this.state = 'CLOSED'; this.resetMetrics(); } }}Circuit breakers operate in concurrent, high-throughput environments where edge cases and race conditions can cause unexpected behavior. Understanding these edge cases is essential for reliable implementations.
Edge Case 1: Concurrent Probes in Half-Open
When the circuit transitions to HALF-OPEN, multiple threads might simultaneously see the new state and attempt to execute probes.
Problem: If 100 threads enter half-open simultaneously and all become probes, you've just sent 100 requests to a potentially recovering service—defeating the purpose of limited probing.
Solution: Use atomic counters or locks to limit probe count:
1234567891011121314151617181920212223242526272829303132333435
// Thread-safe probe limiting using atomic operationsclass ThreadSafeCircuitBreaker { private halfOpenProbeCount: AtomicInteger = new AtomicInteger(0); private halfOpenMaxProbes: number = 5; async executeInHalfOpen<T>(fn: () => Promise<T>): Promise<T> { // Atomically try to acquire a probe slot const myProbeNumber = this.halfOpenProbeCount.incrementAndGet(); if (myProbeNumber > this.halfOpenMaxProbes) { // We didn't get a probe slot - fail fast throw new CircuitBreakerOpenException('Probe limit reached'); } // We got a probe slot - execute the request return fn(); }} // Alternative: Use a semaphoreclass SemaphoreCircuitBreaker { private probeSemaphore = new Semaphore(5); // Max 5 probes async executeInHalfOpen<T>(fn: () => Promise<T>): Promise<T> { if (!this.probeSemaphore.tryAcquire()) { throw new CircuitBreakerOpenException('Probe limit reached'); } try { return await fn(); } finally { this.probeSemaphore.release(); } }}Edge Case 2: Transition Race Conditions
Two threads might evaluate transition conditions simultaneously, both seeing the same metrics and both attempting to trigger a transition.
Problem: Multiple transition attempts could cause:
Solution: Use compare-and-swap (CAS) for state transitions:
1234567891011121314151617181920
// Compare-and-swap state transitionsclass CASCircuitBreaker { private state: AtomicReference<State> = new AtomicReference(State.CLOSED); private tryTransition(from: State, to: State): boolean { // Atomically transition only if current state matches expected return this.state.compareAndSet(from, to); } private evaluateAndMaybeOpen(): void { if (this.shouldTrip() && this.state.get() === State.CLOSED) { // Only one thread will succeed in this transition if (this.tryTransition(State.CLOSED, State.OPEN)) { this.startOpenTimer(); // Only the winning thread starts timer this.recordStateChange('CLOSED', 'OPEN'); } // Losing threads silently continue - state already transitioned } }}Edge Case 3: In-Flight Requests During Transition
When a circuit transitions from CLOSED to OPEN, there may be requests already in-flight to the downstream service.
Problem: These in-flight requests might:
Solution:
Most production circuit breaker libraries (Resilience4j, Polly, Hystrix) handle these edge cases internally. When implementing custom circuit breakers or debugging issues, awareness of these concurrency challenges helps diagnose unexpected behavior.
Understanding circuit state is crucial for debugging production issues. Effective visualization and logging practices make state transitions observable.
Essential Metrics to Expose:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
// Comprehensive circuit breaker metricsinterface CircuitBreakerMetrics { // State information state: 'CLOSED' | 'OPEN' | 'HALF_OPEN'; stateChangedAt: Date; timeInState: number; // milliseconds // Rate metrics (from sliding window) failureRate: number; // percentage slowCallRate: number; // percentage // Counts bufferedCalls: number; successfulCalls: number; failedCalls: number; slowCalls: number; notPermittedCalls: number; // rejected by open circuit // Transition counters (since startup) stateTransitions: { closedToOpen: number; openToHalfOpen: number; halfOpenToClosed: number; halfOpenToOpen: number; };} // Prometheus metrics exampleconst circuitBreakerState = new Gauge({ name: 'circuit_breaker_state', help: 'Current circuit breaker state (0=closed, 1=open, 2=half_open)', labelNames: ['circuit_name', 'downstream_service'],}); const circuitBreakerFailureRate = new Gauge({ name: 'circuit_breaker_failure_rate', help: 'Current failure rate in the sliding window', labelNames: ['circuit_name'],}); const circuitBreakerTransitions = new Counter({ name: 'circuit_breaker_state_transitions_total', help: 'Total number of state transitions', labelNames: ['circuit_name', 'from_state', 'to_state'],});Logging State Transitions:
State transitions are significant events that should be logged at a high level (WARN or INFO). The log should include:
12345678910111213141516171819202122232425262728293031
// Structured logging for state transitionsclass LoggingCircuitBreaker { private logger = getLogger('circuit-breaker'); private onStateTransition(from: State, to: State, metrics: Metrics): void { this.logger.warn({ event: 'circuit_breaker_transition', circuit_name: this.name, downstream: this.downstreamService, from_state: from, to_state: to, trigger: this.getTransitionTrigger(from, to, metrics), metrics: { failure_rate: metrics.failureRate, slow_call_rate: metrics.slowCallRate, buffered_calls: metrics.bufferedCalls, failures_in_window: metrics.failures, time_in_previous_state_ms: Date.now() - this.stateChangedAt, }, }); // Alert on OPEN transitions if (to === State.OPEN) { this.alerting.fire({ severity: 'warning', title: `Circuit breaker OPEN: ${this.name}`, message: `Circuit to ${this.downstreamService} opened due to ${metrics.failureRate}% failure rate`, }); } }}Create a dedicated circuit breaker dashboard showing all circuits in your system, their current states, and recent transition history. During incidents, this dashboard provides immediate visibility into which fault boundaries are active and how the system is self-protecting.
We've comprehensively examined the state machine that powers circuit breakers, understanding not just what each state does, but why and how transitions occur.
What's next:
With a solid understanding of states and transitions, the next page explores configuration parameters in depth—the specific thresholds, timeouts, and limits that tune circuit breaker behavior for different operational needs.
You now have a deep understanding of the circuit breaker state machine. This knowledge is foundational for effective configuration and debugging. You can explain why a circuit is in any given state, predict when transitions will occur, and diagnose unexpected behavior.