System Design (HLD)Circuit Breaker Pattern

Circuit Breaker Pattern

LevelAdvanced

Duration75 mins

TopicCircuit Breaker Pattern

2 / 5

Circuit States and Transitions

A Finite State Machine for Resilience

The circuit breaker pattern is fundamentally a finite state machine—a computational model that responds to events by transitioning between a defined set of states, each with its own behavior. Understanding these states and their transitions is essential for both implementing and tuning circuit breakers effectively.

Unlike many software patterns where implementation details vary widely, circuit breaker state machines follow a well-established structure refined over decades of production use. Netflix's Hystrix library, the progenitor of modern circuit breakers, established conventions that have been adopted by virtually every subsequent implementation.

This page provides an exhaustive examination of circuit breaker states, the conditions that trigger transitions between them, and the nuanced timing mechanisms that control recovery behavior.

What You Will Learn

By the end of this page, you will understand each circuit breaker state in detail, master the transition logic that moves between states, comprehend the timing mechanisms that control recovery probes, and recognize the edge cases that can cause unexpected behavior.

The Three States Explained

A circuit breaker operates in exactly three states. Each state defines distinct behavior for incoming requests, distinct metrics tracking, and distinct transition eligibility.

State 1: CLOSED (Normal Operation)

The CLOSED state represents normal, healthy operation. The term 'closed' comes from electrical circuits—a closed circuit allows current to flow, just as a closed software circuit allows requests to flow.

In the CLOSED state:

CLOSED State Behavior

•All requests are allowed through to the downstream service
•Success and failure outcomes are tracked using a sliding window or counter
•Latency metrics are recorded if the circuit monitors slow calls
•No protection is active — the system operates as if no circuit breaker exists
•Transition to OPEN occurs when failure rate exceeds the configured threshold

The CLOSED state is the 'steady state' for healthy systems. A well-designed system should spend 99%+ of its time in CLOSED across all circuit breakers.

State 2: OPEN (Protecting the System)

The OPEN state activates when failures exceed tolerance. An open circuit stops current flow—all requests fail immediately without attempting the downstream call.

In the OPEN state:

OPEN State Behavior

•All requests fail immediately without contacting the downstream service
•Failures are returned in microseconds instead of waiting for timeouts
•A specific exception type is thrown (e.g., CircuitBreakerOpenException) distinguishing this from actual service errors
•A recovery timer begins counting toward the half-open transition
•Fallback logic is triggered if configured, returning cached data or default values
•Metrics continue to be recorded for the open state duration

The OPEN state is the protection mechanism. Its purpose is to preserve resources by not waiting for inevitable failures.

State 3: HALF-OPEN (Testing Recovery)

The HALF-OPEN state is a transitional state for testing whether the downstream service has recovered. It allows a limited number of probe requests through while deciding whether to fully close or re-open the circuit.

In the HALF-OPEN state:

HALF-OPEN State Behavior

•A limited number of probe requests are allowed through to test recovery
•Remaining requests may fail fast or be queued depending on implementation
•Probe outcomes are carefully evaluated to determine the next state
•If probes succeed, the circuit transitions to CLOSED (service recovered)
•If any probe fails, the circuit transitions back to OPEN (service still unhealthy)
•This state is typically short-lived — only as long as needed to evaluate probes

The Half-Open Philosophy

HALF-OPEN embodies the principle of 'trust but verify.' Rather than immediately resuming full traffic (which could overwhelm a recovering service) or waiting indefinitely (which unnecessarily prolongs degradation), it takes a measured, cautious approach to recovery.

Transition Conditions Deep Dive

Understanding precisely what triggers each state transition is critical for both configuration and debugging. Let's examine each transition in detail.

Transition: CLOSED → OPEN (Tripping the Circuit)

This is the most consequential transition—it activates protection. The transition occurs when failure metrics exceed configured thresholds.

Failure rate threshold approach: The circuit trips when the percentage of failed requests exceeds a threshold. For example:

Threshold: 50% failure rate
Window: Last 100 requests
If 51 of the last 100 requests failed → Circuit OPENS

Failure count threshold approach: Alternatively, the circuit trips after a fixed number of failures:

Threshold: 5 consecutive failures
After 5 requests in a row fail → Circuit OPENS

Slow call rate threshold approach: Some implementations also consider latency:

Threshold: 80% of calls slower than 3 seconds
Window: Last 50 requests
If 40+ requests exceeded 3s → Circuit OPENS

transition-closed-to-open.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Resilience4j-style configuration for CLOSED → OPEN transition
const circuitBreakerConfig = {
  // Failure rate threshold (percentage)
  failureRateThreshold: 50,  // Trip if 50%+ requests fail
  
  // Minimum number of calls before calculating failure rate
  minimumNumberOfCalls: 10,  // Wait for at least 10 calls before evaluating
  
  // Slow call configuration
  slowCallDurationThreshold: 3000,  // 3 seconds = "slow"
  slowCallRateThreshold: 80,  // Trip if 80%+ calls are slow
  
  // Sliding window configuration
  slidingWindowType: 'COUNT_BASED',  // or 'TIME_BASED'
  slidingWindowSize: 100,  // Last 100 calls (or 100 seconds if time-based)
};
 
// Transition logic pseudocode:
function evaluateTransition(metrics: CircuitMetrics): State {
  if (metrics.totalCalls < config.minimumNumberOfCalls) {
    return State.CLOSED;  // Not enough data to evaluate
  }
  
  const failureRate = metrics.failures / metrics.totalCalls * 100;
  const slowCallRate = metrics.slowCalls / metrics.totalCalls * 100;
  
  if (failureRate >= config.failureRateThreshold) {
    return State.OPEN;  // Too many failures
  }
  
  if (slowCallRate >= config.slowCallRateThreshold) {
    return State.OPEN;  // Too many slow calls
  }
  
  return State.CLOSED;  // All thresholds within limits
}

Transition: OPEN → HALF-OPEN (Testing Recovery)

This transition is entirely time-based. After the circuit opens, a timer begins. When the timer expires, the circuit automatically moves to HALF-OPEN to test recovery.

Wait duration in open state: Typically 30-60 seconds
No request volume requirement: The transition is purely temporal
Automatic: No manual intervention required

The wait duration is a critical tuning parameter:

Too short: You'll probe a still-failing service too frequently, potentially delaying its recovery
Too long: Users experience degraded functionality longer than necessary

Transition: HALF-OPEN → CLOSED (Recovery Confirmed)

This transition occurs when probe requests succeed, indicating the downstream service has recovered.

Approaches to evaluating probes:

Single probe success:

Simplest approach: one successful probe → close circuit
Risk: A single lucky request might not indicate full recovery

Minimum successful probes:

Wait for N successful probes before closing (e.g., 3 out of 5)
More conservative, reduces false recovery detection

Sliding window on probes:

Apply the same failure rate threshold to probe requests
Most accurate but adds complexity

transition-halfopen-evaluation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Half-open configuration
const halfOpenConfig = {
  permittedNumberOfCallsInHalfOpenState: 5,  // Allow 5 probe requests
  
  // Option 1: Single failure re-opens
  failImmediatelyOnProbeFailure: true,
  
  // Option 2: Apply threshold to probes
  failureThresholdForProbes: 50,  // If 50%+ probes fail, re-open
};
 
function evaluateHalfOpenProbes(probeResults: boolean[]): State {
  const totalProbes = probeResults.length;
  const failures = probeResults.filter(r => !r).length;
  
  if (config.failImmediatelyOnProbeFailure && failures > 0) {
    return State.OPEN;  // Any failure → re-open immediately
  }
  
  if (totalProbes >= config.permittedNumberOfCallsInHalfOpenState) {
    const failureRate = failures / totalProbes * 100;
    if (failureRate >= config.failureThresholdForProbes) {
      return State.OPEN;  // Too many probe failures
    }
    return State.CLOSED;  // Probes successful, close circuit
  }
  
  return State.HALF_OPEN;  // Still collecting probe results
}

Transition: HALF-OPEN → OPEN (Recovery Failed)

If probe requests fail, the circuit returns to the OPEN state to continue protecting the system.

The wait duration timer restarts from zero
Another full wait period passes before the next half-open attempt
This cycle continues until either the service recovers or manual intervention occurs

Sliding Window Mechanisms

The accuracy of circuit breaker decisions depends heavily on how failures are counted. Most implementations use sliding window algorithms to track recent outcomes.

Count-Based Sliding Window

A count-based window tracks the last N requests, regardless of when they occurred.

Example: Window size of 100
Behavior: Evaluates failure rate across the most recent 100 calls
Advantage: Consistent evaluation regardless of traffic volume
Disadvantage: In low-traffic scenarios, the window might span hours or days, making old data still influence decisions

Converting Mermaid diagram...

Time-Based Sliding Window

A time-based window tracks all requests within a time period (e.g., the last 60 seconds).

Example: Window duration of 60 seconds
Behavior: Evaluates failure rate across all calls in the last minute
Advantage: Old failures naturally expire; consistent time relevance
Disadvantage: In high-traffic scenarios, the window might contain thousands of calls, increasing memory usage

Implementation Detail: Ring Buffer

Most efficient implementations use a ring buffer (circular buffer) to track outcomes:

sliding-window-implementation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Simplified count-based sliding window using ring buffer
class CountBasedSlidingWindow {
  private buffer: Array<'success' | 'failure'>;
  private position: number = 0;
  private successCount: number = 0;
  private failureCount: number = 0;
  private filled: boolean = false;
  
  constructor(private size: number) {
    this.buffer = new Array(size).fill(null);
  }
  
  record(outcome: 'success' | 'failure'): void {
    const previous = this.buffer[this.position];
    
    // Update running counts
    if (previous === 'success') this.successCount--;
    if (previous === 'failure') this.failureCount--;
    
    this.buffer[this.position] = outcome;
    
    if (outcome === 'success') this.successCount++;
    if (outcome === 'failure') this.failureCount++;
    
    this.position = (this.position + 1) % this.size;
    if (this.position === 0) this.filled = true;
  }
  
  getFailureRate(): number {
    const total = this.filled ? this.size : this.position;
    if (total === 0) return 0;
    return (this.failureCount / total) * 100;
  }
  
  getTotalCount(): number {
    return this.filled ? this.size : this.position;
  }
}
 
// Usage:
const window = new CountBasedSlidingWindow(100);
window.record('success');
window.record('failure');
window.record('success');
console.log(window.getFailureRate());  // 33.33%

Aggregated Time-Based Windows

For time-based windows with high traffic, implementations often use bucketed aggregation:

Divide time into buckets (e.g., 1-second buckets)
Aggregate success/failure counts per bucket
Sum across buckets within the window
Slide window by dropping old buckets and adding new ones

This approach uses O(window_duration / bucket_size) memory regardless of traffic volume.

Sliding Window Type Comparison
Aspect	Count-Based	Time-Based
Window definition	Last N requests	Last T seconds
Data freshness	Variable (depends on traffic)	Consistent (bounded by time)
Memory usage	Fixed (N entries)	Variable (depends on traffic)
Low traffic behavior	Stale data persists	Old data naturally expires
High traffic behavior	Always fresh	May require bucketing
Best for	Consistent traffic patterns	Variable traffic patterns

Choosing Window Type

For most production systems, time-based windows are preferable. They ensure that circuit breaker decisions are based on recent behavior, which is especially important for services with variable traffic patterns (batch jobs, daily cycles, etc.).

The Minimum Calls Threshold

One of the most commonly overlooked configuration parameters is the minimum number of calls threshold. This parameter prevents the circuit from tripping based on statistically insignificant data.

The problem it solves:

Imagine a circuit with a 50% failure threshold. If only 2 requests have been made and 1 failed, the failure rate is 50%—technically exceeding the threshold. Should the circuit open?

Almost certainly not. A single failure out of two requests might be:

A transient network blip
An unlucky timing coincidence
A user error, not a service failure
Normal variance in any reliable system

The minimum calls threshold prevents premature tripping:

minimum-calls-threshold.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// The minimum calls threshold in action
const config = {
  failureRateThreshold: 50,      // 50% failure rate triggers open
  minimumNumberOfCalls: 10,      // But only after at least 10 calls
  slidingWindowSize: 100,        // Evaluate over last 100 calls
};
 
function shouldOpenCircuit(metrics: Metrics): boolean {
  // Check 1: Do we have enough data?
  if (metrics.totalCalls < config.minimumNumberOfCalls) {
    return false;  // NOT ENOUGH DATA - don't evaluate yet
  }
  
  // Check 2: Does failure rate exceed threshold?
  const failureRate = metrics.failures / metrics.totalCalls * 100;
  return failureRate >= config.failureRateThreshold;
}
 
// Examples:
// 2 calls, 1 failure (50% rate) → Don't open (only 2 calls < 10 minimum)
// 5 calls, 3 failures (60% rate) → Don't open (only 5 calls < 10 minimum)
// 10 calls, 4 failures (40% rate) → Don't open (40% < 50% threshold)
// 10 calls, 5 failures (50% rate) → OPEN (threshold met, enough data)

Tuning the minimum calls threshold:

The right value depends on your traffic patterns and failure tolerance:

Traffic Level	Recommended Minimum	Reasoning
Low (< 100/min)	5-10 calls	Small sample size; want quick detection
Medium (100-1000/min)	10-20 calls	Balance between speed and accuracy
High (> 1000/min)	20-50 calls	Plenty of data; prioritize accuracy
Critical path	Lower values	Faster protection, accept more false positives
Non-critical path	Higher values	Avoid unnecessary open circuits

A Common Misconfiguration

Setting minimum calls too high for low-traffic services means the circuit might never evaluate failure rates. If minimum calls is 100 but the service only handles 50 requests per hour, the circuit will never consider opening—even during a real outage.

Interaction with sliding window:

The minimum calls threshold works in conjunction with the sliding window:

Sliding window determines which calls are considered
Minimum threshold determines when evaluation begins

If your sliding window size is 100 (count-based) and minimum calls is 20, evaluation begins after 20 calls and continues until the window is full. Once full, the oldest call is evicted as new calls arrive, maintaining exactly 100 calls in the evaluation set.

State Transition Timing Nuances

The timing of state transitions introduces subtle behaviors that affect system dynamics. Understanding these nuances is essential for advanced circuit breaker tuning.

Wait Duration in Open State

The wait duration (also called 'open state duration' or 'sleep window') controls how long the circuit stays open before attempting recovery.

Key considerations:

Open State Wait Duration Factors

•Service recovery time: Wait should exceed typical service restart time (e.g., container restart: ~30s, database failover: ~60-120s)
•Downstream capacity: After recovery, the service may have backlog or warmup needs. Too-fast probing can re-trigger failures.
•User experience: Longer waits mean longer degraded experience. Balance protection with availability.
•Dependency chain: If your service is mid-chain, your open circuit affects upstream callers' experience.
•Auto-scaling behavior: If downstream auto-scales, wait for scaling to complete before probing.

Typical wait duration recommendations:

Recommended Wait Durations by Failure Type
Failure Type	Recommended Wait	Rationale
Network blip	10-30 seconds	Transient issues resolve quickly
Service restart	30-60 seconds	Container/process restart time
Database failover	60-180 seconds	Primary/replica promotion time
Deployment failure	120-300 seconds	Rollback or fix deployment time
External provider outage	300-600 seconds	Third-party recovery time varies

Half-Open Probe Limiting

In the HALF-OPEN state, not all requests become probes. The circuit limits how many probe requests are allowed to prevent overwhelming a recovering service.

probe limiting strategies:

First-N approach: The first N requests after entering half-open become probes; subsequent requests fail fast until transition occurs
Single probe approach: Only one request is allowed through; all others fail fast until that probe completes
Rate-limited probes: Probes are allowed at a fixed rate (e.g., 1 per second) regardless of incoming traffic

Most implementations use the First-N approach with N between 3 and 10.

halfopen-probing.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Half-open state with probe limiting
class CircuitBreaker {
  private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
  private halfOpenProbesAllowed: number = 5;
  private halfOpenProbeCount: number = 0;
  private halfOpenProbeResults: boolean[] = [];
  
  async execute<T>(fn: () => Promise<T>): Promise<T> {
    switch (this.state) {
      case 'CLOSED':
        return this.executeWithTracking(fn);
        
      case 'OPEN':
        if (this.shouldTransitionToHalfOpen()) {
          this.state = 'HALF_OPEN';
          this.halfOpenProbeCount = 0;
          this.halfOpenProbeResults = [];
          return this.executeAsProbe(fn);
        }
        throw new CircuitBreakerOpenException('Circuit is OPEN');
        
      case 'HALF_OPEN':
        if (this.halfOpenProbeCount < this.halfOpenProbesAllowed) {
          return this.executeAsProbe(fn);
        }
        // No more probes allowed - fail fast
        throw new CircuitBreakerOpenException('Circuit is HALF_OPEN, max probes reached');
    }
  }
  
  private async executeAsProbe<T>(fn: () => Promise<T>): Promise<T> {
    this.halfOpenProbeCount++;
    try {
      const result = await fn();
      this.halfOpenProbeResults.push(true);  // Success
      this.evaluateHalfOpenState();
      return result;
    } catch (error) {
      this.halfOpenProbeResults.push(false);  // Failure
      this.evaluateHalfOpenState();
      throw error;
    }
  }
  
  private evaluateHalfOpenState(): void {
    const failures = this.halfOpenProbeResults.filter(r => !r).length;
    
    if (failures > 0) {
      // Any failure → back to OPEN
      this.state = 'OPEN';
      this.resetOpenStateTimer();
      return;
    }
    
    if (this.halfOpenProbeResults.length >= this.halfOpenProbesAllowed) {
      // All probes succeeded → CLOSED
      this.state = 'CLOSED';
      this.resetMetrics();
    }
  }
}

Edge Cases and Race Conditions

Circuit breakers operate in concurrent, high-throughput environments where edge cases and race conditions can cause unexpected behavior. Understanding these edge cases is essential for reliable implementations.

Edge Case 1: Concurrent Probes in Half-Open

When the circuit transitions to HALF-OPEN, multiple threads might simultaneously see the new state and attempt to execute probes.

Problem: If 100 threads enter half-open simultaneously and all become probes, you've just sent 100 requests to a potentially recovering service—defeating the purpose of limited probing.

Solution: Use atomic counters or locks to limit probe count:

thread-safe-probes.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// Thread-safe probe limiting using atomic operations
class ThreadSafeCircuitBreaker {
  private halfOpenProbeCount: AtomicInteger = new AtomicInteger(0);
  private halfOpenMaxProbes: number = 5;
  
  async executeInHalfOpen<T>(fn: () => Promise<T>): Promise<T> {
    // Atomically try to acquire a probe slot
    const myProbeNumber = this.halfOpenProbeCount.incrementAndGet();
    
    if (myProbeNumber > this.halfOpenMaxProbes) {
      // We didn't get a probe slot - fail fast
      throw new CircuitBreakerOpenException('Probe limit reached');
    }
    
    // We got a probe slot - execute the request
    return fn();
  }
}
 
// Alternative: Use a semaphore
class SemaphoreCircuitBreaker {
  private probeSemaphore = new Semaphore(5);  // Max 5 probes
  
  async executeInHalfOpen<T>(fn: () => Promise<T>): Promise<T> {
    if (!this.probeSemaphore.tryAcquire()) {
      throw new CircuitBreakerOpenException('Probe limit reached');
    }
    
    try {
      return await fn();
    } finally {
      this.probeSemaphore.release();
    }
  }
}

Edge Case 2: Transition Race Conditions

Two threads might evaluate transition conditions simultaneously, both seeing the same metrics and both attempting to trigger a transition.

Problem: Multiple transition attempts could cause:

Duplicate timer starts in OPEN state
Probe count resets mid-evaluation
State corruption

Solution: Use compare-and-swap (CAS) for state transitions:

cas-state-transition.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Compare-and-swap state transitions
class CASCircuitBreaker {
  private state: AtomicReference<State> = new AtomicReference(State.CLOSED);
  
  private tryTransition(from: State, to: State): boolean {
    // Atomically transition only if current state matches expected
    return this.state.compareAndSet(from, to);
  }
  
  private evaluateAndMaybeOpen(): void {
    if (this.shouldTrip() && this.state.get() === State.CLOSED) {
      // Only one thread will succeed in this transition
      if (this.tryTransition(State.CLOSED, State.OPEN)) {
        this.startOpenTimer();  // Only the winning thread starts timer
        this.recordStateChange('CLOSED', 'OPEN');
      }
      // Losing threads silently continue - state already transitioned
    }
  }
}

Edge Case 3: In-Flight Requests During Transition

When a circuit transitions from CLOSED to OPEN, there may be requests already in-flight to the downstream service.

Problem: These in-flight requests might:

Complete successfully (confusing metrics)
Complete with failure (already counted, counted again?)
Complete after the circuit reopens from half-open

Solution:

Track request IDs with their originating state
Only count completions if they match current state epoch
Accept some imprecision as unavoidable

Practical Perspective

Most production circuit breaker libraries (Resilience4j, Polly, Hystrix) handle these edge cases internally. When implementing custom circuit breakers or debugging issues, awareness of these concurrency challenges helps diagnose unexpected behavior.

State Visualization and Debugging

Understanding circuit state is crucial for debugging production issues. Effective visualization and logging practices make state transitions observable.

Essential Metrics to Expose:

Circuit Breaker Observability Metrics

•Current state: CLOSED, OPEN, or HALF_OPEN (as a label/dimension)
•Failure rate: Current failure percentage in the sliding window
•Slow call rate: Current percentage of calls exceeding latency threshold
•Buffered calls: Number of calls currently in the sliding window
•Not permitted calls: Requests rejected due to open circuit or probe limit
•State transitions: Counter for each transition type (CLOSED→OPEN, etc.)
•Time in state: Duration in current state (especially useful for OPEN)

circuit-breaker-metrics.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Comprehensive circuit breaker metrics
interface CircuitBreakerMetrics {
  // State information
  state: 'CLOSED' | 'OPEN' | 'HALF_OPEN';
  stateChangedAt: Date;
  timeInState: number;  // milliseconds
  
  // Rate metrics (from sliding window)
  failureRate: number;  // percentage
  slowCallRate: number;  // percentage
  
  // Counts
  bufferedCalls: number;
  successfulCalls: number;
  failedCalls: number;
  slowCalls: number;
  notPermittedCalls: number;  // rejected by open circuit
  
  // Transition counters (since startup)
  stateTransitions: {
    closedToOpen: number;
    openToHalfOpen: number;
    halfOpenToClosed: number;
    halfOpenToOpen: number;
  };
}
 
// Prometheus metrics example
const circuitBreakerState = new Gauge({
  name: 'circuit_breaker_state',
  help: 'Current circuit breaker state (0=closed, 1=open, 2=half_open)',
  labelNames: ['circuit_name', 'downstream_service'],
});
 
const circuitBreakerFailureRate = new Gauge({
  name: 'circuit_breaker_failure_rate',
  help: 'Current failure rate in the sliding window',
  labelNames: ['circuit_name'],
});
 
const circuitBreakerTransitions = new Counter({
  name: 'circuit_breaker_state_transitions_total',
  help: 'Total number of state transitions',
  labelNames: ['circuit_name', 'from_state', 'to_state'],
});

Logging State Transitions:

State transitions are significant events that should be logged at a high level (WARN or INFO). The log should include:

transition-logging.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Structured logging for state transitions
class LoggingCircuitBreaker {
  private logger = getLogger('circuit-breaker');
  
  private onStateTransition(from: State, to: State, metrics: Metrics): void {
    this.logger.warn({
      event: 'circuit_breaker_transition',
      circuit_name: this.name,
      downstream: this.downstreamService,
      from_state: from,
      to_state: to,
      trigger: this.getTransitionTrigger(from, to, metrics),
      metrics: {
        failure_rate: metrics.failureRate,
        slow_call_rate: metrics.slowCallRate,
        buffered_calls: metrics.bufferedCalls,
        failures_in_window: metrics.failures,
        time_in_previous_state_ms: Date.now() - this.stateChangedAt,
      },
    });
    
    // Alert on OPEN transitions
    if (to === State.OPEN) {
      this.alerting.fire({
        severity: 'warning',
        title: `Circuit breaker OPEN: ${this.name}`,
        message: `Circuit to ${this.downstreamService} opened due to ${metrics.failureRate}% failure rate`,
      });
    }
  }
}

Dashboard Best Practice

Create a dedicated circuit breaker dashboard showing all circuits in your system, their current states, and recent transition history. During incidents, this dashboard provides immediate visibility into which fault boundaries are active and how the system is self-protecting.

Summary: Mastering Circuit States

We've comprehensively examined the state machine that powers circuit breakers, understanding not just what each state does, but why and how transitions occur.

Key Takeaways

•Three states define all behavior — CLOSED (normal operation), OPEN (protection active), and HALF-OPEN (testing recovery).
•Transitions are condition-based — CLOSED→OPEN uses failure thresholds; OPEN→HALF-OPEN uses timers; HALF-OPEN decisions use probe outcomes.
•Sliding windows provide context — Count-based or time-based windows ensure decisions reflect recent behavior, not ancient history.
•Minimum calls prevent premature tripping — Don't evaluate failure rates until sufficient data exists for statistical significance.
•Timing parameters balance protection and availability — Open state duration, probe limits, and evaluation windows all require careful tuning.
•Edge cases and race conditions require attention — Concurrent state access needs thread-safe handling to avoid corruption.
•Observability is essential — Metrics, logging, and dashboards make circuit state visible to operators.

What's next:

With a solid understanding of states and transitions, the next page explores configuration parameters in depth—the specific thresholds, timeouts, and limits that tune circuit breaker behavior for different operational needs.

Page Complete

You now have a deep understanding of the circuit breaker state machine. This knowledge is foundational for effective configuration and debugging. You can explain why a circuit is in any given state, predict when transitions will occur, and diagnose unexpected behavior.

2 / 5

Loading learning content...

System Design (HLD)Circuit Breaker Pattern

Circuit Breaker Pattern

LevelAdvanced

Duration75 mins

TopicCircuit Breaker Pattern

2 / 5

Circuit States and Transitions

A Finite State Machine for Resilience

This page provides an exhaustive examination of circuit breaker states, the conditions that trigger transitions between them, and the nuanced timing mechanisms that control recovery behavior.

What You Will Learn

The Three States Explained

A circuit breaker operates in exactly three states. Each state defines distinct behavior for incoming requests, distinct metrics tracking, and distinct transition eligibility.

State 1: CLOSED (Normal Operation)

In the CLOSED state:

CLOSED State Behavior

•All requests are allowed through to the downstream service
•Success and failure outcomes are tracked using a sliding window or counter
•Latency metrics are recorded if the circuit monitors slow calls
•No protection is active — the system operates as if no circuit breaker exists
•Transition to OPEN occurs when failure rate exceeds the configured threshold

The CLOSED state is the 'steady state' for healthy systems. A well-designed system should spend 99%+ of its time in CLOSED across all circuit breakers.

State 2: OPEN (Protecting the System)

The OPEN state activates when failures exceed tolerance. An open circuit stops current flow—all requests fail immediately without attempting the downstream call.

In the OPEN state:

OPEN State Behavior

•All requests fail immediately without contacting the downstream service
•Failures are returned in microseconds instead of waiting for timeouts
•A specific exception type is thrown (e.g., CircuitBreakerOpenException) distinguishing this from actual service errors
•A recovery timer begins counting toward the half-open transition
•Fallback logic is triggered if configured, returning cached data or default values
•Metrics continue to be recorded for the open state duration

The OPEN state is the protection mechanism. Its purpose is to preserve resources by not waiting for inevitable failures.

State 3: HALF-OPEN (Testing Recovery)

In the HALF-OPEN state:

HALF-OPEN State Behavior

•A limited number of probe requests are allowed through to test recovery
•Remaining requests may fail fast or be queued depending on implementation
•Probe outcomes are carefully evaluated to determine the next state
•If probes succeed, the circuit transitions to CLOSED (service recovered)
•If any probe fails, the circuit transitions back to OPEN (service still unhealthy)
•This state is typically short-lived — only as long as needed to evaluate probes

The Half-Open Philosophy

Transition Conditions Deep Dive

Understanding precisely what triggers each state transition is critical for both configuration and debugging. Let's examine each transition in detail.

Transition: CLOSED → OPEN (Tripping the Circuit)

This is the most consequential transition—it activates protection. The transition occurs when failure metrics exceed configured thresholds.

Failure rate threshold approach: The circuit trips when the percentage of failed requests exceeds a threshold. For example:

Threshold: 50% failure rate
Window: Last 100 requests
If 51 of the last 100 requests failed → Circuit OPENS

Failure count threshold approach: Alternatively, the circuit trips after a fixed number of failures:

Threshold: 5 consecutive failures
After 5 requests in a row fail → Circuit OPENS

Slow call rate threshold approach: Some implementations also consider latency:

Threshold: 80% of calls slower than 3 seconds
Window: Last 50 requests
If 40+ requests exceeded 3s → Circuit OPENS

transition-closed-to-open.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Resilience4j-style configuration for CLOSED → OPEN transition
const circuitBreakerConfig = {
  // Failure rate threshold (percentage)
  failureRateThreshold: 50,  // Trip if 50%+ requests fail
  
  // Minimum number of calls before calculating failure rate
  minimumNumberOfCalls: 10,  // Wait for at least 10 calls before evaluating
  
  // Slow call configuration
  slowCallDurationThreshold: 3000,  // 3 seconds = "slow"
  slowCallRateThreshold: 80,  // Trip if 80%+ calls are slow
  
  // Sliding window configuration
  slidingWindowType: 'COUNT_BASED',  // or 'TIME_BASED'
  slidingWindowSize: 100,  // Last 100 calls (or 100 seconds if time-based)
};
 
// Transition logic pseudocode:
function evaluateTransition(metrics: CircuitMetrics): State {
  if (metrics.totalCalls < config.minimumNumberOfCalls) {
    return State.CLOSED;  // Not enough data to evaluate
  }
  
  const failureRate = metrics.failures / metrics.totalCalls * 100;
  const slowCallRate = metrics.slowCalls / metrics.totalCalls * 100;
  
  if (failureRate >= config.failureRateThreshold) {
    return State.OPEN;  // Too many failures
  }
  
  if (slowCallRate >= config.slowCallRateThreshold) {
    return State.OPEN;  // Too many slow calls
  }
  
  return State.CLOSED;  // All thresholds within limits
}

Transition: OPEN → HALF-OPEN (Testing Recovery)

This transition is entirely time-based. After the circuit opens, a timer begins. When the timer expires, the circuit automatically moves to HALF-OPEN to test recovery.

Wait duration in open state: Typically 30-60 seconds
No request volume requirement: The transition is purely temporal
Automatic: No manual intervention required

The wait duration is a critical tuning parameter:

Too short: You'll probe a still-failing service too frequently, potentially delaying its recovery
Too long: Users experience degraded functionality longer than necessary

Transition: HALF-OPEN → CLOSED (Recovery Confirmed)

This transition occurs when probe requests succeed, indicating the downstream service has recovered.

Approaches to evaluating probes:

Single probe success:

Simplest approach: one successful probe → close circuit
Risk: A single lucky request might not indicate full recovery

Minimum successful probes:

Wait for N successful probes before closing (e.g., 3 out of 5)
More conservative, reduces false recovery detection

Sliding window on probes:

Apply the same failure rate threshold to probe requests
Most accurate but adds complexity

transition-halfopen-evaluation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Half-open configuration
const halfOpenConfig = {
  permittedNumberOfCallsInHalfOpenState: 5,  // Allow 5 probe requests
  
  // Option 1: Single failure re-opens
  failImmediatelyOnProbeFailure: true,
  
  // Option 2: Apply threshold to probes
  failureThresholdForProbes: 50,  // If 50%+ probes fail, re-open
};
 
function evaluateHalfOpenProbes(probeResults: boolean[]): State {
  const totalProbes = probeResults.length;
  const failures = probeResults.filter(r => !r).length;
  
  if (config.failImmediatelyOnProbeFailure && failures > 0) {
    return State.OPEN;  // Any failure → re-open immediately
  }
  
  if (totalProbes >= config.permittedNumberOfCallsInHalfOpenState) {
    const failureRate = failures / totalProbes * 100;
    if (failureRate >= config.failureThresholdForProbes) {
      return State.OPEN;  // Too many probe failures
    }
    return State.CLOSED;  // Probes successful, close circuit
  }
  
  return State.HALF_OPEN;  // Still collecting probe results
}

Transition: HALF-OPEN → OPEN (Recovery Failed)

If probe requests fail, the circuit returns to the OPEN state to continue protecting the system.

The wait duration timer restarts from zero
Another full wait period passes before the next half-open attempt
This cycle continues until either the service recovers or manual intervention occurs

Sliding Window Mechanisms

The accuracy of circuit breaker decisions depends heavily on how failures are counted. Most implementations use sliding window algorithms to track recent outcomes.

Count-Based Sliding Window

A count-based window tracks the last N requests, regardless of when they occurred.

Example: Window size of 100
Behavior: Evaluates failure rate across the most recent 100 calls
Advantage: Consistent evaluation regardless of traffic volume
Disadvantage: In low-traffic scenarios, the window might span hours or days, making old data still influence decisions

Converting Mermaid diagram...

Time-Based Sliding Window

A time-based window tracks all requests within a time period (e.g., the last 60 seconds).

Example: Window duration of 60 seconds
Behavior: Evaluates failure rate across all calls in the last minute
Advantage: Old failures naturally expire; consistent time relevance
Disadvantage: In high-traffic scenarios, the window might contain thousands of calls, increasing memory usage

Implementation Detail: Ring Buffer

Most efficient implementations use a ring buffer (circular buffer) to track outcomes:

sliding-window-implementation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Simplified count-based sliding window using ring buffer
class CountBasedSlidingWindow {
  private buffer: Array<'success' | 'failure'>;
  private position: number = 0;
  private successCount: number = 0;
  private failureCount: number = 0;
  private filled: boolean = false;
  
  constructor(private size: number) {
    this.buffer = new Array(size).fill(null);
  }
  
  record(outcome: 'success' | 'failure'): void {
    const previous = this.buffer[this.position];
    
    // Update running counts
    if (previous === 'success') this.successCount--;
    if (previous === 'failure') this.failureCount--;
    
    this.buffer[this.position] = outcome;
    
    if (outcome === 'success') this.successCount++;
    if (outcome === 'failure') this.failureCount++;
    
    this.position = (this.position + 1) % this.size;
    if (this.position === 0) this.filled = true;
  }
  
  getFailureRate(): number {
    const total = this.filled ? this.size : this.position;
    if (total === 0) return 0;
    return (this.failureCount / total) * 100;
  }
  
  getTotalCount(): number {
    return this.filled ? this.size : this.position;
  }
}
 
// Usage:
const window = new CountBasedSlidingWindow(100);
window.record('success');
window.record('failure');
window.record('success');
console.log(window.getFailureRate());  // 33.33%

Aggregated Time-Based Windows

For time-based windows with high traffic, implementations often use bucketed aggregation:

Divide time into buckets (e.g., 1-second buckets)
Aggregate success/failure counts per bucket
Sum across buckets within the window
Slide window by dropping old buckets and adding new ones

This approach uses O(window_duration / bucket_size) memory regardless of traffic volume.

Sliding Window Type Comparison
Aspect	Count-Based	Time-Based
Window definition	Last N requests	Last T seconds
Data freshness	Variable (depends on traffic)	Consistent (bounded by time)
Memory usage	Fixed (N entries)	Variable (depends on traffic)
Low traffic behavior	Stale data persists	Old data naturally expires
High traffic behavior	Always fresh	May require bucketing
Best for	Consistent traffic patterns	Variable traffic patterns

Choosing Window Type

The Minimum Calls Threshold

One of the most commonly overlooked configuration parameters is the minimum number of calls threshold. This parameter prevents the circuit from tripping based on statistically insignificant data.

The problem it solves:

Imagine a circuit with a 50% failure threshold. If only 2 requests have been made and 1 failed, the failure rate is 50%—technically exceeding the threshold. Should the circuit open?

Almost certainly not. A single failure out of two requests might be:

A transient network blip
An unlucky timing coincidence
A user error, not a service failure
Normal variance in any reliable system

The minimum calls threshold prevents premature tripping:

minimum-calls-threshold.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// The minimum calls threshold in action
const config = {
  failureRateThreshold: 50,      // 50% failure rate triggers open
  minimumNumberOfCalls: 10,      // But only after at least 10 calls
  slidingWindowSize: 100,        // Evaluate over last 100 calls
};
 
function shouldOpenCircuit(metrics: Metrics): boolean {
  // Check 1: Do we have enough data?
  if (metrics.totalCalls < config.minimumNumberOfCalls) {
    return false;  // NOT ENOUGH DATA - don't evaluate yet
  }
  
  // Check 2: Does failure rate exceed threshold?
  const failureRate = metrics.failures / metrics.totalCalls * 100;
  return failureRate >= config.failureRateThreshold;
}
 
// Examples:
// 2 calls, 1 failure (50% rate) → Don't open (only 2 calls < 10 minimum)
// 5 calls, 3 failures (60% rate) → Don't open (only 5 calls < 10 minimum)
// 10 calls, 4 failures (40% rate) → Don't open (40% < 50% threshold)
// 10 calls, 5 failures (50% rate) → OPEN (threshold met, enough data)

Tuning the minimum calls threshold:

The right value depends on your traffic patterns and failure tolerance:

Traffic Level	Recommended Minimum	Reasoning
Low (< 100/min)	5-10 calls	Small sample size; want quick detection
Medium (100-1000/min)	10-20 calls	Balance between speed and accuracy
High (> 1000/min)	20-50 calls	Plenty of data; prioritize accuracy
Critical path	Lower values	Faster protection, accept more false positives
Non-critical path	Higher values	Avoid unnecessary open circuits

A Common Misconfiguration

Interaction with sliding window:

The minimum calls threshold works in conjunction with the sliding window:

Sliding window determines which calls are considered
Minimum threshold determines when evaluation begins

State Transition Timing Nuances

The timing of state transitions introduces subtle behaviors that affect system dynamics. Understanding these nuances is essential for advanced circuit breaker tuning.

Wait Duration in Open State

The wait duration (also called 'open state duration' or 'sleep window') controls how long the circuit stays open before attempting recovery.

Key considerations:

Open State Wait Duration Factors

•Service recovery time: Wait should exceed typical service restart time (e.g., container restart: ~30s, database failover: ~60-120s)
•Downstream capacity: After recovery, the service may have backlog or warmup needs. Too-fast probing can re-trigger failures.
•User experience: Longer waits mean longer degraded experience. Balance protection with availability.
•Dependency chain: If your service is mid-chain, your open circuit affects upstream callers' experience.
•Auto-scaling behavior: If downstream auto-scales, wait for scaling to complete before probing.

Typical wait duration recommendations:

Recommended Wait Durations by Failure Type
Failure Type	Recommended Wait	Rationale
Network blip	10-30 seconds	Transient issues resolve quickly
Service restart	30-60 seconds	Container/process restart time
Database failover	60-180 seconds	Primary/replica promotion time
Deployment failure	120-300 seconds	Rollback or fix deployment time
External provider outage	300-600 seconds	Third-party recovery time varies

Half-Open Probe Limiting

In the HALF-OPEN state, not all requests become probes. The circuit limits how many probe requests are allowed to prevent overwhelming a recovering service.

probe limiting strategies:

First-N approach: The first N requests after entering half-open become probes; subsequent requests fail fast until transition occurs
Single probe approach: Only one request is allowed through; all others fail fast until that probe completes
Rate-limited probes: Probes are allowed at a fixed rate (e.g., 1 per second) regardless of incoming traffic

Most implementations use the First-N approach with N between 3 and 10.

halfopen-probing.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Half-open state with probe limiting
class CircuitBreaker {
  private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
  private halfOpenProbesAllowed: number = 5;
  private halfOpenProbeCount: number = 0;
  private halfOpenProbeResults: boolean[] = [];
  
  async execute<T>(fn: () => Promise<T>): Promise<T> {
    switch (this.state) {
      case 'CLOSED':
        return this.executeWithTracking(fn);
        
      case 'OPEN':
        if (this.shouldTransitionToHalfOpen()) {
          this.state = 'HALF_OPEN';
          this.halfOpenProbeCount = 0;
          this.halfOpenProbeResults = [];
          return this.executeAsProbe(fn);
        }
        throw new CircuitBreakerOpenException('Circuit is OPEN');
        
      case 'HALF_OPEN':
        if (this.halfOpenProbeCount < this.halfOpenProbesAllowed) {
          return this.executeAsProbe(fn);
        }
        // No more probes allowed - fail fast
        throw new CircuitBreakerOpenException('Circuit is HALF_OPEN, max probes reached');
    }
  }
  
  private async executeAsProbe<T>(fn: () => Promise<T>): Promise<T> {
    this.halfOpenProbeCount++;
    try {
      const result = await fn();
      this.halfOpenProbeResults.push(true);  // Success
      this.evaluateHalfOpenState();
      return result;
    } catch (error) {
      this.halfOpenProbeResults.push(false);  // Failure
      this.evaluateHalfOpenState();
      throw error;
    }
  }
  
  private evaluateHalfOpenState(): void {
    const failures = this.halfOpenProbeResults.filter(r => !r).length;
    
    if (failures > 0) {
      // Any failure → back to OPEN
      this.state = 'OPEN';
      this.resetOpenStateTimer();
      return;
    }
    
    if (this.halfOpenProbeResults.length >= this.halfOpenProbesAllowed) {
      // All probes succeeded → CLOSED
      this.state = 'CLOSED';
      this.resetMetrics();
    }
  }
}

Edge Cases and Race Conditions

Edge Case 1: Concurrent Probes in Half-Open

When the circuit transitions to HALF-OPEN, multiple threads might simultaneously see the new state and attempt to execute probes.

Problem: If 100 threads enter half-open simultaneously and all become probes, you've just sent 100 requests to a potentially recovering service—defeating the purpose of limited probing.

Solution: Use atomic counters or locks to limit probe count:

thread-safe-probes.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// Thread-safe probe limiting using atomic operations
class ThreadSafeCircuitBreaker {
  private halfOpenProbeCount: AtomicInteger = new AtomicInteger(0);
  private halfOpenMaxProbes: number = 5;
  
  async executeInHalfOpen<T>(fn: () => Promise<T>): Promise<T> {
    // Atomically try to acquire a probe slot
    const myProbeNumber = this.halfOpenProbeCount.incrementAndGet();
    
    if (myProbeNumber > this.halfOpenMaxProbes) {
      // We didn't get a probe slot - fail fast
      throw new CircuitBreakerOpenException('Probe limit reached');
    }
    
    // We got a probe slot - execute the request
    return fn();
  }
}
 
// Alternative: Use a semaphore
class SemaphoreCircuitBreaker {
  private probeSemaphore = new Semaphore(5);  // Max 5 probes
  
  async executeInHalfOpen<T>(fn: () => Promise<T>): Promise<T> {
    if (!this.probeSemaphore.tryAcquire()) {
      throw new CircuitBreakerOpenException('Probe limit reached');
    }
    
    try {
      return await fn();
    } finally {
      this.probeSemaphore.release();
    }
  }
}

Edge Case 2: Transition Race Conditions

Two threads might evaluate transition conditions simultaneously, both seeing the same metrics and both attempting to trigger a transition.

Problem: Multiple transition attempts could cause:

Duplicate timer starts in OPEN state
Probe count resets mid-evaluation
State corruption

Solution: Use compare-and-swap (CAS) for state transitions:

cas-state-transition.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Compare-and-swap state transitions
class CASCircuitBreaker {
  private state: AtomicReference<State> = new AtomicReference(State.CLOSED);
  
  private tryTransition(from: State, to: State): boolean {
    // Atomically transition only if current state matches expected
    return this.state.compareAndSet(from, to);
  }
  
  private evaluateAndMaybeOpen(): void {
    if (this.shouldTrip() && this.state.get() === State.CLOSED) {
      // Only one thread will succeed in this transition
      if (this.tryTransition(State.CLOSED, State.OPEN)) {
        this.startOpenTimer();  // Only the winning thread starts timer
        this.recordStateChange('CLOSED', 'OPEN');
      }
      // Losing threads silently continue - state already transitioned
    }
  }
}

Edge Case 3: In-Flight Requests During Transition

When a circuit transitions from CLOSED to OPEN, there may be requests already in-flight to the downstream service.

Problem: These in-flight requests might:

Complete successfully (confusing metrics)
Complete with failure (already counted, counted again?)
Complete after the circuit reopens from half-open

Solution:

Track request IDs with their originating state
Only count completions if they match current state epoch
Accept some imprecision as unavoidable

Practical Perspective

State Visualization and Debugging

Understanding circuit state is crucial for debugging production issues. Effective visualization and logging practices make state transitions observable.

Essential Metrics to Expose:

Circuit Breaker Observability Metrics

•Current state: CLOSED, OPEN, or HALF_OPEN (as a label/dimension)
•Failure rate: Current failure percentage in the sliding window
•Slow call rate: Current percentage of calls exceeding latency threshold
•Buffered calls: Number of calls currently in the sliding window
•Not permitted calls: Requests rejected due to open circuit or probe limit
•State transitions: Counter for each transition type (CLOSED→OPEN, etc.)
•Time in state: Duration in current state (especially useful for OPEN)

circuit-breaker-metrics.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Comprehensive circuit breaker metrics
interface CircuitBreakerMetrics {
  // State information
  state: 'CLOSED' | 'OPEN' | 'HALF_OPEN';
  stateChangedAt: Date;
  timeInState: number;  // milliseconds
  
  // Rate metrics (from sliding window)
  failureRate: number;  // percentage
  slowCallRate: number;  // percentage
  
  // Counts
  bufferedCalls: number;
  successfulCalls: number;
  failedCalls: number;
  slowCalls: number;
  notPermittedCalls: number;  // rejected by open circuit
  
  // Transition counters (since startup)
  stateTransitions: {
    closedToOpen: number;
    openToHalfOpen: number;
    halfOpenToClosed: number;
    halfOpenToOpen: number;
  };
}
 
// Prometheus metrics example
const circuitBreakerState = new Gauge({
  name: 'circuit_breaker_state',
  help: 'Current circuit breaker state (0=closed, 1=open, 2=half_open)',
  labelNames: ['circuit_name', 'downstream_service'],
});
 
const circuitBreakerFailureRate = new Gauge({
  name: 'circuit_breaker_failure_rate',
  help: 'Current failure rate in the sliding window',
  labelNames: ['circuit_name'],
});
 
const circuitBreakerTransitions = new Counter({
  name: 'circuit_breaker_state_transitions_total',
  help: 'Total number of state transitions',
  labelNames: ['circuit_name', 'from_state', 'to_state'],
});

Logging State Transitions:

State transitions are significant events that should be logged at a high level (WARN or INFO). The log should include:

transition-logging.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Structured logging for state transitions
class LoggingCircuitBreaker {
  private logger = getLogger('circuit-breaker');
  
  private onStateTransition(from: State, to: State, metrics: Metrics): void {
    this.logger.warn({
      event: 'circuit_breaker_transition',
      circuit_name: this.name,
      downstream: this.downstreamService,
      from_state: from,
      to_state: to,
      trigger: this.getTransitionTrigger(from, to, metrics),
      metrics: {
        failure_rate: metrics.failureRate,
        slow_call_rate: metrics.slowCallRate,
        buffered_calls: metrics.bufferedCalls,
        failures_in_window: metrics.failures,
        time_in_previous_state_ms: Date.now() - this.stateChangedAt,
      },
    });
    
    // Alert on OPEN transitions
    if (to === State.OPEN) {
      this.alerting.fire({
        severity: 'warning',
        title: `Circuit breaker OPEN: ${this.name}`,
        message: `Circuit to ${this.downstreamService} opened due to ${metrics.failureRate}% failure rate`,
      });
    }
  }
}

Dashboard Best Practice

Summary: Mastering Circuit States

We've comprehensively examined the state machine that powers circuit breakers, understanding not just what each state does, but why and how transitions occur.

Key Takeaways

•Three states define all behavior — CLOSED (normal operation), OPEN (protection active), and HALF-OPEN (testing recovery).
•Transitions are condition-based — CLOSED→OPEN uses failure thresholds; OPEN→HALF-OPEN uses timers; HALF-OPEN decisions use probe outcomes.
•Sliding windows provide context — Count-based or time-based windows ensure decisions reflect recent behavior, not ancient history.
•Minimum calls prevent premature tripping — Don't evaluate failure rates until sufficient data exists for statistical significance.
•Timing parameters balance protection and availability — Open state duration, probe limits, and evaluation windows all require careful tuning.
•Edge cases and race conditions require attention — Concurrent state access needs thread-safe handling to avoid corruption.
•Observability is essential — Metrics, logging, and dashboards make circuit state visible to operators.

What's next:

Page Complete

2 / 5