System Design (HLD)Circuit Breaker Pattern

Circuit Breaker Pattern

LevelAdvanced

Duration75 mins

TopicCircuit Breaker Pattern

4 / 5

Monitoring Circuit State

Making the Invisible Visible

A circuit breaker that protects your system invisibly is doing its job. But invisibility is a double-edged sword—if operators can't see circuit state, they can't:

Diagnose why certain requests are failing fast
Understand if protection is appropriately engaged
Detect misconfigured circuits that trip too aggressively or not at all
Correlate circuit transitions with downstream incidents
Make informed decisions about tuning

Observability transforms circuit breakers from black boxes into transparent, manageable components. This page covers everything you need to make circuit breaker state visible, understandable, and actionable.

What You Will Learn

By the end of this page, you will understand which metrics to collect from circuit breakers, how to design effective dashboards, when and how to alert on circuit state, and operational practices for managing circuits in production.

Essential Circuit Breaker Metrics

Effective monitoring starts with collecting the right metrics. Circuit breakers produce several categories of metrics that serve different purposes.

State Metrics (Point-in-Time):

These metrics capture the current state of the circuit at any moment:

Circuit Breaker State Metrics
Metric Name	Type	Description	Example Value
circuit_breaker_state	Gauge	Current state as numeric value	0=CLOSED, 1=OPEN, 2=HALF_OPEN
circuit_breaker_failure_rate	Gauge	Current failure rate in sliding window (%)	42.5
circuit_breaker_slow_call_rate	Gauge	Current slow call rate in sliding window (%)	15.2
circuit_breaker_buffered_calls	Gauge	Number of calls in sliding window	87
circuit_breaker_successful_calls	Gauge	Successful calls in window	50
circuit_breaker_failed_calls	Gauge	Failed calls in window	37
circuit_breaker_slow_calls	Gauge	Slow calls in window	13

Event Metrics (Counters):

These metrics count events over time, enabling rate calculations:

Circuit Breaker Event Counters
Metric Name	Type	Description	Use Case
circuit_breaker_calls_total	Counter	Total calls attempted	Success rate calculation
circuit_breaker_successful_calls_total	Counter	Total successful calls	Throughput analysis
circuit_breaker_failed_calls_total	Counter	Total failed calls	Error trending
circuit_breaker_not_permitted_calls_total	Counter	Calls rejected by open circuit	Protection activity
circuit_breaker_state_transitions_total	Counter	State transition count	Circuit stability

prometheus-metrics.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Prometheus metrics setup for circuit breakers
import { Registry, Gauge, Counter, Histogram } from 'prom-client';
 
const registry = new Registry();
 
// State gauge (current state as numeric value)
const circuitState = new Gauge({
  name: 'circuit_breaker_state',
  help: 'Current circuit breaker state (0=closed, 1=open, 2=half_open)',
  labelNames: ['circuit_name', 'downstream_service'],
  registers: [registry],
});
 
// Failure rate gauge
const failureRate = new Gauge({
  name: 'circuit_breaker_failure_rate',
  help: 'Current failure rate percentage in sliding window',
  labelNames: ['circuit_name'],
  registers: [registry],
});
 
// Call counters
const callsTotal = new Counter({
  name: 'circuit_breaker_calls_total',
  help: 'Total number of calls through circuit breaker',
  labelNames: ['circuit_name', 'outcome'],  // outcome: success, failure, not_permitted
  registers: [registry],
});
 
// State transition counter
const transitions = new Counter({
  name: 'circuit_breaker_state_transitions_total',
  help: 'Number of state transitions',
  labelNames: ['circuit_name', 'from_state', 'to_state'],
  registers: [registry],
});
 
// Response time histogram
const responseTime = new Histogram({
  name: 'circuit_breaker_call_duration_seconds',
  help: 'Response time of calls through circuit breaker',
  labelNames: ['circuit_name', 'outcome'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10],  // 10ms to 10s
  registers: [registry],
});
 
// Usage in circuit breaker
function recordCallOutcome(
  circuitName: string,
  downstream: string,
  outcome: 'success' | 'failure' | 'not_permitted',
  durationMs: number
): void {
  callsTotal.inc({ circuit_name: circuitName, outcome });
  
  if (outcome !== 'not_permitted') {
    responseTime.observe(
      { circuit_name: circuitName, outcome },
      durationMs / 1000
    );
  }
}

Label Cardinality

Be thoughtful about labels. Adding a 'request_id' label would create unique time series per request—quickly overwhelming your metrics storage. Limit labels to dimensions you'll actually query by: circuit name, downstream service, state, outcome.

Metric Collection Patterns

There are two primary patterns for collecting circuit breaker metrics: push-based and pull-based.

Pull-Based Collection (Prometheus Pattern):

The monitoring system periodically scrapes metrics endpoints exposed by your services.

Pull-Based Advantages

•Simple service architecture: Services only expose an endpoint; no outbound connections needed
•Automatic discovery: Service discovery integrates with scraping
•Consistent intervals: Scrape interval controlled centrally
•No metric loss on monitoring failure: Metrics remain in service until scraped

metrics-endpoint.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Express endpoint for Prometheus scraping
import express from 'express';
import { circuitBreakerRegistry } from './circuit-breaker-metrics';
 
const app = express();
 
// Metrics endpoint
app.get('/metrics', async (req, res) => {
  try {
    res.set('Content-Type', circuitBreakerRegistry.contentType);
    res.end(await circuitBreakerRegistry.metrics());
  } catch (error) {
    res.status(500).end(error);
  }
});
 
// Health check that includes circuit state summary
app.get('/health', async (req, res) => {
  const circuitSummary = circuitBreakerManager.getAllCircuitStates();
  
  // Include open circuits in health response
  const openCircuits = Object.entries(circuitSummary)
    .filter(([_, state]) => state === 'OPEN')
    .map(([name, _]) => name);
  
  res.json({
    status: openCircuits.length === 0 ? 'healthy' : 'degraded',
    openCircuits,
    circuitStates: circuitSummary,
  });
});

Push-Based Collection (StatsD/DataDog Pattern):

Services actively push metrics to a collection endpoint.

Push-Based Advantages

•Real-time events: State transitions pushed immediately
•Event-driven granularity: Every state change captured precisely
•Works through firewalls: Outbound connections usually easier than inbound
•No scrape interval delay: Events visible instantly

statsd-metrics.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// StatsD-based metrics pushing
import StatsD from 'hot-shots';
 
const statsd = new StatsD({
  host: 'statsd.monitoring.local',
  port: 8125,
  prefix: 'circuit_breaker.',
});
 
// Real-time event pushing
class StatsDBreakerMetrics {
  private circuitName: string;
  
  constructor(circuitName: string) {
    this.circuitName = circuitName;
  }
  
  recordSuccess(durationMs: number): void {
    statsd.increment(`${this.circuitName}.success`);
    statsd.timing(`${this.circuitName}.duration`, durationMs);
  }
  
  recordFailure(durationMs: number): void {
    statsd.increment(`${this.circuitName}.failure`);
    statsd.timing(`${this.circuitName}.duration`, durationMs);
  }
  
  recordNotPermitted(): void {
    statsd.increment(`${this.circuitName}.not_permitted`);
  }
  
  recordStateTransition(from: string, to: string): void {
    // Immediate visibility into state changes
    statsd.event(
      `Circuit ${this.circuitName} transition`,
      `State changed from ${from} to ${to}`,
      { alert_type: to === 'OPEN' ? 'warning' : 'info' }
    );
    statsd.increment(`${this.circuitName}.transition.${from}_to_${to}`);
  }
  
  updateGauges(metrics: CircuitMetrics): void {
    statsd.gauge(`${this.circuitName}.failure_rate`, metrics.failureRate);
    statsd.gauge(`${this.circuitName}.slow_call_rate`, metrics.slowCallRate);
    statsd.gauge(`${this.circuitName}.buffered_calls`, metrics.bufferedCalls);
  }
}

Hybrid Approach

Many organizations use both patterns: pull-based for regular metrics scraping and push-based (or event-based) for critical events like state transitions. This provides both comprehensive monitoring and real-time alerting.

Dashboard Design

Effective dashboards present circuit breaker information in a way that enables quick understanding and action. Different audiences need different views.

Dashboard 1: Circuit Breaker Overview (Primary)

This dashboard provides a system-wide view of all circuit breakers, designed for on-call engineers and NOC teams.

Overview Dashboard Components

•State map: Visual grid showing all circuits with color-coded states (green=closed, red=open, yellow=half-open)
•Open circuit count: Large number showing how many circuits are currently open
•Recent transitions: Timeline showing state transitions in the last hour
•Not-permitted requests: Graph showing requests rejected by open circuits
•Failure rate heatmap: Service-by-circuit matrix showing current failure rates

grafana-queries.promql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# PromQL queries for circuit breaker dashboard
 
# Count of open circuits (single stat panel)
count(circuit_breaker_state{state="1"}) or vector(0)
 
# State map by circuit (table panel)
circuit_breaker_state{} * on(circuit_name) group_left(downstream_service) 
  circuit_breaker_info{}
 
# Failure rate over time (graph panel)
circuit_breaker_failure_rate{circuit_name=~"$circuit"}
 
# Not permitted requests per second (graph panel)
rate(circuit_breaker_calls_total{outcome="not_permitted"}[5m])
 
# State transitions per hour (counter panel)
increase(circuit_breaker_state_transitions_total{}[1h])
 
# Success rate through circuit (graph panel)
sum(rate(circuit_breaker_calls_total{outcome="success"}[5m])) by (circuit_name) /
sum(rate(circuit_breaker_calls_total{outcome=~"success|failure"}[5m])) by (circuit_name)
 
# Time in current state (gauge panel)
time() - (circuit_breaker_state_changed_timestamp_seconds{} or 0)

Dashboard 2: Circuit Breaker Deep Dive (Per-Circuit)

This dashboard provides detailed analysis of a single circuit, useful for debugging and tuning.

Deep Dive Dashboard Components

•Current state and duration: How long has this circuit been in its current state?
•Failure rate timeline: Failure rate over the last 24 hours with threshold line
•Slow call rate timeline: Slow call percentage over time
•Call latency distribution: Histogram showing response time distribution
•State transition history: Annotated timeline of all state changes
•Configuration display: Current threshold settings for reference
•Correlation panel: Show downstream service latency alongside circuit state

Correlation View

The most powerful debugging view correlates circuit breaker state with downstream metrics. When you can see that the payment service P99 latency spiked to 20 seconds at the exact moment the circuit opened, causation is immediately clear.

Alerting Strategies

Circuit breaker events can trigger alerts, but not all events warrant human attention. Effective alerting distinguishes between informational events and actionable incidents.

Alert: Circuit Opened (Warning)

Triggered immediately when a circuit transitions to OPEN state.

alerting-rules.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# Prometheus Alertmanager rules for circuit breakers
 
groups:
- name: circuit_breaker_alerts
  rules:
    
  # Alert when any circuit opens
  - alert: CircuitBreakerOpened
    expr: circuit_breaker_state == 1
    for: 0s  # Alert immediately
    labels:
      severity: warning
    annotations:
      summary: "Circuit breaker {{ $labels.circuit_name }} is OPEN"
      description: |
        Circuit {{ $labels.circuit_name }} to {{ $labels.downstream_service }}
        has opened due to failure rate exceeding threshold.
        Current failure rate: {{ $value }}%
        
  # Alert when circuit remains open for extended period
  - alert: CircuitBreakerOpenExtended
    expr: |
      circuit_breaker_state == 1 
      and 
      (time() - circuit_breaker_state_changed_timestamp) > 300
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Circuit {{ $labels.circuit_name }} open for >5 minutes"
      description: |
        Circuit has been open for more than 5 minutes.
        Recovery probes may be failing, or downstream service
        is not recovering. Manual investigation required.
 
  # Alert on circuit flapping (multiple transitions in short period)
  - alert: CircuitBreakerFlapping
    expr: |
      increase(circuit_breaker_state_transitions_total[15m]) > 4
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Circuit {{ $labels.circuit_name }} is flapping"
      description: |
        Circuit has transitioned more than 4 times in 15 minutes.
        This indicates either threshold misconfiguration or
        an unstable downstream service. Review circuit config
        and downstream health.
 
  # Alert on high rejection rate
  - alert: HighCircuitRejectionRate
    expr: |
      rate(circuit_breaker_calls_total{outcome="not_permitted"}[5m]) 
      / 
      rate(circuit_breaker_calls_total[5m]) > 0.1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Circuit {{ $labels.circuit_name }} rejecting >10% of traffic"
      description: |
        More than 10% of requests to this circuit are being rejected
        due to open circuit state. User impact likely.

Alert Severity Guidelines:

Circuit Breaker Alert Severity Matrix
Condition	Severity	Action Required
Circuit opens (non-critical service)	Info	Monitor, no immediate action
Circuit opens (critical service)	Warning	Investigate, monitor recovery
Circuit open > 5 minutes	Warning → Critical	Active investigation required
Circuit flapping (repeated transitions)	Warning	Review configuration
Multiple circuits open simultaneously	Critical	Potential systemic issue
Circuit never opens despite known issues	Warning	Configuration audit needed

Alert Fatigue

Circuits opening and closing is normal, healthy behavior. Alerting on every OPEN transition will create noise. Instead, alert on patterns that require investigation: extended open duration, flapping, or aggregate impact. A circuit that opens for 30 seconds during a transient issue and auto-recovers might not need human attention.

Logging Best Practices

While metrics provide quantitative visibility, logs provide qualitative context. Structured logging makes circuit breaker events searchable and analyzable.

What to Log:

Essential Circuit Breaker Log Events

•State transitions: Log at WARN level with before/after state and metrics snapshot
•Not-permitted calls: Log at INFO or DEBUG level with request context
•Probe results in half-open: Log probe success/failure for debugging
•Configuration changes: Log when circuit configuration is updated
•Threshold breaches: Log when failure rate first crosses threshold

structured-logging.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
// Structured logging for circuit breaker events
import { Logger } from './logger';
 
class CircuitBreakerLogger {
  private logger: Logger;
  private circuitName: string;
  
  constructor(circuitName: string) {
    this.circuitName = circuitName;
    this.logger = Logger.getLogger('circuit-breaker');
  }
  
  logStateTransition(
    from: State,
    to: State,
    metrics: CircuitMetrics,
    trigger: string
  ): void {
    // State transitions are significant - WARN level
    this.logger.warn({
      event: 'circuit_breaker_state_transition',
      circuit_name: this.circuitName,
      downstream: metrics.downstreamService,
      from_state: from,
      to_state: to,
      trigger: trigger,  // 'failure_threshold_exceeded', 'timeout_expired', etc.
      metrics: {
        failure_rate: metrics.failureRate,
        slow_call_rate: metrics.slowCallRate,
        buffered_calls: metrics.bufferedCalls,
        failed_calls: metrics.failedCalls,
        slow_calls: metrics.slowCalls,
      },
      time_in_previous_state_ms: metrics.timeInPreviousState,
    });
  }
  
  logNotPermitted(requestContext: RequestContext): void {
    // Not-permitted calls might be noisy - INFO or DEBUG level
    this.logger.info({
      event: 'circuit_breaker_call_not_permitted',
      circuit_name: this.circuitName,
      request_id: requestContext.requestId,
      path: requestContext.path,
      method: requestContext.method,
      current_state: 'OPEN',
      reason: 'Circuit is open, request rejected immediately',
    });
  }
  
  logProbeResult(
    probeNumber: number,
    totalProbes: number,
    success: boolean,
    durationMs: number
  ): void {
    this.logger.info({
      event: 'circuit_breaker_probe_result',
      circuit_name: this.circuitName,
      probe_number: probeNumber,
      total_probes: totalProbes,
      success: success,
      duration_ms: durationMs,
      state: 'HALF_OPEN',
    });
  }
  
  logThresholdBreach(
    metricType: 'failure_rate' | 'slow_call_rate',
    currentValue: number,
    threshold: number
  ): void {
    this.logger.warn({
      event: 'circuit_breaker_threshold_breach',
      circuit_name: this.circuitName,
      metric_type: metricType,
      current_value: currentValue,
      threshold: threshold,
      message: `${metricType} (${currentValue}%) exceeded threshold (${threshold}%)`,
    });
  }
}

Log Aggregation Queries:

With structured logs, you can perform powerful queries:

log-queries.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Elasticsearch/OpenSearch queries for circuit breaker analysis
 
# Find all circuits that opened in the last hour
event:"circuit_breaker_state_transition" AND to_state:"OPEN" AND @timestamp:[now-1h TO now]
 
# Find circuits that are flapping
event:"circuit_breaker_state_transition" | stats count by circuit_name | where count > 5
 
# Find requests rejected by circuits for a specific service
event:"circuit_breaker_call_not_permitted" AND path:"/api/checkout/*"
 
# Analyze probe success rate during recovery
event:"circuit_breaker_probe_result" AND circuit_name:"payment-service"
  | stats count by success
 
# Correlate circuit opens with error logs from downstream
(event:"circuit_breaker_state_transition" AND to_state:"OPEN") 
  OR 
(service:"payment-service" AND level:"ERROR")

Log Sampling

For high-traffic circuits, logging every not-permitted call can be overwhelming. Consider sampling: log every Nth rejection, or log only the first rejection per minute. State transitions should always be logged without sampling—they're infrequent and critical.

Distributed Tracing Integration

Distributed tracing provides request-level visibility that complements metrics and logs. Integrating circuit breaker state into traces enables powerful debugging capabilities.

Adding Circuit State to Spans:

When a request passes through a circuit breaker, annotate the trace span with circuit information:

tracing-integration.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// OpenTelemetry integration for circuit breakers
import { trace, Span, SpanStatusCode } from '@opentelemetry/api';
 
class TracedCircuitBreaker {
  private breaker: CircuitBreaker;
  private tracer = trace.getTracer('circuit-breaker');
  
  async execute<T>(operation: () => Promise<T>): Promise<T> {
    const span = this.tracer.startSpan('circuit-breaker.execute', {
      attributes: {
        'circuit.name': this.breaker.name,
        'circuit.downstream': this.breaker.downstream,
        'circuit.state': this.breaker.state,
      },
    });
    
    try {
      if (this.breaker.state === 'OPEN') {
        // Request rejected by open circuit
        span.setAttributes({
          'circuit.result': 'not_permitted',
          'circuit.rejection_reason': 'circuit_open',
        });
        span.setStatus({ code: SpanStatusCode.ERROR, message: 'Circuit open' });
        throw new CircuitBreakerOpenException();
      }
      
      const startTime = Date.now();
      const result = await operation();
      const duration = Date.now() - startTime;
      
      span.setAttributes({
        'circuit.result': 'success',
        'circuit.duration_ms': duration,
        'circuit.was_slow': duration > this.breaker.slowCallThreshold,
      });
      
      return result;
      
    } catch (error) {
      span.setAttributes({
        'circuit.result': 'failure',
        'circuit.error_type': error.constructor.name,
      });
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw error;
      
    } finally {
      // Always include final circuit state
      span.setAttributes({
        'circuit.final_state': this.breaker.state,
        'circuit.failure_rate': this.breaker.metrics.failureRate,
      });
      span.end();
    }
  }
}

Benefits of Trace-Level Circuit Visibility:

Tracing Integration Benefits

•Request path analysis: See exactly which circuit rejected a specific request
•Latency contribution: Understand how circuit breaker overhead contributes to total latency
•Failure correlation: Trace from circuit rejection back to the originating user request
•Probe visibility: Track probe requests through the downstream service
•Fallback tracing: Follow execution path when circuit triggers fallback logic

Trace Sampling Considerations

If your tracing system uses sampling, ensure that circuit breaker events can force trace capture. A request rejected by an open circuit might otherwise be ignored by sampling, losing valuable debugging information.

Operational Runbooks

Monitoring is only valuable if operators know how to respond. Runbooks codify the response procedures for circuit breaker events.

Runbook: Circuit Opens Unexpectedly

runbook-circuit-opened.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Runbook: Circuit Breaker Opened
 
## 1. Initial Assessment (30 seconds)
- Which circuit opened? Check alert details.
- Is this a critical path circuit? Check criticality labeling.
- Are other circuits to the same downstream also open?
 
## 2. Verify Downstream Health (1-2 minutes)
- Check downstream service dashboard
- Review downstream service logs for errors
- Check downstream service pod/container status
- Verify downstream database/dependencies health
 
## 3. Assess Impact (1 minute)
- What user-facing functionality is affected?
- Is fallback behavior working?
- How many users/requests are affected?
 
## 4. Decide Response
IF downstream is genuinely unhealthy:
  → Let circuit protect the system
  → Focus on downstream recovery
  → Monitor circuit state for auto-recovery
 
IF downstream appears healthy (possible false positive):
  → Review circuit configuration
  → Check for threshold misconfiguration
  → Consider temporarily adjusting thresholds
 
IF circuit is flapping:
  → Investigate intermittent downstream issues
  → Consider increasing wait duration
  → Review minimum calls threshold
 
## 5. Recovery Verification
- Monitor half-open probes in dashboard
- Verify circuit closes after downstream recovery
- Confirm user-facing functionality restored
 
## 6. Post-Incident
- Document root cause if not auto-recovered
- Review if circuit configuration was appropriate
- Update runbook with lessons learned

Runbook: Manually Overriding Circuit State

In rare cases, operators may need to manually force a circuit state:

runbook-manual-override.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Runbook: Manual Circuit Override
 
## WARNING
Manual overrides bypass automatic protection. Use only when:
- Circuit is misconfigured and needs emergency correction
- Downstream is confirmed healthy but circuit won't recover
- Testing circuit behavior in staging environment
 
## Force Circuit CLOSED
Use when: Circuit is stuck open despite healthy downstream
 
```bash
# Via admin API (if exposed)
curl -X POST http://localhost:8080/admin/circuits/payment-service/close
 
# Via configuration (requires restart or dynamic config)
export CIRCUIT_payment-service_FORCE_OPEN=false
```
 
## Force Circuit OPEN
Use when: Need to stop traffic to a service immediately
 
```bash
# Via admin API
curl -X POST http://localhost:8080/admin/circuits/payment-service/open
 
# Via configuration
export CIRCUIT_payment-service_FORCE_OPEN=true
```
 
## Reset Circuit Metrics
Use when: Historical failures skewing current evaluation
 
```bash
curl -X POST http://localhost:8080/admin/circuits/payment-service/reset
```
 
## IMPORTANT: Revert Overrides
After emergency override, always:
1. Document the override in incident channel
2. Monitor circuit behavior closely
3. Remove override once issue is resolved
4. Let circuit return to automatic operation

Override Audit Trail

All manual overrides should be logged and auditable. Override without audit trail makes post-incident analysis difficult and can mask configuration issues that would otherwise be caught.

Summary: Monitoring Circuit Breakers

We've comprehensively explored how to make circuit breaker behavior visible and manageable through effective monitoring, alerting, and operational processes.

Key Takeaways

•Essential metrics include state, failure rate, slow call rate, buffered calls, and transition counts.
•Collection patterns (push vs. pull) each have advantages; hybrid approaches combine benefits.
•Dashboard design serves different audiences: overview for NOC teams, deep-dive for debugging.
•Alerting strategies distinguish between noise (normal transitions) and actionable events (extended open, flapping).
•Structured logging enables powerful queries and correlates circuit events with downstream issues.
•Distributed tracing provides request-level visibility into circuit decisions and fallback paths.
•Operational runbooks codify response procedures for consistent, effective incident handling.

What's next:

With monitoring in place, the final page covers implementation considerations—practical guidance for choosing libraries, handling edge cases, integrating with existing systems, and deploying circuit breakers effectively.

Page Complete

You now understand how to make circuit breakers observable and manageable. You can design dashboards that surface critical information, configure alerts that reduce noise while catching real issues, and create runbooks that enable effective operational response.

4 / 5

Loading learning content...

System Design (HLD)Circuit Breaker Pattern

Circuit Breaker Pattern

LevelAdvanced

Duration75 mins

TopicCircuit Breaker Pattern

4 / 5

Monitoring Circuit State

Making the Invisible Visible

A circuit breaker that protects your system invisibly is doing its job. But invisibility is a double-edged sword—if operators can't see circuit state, they can't:

Diagnose why certain requests are failing fast
Understand if protection is appropriately engaged
Detect misconfigured circuits that trip too aggressively or not at all
Correlate circuit transitions with downstream incidents
Make informed decisions about tuning

What You Will Learn

Essential Circuit Breaker Metrics

Effective monitoring starts with collecting the right metrics. Circuit breakers produce several categories of metrics that serve different purposes.

State Metrics (Point-in-Time):

These metrics capture the current state of the circuit at any moment:

Circuit Breaker State Metrics
Metric Name	Type	Description	Example Value
circuit_breaker_state	Gauge	Current state as numeric value	0=CLOSED, 1=OPEN, 2=HALF_OPEN
circuit_breaker_failure_rate	Gauge	Current failure rate in sliding window (%)	42.5
circuit_breaker_slow_call_rate	Gauge	Current slow call rate in sliding window (%)	15.2
circuit_breaker_buffered_calls	Gauge	Number of calls in sliding window	87
circuit_breaker_successful_calls	Gauge	Successful calls in window	50
circuit_breaker_failed_calls	Gauge	Failed calls in window	37
circuit_breaker_slow_calls	Gauge	Slow calls in window	13

Event Metrics (Counters):

These metrics count events over time, enabling rate calculations:

Circuit Breaker Event Counters
Metric Name	Type	Description	Use Case
circuit_breaker_calls_total	Counter	Total calls attempted	Success rate calculation
circuit_breaker_successful_calls_total	Counter	Total successful calls	Throughput analysis
circuit_breaker_failed_calls_total	Counter	Total failed calls	Error trending
circuit_breaker_not_permitted_calls_total	Counter	Calls rejected by open circuit	Protection activity
circuit_breaker_state_transitions_total	Counter	State transition count	Circuit stability

prometheus-metrics.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Prometheus metrics setup for circuit breakers
import { Registry, Gauge, Counter, Histogram } from 'prom-client';
 
const registry = new Registry();
 
// State gauge (current state as numeric value)
const circuitState = new Gauge({
  name: 'circuit_breaker_state',
  help: 'Current circuit breaker state (0=closed, 1=open, 2=half_open)',
  labelNames: ['circuit_name', 'downstream_service'],
  registers: [registry],
});
 
// Failure rate gauge
const failureRate = new Gauge({
  name: 'circuit_breaker_failure_rate',
  help: 'Current failure rate percentage in sliding window',
  labelNames: ['circuit_name'],
  registers: [registry],
});
 
// Call counters
const callsTotal = new Counter({
  name: 'circuit_breaker_calls_total',
  help: 'Total number of calls through circuit breaker',
  labelNames: ['circuit_name', 'outcome'],  // outcome: success, failure, not_permitted
  registers: [registry],
});
 
// State transition counter
const transitions = new Counter({
  name: 'circuit_breaker_state_transitions_total',
  help: 'Number of state transitions',
  labelNames: ['circuit_name', 'from_state', 'to_state'],
  registers: [registry],
});
 
// Response time histogram
const responseTime = new Histogram({
  name: 'circuit_breaker_call_duration_seconds',
  help: 'Response time of calls through circuit breaker',
  labelNames: ['circuit_name', 'outcome'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10],  // 10ms to 10s
  registers: [registry],
});
 
// Usage in circuit breaker
function recordCallOutcome(
  circuitName: string,
  downstream: string,
  outcome: 'success' | 'failure' | 'not_permitted',
  durationMs: number
): void {
  callsTotal.inc({ circuit_name: circuitName, outcome });
  
  if (outcome !== 'not_permitted') {
    responseTime.observe(
      { circuit_name: circuitName, outcome },
      durationMs / 1000
    );
  }
}

Label Cardinality

Metric Collection Patterns

There are two primary patterns for collecting circuit breaker metrics: push-based and pull-based.

Pull-Based Collection (Prometheus Pattern):

The monitoring system periodically scrapes metrics endpoints exposed by your services.

Pull-Based Advantages

•Simple service architecture: Services only expose an endpoint; no outbound connections needed
•Automatic discovery: Service discovery integrates with scraping
•Consistent intervals: Scrape interval controlled centrally
•No metric loss on monitoring failure: Metrics remain in service until scraped

metrics-endpoint.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Express endpoint for Prometheus scraping
import express from 'express';
import { circuitBreakerRegistry } from './circuit-breaker-metrics';
 
const app = express();
 
// Metrics endpoint
app.get('/metrics', async (req, res) => {
  try {
    res.set('Content-Type', circuitBreakerRegistry.contentType);
    res.end(await circuitBreakerRegistry.metrics());
  } catch (error) {
    res.status(500).end(error);
  }
});
 
// Health check that includes circuit state summary
app.get('/health', async (req, res) => {
  const circuitSummary = circuitBreakerManager.getAllCircuitStates();
  
  // Include open circuits in health response
  const openCircuits = Object.entries(circuitSummary)
    .filter(([_, state]) => state === 'OPEN')
    .map(([name, _]) => name);
  
  res.json({
    status: openCircuits.length === 0 ? 'healthy' : 'degraded',
    openCircuits,
    circuitStates: circuitSummary,
  });
});

Push-Based Collection (StatsD/DataDog Pattern):

Services actively push metrics to a collection endpoint.

Push-Based Advantages

•Real-time events: State transitions pushed immediately
•Event-driven granularity: Every state change captured precisely
•Works through firewalls: Outbound connections usually easier than inbound
•No scrape interval delay: Events visible instantly

statsd-metrics.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// StatsD-based metrics pushing
import StatsD from 'hot-shots';
 
const statsd = new StatsD({
  host: 'statsd.monitoring.local',
  port: 8125,
  prefix: 'circuit_breaker.',
});
 
// Real-time event pushing
class StatsDBreakerMetrics {
  private circuitName: string;
  
  constructor(circuitName: string) {
    this.circuitName = circuitName;
  }
  
  recordSuccess(durationMs: number): void {
    statsd.increment(`${this.circuitName}.success`);
    statsd.timing(`${this.circuitName}.duration`, durationMs);
  }
  
  recordFailure(durationMs: number): void {
    statsd.increment(`${this.circuitName}.failure`);
    statsd.timing(`${this.circuitName}.duration`, durationMs);
  }
  
  recordNotPermitted(): void {
    statsd.increment(`${this.circuitName}.not_permitted`);
  }
  
  recordStateTransition(from: string, to: string): void {
    // Immediate visibility into state changes
    statsd.event(
      `Circuit ${this.circuitName} transition`,
      `State changed from ${from} to ${to}`,
      { alert_type: to === 'OPEN' ? 'warning' : 'info' }
    );
    statsd.increment(`${this.circuitName}.transition.${from}_to_${to}`);
  }
  
  updateGauges(metrics: CircuitMetrics): void {
    statsd.gauge(`${this.circuitName}.failure_rate`, metrics.failureRate);
    statsd.gauge(`${this.circuitName}.slow_call_rate`, metrics.slowCallRate);
    statsd.gauge(`${this.circuitName}.buffered_calls`, metrics.bufferedCalls);
  }
}

Hybrid Approach

Dashboard Design

Effective dashboards present circuit breaker information in a way that enables quick understanding and action. Different audiences need different views.

Dashboard 1: Circuit Breaker Overview (Primary)

This dashboard provides a system-wide view of all circuit breakers, designed for on-call engineers and NOC teams.

Overview Dashboard Components

•State map: Visual grid showing all circuits with color-coded states (green=closed, red=open, yellow=half-open)
•Open circuit count: Large number showing how many circuits are currently open
•Recent transitions: Timeline showing state transitions in the last hour
•Not-permitted requests: Graph showing requests rejected by open circuits
•Failure rate heatmap: Service-by-circuit matrix showing current failure rates

grafana-queries.promql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# PromQL queries for circuit breaker dashboard
 
# Count of open circuits (single stat panel)
count(circuit_breaker_state{state="1"}) or vector(0)
 
# State map by circuit (table panel)
circuit_breaker_state{} * on(circuit_name) group_left(downstream_service) 
  circuit_breaker_info{}
 
# Failure rate over time (graph panel)
circuit_breaker_failure_rate{circuit_name=~"$circuit"}
 
# Not permitted requests per second (graph panel)
rate(circuit_breaker_calls_total{outcome="not_permitted"}[5m])
 
# State transitions per hour (counter panel)
increase(circuit_breaker_state_transitions_total{}[1h])
 
# Success rate through circuit (graph panel)
sum(rate(circuit_breaker_calls_total{outcome="success"}[5m])) by (circuit_name) /
sum(rate(circuit_breaker_calls_total{outcome=~"success|failure"}[5m])) by (circuit_name)
 
# Time in current state (gauge panel)
time() - (circuit_breaker_state_changed_timestamp_seconds{} or 0)

Dashboard 2: Circuit Breaker Deep Dive (Per-Circuit)

This dashboard provides detailed analysis of a single circuit, useful for debugging and tuning.

Deep Dive Dashboard Components

•Current state and duration: How long has this circuit been in its current state?
•Failure rate timeline: Failure rate over the last 24 hours with threshold line
•Slow call rate timeline: Slow call percentage over time
•Call latency distribution: Histogram showing response time distribution
•State transition history: Annotated timeline of all state changes
•Configuration display: Current threshold settings for reference
•Correlation panel: Show downstream service latency alongside circuit state

Correlation View

Alerting Strategies

Circuit breaker events can trigger alerts, but not all events warrant human attention. Effective alerting distinguishes between informational events and actionable incidents.

Alert: Circuit Opened (Warning)

Triggered immediately when a circuit transitions to OPEN state.

alerting-rules.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# Prometheus Alertmanager rules for circuit breakers
 
groups:
- name: circuit_breaker_alerts
  rules:
    
  # Alert when any circuit opens
  - alert: CircuitBreakerOpened
    expr: circuit_breaker_state == 1
    for: 0s  # Alert immediately
    labels:
      severity: warning
    annotations:
      summary: "Circuit breaker {{ $labels.circuit_name }} is OPEN"
      description: |
        Circuit {{ $labels.circuit_name }} to {{ $labels.downstream_service }}
        has opened due to failure rate exceeding threshold.
        Current failure rate: {{ $value }}%
        
  # Alert when circuit remains open for extended period
  - alert: CircuitBreakerOpenExtended
    expr: |
      circuit_breaker_state == 1 
      and 
      (time() - circuit_breaker_state_changed_timestamp) > 300
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Circuit {{ $labels.circuit_name }} open for >5 minutes"
      description: |
        Circuit has been open for more than 5 minutes.
        Recovery probes may be failing, or downstream service
        is not recovering. Manual investigation required.
 
  # Alert on circuit flapping (multiple transitions in short period)
  - alert: CircuitBreakerFlapping
    expr: |
      increase(circuit_breaker_state_transitions_total[15m]) > 4
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Circuit {{ $labels.circuit_name }} is flapping"
      description: |
        Circuit has transitioned more than 4 times in 15 minutes.
        This indicates either threshold misconfiguration or
        an unstable downstream service. Review circuit config
        and downstream health.
 
  # Alert on high rejection rate
  - alert: HighCircuitRejectionRate
    expr: |
      rate(circuit_breaker_calls_total{outcome="not_permitted"}[5m]) 
      / 
      rate(circuit_breaker_calls_total[5m]) > 0.1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Circuit {{ $labels.circuit_name }} rejecting >10% of traffic"
      description: |
        More than 10% of requests to this circuit are being rejected
        due to open circuit state. User impact likely.

Alert Severity Guidelines:

Circuit Breaker Alert Severity Matrix
Condition	Severity	Action Required
Circuit opens (non-critical service)	Info	Monitor, no immediate action
Circuit opens (critical service)	Warning	Investigate, monitor recovery
Circuit open > 5 minutes	Warning → Critical	Active investigation required
Circuit flapping (repeated transitions)	Warning	Review configuration
Multiple circuits open simultaneously	Critical	Potential systemic issue
Circuit never opens despite known issues	Warning	Configuration audit needed

Alert Fatigue

Logging Best Practices

While metrics provide quantitative visibility, logs provide qualitative context. Structured logging makes circuit breaker events searchable and analyzable.

What to Log:

Essential Circuit Breaker Log Events

•State transitions: Log at WARN level with before/after state and metrics snapshot
•Not-permitted calls: Log at INFO or DEBUG level with request context
•Probe results in half-open: Log probe success/failure for debugging
•Configuration changes: Log when circuit configuration is updated
•Threshold breaches: Log when failure rate first crosses threshold

structured-logging.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
// Structured logging for circuit breaker events
import { Logger } from './logger';
 
class CircuitBreakerLogger {
  private logger: Logger;
  private circuitName: string;
  
  constructor(circuitName: string) {
    this.circuitName = circuitName;
    this.logger = Logger.getLogger('circuit-breaker');
  }
  
  logStateTransition(
    from: State,
    to: State,
    metrics: CircuitMetrics,
    trigger: string
  ): void {
    // State transitions are significant - WARN level
    this.logger.warn({
      event: 'circuit_breaker_state_transition',
      circuit_name: this.circuitName,
      downstream: metrics.downstreamService,
      from_state: from,
      to_state: to,
      trigger: trigger,  // 'failure_threshold_exceeded', 'timeout_expired', etc.
      metrics: {
        failure_rate: metrics.failureRate,
        slow_call_rate: metrics.slowCallRate,
        buffered_calls: metrics.bufferedCalls,
        failed_calls: metrics.failedCalls,
        slow_calls: metrics.slowCalls,
      },
      time_in_previous_state_ms: metrics.timeInPreviousState,
    });
  }
  
  logNotPermitted(requestContext: RequestContext): void {
    // Not-permitted calls might be noisy - INFO or DEBUG level
    this.logger.info({
      event: 'circuit_breaker_call_not_permitted',
      circuit_name: this.circuitName,
      request_id: requestContext.requestId,
      path: requestContext.path,
      method: requestContext.method,
      current_state: 'OPEN',
      reason: 'Circuit is open, request rejected immediately',
    });
  }
  
  logProbeResult(
    probeNumber: number,
    totalProbes: number,
    success: boolean,
    durationMs: number
  ): void {
    this.logger.info({
      event: 'circuit_breaker_probe_result',
      circuit_name: this.circuitName,
      probe_number: probeNumber,
      total_probes: totalProbes,
      success: success,
      duration_ms: durationMs,
      state: 'HALF_OPEN',
    });
  }
  
  logThresholdBreach(
    metricType: 'failure_rate' | 'slow_call_rate',
    currentValue: number,
    threshold: number
  ): void {
    this.logger.warn({
      event: 'circuit_breaker_threshold_breach',
      circuit_name: this.circuitName,
      metric_type: metricType,
      current_value: currentValue,
      threshold: threshold,
      message: `${metricType} (${currentValue}%) exceeded threshold (${threshold}%)`,
    });
  }
}

Log Aggregation Queries:

With structured logs, you can perform powerful queries:

log-queries.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Elasticsearch/OpenSearch queries for circuit breaker analysis
 
# Find all circuits that opened in the last hour
event:"circuit_breaker_state_transition" AND to_state:"OPEN" AND @timestamp:[now-1h TO now]
 
# Find circuits that are flapping
event:"circuit_breaker_state_transition" | stats count by circuit_name | where count > 5
 
# Find requests rejected by circuits for a specific service
event:"circuit_breaker_call_not_permitted" AND path:"/api/checkout/*"
 
# Analyze probe success rate during recovery
event:"circuit_breaker_probe_result" AND circuit_name:"payment-service"
  | stats count by success
 
# Correlate circuit opens with error logs from downstream
(event:"circuit_breaker_state_transition" AND to_state:"OPEN") 
  OR 
(service:"payment-service" AND level:"ERROR")

Log Sampling

Distributed Tracing Integration

Distributed tracing provides request-level visibility that complements metrics and logs. Integrating circuit breaker state into traces enables powerful debugging capabilities.

Adding Circuit State to Spans:

When a request passes through a circuit breaker, annotate the trace span with circuit information:

tracing-integration.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// OpenTelemetry integration for circuit breakers
import { trace, Span, SpanStatusCode } from '@opentelemetry/api';
 
class TracedCircuitBreaker {
  private breaker: CircuitBreaker;
  private tracer = trace.getTracer('circuit-breaker');
  
  async execute<T>(operation: () => Promise<T>): Promise<T> {
    const span = this.tracer.startSpan('circuit-breaker.execute', {
      attributes: {
        'circuit.name': this.breaker.name,
        'circuit.downstream': this.breaker.downstream,
        'circuit.state': this.breaker.state,
      },
    });
    
    try {
      if (this.breaker.state === 'OPEN') {
        // Request rejected by open circuit
        span.setAttributes({
          'circuit.result': 'not_permitted',
          'circuit.rejection_reason': 'circuit_open',
        });
        span.setStatus({ code: SpanStatusCode.ERROR, message: 'Circuit open' });
        throw new CircuitBreakerOpenException();
      }
      
      const startTime = Date.now();
      const result = await operation();
      const duration = Date.now() - startTime;
      
      span.setAttributes({
        'circuit.result': 'success',
        'circuit.duration_ms': duration,
        'circuit.was_slow': duration > this.breaker.slowCallThreshold,
      });
      
      return result;
      
    } catch (error) {
      span.setAttributes({
        'circuit.result': 'failure',
        'circuit.error_type': error.constructor.name,
      });
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw error;
      
    } finally {
      // Always include final circuit state
      span.setAttributes({
        'circuit.final_state': this.breaker.state,
        'circuit.failure_rate': this.breaker.metrics.failureRate,
      });
      span.end();
    }
  }
}

Benefits of Trace-Level Circuit Visibility:

Tracing Integration Benefits

•Request path analysis: See exactly which circuit rejected a specific request
•Latency contribution: Understand how circuit breaker overhead contributes to total latency
•Failure correlation: Trace from circuit rejection back to the originating user request
•Probe visibility: Track probe requests through the downstream service
•Fallback tracing: Follow execution path when circuit triggers fallback logic

Trace Sampling Considerations

Operational Runbooks

Monitoring is only valuable if operators know how to respond. Runbooks codify the response procedures for circuit breaker events.

Runbook: Circuit Opens Unexpectedly

runbook-circuit-opened.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Runbook: Circuit Breaker Opened
 
## 1. Initial Assessment (30 seconds)
- Which circuit opened? Check alert details.
- Is this a critical path circuit? Check criticality labeling.
- Are other circuits to the same downstream also open?
 
## 2. Verify Downstream Health (1-2 minutes)
- Check downstream service dashboard
- Review downstream service logs for errors
- Check downstream service pod/container status
- Verify downstream database/dependencies health
 
## 3. Assess Impact (1 minute)
- What user-facing functionality is affected?
- Is fallback behavior working?
- How many users/requests are affected?
 
## 4. Decide Response
IF downstream is genuinely unhealthy:
  → Let circuit protect the system
  → Focus on downstream recovery
  → Monitor circuit state for auto-recovery
 
IF downstream appears healthy (possible false positive):
  → Review circuit configuration
  → Check for threshold misconfiguration
  → Consider temporarily adjusting thresholds
 
IF circuit is flapping:
  → Investigate intermittent downstream issues
  → Consider increasing wait duration
  → Review minimum calls threshold
 
## 5. Recovery Verification
- Monitor half-open probes in dashboard
- Verify circuit closes after downstream recovery
- Confirm user-facing functionality restored
 
## 6. Post-Incident
- Document root cause if not auto-recovered
- Review if circuit configuration was appropriate
- Update runbook with lessons learned

Runbook: Manually Overriding Circuit State

In rare cases, operators may need to manually force a circuit state:

runbook-manual-override.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Runbook: Manual Circuit Override
 
## WARNING
Manual overrides bypass automatic protection. Use only when:
- Circuit is misconfigured and needs emergency correction
- Downstream is confirmed healthy but circuit won't recover
- Testing circuit behavior in staging environment
 
## Force Circuit CLOSED
Use when: Circuit is stuck open despite healthy downstream
 
```bash
# Via admin API (if exposed)
curl -X POST http://localhost:8080/admin/circuits/payment-service/close
 
# Via configuration (requires restart or dynamic config)
export CIRCUIT_payment-service_FORCE_OPEN=false
```
 
## Force Circuit OPEN
Use when: Need to stop traffic to a service immediately
 
```bash
# Via admin API
curl -X POST http://localhost:8080/admin/circuits/payment-service/open
 
# Via configuration
export CIRCUIT_payment-service_FORCE_OPEN=true
```
 
## Reset Circuit Metrics
Use when: Historical failures skewing current evaluation
 
```bash
curl -X POST http://localhost:8080/admin/circuits/payment-service/reset
```
 
## IMPORTANT: Revert Overrides
After emergency override, always:
1. Document the override in incident channel
2. Monitor circuit behavior closely
3. Remove override once issue is resolved
4. Let circuit return to automatic operation

Override Audit Trail

All manual overrides should be logged and auditable. Override without audit trail makes post-incident analysis difficult and can mask configuration issues that would otherwise be caught.

Summary: Monitoring Circuit Breakers

We've comprehensively explored how to make circuit breaker behavior visible and manageable through effective monitoring, alerting, and operational processes.

Key Takeaways

•Essential metrics include state, failure rate, slow call rate, buffered calls, and transition counts.
•Collection patterns (push vs. pull) each have advantages; hybrid approaches combine benefits.
•Dashboard design serves different audiences: overview for NOC teams, deep-dive for debugging.
•Alerting strategies distinguish between noise (normal transitions) and actionable events (extended open, flapping).
•Structured logging enables powerful queries and correlates circuit events with downstream issues.
•Distributed tracing provides request-level visibility into circuit decisions and fallback paths.
•Operational runbooks codify response procedures for consistent, effective incident handling.

What's next:

Page Complete

4 / 5