Loading learning content...
A circuit breaker that protects your system invisibly is doing its job. But invisibility is a double-edged sword—if operators can't see circuit state, they can't:
Observability transforms circuit breakers from black boxes into transparent, manageable components. This page covers everything you need to make circuit breaker state visible, understandable, and actionable.
By the end of this page, you will understand which metrics to collect from circuit breakers, how to design effective dashboards, when and how to alert on circuit state, and operational practices for managing circuits in production.
Effective monitoring starts with collecting the right metrics. Circuit breakers produce several categories of metrics that serve different purposes.
State Metrics (Point-in-Time):
These metrics capture the current state of the circuit at any moment:
| Metric Name | Type | Description | Example Value |
|---|---|---|---|
| circuit_breaker_state | Gauge | Current state as numeric value | 0=CLOSED, 1=OPEN, 2=HALF_OPEN |
| circuit_breaker_failure_rate | Gauge | Current failure rate in sliding window (%) | 42.5 |
| circuit_breaker_slow_call_rate | Gauge | Current slow call rate in sliding window (%) | 15.2 |
| circuit_breaker_buffered_calls | Gauge | Number of calls in sliding window | 87 |
| circuit_breaker_successful_calls | Gauge | Successful calls in window | 50 |
| circuit_breaker_failed_calls | Gauge | Failed calls in window | 37 |
| circuit_breaker_slow_calls | Gauge | Slow calls in window | 13 |
Event Metrics (Counters):
These metrics count events over time, enabling rate calculations:
| Metric Name | Type | Description | Use Case |
|---|---|---|---|
| circuit_breaker_calls_total | Counter | Total calls attempted | Success rate calculation |
| circuit_breaker_successful_calls_total | Counter | Total successful calls | Throughput analysis |
| circuit_breaker_failed_calls_total | Counter | Total failed calls | Error trending |
| circuit_breaker_not_permitted_calls_total | Counter | Calls rejected by open circuit | Protection activity |
| circuit_breaker_state_transitions_total | Counter | State transition count | Circuit stability |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
// Prometheus metrics setup for circuit breakersimport { Registry, Gauge, Counter, Histogram } from 'prom-client'; const registry = new Registry(); // State gauge (current state as numeric value)const circuitState = new Gauge({ name: 'circuit_breaker_state', help: 'Current circuit breaker state (0=closed, 1=open, 2=half_open)', labelNames: ['circuit_name', 'downstream_service'], registers: [registry],}); // Failure rate gaugeconst failureRate = new Gauge({ name: 'circuit_breaker_failure_rate', help: 'Current failure rate percentage in sliding window', labelNames: ['circuit_name'], registers: [registry],}); // Call countersconst callsTotal = new Counter({ name: 'circuit_breaker_calls_total', help: 'Total number of calls through circuit breaker', labelNames: ['circuit_name', 'outcome'], // outcome: success, failure, not_permitted registers: [registry],}); // State transition counterconst transitions = new Counter({ name: 'circuit_breaker_state_transitions_total', help: 'Number of state transitions', labelNames: ['circuit_name', 'from_state', 'to_state'], registers: [registry],}); // Response time histogramconst responseTime = new Histogram({ name: 'circuit_breaker_call_duration_seconds', help: 'Response time of calls through circuit breaker', labelNames: ['circuit_name', 'outcome'], buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10], // 10ms to 10s registers: [registry],}); // Usage in circuit breakerfunction recordCallOutcome( circuitName: string, downstream: string, outcome: 'success' | 'failure' | 'not_permitted', durationMs: number): void { callsTotal.inc({ circuit_name: circuitName, outcome }); if (outcome !== 'not_permitted') { responseTime.observe( { circuit_name: circuitName, outcome }, durationMs / 1000 ); }}Be thoughtful about labels. Adding a 'request_id' label would create unique time series per request—quickly overwhelming your metrics storage. Limit labels to dimensions you'll actually query by: circuit name, downstream service, state, outcome.
There are two primary patterns for collecting circuit breaker metrics: push-based and pull-based.
Pull-Based Collection (Prometheus Pattern):
The monitoring system periodically scrapes metrics endpoints exposed by your services.
12345678910111213141516171819202122232425262728293031
// Express endpoint for Prometheus scrapingimport express from 'express';import { circuitBreakerRegistry } from './circuit-breaker-metrics'; const app = express(); // Metrics endpointapp.get('/metrics', async (req, res) => { try { res.set('Content-Type', circuitBreakerRegistry.contentType); res.end(await circuitBreakerRegistry.metrics()); } catch (error) { res.status(500).end(error); }}); // Health check that includes circuit state summaryapp.get('/health', async (req, res) => { const circuitSummary = circuitBreakerManager.getAllCircuitStates(); // Include open circuits in health response const openCircuits = Object.entries(circuitSummary) .filter(([_, state]) => state === 'OPEN') .map(([name, _]) => name); res.json({ status: openCircuits.length === 0 ? 'healthy' : 'degraded', openCircuits, circuitStates: circuitSummary, });});Push-Based Collection (StatsD/DataDog Pattern):
Services actively push metrics to a collection endpoint.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
// StatsD-based metrics pushingimport StatsD from 'hot-shots'; const statsd = new StatsD({ host: 'statsd.monitoring.local', port: 8125, prefix: 'circuit_breaker.',}); // Real-time event pushingclass StatsDBreakerMetrics { private circuitName: string; constructor(circuitName: string) { this.circuitName = circuitName; } recordSuccess(durationMs: number): void { statsd.increment(`${this.circuitName}.success`); statsd.timing(`${this.circuitName}.duration`, durationMs); } recordFailure(durationMs: number): void { statsd.increment(`${this.circuitName}.failure`); statsd.timing(`${this.circuitName}.duration`, durationMs); } recordNotPermitted(): void { statsd.increment(`${this.circuitName}.not_permitted`); } recordStateTransition(from: string, to: string): void { // Immediate visibility into state changes statsd.event( `Circuit ${this.circuitName} transition`, `State changed from ${from} to ${to}`, { alert_type: to === 'OPEN' ? 'warning' : 'info' } ); statsd.increment(`${this.circuitName}.transition.${from}_to_${to}`); } updateGauges(metrics: CircuitMetrics): void { statsd.gauge(`${this.circuitName}.failure_rate`, metrics.failureRate); statsd.gauge(`${this.circuitName}.slow_call_rate`, metrics.slowCallRate); statsd.gauge(`${this.circuitName}.buffered_calls`, metrics.bufferedCalls); }}Many organizations use both patterns: pull-based for regular metrics scraping and push-based (or event-based) for critical events like state transitions. This provides both comprehensive monitoring and real-time alerting.
Effective dashboards present circuit breaker information in a way that enables quick understanding and action. Different audiences need different views.
Dashboard 1: Circuit Breaker Overview (Primary)
This dashboard provides a system-wide view of all circuit breakers, designed for on-call engineers and NOC teams.
123456789101112131415161718192021222324
# PromQL queries for circuit breaker dashboard # Count of open circuits (single stat panel)count(circuit_breaker_state{state="1"}) or vector(0) # State map by circuit (table panel)circuit_breaker_state{} * on(circuit_name) group_left(downstream_service) circuit_breaker_info{} # Failure rate over time (graph panel)circuit_breaker_failure_rate{circuit_name=~"$circuit"} # Not permitted requests per second (graph panel)rate(circuit_breaker_calls_total{outcome="not_permitted"}[5m]) # State transitions per hour (counter panel)increase(circuit_breaker_state_transitions_total{}[1h]) # Success rate through circuit (graph panel)sum(rate(circuit_breaker_calls_total{outcome="success"}[5m])) by (circuit_name) /sum(rate(circuit_breaker_calls_total{outcome=~"success|failure"}[5m])) by (circuit_name) # Time in current state (gauge panel)time() - (circuit_breaker_state_changed_timestamp_seconds{} or 0)Dashboard 2: Circuit Breaker Deep Dive (Per-Circuit)
This dashboard provides detailed analysis of a single circuit, useful for debugging and tuning.
The most powerful debugging view correlates circuit breaker state with downstream metrics. When you can see that the payment service P99 latency spiked to 20 seconds at the exact moment the circuit opened, causation is immediately clear.
Circuit breaker events can trigger alerts, but not all events warrant human attention. Effective alerting distinguishes between informational events and actionable incidents.
Alert: Circuit Opened (Warning)
Triggered immediately when a circuit transitions to OPEN state.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
# Prometheus Alertmanager rules for circuit breakers groups:- name: circuit_breaker_alerts rules: # Alert when any circuit opens - alert: CircuitBreakerOpened expr: circuit_breaker_state == 1 for: 0s # Alert immediately labels: severity: warning annotations: summary: "Circuit breaker {{ $labels.circuit_name }} is OPEN" description: | Circuit {{ $labels.circuit_name }} to {{ $labels.downstream_service }} has opened due to failure rate exceeding threshold. Current failure rate: {{ $value }}% # Alert when circuit remains open for extended period - alert: CircuitBreakerOpenExtended expr: | circuit_breaker_state == 1 and (time() - circuit_breaker_state_changed_timestamp) > 300 for: 1m labels: severity: critical annotations: summary: "Circuit {{ $labels.circuit_name }} open for >5 minutes" description: | Circuit has been open for more than 5 minutes. Recovery probes may be failing, or downstream service is not recovering. Manual investigation required. # Alert on circuit flapping (multiple transitions in short period) - alert: CircuitBreakerFlapping expr: | increase(circuit_breaker_state_transitions_total[15m]) > 4 for: 5m labels: severity: warning annotations: summary: "Circuit {{ $labels.circuit_name }} is flapping" description: | Circuit has transitioned more than 4 times in 15 minutes. This indicates either threshold misconfiguration or an unstable downstream service. Review circuit config and downstream health. # Alert on high rejection rate - alert: HighCircuitRejectionRate expr: | rate(circuit_breaker_calls_total{outcome="not_permitted"}[5m]) / rate(circuit_breaker_calls_total[5m]) > 0.1 for: 2m labels: severity: warning annotations: summary: "Circuit {{ $labels.circuit_name }} rejecting >10% of traffic" description: | More than 10% of requests to this circuit are being rejected due to open circuit state. User impact likely.Alert Severity Guidelines:
| Condition | Severity | Action Required |
|---|---|---|
| Circuit opens (non-critical service) | Info | Monitor, no immediate action |
| Circuit opens (critical service) | Warning | Investigate, monitor recovery |
| Circuit open > 5 minutes | Warning → Critical | Active investigation required |
| Circuit flapping (repeated transitions) | Warning | Review configuration |
| Multiple circuits open simultaneously | Critical | Potential systemic issue |
| Circuit never opens despite known issues | Warning | Configuration audit needed |
Circuits opening and closing is normal, healthy behavior. Alerting on every OPEN transition will create noise. Instead, alert on patterns that require investigation: extended open duration, flapping, or aggregate impact. A circuit that opens for 30 seconds during a transient issue and auto-recovers might not need human attention.
While metrics provide quantitative visibility, logs provide qualitative context. Structured logging makes circuit breaker events searchable and analyzable.
What to Log:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
// Structured logging for circuit breaker eventsimport { Logger } from './logger'; class CircuitBreakerLogger { private logger: Logger; private circuitName: string; constructor(circuitName: string) { this.circuitName = circuitName; this.logger = Logger.getLogger('circuit-breaker'); } logStateTransition( from: State, to: State, metrics: CircuitMetrics, trigger: string ): void { // State transitions are significant - WARN level this.logger.warn({ event: 'circuit_breaker_state_transition', circuit_name: this.circuitName, downstream: metrics.downstreamService, from_state: from, to_state: to, trigger: trigger, // 'failure_threshold_exceeded', 'timeout_expired', etc. metrics: { failure_rate: metrics.failureRate, slow_call_rate: metrics.slowCallRate, buffered_calls: metrics.bufferedCalls, failed_calls: metrics.failedCalls, slow_calls: metrics.slowCalls, }, time_in_previous_state_ms: metrics.timeInPreviousState, }); } logNotPermitted(requestContext: RequestContext): void { // Not-permitted calls might be noisy - INFO or DEBUG level this.logger.info({ event: 'circuit_breaker_call_not_permitted', circuit_name: this.circuitName, request_id: requestContext.requestId, path: requestContext.path, method: requestContext.method, current_state: 'OPEN', reason: 'Circuit is open, request rejected immediately', }); } logProbeResult( probeNumber: number, totalProbes: number, success: boolean, durationMs: number ): void { this.logger.info({ event: 'circuit_breaker_probe_result', circuit_name: this.circuitName, probe_number: probeNumber, total_probes: totalProbes, success: success, duration_ms: durationMs, state: 'HALF_OPEN', }); } logThresholdBreach( metricType: 'failure_rate' | 'slow_call_rate', currentValue: number, threshold: number ): void { this.logger.warn({ event: 'circuit_breaker_threshold_breach', circuit_name: this.circuitName, metric_type: metricType, current_value: currentValue, threshold: threshold, message: `${metricType} (${currentValue}%) exceeded threshold (${threshold}%)`, }); }}Log Aggregation Queries:
With structured logs, you can perform powerful queries:
12345678910111213141516171819
# Elasticsearch/OpenSearch queries for circuit breaker analysis # Find all circuits that opened in the last hourevent:"circuit_breaker_state_transition" AND to_state:"OPEN" AND @timestamp:[now-1h TO now] # Find circuits that are flappingevent:"circuit_breaker_state_transition" | stats count by circuit_name | where count > 5 # Find requests rejected by circuits for a specific serviceevent:"circuit_breaker_call_not_permitted" AND path:"/api/checkout/*" # Analyze probe success rate during recoveryevent:"circuit_breaker_probe_result" AND circuit_name:"payment-service" | stats count by success # Correlate circuit opens with error logs from downstream(event:"circuit_breaker_state_transition" AND to_state:"OPEN") OR (service:"payment-service" AND level:"ERROR")For high-traffic circuits, logging every not-permitted call can be overwhelming. Consider sampling: log every Nth rejection, or log only the first rejection per minute. State transitions should always be logged without sampling—they're infrequent and critical.
Distributed tracing provides request-level visibility that complements metrics and logs. Integrating circuit breaker state into traces enables powerful debugging capabilities.
Adding Circuit State to Spans:
When a request passes through a circuit breaker, annotate the trace span with circuit information:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
// OpenTelemetry integration for circuit breakersimport { trace, Span, SpanStatusCode } from '@opentelemetry/api'; class TracedCircuitBreaker { private breaker: CircuitBreaker; private tracer = trace.getTracer('circuit-breaker'); async execute<T>(operation: () => Promise<T>): Promise<T> { const span = this.tracer.startSpan('circuit-breaker.execute', { attributes: { 'circuit.name': this.breaker.name, 'circuit.downstream': this.breaker.downstream, 'circuit.state': this.breaker.state, }, }); try { if (this.breaker.state === 'OPEN') { // Request rejected by open circuit span.setAttributes({ 'circuit.result': 'not_permitted', 'circuit.rejection_reason': 'circuit_open', }); span.setStatus({ code: SpanStatusCode.ERROR, message: 'Circuit open' }); throw new CircuitBreakerOpenException(); } const startTime = Date.now(); const result = await operation(); const duration = Date.now() - startTime; span.setAttributes({ 'circuit.result': 'success', 'circuit.duration_ms': duration, 'circuit.was_slow': duration > this.breaker.slowCallThreshold, }); return result; } catch (error) { span.setAttributes({ 'circuit.result': 'failure', 'circuit.error_type': error.constructor.name, }); span.recordException(error); span.setStatus({ code: SpanStatusCode.ERROR }); throw error; } finally { // Always include final circuit state span.setAttributes({ 'circuit.final_state': this.breaker.state, 'circuit.failure_rate': this.breaker.metrics.failureRate, }); span.end(); } }}Benefits of Trace-Level Circuit Visibility:
If your tracing system uses sampling, ensure that circuit breaker events can force trace capture. A request rejected by an open circuit might otherwise be ignored by sampling, losing valuable debugging information.
Monitoring is only valuable if operators know how to respond. Runbooks codify the response procedures for circuit breaker events.
Runbook: Circuit Opens Unexpectedly
12345678910111213141516171819202122232425262728293031323334353637383940414243
# Runbook: Circuit Breaker Opened ## 1. Initial Assessment (30 seconds)- Which circuit opened? Check alert details.- Is this a critical path circuit? Check criticality labeling.- Are other circuits to the same downstream also open? ## 2. Verify Downstream Health (1-2 minutes)- Check downstream service dashboard- Review downstream service logs for errors- Check downstream service pod/container status- Verify downstream database/dependencies health ## 3. Assess Impact (1 minute)- What user-facing functionality is affected?- Is fallback behavior working?- How many users/requests are affected? ## 4. Decide ResponseIF downstream is genuinely unhealthy: → Let circuit protect the system → Focus on downstream recovery → Monitor circuit state for auto-recovery IF downstream appears healthy (possible false positive): → Review circuit configuration → Check for threshold misconfiguration → Consider temporarily adjusting thresholds IF circuit is flapping: → Investigate intermittent downstream issues → Consider increasing wait duration → Review minimum calls threshold ## 5. Recovery Verification- Monitor half-open probes in dashboard- Verify circuit closes after downstream recovery- Confirm user-facing functionality restored ## 6. Post-Incident- Document root cause if not auto-recovered- Review if circuit configuration was appropriate- Update runbook with lessons learnedRunbook: Manually Overriding Circuit State
In rare cases, operators may need to manually force a circuit state:
12345678910111213141516171819202122232425262728293031323334353637383940414243
# Runbook: Manual Circuit Override ## WARNINGManual overrides bypass automatic protection. Use only when:- Circuit is misconfigured and needs emergency correction- Downstream is confirmed healthy but circuit won't recover- Testing circuit behavior in staging environment ## Force Circuit CLOSEDUse when: Circuit is stuck open despite healthy downstream ```bash# Via admin API (if exposed)curl -X POST http://localhost:8080/admin/circuits/payment-service/close # Via configuration (requires restart or dynamic config)export CIRCUIT_payment-service_FORCE_OPEN=false``` ## Force Circuit OPENUse when: Need to stop traffic to a service immediately ```bash# Via admin APIcurl -X POST http://localhost:8080/admin/circuits/payment-service/open # Via configurationexport CIRCUIT_payment-service_FORCE_OPEN=true``` ## Reset Circuit MetricsUse when: Historical failures skewing current evaluation ```bashcurl -X POST http://localhost:8080/admin/circuits/payment-service/reset``` ## IMPORTANT: Revert OverridesAfter emergency override, always:1. Document the override in incident channel2. Monitor circuit behavior closely3. Remove override once issue is resolved4. Let circuit return to automatic operationAll manual overrides should be logged and auditable. Override without audit trail makes post-incident analysis difficult and can mask configuration issues that would otherwise be caught.
We've comprehensively explored how to make circuit breaker behavior visible and manageable through effective monitoring, alerting, and operational processes.
What's next:
With monitoring in place, the final page covers implementation considerations—practical guidance for choosing libraries, handling edge cases, integrating with existing systems, and deploying circuit breakers effectively.
You now understand how to make circuit breakers observable and manageable. You can design dashboards that surface critical information, configure alerts that reduce noise while catching real issues, and create runbooks that enable effective operational response.