System Design (HLD)Circuit Breaker Pattern

Circuit Breaker Pattern: Building Resilient Distributed Systems

LevelAdvanced

Duration90 mins

TopicCircuit Breaker Pattern

3 / 5

Failure Thresholds

The Science of Threshold Configuration

A circuit breaker is only as good as its configuration. Set the failure threshold too high, and the circuit never trips—you'll experience full cascade failures before protection engages. Set it too low, and the circuit trips on transient noise—users experience unnecessary degradation.

Threshold configuration is not guesswork. It requires understanding your system's baseline behavior, the statistical properties of failure detection, and the trade-offs between sensitivity and stability. This page equips you with the analytical framework to configure circuit breaker thresholds correctly.

Incorrect threshold configuration is the most common reason circuit breakers fail to provide protection in production. Engineers often copy default values without understanding whether those defaults fit their specific use case. By the end of this page, you'll know how to reason about and tune each configuration parameter for your specific context.

What You Will Learn

By the end of this page, you will understand the mathematics behind failure rate calculation, how to select appropriate thresholds based on your service's characteristics, why minimum volume requirements are essential, the trade-offs in sliding window sizing, and a systematic approach to threshold tuning in production.

Understanding Failure Rate

The failure rate is the fundamental metric that circuit breakers use to assess dependency health. Before configuring thresholds, you must understand exactly how failure rate is calculated and what it represents.

The Basic Formula

Failure Rate = Failed Calls / Total Calls × 100%

Where:

Failed Calls: Calls that resulted in a recorded failure (exceptions, timeouts, error responses)
Total Calls: All calls within the measurement window (successes + failures)

Example Calculation

Over the last 100 calls:

80 succeeded
20 failed (timeouts or exceptions)

Failure Rate = 20 / 100 × 100% = 20%

The Measurement Window

Failure rate is not calculated over all time—it's calculated over a sliding window. This window can be:

1. Count-Based Window Track the last N calls. Oldest calls drop off as new ones arrive.

Example: Window size = 100 calls. Failure rate = failures in last 100 calls / 100.

Pros: Consistent sample size; predictable statistical properties. Cons: Time to fill window varies with traffic volume.

2. Time-Based Window Track calls within the last T seconds. All calls older than T are excluded.

Example: Window = 60 seconds. Failure rate = failures in last 60 seconds / total calls in last 60 seconds.

Pros: Consistent time horizon; failure rate ages out at predictable rate. Cons: Sample size varies with traffic; low traffic = high variance.

3. Hybrid: Time-Bucketed Windows Divide time into buckets (e.g., 10 × 6-second buckets for a 60-second window). Count successes/failures per bucket. Calculate aggregate rate.

Pros: Memory efficient; provides time decay. Cons: Bucket granularity affects precision.

WindowTypes.java
Java (Resilience4j)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Count-based sliding window
CircuitBreakerConfig countBased = CircuitBreakerConfig.custom()
    .slidingWindowType(SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(100)  // Evaluate last 100 calls
    .failureRateThreshold(50)
    .build();
 
// Time-based sliding window
CircuitBreakerConfig timeBased = CircuitBreakerConfig.custom()
    .slidingWindowType(SlidingWindowType.TIME_BASED)
    .slidingWindowSize(60)   // Evaluate last 60 seconds
    .failureRateThreshold(50)
    .build();
 
/*
 * Comparison for a service receiving 10 requests/second:
 * 
 * Count-based (100 calls):
 *   - Window represents last 10 seconds of traffic
 *   - Sample size is always 100 (after warmup)
 *   - Predictable statistical precision
 * 
 * Time-based (60 seconds):
 *   - Window always represents last 60 seconds
 *   - Sample size is ~600 calls (at 10 req/s)
 *   - Better smoothing, but slower to detect change
 */

Count-Based is Usually Preferred

Count-based windows are generally preferred because they provide consistent sample sizes regardless of traffic volume. Time-based windows can have high variance during low-traffic periods or be over-smoothed during high-traffic periods.

The Statistics of Threshold Selection

Setting a failure rate threshold requires understanding basic statistics. What failure rate indicates a genuinely unhealthy service versus normal variance?

Baseline Failure Rate

Every service has a baseline failure rate under normal conditions. This is not zero—transient failures, client errors, and edge cases always produce some failures:

Service Type	Typical Baseline Failure Rate
Healthy internal microservice	0.1% - 1%
External API dependency	1% - 5%
Database with high contention	0.5% - 2%
Network-intensive service	0.5% - 3%

The Signal-to-Noise Challenge

If your baseline failure rate is 2% and you set your threshold at 5%, you need to reliably distinguish between:

Normal variance around 2% (noise)
Genuine degradation above 5% (signal)

With small sample sizes, normal variance can easily swing from 2% to 5% or higher by random chance.

Statistical Confidence

Consider the following scenario:

True failure rate: 2% (20 failures per 1000 requests)
Sample size: 50 requests
Observed failures: 3 (6% in sample)

Is this 6% observed rate indicative of a problem, or just random variance in a small sample?

Confidence Interval Calculation

Using the binomial proportion confidence interval (Wilson score):

For a sample of 50 with 3 failures (6% observed):

95% confidence interval: approximately 1.9% to 15.8%

This means the true failure rate could plausibly be anywhere from 2% to 16%. A 50-request sample is insufficient to confidently detect a change from 2% to 6%.

The Minimum Sample Size

To reliably detect a doubling of failure rate (e.g., 2% → 4%) with 95% confidence and 80% power, statistical calculations suggest you need approximately:

To detect 2% → 4%: ~1200 samples
To detect 2% → 10%: ~150 samples
To detect 2% → 20%: ~50 samples

This has critical implications: your window size determines what magnitude of change you can reliably detect.

Window Size vs. Detectable Change (from 2% baseline)
Window Size	Smallest Detectable Change	Detection Time (@100 req/s)
20 calls	2% → 30%+	0.2 seconds
50 calls	2% → 15%+	0.5 seconds
100 calls	2% → 10%+	1 second
200 calls	2% → 7%+	2 seconds
500 calls	2% → 5%+	5 seconds
1000 calls	2% → 4%+	10 seconds

The Trade-off: Sensitivity vs. Speed

Larger windows provide better statistical confidence but slower detection. Smaller windows detect quickly but with more false positives. There's no free lunch—you must choose the trade-off appropriate for your use case.

Practical Guidance

For most services, the following guidance applies:

Measure your baseline failure rate during normal operation. Track it over days, not hours.
Set threshold significantly above baseline — at least 3-5x baseline rate, or a minimum of 30-50% failure rate for critical services. A 50% failure rate means half your requests are failing—that's unambiguously a problem.
Use window sizes of 50-100 minimum for count-based windows. Smaller windows have too much variance.
Accept that you're detecting major degradation, not subtle changes. Circuit breakers are for preventing cascades, not alerting on slight SLA degradation.

Minimum Volume Requirements

The minimum volume requirement is one of the most important—and most misunderstood—circuit breaker configuration parameters. Without it, circuits can trip inappropriately during low-traffic periods.

The Problem Without Minimum Volume

Consider a circuit with 50% failure threshold and no minimum volume requirement:

During a quiet period, only 2 requests occur
1 request fails (perhaps a timeout due to network blip)
Failure rate = 50%
Circuit trips!

This is clearly wrong. One failure out of two requests is not statistically significant evidence of service degradation. Yet without minimum volume, the circuit breaker treats it as such.

The Solution: Minimum Calls Threshold

The minimum volume requirement specifies that failure rate evaluation should only occur when enough samples have been collected:

IF totalCallsInWindow >= minimumNumberOfCalls THEN
    evaluateFailureRate()
ELSE
    remainClosed()  // Not enough data to evaluate
END

MinimumVolumeConfig.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .slidingWindowType(SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(100)           // Evaluate last 100 calls
    .failureRateThreshold(50)         // Trip at 50% failure rate
    .minimumNumberOfCalls(20)         // But only if at least 20 calls in window
    .build();
 
/*
 * Behavior:
 * 
 * Scenario A: Window has 10 calls, 5 failed (50% failure rate)
 *   → Circuit remains CLOSED (only 10 calls, need 20 minimum)
 * 
 * Scenario B: Window has 30 calls, 10 failed (33% failure rate)
 *   → Circuit remains CLOSED (33% < 50% threshold)
 * 
 * Scenario C: Window has 30 calls, 20 failed (67% failure rate)
 *   → Circuit OPENS (enough calls, threshold exceeded)
 */

Choosing the Minimum Volume

How high should the minimum volume be? Consider these factors:

1. Statistical Significance

As discussed earlier, you need enough samples for your failure rate calculation to be meaningful. Rule of thumb: minimum volume should be at least 20-50 requests.

2. Traffic Patterns

Minimum volume should be reachable during your lowest-traffic period:

Traffic Pattern	Minimum Volume Guidance
High and consistent (>100 req/s)	50-100 calls
Moderate (10-100 req/s)	20-50 calls
Low or bursty (<10 req/s)	10-20 calls
Very low (<1 req/s)	Consider disabling circuit breaker

3. Detection Latency

Higher minimum volume means longer time before the circuit can trip:

Traffic Rate	Minimum Volume	Time to Reach Minimum
100 req/s	20	0.2 seconds
100 req/s	50	0.5 seconds
10 req/s	20	2 seconds
10 req/s	50	5 seconds
1 req/s	20	20 seconds

For low-traffic services, high minimum volumes can delay protection significantly.

The Startup Problem

When a service starts up, the sliding window is empty. With high minimum volume requirements, the circuit breaker won't engage until the window fills. This is usually desirable—you don't want circuits tripping on startup transients. But be aware that protection isn't active until minimum volume is reached.

The Relationship Between Window Size and Minimum Volume

These parameters interact:

minimumNumberOfCalls ≤ slidingWindowSize (always)
If minimum equals window size, you need a full window before evaluation
If minimum is much smaller than window, you evaluate earlier but with less data

Recommended Configuration

For most use cases:

minimumNumberOfCalls = 0.2 × slidingWindowSize

With window size of 100:

minimum = 20: Start evaluating once 20 calls recorded
Provides some confidence while allowing quick detection

For critical services where false positives are costly:

minimumNumberOfCalls = 0.5 × slidingWindowSize

Sliding Window Sizing

The sliding window size determines how much history the circuit breaker considers when evaluating health. It's perhaps the most impactful configuration parameter.

The Trade-off Matrix

Window Size Trade-offs
Aspect	Small Window (10-50)	Medium Window (50-200)	Large Window (200-1000)
Detection Speed	Very fast (seconds)	Fast (seconds to minutes)	Slow (minutes)
Statistical Confidence	Low (high variance)	Moderate	High (stable)
Sensitivity to Spikes	High (may overreact)	Moderate	Low (smooths over spikes)
Recovery Detection	Fast	Moderate	Slow (old failures linger)
Memory Usage	Low	Moderate	Higher
Best For	Critical paths, high-traffic	Most services	Stable services, low priority

The Detection Latency Issue

Window size directly impacts how quickly the circuit breaker detects degradation:

Scenario: Service starts failing 100% of requests

Window Size	Traffic Rate	Time to Trip (50% threshold)
50 calls	100 req/s	~0.25 seconds (25 failures to reach 50%)
50 calls	10 req/s	~2.5 seconds
200 calls	100 req/s	~1 second (100 failures to reach 50%)
200 calls	10 req/s	~10 seconds

The Stale Failure Problem

Larger windows have a counterintuitive issue: old failures can keep the circuit tripped even after the dependency recovers.

Scenario:

Window size: 200 calls
Minutes 0-1: Service fails, 100 calls fail (100/100 = 100% failure rate)
Minute 1-2: Service recovers, 100 calls succeed
Current window: 100 failures + 100 successes = 50% failure rate
Circuit remains open due to stale failures!

The window must completely cycle before old failures age out.

StaleFailureIssue.txt

Timeline

Time | Window Contents (200-call window) | Failure Rate | Circuit State
-----|-----------------------------------|--------------|---------------
T+0  | 200 successes                     | 0%           | CLOSED
T+10s| 100 successes + 100 failures      | 50%          | → OPEN (just tripped)
T+10s| (Recovery timeout: 30s)           | -            | OPEN
T+40s| Recovery test                      | -            | → HALF-OPEN
T+40s| Probe succeeds (dependency fixed) | -            | → CLOSED, counters reset
T+41s| Window: 1 success                 | 0%           | CLOSED (fresh start)
 
Note: Counters are typically reset when circuit closes, solving the stale failure problem.
But during OPEN state, you must wait for recovery timeout regardless of window contents.

Time-Based Window Considerations

For time-based windows, size is specified in seconds rather than call count:

Time Window	Effective Call Count at Various Traffic Levels
10 seconds	100 calls (10 req/s), 1000 calls (100 req/s)
60 seconds	600 calls (10 req/s), 6000 calls (100 req/s)
120 seconds	1200 calls (10 req/s), 12000 calls (100 req/s)

Time-based windows have variable effective sample size based on traffic volume. During traffic spikes, you have more data. During lulls, you have less.

Practical Recommendations

Window Size Selection Guide

•Start with 100 calls (count-based) — This provides a reasonable balance for most services. Adjust based on observed behavior.
•For high-traffic critical paths (>100 req/s): Consider 50-100 calls for faster detection.
•For moderate-traffic services (10-100 req/s): 100-200 calls provides good stability.
•For low-traffic services (<10 req/s): Consider 50 calls or switch to time-based windows.
•Avoid very large windows (>500): They mask rapid changes and delay both failure detection and recovery.

Slow Call Thresholds

Modern circuit breaker implementations like Resilience4j can trip based on slow calls, not just failures. This catches a critical failure mode: latency degradation that hasn't yet manifested as errors.

Why Slow Calls Matter

A service returning slowly is often worse than one returning errors:

Failure Mode	Resource Consumption	User Experience	Detection Speed
Fast failure (error response)	Low	See error quickly	Immediate
Slow failure (timeout)	High	Wait, then see error	Slow
Slow success	High	Wait, then proceed	May not be detected

A "successful" response that takes 10 seconds instead of 100ms still blocks the calling thread for 10 seconds. The cascade failure mechanics from our earlier discussion apply equally to slow successful calls.

Configuring Slow Call Detection

SlowCallConfig.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    // Standard failure rate threshold
    .failureRateThreshold(50)
    
    // Slow call detection
    .slowCallRateThreshold(80)               // Trip if 80% of calls are slow
    .slowCallDurationThreshold(Duration.ofSeconds(2))  // Define "slow" as >2 seconds
    
    // Common settings
    .minimumNumberOfCalls(20)
    .slidingWindowType(SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(100)
    
    .build();
 
/*
 * Now the circuit will open if EITHER condition is met:
 * 
 * Condition 1: Failure Rate >= 50%
 *   (Standard failure detection)
 * 
 * Condition 2: Slow Call Rate >= 80%
 *   Where "slow" = response time > 2 seconds
 *   (Catches latency degradation before timeouts)
 * 
 * Example triggering Condition 2:
 *   - 100 calls in window
 *   - 10 failed (10% failure rate - below threshold)
 *   - 85 took >2 seconds (85% slow rate - ABOVE threshold)
 *   - Circuit opens despite "only" 10% failure rate
 */

Determining the Slow Call Duration Threshold

The "slow" threshold should be significantly above your normal response time but below your timeout:

P50 Latency << Slow Threshold << Timeout

Example:

P50 latency: 100ms
P99 latency: 500ms
Slow threshold: 2 seconds (4x P99)
Timeout: 10 seconds

Guidance for setting slow threshold:

Baseline Latency (P99)	Recommended Slow Threshold
50ms	500ms - 1s
100ms	500ms - 2s
500ms	2s - 5s
1s	3s - 10s

The threshold should catch genuine degradation while ignoring occasional slow calls that are within acceptable variance.

Slow Call Rate vs. Failure Rate

Slow call rate thresholds are typically set higher than failure rate thresholds (e.g., 80% slow vs. 50% failure). Some slow calls are expected—network variance, GC pauses, etc. But when 80% of calls are slow, something is genuinely wrong even if they all eventually succeed.

The Interaction with Timeouts

Slow call detection and timeouts work together:

Request starts — Timer begins
Slow threshold exceeded — Call counted as "slow" but continues
Call succeeds (slowly) — Slow call counter incremented, but it's a success
OR Timeout exceeded — Call fails, counted as both slow AND failed

Timeouts bound the maximum wait time. Slow call detection triggers protection earlier. Together, they provide comprehensive latency-based protection.

Practical Threshold Tuning

Armed with theory, let's develop a practical approach to threshold tuning for real services.

Step 1: Establish Baselines

Before configuring thresholds, collect baseline metrics:

SELECT 
    percentile_cont(0.50) WITHIN GROUP (ORDER BY duration_ms) as p50,
    percentile_cont(0.90) WITHIN GROUP (ORDER BY duration_ms) as p90,
    percentile_cont(0.99) WITHIN GROUP (ORDER BY duration_ms) as p99,
    COUNT(*) FILTER (WHERE status = 'error') * 100.0 / COUNT(*) as error_rate,
    COUNT(*) as total_calls
FROM service_call_logs
WHERE 
    service = 'inventory-service'
    AND timestamp > NOW() - INTERVAL '7 days'

Collect this data over a representative period (at least a week) covering:

Peak traffic periods
Off-peak periods
Any regular batch jobs or maintenance windows

Step 2: Initial Configuration

Start with conservative settings:

InitialConfig.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Given baseline:
// - P50: 50ms, P99: 200ms
// - Error rate: 0.5%
// - Traffic: 50 req/s
 
CircuitBreakerConfig initialConfig = CircuitBreakerConfig.custom()
    // Conservative failure rate: 10x baseline
    .failureRateThreshold(50)  // 0.5% baseline → start at 50%
    
    // Conservative slow call: 10x P99
    .slowCallRateThreshold(80)
    .slowCallDurationThreshold(Duration.ofSeconds(2))  // 200ms P99 → 2000ms
    
    // Moderate window size
    .slidingWindowType(SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(100)
    
    // Moderate minimum volume (traffic allows)
    .minimumNumberOfCalls(20)
    
    // Standard recovery
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .permittedNumberOfCallsInHalfOpenState(5)
    
    .build();
 
// This configuration:
// - Won't trip under normal operation (0.5% << 50%)
// - Will trip on major degradation (50%+ failures)
// - Will trip on latency degradation (80%+ calls > 2s)
// - Has reasonable detection latency (~2s at 50 req/s)

Step 3: Deploy with Monitoring

Deploy the initial configuration with comprehensive monitoring:

Track circuit state (closed/open/half-open) over time
Log all state transitions with context
Alert on state changes (warning, not critical initially)
Correlate circuit events with dependency metrics

Step 4: Observe and Adjust

Analyze circuit behavior over weeks of operation:

If the circuit never trips:

This might be correct (dependency is stable)
Or thresholds are too conservative
If known outages occurred without tripping, lower thresholds

If the circuit trips too often:

Analyze each trip: was it justified?
If trips occurred during normal operation, raise thresholds
If trips preceded cascade failures, current thresholds are correct

If trips are appropriate but recovery is slow/fast:

Adjust recovery timeout based on dependency recovery patterns
If dependency recovers in 10s but timeout is 60s, reduce it
If dependency needs 2 minutes but timeout is 30s, increase it

Threshold Adjustment Guide
Observation	Diagnosis	Adjustment
False positives (trips during normal operation)	Thresholds too sensitive	Increase failureRateThreshold or minimumNumberOfCalls
False negatives (no trip during known outages)	Thresholds too conservative	Lower failureRateThreshold or slowCallRateThreshold
Frequent oscillation (open→closed→open rapidly)	Recovery timeout too short	Increase waitDurationInOpenState
Slow recovery after dependency fixed	Recovery timeout too long	Decrease waitDurationInOpenState
Trips on brief spikes that resolve quickly	Window too small	Increase slidingWindowSize
Slow to detect sustained degradation	Window too large	Decrease slidingWindowSize

Change One Parameter at a Time

When tuning, adjust one parameter per deployment cycle. This allows you to attribute changes in behavior to specific adjustments. Changing multiple parameters simultaneously makes it impossible to understand cause and effect.

Per-Operation vs. Global Thresholds

A common design question: should you have one circuit breaker per dependency, or multiple circuits for different operations on the same dependency?

The Global Circuit Approach

One circuit breaker for all calls to a dependency:

PaymentService → [Circuit Breaker] → Payment Gateway
  - authorize()
  - capture()
  - refund()
  - getTransaction()

Pros:

Simple to configure and monitor
Comprehensive protection for the dependency
Lower operational overhead

Cons:

A problem with one operation affects all operations
Can't distinguish between operation-specific issues

The Per-Operation Approach

Separate circuit breakers for different operations:

PaymentService → [CB: authorize]  → Payment Gateway /authorize
              → [CB: capture]   → Payment Gateway /capture
              → [CB: refund]    → Payment Gateway /refund
              → [CB: query]     → Payment Gateway /transactions

Pros:

Isolation: authorize failures don't affect refund
Operation-specific thresholds based on operation characteristics
More granular fallback behavior

Cons:

More configuration and monitoring overhead
May miss cross-operation correlation (e.g., database issues affecting all operations)

PerOperationConfig.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
@Configuration
public class PaymentCircuitBreakerConfig {
    
    // High-value, critical operation - conservative thresholds
    @Bean
    CircuitBreaker authorizeCircuitBreaker() {
        return CircuitBreaker.of("payment-authorize",
            CircuitBreakerConfig.custom()
                .failureRateThreshold(30)    // Lower threshold for critical path
                .slowCallRateThreshold(60)
                .slowCallDurationThreshold(Duration.ofSeconds(5))
                .slidingWindowSize(50)
                .minimumNumberOfCalls(10)
                .waitDurationInOpenState(Duration.ofSeconds(60))
                .build()
        );
    }
    
    // Less-critical read operation - higher thresholds
    @Bean
    CircuitBreaker queryCircuitBreaker() {
        return CircuitBreaker.of("payment-query",
            CircuitBreakerConfig.custom()
                .failureRateThreshold(60)    // Higher threshold for reads
                .slowCallRateThreshold(90)
                .slowCallDurationThreshold(Duration.ofSeconds(3))
                .slidingWindowSize(100)
                .minimumNumberOfCalls(20)
                .waitDurationInOpenState(Duration.ofSeconds(30))
                .build()
        );
    }
    
    // Background operation - even higher thresholds
    @Bean
    CircuitBreaker refundCircuitBreaker() {
        return CircuitBreaker.of("payment-refund",
            CircuitBreakerConfig.custom()
                .failureRateThreshold(70)    // Can tolerate more failures
                .slowCallRateThreshold(95)
                .slowCallDurationThreshold(Duration.ofSeconds(10))
                .slidingWindowSize(100)
                .minimumNumberOfCalls(10)
                .waitDurationInOpenState(Duration.ofSeconds(30))
                .build()
        );
    }
}

Hybrid Approach: Operation Groups

A middle ground is to group operations by characteristics:

PaymentService → [CB: critical-ops]   → authorize, capture
              → [CB: read-ops]       → getTransaction, listTransactions  
              → [CB: background-ops] → refund, void

This reduces configuration overhead while still providing isolation between fundamentally different operation types.

Guidance for Choosing

When to Use Each Approach
Scenario	Recommended Approach
All operations hit the same backend infrastructure	Global circuit breaker
Operations have significantly different latency profiles	Per-operation or per-group
Some operations are critical, others are optional	Per-operation or per-group
Service is simple with uniform operations	Global circuit breaker
Service is complex with diverse operation types	Per-group circuit breakers
Operations can fail independently	Per-operation

Summary: Threshold Configuration Mastery

We've covered the science and practice of failure threshold configuration. Let's consolidate the key insights:

Key Takeaways

•Failure rate is calculated over sliding windows — Either count-based (last N calls) or time-based (last T seconds). Count-based windows provide consistent sample sizes.
•Window size determines detection capability — Larger windows provide statistical confidence but slower detection. Smaller windows detect quickly but with more variance.
•Minimum volume prevents false positives — Don't evaluate failure rate until enough samples exist. Without minimum volume, single failures during low traffic can trip circuits.
•Set thresholds relative to baselines — Measure normal failure rates and latencies. Set thresholds significantly above baseline to catch genuine degradation, not normal variance.
•Slow call detection catches latency degradation — Slow calls consume resources like failures. Detect them before they become timeouts.
•Tune iteratively based on observation — Deploy with conservative settings, monitor behavior, adjust one parameter at a time based on false positives/negatives.
•Consider per-operation vs. global circuits — Use per-operation circuits when operations have different characteristics or criticality levels.

What's Next

With threshold configuration understood, we'll explore the practical implementations of circuit breakers in the next page. We'll examine Netflix Hystrix (the pioneering implementation) and Resilience4j (the modern standard), understanding their architectures, APIs, and operational characteristics. This will ground our conceptual knowledge in real, production-ready code.

Page Complete

You now understand the mathematics and practice of threshold configuration. You can reason about window sizes, minimum volumes, failure rates, and slow call detection. You have a framework for initial configuration and iterative tuning. Next, we'll explore production-ready circuit breaker libraries.

3 / 5

Loading learning content...

System Design (HLD)Circuit Breaker Pattern

Circuit Breaker Pattern: Building Resilient Distributed Systems

LevelAdvanced

Duration90 mins

TopicCircuit Breaker Pattern

3 / 5

Failure Thresholds

The Science of Threshold Configuration

What You Will Learn

Understanding Failure Rate

The Basic Formula

Failure Rate = Failed Calls / Total Calls × 100%

Where:

Failed Calls: Calls that resulted in a recorded failure (exceptions, timeouts, error responses)
Total Calls: All calls within the measurement window (successes + failures)

Example Calculation

Over the last 100 calls:

80 succeeded
20 failed (timeouts or exceptions)

Failure Rate = 20 / 100 × 100% = 20%