Loading learning content...
A circuit breaker is only as good as its configuration. Set the failure threshold too high, and the circuit never trips—you'll experience full cascade failures before protection engages. Set it too low, and the circuit trips on transient noise—users experience unnecessary degradation.
Threshold configuration is not guesswork. It requires understanding your system's baseline behavior, the statistical properties of failure detection, and the trade-offs between sensitivity and stability. This page equips you with the analytical framework to configure circuit breaker thresholds correctly.
Incorrect threshold configuration is the most common reason circuit breakers fail to provide protection in production. Engineers often copy default values without understanding whether those defaults fit their specific use case. By the end of this page, you'll know how to reason about and tune each configuration parameter for your specific context.
By the end of this page, you will understand the mathematics behind failure rate calculation, how to select appropriate thresholds based on your service's characteristics, why minimum volume requirements are essential, the trade-offs in sliding window sizing, and a systematic approach to threshold tuning in production.
The failure rate is the fundamental metric that circuit breakers use to assess dependency health. Before configuring thresholds, you must understand exactly how failure rate is calculated and what it represents.
The Basic Formula
Failure Rate = Failed Calls / Total Calls × 100%
Where:
Example Calculation
Over the last 100 calls:
Failure Rate = 20 / 100 × 100% = 20%
The Measurement Window
Failure rate is not calculated over all time—it's calculated over a sliding window. This window can be:
1. Count-Based Window Track the last N calls. Oldest calls drop off as new ones arrive.
Example: Window size = 100 calls. Failure rate = failures in last 100 calls / 100.
Pros: Consistent sample size; predictable statistical properties. Cons: Time to fill window varies with traffic volume.
2. Time-Based Window Track calls within the last T seconds. All calls older than T are excluded.
Example: Window = 60 seconds. Failure rate = failures in last 60 seconds / total calls in last 60 seconds.
Pros: Consistent time horizon; failure rate ages out at predictable rate. Cons: Sample size varies with traffic; low traffic = high variance.
3. Hybrid: Time-Bucketed Windows Divide time into buckets (e.g., 10 × 6-second buckets for a 60-second window). Count successes/failures per bucket. Calculate aggregate rate.
Pros: Memory efficient; provides time decay. Cons: Bucket granularity affects precision.
123456789101112131415161718192021222324252627
// Count-based sliding windowCircuitBreakerConfig countBased = CircuitBreakerConfig.custom() .slidingWindowType(SlidingWindowType.COUNT_BASED) .slidingWindowSize(100) // Evaluate last 100 calls .failureRateThreshold(50) .build(); // Time-based sliding windowCircuitBreakerConfig timeBased = CircuitBreakerConfig.custom() .slidingWindowType(SlidingWindowType.TIME_BASED) .slidingWindowSize(60) // Evaluate last 60 seconds .failureRateThreshold(50) .build(); /* * Comparison for a service receiving 10 requests/second: * * Count-based (100 calls): * - Window represents last 10 seconds of traffic * - Sample size is always 100 (after warmup) * - Predictable statistical precision * * Time-based (60 seconds): * - Window always represents last 60 seconds * - Sample size is ~600 calls (at 10 req/s) * - Better smoothing, but slower to detect change */Count-based windows are generally preferred because they provide consistent sample sizes regardless of traffic volume. Time-based windows can have high variance during low-traffic periods or be over-smoothed during high-traffic periods.
Setting a failure rate threshold requires understanding basic statistics. What failure rate indicates a genuinely unhealthy service versus normal variance?
Baseline Failure Rate
Every service has a baseline failure rate under normal conditions. This is not zero—transient failures, client errors, and edge cases always produce some failures:
| Service Type | Typical Baseline Failure Rate |
|---|---|
| Healthy internal microservice | 0.1% - 1% |
| External API dependency | 1% - 5% |
| Database with high contention | 0.5% - 2% |
| Network-intensive service | 0.5% - 3% |
The Signal-to-Noise Challenge
If your baseline failure rate is 2% and you set your threshold at 5%, you need to reliably distinguish between:
With small sample sizes, normal variance can easily swing from 2% to 5% or higher by random chance.
Statistical Confidence
Consider the following scenario:
Is this 6% observed rate indicative of a problem, or just random variance in a small sample?
Confidence Interval Calculation
Using the binomial proportion confidence interval (Wilson score):
For a sample of 50 with 3 failures (6% observed):
This means the true failure rate could plausibly be anywhere from 2% to 16%. A 50-request sample is insufficient to confidently detect a change from 2% to 6%.
The Minimum Sample Size
To reliably detect a doubling of failure rate (e.g., 2% → 4%) with 95% confidence and 80% power, statistical calculations suggest you need approximately:
This has critical implications: your window size determines what magnitude of change you can reliably detect.
| Window Size | Smallest Detectable Change | Detection Time (@100 req/s) |
|---|---|---|
| 20 calls | 2% → 30%+ | 0.2 seconds |
| 50 calls | 2% → 15%+ | 0.5 seconds |
| 100 calls | 2% → 10%+ | 1 second |
| 200 calls | 2% → 7%+ | 2 seconds |
| 500 calls | 2% → 5%+ | 5 seconds |
| 1000 calls | 2% → 4%+ | 10 seconds |
Larger windows provide better statistical confidence but slower detection. Smaller windows detect quickly but with more false positives. There's no free lunch—you must choose the trade-off appropriate for your use case.
Practical Guidance
For most services, the following guidance applies:
Measure your baseline failure rate during normal operation. Track it over days, not hours.
Set threshold significantly above baseline — at least 3-5x baseline rate, or a minimum of 30-50% failure rate for critical services. A 50% failure rate means half your requests are failing—that's unambiguously a problem.
Use window sizes of 50-100 minimum for count-based windows. Smaller windows have too much variance.
Accept that you're detecting major degradation, not subtle changes. Circuit breakers are for preventing cascades, not alerting on slight SLA degradation.
The minimum volume requirement is one of the most important—and most misunderstood—circuit breaker configuration parameters. Without it, circuits can trip inappropriately during low-traffic periods.
The Problem Without Minimum Volume
Consider a circuit with 50% failure threshold and no minimum volume requirement:
This is clearly wrong. One failure out of two requests is not statistically significant evidence of service degradation. Yet without minimum volume, the circuit breaker treats it as such.
The Solution: Minimum Calls Threshold
The minimum volume requirement specifies that failure rate evaluation should only occur when enough samples have been collected:
IF totalCallsInWindow >= minimumNumberOfCalls THEN
evaluateFailureRate()
ELSE
remainClosed() // Not enough data to evaluate
END
12345678910111213141516171819
CircuitBreakerConfig config = CircuitBreakerConfig.custom() .slidingWindowType(SlidingWindowType.COUNT_BASED) .slidingWindowSize(100) // Evaluate last 100 calls .failureRateThreshold(50) // Trip at 50% failure rate .minimumNumberOfCalls(20) // But only if at least 20 calls in window .build(); /* * Behavior: * * Scenario A: Window has 10 calls, 5 failed (50% failure rate) * → Circuit remains CLOSED (only 10 calls, need 20 minimum) * * Scenario B: Window has 30 calls, 10 failed (33% failure rate) * → Circuit remains CLOSED (33% < 50% threshold) * * Scenario C: Window has 30 calls, 20 failed (67% failure rate) * → Circuit OPENS (enough calls, threshold exceeded) */Choosing the Minimum Volume
How high should the minimum volume be? Consider these factors:
1. Statistical Significance
As discussed earlier, you need enough samples for your failure rate calculation to be meaningful. Rule of thumb: minimum volume should be at least 20-50 requests.
2. Traffic Patterns
Minimum volume should be reachable during your lowest-traffic period:
| Traffic Pattern | Minimum Volume Guidance |
|---|---|
| High and consistent (>100 req/s) | 50-100 calls |
| Moderate (10-100 req/s) | 20-50 calls |
| Low or bursty (<10 req/s) | 10-20 calls |
| Very low (<1 req/s) | Consider disabling circuit breaker |
3. Detection Latency
Higher minimum volume means longer time before the circuit can trip:
| Traffic Rate | Minimum Volume | Time to Reach Minimum |
|---|---|---|
| 100 req/s | 20 | 0.2 seconds |
| 100 req/s | 50 | 0.5 seconds |
| 10 req/s | 20 | 2 seconds |
| 10 req/s | 50 | 5 seconds |
| 1 req/s | 20 | 20 seconds |
For low-traffic services, high minimum volumes can delay protection significantly.
When a service starts up, the sliding window is empty. With high minimum volume requirements, the circuit breaker won't engage until the window fills. This is usually desirable—you don't want circuits tripping on startup transients. But be aware that protection isn't active until minimum volume is reached.
The Relationship Between Window Size and Minimum Volume
These parameters interact:
Recommended Configuration
For most use cases:
minimumNumberOfCalls = 0.2 × slidingWindowSize
With window size of 100:
For critical services where false positives are costly:
minimumNumberOfCalls = 0.5 × slidingWindowSize
The sliding window size determines how much history the circuit breaker considers when evaluating health. It's perhaps the most impactful configuration parameter.
The Trade-off Matrix
| Aspect | Small Window (10-50) | Medium Window (50-200) | Large Window (200-1000) |
|---|---|---|---|
| Detection Speed | Very fast (seconds) | Fast (seconds to minutes) | Slow (minutes) |
| Statistical Confidence | Low (high variance) | Moderate | High (stable) |
| Sensitivity to Spikes | High (may overreact) | Moderate | Low (smooths over spikes) |
| Recovery Detection | Fast | Moderate | Slow (old failures linger) |
| Memory Usage | Low | Moderate | Higher |
| Best For | Critical paths, high-traffic | Most services | Stable services, low priority |
The Detection Latency Issue
Window size directly impacts how quickly the circuit breaker detects degradation:
Scenario: Service starts failing 100% of requests
| Window Size | Traffic Rate | Time to Trip (50% threshold) |
|---|---|---|
| 50 calls | 100 req/s | ~0.25 seconds (25 failures to reach 50%) |
| 50 calls | 10 req/s | ~2.5 seconds |
| 200 calls | 100 req/s | ~1 second (100 failures to reach 50%) |
| 200 calls | 10 req/s | ~10 seconds |
The Stale Failure Problem
Larger windows have a counterintuitive issue: old failures can keep the circuit tripped even after the dependency recovers.
Scenario:
The window must completely cycle before old failures age out.
Time | Window Contents (200-call window) | Failure Rate | Circuit State-----|-----------------------------------|--------------|---------------T+0 | 200 successes | 0% | CLOSEDT+10s| 100 successes + 100 failures | 50% | → OPEN (just tripped)T+10s| (Recovery timeout: 30s) | - | OPENT+40s| Recovery test | - | → HALF-OPENT+40s| Probe succeeds (dependency fixed) | - | → CLOSED, counters resetT+41s| Window: 1 success | 0% | CLOSED (fresh start) Note: Counters are typically reset when circuit closes, solving the stale failure problem.But during OPEN state, you must wait for recovery timeout regardless of window contents.Time-Based Window Considerations
For time-based windows, size is specified in seconds rather than call count:
| Time Window | Effective Call Count at Various Traffic Levels |
|---|---|
| 10 seconds | 100 calls (10 req/s), 1000 calls (100 req/s) |
| 60 seconds | 600 calls (10 req/s), 6000 calls (100 req/s) |
| 120 seconds | 1200 calls (10 req/s), 12000 calls (100 req/s) |
Time-based windows have variable effective sample size based on traffic volume. During traffic spikes, you have more data. During lulls, you have less.
Practical Recommendations
Modern circuit breaker implementations like Resilience4j can trip based on slow calls, not just failures. This catches a critical failure mode: latency degradation that hasn't yet manifested as errors.
Why Slow Calls Matter
A service returning slowly is often worse than one returning errors:
| Failure Mode | Resource Consumption | User Experience | Detection Speed |
|---|---|---|---|
| Fast failure (error response) | Low | See error quickly | Immediate |
| Slow failure (timeout) | High | Wait, then see error | Slow |
| Slow success | High | Wait, then proceed | May not be detected |
A "successful" response that takes 10 seconds instead of 100ms still blocks the calling thread for 10 seconds. The cascade failure mechanics from our earlier discussion apply equally to slow successful calls.
Configuring Slow Call Detection
12345678910111213141516171819202122232425262728293031
CircuitBreakerConfig config = CircuitBreakerConfig.custom() // Standard failure rate threshold .failureRateThreshold(50) // Slow call detection .slowCallRateThreshold(80) // Trip if 80% of calls are slow .slowCallDurationThreshold(Duration.ofSeconds(2)) // Define "slow" as >2 seconds // Common settings .minimumNumberOfCalls(20) .slidingWindowType(SlidingWindowType.COUNT_BASED) .slidingWindowSize(100) .build(); /* * Now the circuit will open if EITHER condition is met: * * Condition 1: Failure Rate >= 50% * (Standard failure detection) * * Condition 2: Slow Call Rate >= 80% * Where "slow" = response time > 2 seconds * (Catches latency degradation before timeouts) * * Example triggering Condition 2: * - 100 calls in window * - 10 failed (10% failure rate - below threshold) * - 85 took >2 seconds (85% slow rate - ABOVE threshold) * - Circuit opens despite "only" 10% failure rate */Determining the Slow Call Duration Threshold
The "slow" threshold should be significantly above your normal response time but below your timeout:
P50 Latency << Slow Threshold << Timeout
Example:
Guidance for setting slow threshold:
| Baseline Latency (P99) | Recommended Slow Threshold |
|---|---|
| 50ms | 500ms - 1s |
| 100ms | 500ms - 2s |
| 500ms | 2s - 5s |
| 1s | 3s - 10s |
The threshold should catch genuine degradation while ignoring occasional slow calls that are within acceptable variance.
Slow call rate thresholds are typically set higher than failure rate thresholds (e.g., 80% slow vs. 50% failure). Some slow calls are expected—network variance, GC pauses, etc. But when 80% of calls are slow, something is genuinely wrong even if they all eventually succeed.
The Interaction with Timeouts
Slow call detection and timeouts work together:
Timeouts bound the maximum wait time. Slow call detection triggers protection earlier. Together, they provide comprehensive latency-based protection.
Armed with theory, let's develop a practical approach to threshold tuning for real services.
Step 1: Establish Baselines
Before configuring thresholds, collect baseline metrics:
SELECT
percentile_cont(0.50) WITHIN GROUP (ORDER BY duration_ms) as p50,
percentile_cont(0.90) WITHIN GROUP (ORDER BY duration_ms) as p90,
percentile_cont(0.99) WITHIN GROUP (ORDER BY duration_ms) as p99,
COUNT(*) FILTER (WHERE status = 'error') * 100.0 / COUNT(*) as error_rate,
COUNT(*) as total_calls
FROM service_call_logs
WHERE
service = 'inventory-service'
AND timestamp > NOW() - INTERVAL '7 days'
Collect this data over a representative period (at least a week) covering:
Step 2: Initial Configuration
Start with conservative settings:
12345678910111213141516171819202122232425262728293031
// Given baseline:// - P50: 50ms, P99: 200ms// - Error rate: 0.5%// - Traffic: 50 req/s CircuitBreakerConfig initialConfig = CircuitBreakerConfig.custom() // Conservative failure rate: 10x baseline .failureRateThreshold(50) // 0.5% baseline → start at 50% // Conservative slow call: 10x P99 .slowCallRateThreshold(80) .slowCallDurationThreshold(Duration.ofSeconds(2)) // 200ms P99 → 2000ms // Moderate window size .slidingWindowType(SlidingWindowType.COUNT_BASED) .slidingWindowSize(100) // Moderate minimum volume (traffic allows) .minimumNumberOfCalls(20) // Standard recovery .waitDurationInOpenState(Duration.ofSeconds(30)) .permittedNumberOfCallsInHalfOpenState(5) .build(); // This configuration:// - Won't trip under normal operation (0.5% << 50%)// - Will trip on major degradation (50%+ failures)// - Will trip on latency degradation (80%+ calls > 2s)// - Has reasonable detection latency (~2s at 50 req/s)Step 3: Deploy with Monitoring
Deploy the initial configuration with comprehensive monitoring:
Step 4: Observe and Adjust
Analyze circuit behavior over weeks of operation:
If the circuit never trips:
If the circuit trips too often:
If trips are appropriate but recovery is slow/fast:
| Observation | Diagnosis | Adjustment |
|---|---|---|
| False positives (trips during normal operation) | Thresholds too sensitive | Increase failureRateThreshold or minimumNumberOfCalls |
| False negatives (no trip during known outages) | Thresholds too conservative | Lower failureRateThreshold or slowCallRateThreshold |
| Frequent oscillation (open→closed→open rapidly) | Recovery timeout too short | Increase waitDurationInOpenState |
| Slow recovery after dependency fixed | Recovery timeout too long | Decrease waitDurationInOpenState |
| Trips on brief spikes that resolve quickly | Window too small | Increase slidingWindowSize |
| Slow to detect sustained degradation | Window too large | Decrease slidingWindowSize |
When tuning, adjust one parameter per deployment cycle. This allows you to attribute changes in behavior to specific adjustments. Changing multiple parameters simultaneously makes it impossible to understand cause and effect.
A common design question: should you have one circuit breaker per dependency, or multiple circuits for different operations on the same dependency?
The Global Circuit Approach
One circuit breaker for all calls to a dependency:
PaymentService → [Circuit Breaker] → Payment Gateway
- authorize()
- capture()
- refund()
- getTransaction()
Pros:
Cons:
The Per-Operation Approach
Separate circuit breakers for different operations:
PaymentService → [CB: authorize] → Payment Gateway /authorize
→ [CB: capture] → Payment Gateway /capture
→ [CB: refund] → Payment Gateway /refund
→ [CB: query] → Payment Gateway /transactions
Pros:
Cons:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
@Configurationpublic class PaymentCircuitBreakerConfig { // High-value, critical operation - conservative thresholds @Bean CircuitBreaker authorizeCircuitBreaker() { return CircuitBreaker.of("payment-authorize", CircuitBreakerConfig.custom() .failureRateThreshold(30) // Lower threshold for critical path .slowCallRateThreshold(60) .slowCallDurationThreshold(Duration.ofSeconds(5)) .slidingWindowSize(50) .minimumNumberOfCalls(10) .waitDurationInOpenState(Duration.ofSeconds(60)) .build() ); } // Less-critical read operation - higher thresholds @Bean CircuitBreaker queryCircuitBreaker() { return CircuitBreaker.of("payment-query", CircuitBreakerConfig.custom() .failureRateThreshold(60) // Higher threshold for reads .slowCallRateThreshold(90) .slowCallDurationThreshold(Duration.ofSeconds(3)) .slidingWindowSize(100) .minimumNumberOfCalls(20) .waitDurationInOpenState(Duration.ofSeconds(30)) .build() ); } // Background operation - even higher thresholds @Bean CircuitBreaker refundCircuitBreaker() { return CircuitBreaker.of("payment-refund", CircuitBreakerConfig.custom() .failureRateThreshold(70) // Can tolerate more failures .slowCallRateThreshold(95) .slowCallDurationThreshold(Duration.ofSeconds(10)) .slidingWindowSize(100) .minimumNumberOfCalls(10) .waitDurationInOpenState(Duration.ofSeconds(30)) .build() ); }}Hybrid Approach: Operation Groups
A middle ground is to group operations by characteristics:
PaymentService → [CB: critical-ops] → authorize, capture
→ [CB: read-ops] → getTransaction, listTransactions
→ [CB: background-ops] → refund, void
This reduces configuration overhead while still providing isolation between fundamentally different operation types.
Guidance for Choosing
| Scenario | Recommended Approach |
|---|---|
| All operations hit the same backend infrastructure | Global circuit breaker |
| Operations have significantly different latency profiles | Per-operation or per-group |
| Some operations are critical, others are optional | Per-operation or per-group |
| Service is simple with uniform operations | Global circuit breaker |
| Service is complex with diverse operation types | Per-group circuit breakers |
| Operations can fail independently | Per-operation |
We've covered the science and practice of failure threshold configuration. Let's consolidate the key insights:
What's Next
With threshold configuration understood, we'll explore the practical implementations of circuit breakers in the next page. We'll examine Netflix Hystrix (the pioneering implementation) and Resilience4j (the modern standard), understanding their architectures, APIs, and operational characteristics. This will ground our conceptual knowledge in real, production-ready code.
You now understand the mathematics and practice of threshold configuration. You can reason about window sizes, minimum volumes, failure rates, and slow call detection. You have a framework for initial configuration and iterative tuning. Next, we'll explore production-ready circuit breaker libraries.