Loading content...
A circuit breaker with default settings is better than no circuit breaker—but default settings are optimized for no particular system. Real-world services have unique characteristics: traffic patterns, failure modes, recovery times, and criticality levels that demand tailored configuration.
The difference between a well-tuned circuit breaker and a poorly configured one can be dramatic:
This page provides a comprehensive guide to every configuration parameter, with concrete guidance for tuning them based on your system's characteristics.
By the end of this page, you will understand every major circuit breaker configuration parameter, know how to derive appropriate values from your system characteristics, and have practical tuning strategies for common scenarios.
The failure rate threshold is the most important circuit breaker parameter. It defines the percentage of failed requests that triggers the circuit to open.
Default values: Most libraries default to 50%, meaning the circuit opens when more than half of recent requests fail.
Configuration principle: The threshold should reflect the point at which continuing to call the downstream service causes more harm than failing fast.
Factors influencing threshold selection:
| Factor | Lower Threshold (30-40%) | Default (50%) | Higher Threshold (60-80%) |
|---|---|---|---|
| Service criticality | Critical path - protect aggressively | Standard services | Non-critical, best-effort services |
| Failure cost | High cost per failure (payments) | Normal error handling | Low cost, easy retry |
| Normal error rate | Baseline < 1% | Baseline 1-5% | Baseline 5-15% (some errors expected) |
| Fallback availability | Good fallbacks exist | Partial fallbacks | No fallbacks, failure visible to users |
| Recovery speed | Service recovers slowly | Normal recovery | Quick recovery, brief failures |
Calculating threshold from baseline:
A service might have a normal 'baseline' error rate even when healthy. Your threshold should be significantly above this baseline to avoid false positives.
Formula for threshold:
Threshold = BaselineErrorRate + SafetyMargin
Example:
Better approach: Set threshold to cause meaningful failure, typically 25-50% above baseline:
123456789101112131415161718192021222324252627282930313233
// Programmatic threshold calculationfunction calculateFailureThreshold( baselineErrorRate: number, // Normal error rate when healthy (0-100) serviceCriticality: 'critical' | 'standard' | 'best-effort'): number { // Minimum threshold to avoid noise const minimumThreshold = 20; // Criticality-based multipliers const multipliers = { 'critical': 3, // Trip at 3x baseline 'standard': 5, // Trip at 5x baseline 'best-effort': 10, // Trip at 10x baseline }; // Target thresholds by criticality const targetThresholds = { 'critical': 30, // Protect aggressively 'standard': 50, // Balanced 'best-effort': 70, // Tolerate more failures }; const calculatedThreshold = baselineErrorRate * multipliers[serviceCriticality]; const targetThreshold = targetThresholds[serviceCriticality]; // Use whichever is higher: baseline-derived or target return Math.max(minimumThreshold, calculatedThreshold, targetThreshold);} // Examples:// Critical service, 1% baseline: max(20, 3, 30) = 30%// Standard service, 2% baseline: max(20, 10, 50) = 50%// Best-effort, 5% baseline: max(20, 50, 70) = 70%Setting thresholds too low (e.g., 10%) often leads to 'flapping'—circuits that open and close repeatedly due to normal variance. If your circuit opens multiple times per hour during healthy operation, your threshold is too low.
Beyond outright failures, slow calls can be equally damaging. A service returning responses in 30 seconds instead of 300ms might not technically 'fail,' but it's consuming threads, degrading user experience, and potentially causing upstream timeouts.
Slow call configuration has two parts:
Determining the slow call duration threshold:
12345678910111213141516171819202122232425262728293031323334
// Slow call configurationconst circuitBreakerConfig = { // Failure rate configuration (as before) failureRateThreshold: 50, // Slow call configuration slowCallDurationThreshold: 3000, // 3 seconds = "slow" slowCallRateThreshold: 80, // If 80%+ of calls are slow, open circuit // Combined evaluation: circuit opens if EITHER // - failure rate >= 50% OR // - slow call rate >= 80%}; // Determining slow call duration from measured latenciesfunction calculateSlowThreshold( p99Latency: number, // Normal P99 in milliseconds timeoutMs: number, // Request timeout minSlowThreshold: number = 1000 // Never below 1 second): number { // Slow = 3x P99, but less than 50% of timeout const p99Based = p99Latency * 3; const timeoutBased = timeoutMs * 0.5; return Math.max( minSlowThreshold, Math.min(p99Based, timeoutBased) );} // Examples:// P99: 200ms, Timeout: 30s → max(1000, min(600, 15000)) = 1000ms// P99: 500ms, Timeout: 5s → max(1000, min(1500, 2500)) = 1500ms// P99: 2s, Timeout: 60s → max(1000, min(6000, 30000)) = 6000msDetermining the slow call rate threshold:
The slow call rate threshold determines how many slow calls are tolerable before tripping.
| Slow Call Rate Threshold | Use Case |
|---|---|
| 50% | Very latency-sensitive; even half slow is too many |
| 80% (recommended default) | Service is mostly slow; probably degraded |
| 90%+ | Only trip if nearly everything is slow |
Most implementations use 80% as the slow call rate threshold. At this point, the service is clearly degraded enough to warrant circuit protection.
In many failure modes, slow calls precede failures. A database under heavy load first becomes slow, then times out. Monitoring slow call rate can provide earlier warning than failure rate alone.
The sliding window determines which calls are included when calculating failure and slow call rates. Two types are available along with their sizing parameters.
Sliding Window Type:
Sliding Window Size:
Window size directly affects sensitivity and reaction time:
| Window Size | Sensitivity | Reaction Time | Risk |
|---|---|---|---|
| Small (10-20) | High | Fast (seconds) | False positives from variance |
| Medium (50-100) | Balanced | Moderate (tens of seconds) | Good balance |
| Large (200-500) | Low | Slow (minutes) | Slow detection, more accurate |
123456789101112131415161718192021222324252627282930313233
// Window size selection based on traffic volume and latency requirementsfunction selectWindowConfiguration( requestsPerSecond: number, targetDetectionTimeSeconds: number, windowType: 'COUNT_BASED' | 'TIME_BASED'): { type: string; size: number } { if (windowType === 'TIME_BASED') { // Time-based: window size in seconds // Detection time ≈ window size (roughly) return { type: 'TIME_BASED', size: Math.max(10, Math.min(120, targetDetectionTimeSeconds)), }; } // Count-based: window size in number of calls // Calls in detection time = requests/second × detection time const callsInDetectionTime = requestsPerSecond * targetDetectionTimeSeconds; // Ensure minimum of 20 for statistical validity // Maximum of 500 to limit memory usage return { type: 'COUNT_BASED', size: Math.max(20, Math.min(500, Math.ceil(callsInDetectionTime))), };} // Examples:// 100 req/s, 30s detection → COUNT_BASED, size: 500 (capped)// 10 req/s, 30s detection → COUNT_BASED, size: 300// 1 req/s, 60s detection → COUNT_BASED, size: 60// Variable traffic, 60s detection → TIME_BASED, size: 60For services handling fewer than 1 request per second, count-based windows become problematic. A 100-call window might span hours of data. In such cases, use time-based windows with appropriate minimum call thresholds.
The minimum number of calls parameter prevents the circuit from evaluating failure rates until sufficient data exists. This is critical for avoiding premature tripping based on statistically insignificant samples.
The statistical problem:
With only 5 requests, a 50% failure threshold could trip from just 3 failures. But 3/5 failing might be normal variance, not a real problem.
With 100 requests, 50/100 failing is almost certainly a real issue.
Guidance for setting minimum calls:
| Traffic Level | Recommended Minimum | Reasoning |
|---|---|---|
| Very low (<1 req/min) | 5-10 | Accept more variance; any signal is valuable |
| Low (1-10 req/min) | 10-20 | Balance between speed and accuracy |
| Medium (10-100 req/min) | 20-50 | Good statistical significance |
| High (>100 req/min) | 50-100 | High accuracy, quick fill time |
Relationship with sliding window:
Minimum calls must be less than or equal to the sliding window size:
If minimum calls > window size:
This is a configuration error. The circuit would never evaluate because the window would never contain enough calls. Most libraries will reject this configuration or log a warning.
123456789101112131415161718192021222324252627
// Configuration validationfunction validateCircuitBreakerConfig(config: CircuitBreakerConfig): void { // Minimum calls must not exceed window size if (config.minimumNumberOfCalls > config.slidingWindowSize) { throw new Error( `Invalid configuration: minimumNumberOfCalls (${config.minimumNumberOfCalls}) ` + `exceeds slidingWindowSize (${config.slidingWindowSize}). ` + `Circuit would never evaluate failure rate.` ); } // Minimum calls should be at least 10 for statistical validity if (config.minimumNumberOfCalls < 10) { console.warn( `Warning: minimumNumberOfCalls (${config.minimumNumberOfCalls}) is very low. ` + `Consider increasing to at least 10 to reduce false positives.` ); } // Minimum calls should be at most 50% of window for reasonable sensitivity if (config.minimumNumberOfCalls > config.slidingWindowSize * 0.5) { console.warn( `Warning: minimumNumberOfCalls is more than 50% of slidingWindowSize. ` + `This may delay failure detection significantly.` ); }}After a service restart or deployment, the sliding window is empty. Until minimum calls are reached, the circuit cannot trip. This provides a natural 'warmup period' where early failures don't immediately trigger protection—which is usually desirable since startup failures might be transient.
The wait duration in open state (also called 'open timeout' or 'sleep window') controls how long the circuit remains open before transitioning to half-open to test recovery.
Tradeoffs in wait duration:
| Too Short | Too Long |
|---|---|
| Probes struggling service frequently | Unnecessary prolonged degradation |
| Prevents service recovery | Poor user experience |
| Consumes resources on probes | Slow to restore functionality |
| May cause recovery oscillation | Operators might intervene manually |
Determining appropriate wait duration:
The wait duration should reflect the expected recovery time of the downstream service. Consider:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
// Wait duration by dependency typeconst waitDurationsByType = { // Internal microservices with container orchestration internalService: { waitDurationMs: 30_000, // 30 seconds reason: 'Kubernetes pod restart/replacement time', }, // Database connections database: { waitDurationMs: 60_000, // 60 seconds reason: 'Connection pool reset, potential failover', }, // Cache (Redis, Memcached) cache: { waitDurationMs: 15_000, // 15 seconds reason: 'Fast restart, can fallback to origin', }, // Third-party APIs (payment gateways, etc.) externalApi: { waitDurationMs: 120_000, // 2 minutes reason: 'External recovery timeline unknown, be conservative', }, // Message queues messageQueue: { waitDurationMs: 45_000, // 45 seconds reason: 'Broker reconnection, consumer rebalancing', },}; // Exponential backoff for repeated failuresfunction getWaitDuration( baseWaitMs: number, consecutiveOpenings: number, maxWaitMs: number = 300_000 // 5 minute cap): number { // Double wait time for each consecutive opening // Cap at maximum to prevent excessive waits const backoffWait = baseWaitMs * Math.pow(2, consecutiveOpenings - 1); return Math.min(backoffWait, maxWaitMs);} // Example: 30s base, 3rd consecutive opening// getWaitDuration(30000, 3) = min(30000 * 4, 300000) = 120000 (2 minutes)Some advanced implementations dynamically adjust wait duration based on failure patterns. If the circuit opens repeatedly in quick succession, it suggests the service isn't fully recovered—extending the wait prevents overly aggressive probing.
The half-open state has its own configuration parameters that control how recovery is tested.
Permitted Number of Calls in Half-Open:
This determines how many probe requests are allowed through during the half-open state.
| Probe Count | Behavior | Recommendation |
|---|---|---|
| 1 | Single request determines state | Fast decision, higher false positive risk |
| 3-5 | Small sample for validation | Good balance (recommended) |
| 10+ | Larger sample, more accurate | Slower recovery, potentially heavy on recovering service |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// Half-open state configurationconst halfOpenConfig = { // Number of requests allowed through for probing permittedNumberOfCallsInHalfOpenState: 5, // How to evaluate probe results: // Option 1: Immediate failure mode (any probe failure → reopen) // Option 2: Apply failure threshold to probes // If using immediate failure mode: // - First probe failure reopens the circuit // - Requires all N probes to succeed for closure immediateFailureModeEnabled: false, // If NOT using immediate failure mode: // - Apply the same failure threshold to probe results // - More tolerant of transient probe failures probeFailureRateThreshold: 50, // 50% of probes must succeed}; // Strategies for probe distributionconst probeStrategies = { // Strategy 1: First-N (default) // First N requests after half-open become probes // Remaining requests fail fast until probes complete firstN: { pros: 'Simple, fast initial probing', cons: 'Bursty probe traffic if many concurrent requests', }, // Strategy 2: Rate-limited probes // Probes allowed at fixed rate (e.g., 1 per second) // Smoother load on recovering service rateLimited: { pros: 'Gentle on recovering service', cons: 'Slower to accumulate probe results', }, // Strategy 3: Random sampling // Each request has a probability of becoming a probe // Spreads probes over time naturally randomSampling: { pros: 'Natural distribution, no burst', cons: 'Unpredictable probe timing', },};Non-Probe Request Handling:
During half-open, requests that aren't selected as probes can be handled in two ways:
Fail fast (default): Non-probe requests throw CircuitBreakerOpenException immediately. Simple and consistent.
Queue and wait: Non-probe requests wait for probes to complete, then proceeed if circuit closes. Provides better user experience but adds complexity and potential memory pressure.
Most implementations use fail-fast for simplicity and predictable resource usage.
Consider which requests become probes. If possible, use low-risk requests (read operations, health-check-like calls) for probing rather than high-value operations (payments, critical writes). Some implementations allow tagging certain request types as probe-eligible.
Not all errors are created equal. The circuit breaker must be configured to recognize which exceptions constitute 'failures' that should count toward the threshold.
Default behavior:
Most implementations count any exception as a failure. However, this is often too broad.
Customizing failure recognition:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
// Resilience4j-style failure predicate configurationconst failureConfig = { // Record exceptions as failures (these count toward threshold) recordExceptions: [ IOException, TimeoutException, ServiceUnavailableException, CircuitBreakerDownstreamException, ], // Explicitly ignore these exceptions (don't count as failures) ignoreExceptions: [ BadRequestException, // Client error, not service failure UnauthorizedException, // Auth issue, not service failure NotFoundException, // Normal "not found" response ValidationException, // Input validation, client's fault BusinessLogicException, // Expected business errors ], // Custom predicate for complex logic recordFailurePredicate: (error: Error): boolean => { // Don't count 4xx responses as failures if (error instanceof HttpException && error.status >= 400 && error.status < 500) { return false; } // Count 5xx responses as failures if (error instanceof HttpException && error.status >= 500) { return true; } // Count timeouts as failures if (error instanceof TimeoutException) { return true; } // Default: count as failure return true; },}; // HTTP status code filtering examplefunction isCircuitBreakerFailure(response: HttpResponse): boolean { // 2xx and 3xx: Success if (response.status < 400) return false; // 4xx Client Errors: Usually NOT counted // Exception: 429 Too Many Requests should count (rate limiting suggests pressure) if (response.status >= 400 && response.status < 500) { return response.status === 429; // Only rate limiting counts } // 5xx Server Errors: Count as failure // Exception: 501 Not Implemented is not a transient failure if (response.status === 501) return false; // 500, 502, 503, 504, 507, etc: Definitely failures return true;}Common mistakes in failure definition:
Some 'successful' responses are semantic failures. A payment service returning 200 OK with body {"status": "FAILED", "reason": "Gateway unavailable"} is technically a success but semantically a failure. Advanced configurations may need custom predicates to detect these.
Circuit breaker configuration often needs to vary by environment. Development, staging, and production have different requirements.
Development Environment:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
// Environment-specific circuit breaker configurations const developmentConfig = { // More sensitive for faster feedback during testing failureRateThreshold: 25, minimumNumberOfCalls: 3, slidingWindowSize: 10, waitDurationInOpenState: 10_000, // 10 seconds - fast iteration // Aggressive tripping helps catch issues early rationale: 'Fast feedback for developers, catch issues quickly',}; const stagingConfig = { // Similar to production but slightly more aggressive failureRateThreshold: 40, minimumNumberOfCalls: 15, slidingWindowSize: 50, waitDurationInOpenState: 30_000, // 30 seconds // Test circuit behavior under realistic conditions rationale: 'Validate circuit tuning before production',}; const productionConfig = { // Conservative to avoid false positives failureRateThreshold: 50, minimumNumberOfCalls: 20, slidingWindowSize: 100, waitDurationInOpenState: 60_000, // 60 seconds // Balance protection with availability rationale: 'Minimize false positives, protect user experience',}; // Configuration factoryfunction getCircuitBreakerConfig(env: string): CircuitBreakerConfig { switch (env) { case 'development': return developmentConfig; case 'staging': return stagingConfig; case 'production': return productionConfig; default: return productionConfig; // Safe default }}Per-Service Configuration:
Different downstream services often need different circuit configurations:
| Service Type | Failure Threshold | Wait Duration | Reasoning |
|---|---|---|---|
| Payment gateway | 30% | 120s | Critical path, protect aggressively, allow recovery time |
| User service | 50% | 30s | Core dependency, balanced approach |
| Recommendation engine | 70% | 30s | Non-critical, tolerate more failures |
| Analytics service | 80% | 60s | Best effort, can fail silently |
| External rate-limited API | 40% | 300s | Rate limits need longer recovery |
Consider using a configuration service (Consul, etcd, AWS AppConfig) to manage circuit breaker settings. This allows tuning in production without redeployment—invaluable during incidents when you need to quickly adjust thresholds.
We've comprehensively examined every major circuit breaker configuration parameter, providing guidance for deriving appropriate values from your system's characteristics.
What's next:
With configuration mastered, the next page explores monitoring circuit state—how to observe circuit behavior, build effective dashboards, and set up alerting that makes circuit breaker activity visible to operators.
You now have a comprehensive understanding of circuit breaker configuration parameters. You can derive appropriate values from system characteristics, avoid common misconfiguration pitfalls, and tune circuits for different environments and service types.