Circuit Breaker Pattern - Learning Module

Loading content...

0/273

Configuration Parameters

The Art and Science of Circuit Tuning

A circuit breaker with default settings is better than no circuit breaker—but default settings are optimized for no particular system. Real-world services have unique characteristics: traffic patterns, failure modes, recovery times, and criticality levels that demand tailored configuration.

The difference between a well-tuned circuit breaker and a poorly configured one can be dramatic:

Over-sensitive circuits trip on minor fluctuations, creating unnecessary service degradation
Under-sensitive circuits fail to protect when real failures occur, allowing cascades
Poorly timed circuits either probe too aggressively (preventing recovery) or too slowly (prolonging outages)

This page provides a comprehensive guide to every configuration parameter, with concrete guidance for tuning them based on your system's characteristics.

What You Will Learn

By the end of this page, you will understand every major circuit breaker configuration parameter, know how to derive appropriate values from your system characteristics, and have practical tuning strategies for common scenarios.

Failure Rate Threshold

The failure rate threshold is the most important circuit breaker parameter. It defines the percentage of failed requests that triggers the circuit to open.

Default values: Most libraries default to 50%, meaning the circuit opens when more than half of recent requests fail.

Configuration principle: The threshold should reflect the point at which continuing to call the downstream service causes more harm than failing fast.

Factors influencing threshold selection:

Failure Rate Threshold Selection Guide
Factor	Lower Threshold (30-40%)	Default (50%)	Higher Threshold (60-80%)
Service criticality	Critical path - protect aggressively	Standard services	Non-critical, best-effort services
Failure cost	High cost per failure (payments)	Normal error handling	Low cost, easy retry
Normal error rate	Baseline < 1%	Baseline 1-5%	Baseline 5-15% (some errors expected)
Fallback availability	Good fallbacks exist	Partial fallbacks	No fallbacks, failure visible to users
Recovery speed	Service recovers slowly	Normal recovery	Quick recovery, brief failures

Calculating threshold from baseline:

A service might have a normal 'baseline' error rate even when healthy. Your threshold should be significantly above this baseline to avoid false positives.

Formula for threshold:

Threshold = BaselineErrorRate + SafetyMargin

Example:

Service baseline: 2% errors (timeouts, transient failures)
Safety margin: 3-5x baseline
Threshold: 2% × 5 = 10%? No—this is too aggressive.

Better approach: Set threshold to cause meaningful failure, typically 25-50% above baseline:

Threshold = max(BaselineErrorRate × 5, 25-50%)
With 2% baseline: max(10%, 40%) = 40%

threshold-calculation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Programmatic threshold calculation
function calculateFailureThreshold(
  baselineErrorRate: number,  // Normal error rate when healthy (0-100)
  serviceCriticality: 'critical' | 'standard' | 'best-effort'
): number {
  // Minimum threshold to avoid noise
  const minimumThreshold = 20;
  
  // Criticality-based multipliers
  const multipliers = {
    'critical': 3,      // Trip at 3x baseline
    'standard': 5,      // Trip at 5x baseline  
    'best-effort': 10,  // Trip at 10x baseline
  };
  
  // Target thresholds by criticality
  const targetThresholds = {
    'critical': 30,     // Protect aggressively
    'standard': 50,     // Balanced
    'best-effort': 70,  // Tolerate more failures
  };
  
  const calculatedThreshold = baselineErrorRate * multipliers[serviceCriticality];
  const targetThreshold = targetThresholds[serviceCriticality];
  
  // Use whichever is higher: baseline-derived or target
  return Math.max(minimumThreshold, calculatedThreshold, targetThreshold);
}
 
// Examples:
// Critical service, 1% baseline: max(20, 3, 30) = 30%
// Standard service, 2% baseline: max(20, 10, 50) = 50%
// Best-effort, 5% baseline: max(20, 50, 70) = 70%

Common Mistake

Setting thresholds too low (e.g., 10%) often leads to 'flapping'—circuits that open and close repeatedly due to normal variance. If your circuit opens multiple times per hour during healthy operation, your threshold is too low.

Slow Call Rate Threshold

Beyond outright failures, slow calls can be equally damaging. A service returning responses in 30 seconds instead of 300ms might not technically 'fail,' but it's consuming threads, degrading user experience, and potentially causing upstream timeouts.

Slow call configuration has two parts:

Slow call duration threshold: What constitutes a 'slow' call (e.g., >3 seconds)
Slow call rate threshold: Percentage of slow calls that triggers circuit opening (e.g., 80%)

Determining the slow call duration threshold:

Slow Call Duration Factors

•Normal P99 latency: Slow threshold should exceed your normal P99 latency by 3-5x. If P99 is 500ms, slow might be 2-3 seconds.
•Caller timeout: Your slow threshold should typically be below your timeout. If timeout is 30s, slow might be 10-15s.
•User experience boundary: At what latency does user experience significantly degrade? Often 2-5 seconds for interactive requests.
•SLA requirements: Your service's latency SLO might define 'slow' (e.g., SLO is 1s, slow is anything over).

slow-call-configuration.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Slow call configuration
const circuitBreakerConfig = {
  // Failure rate configuration (as before)
  failureRateThreshold: 50,
  
  // Slow call configuration
  slowCallDurationThreshold: 3000,  // 3 seconds = "slow"
  slowCallRateThreshold: 80,        // If 80%+ of calls are slow, open circuit
  
  // Combined evaluation: circuit opens if EITHER
  // - failure rate >= 50%  OR
  // - slow call rate >= 80%
};
 
// Determining slow call duration from measured latencies
function calculateSlowThreshold(
  p99Latency: number,      // Normal P99 in milliseconds
  timeoutMs: number,       // Request timeout
  minSlowThreshold: number = 1000  // Never below 1 second
): number {
  // Slow = 3x P99, but less than 50% of timeout
  const p99Based = p99Latency * 3;
  const timeoutBased = timeoutMs * 0.5;
  
  return Math.max(
    minSlowThreshold,
    Math.min(p99Based, timeoutBased)
  );
}
 
// Examples:
// P99: 200ms, Timeout: 30s → max(1000, min(600, 15000)) = 1000ms
// P99: 500ms, Timeout: 5s → max(1000, min(1500, 2500)) = 1500ms
// P99: 2s, Timeout: 60s → max(1000, min(6000, 30000)) = 6000ms

Determining the slow call rate threshold:

The slow call rate threshold determines how many slow calls are tolerable before tripping.

Slow Call Rate Threshold	Use Case
50%	Very latency-sensitive; even half slow is too many
80% (recommended default)	Service is mostly slow; probably degraded
90%+	Only trip if nearly everything is slow

Most implementations use 80% as the slow call rate threshold. At this point, the service is clearly degraded enough to warrant circuit protection.

Slow Calls vs. Failures

In many failure modes, slow calls precede failures. A database under heavy load first becomes slow, then times out. Monitoring slow call rate can provide earlier warning than failure rate alone.

Sliding Window Configuration

The sliding window determines which calls are included when calculating failure and slow call rates. Two types are available along with their sizing parameters.

Sliding Window Type:

Count-Based Window

•Definition: Last N requests
•Size unit: Number of calls (e.g., 100)
•Best for: Consistent traffic
•Risk: Stale data in low-traffic periods
•Memory: Fixed (O(N))

Time-Based Window

•Definition: Last T seconds
•Size unit: Duration (e.g., 60s)
•Best for: Variable traffic
•Risk: High memory in high-traffic periods
•Memory: O(calls per period)

Sliding Window Size:

Window size directly affects sensitivity and reaction time:

Window Size	Sensitivity	Reaction Time	Risk
Small (10-20)	High	Fast (seconds)	False positives from variance
Medium (50-100)	Balanced	Moderate (tens of seconds)	Good balance
Large (200-500)	Low	Slow (minutes)	Slow detection, more accurate

window-size-selection.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Window size selection based on traffic volume and latency requirements
function selectWindowConfiguration(
  requestsPerSecond: number,
  targetDetectionTimeSeconds: number,
  windowType: 'COUNT_BASED' | 'TIME_BASED'
): { type: string; size: number } {
  
  if (windowType === 'TIME_BASED') {
    // Time-based: window size in seconds
    // Detection time ≈ window size (roughly)
    return {
      type: 'TIME_BASED',
      size: Math.max(10, Math.min(120, targetDetectionTimeSeconds)),
    };
  }
  
  // Count-based: window size in number of calls
  // Calls in detection time = requests/second × detection time
  const callsInDetectionTime = requestsPerSecond * targetDetectionTimeSeconds;
  
  // Ensure minimum of 20 for statistical validity
  // Maximum of 500 to limit memory usage
  return {
    type: 'COUNT_BASED',
    size: Math.max(20, Math.min(500, Math.ceil(callsInDetectionTime))),
  };
}
 
// Examples:
// 100 req/s, 30s detection → COUNT_BASED, size: 500 (capped)
// 10 req/s, 30s detection → COUNT_BASED, size: 300
// 1 req/s, 60s detection → COUNT_BASED, size: 60
// Variable traffic, 60s detection → TIME_BASED, size: 60

Low Traffic Warning

For services handling fewer than 1 request per second, count-based windows become problematic. A 100-call window might span hours of data. In such cases, use time-based windows with appropriate minimum call thresholds.

Minimum Number of Calls

The minimum number of calls parameter prevents the circuit from evaluating failure rates until sufficient data exists. This is critical for avoiding premature tripping based on statistically insignificant samples.

The statistical problem:

With only 5 requests, a 50% failure threshold could trip from just 3 failures. But 3/5 failing might be normal variance, not a real problem.

With 100 requests, 50/100 failing is almost certainly a real issue.

Guidance for setting minimum calls:

Minimum Calls Configuration Guide
Traffic Level	Recommended Minimum	Reasoning
Very low (<1 req/min)	5-10	Accept more variance; any signal is valuable
Low (1-10 req/min)	10-20	Balance between speed and accuracy
Medium (10-100 req/min)	20-50	Good statistical significance
High (>100 req/min)	50-100	High accuracy, quick fill time

Relationship with sliding window:

Minimum calls must be less than or equal to the sliding window size:

Window size: 100
Minimum calls: 20
Behavior: After 20 calls, evaluation begins. Once 100 calls exist, oldest calls are evicted.

If minimum calls > window size:

This is a configuration error. The circuit would never evaluate because the window would never contain enough calls. Most libraries will reject this configuration or log a warning.

minimum-calls-validation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Configuration validation
function validateCircuitBreakerConfig(config: CircuitBreakerConfig): void {
  // Minimum calls must not exceed window size
  if (config.minimumNumberOfCalls > config.slidingWindowSize) {
    throw new Error(
      `Invalid configuration: minimumNumberOfCalls (${config.minimumNumberOfCalls}) ` +
      `exceeds slidingWindowSize (${config.slidingWindowSize}). ` +
      `Circuit would never evaluate failure rate.`
    );
  }
  
  // Minimum calls should be at least 10 for statistical validity
  if (config.minimumNumberOfCalls < 10) {
    console.warn(
      `Warning: minimumNumberOfCalls (${config.minimumNumberOfCalls}) is very low. ` +
      `Consider increasing to at least 10 to reduce false positives.`
    );
  }
  
  // Minimum calls should be at most 50% of window for reasonable sensitivity
  if (config.minimumNumberOfCalls > config.slidingWindowSize * 0.5) {
    console.warn(
      `Warning: minimumNumberOfCalls is more than 50% of slidingWindowSize. ` +
      `This may delay failure detection significantly.`
    );
  }
}

Startup Considerations

After a service restart or deployment, the sliding window is empty. Until minimum calls are reached, the circuit cannot trip. This provides a natural 'warmup period' where early failures don't immediately trigger protection—which is usually desirable since startup failures might be transient.

Wait Duration in Open State

The wait duration in open state (also called 'open timeout' or 'sleep window') controls how long the circuit remains open before transitioning to half-open to test recovery.

Tradeoffs in wait duration:

Too Short	Too Long
Probes struggling service frequently	Unnecessary prolonged degradation
Prevents service recovery	Poor user experience
Consumes resources on probes	Slow to restore functionality
May cause recovery oscillation	Operators might intervene manually

Determining appropriate wait duration:

The wait duration should reflect the expected recovery time of the downstream service. Consider:

Wait Duration Factors

•Service restart time: Container restarts typically take 30-60 seconds. Wait at least this long.
•Auto-scaling time: If failures trigger scaling, wait for new instances to be healthy (2-5 minutes).
•Database failover time: Primary/replica promotion varies (30s - 5 minutes depending on setup).
•External dependencies: Third-party services have their own recovery timelines (unpredictable).
•Human intervention time: If manual action is required, waits of 5-15 minutes might be appropriate.

wait-duration-configuration.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// Wait duration by dependency type
const waitDurationsByType = {
  // Internal microservices with container orchestration
  internalService: {
    waitDurationMs: 30_000,  // 30 seconds
    reason: 'Kubernetes pod restart/replacement time',
  },
  
  // Database connections
  database: {
    waitDurationMs: 60_000,  // 60 seconds
    reason: 'Connection pool reset, potential failover',
  },
  
  // Cache (Redis, Memcached)
  cache: {
    waitDurationMs: 15_000,  // 15 seconds
    reason: 'Fast restart, can fallback to origin',
  },
  
  // Third-party APIs (payment gateways, etc.)
  externalApi: {
    waitDurationMs: 120_000,  // 2 minutes
    reason: 'External recovery timeline unknown, be conservative',
  },
  
  // Message queues
  messageQueue: {
    waitDurationMs: 45_000,  // 45 seconds
    reason: 'Broker reconnection, consumer rebalancing',
  },
};
 
// Exponential backoff for repeated failures
function getWaitDuration(
  baseWaitMs: number,
  consecutiveOpenings: number,
  maxWaitMs: number = 300_000  // 5 minute cap
): number {
  // Double wait time for each consecutive opening
  // Cap at maximum to prevent excessive waits
  const backoffWait = baseWaitMs * Math.pow(2, consecutiveOpenings - 1);
  return Math.min(backoffWait, maxWaitMs);
}
 
// Example: 30s base, 3rd consecutive opening
// getWaitDuration(30000, 3) = min(30000 * 4, 300000) = 120000 (2 minutes)

Adaptive Wait Duration

Some advanced implementations dynamically adjust wait duration based on failure patterns. If the circuit opens repeatedly in quick succession, it suggests the service isn't fully recovered—extending the wait prevents overly aggressive probing.

Half-Open Configuration

The half-open state has its own configuration parameters that control how recovery is tested.

Permitted Number of Calls in Half-Open:

This determines how many probe requests are allowed through during the half-open state.

Probe Count	Behavior	Recommendation
1	Single request determines state	Fast decision, higher false positive risk
3-5	Small sample for validation	Good balance (recommended)
10+	Larger sample, more accurate	Slower recovery, potentially heavy on recovering service

halfopen-configuration.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Half-open state configuration
const halfOpenConfig = {
  // Number of requests allowed through for probing
  permittedNumberOfCallsInHalfOpenState: 5,
  
  // How to evaluate probe results:
  // Option 1: Immediate failure mode (any probe failure → reopen)
  // Option 2: Apply failure threshold to probes
  
  // If using immediate failure mode:
  // - First probe failure reopens the circuit
  // - Requires all N probes to succeed for closure
  immediateFailureModeEnabled: false,
  
  // If NOT using immediate failure mode:
  // - Apply the same failure threshold to probe results
  // - More tolerant of transient probe failures
  probeFailureRateThreshold: 50,  // 50% of probes must succeed
};
 
// Strategies for probe distribution
const probeStrategies = {
  // Strategy 1: First-N (default)
  // First N requests after half-open become probes
  // Remaining requests fail fast until probes complete
  firstN: {
    pros: 'Simple, fast initial probing',
    cons: 'Bursty probe traffic if many concurrent requests',
  },
  
  // Strategy 2: Rate-limited probes
  // Probes allowed at fixed rate (e.g., 1 per second)
  // Smoother load on recovering service
  rateLimited: {
    pros: 'Gentle on recovering service',
    cons: 'Slower to accumulate probe results',
  },
  
  // Strategy 3: Random sampling
  // Each request has a probability of becoming a probe
  // Spreads probes over time naturally
  randomSampling: {
    pros: 'Natural distribution, no burst',
    cons: 'Unpredictable probe timing',
  },
};

Non-Probe Request Handling:

During half-open, requests that aren't selected as probes can be handled in two ways:

Fail fast (default): Non-probe requests throw CircuitBreakerOpenException immediately. Simple and consistent.
Queue and wait: Non-probe requests wait for probes to complete, then proceeed if circuit closes. Provides better user experience but adds complexity and potential memory pressure.

Most implementations use fail-fast for simplicity and predictable resource usage.

Probe Request Selection

Consider which requests become probes. If possible, use low-risk requests (read operations, health-check-like calls) for probing rather than high-value operations (payments, critical writes). Some implementations allow tagging certain request types as probe-eligible.

Failure Definition Configuration

Not all errors are created equal. The circuit breaker must be configured to recognize which exceptions constitute 'failures' that should count toward the threshold.

Default behavior:

Most implementations count any exception as a failure. However, this is often too broad.

Customizing failure recognition:

failure-definition.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Resilience4j-style failure predicate configuration
const failureConfig = {
  // Record exceptions as failures (these count toward threshold)
  recordExceptions: [
    IOException,
    TimeoutException,
    ServiceUnavailableException,
    CircuitBreakerDownstreamException,
  ],
  
  // Explicitly ignore these exceptions (don't count as failures)
  ignoreExceptions: [
    BadRequestException,        // Client error, not service failure
    UnauthorizedException,      // Auth issue, not service failure
    NotFoundException,          // Normal "not found" response
    ValidationException,        // Input validation, client's fault
    BusinessLogicException,     // Expected business errors
  ],
  
  // Custom predicate for complex logic
  recordFailurePredicate: (error: Error): boolean => {
    // Don't count 4xx responses as failures
    if (error instanceof HttpException && error.status >= 400 && error.status < 500) {
      return false;
    }
    // Count 5xx responses as failures
    if (error instanceof HttpException && error.status >= 500) {
      return true;
    }
    // Count timeouts as failures
    if (error instanceof TimeoutException) {
      return true;
    }
    // Default: count as failure
    return true;
  },
};
 
// HTTP status code filtering example
function isCircuitBreakerFailure(response: HttpResponse): boolean {
  // 2xx and 3xx: Success
  if (response.status < 400) return false;
  
  // 4xx Client Errors: Usually NOT counted
  // Exception: 429 Too Many Requests should count (rate limiting suggests pressure)
  if (response.status >= 400 && response.status < 500) {
    return response.status === 429;  // Only rate limiting counts
  }
  
  // 5xx Server Errors: Count as failure
  // Exception: 501 Not Implemented is not a transient failure
  if (response.status === 501) return false;
  
  // 500, 502, 503, 504, 507, etc: Definitely failures
  return true;
}

Common mistakes in failure definition:

Failure Definition Anti-Patterns

•Counting all exceptions: Includes business errors and validation failures in failure rate, causing false positives.
•Counting 404s as failures: 'Not found' is often a valid response, not a service failure.
•Ignoring timeouts: Timeouts almost always indicate a problem and should count as failures.
•Ignoring slow successes: A 30-second successful response is often worse than a fast failure.
•Not distinguishing circuit source: Failures from YOUR circuit vs. failures from a nested call should potentially differ.

Semantic Failures

Some 'successful' responses are semantic failures. A payment service returning 200 OK with body {"status": "FAILED", "reason": "Gateway unavailable"} is technically a success but semantically a failure. Advanced configurations may need custom predicates to detect these.

Environment-Specific Configuration

Circuit breaker configuration often needs to vary by environment. Development, staging, and production have different requirements.

Development Environment:

environment-configs.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// Environment-specific circuit breaker configurations
 
const developmentConfig = {
  // More sensitive for faster feedback during testing
  failureRateThreshold: 25,
  minimumNumberOfCalls: 3,
  slidingWindowSize: 10,
  waitDurationInOpenState: 10_000,  // 10 seconds - fast iteration
  
  // Aggressive tripping helps catch issues early
  rationale: 'Fast feedback for developers, catch issues quickly',
};
 
const stagingConfig = {
  // Similar to production but slightly more aggressive
  failureRateThreshold: 40,
  minimumNumberOfCalls: 15,
  slidingWindowSize: 50,
  waitDurationInOpenState: 30_000,  // 30 seconds
  
  // Test circuit behavior under realistic conditions
  rationale: 'Validate circuit tuning before production',
};
 
const productionConfig = {
  // Conservative to avoid false positives
  failureRateThreshold: 50,
  minimumNumberOfCalls: 20,
  slidingWindowSize: 100,
  waitDurationInOpenState: 60_000,  // 60 seconds
  
  // Balance protection with availability
  rationale: 'Minimize false positives, protect user experience',
};
 
// Configuration factory
function getCircuitBreakerConfig(env: string): CircuitBreakerConfig {
  switch (env) {
    case 'development': return developmentConfig;
    case 'staging': return stagingConfig;
    case 'production': return productionConfig;
    default: return productionConfig;  // Safe default
  }
}

Per-Service Configuration:

Different downstream services often need different circuit configurations:

Service-Specific Circuit Configuration
Service Type	Failure Threshold	Wait Duration	Reasoning
Payment gateway	30%	120s	Critical path, protect aggressively, allow recovery time
User service	50%	30s	Core dependency, balanced approach
Recommendation engine	70%	30s	Non-critical, tolerate more failures
Analytics service	80%	60s	Best effort, can fail silently
External rate-limited API	40%	300s	Rate limits need longer recovery

Dynamic Configuration

Consider using a configuration service (Consul, etcd, AWS AppConfig) to manage circuit breaker settings. This allows tuning in production without redeployment—invaluable during incidents when you need to quickly adjust thresholds.

Summary: Configuring Circuit Breakers

We've comprehensively examined every major circuit breaker configuration parameter, providing guidance for deriving appropriate values from your system's characteristics.

Key Takeaways

•Failure rate threshold sets when protection activates — derive from baseline error rate and criticality.
•Slow call thresholds catch degradation before failures — base on normal latency and timeout values.
•Sliding window type and size determine what data informs decisions — match to traffic patterns.
•Minimum calls prevent premature tripping — scale with traffic volume to ensure statistical validity.
•Wait duration controls recovery testing frequency — match to downstream recovery time.
•Half-open configuration balances recovery testing with protection — tune probe count carefully.
•Failure definition distinguishes real failures from expected errors — customize for your API semantics.
•Environment-specific configuration enables appropriate behavior per deployment stage.

What's next:

With configuration mastered, the next page explores monitoring circuit state—how to observe circuit behavior, build effective dashboards, and set up alerting that makes circuit breaker activity visible to operators.

Page Complete

You now have a comprehensive understanding of circuit breaker configuration parameters. You can derive appropriate values from system characteristics, avoid common misconfiguration pitfalls, and tune circuits for different environments and service types.