Loading learning content...
Throughout this module, we've explored bulkheads in depth—understanding failure isolation, resource partitioning, thread pool implementations, and semaphore alternatives. But bulkheads address only part of the fault tolerance picture.
Bulkheads prevent cascade failures: When a downstream service is slow or failing, bulkheads contain the impact to a specific resource partition, protecting other workloads from starvation.
Circuit breakers prevent futile calls: When a downstream service is known to be failing, circuit breakers stop making calls entirely, failing fast rather than consuming resources on doomed requests.
Together, they provide comprehensive protection: Bulkheads handle the 'what if calls are slow?' scenario, while circuit breakers handle the 'what if the service is completely down?' scenario. Combining them creates a layered defense against distributed system failures.
By the end of this page, you will understand how bulkheads and circuit breakers complement each other, the correct composition order for these patterns, how to coordinate configuration between them, real-world integration strategies, and common mistakes when combining these patterns.
Bulkheads and circuit breakers solve different problems in the failure continuum. Understanding their distinct roles is essential for effective combination.
The failure timeline:
Consider a downstream service that's experiencing problems:
t=0: Service is healthy. Both bulkhead and circuit breaker are in normal state.
t=1m: Service starts responding slowly (2x normal latency). Bulkhead threads accumulate but don't exhaust. Circuit breaker sees increased latency but not enough failures to trip.
t=2m: Service latency increases further (5x normal). Bulkhead starts rejecting some requests. Circuit breaker still closed (errors below threshold).
t=3m: Service starts returning errors mixed with slow responses. Bulkhead is near exhaustion. Circuit breaker begins recording failures.
t=4m: Service failure rate exceeds threshold. Circuit breaker opens. Now both patterns are active—bulkhead protects capacity, circuit breaker stops new calls entirely.
t=5m: Circuit breaker enters half-open state, allowing test requests. If successful, circuit closes. Bulkhead capacity becomes available again.
Each pattern handles a different phase:
| Aspect | Bulkhead | Circuit Breaker |
|---|---|---|
| Primary Problem | Resource exhaustion from slow calls | Repeated calls to failing service |
| Trigger Condition | Capacity limit reached | Error rate threshold exceeded |
| Rejection Reason | No capacity available | Known-failing destination |
| Protection Type | Resource isolation | Failure detection and avoidance |
| Recovery Mode | Capacity frees as calls complete | Half-open state probes for recovery |
| Latency Protection | Excellent (limits concurrency) | Limited (only prevents known-bad calls) |
| Error Protection | Partial (limits blast radius) | Excellent (stops all calls when failing) |
A bulkhead without a circuit breaker will keep trying to call a completely dead service, consuming capacity on doomed requests. A circuit breaker without a bulkhead can't prevent slow (but successful) calls from exhausting resources. The combination addresses both failure modes.
When composing resilience patterns, the order of application significantly affects behavior. For bulkheads and circuit breakers, there is a correct order.
The correct order: Circuit Breaker → Bulkhead → Call
Read from left to right, the request first encounters the circuit breaker, then the bulkhead, then makes the actual call.
Why this order?
Circuit breaker rejects before consuming bulkhead capacity: If the circuit is open (service known to be failing), the request is rejected immediately. No bulkhead permit is acquired, preserving capacity for other operations.
Bulkhead capacity isn't wasted on doomed requests: When a circuit is open, bulkhead resources remain available for other bulkheads that are still operational.
Clean separation of concerns: Circuit breaker decides 'should we even try?' Bulkhead decides 'do we have capacity to try right now?'
The wrong order: Bulkhead → Circuit Breaker → Call
If the bulkhead is evaluated first:
Think of the circuit breaker as a 'should we even try?' check and the bulkhead as 'committing resources to try.' You should always check before committing. This naturally leads to circuit breaker → bulkhead order.
In practice, you almost always want a third pattern: timeouts. Without timeouts, a slow call can hold a bulkhead permit indefinitely, and the circuit breaker never records a failure (because the call technically hasn't failed—it's just very slow).
The complete pattern: Circuit Breaker → Bulkhead → Timeout → Call
Each layer adds a specific protection:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import io.github.resilience4j.circuitbreaker.*;import io.github.resilience4j.bulkhead.*;import io.github.resilience4j.timelimiter.*;import io.github.resilience4j.decorators.Decorators; // Configure each pattern independentlyCircuitBreaker circuitBreaker = CircuitBreaker.of("paymentService", CircuitBreakerConfig.custom() .failureRateThreshold(50) // Open at 50% failures .slowCallRateThreshold(80) // Also trip on slow calls .slowCallDurationThreshold(Duration.ofSeconds(2)) .waitDurationInOpenState(Duration.ofSeconds(30)) .slidingWindowSize(10) .minimumNumberOfCalls(5) .permittedNumberOfCallsInHalfOpenState(3) .build()); ThreadPoolBulkhead bulkhead = ThreadPoolBulkhead.of("paymentService", ThreadPoolBulkheadConfig.custom() .maxThreadPoolSize(50) .coreThreadPoolSize(50) .queueCapacity(10) .build()); TimeLimiter timeLimiter = TimeLimiter.of( TimeLimiterConfig.custom() .timeoutDuration(Duration.ofSeconds(3)) // 3-second timeout .cancelRunningFuture(true) // Cancel on timeout .build()); // Compose with correct order: CB (outer) → Bulkhead → Timeout → CallScheduledExecutorService scheduler = Executors.newScheduledThreadPool(4); Supplier<CompletionStage<PaymentResult>> asyncSupplier = () -> CompletableFuture.supplyAsync( () -> paymentGateway.process(payment), // The actual call bulkhead.getExecutorService() // Using bulkhead's executor ); Callable<PaymentResult> decorated = Decorators .ofSupplier(() -> paymentGateway.process(payment)) .withThreadPoolBulkhead(bulkhead) // Isolation layer .withTimeLimiter(timeLimiter, scheduler) // Timeout layer .withCircuitBreaker(circuitBreaker) // Protection layer (outermost) .withFallback(Arrays.asList( CallNotPermittedException.class, // Circuit open BulkheadFullException.class, // Bulkhead exhausted TimeoutException.class // Call took too long ), throwable -> fallbackPaymentResult()) // Graceful degradation .decorate(); // Execute:try { PaymentResult result = decorated.call();} catch (Exception e) { // Handle unexpected exceptions (not covered by fallback) logger.error("Unexpected payment failure", e);}The timeout must be inside the bulkhead. If timeout is outside, the caller might timeout and move on, but the bulkhead thread continues executing the slow call—consuming capacity for a result no one will use. With timeout inside, the operation is cancelled, freeing the bulkhead thread promptly.
When combining patterns, configuration values must be coordinated. Misaligned values can cause unexpected behavior or render one pattern ineffective.
Key coordination points:
timeout ≤ slowCallThreshold.| Parameter | Value | Rationale |
|---|---|---|
| Timeout | 3 seconds | User-acceptable wait time |
| CB slow call threshold | 3 seconds | Matches timeout; slow = timeout |
| CB slow call rate threshold | 80% | Trip if 80% of calls are slow (timing out) |
| CB failure rate threshold | 50% | Trip if 50% of calls fail outright |
| CB sliding window size | 20 calls | Small enough to trip quickly |
| CB wait in open state | 30 seconds | Time for downstream to recover |
| Bulkhead pool size | 50 threads | Supports 50 concurrent in-flight calls |
| Bulkhead queue size | 10 | Small buffer for bursts |
Calculating the critical interval:
How long from first slow request to circuit open?
Worst case: All 60 slots (50 threads + 10 queue) fill with requests that timeout.
Time to circuit open: ~3 seconds after degradation starts
During this 3-second window:
This is acceptable: limited blast radius during the detection window, then full protection.
Write integration tests that simulate downstream degradation and measure: (1) time to circuit open, (2) requests affected during that time, (3) behavior of the system while circuit is open, (4) recovery behavior in half-open state. These tests validate that your pattern composition works as expected.
A key architectural decision is whether to share or separate pattern instances across different callers of the same service.
| Strategy | Circuit Breaker | Bulkhead | Best For |
|---|---|---|---|
| Full Sharing | Shared | Shared | Simple applications, single responsibility |
| Full Isolation | Per-caller | Per-caller | Multi-tenant, diverse workloads |
| Global Detection, Local Capacity | Shared | Per-caller | Shared services with varied consumers |
| Local Detection, Global Capacity | Per-caller | Shared | Rare; usually doesn't make sense |
Example: Shared circuit breaker, separate bulkheads
Consider a Payment Service called by:
Shared circuit breaker: If Payment Service is down, all callers see the circuit open immediately. No one wastes resources on doomed calls.
Separate bulkheads:
Benefits:
Most resilience libraries (Resilience4j, Polly) use a registry pattern where instances are named and retrieved by name. This makes sharing intentional: callers that want to share use the same name, callers that don't use different names. Be explicit about your sharing strategy in naming conventions.
Even experienced engineers make mistakes when combining resilience patterns. Here are the most common pitfalls.
/api/v1/users and /api/v1/users?active=true splits the signal. If they share a backend, share a circuit breaker.12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
// MISTAKE 1: Bulkhead rejection not recorded by circuit breaker// WRONG:ThreadPoolBulkhead bulkhead = ...;CircuitBreaker circuitBreaker = ...; try { bulkhead.executeSupplier(() -> service.call());} catch (BulkheadFullException e) { // Bulkhead rejected, but circuit breaker doesn't know! return fallback();} // CORRECT:try { bulkhead.executeSupplier(() -> service.call());} catch (BulkheadFullException e) { circuitBreaker.onError(e); // Record as failure return fallback();} // Or use proper decorator composition:Decorators.ofSupplier(() -> service.call()) .withThreadPoolBulkhead(bulkhead) .withCircuitBreaker(circuitBreaker) // Failures propagate correctly .decorate(); // MISTAKE 2: Fallback that repeats failure// WRONG:Supplier<Result> decorated = Decorators.ofSupplier(() -> primaryService.call()) .withCircuitBreaker(circuitBreaker) .withFallback(ex -> secondaryService.call()) // Same service, different endpoint! .decorate(); // CORRECT:Supplier<Result> decorated = Decorators.ofSupplier(() -> primaryService.call()) .withCircuitBreaker(circuitBreaker) .withFallback(ex -> { // Fallback to cached data, degraded mode, or truly independent service return getCachedResult(); }) .decorate(); // MISTAKE 3: Timeout outside bulkhead (thread continues after timeout)// WRONG:CompletableFuture.supplyAsync(() -> service.call(), bulkhead.getExecutorService()) .orTimeout(3, TimeUnit.SECONDS) // Caller gives up, but executor thread continues! .join(); // CORRECT:// Timeout INSIDE the call, or use TimeLimiter that cancels:TimeLimiter timeLimiter = TimeLimiter.of(Duration.ofSeconds(3));bulkhead.executeSupplier(() -> timeLimiter.executeFutureSupplier(() -> CompletableFuture.supplyAsync(() -> service.call()) ));When combining patterns, don't assume independent failures. If Service A and Service B are both hosted on the same degraded infrastructure, their circuit breakers may trip simultaneously, and cascaded fallbacks may all fail. Architect fallbacks to be truly independent—different infrastructure, cached data, or degraded functionality.
With multiple patterns active, monitoring must provide a unified view of the resilience state while allowing drill-down into individual pattern behavior.
Dashboard design for combined patterns:
┌─────────────────────────────────────────────────────────────┐
│ Payment Service Health │
├──────────────────┬──────────────────┬───────────────────────┤
│ Circuit Breaker │ Bulkhead │ Success Rate │
│ │ │ │
│ ● CLOSED │ [████████░░] 80%│ 98.5% │
│ │ 40/50 threads │ (last 5 min) │
├──────────────────┴──────────────────┴───────────────────────┤
│ Rejection Breakdown │
├─────────────────────────────────────────────────────────────┤
│ Circuit Open: 0 Bulkhead Full: 12 Timeout: 3 │
├─────────────────────────────────────────────────────────────┤
│ Response Time (p99) │
│ [Graph showing latency over time with timeout threshold] │
└─────────────────────────────────────────────────────────────┘
Alert hierarchy:
Escalation should correlate across patterns: if the circuit is open AND bulkhead was saturated before it opened, the root cause is likely downstream latency. If the circuit is open without prior bulkhead pressure, the cause is likely quick failures (errors, not slowness).
Document what to do for each resilience state combination: 'Circuit Open + Bulkhead Clear' means the downstream is completely failing. 'Circuit Closed + Bulkhead Saturated' means latency but no errors—check downstream latency. 'Circuit Half-Open + Bulkhead Low' means we're testing recovery. Each state has diagnostic steps and potential actions.
We've completed a comprehensive journey through the Bulkhead Pattern—from foundational concepts to advanced integration with circuit breakers. Let's consolidate the key takeaways from this entire module.
You now have the knowledge to:
The Bulkhead Pattern is a cornerstone of building reliable distributed systems. Combined with the other fault tolerance patterns in this chapter—Circuit Breakers, Timeouts, Retry, and Fallbacks—you have a comprehensive toolkit for building systems that gracefully handle the inevitable failures of distributed computing.
Congratulations! You've completed the Bulkhead Pattern module. You now understand failure isolation at a deep level—from the principles that make bulkheads effective to the implementation details of thread pool and semaphore variants to the integration with circuit breakers. Apply this knowledge to build systems that remain partially operational even when parts fail—the hallmark of resilient distributed systems.