System Design (HLD)Bulkhead Pattern

Bulkhead Pattern: Isolating Failures for System Resilience

LevelAdvanced

Duration75 mins

TopicBulkhead Pattern

5 / 5

Combining Bulkheads with Circuit Breakers: Complete Fault Tolerance

Two Patterns, One Resilient System

Throughout this module, we've explored bulkheads in depth—understanding failure isolation, resource partitioning, thread pool implementations, and semaphore alternatives. But bulkheads address only part of the fault tolerance picture.

Bulkheads prevent cascade failures: When a downstream service is slow or failing, bulkheads contain the impact to a specific resource partition, protecting other workloads from starvation.

Circuit breakers prevent futile calls: When a downstream service is known to be failing, circuit breakers stop making calls entirely, failing fast rather than consuming resources on doomed requests.

Together, they provide comprehensive protection: Bulkheads handle the 'what if calls are slow?' scenario, while circuit breakers handle the 'what if the service is completely down?' scenario. Combining them creates a layered defense against distributed system failures.

What You Will Learn

By the end of this page, you will understand how bulkheads and circuit breakers complement each other, the correct composition order for these patterns, how to coordinate configuration between them, real-world integration strategies, and common mistakes when combining these patterns.

Understanding the Complementary Relationship

Bulkheads and circuit breakers solve different problems in the failure continuum. Understanding their distinct roles is essential for effective combination.

The failure timeline:

Consider a downstream service that's experiencing problems:

t=0: Service is healthy. Both bulkhead and circuit breaker are in normal state.
t=1m: Service starts responding slowly (2x normal latency). Bulkhead threads accumulate but don't exhaust. Circuit breaker sees increased latency but not enough failures to trip.
t=2m: Service latency increases further (5x normal). Bulkhead starts rejecting some requests. Circuit breaker still closed (errors below threshold).
t=3m: Service starts returning errors mixed with slow responses. Bulkhead is near exhaustion. Circuit breaker begins recording failures.
t=4m: Service failure rate exceeds threshold. Circuit breaker opens. Now both patterns are active—bulkhead protects capacity, circuit breaker stops new calls entirely.
t=5m: Circuit breaker enters half-open state, allowing test requests. If successful, circuit closes. Bulkhead capacity becomes available again.

Each pattern handles a different phase:

Healthy → Slow: Bulkhead-only scenario. Circuit breaker isn't triggered because there are no errors, just latency. Bulkhead prevents slow calls from consuming all resources.
Slow → Failing: Both patterns active. Bulkhead limits concurrent slow calls; circuit breaker starts accumulating failure evidence.
Failed: Circuit breaker-dominant scenario. Calls are rejected before reaching the bulkhead, which stays at low utilization.
Recovering: Circuit breaker controls the rate of test requests. Bulkhead ensures test requests don't consume excessive capacity if recovery is partial.

Bulkhead vs Circuit Breaker: Role Comparison
Aspect	Bulkhead	Circuit Breaker
Primary Problem	Resource exhaustion from slow calls	Repeated calls to failing service
Trigger Condition	Capacity limit reached	Error rate threshold exceeded
Rejection Reason	No capacity available	Known-failing destination
Protection Type	Resource isolation	Failure detection and avoidance
Recovery Mode	Capacity frees as calls complete	Half-open state probes for recovery
Latency Protection	Excellent (limits concurrency)	Limited (only prevents known-bad calls)
Error Protection	Partial (limits blast radius)	Excellent (stops all calls when failing)

Neither Pattern Is Sufficient Alone

A bulkhead without a circuit breaker will keep trying to call a completely dead service, consuming capacity on doomed requests. A circuit breaker without a bulkhead can't prevent slow (but successful) calls from exhausting resources. The combination addresses both failure modes.

Composition Order Matters

When composing resilience patterns, the order of application significantly affects behavior. For bulkheads and circuit breakers, there is a correct order.

The correct order: Circuit Breaker → Bulkhead → Call

Read from left to right, the request first encounters the circuit breaker, then the bulkhead, then makes the actual call.

Why this order?

Circuit breaker rejects before consuming bulkhead capacity: If the circuit is open (service known to be failing), the request is rejected immediately. No bulkhead permit is acquired, preserving capacity for other operations.
Bulkhead capacity isn't wasted on doomed requests: When a circuit is open, bulkhead resources remain available for other bulkheads that are still operational.
Clean separation of concerns: Circuit breaker decides 'should we even try?' Bulkhead decides 'do we have capacity to try right now?'

The wrong order: Bulkhead → Circuit Breaker → Call

If the bulkhead is evaluated first:

Request acquires a bulkhead permit
Circuit breaker rejects because it's open
Bulkhead permit was wasted on a request that never had a chance
Under sustained failure, the bulkhead fills with circuited requests, starving legitimate work

Correct: Circuit Breaker First

•Request arrives
•Circuit breaker checks state
•If OPEN → immediate rejection, no resources consumed
•If CLOSED → continue to bulkhead
•Bulkhead checks capacity
•If FULL → rejection, circuit breaker updated
•If AVAILABLE → acquire permit, make call
•Result updates circuit breaker, permit released

Wrong: Bulkhead First

•Request arrives
•Bulkhead checks capacity
•If FULL → rejection (but circuit breaker not updated)
•If AVAILABLE → acquire permit
•Circuit breaker checks state
•If OPEN → rejection, wasted permit
•Permit held briefly for no useful work
•Bulkhead capacity wasted on doomed requests

Mnemonic: 'Check Before You Commit'

Think of the circuit breaker as a 'should we even try?' check and the bulkhead as 'committing resources to try.' You should always check before committing. This naturally leads to circuit breaker → bulkhead order.

The Complete Pattern: Circuit Breaker + Bulkhead + Timeout

In practice, you almost always want a third pattern: timeouts. Without timeouts, a slow call can hold a bulkhead permit indefinitely, and the circuit breaker never records a failure (because the call technically hasn't failed—it's just very slow).

The complete pattern: Circuit Breaker → Bulkhead → Timeout → Call

Each layer adds a specific protection:

Circuit Breaker (outermost): Rejects calls to known-failing services without consuming any resources
Bulkhead: Limits concurrent calls to a specific service, isolating from other workloads
Timeout (innermost): Ensures any single call completes in bounded time, freeing resources
Call: The actual operation (HTTP request, database query, etc.)

complete-resilience-stack
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import io.github.resilience4j.circuitbreaker.*;
import io.github.resilience4j.bulkhead.*;
import io.github.resilience4j.timelimiter.*;
import io.github.resilience4j.decorators.Decorators;
 
// Configure each pattern independently
CircuitBreaker circuitBreaker = CircuitBreaker.of("paymentService",
    CircuitBreakerConfig.custom()
        .failureRateThreshold(50)              // Open at 50% failures
        .slowCallRateThreshold(80)             // Also trip on slow calls
        .slowCallDurationThreshold(Duration.ofSeconds(2))
        .waitDurationInOpenState(Duration.ofSeconds(30))
        .slidingWindowSize(10)
        .minimumNumberOfCalls(5)
        .permittedNumberOfCallsInHalfOpenState(3)
        .build());
 
ThreadPoolBulkhead bulkhead = ThreadPoolBulkhead.of("paymentService",
    ThreadPoolBulkheadConfig.custom()
        .maxThreadPoolSize(50)
        .coreThreadPoolSize(50)
        .queueCapacity(10)
        .build());
 
TimeLimiter timeLimiter = TimeLimiter.of(
    TimeLimiterConfig.custom()
        .timeoutDuration(Duration.ofSeconds(3))  // 3-second timeout
        .cancelRunningFuture(true)               // Cancel on timeout
        .build());
 
// Compose with correct order: CB (outer) → Bulkhead → Timeout → Call
ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(4);
 
Supplier<CompletionStage<PaymentResult>> asyncSupplier = 
    () -> CompletableFuture.supplyAsync(
        () -> paymentGateway.process(payment),  // The actual call
        bulkhead.getExecutorService()           // Using bulkhead's executor
    );
 
Callable<PaymentResult> decorated = Decorators
    .ofSupplier(() -> paymentGateway.process(payment))
    .withThreadPoolBulkhead(bulkhead)           // Isolation layer
    .withTimeLimiter(timeLimiter, scheduler)    // Timeout layer
    .withCircuitBreaker(circuitBreaker)         // Protection layer (outermost)
    .withFallback(Arrays.asList(
        CallNotPermittedException.class,        // Circuit open
        BulkheadFullException.class,            // Bulkhead exhausted
        TimeoutException.class                  // Call took too long
    ), throwable -> fallbackPaymentResult())    // Graceful degradation
    .decorate();
 
// Execute:
try {
    PaymentResult result = decorated.call();
} catch (Exception e) {
    // Handle unexpected exceptions (not covered by fallback)
    logger.error("Unexpected payment failure", e);
}

Timeout Placement Is Critical

The timeout must be inside the bulkhead. If timeout is outside, the caller might timeout and move on, but the bulkhead thread continues executing the slow call—consuming capacity for a result no one will use. With timeout inside, the operation is cancelled, freeing the bulkhead thread promptly.

Configuration Coordination

When combining patterns, configuration values must be coordinated. Misaligned values can cause unexpected behavior or render one pattern ineffective.

Key coordination points:

Configuration Dependencies

•Timeout ≤ Circuit Breaker slowCallDurationThreshold — If your timeout is longer than what the circuit breaker considers 'slow,' slow calls will timeout before being recorded as slow by the circuit breaker. Set: timeout ≤ slowCallThreshold.
•Bulkhead capacity × timeout = request handling budget — If your bulkhead has 50 threads and timeout is 3 seconds, you can sustain failures for 50 × 3 = 150 'thread-seconds.' Ensure this is sufficient for your circuit breaker to trip and start rejecting.
•Circuit breaker window > request volume recovery time — The circuit breaker needs to see enough requests to make a decision. If your bulkhead limits you to 50 concurrent calls and circuit breaker window is 10 calls, you can trip quickly. If window is 1000, you may exhaust resources before tripping.
•Bulkhead rejection should count as circuit breaker failure — When the bulkhead rejects a request, the circuit breaker should record this as a failure. Otherwise, bulkhead saturation doesn't contribute to circuit tripping, delaying protection.
•Half-open test requests should respect bulkhead — When the circuit breaker allows test requests in half-open state, they should acquire bulkhead permits. Don't bypass the bulkhead for test requests or you lose isolation.

Example Coordinated Configuration
Parameter	Value	Rationale
Timeout	3 seconds	User-acceptable wait time
CB slow call threshold	3 seconds	Matches timeout; slow = timeout
CB slow call rate threshold	80%	Trip if 80% of calls are slow (timing out)
CB failure rate threshold	50%	Trip if 50% of calls fail outright
CB sliding window size	20 calls	Small enough to trip quickly
CB wait in open state	30 seconds	Time for downstream to recover
Bulkhead pool size	50 threads	Supports 50 concurrent in-flight calls
Bulkhead queue size	10	Small buffer for bursts

Calculating the critical interval:

How long from first slow request to circuit open?

Bulkhead: 50 threads, timeout 3 seconds, queue 10
Circuit breaker: Window 20 calls, slow threshold 80%

Worst case: All 60 slots (50 threads + 10 queue) fill with requests that timeout.

First 50 requests start, all hitting timeout (3 seconds)
Next 10 requests queue
At t=3s, first 50 complete (as timeouts)
Circuit breaker has 50 slow calls → 50/50 = 100% slow → threshold exceeded → OPEN

Time to circuit open: ~3 seconds after degradation starts

During this 3-second window:

60 requests are affected (50 + 10 queue)
Requests beyond 60 are rejected by bulkhead (fast failure)

This is acceptable: limited blast radius during the detection window, then full protection.

Test Your Configuration

Write integration tests that simulate downstream degradation and measure: (1) time to circuit open, (2) requests affected during that time, (3) behavior of the system while circuit is open, (4) recovery behavior in half-open state. These tests validate that your pattern composition works as expected.

Shared vs Separate Pattern Instances

A key architectural decision is whether to share or separate pattern instances across different callers of the same service.

Sharing Patterns

•Shared Circuit Breaker — All callers of a service share one circuit breaker. If the service is failing, all callers are protected immediately. Makes sense when the service is the unit of failure.
•Shared Bulkhead — All callers share a bulkhead limiting total calls to a service. Prevents overloading the downstream regardless of which caller is busy. Makes sense when protecting the downstream from aggregate load.
•Per-Caller Instances — Each caller has its own circuit breaker and bulkhead. One caller's heavy usage or failures don't affect other callers. Makes sense for isolation between workloads.
•Hybrid: Shared Circuit Breaker, Separate Bulkheads — Circuit breaker detects service health globally, but each caller has independent capacity limits. Balances global detection with local isolation.

Sharing Strategies and Trade-offs
Strategy	Circuit Breaker	Bulkhead	Best For
Full Sharing	Shared	Shared	Simple applications, single responsibility
Full Isolation	Per-caller	Per-caller	Multi-tenant, diverse workloads
Global Detection, Local Capacity	Shared	Per-caller	Shared services with varied consumers
Local Detection, Global Capacity	Per-caller	Shared	Rare; usually doesn't make sense

Example: Shared circuit breaker, separate bulkheads

Consider a Payment Service called by:

Checkout (high priority, user-facing)
Order Reconciliation (low priority, batch)
Fraud Detection (medium priority, critical)

Shared circuit breaker: If Payment Service is down, all callers see the circuit open immediately. No one wastes resources on doomed calls.

Separate bulkheads:

Checkout: 40 threads (must be responsive)
Fraud Detection: 30 threads (important but can queue)
Order Reconciliation: 10 threads (can be slow)

Benefits:

Checkout isn't starved by batch reconciliation
Reconciliation can't overwhelm the payment service
All callers benefit from quick circuit-open detection
Each caller's capacity is right-sized for its needs

Registry Pattern

Most resilience libraries (Resilience4j, Polly) use a registry pattern where instances are named and retrieved by name. This makes sharing intentional: callers that want to share use the same name, callers that don't use different names. Be explicit about your sharing strategy in naming conventions.

Common Mistakes When Combining Patterns

Even experienced engineers make mistakes when combining resilience patterns. Here are the most common pitfalls.

Common Mistakes

•Wrong composition order — Bulkhead before circuit breaker wastes permits on circuited calls. Always: Circuit Breaker (outer) → Bulkhead → Timeout → Call (inner).
•Bulkhead rejection not counted as failure — If bulkhead rejections don't update the circuit breaker's failure count, persistent bulkhead saturation won't trigger circuit protection. Ensure rejections are recorded.
•Timeout longer than circuit breaker thresholds — If timeout is 30 seconds but circuit breaker trips at 3-second slow calls, the timeout never triggers. Align these values.
•Thread pool bulkhead with wrong timeout — If the caller has a 3-second timeout but the bulkhead thread continues indefinitely, resources are wasted. Ensure the call inside the bulkhead respects timeouts.
•Fallback that calls the same failing service — A fallback that retries the same service (or a service with the same failure mode) isn't graceful degradation—it's double failure. Fallbacks should be independent.
•Circuit breaker on every request variant — Creating separate circuit breakers for /api/v1/users and /api/v1/users?active=true splits the signal. If they share a backend, share a circuit breaker.
•Ignoring bulkhead rejection in circuit breaker stats — Some configurations only count actual HTTP errors. If the bulkhead rejects and the call never happens, is that a 'failure'? Usually yes—the consumer couldn't get service.

mistake-examples
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// MISTAKE 1: Bulkhead rejection not recorded by circuit breaker
// WRONG:
ThreadPoolBulkhead bulkhead = ...;
CircuitBreaker circuitBreaker = ...;
 
try {
    bulkhead.executeSupplier(() -> service.call());
} catch (BulkheadFullException e) {
    // Bulkhead rejected, but circuit breaker doesn't know!
    return fallback();
}
 
// CORRECT:
try {
    bulkhead.executeSupplier(() -> service.call());
} catch (BulkheadFullException e) {
    circuitBreaker.onError(e);  // Record as failure
    return fallback();
}
 
// Or use proper decorator composition:
Decorators.ofSupplier(() -> service.call())
    .withThreadPoolBulkhead(bulkhead)
    .withCircuitBreaker(circuitBreaker)  // Failures propagate correctly
    .decorate();
 
 
// MISTAKE 2: Fallback that repeats failure
// WRONG:
Supplier<Result> decorated = Decorators.ofSupplier(() -> primaryService.call())
    .withCircuitBreaker(circuitBreaker)
    .withFallback(ex -> secondaryService.call())  // Same service, different endpoint!
    .decorate();
 
// CORRECT:
Supplier<Result> decorated = Decorators.ofSupplier(() -> primaryService.call())
    .withCircuitBreaker(circuitBreaker)
    .withFallback(ex -> {
        // Fallback to cached data, degraded mode, or truly independent service
        return getCachedResult();
    })
    .decorate();
 
 
// MISTAKE 3: Timeout outside bulkhead (thread continues after timeout)
// WRONG:
CompletableFuture.supplyAsync(() -> service.call(), bulkhead.getExecutorService())
    .orTimeout(3, TimeUnit.SECONDS)  // Caller gives up, but executor thread continues!
    .join();
 
// CORRECT:
// Timeout INSIDE the call, or use TimeLimiter that cancels:
TimeLimiter timeLimiter = TimeLimiter.of(Duration.ofSeconds(3));
bulkhead.executeSupplier(() -> 
    timeLimiter.executeFutureSupplier(() -> 
        CompletableFuture.supplyAsync(() -> service.call())
    )
);

The 'Independence Assumption' Trap

When combining patterns, don't assume independent failures. If Service A and Service B are both hosted on the same degraded infrastructure, their circuit breakers may trip simultaneously, and cascaded fallbacks may all fail. Architect fallbacks to be truly independent—different infrastructure, cached data, or degraded functionality.

Monitoring the Combined Stack

With multiple patterns active, monitoring must provide a unified view of the resilience state while allowing drill-down into individual pattern behavior.

Combined Stack Metrics

•Overall Success Rate — (Successful calls) / (Total attempts). The final user-visible metric. Combine circuit rejections, bulkhead rejections, timeouts, and actual failures.
•Rejection Breakdown — Separate counts for: circuit breaker open rejections, bulkhead full rejections, timeout kills. Helps identify which pattern is active.
•Circuit Breaker State — Gauge metric with values: 0=closed, 1=half-open, 2=open. Dashboard should clearly show state per service.
•Bulkhead Saturation — Percentage of capacity in use. Correlate with circuit breaker state to understand failure dynamics.
•Fallback Rate — How often is the fallback path used? High rate indicates persistent issues.
•End-to-End Latency — Total time including all pattern overhead. Ensure patterns don't add unacceptable delay.

Dashboard design for combined patterns:

┌─────────────────────────────────────────────────────────────┐
│                    Payment Service Health                    │
├──────────────────┬──────────────────┬───────────────────────┤
│  Circuit Breaker │     Bulkhead     │      Success Rate     │
│                  │                  │                       │
│   ● CLOSED       │  [████████░░] 80%│        98.5%          │
│                  │  40/50 threads   │    (last 5 min)       │
├──────────────────┴──────────────────┴───────────────────────┤
│                    Rejection Breakdown                       │
├─────────────────────────────────────────────────────────────┤
│  Circuit Open: 0    Bulkhead Full: 12    Timeout: 3         │
├─────────────────────────────────────────────────────────────┤
│                    Response Time (p99)                       │
│  [Graph showing latency over time with timeout threshold]    │
└─────────────────────────────────────────────────────────────┘

Alert hierarchy:

Info: Bulkhead saturation > 60% (early warning)
Warning: Bulkhead saturation > 80% OR any rejections
Warning: High timeout rate (> 5%)
Critical: Circuit breaker OPEN
Critical: Fallback rate > 10%

Escalation should correlate across patterns: if the circuit is open AND bulkhead was saturated before it opened, the root cause is likely downstream latency. If the circuit is open without prior bulkhead pressure, the cause is likely quick failures (errors, not slowness).

Create a Runbook for Each State

Document what to do for each resilience state combination: 'Circuit Open + Bulkhead Clear' means the downstream is completely failing. 'Circuit Closed + Bulkhead Saturated' means latency but no errors—check downstream latency. 'Circuit Half-Open + Bulkhead Low' means we're testing recovery. Each state has diagnostic steps and potential actions.

Summary: Complete Bulkhead Pattern Mastery

We've completed a comprehensive journey through the Bulkhead Pattern—from foundational concepts to advanced integration with circuit breakers. Let's consolidate the key takeaways from this entire module.

Key Takeaways: This Page

•Bulkheads and circuit breakers are complementary — Bulkheads handle slow calls; circuit breakers handle failing calls. Both are needed for complete protection.
•Composition order matters — Circuit Breaker (outer) → Bulkhead → Timeout → Call (inner). Check before committing resources.
•Coordinate configurations — Timeout ≤ slow call threshold. Bulkhead rejections should count as circuit breaker failures.
•Share circuit breakers, consider separate bulkheads — Global failure detection with local capacity limits often works best.
•Test the combined behavior — Simulate failures and verify all patterns work together as expected.

Module Summary: Bulkhead Pattern

•Failure Isolation — Bulkheads contain failures to compartments, preventing system-wide cascades through resource exhaustion.
•Resource Partitioning — Size bulkheads using Little's Law. Use static partitioning for predictability, guarantee minimums in dynamic schemes.
•Thread Pool Bulkheads — Strong isolation for blocking I/O. Set core = max, use small queues, always use AbortPolicy.
•Semaphore Bulkheads — Lightweight concurrency control for non-blocking workloads. Minimal overhead, ideal for reactive systems.
•Combined with Circuit Breakers — Complete fault tolerance through layered protection. Configure and monitor as a unified system.

You now have the knowledge to:

Design bulkhead architectures that prevent cascading failures
Size and configure thread pool and semaphore bulkheads correctly
Combine bulkheads with circuit breakers and timeouts for complete protection
Monitor and tune resilience patterns in production
Avoid common mistakes that undermine fault tolerance

The Bulkhead Pattern is a cornerstone of building reliable distributed systems. Combined with the other fault tolerance patterns in this chapter—Circuit Breakers, Timeouts, Retry, and Fallbacks—you have a comprehensive toolkit for building systems that gracefully handle the inevitable failures of distributed computing.

Module Complete

Congratulations! You've completed the Bulkhead Pattern module. You now understand failure isolation at a deep level—from the principles that make bulkheads effective to the implementation details of thread pool and semaphore variants to the integration with circuit breakers. Apply this knowledge to build systems that remain partially operational even when parts fail—the hallmark of resilient distributed systems.

5 / 5

Loading learning content...

System Design (HLD)Bulkhead Pattern

Bulkhead Pattern: Isolating Failures for System Resilience

LevelAdvanced

Duration75 mins

TopicBulkhead Pattern

5 / 5

Combining Bulkheads with Circuit Breakers: Complete Fault Tolerance

Two Patterns, One Resilient System

Bulkheads prevent cascade failures: When a downstream service is slow or failing, bulkheads contain the impact to a specific resource partition, protecting other workloads from starvation.

What You Will Learn

Understanding the Complementary Relationship

Bulkheads and circuit breakers solve different problems in the failure continuum. Understanding their distinct roles is essential for effective combination.

The failure timeline:

Consider a downstream service that's experiencing problems:

t=0: Service is healthy. Both bulkhead and circuit breaker are in normal state.
t=1m: Service starts responding slowly (2x normal latency). Bulkhead threads accumulate but don't exhaust. Circuit breaker sees increased latency but not enough failures to trip.
t=2m: Service latency increases further (5x normal). Bulkhead starts rejecting some requests. Circuit breaker still closed (errors below threshold).
t=3m: Service starts returning errors mixed with slow responses. Bulkhead is near exhaustion. Circuit breaker begins recording failures.
t=4m: Service failure rate exceeds threshold. Circuit breaker opens. Now both patterns are active—bulkhead protects capacity, circuit breaker stops new calls entirely.
t=5m: Circuit breaker enters half-open state, allowing test requests. If successful, circuit closes. Bulkhead capacity becomes available again.

Each pattern handles a different phase:

Healthy → Slow: Bulkhead-only scenario. Circuit breaker isn't triggered because there are no errors, just latency. Bulkhead prevents slow calls from consuming all resources.
Slow → Failing: Both patterns active. Bulkhead limits concurrent slow calls; circuit breaker starts accumulating failure evidence.
Failed: Circuit breaker-dominant scenario. Calls are rejected before reaching the bulkhead, which stays at low utilization.
Recovering: Circuit breaker controls the rate of test requests. Bulkhead ensures test requests don't consume excessive capacity if recovery is partial.

Bulkhead vs Circuit Breaker: Role Comparison
Aspect	Bulkhead	Circuit Breaker
Primary Problem	Resource exhaustion from slow calls	Repeated calls to failing service
Trigger Condition	Capacity limit reached	Error rate threshold exceeded
Rejection Reason	No capacity available	Known-failing destination
Protection Type	Resource isolation	Failure detection and avoidance
Recovery Mode	Capacity frees as calls complete	Half-open state probes for recovery
Latency Protection	Excellent (limits concurrency)	Limited (only prevents known-bad calls)
Error Protection	Partial (limits blast radius)	Excellent (stops all calls when failing)

Neither Pattern Is Sufficient Alone

Composition Order Matters

When composing resilience patterns, the order of application significantly affects behavior. For bulkheads and circuit breakers, there is a correct order.

The correct order: Circuit Breaker → Bulkhead → Call

Read from left to right, the request first encounters the circuit breaker, then the bulkhead, then makes the actual call.

Why this order?

Circuit breaker rejects before consuming bulkhead capacity: If the circuit is open (service known to be failing), the request is rejected immediately. No bulkhead permit is acquired, preserving capacity for other operations.
Bulkhead capacity isn't wasted on doomed requests: When a circuit is open, bulkhead resources remain available for other bulkheads that are still operational.
Clean separation of concerns: Circuit breaker decides 'should we even try?' Bulkhead decides 'do we have capacity to try right now?'

The wrong order: Bulkhead → Circuit Breaker → Call

If the bulkhead is evaluated first:

Request acquires a bulkhead permit
Circuit breaker rejects because it's open
Bulkhead permit was wasted on a request that never had a chance
Under sustained failure, the bulkhead fills with circuited requests, starving legitimate work

Correct: Circuit Breaker First

•Request arrives
•Circuit breaker checks state
•If OPEN → immediate rejection, no resources consumed
•If CLOSED → continue to bulkhead
•Bulkhead checks capacity
•If FULL → rejection, circuit breaker updated
•If AVAILABLE → acquire permit, make call
•Result updates circuit breaker, permit released

Wrong: Bulkhead First

•Request arrives
•Bulkhead checks capacity
•If FULL → rejection (but circuit breaker not updated)
•If AVAILABLE → acquire permit
•Circuit breaker checks state
•If OPEN → rejection, wasted permit
•Permit held briefly for no useful work
•Bulkhead capacity wasted on doomed requests

Mnemonic: 'Check Before You Commit'

The Complete Pattern: Circuit Breaker + Bulkhead + Timeout

The complete pattern: Circuit Breaker → Bulkhead → Timeout → Call

Each layer adds a specific protection:

Circuit Breaker (outermost): Rejects calls to known-failing services without consuming any resources
Bulkhead: Limits concurrent calls to a specific service, isolating from other workloads
Timeout (innermost): Ensures any single call completes in bounded time, freeing resources
Call: The actual operation (HTTP request, database query, etc.)

complete-resilience-stack
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import io.github.resilience4j.circuitbreaker.*;
import io.github.resilience4j.bulkhead.*;
import io.github.resilience4j.timelimiter.*;
import io.github.resilience4j.decorators.Decorators;
 
// Configure each pattern independently
CircuitBreaker circuitBreaker = CircuitBreaker.of("paymentService",
    CircuitBreakerConfig.custom()
        .failureRateThreshold(50)              // Open at 50% failures
        .slowCallRateThreshold(80)             // Also trip on slow calls
        .slowCallDurationThreshold(Duration.ofSeconds(2))
        .waitDurationInOpenState(Duration.ofSeconds(30))
        .slidingWindowSize(10)
        .minimumNumberOfCalls(5)
        .permittedNumberOfCallsInHalfOpenState(3)
        .build());
 
ThreadPoolBulkhead bulkhead = ThreadPoolBulkhead.of("paymentService",
    ThreadPoolBulkheadConfig.custom()
        .maxThreadPoolSize(50)
        .coreThreadPoolSize(50)
        .queueCapacity(10)
        .build());
 
TimeLimiter timeLimiter = TimeLimiter.of(
    TimeLimiterConfig.custom()
        .timeoutDuration(Duration.ofSeconds(3))  // 3-second timeout
        .cancelRunningFuture(true)               // Cancel on timeout
        .build());
 
// Compose with correct order: CB (outer) → Bulkhead → Timeout → Call
ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(4);
 
Supplier<CompletionStage<PaymentResult>> asyncSupplier = 
    () -> CompletableFuture.supplyAsync(
        () -> paymentGateway.process(payment),  // The actual call
        bulkhead.getExecutorService()           // Using bulkhead's executor
    );
 
Callable<PaymentResult> decorated = Decorators
    .ofSupplier(() -> paymentGateway.process(payment))
    .withThreadPoolBulkhead(bulkhead)           // Isolation layer
    .withTimeLimiter(timeLimiter, scheduler)    // Timeout layer
    .withCircuitBreaker(circuitBreaker)         // Protection layer (outermost)
    .withFallback(Arrays.asList(
        CallNotPermittedException.class,        // Circuit open
        BulkheadFullException.class,            // Bulkhead exhausted
        TimeoutException.class                  // Call took too long
    ), throwable -> fallbackPaymentResult())    // Graceful degradation
    .decorate();
 
// Execute:
try {
    PaymentResult result = decorated.call();
} catch (Exception e) {
    // Handle unexpected exceptions (not covered by fallback)
    logger.error("Unexpected payment failure", e);
}

Timeout Placement Is Critical

Configuration Coordination

When combining patterns, configuration values must be coordinated. Misaligned values can cause unexpected behavior or render one pattern ineffective.

Key coordination points:

Configuration Dependencies

•Timeout ≤ Circuit Breaker slowCallDurationThreshold — If your timeout is longer than what the circuit breaker considers 'slow,' slow calls will timeout before being recorded as slow by the circuit breaker. Set: timeout ≤ slowCallThreshold.
•Bulkhead capacity × timeout = request handling budget — If your bulkhead has 50 threads and timeout is 3 seconds, you can sustain failures for 50 × 3 = 150 'thread-seconds.' Ensure this is sufficient for your circuit breaker to trip and start rejecting.
•Circuit breaker window > request volume recovery time — The circuit breaker needs to see enough requests to make a decision. If your bulkhead limits you to 50 concurrent calls and circuit breaker window is 10 calls, you can trip quickly. If window is 1000, you may exhaust resources before tripping.
•Bulkhead rejection should count as circuit breaker failure — When the bulkhead rejects a request, the circuit breaker should record this as a failure. Otherwise, bulkhead saturation doesn't contribute to circuit tripping, delaying protection.
•Half-open test requests should respect bulkhead — When the circuit breaker allows test requests in half-open state, they should acquire bulkhead permits. Don't bypass the bulkhead for test requests or you lose isolation.

Example Coordinated Configuration
Parameter	Value	Rationale
Timeout	3 seconds	User-acceptable wait time
CB slow call threshold	3 seconds	Matches timeout; slow = timeout
CB slow call rate threshold	80%	Trip if 80% of calls are slow (timing out)
CB failure rate threshold	50%	Trip if 50% of calls fail outright
CB sliding window size	20 calls	Small enough to trip quickly
CB wait in open state	30 seconds	Time for downstream to recover
Bulkhead pool size	50 threads	Supports 50 concurrent in-flight calls
Bulkhead queue size	10	Small buffer for bursts

Calculating the critical interval:

How long from first slow request to circuit open?

Bulkhead: 50 threads, timeout 3 seconds, queue 10
Circuit breaker: Window 20 calls, slow threshold 80%

Worst case: All 60 slots (50 threads + 10 queue) fill with requests that timeout.

First 50 requests start, all hitting timeout (3 seconds)
Next 10 requests queue
At t=3s, first 50 complete (as timeouts)
Circuit breaker has 50 slow calls → 50/50 = 100% slow → threshold exceeded → OPEN

Time to circuit open: ~3 seconds after degradation starts

During this 3-second window:

60 requests are affected (50 + 10 queue)
Requests beyond 60 are rejected by bulkhead (fast failure)

This is acceptable: limited blast radius during the detection window, then full protection.

Test Your Configuration

Shared vs Separate Pattern Instances

A key architectural decision is whether to share or separate pattern instances across different callers of the same service.

Sharing Patterns

•Shared Circuit Breaker — All callers of a service share one circuit breaker. If the service is failing, all callers are protected immediately. Makes sense when the service is the unit of failure.
•Shared Bulkhead — All callers share a bulkhead limiting total calls to a service. Prevents overloading the downstream regardless of which caller is busy. Makes sense when protecting the downstream from aggregate load.
•Per-Caller Instances — Each caller has its own circuit breaker and bulkhead. One caller's heavy usage or failures don't affect other callers. Makes sense for isolation between workloads.
•Hybrid: Shared Circuit Breaker, Separate Bulkheads — Circuit breaker detects service health globally, but each caller has independent capacity limits. Balances global detection with local isolation.

Sharing Strategies and Trade-offs
Strategy	Circuit Breaker	Bulkhead	Best For
Full Sharing	Shared	Shared	Simple applications, single responsibility
Full Isolation	Per-caller	Per-caller	Multi-tenant, diverse workloads
Global Detection, Local Capacity	Shared	Per-caller	Shared services with varied consumers
Local Detection, Global Capacity	Per-caller	Shared	Rare; usually doesn't make sense

Example: Shared circuit breaker, separate bulkheads

Consider a Payment Service called by:

Checkout (high priority, user-facing)
Order Reconciliation (low priority, batch)
Fraud Detection (medium priority, critical)

Shared circuit breaker: If Payment Service is down, all callers see the circuit open immediately. No one wastes resources on doomed calls.

Separate bulkheads:

Checkout: 40 threads (must be responsive)
Fraud Detection: 30 threads (important but can queue)
Order Reconciliation: 10 threads (can be slow)

Benefits:

Checkout isn't starved by batch reconciliation
Reconciliation can't overwhelm the payment service
All callers benefit from quick circuit-open detection
Each caller's capacity is right-sized for its needs

Registry Pattern

Common Mistakes When Combining Patterns

Even experienced engineers make mistakes when combining resilience patterns. Here are the most common pitfalls.

Common Mistakes

•Wrong composition order — Bulkhead before circuit breaker wastes permits on circuited calls. Always: Circuit Breaker (outer) → Bulkhead → Timeout → Call (inner).
•Bulkhead rejection not counted as failure — If bulkhead rejections don't update the circuit breaker's failure count, persistent bulkhead saturation won't trigger circuit protection. Ensure rejections are recorded.
•Timeout longer than circuit breaker thresholds — If timeout is 30 seconds but circuit breaker trips at 3-second slow calls, the timeout never triggers. Align these values.
•Thread pool bulkhead with wrong timeout — If the caller has a 3-second timeout but the bulkhead thread continues indefinitely, resources are wasted. Ensure the call inside the bulkhead respects timeouts.
•Fallback that calls the same failing service — A fallback that retries the same service (or a service with the same failure mode) isn't graceful degradation—it's double failure. Fallbacks should be independent.
•Circuit breaker on every request variant — Creating separate circuit breakers for /api/v1/users and /api/v1/users?active=true splits the signal. If they share a backend, share a circuit breaker.
•Ignoring bulkhead rejection in circuit breaker stats — Some configurations only count actual HTTP errors. If the bulkhead rejects and the call never happens, is that a 'failure'? Usually yes—the consumer couldn't get service.

mistake-examples
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// MISTAKE 1: Bulkhead rejection not recorded by circuit breaker
// WRONG:
ThreadPoolBulkhead bulkhead = ...;
CircuitBreaker circuitBreaker = ...;
 
try {
    bulkhead.executeSupplier(() -> service.call());
} catch (BulkheadFullException e) {
    // Bulkhead rejected, but circuit breaker doesn't know!
    return fallback();
}
 
// CORRECT:
try {
    bulkhead.executeSupplier(() -> service.call());
} catch (BulkheadFullException e) {
    circuitBreaker.onError(e);  // Record as failure
    return fallback();
}
 
// Or use proper decorator composition:
Decorators.ofSupplier(() -> service.call())
    .withThreadPoolBulkhead(bulkhead)
    .withCircuitBreaker(circuitBreaker)  // Failures propagate correctly
    .decorate();
 
 
// MISTAKE 2: Fallback that repeats failure
// WRONG:
Supplier<Result> decorated = Decorators.ofSupplier(() -> primaryService.call())
    .withCircuitBreaker(circuitBreaker)
    .withFallback(ex -> secondaryService.call())  // Same service, different endpoint!
    .decorate();
 
// CORRECT:
Supplier<Result> decorated = Decorators.ofSupplier(() -> primaryService.call())
    .withCircuitBreaker(circuitBreaker)
    .withFallback(ex -> {
        // Fallback to cached data, degraded mode, or truly independent service
        return getCachedResult();
    })
    .decorate();
 
 
// MISTAKE 3: Timeout outside bulkhead (thread continues after timeout)
// WRONG:
CompletableFuture.supplyAsync(() -> service.call(), bulkhead.getExecutorService())
    .orTimeout(3, TimeUnit.SECONDS)  // Caller gives up, but executor thread continues!
    .join();
 
// CORRECT:
// Timeout INSIDE the call, or use TimeLimiter that cancels:
TimeLimiter timeLimiter = TimeLimiter.of(Duration.ofSeconds(3));
bulkhead.executeSupplier(() -> 
    timeLimiter.executeFutureSupplier(() -> 
        CompletableFuture.supplyAsync(() -> service.call())
    )
);

The 'Independence Assumption' Trap

Monitoring the Combined Stack

With multiple patterns active, monitoring must provide a unified view of the resilience state while allowing drill-down into individual pattern behavior.

Combined Stack Metrics

•Overall Success Rate — (Successful calls) / (Total attempts). The final user-visible metric. Combine circuit rejections, bulkhead rejections, timeouts, and actual failures.
•Rejection Breakdown — Separate counts for: circuit breaker open rejections, bulkhead full rejections, timeout kills. Helps identify which pattern is active.
•Circuit Breaker State — Gauge metric with values: 0=closed, 1=half-open, 2=open. Dashboard should clearly show state per service.
•Bulkhead Saturation — Percentage of capacity in use. Correlate with circuit breaker state to understand failure dynamics.
•Fallback Rate — How often is the fallback path used? High rate indicates persistent issues.
•End-to-End Latency — Total time including all pattern overhead. Ensure patterns don't add unacceptable delay.

Dashboard design for combined patterns:

┌─────────────────────────────────────────────────────────────┐
│                    Payment Service Health                    │
├──────────────────┬──────────────────┬───────────────────────┤
│  Circuit Breaker │     Bulkhead     │      Success Rate     │
│                  │                  │                       │
│   ● CLOSED       │  [████████░░] 80%│        98.5%          │
│                  │  40/50 threads   │    (last 5 min)       │
├──────────────────┴──────────────────┴───────────────────────┤
│                    Rejection Breakdown                       │
├─────────────────────────────────────────────────────────────┤
│  Circuit Open: 0    Bulkhead Full: 12    Timeout: 3         │
├─────────────────────────────────────────────────────────────┤
│                    Response Time (p99)                       │
│  [Graph showing latency over time with timeout threshold]    │
└─────────────────────────────────────────────────────────────┘

Alert hierarchy:

Info: Bulkhead saturation > 60% (early warning)
Warning: Bulkhead saturation > 80% OR any rejections
Warning: High timeout rate (> 5%)
Critical: Circuit breaker OPEN
Critical: Fallback rate > 10%

Create a Runbook for Each State

Summary: Complete Bulkhead Pattern Mastery

Key Takeaways: This Page

•Bulkheads and circuit breakers are complementary — Bulkheads handle slow calls; circuit breakers handle failing calls. Both are needed for complete protection.
•Composition order matters — Circuit Breaker (outer) → Bulkhead → Timeout → Call (inner). Check before committing resources.
•Coordinate configurations — Timeout ≤ slow call threshold. Bulkhead rejections should count as circuit breaker failures.
•Share circuit breakers, consider separate bulkheads — Global failure detection with local capacity limits often works best.
•Test the combined behavior — Simulate failures and verify all patterns work together as expected.

Module Summary: Bulkhead Pattern

•Failure Isolation — Bulkheads contain failures to compartments, preventing system-wide cascades through resource exhaustion.
•Resource Partitioning — Size bulkheads using Little's Law. Use static partitioning for predictability, guarantee minimums in dynamic schemes.
•Thread Pool Bulkheads — Strong isolation for blocking I/O. Set core = max, use small queues, always use AbortPolicy.
•Semaphore Bulkheads — Lightweight concurrency control for non-blocking workloads. Minimal overhead, ideal for reactive systems.
•Combined with Circuit Breakers — Complete fault tolerance through layered protection. Configure and monitor as a unified system.

You now have the knowledge to:

Design bulkhead architectures that prevent cascading failures
Size and configure thread pool and semaphore bulkheads correctly
Combine bulkheads with circuit breakers and timeouts for complete protection
Monitor and tune resilience patterns in production
Avoid common mistakes that undermine fault tolerance

Module Complete

5 / 5