Health Checks & Failover - Learning Module

Loading content...

0/273

Graceful Degradation

When Perfect Is the Enemy of Good Enough

In December 2020, a major cloud provider experienced a cascading outage that took down authentication services for thousands of companies. The root cause wasn't a single component failure—it was the response to that failure. When the authentication cache became overloaded, services that couldn't authenticate users returned errors. Those errors triggered retry storms. The retries overloaded the authentication service further. Within minutes, a partial degradation became a complete outage.

The irony: many of those services could have operated with stale authentication tokens for minutes without security risk. They could have degraded gracefully—accepting slightly increased risk in exchange for continued operation. Instead, they pursued 'correctness' and achieved 'nothing.'

This is the essence of graceful degradation: accepting that partial failure is inevitable, and designing systems that provide the best possible experience within degraded constraints. It's the recognition that serving 80% of requests perfectly is better than serving 0% of requests while waiting for perfection.

What You Will Learn

By the end of this page, you will understand the principles and patterns of graceful degradation. You'll learn how to identify degradation opportunities, implement fallback mechanisms, prioritize critical functionality, and design systems that bend without breaking when components fail.

The Degradation Mindset

Graceful degradation requires a fundamental shift in how we think about system design. Traditional thinking asks: 'How do we prevent failures?' Degradation thinking asks: 'Given that components will fail, how do we minimize impact?'

The Degradation Hierarchy:

Not all functionality is equally important. Graceful degradation requires understanding which features are essential and which can be sacrificed:

Core Functionality: The primary value proposition. An e-commerce site must allow purchases. A messaging app must deliver messages.
Supporting Functionality: Features that enhance core but aren't essential. Personalized recommendations, real-time analytics, AI-powered suggestions.
Optimization Features: Nice-to-haves that improve experience. Animations, preview images, predictive prefetching.
Cosmetic Features: Purely aesthetic elements. Custom fonts, fancy transitions, decorative images.

When resources become constrained, the system should shed load from the bottom up—protecting core functionality at all costs.

Converting Mermaid diagram...

Degradation Hierarchy Example: E-Commerce Platform
Priority	Functionality	Degradation Strategy	User Impact
Core	Product catalog, cart, checkout, payment	Never degrade - protect at all costs	None - always functional
Supporting	Reviews, recommendations, wish lists	Serve cached/stale data, hide if unavailable	Reduced personalization
Optimization	Real-time inventory, dynamic pricing	Use cached values, skip updates	Slightly stale information
Cosmetic	High-res images, animations, videos	Use lower quality, disable effects	Minimal - mostly aesthetic

Document Your Degradation Hierarchy

Every system should have an explicit degradation hierarchy documented. During an incident isn't the time to debate which features to sacrifice. Define this hierarchy upfront, get stakeholder buy-in, and implement automated shedding.

Fallback Patterns

Fallbacks are the tactical implementation of degradation—alternative code paths that execute when the primary path fails. Well-designed fallbacks provide reduced-but-functional service rather than errors.

Types of Fallbacks:

Fallback Strategies

•Cached Fallback — Serve stale cached data when fresh data is unavailable. Best for data that changes slowly or where staleness is acceptable for short periods.
•Default Fallback — Return pre-configured default values when dynamic values can't be computed. Best for personalization, feature flags, or optional enhancements.
•Simplified Fallback — Use a simpler algorithm or reduced feature set when the full version can't execute. Search returns basic results instead of ML-ranked results.
•Static Fallback — Serve completely static content when dynamic generation fails. Static error pages, cached snapshots, or pre-generated content.
•Degraded Fallback — Provide partial functionality instead of complete failure. Show text reviews without ratings, or product info without real-time price.
•Fail-Open Fallback — When a guardrail fails, proceed without it. Rate limiting fails open (allow traffic) vs fail closed (block traffic).

fallback-patterns.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
// TypeScript: Comprehensive Fallback Implementation
 
import { Redis } from 'ioredis';
import { CircuitBreaker } from './circuit-breaker';
 
interface Product {
    id: string;
    name: string;
    price: number;
    description: string;
    recommendations?: string[];
    realTimeInventory?: number;
    reviews?: Review[];
}
 
interface Review {
    rating: number;
    text: string;
    author: string;
}
 
class ProductService {
    private redis: Redis;
    private circuitBreaker: CircuitBreaker;
    private defaultProducts: Map<string, Product> = new Map();
    
    /**
     * Get product with multi-layer fallback
     */
    async getProduct(productId: string): Promise<Product> {
        // Layer 1: Try primary service with circuit breaker
        try {
            return await this.circuitBreaker.execute(async () => {
                return await this.fetchFromPrimaryService(productId);
            });
        } catch (primaryError) {
            console.warn(`Primary service failed: ${primaryError.message}`);
        }
        
        // Layer 2: Try cache fallback
        try {
            const cached = await this.getCachedProduct(productId);
            if (cached) {
                console.log(`Serving cached product for ${productId}`);
                return {
                    ...cached,
                    _degraded: true,
                    _degradedReason: 'Served from cache'
                } as Product;
            }
        } catch (cacheError) {
            console.warn(`Cache fallback failed: ${cacheError.message}`);
        }
        
        // Layer 3: Try static fallback
        const staticFallback = this.defaultProducts.get(productId);
        if (staticFallback) {
            console.log(`Serving static fallback for ${productId}`);
            return {
                ...staticFallback,
                _degraded: true,
                _degradedReason: 'Served from static fallback'
            } as Product;
        }
        
        // Layer 4: Return minimal valid response
        return this.createMinimalProduct(productId);
    }
    
    /**
     * Get recommendations with graceful fallback
     */
    async getRecommendations(
        userId: string, 
        productId: string
    ): Promise<string[]> {
        // Try personalized recommendations
        try {
            return await this.fetchPersonalizedRecommendations(userId, productId);
        } catch (error) {
            console.warn('Personalized recommendations failed, trying fallbacks');
        }
        
        // Fallback 1: Non-personalized recommendations for this product
        try {
            return await this.fetchProductRecommendations(productId);
        } catch (error) {
            console.warn('Product recommendations failed');
        }
        
        // Fallback 2: Cached popular products
        try {
            return await this.getCachedPopularProducts();
        } catch (error) {
            console.warn('Cached popular products failed');
        }
        
        // Fallback 3: Static bestsellers
        return this.staticBestsellers;
    }
    
    /**
     * Get inventory with fail-open strategy
     */
    async getInventory(productId: string): Promise<{
        count: number;
        source: 'realtime' | 'cached' | 'estimated';
        confidence: number;
    }> {
        // Try real-time inventory
        try {
            const realtime = await this.fetchRealtimeInventory(productId);
            return {
                count: realtime,
                source: 'realtime',
                confidence: 1.0
            };
        } catch (error) {
            console.warn('Real-time inventory unavailable');
        }
        
        // Try cached inventory with staleness tracking
        try {
            const cached = await this.getCachedInventory(productId);
            if (cached) {
                const ageMs = Date.now() - cached.timestamp;
                const ageMinutes = ageMs / 60000;
                
                // Confidence degrades with age: 100% fresh, 50% at 30min, 0% at 60min
                const confidence = Math.max(0, 1 - (ageMinutes / 60));
                
                return {
                    count: cached.count,
                    source: 'cached',
                    confidence
                };
            }
        } catch (error) {
            console.warn('Cached inventory unavailable');
        }
        
        // Fail open: assume in stock to prevent lost sales
        // Better to oversell slightly than block purchases
        return {
            count: 999,  // Arbitrary "in stock" value
            source: 'estimated',
            confidence: 0.1
        };
    }
    
    private createMinimalProduct(productId: string): Product {
        return {
            id: productId,
            name: 'Product Temporarily Unavailable',
            price: 0,
            description: 'Details are being loaded. Please refresh.',
            _degraded: true,
            _degradedReason: 'Minimal fallback - all data sources unavailable'
        } as Product;
    }
    
    private staticBestsellers = [
        'bestseller-1',
        'bestseller-2', 
        'bestseller-3'
    ];
}
 
// Fallback-aware API response
interface DegradedResponse<T> {
    data: T;
    degraded: boolean;
    degradedFeatures?: string[];
    message?: string;
}
 
function createDegradedResponse<T>(
    data: T,
    degradedFeatures: string[] = []
): DegradedResponse<T> {
    return {
        data,
        degraded: degradedFeatures.length > 0,
        degradedFeatures,
        message: degradedFeatures.length > 0
            ? `Some features unavailable: ${degradedFeatures.join(', ')}`
            : undefined
    };
}

Test Your Fallbacks

Fallback code paths are rarely exercised and easily bit-rot. Use chaos engineering to regularly trigger fallbacks in production (or staging). A fallback that fails when needed is worse than no fallback—it gives false confidence.

Load Shedding

Load shedding is the controlled rejection of requests to prevent overload. Rather than attempting to serve all requests and failing at all of them, the system intentionally drops some requests to preserve capacity for others.

Why Load Shedding Matters:

When a system receives more load than it can handle, response times increase for everyone. Beyond a tipping point, the overhead of managing requests exceeds the capacity to complete them, and throughput actually decreases as load increases. This is 'thrashing.'

Load shedding prevents thrashing by maintaining a sustainable queue depth. Requests beyond capacity are rejected immediately (fast failure) rather than queued indefinitely (slow failure).

Load Shedding Strategies:

Load Shedding Strategies
Strategy	Description	Best For	Limitation
Queue Depth	Reject when queue exceeds threshold	Stable latency requirements	Doesn't distinguish request priority
Latency-Based	Reject when response times exceed SLA	SLA-driven systems	Reactive - damage already done when triggered
Token Bucket	Fixed rate of request tokens replenished over time	Rate limiting, API quotas	Doesn't adapt to capacity changes
Adaptive Concurrency	Dynamically adjust based on latency/throughput	Variable capacity systems	Complex to tune correctly
Priority-Based	Shed low-priority traffic first	Mixed-criticality workloads	Requires priority assignment infrastructure
User-Based	Shed based on user tier or quota	Multi-tenant systems	May violate fairness expectations

load-shedding.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
// TypeScript: Priority-Based Load Shedding Implementation
 
interface Request {
    id: string;
    userId: string;
    priority: 'critical' | 'high' | 'normal' | 'low' | 'background';
    timestamp: number;
}
 
interface LoadSheddingConfig {
    maxQueueDepth: number;
    targetLatencyMs: number;
    priorityQuotas: Map<Request['priority'], number>;  // % of capacity per priority
    minHealthyCapacityPercent: number;
}
 
class AdaptiveLoadShedder {
    private config: LoadSheddingConfig;
    private currentQueueDepth: number = 0;
    private recentLatencies: number[] = [];
    private sheddingLevel: number = 0;  // 0 = no shedding, 1 = max shedding
    
    constructor(config: LoadSheddingConfig) {
        this.config = config;
    }
    
    /**
     * Decide whether to accept or shed a request
     */
    shouldAccept(request: Request): {
        accept: boolean;
        reason?: string;
        retryAfterMs?: number;
    } {
        // Always accept critical requests (payments, auth)
        if (request.priority === 'critical') {
            return { accept: true };
        }
        
        // Calculate current load pressure (0-1 scale)
        const loadPressure = this.calculateLoadPressure();
        
        // Update shedding level with smoothing
        this.sheddingLevel = this.sheddingLevel * 0.8 + loadPressure * 0.2;
        
        // Determine acceptance based on priority and shedding level
        const acceptanceThresholds: Record<Request['priority'], number> = {
            critical: 1.0,    // Always accept
            high: 0.8,        // Shed when shedding level > 80%
            normal: 0.5,      // Shed when shedding level > 50%
            low: 0.3,         // Shed when shedding level > 30%
            background: 0.1   // Shed when shedding level > 10%
        };
        
        const threshold = acceptanceThresholds[request.priority];
        
        if (this.sheddingLevel > threshold) {
            const retryAfter = this.calculateRetryDelay(request.priority);
            return {
                accept: false,
                reason: `Load shedding active (level: ${(this.sheddingLevel * 100).toFixed(1)}%, threshold for ${request.priority}: ${threshold * 100}%)`,
                retryAfterMs: retryAfter
            };
        }
        
        return { accept: true };
    }
    
    /**
     * Calculate load pressure from multiple signals
     */
    private calculateLoadPressure(): number {
        // Signal 1: Queue depth (normalized to 0-1)
        const queuePressure = Math.min(1, this.currentQueueDepth / this.config.maxQueueDepth);
        
        // Signal 2: Latency vs target (normalized to 0-1)
        const avgLatency = this.recentLatencies.length > 0
            ? this.recentLatencies.reduce((a, b) => a + b, 0) / this.recentLatencies.length
            : 0;
        const latencyPressure = Math.min(1, avgLatency / (this.config.targetLatencyMs * 2));
        
        // Combine signals (weighted average)
        return queuePressure * 0.6 + latencyPressure * 0.4;
    }
    
    /**
     * Calculate retry delay based on priority (lower priority = longer wait)
     */
    private calculateRetryDelay(priority: Request['priority']): number {
        const baseDelayMs = 1000;
        const priorityMultipliers: Record<Request['priority'], number> = {
            critical: 0,
            high: 1,
            normal: 2,
            low: 5,
            background: 10
        };
        
        // Add jitter to prevent thundering herd
        const jitter = Math.random() * 500;
        
        return baseDelayMs * priorityMultipliers[priority] + jitter;
    }
    
    /**
     * Record completed request latency for adaptive adjustment
     */
    recordLatency(latencyMs: number) {
        this.recentLatencies.push(latencyMs);
        
        // Keep sliding window of last 100 requests
        if (this.recentLatencies.length > 100) {
            this.recentLatencies.shift();
        }
    }
    
    /**
     * Update current queue depth
     */
    updateQueueDepth(depth: number) {
        this.currentQueueDepth = depth;
    }
    
    /**
     * Get current shedding status for monitoring
     */
    getStatus(): {
        sheddingLevel: number;
        queueDepth: number;
        avgLatencyMs: number;
        acceptanceRates: Record<Request['priority'], boolean>;
    } {
        const avgLatency = this.recentLatencies.length > 0
            ? this.recentLatencies.reduce((a, b) => a + b, 0) / this.recentLatencies.length
            : 0;
        
        return {
            sheddingLevel: this.sheddingLevel,
            queueDepth: this.currentQueueDepth,
            avgLatencyMs: avgLatency,
            acceptanceRates: {
                critical: true,
                high: this.sheddingLevel <= 0.8,
                normal: this.sheddingLevel <= 0.5,
                low: this.sheddingLevel <= 0.3,
                background: this.sheddingLevel <= 0.1
            }
        };
    }
}

Client-Side Handling of Shed Requests

When shedding requests, return HTTP 503 Service Unavailable with a Retry-After header indicating when to retry. Well-behaved clients will respect this, reducing retry storm pressure. Avoid 500 errors which clients may interpret as retryable immediately.

Circuit Breakers in Depth

The circuit breaker pattern is a cornerstone of graceful degradation. When a dependency fails repeatedly, the circuit breaker 'opens,' preventing further calls to the failing service. This protects both the caller (from timeout overhead) and the callee (from overload during recovery).

Circuit Breaker States:

Closed (Normal): Requests pass through. Failures are counted.
Open (Failing): Requests fail immediately without calling the service. This is the 'circuit is broken' state.
Half-Open (Testing): After a cooldown, a limited number of test requests are allowed through to probe recovery.

Advanced Circuit Breaker Considerations:

Advanced Circuit Breaker Features

•Per-Operation Circuits — Different operations may have different failure characteristics. Separate circuits for read vs write operations, or for different endpoints.
•Gradual Recovery — In half-open state, gradually increase traffic rather than all-or-nothing. Start with 10% of traffic, increase if successful.
•Bulkheading — Combine circuit breakers with bulkheads (separate thread pools per dependency) to prevent one failing dependency from consuming all resources.
•Health Integration — Trip circuit preemptively based on dependency health checks rather than waiting for request failures.
•Manual Override — Provide operational controls to force circuit open (during known outages) or force closed (for emergency access).

circuit-breaker-advanced.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
// TypeScript: Advanced Circuit Breaker with Gradual Recovery
 
type CircuitState = 'closed' | 'open' | 'half-open';
 
interface CircuitBreakerConfig {
    failureThreshold: number;         // Failures to trip circuit
    failureRateThreshold: number;     // Failure rate to trip (0-1)
    minimumCalls: number;             // Minimum calls before rate calculation
    openDurationMs: number;           // How long to stay open
    halfOpenMaxCalls: number;         // Test calls allowed in half-open
    recoverySteps: number;            // Gradual recovery steps
    resetTimeoutMs: number;           // Time before counters reset
}
 
interface CircuitMetrics {
    totalCalls: number;
    failedCalls: number;
    successfulCalls: number;
    lastFailureTime: number;
    lastSuccessTime: number;
    consecutiveSuccesses: number;
    consecutiveFailures: number;
}
 
class AdvancedCircuitBreaker {
    private state: CircuitState = 'closed';
    private config: CircuitBreakerConfig;
    private metrics: CircuitMetrics;
    private stateChangedAt: number = Date.now();
    private halfOpenCallCount: number = 0;
    private recoveryStep: number = 0;  // Current step in gradual recovery
    
    constructor(private name: string, config: CircuitBreakerConfig) {
        this.config = config;
        this.metrics = this.createEmptyMetrics();
    }
    
    /**
     * Execute a call through the circuit breaker
     */
    async execute<T>(operation: () => Promise<T>): Promise<T> {
        // Check if we should allow this call
        if (!this.allowRequest()) {
            throw new CircuitOpenError(
                `Circuit '${this.name}' is open. Retry after ${this.getRetryDelay()}ms`
            );
        }
        
        const startTime = Date.now();
        
        try {
            const result = await operation();
            this.recordSuccess(Date.now() - startTime);
            return result;
        } catch (error) {
            this.recordFailure(Date.now() - startTime);
            throw error;
        }
    }
    
    /**
     * Determine if request should be allowed based on circuit state
     */
    private allowRequest(): boolean {
        const now = Date.now();
        
        switch (this.state) {
            case 'closed':
                return true;
                
            case 'open':
                // Check if cooldown has passed
                if (now - this.stateChangedAt >= this.config.openDurationMs) {
                    this.transitionTo('half-open');
                    return this.allowHalfOpenRequest();
                }
                return false;
                
            case 'half-open':
                return this.allowHalfOpenRequest();
        }
    }
    
    /**
     * Gradual recovery in half-open state
     */
    private allowHalfOpenRequest(): boolean {
        // Limit total calls in half-open
        if (this.halfOpenCallCount >= this.config.halfOpenMaxCalls) {
            return false;
        }
        
        this.halfOpenCallCount++;
        return true;
    }
    
    /**
     * Record successful call
     */
    private recordSuccess(latencyMs: number) {
        this.metrics.totalCalls++;
        this.metrics.successfulCalls++;
        this.metrics.consecutiveSuccesses++;
        this.metrics.consecutiveFailures = 0;
        this.metrics.lastSuccessTime = Date.now();
        
        if (this.state === 'half-open') {
            // Check if we should recover
            if (this.metrics.consecutiveSuccesses >= this.getSuccessesNeededForRecovery()) {
                this.recoveryStep++;
                
                if (this.recoveryStep >= this.config.recoverySteps) {
                    // Full recovery
                    this.transitionTo('closed');
                } else {
                    // Partial recovery - reset half-open counter for more tests
                    this.halfOpenCallCount = 0;
                    console.log(`Circuit '${this.name}': Recovery step ${this.recoveryStep}/${this.config.recoverySteps}`);
                }
            }
        }
    }
    
    /**
     * Record failed call
     */
    private recordFailure(latencyMs: number) {
        this.metrics.totalCalls++;
        this.metrics.failedCalls++;
        this.metrics.consecutiveFailures++;
        this.metrics.consecutiveSuccesses = 0;
        this.metrics.lastFailureTime = Date.now();
        
        if (this.state === 'closed') {
            // Check if we should trip the circuit
            if (this.shouldTrip()) {
                this.transitionTo('open');
            }
        } else if (this.state === 'half-open') {
            // Any failure in half-open returns to open
            this.transitionTo('open');
        }
    }
    
    /**
     * Determine if circuit should trip based on failure rate
     */
    private shouldTrip(): boolean {
        // Check consecutive failures
        if (this.metrics.consecutiveFailures >= this.config.failureThreshold) {
            return true;
        }
        
        // Check failure rate (with minimum call threshold)
        if (this.metrics.totalCalls >= this.config.minimumCalls) {
            const failureRate = this.metrics.failedCalls / this.metrics.totalCalls;
            if (failureRate >= this.config.failureRateThreshold) {
                return true;
            }
        }
        
        return false;
    }
    
    /**
     * Transition to new state with logging
     */
    private transitionTo(newState: CircuitState) {
        const oldState = this.state;
        this.state = newState;
        this.stateChangedAt = Date.now();
        
        if (newState === 'half-open') {
            this.halfOpenCallCount = 0;
        }
        
        if (newState === 'closed') {
            this.metrics = this.createEmptyMetrics();
            this.recoveryStep = 0;
        }
        
        console.log(`Circuit '${this.name}': ${oldState} → ${newState}`);
        
        // Emit metrics
        circuitStateGauge.set({ circuit: this.name, state: newState }, 1);
    }
    
    /**
     * Calculate successes needed at current recovery step
     */
    private getSuccessesNeededForRecovery(): number {
        // More successes needed at each recovery step
        return 2 + this.recoveryStep;
    }
    
    /**
     * Get retry delay for clients
     */
    getRetryDelay(): number {
        if (this.state === 'open') {
            const elapsed = Date.now() - this.stateChangedAt;
            const remaining = this.config.openDurationMs - elapsed;
            return Math.max(0, remaining);
        }
        return 0;
    }
    
    private createEmptyMetrics(): CircuitMetrics {
        return {
            totalCalls: 0,
            failedCalls: 0,
            successfulCalls: 0,
            lastFailureTime: 0,
            lastSuccessTime: 0,
            consecutiveSuccesses: 0,
            consecutiveFailures: 0
        };
    }
}
 
class CircuitOpenError extends Error {
    constructor(message: string) {
        super(message);
        this.name = 'CircuitOpenError';
    }
}

Circuit Breaker Observability

Monitor circuit state transitions closely. Frequent trips indicate unstable dependencies. Circuits that never trip may have thresholds set too high. Track state time in each state and correlate with incident timelines.

Cascading Failure Prevention

Cascading failures occur when one component's failure triggers failures in dependent components, which trigger further failures, until the entire system collapses. Understanding and preventing these cascades is essential for robust degradation.

Cascade Trigger Patterns:

Common Cascade Triggers and Mitigations
Trigger Pattern	How It Cascades	Prevention Strategy
Retry Storms	Failed requests trigger retries, overloading already stressed services	Exponential backoff, jitter, circuit breakers
Connection Pool Exhaustion	Slow dependencies hold connections, starving other operations	Timeout connections, separate pools per dependency
Queue Backup	Processing slows, queues grow, memory exhausted, OOM crash	Queue depth limits, dead letter queues, backpressure
Cache Stampede	Cache expires, all requests hit database simultaneously	Staggered expiration, cache warming, request coalescing
Resource Contention	Heavy operations starve lightweight operations	Resource isolation, priority queues, rate limiting
Health Check Overload	Health checks consume resources during stress, worsening degradation	Lightweight health checks, separate thread pool

cascade-prevention.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
// TypeScript: Cascade Prevention Patterns
 
import Bottleneck from 'bottleneck';
 
/**
 * Request Coalescing: Prevent cache stampede by combining identical requests
 */
class RequestCoalescer<T> {
    private inFlightRequests: Map<string, Promise<T>> = new Map();
    
    async coalesce(key: string, fetchFn: () => Promise<T>): Promise<T> {
        // If request for this key is already in flight, wait for it
        const existing = this.inFlightRequests.get(key);
        if (existing) {
            console.log(`Coalescing request for key: ${key}`);
            return existing;
        }
        
        // Start new request
        const promise = fetchFn().finally(() => {
            this.inFlightRequests.delete(key);
        });
        
        this.inFlightRequests.set(key, promise);
        return promise;
    }
}
 
/**
 * Bulkhead Pattern: Isolate resources per dependency
 */
class BulkheadedClient {
    private limiters: Map<string, Bottleneck>;
    
    constructor(private config: {
        services: string[];
        maxConcurrent: number;
        maxQueued: number;
    }) {
        this.limiters = new Map();
        
        for (const service of config.services) {
            this.limiters.set(service, new Bottleneck({
                maxConcurrent: config.maxConcurrent,
                highWater: config.maxQueued,
                strategy: Bottleneck.strategy.OVERFLOW  // Reject when full
            }));
        }
    }
    
    async call<T>(
        service: string, 
        operation: () => Promise<T>
    ): Promise<T> {
        const limiter = this.limiters.get(service);
        
        if (!limiter) {
            throw new Error(`Unknown service: ${service}`);
        }
        
        try {
            return await limiter.schedule(operation);
        } catch (error) {
            if (error.message === 'This limiter is overflowed') {
                throw new BulkheadFullError(
                    `Bulkhead for ${service} is full. Request rejected.`
                );
            }
            throw error;
        }
    }
}
 
class BulkheadFullError extends Error {
    constructor(message: string) {
        super(message);
        this.name = 'BulkheadFullError';
    }
}
 
/**
 * Backpressure Handler: Propagate capacity limits upstream
 */
class BackpressureHandler {
    private currentPressure: number = 0;
    private pressureThresholds = {
        low: 0.5,
        medium: 0.7,
        high: 0.9,
        critical: 0.95
    };
    
    updatePressure(queueDepth: number, maxDepth: number) {
        this.currentPressure = queueDepth / maxDepth;
    }
    
    /**
     * Get recommended delay for upstream callers
     */
    getBackpressureSignal(): {
        level: 'none' | 'low' | 'medium' | 'high' | 'critical';
        recommendedDelayMs: number;
        acceptingNewRequests: boolean;
    } {
        if (this.currentPressure < this.pressureThresholds.low) {
            return { level: 'none', recommendedDelayMs: 0, acceptingNewRequests: true };
        }
        if (this.currentPressure < this.pressureThresholds.medium) {
            return { level: 'low', recommendedDelayMs: 100, acceptingNewRequests: true };
        }
        if (this.currentPressure < this.pressureThresholds.high) {
            return { level: 'medium', recommendedDelayMs: 500, acceptingNewRequests: true };
        }
        if (this.currentPressure < this.pressureThresholds.critical) {
            return { level: 'high', recommendedDelayMs: 2000, acceptingNewRequests: true };
        }
        return { level: 'critical', recommendedDelayMs: 5000, acceptingNewRequests: false };
    }
}
 
/**
 * Retry with exponential backoff and jitter
 */
async function retryWithBackoff<T>(
    operation: () => Promise<T>,
    config: {
        maxRetries: number;
        baseDelayMs: number;
        maxDelayMs: number;
        jitterMs: number;
    }
): Promise<T> {
    let lastError: Error | undefined;
    
    for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
        try {
            return await operation();
        } catch (error) {
            lastError = error as Error;
            
            if (attempt === config.maxRetries) {
                throw lastError;
            }
            
            // Calculate delay with exponential backoff
            const exponentialDelay = config.baseDelayMs * Math.pow(2, attempt);
            const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs);
            
            // Add jitter to prevent thundering herd
            const jitter = Math.random() * config.jitterMs;
            const finalDelay = cappedDelay + jitter;
            
            console.log(`Retry attempt ${attempt + 1}/${config.maxRetries} after ${Math.round(finalDelay)}ms`);
            
            await new Promise(resolve => setTimeout(resolve, finalDelay));
        }
    }
    
    throw lastError;
}

The Domino Effect

In complex systems, the first failure is rarely the catastrophic one—it's the second, third, and fourth failures triggered by the response to the first. Design your degradation strategies to reduce system load during failures, not amplify it.

Summary: Building Resilient Systems

Graceful degradation is the art of maintaining useful service when perfection is impossible. It requires intentional design, clear prioritization, and robust fallback mechanisms.

Key Takeaways

•Establish a degradation hierarchy — Document which features are core, supporting, optimization, and cosmetic. Shed load from bottom up.
•Implement multi-layer fallbacks — Cache → default → static → minimal. Test fallbacks regularly; they're rarely exercised otherwise.
•Use load shedding to protect the system — It's better to reject some requests fast than accept all requests and serve them slowly.
•Deploy circuit breakers on dependencies — Prevent slow or failing dependencies from consuming all resources. Use gradual recovery.
•Prevent cascading failures — Limit retries, isolate resources with bulkheads, coalesce duplicate requests, and propagate backpressure.
•Communicate degraded state to clients — Use appropriate HTTP status codes and headers so clients can handle degradation gracefully.

What's next:

Graceful degradation handles failures at the application level. But when servers fail entirely, the system needs to replace them—automatically routing traffic to surviving instances and potentially spinning up new capacity. The final page in this module explores failover strategies: how traffic is redirected and how systems recover.

Page Complete

You now understand the principles and patterns of graceful degradation—how to maintain partial service during failures, prioritize critical functionality, and prevent cascade failures. Next, we'll explore comprehensive failover strategies.