Loading content...
In December 2020, a major cloud provider experienced a cascading outage that took down authentication services for thousands of companies. The root cause wasn't a single component failure—it was the response to that failure. When the authentication cache became overloaded, services that couldn't authenticate users returned errors. Those errors triggered retry storms. The retries overloaded the authentication service further. Within minutes, a partial degradation became a complete outage.
The irony: many of those services could have operated with stale authentication tokens for minutes without security risk. They could have degraded gracefully—accepting slightly increased risk in exchange for continued operation. Instead, they pursued 'correctness' and achieved 'nothing.'
This is the essence of graceful degradation: accepting that partial failure is inevitable, and designing systems that provide the best possible experience within degraded constraints. It's the recognition that serving 80% of requests perfectly is better than serving 0% of requests while waiting for perfection.
By the end of this page, you will understand the principles and patterns of graceful degradation. You'll learn how to identify degradation opportunities, implement fallback mechanisms, prioritize critical functionality, and design systems that bend without breaking when components fail.
Graceful degradation requires a fundamental shift in how we think about system design. Traditional thinking asks: 'How do we prevent failures?' Degradation thinking asks: 'Given that components will fail, how do we minimize impact?'
The Degradation Hierarchy:
Not all functionality is equally important. Graceful degradation requires understanding which features are essential and which can be sacrificed:
Core Functionality: The primary value proposition. An e-commerce site must allow purchases. A messaging app must deliver messages.
Supporting Functionality: Features that enhance core but aren't essential. Personalized recommendations, real-time analytics, AI-powered suggestions.
Optimization Features: Nice-to-haves that improve experience. Animations, preview images, predictive prefetching.
Cosmetic Features: Purely aesthetic elements. Custom fonts, fancy transitions, decorative images.
When resources become constrained, the system should shed load from the bottom up—protecting core functionality at all costs.
| Priority | Functionality | Degradation Strategy | User Impact |
|---|---|---|---|
| Core | Product catalog, cart, checkout, payment | Never degrade - protect at all costs | None - always functional |
| Supporting | Reviews, recommendations, wish lists | Serve cached/stale data, hide if unavailable | Reduced personalization |
| Optimization | Real-time inventory, dynamic pricing | Use cached values, skip updates | Slightly stale information |
| Cosmetic | High-res images, animations, videos | Use lower quality, disable effects | Minimal - mostly aesthetic |
Every system should have an explicit degradation hierarchy documented. During an incident isn't the time to debate which features to sacrifice. Define this hierarchy upfront, get stakeholder buy-in, and implement automated shedding.
Fallbacks are the tactical implementation of degradation—alternative code paths that execute when the primary path fails. Well-designed fallbacks provide reduced-but-functional service rather than errors.
Types of Fallbacks:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189
// TypeScript: Comprehensive Fallback Implementation import { Redis } from 'ioredis';import { CircuitBreaker } from './circuit-breaker'; interface Product { id: string; name: string; price: number; description: string; recommendations?: string[]; realTimeInventory?: number; reviews?: Review[];} interface Review { rating: number; text: string; author: string;} class ProductService { private redis: Redis; private circuitBreaker: CircuitBreaker; private defaultProducts: Map<string, Product> = new Map(); /** * Get product with multi-layer fallback */ async getProduct(productId: string): Promise<Product> { // Layer 1: Try primary service with circuit breaker try { return await this.circuitBreaker.execute(async () => { return await this.fetchFromPrimaryService(productId); }); } catch (primaryError) { console.warn(`Primary service failed: ${primaryError.message}`); } // Layer 2: Try cache fallback try { const cached = await this.getCachedProduct(productId); if (cached) { console.log(`Serving cached product for ${productId}`); return { ...cached, _degraded: true, _degradedReason: 'Served from cache' } as Product; } } catch (cacheError) { console.warn(`Cache fallback failed: ${cacheError.message}`); } // Layer 3: Try static fallback const staticFallback = this.defaultProducts.get(productId); if (staticFallback) { console.log(`Serving static fallback for ${productId}`); return { ...staticFallback, _degraded: true, _degradedReason: 'Served from static fallback' } as Product; } // Layer 4: Return minimal valid response return this.createMinimalProduct(productId); } /** * Get recommendations with graceful fallback */ async getRecommendations( userId: string, productId: string ): Promise<string[]> { // Try personalized recommendations try { return await this.fetchPersonalizedRecommendations(userId, productId); } catch (error) { console.warn('Personalized recommendations failed, trying fallbacks'); } // Fallback 1: Non-personalized recommendations for this product try { return await this.fetchProductRecommendations(productId); } catch (error) { console.warn('Product recommendations failed'); } // Fallback 2: Cached popular products try { return await this.getCachedPopularProducts(); } catch (error) { console.warn('Cached popular products failed'); } // Fallback 3: Static bestsellers return this.staticBestsellers; } /** * Get inventory with fail-open strategy */ async getInventory(productId: string): Promise<{ count: number; source: 'realtime' | 'cached' | 'estimated'; confidence: number; }> { // Try real-time inventory try { const realtime = await this.fetchRealtimeInventory(productId); return { count: realtime, source: 'realtime', confidence: 1.0 }; } catch (error) { console.warn('Real-time inventory unavailable'); } // Try cached inventory with staleness tracking try { const cached = await this.getCachedInventory(productId); if (cached) { const ageMs = Date.now() - cached.timestamp; const ageMinutes = ageMs / 60000; // Confidence degrades with age: 100% fresh, 50% at 30min, 0% at 60min const confidence = Math.max(0, 1 - (ageMinutes / 60)); return { count: cached.count, source: 'cached', confidence }; } } catch (error) { console.warn('Cached inventory unavailable'); } // Fail open: assume in stock to prevent lost sales // Better to oversell slightly than block purchases return { count: 999, // Arbitrary "in stock" value source: 'estimated', confidence: 0.1 }; } private createMinimalProduct(productId: string): Product { return { id: productId, name: 'Product Temporarily Unavailable', price: 0, description: 'Details are being loaded. Please refresh.', _degraded: true, _degradedReason: 'Minimal fallback - all data sources unavailable' } as Product; } private staticBestsellers = [ 'bestseller-1', 'bestseller-2', 'bestseller-3' ];} // Fallback-aware API responseinterface DegradedResponse<T> { data: T; degraded: boolean; degradedFeatures?: string[]; message?: string;} function createDegradedResponse<T>( data: T, degradedFeatures: string[] = []): DegradedResponse<T> { return { data, degraded: degradedFeatures.length > 0, degradedFeatures, message: degradedFeatures.length > 0 ? `Some features unavailable: ${degradedFeatures.join(', ')}` : undefined };}Fallback code paths are rarely exercised and easily bit-rot. Use chaos engineering to regularly trigger fallbacks in production (or staging). A fallback that fails when needed is worse than no fallback—it gives false confidence.
Load shedding is the controlled rejection of requests to prevent overload. Rather than attempting to serve all requests and failing at all of them, the system intentionally drops some requests to preserve capacity for others.
Why Load Shedding Matters:
When a system receives more load than it can handle, response times increase for everyone. Beyond a tipping point, the overhead of managing requests exceeds the capacity to complete them, and throughput actually decreases as load increases. This is 'thrashing.'
Load shedding prevents thrashing by maintaining a sustainable queue depth. Requests beyond capacity are rejected immediately (fast failure) rather than queued indefinitely (slow failure).
Load Shedding Strategies:
| Strategy | Description | Best For | Limitation |
|---|---|---|---|
| Queue Depth | Reject when queue exceeds threshold | Stable latency requirements | Doesn't distinguish request priority |
| Latency-Based | Reject when response times exceed SLA | SLA-driven systems | Reactive - damage already done when triggered |
| Token Bucket | Fixed rate of request tokens replenished over time | Rate limiting, API quotas | Doesn't adapt to capacity changes |
| Adaptive Concurrency | Dynamically adjust based on latency/throughput | Variable capacity systems | Complex to tune correctly |
| Priority-Based | Shed low-priority traffic first | Mixed-criticality workloads | Requires priority assignment infrastructure |
| User-Based | Shed based on user tier or quota | Multi-tenant systems | May violate fairness expectations |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150
// TypeScript: Priority-Based Load Shedding Implementation interface Request { id: string; userId: string; priority: 'critical' | 'high' | 'normal' | 'low' | 'background'; timestamp: number;} interface LoadSheddingConfig { maxQueueDepth: number; targetLatencyMs: number; priorityQuotas: Map<Request['priority'], number>; // % of capacity per priority minHealthyCapacityPercent: number;} class AdaptiveLoadShedder { private config: LoadSheddingConfig; private currentQueueDepth: number = 0; private recentLatencies: number[] = []; private sheddingLevel: number = 0; // 0 = no shedding, 1 = max shedding constructor(config: LoadSheddingConfig) { this.config = config; } /** * Decide whether to accept or shed a request */ shouldAccept(request: Request): { accept: boolean; reason?: string; retryAfterMs?: number; } { // Always accept critical requests (payments, auth) if (request.priority === 'critical') { return { accept: true }; } // Calculate current load pressure (0-1 scale) const loadPressure = this.calculateLoadPressure(); // Update shedding level with smoothing this.sheddingLevel = this.sheddingLevel * 0.8 + loadPressure * 0.2; // Determine acceptance based on priority and shedding level const acceptanceThresholds: Record<Request['priority'], number> = { critical: 1.0, // Always accept high: 0.8, // Shed when shedding level > 80% normal: 0.5, // Shed when shedding level > 50% low: 0.3, // Shed when shedding level > 30% background: 0.1 // Shed when shedding level > 10% }; const threshold = acceptanceThresholds[request.priority]; if (this.sheddingLevel > threshold) { const retryAfter = this.calculateRetryDelay(request.priority); return { accept: false, reason: `Load shedding active (level: ${(this.sheddingLevel * 100).toFixed(1)}%, threshold for ${request.priority}: ${threshold * 100}%)`, retryAfterMs: retryAfter }; } return { accept: true }; } /** * Calculate load pressure from multiple signals */ private calculateLoadPressure(): number { // Signal 1: Queue depth (normalized to 0-1) const queuePressure = Math.min(1, this.currentQueueDepth / this.config.maxQueueDepth); // Signal 2: Latency vs target (normalized to 0-1) const avgLatency = this.recentLatencies.length > 0 ? this.recentLatencies.reduce((a, b) => a + b, 0) / this.recentLatencies.length : 0; const latencyPressure = Math.min(1, avgLatency / (this.config.targetLatencyMs * 2)); // Combine signals (weighted average) return queuePressure * 0.6 + latencyPressure * 0.4; } /** * Calculate retry delay based on priority (lower priority = longer wait) */ private calculateRetryDelay(priority: Request['priority']): number { const baseDelayMs = 1000; const priorityMultipliers: Record<Request['priority'], number> = { critical: 0, high: 1, normal: 2, low: 5, background: 10 }; // Add jitter to prevent thundering herd const jitter = Math.random() * 500; return baseDelayMs * priorityMultipliers[priority] + jitter; } /** * Record completed request latency for adaptive adjustment */ recordLatency(latencyMs: number) { this.recentLatencies.push(latencyMs); // Keep sliding window of last 100 requests if (this.recentLatencies.length > 100) { this.recentLatencies.shift(); } } /** * Update current queue depth */ updateQueueDepth(depth: number) { this.currentQueueDepth = depth; } /** * Get current shedding status for monitoring */ getStatus(): { sheddingLevel: number; queueDepth: number; avgLatencyMs: number; acceptanceRates: Record<Request['priority'], boolean>; } { const avgLatency = this.recentLatencies.length > 0 ? this.recentLatencies.reduce((a, b) => a + b, 0) / this.recentLatencies.length : 0; return { sheddingLevel: this.sheddingLevel, queueDepth: this.currentQueueDepth, avgLatencyMs: avgLatency, acceptanceRates: { critical: true, high: this.sheddingLevel <= 0.8, normal: this.sheddingLevel <= 0.5, low: this.sheddingLevel <= 0.3, background: this.sheddingLevel <= 0.1 } }; }}When shedding requests, return HTTP 503 Service Unavailable with a Retry-After header indicating when to retry. Well-behaved clients will respect this, reducing retry storm pressure. Avoid 500 errors which clients may interpret as retryable immediately.
The circuit breaker pattern is a cornerstone of graceful degradation. When a dependency fails repeatedly, the circuit breaker 'opens,' preventing further calls to the failing service. This protects both the caller (from timeout overhead) and the callee (from overload during recovery).
Circuit Breaker States:
Advanced Circuit Breaker Considerations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226
// TypeScript: Advanced Circuit Breaker with Gradual Recovery type CircuitState = 'closed' | 'open' | 'half-open'; interface CircuitBreakerConfig { failureThreshold: number; // Failures to trip circuit failureRateThreshold: number; // Failure rate to trip (0-1) minimumCalls: number; // Minimum calls before rate calculation openDurationMs: number; // How long to stay open halfOpenMaxCalls: number; // Test calls allowed in half-open recoverySteps: number; // Gradual recovery steps resetTimeoutMs: number; // Time before counters reset} interface CircuitMetrics { totalCalls: number; failedCalls: number; successfulCalls: number; lastFailureTime: number; lastSuccessTime: number; consecutiveSuccesses: number; consecutiveFailures: number;} class AdvancedCircuitBreaker { private state: CircuitState = 'closed'; private config: CircuitBreakerConfig; private metrics: CircuitMetrics; private stateChangedAt: number = Date.now(); private halfOpenCallCount: number = 0; private recoveryStep: number = 0; // Current step in gradual recovery constructor(private name: string, config: CircuitBreakerConfig) { this.config = config; this.metrics = this.createEmptyMetrics(); } /** * Execute a call through the circuit breaker */ async execute<T>(operation: () => Promise<T>): Promise<T> { // Check if we should allow this call if (!this.allowRequest()) { throw new CircuitOpenError( `Circuit '${this.name}' is open. Retry after ${this.getRetryDelay()}ms` ); } const startTime = Date.now(); try { const result = await operation(); this.recordSuccess(Date.now() - startTime); return result; } catch (error) { this.recordFailure(Date.now() - startTime); throw error; } } /** * Determine if request should be allowed based on circuit state */ private allowRequest(): boolean { const now = Date.now(); switch (this.state) { case 'closed': return true; case 'open': // Check if cooldown has passed if (now - this.stateChangedAt >= this.config.openDurationMs) { this.transitionTo('half-open'); return this.allowHalfOpenRequest(); } return false; case 'half-open': return this.allowHalfOpenRequest(); } } /** * Gradual recovery in half-open state */ private allowHalfOpenRequest(): boolean { // Limit total calls in half-open if (this.halfOpenCallCount >= this.config.halfOpenMaxCalls) { return false; } this.halfOpenCallCount++; return true; } /** * Record successful call */ private recordSuccess(latencyMs: number) { this.metrics.totalCalls++; this.metrics.successfulCalls++; this.metrics.consecutiveSuccesses++; this.metrics.consecutiveFailures = 0; this.metrics.lastSuccessTime = Date.now(); if (this.state === 'half-open') { // Check if we should recover if (this.metrics.consecutiveSuccesses >= this.getSuccessesNeededForRecovery()) { this.recoveryStep++; if (this.recoveryStep >= this.config.recoverySteps) { // Full recovery this.transitionTo('closed'); } else { // Partial recovery - reset half-open counter for more tests this.halfOpenCallCount = 0; console.log(`Circuit '${this.name}': Recovery step ${this.recoveryStep}/${this.config.recoverySteps}`); } } } } /** * Record failed call */ private recordFailure(latencyMs: number) { this.metrics.totalCalls++; this.metrics.failedCalls++; this.metrics.consecutiveFailures++; this.metrics.consecutiveSuccesses = 0; this.metrics.lastFailureTime = Date.now(); if (this.state === 'closed') { // Check if we should trip the circuit if (this.shouldTrip()) { this.transitionTo('open'); } } else if (this.state === 'half-open') { // Any failure in half-open returns to open this.transitionTo('open'); } } /** * Determine if circuit should trip based on failure rate */ private shouldTrip(): boolean { // Check consecutive failures if (this.metrics.consecutiveFailures >= this.config.failureThreshold) { return true; } // Check failure rate (with minimum call threshold) if (this.metrics.totalCalls >= this.config.minimumCalls) { const failureRate = this.metrics.failedCalls / this.metrics.totalCalls; if (failureRate >= this.config.failureRateThreshold) { return true; } } return false; } /** * Transition to new state with logging */ private transitionTo(newState: CircuitState) { const oldState = this.state; this.state = newState; this.stateChangedAt = Date.now(); if (newState === 'half-open') { this.halfOpenCallCount = 0; } if (newState === 'closed') { this.metrics = this.createEmptyMetrics(); this.recoveryStep = 0; } console.log(`Circuit '${this.name}': ${oldState} → ${newState}`); // Emit metrics circuitStateGauge.set({ circuit: this.name, state: newState }, 1); } /** * Calculate successes needed at current recovery step */ private getSuccessesNeededForRecovery(): number { // More successes needed at each recovery step return 2 + this.recoveryStep; } /** * Get retry delay for clients */ getRetryDelay(): number { if (this.state === 'open') { const elapsed = Date.now() - this.stateChangedAt; const remaining = this.config.openDurationMs - elapsed; return Math.max(0, remaining); } return 0; } private createEmptyMetrics(): CircuitMetrics { return { totalCalls: 0, failedCalls: 0, successfulCalls: 0, lastFailureTime: 0, lastSuccessTime: 0, consecutiveSuccesses: 0, consecutiveFailures: 0 }; }} class CircuitOpenError extends Error { constructor(message: string) { super(message); this.name = 'CircuitOpenError'; }}Monitor circuit state transitions closely. Frequent trips indicate unstable dependencies. Circuits that never trip may have thresholds set too high. Track state time in each state and correlate with incident timelines.
Cascading failures occur when one component's failure triggers failures in dependent components, which trigger further failures, until the entire system collapses. Understanding and preventing these cascades is essential for robust degradation.
Cascade Trigger Patterns:
| Trigger Pattern | How It Cascades | Prevention Strategy |
|---|---|---|
| Retry Storms | Failed requests trigger retries, overloading already stressed services | Exponential backoff, jitter, circuit breakers |
| Connection Pool Exhaustion | Slow dependencies hold connections, starving other operations | Timeout connections, separate pools per dependency |
| Queue Backup | Processing slows, queues grow, memory exhausted, OOM crash | Queue depth limits, dead letter queues, backpressure |
| Cache Stampede | Cache expires, all requests hit database simultaneously | Staggered expiration, cache warming, request coalescing |
| Resource Contention | Heavy operations starve lightweight operations | Resource isolation, priority queues, rate limiting |
| Health Check Overload | Health checks consume resources during stress, worsening degradation | Lightweight health checks, separate thread pool |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160
// TypeScript: Cascade Prevention Patterns import Bottleneck from 'bottleneck'; /** * Request Coalescing: Prevent cache stampede by combining identical requests */class RequestCoalescer<T> { private inFlightRequests: Map<string, Promise<T>> = new Map(); async coalesce(key: string, fetchFn: () => Promise<T>): Promise<T> { // If request for this key is already in flight, wait for it const existing = this.inFlightRequests.get(key); if (existing) { console.log(`Coalescing request for key: ${key}`); return existing; } // Start new request const promise = fetchFn().finally(() => { this.inFlightRequests.delete(key); }); this.inFlightRequests.set(key, promise); return promise; }} /** * Bulkhead Pattern: Isolate resources per dependency */class BulkheadedClient { private limiters: Map<string, Bottleneck>; constructor(private config: { services: string[]; maxConcurrent: number; maxQueued: number; }) { this.limiters = new Map(); for (const service of config.services) { this.limiters.set(service, new Bottleneck({ maxConcurrent: config.maxConcurrent, highWater: config.maxQueued, strategy: Bottleneck.strategy.OVERFLOW // Reject when full })); } } async call<T>( service: string, operation: () => Promise<T> ): Promise<T> { const limiter = this.limiters.get(service); if (!limiter) { throw new Error(`Unknown service: ${service}`); } try { return await limiter.schedule(operation); } catch (error) { if (error.message === 'This limiter is overflowed') { throw new BulkheadFullError( `Bulkhead for ${service} is full. Request rejected.` ); } throw error; } }} class BulkheadFullError extends Error { constructor(message: string) { super(message); this.name = 'BulkheadFullError'; }} /** * Backpressure Handler: Propagate capacity limits upstream */class BackpressureHandler { private currentPressure: number = 0; private pressureThresholds = { low: 0.5, medium: 0.7, high: 0.9, critical: 0.95 }; updatePressure(queueDepth: number, maxDepth: number) { this.currentPressure = queueDepth / maxDepth; } /** * Get recommended delay for upstream callers */ getBackpressureSignal(): { level: 'none' | 'low' | 'medium' | 'high' | 'critical'; recommendedDelayMs: number; acceptingNewRequests: boolean; } { if (this.currentPressure < this.pressureThresholds.low) { return { level: 'none', recommendedDelayMs: 0, acceptingNewRequests: true }; } if (this.currentPressure < this.pressureThresholds.medium) { return { level: 'low', recommendedDelayMs: 100, acceptingNewRequests: true }; } if (this.currentPressure < this.pressureThresholds.high) { return { level: 'medium', recommendedDelayMs: 500, acceptingNewRequests: true }; } if (this.currentPressure < this.pressureThresholds.critical) { return { level: 'high', recommendedDelayMs: 2000, acceptingNewRequests: true }; } return { level: 'critical', recommendedDelayMs: 5000, acceptingNewRequests: false }; }} /** * Retry with exponential backoff and jitter */async function retryWithBackoff<T>( operation: () => Promise<T>, config: { maxRetries: number; baseDelayMs: number; maxDelayMs: number; jitterMs: number; }): Promise<T> { let lastError: Error | undefined; for (let attempt = 0; attempt <= config.maxRetries; attempt++) { try { return await operation(); } catch (error) { lastError = error as Error; if (attempt === config.maxRetries) { throw lastError; } // Calculate delay with exponential backoff const exponentialDelay = config.baseDelayMs * Math.pow(2, attempt); const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs); // Add jitter to prevent thundering herd const jitter = Math.random() * config.jitterMs; const finalDelay = cappedDelay + jitter; console.log(`Retry attempt ${attempt + 1}/${config.maxRetries} after ${Math.round(finalDelay)}ms`); await new Promise(resolve => setTimeout(resolve, finalDelay)); } } throw lastError;}In complex systems, the first failure is rarely the catastrophic one—it's the second, third, and fourth failures triggered by the response to the first. Design your degradation strategies to reduce system load during failures, not amplify it.
Graceful degradation is the art of maintaining useful service when perfection is impossible. It requires intentional design, clear prioritization, and robust fallback mechanisms.
What's next:
Graceful degradation handles failures at the application level. But when servers fail entirely, the system needs to replace them—automatically routing traffic to surviving instances and potentially spinning up new capacity. The final page in this module explores failover strategies: how traffic is redirected and how systems recover.
You now understand the principles and patterns of graceful degradation—how to maintain partial service during failures, prioritize critical functionality, and prevent cascade failures. Next, we'll explore comprehensive failover strategies.