Loading learning content...
At 3:47 AM on a quiet Sunday morning, a network switch in a major cloud provider's region began dropping packets intermittently. Not all packets—just enough to cause occasional connection timeouts. The affected servers weren't down; they were partially reachable.
The on-call engineer faced a dilemma that cuts to the heart of failure detection: the health checks were failing 30% of the time. Were these servers 'failed' or 'healthy'? Setting the failure threshold at 3 consecutive failures meant some truly struggling servers kept receiving traffic. Setting it at 1 failure meant healthy servers were constantly bouncing in and out of rotation due to transient network hiccups.
This scenario illustrates the fundamental challenge of failure detection: distinguishing genuine failures from transient issues, quickly enough to matter but accurately enough to avoid false alarms. Get it wrong in one direction, and failed servers receive traffic. Get it wrong in the other, and constant false positives destabilize your entire routing layer.
By the end of this page, you will understand the algorithms and strategies that convert raw health observations into actionable failure decisions. You'll learn how to configure detection thresholds for different failure modes, understand the mathematical trade-offs between detection speed and accuracy, and apply advanced detection techniques like adaptive thresholds and statistical anomaly detection.
Before designing detection mechanisms, we must understand what we're detecting. Failures in distributed systems come in many forms, each with distinct characteristics that affect detection strategies.
Failure Taxonomy:
| Failure Type | Characteristics | Detection Challenge | Typical Signal |
|---|---|---|---|
| Crash Failure | Process terminates abruptly | Easy - clear signal | Connection refused, port closed |
| Hang Failure | Process alive but unresponsive | Medium - timeouts required | Request timeouts, no response |
| Byzantine Failure | Process responds incorrectly | Hard - need correctness checks | Wrong data, corrupted responses |
| Performance Degradation | Process slow but functional | Medium - latency thresholds | Increased response times |
| Partial Failure | Some operations fail, others succeed | Hard - workload dependent | Intermittent errors |
| Network Partition | Node unreachable from some peers | Hard - perspective dependent | Asymmetric connectivity |
| Resource Exhaustion | OOM, disk full, connection limits | Medium - resource monitoring | Specific error codes |
The Observer Problem:
A fundamental challenge in failure detection is that you cannot distinguish a crashed process from a slow process from an unreachable process. Consider a health check that times out after 3 seconds:
The health checker sees the same signal (timeout) for very different underlying conditions. This inherent ambiguity shapes everything about failure detection design.
The FLP impossibility result from distributed systems theory proves that it's impossible to reliably distinguish a failed process from a slow one in an asynchronous system. Failure detectors must accept some degree of inaccuracy—either failing to detect real failures (false negatives) or incorrectly flagging healthy nodes (false positives).
The most common approach to failure detection is threshold-based: mark a server as failed after a specified number of consecutive failures or after exceeding an error rate over a time window.
Consecutive Failure Counting:
The simplest model tracks consecutive probe failures:
if consecutiveFailures >= failureThreshold:
markUnhealthy()
if consecutiveSuccesses >= successThreshold:
markHealthy()
This creates a state machine with hysteresis—the asymmetric thresholds prevent rapid oscillation between healthy and unhealthy states.
Configuring Thresholds:
The failure threshold directly controls the trade-off between detection speed and false positive rate:
| Failure Threshold | Detection Time | False Positive Risk | Use Case |
|---|---|---|---|
| 1 | Immediate | High | Ultra-low-latency systems |
| 2-3 | 10-15 seconds* | Medium | Real-time applications |
| 3-5 | 15-25 seconds* | Low | General web services |
| 5-10 | 25-50 seconds* | Very Low | Batch processing, stable workloads |
*Assuming 5-second probe intervals
Success Threshold for Recovery:
The success threshold controls how quickly recovered servers return to rotation. Setting this too low risks adding servers that passed one check but are still unstable. Setting it too high delays capacity restoration:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154
// TypeScript: Threshold-Based Failure Detection Implementation interface ServerHealth { serverId: string; state: 'healthy' | 'failing' | 'unhealthy' | 'recovering'; consecutiveFailures: number; consecutiveSuccesses: number; lastProbeTime: Date; lastStateChange: Date; failureHistory: ProbeResult[];} interface ProbeResult { timestamp: Date; success: boolean; latencyMs: number; errorType?: string;} interface DetectorConfig { failureThreshold: number; // Failures before marking unhealthy successThreshold: number; // Successes before marking healthy historySize: number; // Number of probe results to retain} class ThresholdFailureDetector { private servers: Map<string, ServerHealth> = new Map(); private config: DetectorConfig; constructor(config: DetectorConfig) { this.config = config; } /** * Process a probe result and update server health state */ recordProbe(serverId: string, result: ProbeResult): ServerHealth { let server = this.servers.get(serverId); if (!server) { server = this.initializeServer(serverId); } // Update history (sliding window) server.failureHistory.push(result); if (server.failureHistory.length > this.config.historySize) { server.failureHistory.shift(); } server.lastProbeTime = result.timestamp; // State machine transitions const previousState = server.state; if (result.success) { server.consecutiveFailures = 0; server.consecutiveSuccesses++; if (server.state === 'unhealthy' || server.state === 'recovering') { server.state = 'recovering'; if (server.consecutiveSuccesses >= this.config.successThreshold) { server.state = 'healthy'; server.consecutiveSuccesses = 0; } } else { server.state = 'healthy'; } } else { server.consecutiveSuccesses = 0; server.consecutiveFailures++; if (server.state === 'healthy' || server.state === 'failing') { server.state = 'failing'; if (server.consecutiveFailures >= this.config.failureThreshold) { server.state = 'unhealthy'; } } else if (server.state === 'recovering') { // Recovery interrupted server.state = 'unhealthy'; } } // Record state change timestamp if (server.state !== previousState) { server.lastStateChange = new Date(); this.onStateChange(serverId, previousState, server.state); } this.servers.set(serverId, server); return server; } /** * Check if server should receive traffic */ isHealthy(serverId: string): boolean { const server = this.servers.get(serverId); if (!server) return false; return server.state === 'healthy' || server.state === 'failing'; } private initializeServer(serverId: string): ServerHealth { return { serverId, state: 'healthy', consecutiveFailures: 0, consecutiveSuccesses: 0, lastProbeTime: new Date(), lastStateChange: new Date(), failureHistory: [] }; } private onStateChange( serverId: string, from: ServerHealth['state'], to: ServerHealth['state'] ) { console.log(`[FailureDetector] ${serverId}: ${from} → ${to}`); // Emit metrics stateTransitionCounter.inc({ from_state: from, to_state: to }); }} // Example configuration for different scenariosconst configs = { // High-frequency trading: ultra-fast detection, accept more false positives lowLatency: { failureThreshold: 1, successThreshold: 1, historySize: 10 }, // Standard web service: balanced detection standard: { failureThreshold: 3, successThreshold: 2, historySize: 20 }, // Batch processing: conservative, minimize disruption stable: { failureThreshold: 5, successThreshold: 3, historySize: 50 }};When failure and success thresholds are both low, servers can 'flap' rapidly between healthy and unhealthy states. This creates routing instability, connection storm as clients retry, and alert fatigue. Always use asymmetric thresholds—making it harder to leave unhealthy state than to enter it.
Consecutive failure counting has a significant limitation: it's all-or-nothing. A server that fails every other probe (50% failure rate) will never trigger consecutive failure thresholds but is clearly problematic.
Error Rate Detection:
Rate-based detection tracks the proportion of failures over a sliding time window:
errorRate = failedProbes / totalProbes (over last N seconds)
if errorRate >= errorRateThreshold:
markUnhealthy()
This approach catches servers with intermittent failures that evade consecutive counting.
| Parameter | Description | Typical Values | Trade-off |
|---|---|---|---|
| Time Window | Period over which to calculate rate | 30s - 5min | Shorter = faster detection, more noise |
| Minimum Samples | Required probes before rate calculation | 5-20 | Higher = more stable, slower cold-start |
| Error Rate Threshold | Failure rate to trigger ejection | 50-90% | Higher = more tolerant, slower detection |
| Request Volume Minimum | Minimum requests to evaluate (passive) | 50-100 | Higher = reject low-traffic noise |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137
// TypeScript: Rate-Based Failure Detection with Sliding Window interface SlidingWindowConfig { windowSizeMs: number; // Time window for rate calculation bucketSizeMs: number; // Granularity of time buckets minSamples: number; // Minimum samples before calculating rate errorRateThreshold: number; // Error rate to trigger failure (0.0-1.0)} interface TimeBucket { startTime: number; successes: number; failures: number;} class SlidingWindowDetector { private config: SlidingWindowConfig; private buckets: Map<string, TimeBucket[]> = new Map(); constructor(config: SlidingWindowConfig) { this.config = config; } /** * Record a probe result in the appropriate time bucket */ recordProbe(serverId: string, success: boolean, timestamp: number = Date.now()) { const buckets = this.getOrCreateBuckets(serverId); const bucketIndex = Math.floor(timestamp / this.config.bucketSizeMs); // Find or create the appropriate bucket let bucket = buckets.find(b => Math.floor(b.startTime / this.config.bucketSizeMs) === bucketIndex ); if (!bucket) { bucket = { startTime: bucketIndex * this.config.bucketSizeMs, successes: 0, failures: 0 }; buckets.push(bucket); } if (success) { bucket.successes++; } else { bucket.failures++; } // Prune old buckets this.pruneBuckets(serverId, timestamp); } /** * Calculate current error rate for a server */ getErrorRate(serverId: string, timestamp: number = Date.now()): { errorRate: number; totalSamples: number; isReliable: boolean; } { this.pruneBuckets(serverId, timestamp); const buckets = this.buckets.get(serverId) || []; const windowStart = timestamp - this.config.windowSizeMs; let totalSuccesses = 0; let totalFailures = 0; for (const bucket of buckets) { if (bucket.startTime >= windowStart) { totalSuccesses += bucket.successes; totalFailures += bucket.failures; } } const totalSamples = totalSuccesses + totalFailures; const isReliable = totalSamples >= this.config.minSamples; const errorRate = totalSamples > 0 ? totalFailures / totalSamples : 0; return { errorRate, totalSamples, isReliable }; } /** * Determine if server should be marked unhealthy based on error rate */ shouldEject(serverId: string, timestamp: number = Date.now()): { shouldEject: boolean; reason?: string; } { const { errorRate, totalSamples, isReliable } = this.getErrorRate(serverId, timestamp); if (!isReliable) { return { shouldEject: false, reason: `Insufficient samples (${totalSamples}/${this.config.minSamples})` }; } if (errorRate >= this.config.errorRateThreshold) { return { shouldEject: true, reason: `Error rate ${(errorRate * 100).toFixed(1)}% exceeds threshold ${(this.config.errorRateThreshold * 100).toFixed(1)}%` }; } return { shouldEject: false }; } private getOrCreateBuckets(serverId: string): TimeBucket[] { if (!this.buckets.has(serverId)) { this.buckets.set(serverId, []); } return this.buckets.get(serverId)!; } private pruneBuckets(serverId: string, currentTime: number) { const buckets = this.buckets.get(serverId); if (!buckets) return; const windowStart = currentTime - this.config.windowSizeMs; const prunedBuckets = buckets.filter(b => b.startTime >= windowStart ); this.buckets.set(serverId, prunedBuckets); }} // Example: Envoy-style outlier detection parametersconst envoyStyleConfig: SlidingWindowConfig = { windowSizeMs: 60000, // 1 minute window bucketSizeMs: 5000, // 5 second buckets minSamples: 10, // Need at least 10 probes errorRateThreshold: 0.85 // 85% failure rate triggers ejection};Production systems often combine both approaches: consecutive failure detection catches rapid, complete failures quickly, while rate-based detection catches slower degradation. The 'OR' of both signals provides comprehensive coverage.
Fixed thresholds work well when you know what 'normal' looks like. But in dynamic systems where baselines shift—traffic patterns change, infrastructure scales, dependencies vary—fixed thresholds become stale. Statistical methods adapt detection thresholds to observed behavior.
Success Rate Anomaly Detection:
Instead of a fixed error rate threshold, compare each server's success rate to its peers:
This automatically adjusts for system-wide conditions. If the entire cluster is experiencing 10% errors (maybe due to a dependency issue), no individual server is an outlier. But if one server has 50% errors while peers have 10%, it's ejected.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
// TypeScript: Statistical Outlier Detection interface ServerSuccessRate { serverId: string; successRate: number; requestCount: number;} interface OutlierDetectionConfig { stdevFactor: number; // Standard deviations from mean (e.g., 1.9) minClusterSize: number; // Minimum servers to calculate statistics minRequestVolume: number; // Minimum requests for reliable rate successRateMinHosts: number; // Minimum hosts with sufficient requests} class StatisticalOutlierDetector { private config: OutlierDetectionConfig; constructor(config: OutlierDetectionConfig) { this.config = config; } /** * Identify outliers using success rate deviation from cluster mean */ detectOutliers(serverRates: ServerSuccessRate[]): { outliers: string[]; clusterMean: number; clusterStdev: number; threshold: number; } { // Filter to servers with sufficient request volume const reliableServers = serverRates.filter( s => s.requestCount >= this.config.minRequestVolume ); // Need minimum hosts for meaningful statistics if (reliableServers.length < this.config.successRateMinHosts) { return { outliers: [], clusterMean: 0, clusterStdev: 0, threshold: 0 }; } // Calculate mean const successRates = reliableServers.map(s => s.successRate); const mean = successRates.reduce((a, b) => a + b, 0) / successRates.length; // Calculate standard deviation const squaredDiffs = successRates.map(rate => Math.pow(rate - mean, 2)); const avgSquaredDiff = squaredDiffs.reduce((a, b) => a + b, 0) / squaredDiffs.length; const stdev = Math.sqrt(avgSquaredDiff); // Calculate threshold const threshold = mean - (stdev * this.config.stdevFactor); // Identify outliers (success rate below threshold) const outliers = reliableServers .filter(s => s.successRate < threshold) .map(s => s.serverId); return { outliers, clusterMean: mean, clusterStdev: stdev, threshold }; } /** * Enhanced detection with multiple signals */ detectOutliersMultiSignal( serverRates: ServerSuccessRate[], serverLatencies: { serverId: string; p99LatencyMs: number }[] ): Map<string, { isOutlier: boolean; reasons: string[] }> { const results = new Map<string, { isOutlier: boolean; reasons: string[] }>(); // Success rate outliers const successRateOutliers = this.detectOutliers(serverRates); // Latency outliers (similar approach) const latencies = serverLatencies.map(s => s.p99LatencyMs); const latencyMean = latencies.reduce((a, b) => a + b, 0) / latencies.length; const latencySquaredDiffs = latencies.map(l => Math.pow(l - latencyMean, 2)); const latencyStdev = Math.sqrt( latencySquaredDiffs.reduce((a, b) => a + b, 0) / latencies.length ); const latencyThreshold = latencyMean + (latencyStdev * this.config.stdevFactor); // Combine signals for (const server of serverRates) { const reasons: string[] = []; if (successRateOutliers.outliers.includes(server.serverId)) { reasons.push(`Success rate ${(server.successRate * 100).toFixed(1)}% below threshold ${(successRateOutliers.threshold * 100).toFixed(1)}%`); } const latencyData = serverLatencies.find(l => l.serverId === server.serverId); if (latencyData && latencyData.p99LatencyMs > latencyThreshold) { reasons.push(`P99 latency ${latencyData.p99LatencyMs}ms above threshold ${latencyThreshold.toFixed(0)}ms`); } results.set(server.serverId, { isOutlier: reasons.length > 0, reasons }); } return results; }} // Envoy-style parameters (stdev factor of 1.9 = ~3% false positive rate)const envoyConfig: OutlierDetectionConfig = { stdevFactor: 1.9, minClusterSize: 5, minRequestVolume: 100, successRateMinHosts: 5};The stdev factor determines outlier sensitivity. Assuming normally distributed success rates: 1.0 stdev catches ~16% of hosts, 1.5 stdev catches ~7%, 1.9 stdev catches ~3%, and 2.0 stdev catches ~2.5%. Envoy defaults to 1.9 because it provides good sensitivity while limiting false positives.
Static detection parameters assume stable failure characteristics. In reality, systems experience varied conditions that warrant different detection sensitivity. Adaptive detection adjusts thresholds based on current conditions.
Adaptive Approaches:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
// TypeScript: Adaptive Detection with Exponential Backoff interface AdaptiveServerState { serverId: string; ejectionCount: number; // How many times this server has been ejected lastEjectionTime: number; // Last time server was ejected currentEjectionDuration: number; // Current ejection duration isEjected: boolean;} interface AdaptiveConfig { baseEjectionTimeMs: number; // Initial ejection time maxEjectionTimeMs: number; // Maximum ejection time ejectionMultiplier: number; // Multiplier per subsequent ejection maxEjectionPercent: number; // Max percentage of hosts to eject} class AdaptiveEjectionManager { private config: AdaptiveConfig; private serverStates: Map<string, AdaptiveServerState> = new Map(); private totalHosts: number = 0; constructor(config: AdaptiveConfig) { this.config = config; } setTotalHosts(count: number) { this.totalHosts = count; } /** * Attempt to eject a server, respecting adaptive constraints */ ejectServer(serverId: string, reason: string): { ejected: boolean; ejectionDuration?: number; rejectionReason?: string; } { // Check max ejection percentage const currentlyEjected = Array.from(this.serverStates.values()) .filter(s => s.isEjected).length; const ejectionPercent = (currentlyEjected + 1) / this.totalHosts * 100; if (ejectionPercent > this.config.maxEjectionPercent) { return { ejected: false, rejectionReason: `Would exceed max ejection ${this.config.maxEjectionPercent}% (${currentlyEjected}/${this.totalHosts} already ejected)` }; } // Get or create server state let state = this.serverStates.get(serverId); if (!state) { state = { serverId, ejectionCount: 0, lastEjectionTime: 0, currentEjectionDuration: 0, isEjected: false }; } // Calculate ejection duration with exponential backoff state.ejectionCount++; state.currentEjectionDuration = Math.min( this.config.baseEjectionTimeMs * Math.pow( this.config.ejectionMultiplier, state.ejectionCount - 1 ), this.config.maxEjectionTimeMs ); state.isEjected = true; state.lastEjectionTime = Date.now(); this.serverStates.set(serverId, state); console.log(`[Ejection] ${serverId}: Ejected for ${state.currentEjectionDuration}ms (ejection #${state.ejectionCount}). Reason: ${reason}`); // Schedule automatic recovery check setTimeout(() => { this.checkRecovery(serverId); }, state.currentEjectionDuration); return { ejected: true, ejectionDuration: state.currentEjectionDuration }; } /** * Check if server can be returned to rotation */ private checkRecovery(serverId: string) { const state = this.serverStates.get(serverId); if (!state || !state.isEjected) return; const now = Date.now(); if (now >= state.lastEjectionTime + state.currentEjectionDuration) { state.isEjected = false; this.serverStates.set(serverId, state); console.log(`[Recovery] ${serverId}: Returned to rotation (will verify with active probes)`); } } /** * Reset ejection count after sustained healthy period */ resetEjectionCount(serverId: string, healthyDurationMs: number) { const state = this.serverStates.get(serverId); if (!state) return; // Reset after being healthy for 5x the max ejection time if (healthyDurationMs > this.config.maxEjectionTimeMs * 5) { state.ejectionCount = 0; this.serverStates.set(serverId, state); console.log(`[Reset] ${serverId}: Ejection count reset after sustained health`); } }} // Example configurationconst adaptiveConfig: AdaptiveConfig = { baseEjectionTimeMs: 30000, // 30 seconds base maxEjectionTimeMs: 300000, // 5 minutes max ejectionMultiplier: 2, // Double each time maxEjectionPercent: 50 // Never eject more than 50%}; // Ejection progression: 30s → 60s → 120s → 240s → 300s (capped)The max ejection percentage is your panic button. If your detection logic is flawed or an external factor affects all servers, the max ejection percentage prevents complete traffic loss. Never set this to 100%. Serving with some broken servers is better than serving with none.
There's no universal 'best' configuration for failure detection. The right settings depend on your failure modes, traffic patterns, and business requirements. Here's a framework for tuning detection parameters.
Step 1: Understand Your Failure Modes
Analyze historical incidents:
Step 2: Model Detection Latency
Detection Time = (Failure Threshold - 1) × Probe Interval + Probe Timeout
Example: 3 consecutive failures with 5-second interval and 2-second timeout:
Step 3: Calculate False Positive Rate
If your network has a 1% packet loss rate and you require 3 consecutive failures:
But if probe interval is 5 seconds and network issues are bursty:
| Use Case | Probe Interval | Failure Threshold | Detection Time | False Positive Risk |
|---|---|---|---|---|
| Gaming / Real-time | 1-2s | 2 | ~3s | Higher - acceptable if pools are large |
| E-commerce / API | 5s | 3 | ~12s | Medium - balanced approach |
| Content Delivery | 10s | 3 | ~25s | Low - stability prioritized |
| Batch Processing | 30s | 5 | ~2.5min | Very Low - throughput matters more than latency |
| Database Proxying | 5s | 2 | ~7s | Higher - fast detection critical |
Track 'time to first error reaching user after server failure.' This end-to-end metric captures whether your detection is fast enough, regardless of implementation details. If users see errors for 30 seconds after a server dies, your detection window plus routing update time is at least 30 seconds.
Failure detection is the bridge between health observations and routing decisions. The algorithms and configurations you choose fundamentally determine how quickly your system responds to failures and how stable your traffic routing remains.
What's next:
Detecting failures is only half the battle. Once a failure is detected, how does the system respond? The next page explores graceful degradation—maintaining partial service when components fail, prioritizing critical functionality, and preventing cascade failures.
You now understand the algorithms and strategies for converting health observations into actionable failure decisions. You've learned threshold-based, rate-based, and statistical detection approaches, along with adaptive strategies for dynamic environments. Next, we'll explore graceful degradation.