System Design (HLD)Health Checks & Failover

Health Checks & Failover

LevelIntermediate

Duration75 mins

TopicHealth Checks & Failover

3 / 5

Failure Detection

The Detection Dilemma

At 3:47 AM on a quiet Sunday morning, a network switch in a major cloud provider's region began dropping packets intermittently. Not all packets—just enough to cause occasional connection timeouts. The affected servers weren't down; they were partially reachable.

The on-call engineer faced a dilemma that cuts to the heart of failure detection: the health checks were failing 30% of the time. Were these servers 'failed' or 'healthy'? Setting the failure threshold at 3 consecutive failures meant some truly struggling servers kept receiving traffic. Setting it at 1 failure meant healthy servers were constantly bouncing in and out of rotation due to transient network hiccups.

This scenario illustrates the fundamental challenge of failure detection: distinguishing genuine failures from transient issues, quickly enough to matter but accurately enough to avoid false alarms. Get it wrong in one direction, and failed servers receive traffic. Get it wrong in the other, and constant false positives destabilize your entire routing layer.

What You Will Learn

By the end of this page, you will understand the algorithms and strategies that convert raw health observations into actionable failure decisions. You'll learn how to configure detection thresholds for different failure modes, understand the mathematical trade-offs between detection speed and accuracy, and apply advanced detection techniques like adaptive thresholds and statistical anomaly detection.

The Anatomy of Failure

Before designing detection mechanisms, we must understand what we're detecting. Failures in distributed systems come in many forms, each with distinct characteristics that affect detection strategies.

Failure Taxonomy:

Types of Failures and Their Detection Characteristics
Failure Type	Characteristics	Detection Challenge	Typical Signal
Crash Failure	Process terminates abruptly	Easy - clear signal	Connection refused, port closed
Hang Failure	Process alive but unresponsive	Medium - timeouts required	Request timeouts, no response
Byzantine Failure	Process responds incorrectly	Hard - need correctness checks	Wrong data, corrupted responses
Performance Degradation	Process slow but functional	Medium - latency thresholds	Increased response times
Partial Failure	Some operations fail, others succeed	Hard - workload dependent	Intermittent errors
Network Partition	Node unreachable from some peers	Hard - perspective dependent	Asymmetric connectivity
Resource Exhaustion	OOM, disk full, connection limits	Medium - resource monitoring	Specific error codes

The Observer Problem:

A fundamental challenge in failure detection is that you cannot distinguish a crashed process from a slow process from an unreachable process. Consider a health check that times out after 3 seconds:

Crashed server: Connection refused immediately → Clear failure
Slow network: Response arrives in 4 seconds → Timeout, but server is fine
Overloaded server: Processing takes 5 seconds → Timeout, server is struggling
Network partition: Packets dropped → Timeout, server may be healthy

The health checker sees the same signal (timeout) for very different underlying conditions. This inherent ambiguity shapes everything about failure detection design.

Converting Mermaid diagram...

The Impossibility Result

The FLP impossibility result from distributed systems theory proves that it's impossible to reliably distinguish a failed process from a slow one in an asynchronous system. Failure detectors must accept some degree of inaccuracy—either failing to detect real failures (false negatives) or incorrectly flagging healthy nodes (false positives).

Threshold-Based Detection

The most common approach to failure detection is threshold-based: mark a server as failed after a specified number of consecutive failures or after exceeding an error rate over a time window.

Consecutive Failure Counting:

The simplest model tracks consecutive probe failures:

if consecutiveFailures >= failureThreshold:
    markUnhealthy()

if consecutiveSuccesses >= successThreshold:
    markHealthy()

This creates a state machine with hysteresis—the asymmetric thresholds prevent rapid oscillation between healthy and unhealthy states.

Converting Mermaid diagram...

Configuring Thresholds:

The failure threshold directly controls the trade-off between detection speed and false positive rate:

Failure Threshold	Detection Time	False Positive Risk	Use Case
1	Immediate	High	Ultra-low-latency systems
2-3	10-15 seconds*	Medium	Real-time applications
3-5	15-25 seconds*	Low	General web services
5-10	25-50 seconds*	Very Low	Batch processing, stable workloads

*Assuming 5-second probe intervals

Success Threshold for Recovery:

The success threshold controls how quickly recovered servers return to rotation. Setting this too low risks adding servers that passed one check but are still unstable. Setting it too high delays capacity restoration:

1 success: Fastest recovery, risk of unstable servers receiving traffic
2-3 successes: Balanced recovery, reasonable confidence
5+ successes: Conservative recovery, long delay before capacity restoration

threshold-detection.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
// TypeScript: Threshold-Based Failure Detection Implementation
 
interface ServerHealth {
    serverId: string;
    state: 'healthy' | 'failing' | 'unhealthy' | 'recovering';
    consecutiveFailures: number;
    consecutiveSuccesses: number;
    lastProbeTime: Date;
    lastStateChange: Date;
    failureHistory: ProbeResult[];
}
 
interface ProbeResult {
    timestamp: Date;
    success: boolean;
    latencyMs: number;
    errorType?: string;
}
 
interface DetectorConfig {
    failureThreshold: number;     // Failures before marking unhealthy
    successThreshold: number;     // Successes before marking healthy
    historySize: number;         // Number of probe results to retain
}
 
class ThresholdFailureDetector {
    private servers: Map<string, ServerHealth> = new Map();
    private config: DetectorConfig;
    
    constructor(config: DetectorConfig) {
        this.config = config;
    }
    
    /**
     * Process a probe result and update server health state
     */
    recordProbe(serverId: string, result: ProbeResult): ServerHealth {
        let server = this.servers.get(serverId);
        
        if (!server) {
            server = this.initializeServer(serverId);
        }
        
        // Update history (sliding window)
        server.failureHistory.push(result);
        if (server.failureHistory.length > this.config.historySize) {
            server.failureHistory.shift();
        }
        
        server.lastProbeTime = result.timestamp;
        
        // State machine transitions
        const previousState = server.state;
        
        if (result.success) {
            server.consecutiveFailures = 0;
            server.consecutiveSuccesses++;
            
            if (server.state === 'unhealthy' || server.state === 'recovering') {
                server.state = 'recovering';
                
                if (server.consecutiveSuccesses >= this.config.successThreshold) {
                    server.state = 'healthy';
                    server.consecutiveSuccesses = 0;
                }
            } else {
                server.state = 'healthy';
            }
        } else {
            server.consecutiveSuccesses = 0;
            server.consecutiveFailures++;
            
            if (server.state === 'healthy' || server.state === 'failing') {
                server.state = 'failing';
                
                if (server.consecutiveFailures >= this.config.failureThreshold) {
                    server.state = 'unhealthy';
                }
            } else if (server.state === 'recovering') {
                // Recovery interrupted
                server.state = 'unhealthy';
            }
        }
        
        // Record state change timestamp
        if (server.state !== previousState) {
            server.lastStateChange = new Date();
            this.onStateChange(serverId, previousState, server.state);
        }
        
        this.servers.set(serverId, server);
        return server;
    }
    
    /**
     * Check if server should receive traffic
     */
    isHealthy(serverId: string): boolean {
        const server = this.servers.get(serverId);
        if (!server) return false;
        
        return server.state === 'healthy' || server.state === 'failing';
    }
    
    private initializeServer(serverId: string): ServerHealth {
        return {
            serverId,
            state: 'healthy',
            consecutiveFailures: 0,
            consecutiveSuccesses: 0,
            lastProbeTime: new Date(),
            lastStateChange: new Date(),
            failureHistory: []
        };
    }
    
    private onStateChange(
        serverId: string, 
        from: ServerHealth['state'], 
        to: ServerHealth['state']
    ) {
        console.log(`[FailureDetector] ${serverId}: ${from} → ${to}`);
        
        // Emit metrics
        stateTransitionCounter.inc({
            from_state: from,
            to_state: to
        });
    }
}
 
// Example configuration for different scenarios
const configs = {
    // High-frequency trading: ultra-fast detection, accept more false positives
    lowLatency: {
        failureThreshold: 1,
        successThreshold: 1,
        historySize: 10
    },
    
    // Standard web service: balanced detection
    standard: {
        failureThreshold: 3,
        successThreshold: 2,
        historySize: 20
    },
    
    // Batch processing: conservative, minimize disruption
    stable: {
        failureThreshold: 5,
        successThreshold: 3,
        historySize: 50
    }
};

The Flapping Problem

When failure and success thresholds are both low, servers can 'flap' rapidly between healthy and unhealthy states. This creates routing instability, connection storm as clients retry, and alert fatigue. Always use asymmetric thresholds—making it harder to leave unhealthy state than to enter it.

Rate-Based Detection

Consecutive failure counting has a significant limitation: it's all-or-nothing. A server that fails every other probe (50% failure rate) will never trigger consecutive failure thresholds but is clearly problematic.

Error Rate Detection:

Rate-based detection tracks the proportion of failures over a sliding time window:

errorRate = failedProbes / totalProbes (over last N seconds)

if errorRate >= errorRateThreshold:
    markUnhealthy()

This approach catches servers with intermittent failures that evade consecutive counting.

Rate-Based Detection Parameters
Parameter	Description	Typical Values	Trade-off
Time Window	Period over which to calculate rate	30s - 5min	Shorter = faster detection, more noise
Minimum Samples	Required probes before rate calculation	5-20	Higher = more stable, slower cold-start
Error Rate Threshold	Failure rate to trigger ejection	50-90%	Higher = more tolerant, slower detection
Request Volume Minimum	Minimum requests to evaluate (passive)	50-100	Higher = reject low-traffic noise

rate-based-detection.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
// TypeScript: Rate-Based Failure Detection with Sliding Window
 
interface SlidingWindowConfig {
    windowSizeMs: number;        // Time window for rate calculation
    bucketSizeMs: number;        // Granularity of time buckets
    minSamples: number;          // Minimum samples before calculating rate
    errorRateThreshold: number;  // Error rate to trigger failure (0.0-1.0)
}
 
interface TimeBucket {
    startTime: number;
    successes: number;
    failures: number;
}
 
class SlidingWindowDetector {
    private config: SlidingWindowConfig;
    private buckets: Map<string, TimeBucket[]> = new Map();
    
    constructor(config: SlidingWindowConfig) {
        this.config = config;
    }
    
    /**
     * Record a probe result in the appropriate time bucket
     */
    recordProbe(serverId: string, success: boolean, timestamp: number = Date.now()) {
        const buckets = this.getOrCreateBuckets(serverId);
        const bucketIndex = Math.floor(timestamp / this.config.bucketSizeMs);
        
        // Find or create the appropriate bucket
        let bucket = buckets.find(b => 
            Math.floor(b.startTime / this.config.bucketSizeMs) === bucketIndex
        );
        
        if (!bucket) {
            bucket = {
                startTime: bucketIndex * this.config.bucketSizeMs,
                successes: 0,
                failures: 0
            };
            buckets.push(bucket);
        }
        
        if (success) {
            bucket.successes++;
        } else {
            bucket.failures++;
        }
        
        // Prune old buckets
        this.pruneBuckets(serverId, timestamp);
    }
    
    /**
     * Calculate current error rate for a server
     */
    getErrorRate(serverId: string, timestamp: number = Date.now()): {
        errorRate: number;
        totalSamples: number;
        isReliable: boolean;
    } {
        this.pruneBuckets(serverId, timestamp);
        
        const buckets = this.buckets.get(serverId) || [];
        const windowStart = timestamp - this.config.windowSizeMs;
        
        let totalSuccesses = 0;
        let totalFailures = 0;
        
        for (const bucket of buckets) {
            if (bucket.startTime >= windowStart) {
                totalSuccesses += bucket.successes;
                totalFailures += bucket.failures;
            }
        }
        
        const totalSamples = totalSuccesses + totalFailures;
        const isReliable = totalSamples >= this.config.minSamples;
        const errorRate = totalSamples > 0 ? totalFailures / totalSamples : 0;
        
        return { errorRate, totalSamples, isReliable };
    }
    
    /**
     * Determine if server should be marked unhealthy based on error rate
     */
    shouldEject(serverId: string, timestamp: number = Date.now()): {
        shouldEject: boolean;
        reason?: string;
    } {
        const { errorRate, totalSamples, isReliable } = this.getErrorRate(serverId, timestamp);
        
        if (!isReliable) {
            return {
                shouldEject: false,
                reason: `Insufficient samples (${totalSamples}/${this.config.minSamples})`
            };
        }
        
        if (errorRate >= this.config.errorRateThreshold) {
            return {
                shouldEject: true,
                reason: `Error rate ${(errorRate * 100).toFixed(1)}% exceeds threshold ${(this.config.errorRateThreshold * 100).toFixed(1)}%`
            };
        }
        
        return { shouldEject: false };
    }
    
    private getOrCreateBuckets(serverId: string): TimeBucket[] {
        if (!this.buckets.has(serverId)) {
            this.buckets.set(serverId, []);
        }
        return this.buckets.get(serverId)!;
    }
    
    private pruneBuckets(serverId: string, currentTime: number) {
        const buckets = this.buckets.get(serverId);
        if (!buckets) return;
        
        const windowStart = currentTime - this.config.windowSizeMs;
        const prunedBuckets = buckets.filter(b => 
            b.startTime >= windowStart
        );
        
        this.buckets.set(serverId, prunedBuckets);
    }
}
 
// Example: Envoy-style outlier detection parameters
const envoyStyleConfig: SlidingWindowConfig = {
    windowSizeMs: 60000,         // 1 minute window
    bucketSizeMs: 5000,          // 5 second buckets
    minSamples: 10,              // Need at least 10 probes
    errorRateThreshold: 0.85     // 85% failure rate triggers ejection
};

Combining Consecutive and Rate-Based Detection

Production systems often combine both approaches: consecutive failure detection catches rapid, complete failures quickly, while rate-based detection catches slower degradation. The 'OR' of both signals provides comprehensive coverage.

Statistical Detection Methods

Fixed thresholds work well when you know what 'normal' looks like. But in dynamic systems where baselines shift—traffic patterns change, infrastructure scales, dependencies vary—fixed thresholds become stale. Statistical methods adapt detection thresholds to observed behavior.

Success Rate Anomaly Detection:

Instead of a fixed error rate threshold, compare each server's success rate to its peers:

Calculate the mean success rate across all servers in the cluster
Calculate the standard deviation of success rates
Mark a server as an outlier if its success rate is more than N standard deviations below the mean

This automatically adjusts for system-wide conditions. If the entire cluster is experiencing 10% errors (maybe due to a dependency issue), no individual server is an outlier. But if one server has 50% errors while peers have 10%, it's ejected.

statistical-detection.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
// TypeScript: Statistical Outlier Detection
 
interface ServerSuccessRate {
    serverId: string;
    successRate: number;
    requestCount: number;
}
 
interface OutlierDetectionConfig {
    stdevFactor: number;           // Standard deviations from mean (e.g., 1.9)
    minClusterSize: number;        // Minimum servers to calculate statistics
    minRequestVolume: number;      // Minimum requests for reliable rate
    successRateMinHosts: number;   // Minimum hosts with sufficient requests
}
 
class StatisticalOutlierDetector {
    private config: OutlierDetectionConfig;
    
    constructor(config: OutlierDetectionConfig) {
        this.config = config;
    }
    
    /**
     * Identify outliers using success rate deviation from cluster mean
     */
    detectOutliers(serverRates: ServerSuccessRate[]): {
        outliers: string[];
        clusterMean: number;
        clusterStdev: number;
        threshold: number;
    } {
        // Filter to servers with sufficient request volume
        const reliableServers = serverRates.filter(
            s => s.requestCount >= this.config.minRequestVolume
        );
        
        // Need minimum hosts for meaningful statistics
        if (reliableServers.length < this.config.successRateMinHosts) {
            return {
                outliers: [],
                clusterMean: 0,
                clusterStdev: 0,
                threshold: 0
            };
        }
        
        // Calculate mean
        const successRates = reliableServers.map(s => s.successRate);
        const mean = successRates.reduce((a, b) => a + b, 0) / successRates.length;
        
        // Calculate standard deviation
        const squaredDiffs = successRates.map(rate => Math.pow(rate - mean, 2));
        const avgSquaredDiff = squaredDiffs.reduce((a, b) => a + b, 0) / squaredDiffs.length;
        const stdev = Math.sqrt(avgSquaredDiff);
        
        // Calculate threshold
        const threshold = mean - (stdev * this.config.stdevFactor);
        
        // Identify outliers (success rate below threshold)
        const outliers = reliableServers
            .filter(s => s.successRate < threshold)
            .map(s => s.serverId);
        
        return {
            outliers,
            clusterMean: mean,
            clusterStdev: stdev,
            threshold
        };
    }
    
    /**
     * Enhanced detection with multiple signals
     */
    detectOutliersMultiSignal(
        serverRates: ServerSuccessRate[],
        serverLatencies: { serverId: string; p99LatencyMs: number }[]
    ): Map<string, { isOutlier: boolean; reasons: string[] }> {
        const results = new Map<string, { isOutlier: boolean; reasons: string[] }>();
        
        // Success rate outliers
        const successRateOutliers = this.detectOutliers(serverRates);
        
        // Latency outliers (similar approach)
        const latencies = serverLatencies.map(s => s.p99LatencyMs);
        const latencyMean = latencies.reduce((a, b) => a + b, 0) / latencies.length;
        const latencySquaredDiffs = latencies.map(l => Math.pow(l - latencyMean, 2));
        const latencyStdev = Math.sqrt(
            latencySquaredDiffs.reduce((a, b) => a + b, 0) / latencies.length
        );
        const latencyThreshold = latencyMean + (latencyStdev * this.config.stdevFactor);
        
        // Combine signals
        for (const server of serverRates) {
            const reasons: string[] = [];
            
            if (successRateOutliers.outliers.includes(server.serverId)) {
                reasons.push(`Success rate ${(server.successRate * 100).toFixed(1)}% below threshold ${(successRateOutliers.threshold * 100).toFixed(1)}%`);
            }
            
            const latencyData = serverLatencies.find(l => l.serverId === server.serverId);
            if (latencyData && latencyData.p99LatencyMs > latencyThreshold) {
                reasons.push(`P99 latency ${latencyData.p99LatencyMs}ms above threshold ${latencyThreshold.toFixed(0)}ms`);
            }
            
            results.set(server.serverId, {
                isOutlier: reasons.length > 0,
                reasons
            });
        }
        
        return results;
    }
}
 
// Envoy-style parameters (stdev factor of 1.9 = ~3% false positive rate)
const envoyConfig: OutlierDetectionConfig = {
    stdevFactor: 1.9,
    minClusterSize: 5,
    minRequestVolume: 100,
    successRateMinHosts: 5
};

Understanding Standard Deviation Factors

The stdev factor determines outlier sensitivity. Assuming normally distributed success rates: 1.0 stdev catches ~16% of hosts, 1.5 stdev catches ~7%, 1.9 stdev catches ~3%, and 2.0 stdev catches ~2.5%. Envoy defaults to 1.9 because it provides good sensitivity while limiting false positives.

Adaptive Detection Strategies

Static detection parameters assume stable failure characteristics. In reality, systems experience varied conditions that warrant different detection sensitivity. Adaptive detection adjusts thresholds based on current conditions.

Adaptive Approaches:

Adaptive Detection Techniques

•Time-Weighted Detection — Recent failures weighted more heavily than older failures. A server that was failing 5 minutes ago but is now responding should be treated differently than one that failed in the last 30 seconds.
•Load-Aware Thresholds — Under high load, servers naturally experience more timeouts. Dynamic thresholds that loosen during load spikes prevent cascade ejections.
•Exponential Backoff for Ejection — The more times a server is ejected, the longer it stays ejected on subsequent failures. This prevents thrashing of chronically unhealthy servers.
•Confidence-Based Detection — Weight detection decisions by confidence in the data. Few samples = low confidence = higher threshold for ejection.
•Seasonal Adjustment — Some systems have predictable patterns (traffic spikes, maintenance windows). Adjust sensitivity accordingly.

adaptive-detection.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
// TypeScript: Adaptive Detection with Exponential Backoff
 
interface AdaptiveServerState {
    serverId: string;
    ejectionCount: number;        // How many times this server has been ejected
    lastEjectionTime: number;     // Last time server was ejected
    currentEjectionDuration: number;  // Current ejection duration
    isEjected: boolean;
}
 
interface AdaptiveConfig {
    baseEjectionTimeMs: number;   // Initial ejection time
    maxEjectionTimeMs: number;    // Maximum ejection time
    ejectionMultiplier: number;   // Multiplier per subsequent ejection
    maxEjectionPercent: number;   // Max percentage of hosts to eject
}
 
class AdaptiveEjectionManager {
    private config: AdaptiveConfig;
    private serverStates: Map<string, AdaptiveServerState> = new Map();
    private totalHosts: number = 0;
    
    constructor(config: AdaptiveConfig) {
        this.config = config;
    }
    
    setTotalHosts(count: number) {
        this.totalHosts = count;
    }
    
    /**
     * Attempt to eject a server, respecting adaptive constraints
     */
    ejectServer(serverId: string, reason: string): {
        ejected: boolean;
        ejectionDuration?: number;
        rejectionReason?: string;
    } {
        // Check max ejection percentage
        const currentlyEjected = Array.from(this.serverStates.values())
            .filter(s => s.isEjected).length;
        const ejectionPercent = (currentlyEjected + 1) / this.totalHosts * 100;
        
        if (ejectionPercent > this.config.maxEjectionPercent) {
            return {
                ejected: false,
                rejectionReason: `Would exceed max ejection ${this.config.maxEjectionPercent}% (${currentlyEjected}/${this.totalHosts} already ejected)`
            };
        }
        
        // Get or create server state
        let state = this.serverStates.get(serverId);
        if (!state) {
            state = {
                serverId,
                ejectionCount: 0,
                lastEjectionTime: 0,
                currentEjectionDuration: 0,
                isEjected: false
            };
        }
        
        // Calculate ejection duration with exponential backoff
        state.ejectionCount++;
        state.currentEjectionDuration = Math.min(
            this.config.baseEjectionTimeMs * Math.pow(
                this.config.ejectionMultiplier, 
                state.ejectionCount - 1
            ),
            this.config.maxEjectionTimeMs
        );
        
        state.isEjected = true;
        state.lastEjectionTime = Date.now();
        
        this.serverStates.set(serverId, state);
        
        console.log(`[Ejection] ${serverId}: Ejected for ${state.currentEjectionDuration}ms (ejection #${state.ejectionCount}). Reason: ${reason}`);
        
        // Schedule automatic recovery check
        setTimeout(() => {
            this.checkRecovery(serverId);
        }, state.currentEjectionDuration);
        
        return {
            ejected: true,
            ejectionDuration: state.currentEjectionDuration
        };
    }
    
    /**
     * Check if server can be returned to rotation
     */
    private checkRecovery(serverId: string) {
        const state = this.serverStates.get(serverId);
        if (!state || !state.isEjected) return;
        
        const now = Date.now();
        if (now >= state.lastEjectionTime + state.currentEjectionDuration) {
            state.isEjected = false;
            this.serverStates.set(serverId, state);
            console.log(`[Recovery] ${serverId}: Returned to rotation (will verify with active probes)`);
        }
    }
    
    /**
     * Reset ejection count after sustained healthy period
     */
    resetEjectionCount(serverId: string, healthyDurationMs: number) {
        const state = this.serverStates.get(serverId);
        if (!state) return;
        
        // Reset after being healthy for 5x the max ejection time
        if (healthyDurationMs > this.config.maxEjectionTimeMs * 5) {
            state.ejectionCount = 0;
            this.serverStates.set(serverId, state);
            console.log(`[Reset] ${serverId}: Ejection count reset after sustained health`);
        }
    }
}
 
// Example configuration
const adaptiveConfig: AdaptiveConfig = {
    baseEjectionTimeMs: 30000,     // 30 seconds base
    maxEjectionTimeMs: 300000,     // 5 minutes max
    ejectionMultiplier: 2,         // Double each time
    maxEjectionPercent: 50         // Never eject more than 50%
};
 
// Ejection progression: 30s → 60s → 120s → 240s → 300s (capped)

The Panic Button

The max ejection percentage is your panic button. If your detection logic is flawed or an external factor affects all servers, the max ejection percentage prevents complete traffic loss. Never set this to 100%. Serving with some broken servers is better than serving with none.

Tuning Detection for Your System

There's no universal 'best' configuration for failure detection. The right settings depend on your failure modes, traffic patterns, and business requirements. Here's a framework for tuning detection parameters.

Step 1: Understand Your Failure Modes

Analyze historical incidents:

How do servers typically fail? (crash, hang, slow, intermittent)
How quickly must you detect failures to meet SLAs?
What's the cost of false positives (removing healthy servers) vs false negatives (routing to bad servers)?

Step 2: Model Detection Latency

Detection Time = (Failure Threshold - 1) × Probe Interval + Probe Timeout

Example: 3 consecutive failures with 5-second interval and 2-second timeout:

Best case: 2 × 5 + 2 = 12 seconds (third probe times out immediately)
Worst case: 2 × 5 + 2 = 12 seconds (same, consecutive failures)

Step 3: Calculate False Positive Rate

If your network has a 1% packet loss rate and you require 3 consecutive failures:

P(single failure) = 0.01
P(3 consecutive) = 0.01³ = 0.000001 = 0.0001%

But if probe interval is 5 seconds and network issues are bursty:

P(3 consecutive) during a 15-second network hiccup = much higher

Detection Configuration by Use Case
Use Case	Probe Interval	Failure Threshold	Detection Time	False Positive Risk
Gaming / Real-time	1-2s	2	~3s	Higher - acceptable if pools are large
E-commerce / API	5s	3	~12s	Medium - balanced approach
Content Delivery	10s	3	~25s	Low - stability prioritized
Batch Processing	30s	5	~2.5min	Very Low - throughput matters more than latency
Database Proxying	5s	2	~7s	Higher - fast detection critical

Production Tuning Checklist

•Start conservative — Begin with higher thresholds (fewer false positives) and tighten based on observed missed failures.
•Monitor detection latency — Track time from actual failure to ejection. Is it meeting your SLA requirements?
•Track false ejections — Log when healthy servers are ejected and restored quickly. High rates indicate thresholds are too sensitive.
•Monitor ejection distribution — Are the same servers repeatedly ejected? This may indicate underlying issues beyond detection tuning.
•Test with chaos engineering — Inject failures and verify detection time matches expectations.
•Review after incidents — Every production incident should include review of whether detection was too slow or too aggressive.

The Canary Metric

Track 'time to first error reaching user after server failure.' This end-to-end metric captures whether your detection is fast enough, regardless of implementation details. If users see errors for 30 seconds after a server dies, your detection window plus routing update time is at least 30 seconds.

Summary: Mastering Failure Detection

Failure detection is the bridge between health observations and routing decisions. The algorithms and configurations you choose fundamentally determine how quickly your system responds to failures and how stable your traffic routing remains.

Key Takeaways

•Failures come in many forms — Crash, hang, partial, byzantine, and performance degradation all require different detection approaches.
•Threshold-based detection provides simple, predictable behavior with consecutive failure counting, but misses intermittent failures.
•Rate-based detection catches degradation over time windows but requires sufficient sample volume for accuracy.
•Statistical detection adapts to cluster baselines, ejecting servers that perform worse than their peers rather than against fixed thresholds.
•Adaptive strategies including exponential backoff and max ejection percentages prevent cascading failures and detection thrashing.
•Tuning is system-specific — Balance detection speed against false positive rate based on your failure modes, SLAs, and traffic patterns.

What's next:

Detecting failures is only half the battle. Once a failure is detected, how does the system respond? The next page explores graceful degradation—maintaining partial service when components fail, prioritizing critical functionality, and preventing cascade failures.

Page Complete

You now understand the algorithms and strategies for converting health observations into actionable failure decisions. You've learned threshold-based, rate-based, and statistical detection approaches, along with adaptive strategies for dynamic environments. Next, we'll explore graceful degradation.

3 / 5

Loading learning content...

System Design (HLD)Health Checks & Failover

Health Checks & Failover

LevelIntermediate

Duration75 mins

TopicHealth Checks & Failover

3 / 5

Failure Detection

The Detection Dilemma

What You Will Learn

The Anatomy of Failure

Failure Taxonomy:

Types of Failures and Their Detection Characteristics
Failure Type	Characteristics	Detection Challenge	Typical Signal
Crash Failure	Process terminates abruptly	Easy - clear signal	Connection refused, port closed
Hang Failure	Process alive but unresponsive	Medium - timeouts required	Request timeouts, no response
Byzantine Failure	Process responds incorrectly	Hard - need correctness checks	Wrong data, corrupted responses
Performance Degradation	Process slow but functional	Medium - latency thresholds	Increased response times
Partial Failure	Some operations fail, others succeed	Hard - workload dependent	Intermittent errors
Network Partition	Node unreachable from some peers	Hard - perspective dependent	Asymmetric connectivity
Resource Exhaustion	OOM, disk full, connection limits	Medium - resource monitoring	Specific error codes

The Observer Problem:

Crashed server: Connection refused immediately → Clear failure
Slow network: Response arrives in 4 seconds → Timeout, but server is fine
Overloaded server: Processing takes 5 seconds → Timeout, server is struggling
Network partition: Packets dropped → Timeout, server may be healthy

The health checker sees the same signal (timeout) for very different underlying conditions. This inherent ambiguity shapes everything about failure detection design.

Converting Mermaid diagram...

The Impossibility Result

Threshold-Based Detection

The most common approach to failure detection is threshold-based: mark a server as failed after a specified number of consecutive failures or after exceeding an error rate over a time window.

Consecutive Failure Counting:

The simplest model tracks consecutive probe failures:

if consecutiveFailures >= failureThreshold:
    markUnhealthy()

if consecutiveSuccesses >= successThreshold:
    markHealthy()

This creates a state machine with hysteresis—the asymmetric thresholds prevent rapid oscillation between healthy and unhealthy states.

Converting Mermaid diagram...

Configuring Thresholds:

The failure threshold directly controls the trade-off between detection speed and false positive rate:

Failure Threshold	Detection Time	False Positive Risk	Use Case
1	Immediate	High	Ultra-low-latency systems
2-3	10-15 seconds*	Medium	Real-time applications
3-5	15-25 seconds*	Low	General web services
5-10	25-50 seconds*	Very Low	Batch processing, stable workloads

*Assuming 5-second probe intervals

Success Threshold for Recovery:

1 success: Fastest recovery, risk of unstable servers receiving traffic
2-3 successes: Balanced recovery, reasonable confidence
5+ successes: Conservative recovery, long delay before capacity restoration

threshold-detection.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
// TypeScript: Threshold-Based Failure Detection Implementation
 
interface ServerHealth {
    serverId: string;
    state: 'healthy' | 'failing' | 'unhealthy' | 'recovering';
    consecutiveFailures: number;
    consecutiveSuccesses: number;
    lastProbeTime: Date;
    lastStateChange: Date;
    failureHistory: ProbeResult[];
}
 
interface ProbeResult {
    timestamp: Date;
    success: boolean;
    latencyMs: number;
    errorType?: string;
}
 
interface DetectorConfig {
    failureThreshold: number;     // Failures before marking unhealthy
    successThreshold: number;     // Successes before marking healthy
    historySize: number;         // Number of probe results to retain
}
 
class ThresholdFailureDetector {
    private servers: Map<string, ServerHealth> = new Map();
    private config: DetectorConfig;
    
    constructor(config: DetectorConfig) {
        this.config = config;
    }
    
    /**
     * Process a probe result and update server health state
     */
    recordProbe(serverId: string, result: ProbeResult): ServerHealth {
        let server = this.servers.get(serverId);
        
        if (!server) {
            server = this.initializeServer(serverId);
        }
        
        // Update history (sliding window)
        server.failureHistory.push(result);
        if (server.failureHistory.length > this.config.historySize) {
            server.failureHistory.shift();
        }
        
        server.lastProbeTime = result.timestamp;
        
        // State machine transitions
        const previousState = server.state;
        
        if (result.success) {
            server.consecutiveFailures = 0;
            server.consecutiveSuccesses++;
            
            if (server.state === 'unhealthy' || server.state === 'recovering') {
                server.state = 'recovering';
                
                if (server.consecutiveSuccesses >= this.config.successThreshold) {
                    server.state = 'healthy';
                    server.consecutiveSuccesses = 0;
                }
            } else {
                server.state = 'healthy';
            }
        } else {
            server.consecutiveSuccesses = 0;
            server.consecutiveFailures++;
            
            if (server.state === 'healthy' || server.state === 'failing') {
                server.state = 'failing';
                
                if (server.consecutiveFailures >= this.config.failureThreshold) {
                    server.state = 'unhealthy';
                }
            } else if (server.state === 'recovering') {
                // Recovery interrupted
                server.state = 'unhealthy';
            }
        }
        
        // Record state change timestamp
        if (server.state !== previousState) {
            server.lastStateChange = new Date();
            this.onStateChange(serverId, previousState, server.state);
        }
        
        this.servers.set(serverId, server);
        return server;
    }
    
    /**
     * Check if server should receive traffic
     */
    isHealthy(serverId: string): boolean {
        const server = this.servers.get(serverId);
        if (!server) return false;
        
        return server.state === 'healthy' || server.state === 'failing';
    }
    
    private initializeServer(serverId: string): ServerHealth {
        return {
            serverId,
            state: 'healthy',
            consecutiveFailures: 0,
            consecutiveSuccesses: 0,
            lastProbeTime: new Date(),
            lastStateChange: new Date(),
            failureHistory: []
        };
    }
    
    private onStateChange(
        serverId: string, 
        from: ServerHealth['state'], 
        to: ServerHealth['state']
    ) {
        console.log(`[FailureDetector] ${serverId}: ${from} → ${to}`);
        
        // Emit metrics
        stateTransitionCounter.inc({
            from_state: from,
            to_state: to
        });
    }
}
 
// Example configuration for different scenarios
const configs = {
    // High-frequency trading: ultra-fast detection, accept more false positives
    lowLatency: {
        failureThreshold: 1,
        successThreshold: 1,
        historySize: 10
    },
    
    // Standard web service: balanced detection
    standard: {
        failureThreshold: 3,
        successThreshold: 2,
        historySize: 20
    },
    
    // Batch processing: conservative, minimize disruption
    stable: {
        failureThreshold: 5,
        successThreshold: 3,
        historySize: 50
    }
};

The Flapping Problem

Rate-Based Detection

Error Rate Detection:

Rate-based detection tracks the proportion of failures over a sliding time window:

errorRate = failedProbes / totalProbes (over last N seconds)

if errorRate >= errorRateThreshold:
    markUnhealthy()

This approach catches servers with intermittent failures that evade consecutive counting.

Rate-Based Detection Parameters
Parameter	Description	Typical Values	Trade-off
Time Window	Period over which to calculate rate	30s - 5min	Shorter = faster detection, more noise
Minimum Samples	Required probes before rate calculation	5-20	Higher = more stable, slower cold-start
Error Rate Threshold	Failure rate to trigger ejection	50-90%	Higher = more tolerant, slower detection
Request Volume Minimum	Minimum requests to evaluate (passive)	50-100	Higher = reject low-traffic noise

rate-based-detection.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
// TypeScript: Rate-Based Failure Detection with Sliding Window
 
interface SlidingWindowConfig {
    windowSizeMs: number;        // Time window for rate calculation
    bucketSizeMs: number;        // Granularity of time buckets
    minSamples: number;          // Minimum samples before calculating rate
    errorRateThreshold: number;  // Error rate to trigger failure (0.0-1.0)
}
 
interface TimeBucket {
    startTime: number;
    successes: number;
    failures: number;
}
 
class SlidingWindowDetector {
    private config: SlidingWindowConfig;
    private buckets: Map<string, TimeBucket[]> = new Map();
    
    constructor(config: SlidingWindowConfig) {
        this.config = config;
    }
    
    /**
     * Record a probe result in the appropriate time bucket
     */
    recordProbe(serverId: string, success: boolean, timestamp: number = Date.now()) {
        const buckets = this.getOrCreateBuckets(serverId);
        const bucketIndex = Math.floor(timestamp / this.config.bucketSizeMs);
        
        // Find or create the appropriate bucket
        let bucket = buckets.find(b => 
            Math.floor(b.startTime / this.config.bucketSizeMs) === bucketIndex
        );
        
        if (!bucket) {
            bucket = {
                startTime: bucketIndex * this.config.bucketSizeMs,
                successes: 0,
                failures: 0
            };
            buckets.push(bucket);
        }
        
        if (success) {
            bucket.successes++;
        } else {
            bucket.failures++;
        }
        
        // Prune old buckets
        this.pruneBuckets(serverId, timestamp);
    }
    
    /**
     * Calculate current error rate for a server
     */
    getErrorRate(serverId: string, timestamp: number = Date.now()): {
        errorRate: number;
        totalSamples: number;
        isReliable: boolean;
    } {
        this.pruneBuckets(serverId, timestamp);
        
        const buckets = this.buckets.get(serverId) || [];
        const windowStart = timestamp - this.config.windowSizeMs;
        
        let totalSuccesses = 0;
        let totalFailures = 0;
        
        for (const bucket of buckets) {
            if (bucket.startTime >= windowStart) {
                totalSuccesses += bucket.successes;
                totalFailures += bucket.failures;
            }
        }
        
        const totalSamples = totalSuccesses + totalFailures;
        const isReliable = totalSamples >= this.config.minSamples;
        const errorRate = totalSamples > 0 ? totalFailures / totalSamples : 0;
        
        return { errorRate, totalSamples, isReliable };
    }
    
    /**
     * Determine if server should be marked unhealthy based on error rate
     */
    shouldEject(serverId: string, timestamp: number = Date.now()): {
        shouldEject: boolean;
        reason?: string;
    } {
        const { errorRate, totalSamples, isReliable } = this.getErrorRate(serverId, timestamp);
        
        if (!isReliable) {
            return {
                shouldEject: false,
                reason: `Insufficient samples (${totalSamples}/${this.config.minSamples})`
            };
        }
        
        if (errorRate >= this.config.errorRateThreshold) {
            return {
                shouldEject: true,
                reason: `Error rate ${(errorRate * 100).toFixed(1)}% exceeds threshold ${(this.config.errorRateThreshold * 100).toFixed(1)}%`
            };
        }
        
        return { shouldEject: false };
    }
    
    private getOrCreateBuckets(serverId: string): TimeBucket[] {
        if (!this.buckets.has(serverId)) {
            this.buckets.set(serverId, []);
        }
        return this.buckets.get(serverId)!;
    }
    
    private pruneBuckets(serverId: string, currentTime: number) {
        const buckets = this.buckets.get(serverId);
        if (!buckets) return;
        
        const windowStart = currentTime - this.config.windowSizeMs;
        const prunedBuckets = buckets.filter(b => 
            b.startTime >= windowStart
        );
        
        this.buckets.set(serverId, prunedBuckets);
    }
}
 
// Example: Envoy-style outlier detection parameters
const envoyStyleConfig: SlidingWindowConfig = {
    windowSizeMs: 60000,         // 1 minute window
    bucketSizeMs: 5000,          // 5 second buckets
    minSamples: 10,              // Need at least 10 probes
    errorRateThreshold: 0.85     // 85% failure rate triggers ejection
};

Combining Consecutive and Rate-Based Detection

Statistical Detection Methods

Success Rate Anomaly Detection:

Instead of a fixed error rate threshold, compare each server's success rate to its peers:

Calculate the mean success rate across all servers in the cluster
Calculate the standard deviation of success rates
Mark a server as an outlier if its success rate is more than N standard deviations below the mean

statistical-detection.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
// TypeScript: Statistical Outlier Detection
 
interface ServerSuccessRate {
    serverId: string;
    successRate: number;
    requestCount: number;
}
 
interface OutlierDetectionConfig {
    stdevFactor: number;           // Standard deviations from mean (e.g., 1.9)
    minClusterSize: number;        // Minimum servers to calculate statistics
    minRequestVolume: number;      // Minimum requests for reliable rate
    successRateMinHosts: number;   // Minimum hosts with sufficient requests
}
 
class StatisticalOutlierDetector {
    private config: OutlierDetectionConfig;
    
    constructor(config: OutlierDetectionConfig) {
        this.config = config;
    }
    
    /**
     * Identify outliers using success rate deviation from cluster mean
     */
    detectOutliers(serverRates: ServerSuccessRate[]): {
        outliers: string[];
        clusterMean: number;
        clusterStdev: number;
        threshold: number;
    } {
        // Filter to servers with sufficient request volume
        const reliableServers = serverRates.filter(
            s => s.requestCount >= this.config.minRequestVolume
        );
        
        // Need minimum hosts for meaningful statistics
        if (reliableServers.length < this.config.successRateMinHosts) {
            return {
                outliers: [],
                clusterMean: 0,
                clusterStdev: 0,
                threshold: 0
            };
        }
        
        // Calculate mean
        const successRates = reliableServers.map(s => s.successRate);
        const mean = successRates.reduce((a, b) => a + b, 0) / successRates.length;
        
        // Calculate standard deviation
        const squaredDiffs = successRates.map(rate => Math.pow(rate - mean, 2));
        const avgSquaredDiff = squaredDiffs.reduce((a, b) => a + b, 0) / squaredDiffs.length;
        const stdev = Math.sqrt(avgSquaredDiff);
        
        // Calculate threshold
        const threshold = mean - (stdev * this.config.stdevFactor);
        
        // Identify outliers (success rate below threshold)
        const outliers = reliableServers
            .filter(s => s.successRate < threshold)
            .map(s => s.serverId);
        
        return {
            outliers,
            clusterMean: mean,
            clusterStdev: stdev,
            threshold
        };
    }
    
    /**
     * Enhanced detection with multiple signals
     */
    detectOutliersMultiSignal(
        serverRates: ServerSuccessRate[],
        serverLatencies: { serverId: string; p99LatencyMs: number }[]
    ): Map<string, { isOutlier: boolean; reasons: string[] }> {
        const results = new Map<string, { isOutlier: boolean; reasons: string[] }>();
        
        // Success rate outliers
        const successRateOutliers = this.detectOutliers(serverRates);
        
        // Latency outliers (similar approach)
        const latencies = serverLatencies.map(s => s.p99LatencyMs);
        const latencyMean = latencies.reduce((a, b) => a + b, 0) / latencies.length;
        const latencySquaredDiffs = latencies.map(l => Math.pow(l - latencyMean, 2));
        const latencyStdev = Math.sqrt(
            latencySquaredDiffs.reduce((a, b) => a + b, 0) / latencies.length
        );
        const latencyThreshold = latencyMean + (latencyStdev * this.config.stdevFactor);
        
        // Combine signals
        for (const server of serverRates) {
            const reasons: string[] = [];
            
            if (successRateOutliers.outliers.includes(server.serverId)) {
                reasons.push(`Success rate ${(server.successRate * 100).toFixed(1)}% below threshold ${(successRateOutliers.threshold * 100).toFixed(1)}%`);
            }
            
            const latencyData = serverLatencies.find(l => l.serverId === server.serverId);
            if (latencyData && latencyData.p99LatencyMs > latencyThreshold) {
                reasons.push(`P99 latency ${latencyData.p99LatencyMs}ms above threshold ${latencyThreshold.toFixed(0)}ms`);
            }
            
            results.set(server.serverId, {
                isOutlier: reasons.length > 0,
                reasons
            });
        }
        
        return results;
    }
}
 
// Envoy-style parameters (stdev factor of 1.9 = ~3% false positive rate)
const envoyConfig: OutlierDetectionConfig = {
    stdevFactor: 1.9,
    minClusterSize: 5,
    minRequestVolume: 100,
    successRateMinHosts: 5
};

Understanding Standard Deviation Factors

Adaptive Detection Strategies

Adaptive Approaches:

Adaptive Detection Techniques

•Time-Weighted Detection — Recent failures weighted more heavily than older failures. A server that was failing 5 minutes ago but is now responding should be treated differently than one that failed in the last 30 seconds.
•Load-Aware Thresholds — Under high load, servers naturally experience more timeouts. Dynamic thresholds that loosen during load spikes prevent cascade ejections.
•Exponential Backoff for Ejection — The more times a server is ejected, the longer it stays ejected on subsequent failures. This prevents thrashing of chronically unhealthy servers.
•Confidence-Based Detection — Weight detection decisions by confidence in the data. Few samples = low confidence = higher threshold for ejection.
•Seasonal Adjustment — Some systems have predictable patterns (traffic spikes, maintenance windows). Adjust sensitivity accordingly.

adaptive-detection.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
// TypeScript: Adaptive Detection with Exponential Backoff
 
interface AdaptiveServerState {
    serverId: string;
    ejectionCount: number;        // How many times this server has been ejected
    lastEjectionTime: number;     // Last time server was ejected
    currentEjectionDuration: number;  // Current ejection duration
    isEjected: boolean;
}
 
interface AdaptiveConfig {
    baseEjectionTimeMs: number;   // Initial ejection time
    maxEjectionTimeMs: number;    // Maximum ejection time
    ejectionMultiplier: number;   // Multiplier per subsequent ejection
    maxEjectionPercent: number;   // Max percentage of hosts to eject
}
 
class AdaptiveEjectionManager {
    private config: AdaptiveConfig;
    private serverStates: Map<string, AdaptiveServerState> = new Map();
    private totalHosts: number = 0;
    
    constructor(config: AdaptiveConfig) {
        this.config = config;
    }
    
    setTotalHosts(count: number) {
        this.totalHosts = count;
    }
    
    /**
     * Attempt to eject a server, respecting adaptive constraints
     */
    ejectServer(serverId: string, reason: string): {
        ejected: boolean;
        ejectionDuration?: number;
        rejectionReason?: string;
    } {
        // Check max ejection percentage
        const currentlyEjected = Array.from(this.serverStates.values())
            .filter(s => s.isEjected).length;
        const ejectionPercent = (currentlyEjected + 1) / this.totalHosts * 100;
        
        if (ejectionPercent > this.config.maxEjectionPercent) {
            return {
                ejected: false,
                rejectionReason: `Would exceed max ejection ${this.config.maxEjectionPercent}% (${currentlyEjected}/${this.totalHosts} already ejected)`
            };
        }
        
        // Get or create server state
        let state = this.serverStates.get(serverId);
        if (!state) {
            state = {
                serverId,
                ejectionCount: 0,
                lastEjectionTime: 0,
                currentEjectionDuration: 0,
                isEjected: false
            };
        }
        
        // Calculate ejection duration with exponential backoff
        state.ejectionCount++;
        state.currentEjectionDuration = Math.min(
            this.config.baseEjectionTimeMs * Math.pow(
                this.config.ejectionMultiplier, 
                state.ejectionCount - 1
            ),
            this.config.maxEjectionTimeMs
        );
        
        state.isEjected = true;
        state.lastEjectionTime = Date.now();
        
        this.serverStates.set(serverId, state);
        
        console.log(`[Ejection] ${serverId}: Ejected for ${state.currentEjectionDuration}ms (ejection #${state.ejectionCount}). Reason: ${reason}`);
        
        // Schedule automatic recovery check
        setTimeout(() => {
            this.checkRecovery(serverId);
        }, state.currentEjectionDuration);
        
        return {
            ejected: true,
            ejectionDuration: state.currentEjectionDuration
        };
    }
    
    /**
     * Check if server can be returned to rotation
     */
    private checkRecovery(serverId: string) {
        const state = this.serverStates.get(serverId);
        if (!state || !state.isEjected) return;
        
        const now = Date.now();
        if (now >= state.lastEjectionTime + state.currentEjectionDuration) {
            state.isEjected = false;
            this.serverStates.set(serverId, state);
            console.log(`[Recovery] ${serverId}: Returned to rotation (will verify with active probes)`);
        }
    }
    
    /**
     * Reset ejection count after sustained healthy period
     */
    resetEjectionCount(serverId: string, healthyDurationMs: number) {
        const state = this.serverStates.get(serverId);
        if (!state) return;
        
        // Reset after being healthy for 5x the max ejection time
        if (healthyDurationMs > this.config.maxEjectionTimeMs * 5) {
            state.ejectionCount = 0;
            this.serverStates.set(serverId, state);
            console.log(`[Reset] ${serverId}: Ejection count reset after sustained health`);
        }
    }
}
 
// Example configuration
const adaptiveConfig: AdaptiveConfig = {
    baseEjectionTimeMs: 30000,     // 30 seconds base
    maxEjectionTimeMs: 300000,     // 5 minutes max
    ejectionMultiplier: 2,         // Double each time
    maxEjectionPercent: 50         // Never eject more than 50%
};
 
// Ejection progression: 30s → 60s → 120s → 240s → 300s (capped)

The Panic Button

Tuning Detection for Your System

Step 1: Understand Your Failure Modes

Analyze historical incidents:

How do servers typically fail? (crash, hang, slow, intermittent)
How quickly must you detect failures to meet SLAs?
What's the cost of false positives (removing healthy servers) vs false negatives (routing to bad servers)?

Step 2: Model Detection Latency

Detection Time = (Failure Threshold - 1) × Probe Interval + Probe Timeout

Example: 3 consecutive failures with 5-second interval and 2-second timeout:

Best case: 2 × 5 + 2 = 12 seconds (third probe times out immediately)
Worst case: 2 × 5 + 2 = 12 seconds (same, consecutive failures)

Step 3: Calculate False Positive Rate

If your network has a 1% packet loss rate and you require 3 consecutive failures:

P(single failure) = 0.01
P(3 consecutive) = 0.01³ = 0.000001 = 0.0001%

But if probe interval is 5 seconds and network issues are bursty:

P(3 consecutive) during a 15-second network hiccup = much higher

Detection Configuration by Use Case
Use Case	Probe Interval	Failure Threshold	Detection Time	False Positive Risk
Gaming / Real-time	1-2s	2	~3s	Higher - acceptable if pools are large
E-commerce / API	5s	3	~12s	Medium - balanced approach
Content Delivery	10s	3	~25s	Low - stability prioritized
Batch Processing	30s	5	~2.5min	Very Low - throughput matters more than latency
Database Proxying	5s	2	~7s	Higher - fast detection critical

Production Tuning Checklist

•Start conservative — Begin with higher thresholds (fewer false positives) and tighten based on observed missed failures.
•Monitor detection latency — Track time from actual failure to ejection. Is it meeting your SLA requirements?
•Track false ejections — Log when healthy servers are ejected and restored quickly. High rates indicate thresholds are too sensitive.
•Monitor ejection distribution — Are the same servers repeatedly ejected? This may indicate underlying issues beyond detection tuning.
•Test with chaos engineering — Inject failures and verify detection time matches expectations.
•Review after incidents — Every production incident should include review of whether detection was too slow or too aggressive.

The Canary Metric

Summary: Mastering Failure Detection

Key Takeaways

•Failures come in many forms — Crash, hang, partial, byzantine, and performance degradation all require different detection approaches.
•Threshold-based detection provides simple, predictable behavior with consecutive failure counting, but misses intermittent failures.
•Rate-based detection catches degradation over time windows but requires sufficient sample volume for accuracy.
•Statistical detection adapts to cluster baselines, ejecting servers that perform worse than their peers rather than against fixed thresholds.
•Adaptive strategies including exponential backoff and max ejection percentages prevent cascading failures and detection thrashing.
•Tuning is system-specific — Balance detection speed against false positive rate based on your failure modes, SLAs, and traffic patterns.

What's next:

Page Complete

3 / 5