System Design (HLD)Failover Strategies

Failover Strategies: Building Resilient Systems

LevelAdvanced

Duration90 mins

TopicFailover Strategies

2 / 5

Failover Detection: The Art and Science of Knowing When Systems Fail

The Detection Problem

In the early hours of March 2, 2017, a typo in an Amazon S3 command cascaded into one of the most significant cloud outages in history. Massive swaths of the internet went dark. But here's what many don't realize: the detection systems worked perfectly. They detected the failures within seconds. The problem was that too many things failed at once, overwhelming both automated and manual response capabilities.

This incident illustrates a fundamental truth about failover detection: detecting that something is wrong is only the beginning. The real challenge is:

Detecting failures quickly enough to matter
Distinguishing genuine failures from transient issues
Avoiding false positives that trigger unnecessary failovers
Handling partial failures and degraded states
Maintaining detection capability when the detection system itself is under stress

This page provides the comprehensive foundation you need to design detection systems that are accurate, timely, and resilient.

What You Will Learn

By the end of this page, you will master: heartbeat mechanisms and their configuration, health check design for different system types, failure detection algorithms and their tradeoffs, handling partial failures and gray states, detection system architecture for high availability, and common detection pitfalls with their solutions.

The Fundamental Detection Challenge

At its core, failure detection in distributed systems faces an impossible problem: you cannot reliably distinguish between a failed node and a slow node.

This is not a limitation of current technology—it's a fundamental property of asynchronous distributed systems, formalized in the FLP impossibility result. If a node stops responding, we cannot know with certainty whether:

The node has crashed
The node is alive but very slow
The network between us and the node has failed
The node is alive but our detection mechanism is broken

Since we cannot achieve perfect detection, we must instead design systems that make probabilistic decisions with understood tradeoffs. Every detection system balances two types of errors:

False Positives

•Node is healthy, but detection says it failed
•Causes unnecessary failovers
•May introduce instability
•Wastes resources and engineering time
•Can cause oscillation/flapping
•Reduces trust in the detection system

False Negatives (Delayed Detection)

•Node has failed, but detection hasn't noticed
•Extends downtime duration
•Violates recovery time objectives
•May cause cascading issues
•Clients experience failures
•SLA breaches accumulate

The Detection Tradeoff Space:

Every detection system operates within a tradeoff space defined by these parameters:

Detection Latency: How quickly after a failure occurs do we detect it?
False Positive Rate: What percentage of detections are incorrect?
Detection Accuracy: When there is a failure, how often do we correctly detect it?
Resource Cost: How much CPU, network, and storage does detection consume?

Aggressive detection (short timeouts, frequent checks) reduces detection latency but increases false positives. Conservative detection (long timeouts, multiple confirmations) reduces false positives but increases detection latency. There is no universally optimal point—the right configuration depends on your system's specific requirements.

The Phi Accrual Failure Detector

Academic research has produced sophisticated approaches like the Phi Accrual Failure Detector (used in Cassandra and Akka). Instead of returning binary healthy/unhealthy, it returns a suspicion level (phi) that increases over time since the last heartbeat. This allows applications to make nuanced decisions based on their own risk tolerance. A phi of 1 means 10% chance the node has failed; phi of 4 means 99+% chance.

Heartbeat Mechanisms

Heartbeats are the most fundamental failure detection mechanism. One component periodically sends a signal to another to indicate it's alive. Absence of heartbeats for a configured duration triggers failure detection.

Heartbeat Architecture Patterns:

Pattern 1: Push Heartbeats (Node to Monitor)

The monitored node periodically sends heartbeat messages to a monitoring service. If the monitor doesn't receive a heartbeat within the expected window, it marks the node as potentially failed.

Advantages:

Monitored node controls its own heartbeat timing
Works well when monitoring many nodes from one central service
Natural fit for cluster membership protocols

Disadvantages:

Requires network egress from monitored nodes
Monitor is a single point of failure
Can't distinguish between node failure and network-to-monitor failure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
// Monitored Node - Heartbeat Sender
class HeartbeatSender {
  private intervalId: NodeJS.Timeout | null = null;
  private sequenceNumber: number = 0;
  
  constructor(
    private nodeId: string,
    private monitorEndpoint: string,
    private intervalMs: number = 5000,
    private httpClient: HttpClient
  ) {}
 
  start(): void {
    this.intervalId = setInterval(() => this.sendHeartbeat(), this.intervalMs);
    this.sendHeartbeat(); // Send immediately on start
  }
 
  private async sendHeartbeat(): Promise<void> {
    const heartbeat: Heartbeat = {
      nodeId: this.nodeId,
      timestamp: Date.now(),
      sequenceNumber: ++this.sequenceNumber,
      // Include health indicators in heartbeat
      metrics: {
        cpuPercent: getCpuUsage(),
        memoryPercent: getMemoryUsage(),
        activeConnections: getConnectionCount(),
        pendingRequests: getPendingRequestCount(),
      }
    };
 
    try {
      await this.httpClient.post(this.monitorEndpoint, heartbeat, {
        timeout: this.intervalMs / 2  // Timeout before next heartbeat
      });
    } catch (error) {
      // Log but don't crash - heartbeat sender should be resilient
      console.error('Failed to send heartbeat', error);
    }
  }
 
  stop(): void {
    if (this.intervalId) {
      clearInterval(this.intervalId);
      this.intervalId = null;
    }
  }
}
 
// Central Monitor - Heartbeat Receiver
class HeartbeatMonitor {
  private nodeStates = new Map<string, NodeState>();
  private readonly deadlineMultiplier = 3; // Miss 3 heartbeats = dead
  
  constructor(
    private expectedIntervalMs: number,
    private onNodeDead: (nodeId: string) => void,
    private onNodeRecovered: (nodeId: string) => void
  ) {
    // Check for dead nodes periodically
    setInterval(() => this.checkDeadNodes(), this.expectedIntervalMs);
  }
 
  receiveHeartbeat(heartbeat: Heartbeat): void {
    const existingState = this.nodeStates.get(heartbeat.nodeId);
    
    const newState: NodeState = {
      lastHeartbeat: heartbeat.timestamp,
      lastSequence: heartbeat.sequenceNumber,
      consecutiveMisses: 0,
      status: 'healthy',
      metrics: heartbeat.metrics,
    };
 
    this.nodeStates.set(heartbeat.nodeId, newState);
 
    // Node recovery detection
    if (existingState?.status === 'dead') {
      this.onNodeRecovered(heartbeat.nodeId);
    }
  }
 
  private checkDeadNodes(): void {
    const now = Date.now();
    const deadline = this.expectedIntervalMs * this.deadlineMultiplier;
 
    for (const [nodeId, state] of this.nodeStates.entries()) {
      const timeSinceHeartbeat = now - state.lastHeartbeat;
      
      if (timeSinceHeartbeat > deadline && state.status !== 'dead') {
        state.status = 'dead';
        state.consecutiveMisses = Math.floor(timeSinceHeartbeat / this.expectedIntervalMs);
        this.onNodeDead(nodeId);
      } else if (timeSinceHeartbeat > this.expectedIntervalMs) {
        state.status = 'suspicious';
        state.consecutiveMisses++;
      }
    }
  }
}

Pattern 2: Pull Heartbeats (Monitor Polls Node)

The monitoring service periodically queries monitored nodes with health check requests. If a node fails to respond, it's marked as potentially failed.

Advantages:

Monitor controls polling timing and can adjust dynamically
Can combine heartbeat with functional health checks
Works when nodes cannot send outbound traffic
Easier to implement distributed monitoring

Disadvantages:

Increases inbound traffic to monitored nodes
Monitor failure means no detection at all
Polling interval creates inherent detection delay

Pattern 3: Peer-to-Peer Heartbeats (Gossip)

Nodes periodically exchange heartbeats directly with each other. Each node maintains a local view of cluster membership. Node failures are detected locally and gossipped to other nodes.

Advantages:

No single point of failure
Scales well with cluster size
Fault detection is distributed
Works in network partitions (within partition)

Disadvantages:

More complex to implement correctly
Inconsistent views during partition healing
Higher overall network traffic

Heartbeat Pattern Comparison
Characteristic	Push (Node→Monitor)	Pull (Monitor→Node)	Gossip (Peer-to-Peer)
Single point of failure	Yes (monitor)	Yes (monitor)	No
Network traffic	Low	Medium	High
Detection speed	Fast	Medium	Variable
Implementation complexity	Simple	Simple	Complex
Scales to 1000+ nodes	With sharded monitors	With sharded monitors	Yes, naturally
Best for	Central monitoring	Health check integration	Decentralized clusters

Heartbeat Interval Selection

The heartbeat interval directly impacts detection latency. If you heartbeat every 5 seconds and require 3 missed heartbeats before declaring failure, your minimum detection time is 15 seconds. For sub-second detection, you need sub-second heartbeats—but this increases network load and the risk of false positives from momentary delays.

Health Check Design

While heartbeats verify that a process is alive, health checks verify that a service is functioning correctly. A database might be alive (heartbeating) but unable to accept queries (unhealthy). Effective health checks bridge this gap.

The Health Check Spectrum:

Health checks exist on a spectrum from shallow to deep:

Health Check Depth Levels

•Level 0: TCP Connectivity — Can we establish a TCP connection? Checks only that the process is listening. Fastest but catches only hard failures.
•Level 1: HTTP Response — Does the service return a 200 OK? Verifies the web server is responding but doesn't test business logic.
•Level 2: Application Startup — Is the application fully initialized? Verifies dependencies loaded, caches warmed, ready to serve traffic.
•Level 3: Dependency Connectivity — Can the service reach its dependencies (database, cache, external APIs)? Catches dependency failures before they impact users.
•Level 4: Functional Verification — Can the service perform its core function? Execute a read query, write a test record, process a sample message.
•Level 5: Performance Verification — Is the service performing within acceptable parameters? Check latency percentiles, throughput, error rates.

Designing Effective Health Endpoints:

A well-designed health check endpoint provides graduated information for different consumers:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
interface HealthCheckResponse {
  status: 'healthy' | 'degraded' | 'unhealthy';
  timestamp: string;
  version: string;
  uptime: number;
  checks: ComponentCheck[];
}
 
interface ComponentCheck {
  name: string;
  status: 'pass' | 'warn' | 'fail';
  duration_ms: number;
  message?: string;
  details?: Record<string, unknown>;
}
 
class HealthController {
  // Liveness probe: Is the process alive?
  // Used by orchestrators to know when to restart
  // Should be extremely fast and never fail due to dependencies
  async getLiveness(): Promise<{ status: 'ok' }> {
    return { status: 'ok' };
  }
 
  // Readiness probe: Is the service ready to receive traffic?
  // Used by load balancers to know when to route traffic
  // Should check critical dependencies
  async getReadiness(): Promise<HealthCheckResponse> {
    const checks: ComponentCheck[] = [];
    
    // Check database connection
    checks.push(await this.checkDatabase());
    
    // Check cache connection
    checks.push(await this.checkCache());
    
    // Check required external services
    checks.push(await this.checkPaymentGateway());
 
    const overallStatus = this.determineOverallStatus(checks);
    
    return {
      status: overallStatus,
      timestamp: new Date().toISOString(),
      version: process.env.APP_VERSION || 'unknown',
      uptime: process.uptime(),
      checks,
    };
  }
 
  // Deep health: Comprehensive diagnostic information
  // Used by operators for troubleshooting
  // Can be expensive, call sparingly
  async getDeepHealth(): Promise<HealthCheckResponse> {
    const checks: ComponentCheck[] = [];
    
    // All readiness checks
    checks.push(await this.checkDatabase());
    checks.push(await this.checkCache());
    checks.push(await this.checkPaymentGateway());
    
    // Additional diagnostic checks
    checks.push(await this.checkDatabaseReplicationLag());
    checks.push(await this.checkMessageQueueDepth());
    checks.push(await this.checkDiskSpace());
    checks.push(await this.checkMemoryPressure());
    checks.push(await this.checkCertificateExpiry());
    
    // Performance verification
    checks.push(await this.executeTestQuery());
    
    return {
      status: this.determineOverallStatus(checks),
      timestamp: new Date().toISOString(),
      version: process.env.APP_VERSION || 'unknown',
      uptime: process.uptime(),
      checks,
    };
  }
 
  private async checkDatabase(): Promise<ComponentCheck> {
    const start = Date.now();
    try {
      await db.query('SELECT 1');
      return {
        name: 'database',
        status: 'pass',
        duration_ms: Date.now() - start,
      };
    } catch (error) {
      return {
        name: 'database',
        status: 'fail',
        duration_ms: Date.now() - start,
        message: error.message,
      };
    }
  }
 
  private async checkDatabaseReplicationLag(): Promise<ComponentCheck> {
    const start = Date.now();
    try {
      const lag = await db.query('SELECT pg_last_wal_replay_lsn() - pg_last_wal_receive_lsn()');
      const lagBytes = lag.rows[0].lag_bytes || 0;
      const lagMb = lagBytes / (1024 * 1024);
      
      let status: 'pass' | 'warn' | 'fail';
      if (lagMb < 1) status = 'pass';
      else if (lagMb < 100) status = 'warn';
      else status = 'fail';
      
      return {
        name: 'database_replication_lag',
        status,
        duration_ms: Date.now() - start,
        details: { lag_mb: lagMb },
      };
    } catch (error) {
      return {
        name: 'database_replication_lag',
        status: 'fail',
        duration_ms: Date.now() - start,
        message: error.message,
      };
    }
  }
 
  private determineOverallStatus(checks: ComponentCheck[]): 'healthy' | 'degraded' | 'unhealthy' {
    if (checks.some(c => c.status === 'fail')) return 'unhealthy';
    if (checks.some(c => c.status === 'warn')) return 'degraded';
    return 'healthy';
  }
}

Health Check Anti-Patterns

Common mistakes: 1) Health checks that are too slow (>1s) cause load balancers to timeout. 2) Health checks that fail due to optional dependencies cause false positives. 3) Health checks that mutate state (create test records) cause operational issues. 4) Deep health checks called too frequently overwhelm the system.

Liveness vs Readiness in Kubernetes:

Kubernetes distinguishes between liveness and readiness probes, a pattern valuable even outside Kubernetes:

Liveness Probe: "Should we restart this container?" If liveness fails, the container is killed and restarted. Should only fail if the process is genuinely broken and restart would help. Should NOT fail due to dependency issues—restarting won't fix those.

Readiness Probe: "Should we send traffic to this container?" If readiness fails, the container is removed from load balancer rotation but not restarted. Should fail during startup, during graceful shutdown, and when dependencies are unavailable.

Startup Probe: "Has this container finished starting?" Prevents liveness checks from killing slow-starting applications. Only used during initial startup.

Detection Algorithms

Raw heartbeat data must be processed through detection algorithms to produce actionable failure signals. Different algorithms offer different tradeoffs between detection speed and false positive rates.

Algorithm 1: Fixed Timeout (Naive)

The simplest approach: if no heartbeat arrives within T seconds, declare failure.

Advantages: Simple to implement and understand Disadvantages: Vulnerable to network jitter, requires manual tuning of T

Algorithm 2: Consecutive Failures

Require N consecutive missed heartbeats (or failed health checks) before declaring failure. This smooths over transient issues.

Configuration: failureThreshold = 3, checkInterval = 5s → Detection in 15s minimum

Advantages: Filters transient failures Disadvantages: Adds N×interval to detection latency

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
/**
 * Sliding Window Failure Detector
 * 
 * Instead of requiring consecutive failures, this algorithm
 * tracks failures within a time window. This is more resilient
 * to intermittent issues while still detecting real failures.
 */
class SlidingWindowDetector {
  private failureTimestamps: number[] = [];
  
  constructor(
    private windowSizeMs: number,       // e.g., 60000 (1 minute)
    private failureThreshold: number,   // e.g., 5 failures
    private successThreshold: number    // e.g., 3 successes to recover
  ) {}
 
  recordResult(success: boolean): DetectionResult {
    const now = Date.now();
    
    // Clean old failures outside window
    this.failureTimestamps = this.failureTimestamps.filter(
      ts => now - ts < this.windowSizeMs
    );
    
    if (!success) {
      this.failureTimestamps.push(now);
    }
    
    const failureCount = this.failureTimestamps.length;
    const failureRate = failureCount / (this.windowSizeMs / 1000);
    
    return {
      isHealthy: failureCount < this.failureThreshold,
      failureCount,
      failureRate,
      windowSizeMs: this.windowSizeMs,
      threshold: this.failureThreshold,
    };
  }
}
 
/**
 * Phi Accrual Failure Detector
 * 
 * Returns a suspicion level instead of binary healthy/unhealthy.
 * Based on the paper "The φ Accrual Failure Detector" by Hayashibara et al.
 * Used in Apache Cassandra and Akka.
 */
class PhiAccrualDetector {
  private arrivalTimes: number[] = [];
  private readonly maxSamples = 1000;
  
  recordHeartbeat(): void {
    this.arrivalTimes.push(Date.now());
    if (this.arrivalTimes.length > this.maxSamples) {
      this.arrivalTimes.shift();
    }
  }
  
  getPhi(): number {
    if (this.arrivalTimes.length < 2) {
      return 0;  // Not enough data
    }
    
    // Calculate inter-arrival times
    const intervals: number[] = [];
    for (let i = 1; i < this.arrivalTimes.length; i++) {
      intervals.push(this.arrivalTimes[i] - this.arrivalTimes[i - 1]);
    }
    
    // Calculate mean and standard deviation
    const mean = intervals.reduce((a, b) => a + b, 0) / intervals.length;
    const variance = intervals.reduce((sum, val) => sum + Math.pow(val - mean, 2), 0) / intervals.length;
    const stdDev = Math.sqrt(variance);
    
    // Time since last heartbeat
    const timeSinceLast = Date.now() - this.arrivalTimes[this.arrivalTimes.length - 1];
    
    // Phi calculation (simplified)
    // Higher phi = higher probability of failure
    // phi of 1 ≈ 10% chance of failure
    // phi of 4 ≈ 99% chance of failure
    const phi = this.calculatePhi(timeSinceLast, mean, stdDev);
    
    return phi;
  }
  
  private calculatePhi(timeSinceLast: number, mean: number, stdDev: number): number {
    // Using normal distribution approximation
    const y = (timeSinceLast - mean) / stdDev;
    const e = Math.exp(-y * y / 2);
    const pLater = 1 - (1 / Math.sqrt(2 * Math.PI)) * e;
    
    // Convert probability to phi scale
    return -Math.log10(1 - pLater);
  }
  
  isLikelyFailed(threshold: number = 8): boolean {
    return this.getPhi() > threshold;
  }
}

Algorithm 3: Adaptive Detection

Adaptive detectors automatically tune their parameters based on observed behavior. If a node typically responds in 10ms but occasionally takes 100ms, the detector adjusts its thresholds accordingly.

The Phi Accrual Detector is the canonical example: it builds a statistical model of inter-arrival times and uses this to calculate the probability that a node has failed given the time since the last heartbeat.

Key insight: A 5-second delay means different things for different nodes. For a node that typically responds in 1ms, it's almost certainly failed. For a node that sometimes takes 4 seconds due to garbage collection, it might be fine.

Algorithm 4: Quorum-Based Detection

Multiple monitors independently observe the node. Failure is only declared if a quorum of monitors agree. This prevents false positives from a single monitor's network issues.

Detection Algorithm Comparison
Algorithm	Detection Speed	False Positive Risk	Complexity	Best For
Fixed Timeout	Fast	High	Low	Simple systems, local monitoring
Consecutive Failures	Slow	Medium	Low	Most production systems
Sliding Window	Medium	Medium	Medium	Fluctuating workloads
Phi Accrual	Adaptive	Low	High	Heterogeneous nodes
Quorum-Based	Medium	Very Low	High	Critical systems, distributed monitoring

Combining Algorithms

Production systems often combine multiple algorithms. For example: use consecutive failures for quick detection, then require quorum confirmation before triggering failover. This captures the speed of simple algorithms while adding the reliability of quorum consensus.

Handling Partial Failures

Real-world failures rarely present as clean binary states. More commonly, systems experience partial failures or gray states where behavior is degraded but not completely failed. Effective detection must handle these nuanced states.

Types of Partial Failures:

1. Degraded Response Times

The service responds, but significantly slower than normal. It might be under heavy load, experiencing resource contention, or suffering from a slowly degrading component (filling disk, memory leak).

Detection approach: Monitor latency percentiles (p50, p95, p99). Alert when latency exceeds thresholds. Consider removing from load balancer rotation even if technically healthy.

2. Elevated Error Rates

Some requests succeed, but an unusually high percentage fail. This could indicate database connection exhaustion, downstream dependency issues, or partial data corruption.

Detection approach: Track error rate as a percentage of total requests. Use statistical significance testing to distinguish real problems from normal variance.

Gray Failure States

•Zombie State: Process is alive but not making progress. Often caused by deadlocks or infinite loops. Heartbeats may continue even though useful work has stopped.
•Split Functionality: Some endpoints work, others fail. The /health endpoint returns 200 but /api/orders returns 500. This is why health checks should verify actual functionality.
•Intermittent Failures: Requests randomly fail due to race conditions, resource exhaustion, or hardware issues. Averaging can mask these in health checks.
•Slow Degradation: Performance gradually worsens over hours or days. Memory leaks, connection pool exhaustion, log file growth. Hard to detect with instant health checks.
•Dependency Masking: Service appears healthy but one of its dependencies is failing. Service returns cached or default data, hiding the underlying problem.

Multi-Signal Detection:

To catch partial failures, effective detection systems combine multiple signals:

Latency Signals: p50, p95, p99 response times compared to baseline Error Signals: 4xx rate, 5xx rate, timeout rate Saturation Signals: CPU utilization, memory pressure, connection pool usage, queue depth Traffic Signals: Request rate compared to expected patterns Synthetic Signals: Active probing that exercises code paths beyond /health

No single signal catches all failure modes. The combination provides comprehensive coverage.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
interface Signal {
  name: string;
  value: number;
  thresholds: {
    warning: number;
    critical: number;
  };
  weight: number;  // Importance in overall score
}
 
class MultiSignalHealthEvaluator {
  evaluateHealth(signals: Signal[]): HealthEvaluation {
    let totalWeight = 0;
    let healthScore = 0;  // 0-100, higher is healthier
    const issues: string[] = [];
    
    for (const signal of signals) {
      totalWeight += signal.weight;
      
      if (signal.value >= signal.thresholds.critical) {
        healthScore += 0;  // Add nothing for critical signals
        issues.push(`CRITICAL: ${signal.name} = ${signal.value}`);
      } else if (signal.value >= signal.thresholds.warning) {
        healthScore += signal.weight * 50;  // Half credit for warning
        issues.push(`WARNING: ${signal.name} = ${signal.value}`);
      } else {
        healthScore += signal.weight * 100;  // Full credit for healthy
      }
    }
    
    const normalizedScore = healthScore / totalWeight;
    
    return {
      score: normalizedScore,
      status: this.scoreToStatus(normalizedScore),
      issues,
      recommendation: this.getRecommendation(normalizedScore, issues),
    };
  }
  
  private scoreToStatus(score: number): 'healthy' | 'degraded' | 'unhealthy' {
    if (score >= 80) return 'healthy';
    if (score >= 50) return 'degraded';
    return 'unhealthy';
  }
  
  private getRecommendation(score: number, issues: string[]): string {
    if (score >= 80) return 'No action required';
    if (score >= 50) {
      if (issues.some(i => i.includes('latency'))) {
        return 'Consider reducing traffic weight in load balancer';
      }
      return 'Monitor closely, prepare for possible failover';
    }
    if (issues.some(i => i.includes('CRITICAL'))) {
      return 'Initiate failover or remove from rotation immediately';
    }
    return 'Investigate and consider failover';
  }
}
 
// Usage example
const signals: Signal[] = [
  {
    name: 'error_rate_percent',
    value: 2.5,
    thresholds: { warning: 1, critical: 5 },
    weight: 3,  // High importance
  },
  {
    name: 'p99_latency_ms',
    value: 450,
    thresholds: { warning: 500, critical: 2000 },
    weight: 2,
  },
  {
    name: 'cpu_percent',
    value: 75,
    thresholds: { warning: 80, critical: 95 },
    weight: 1,
  },
  {
    name: 'memory_percent',
    value: 85,
    thresholds: { warning: 80, critical: 95 },
    weight: 1,
  },
];
 
const evaluator = new MultiSignalHealthEvaluator();
const result = evaluator.evaluateHealth(signals);
// result: { score: 83.3, status: 'healthy', issues: ['WARNING: memory_percent = 85'], ... }

The Observability Connection

Effective partial failure detection requires mature observability infrastructure. Metrics, logs, and traces provide the raw signals. Alerting rules and health evaluators process them into actionable information. Investment in observability directly improves detection capability.

Detection System Architecture

The detection system itself must be highly available—otherwise, you can't detect failures when you most need to. This creates a recursive design challenge: how do you build a reliable system to detect unreliable systems?

Distributed Monitoring Architecture:

Production detection systems distribute monitoring across multiple independent vantage points:

1. Local Monitors: Run on the same host or rack as the monitored component. Low latency, high accuracy for local failures. Vulnerable to correlated failures (host dies, monitor dies too).

2. Regional Monitors: Run in the same datacenter/region but different availability zones. Can detect zone-specific issues. Still vulnerable to region-wide failures.

3. Global Monitors: Run in different regions. Can detect region-wide failures. Higher latency, more false positives due to network distance.

4. External Monitors: Commercial services monitoring from outside your infrastructure (Pingdom, Datadog Synthetics, New Relic). Independent failure domain, but limited visibility into internal state.

Consensus Among Monitors:

When monitors disagree, how do you decide? This is where consensus protocols become essential:

Simple Majority: If more than half of monitors report failure, trigger failover. Simple but vulnerable to network partitions where minority has the truth.

Weighted Voting: Assign different weights to different monitors based on their reliability or proximity. Local monitors might get higher weight than global ones.

Leader-Based Decision: One monitor is elected as the decision-maker. Others provide input, but the leader makes the final call. Simpler but introduces single point of failure for decisions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
interface MonitoringVote {
  monitorId: string;
  region: string;
  targetHealthy: boolean;
  confidence: number;  // 0-1, based on connection quality
  timestamp: number;
}
 
interface QuorumDecision {
  isHealthy: boolean;
  confidence: number;
  healthyVotes: number;
  unhealthyVotes: number;
  quorumMet: boolean;
  dissent: string[];
}
 
class QuorumBasedDetector {
  private votes = new Map<string, MonitoringVote>();
  
  constructor(
    private quorumSize: number,
    private staleVoteThresholdMs: number = 30000
  ) {}
 
  recordVote(vote: MonitoringVote): void {
    this.votes.set(vote.monitorId, vote);
  }
 
  getDecision(): QuorumDecision {
    const now = Date.now();
    
    // Filter out stale votes
    const freshVotes = Array.from(this.votes.values()).filter(
      v => now - v.timestamp < this.staleVoteThresholdMs
    );
    
    if (freshVotes.length < this.quorumSize) {
      return {
        isHealthy: true,  // Fail open - assume healthy if unsure
        confidence: 0,
        healthyVotes: freshVotes.filter(v => v.targetHealthy).length,
        unhealthyVotes: freshVotes.filter(v => !v.targetHealthy).length,
        quorumMet: false,
        dissent: ['Insufficient monitors reporting'],
      };
    }
    
    // Weight votes by confidence
    let healthyScore = 0;
    let unhealthyScore = 0;
    
    for (const vote of freshVotes) {
      if (vote.targetHealthy) {
        healthyScore += vote.confidence;
      } else {
        unhealthyScore += vote.confidence;
      }
    }
    
    const totalScore = healthyScore + unhealthyScore;
    const isHealthy = healthyScore > unhealthyScore;
    
    // Calculate dissent (monitors that disagree with majority)
    const majority = isHealthy;
    const dissent = freshVotes
      .filter(v => v.targetHealthy !== majority)
      .map(v => `${v.monitorId} (${v.region}) voted ${v.targetHealthy ? 'healthy' : 'unhealthy'}`);
    
    return {
      isHealthy,
      confidence: Math.abs(healthyScore - unhealthyScore) / totalScore,
      healthyVotes: freshVotes.filter(v => v.targetHealthy).length,
      unhealthyVotes: freshVotes.filter(v => !v.targetHealthy).length,
      quorumMet: true,
      dissent,
    };
  }
}

Fail-Open vs Fail-Closed

When the detection system itself fails (not enough monitors, network partition), you must choose: Fail-Open (assume monitored component is healthy) means you might miss real failures. Fail-Closed (assume monitored component is unhealthy) means you might cause unnecessary failovers. Most systems fail-open as unnecessary failovers are worse than delayed detection.

Common Detection Pitfalls

Even well-designed detection systems fall into common traps. Understanding these pitfalls helps you avoid them in your own implementations.

Pitfall 1: Detection via Control Plane

A common mistake is having the detection system share failure modes with the monitored system. If health checks go through the same load balancer as user traffic, a load balancer failure blinds your detection.

Solution: Use independent network paths for detection. Health checks should bypass load balancers or use dedicated health check endpoints with direct routing.

Pitfall 2: Health Check Thundering Herd

When multiple monitors all check health at the same interval, they can synchronize and create periodic load spikes. If checkInterval = 5s for 100 monitors, the monitored service sees 100 health checks simultaneously every 5 seconds.

Solution: Add jitter to check intervals. Random offset of 0-interval ensures checks distribute evenly over time.

Detection Anti-Patterns

•Self-Check Trust: Node reports its own health status. A frozen node continues reporting healthy. Detection should be external, not self-reported.
•Ignoring Clock Skew: Detection uses timestamps without NTP synchronization. Monitors disagree on when heartbeats arrived. Use monotonic clocks or accept clock drift in calculations.
•Static Thresholds: Same timeout for all nodes regardless of their typical latency. A US-based monitor times out an India-based service that's actually healthy. Use adaptive thresholds.
•Missing Dependency Health: Health check passes while critical dependency is down. Node routes traffic to broken endpoints. Include dependency checks in readiness probes.
•Cascading Health Checks: Health check endpoint calls external services synchronously. External service slowdown makes health check fail. Health checks should be fast and mostly self-contained.

Pitfall 3: The GC Pause Problem

JVM and other garbage-collected runtimes can pause for seconds during full GC. This looks exactly like a failure to detection systems but the service will recover automatically.

Solution: Tune thresholds to accommodate expected GC pauses. Use GC logging to correlate false positives with GC events. Consider using GC-aware health endpoints that pause checks during known GC.

Pitfall 4: Network Partition Blindness

If detection monitors are all in the same network segment as the monitored service, a partition that isolates clients also isolates monitors. From monitors' perspective, everything looks healthy.

Solution: Distribute monitors across network boundaries. Include external synthetic monitoring that simulates real client access patterns.

Testing Your Detection

Regularly test that your detection actually detects failures: 1) Introduce fake failures (chaos engineering) and verify detection triggers. 2) Review false positive and false negative rates from production data. 3) Measure actual time-to-detection compared to your targets. 4) Verify detection works during partial failures, not just complete outages.

Summary: Failover Detection Mastery

Failure detection is the foundation upon which all failover mechanisms rest. Without accurate, timely detection, even the best failover automation is useless. Let's consolidate the key principles:

Key Takeaways

•Detection faces fundamental uncertainty — We cannot reliably distinguish failed nodes from slow nodes. Detection is probabilistic, not absolute.
•Heartbeats provide liveness, health checks provide readiness — Both are needed. A live process that can't serve requests is as bad as a dead one.
•Detection algorithms trade speed for accuracy — Simple timeouts are fast but noisy. Consecutive failures and Phi accrual reduce false positives at the cost of detection latency.
•Partial failures are the norm — Binary healthy/unhealthy is insufficient. Multi-signal detection catches degraded states before they become outages.
•Detection systems must be more reliable than monitored systems — Distributed monitoring, quorum decisions, and independent network paths are essential.
•Every detection pitfall correlates with a production incident — Learn from common anti-patterns. Test that detection works before you need it.

What's Next:

With detection established as our eyes into system health, we turn to the critical question of timing: Once we detect a failure, how quickly should we react? The next page explores failover timing—the careful balance between speed and safety that determines whether failover helps or harms.

Page Complete

You now understand the fundamental challenges of failure detection, can implement heartbeat and health check mechanisms, evaluate detection algorithms for your use case, and design detection architectures that are themselves highly available. Next: Failover Timing.

2 / 5

Loading learning content...

System Design (HLD)Failover Strategies

Failover Strategies: Building Resilient Systems

LevelAdvanced

Duration90 mins

TopicFailover Strategies

2 / 5

Failover Detection: The Art and Science of Knowing When Systems Fail

The Detection Problem

This incident illustrates a fundamental truth about failover detection: detecting that something is wrong is only the beginning. The real challenge is:

Detecting failures quickly enough to matter
Distinguishing genuine failures from transient issues
Avoiding false positives that trigger unnecessary failovers
Handling partial failures and degraded states
Maintaining detection capability when the detection system itself is under stress

This page provides the comprehensive foundation you need to design detection systems that are accurate, timely, and resilient.

What You Will Learn

The Fundamental Detection Challenge

At its core, failure detection in distributed systems faces an impossible problem: you cannot reliably distinguish between a failed node and a slow node.

The node has crashed
The node is alive but very slow
The network between us and the node has failed
The node is alive but our detection mechanism is broken

Since we cannot achieve perfect detection, we must instead design systems that make probabilistic decisions with understood tradeoffs. Every detection system balances two types of errors:

False Positives

•Node is healthy, but detection says it failed
•Causes unnecessary failovers
•May introduce instability
•Wastes resources and engineering time
•Can cause oscillation/flapping
•Reduces trust in the detection system

False Negatives (Delayed Detection)

•Node has failed, but detection hasn't noticed
•Extends downtime duration
•Violates recovery time objectives
•May cause cascading issues
•Clients experience failures
•SLA breaches accumulate

The Detection Tradeoff Space:

Every detection system operates within a tradeoff space defined by these parameters:

Detection Latency: How quickly after a failure occurs do we detect it?
False Positive Rate: What percentage of detections are incorrect?
Detection Accuracy: When there is a failure, how often do we correctly detect it?
Resource Cost: How much CPU, network, and storage does detection consume?

The Phi Accrual Failure Detector

Heartbeat Mechanisms

Heartbeat Architecture Patterns:

Pattern 1: Push Heartbeats (Node to Monitor)

The monitored node periodically sends heartbeat messages to a monitoring service. If the monitor doesn't receive a heartbeat within the expected window, it marks the node as potentially failed.

Advantages:

Monitored node controls its own heartbeat timing
Works well when monitoring many nodes from one central service
Natural fit for cluster membership protocols

Disadvantages:

Requires network egress from monitored nodes
Monitor is a single point of failure
Can't distinguish between node failure and network-to-monitor failure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
// Monitored Node - Heartbeat Sender
class HeartbeatSender {
  private intervalId: NodeJS.Timeout | null = null;
  private sequenceNumber: number = 0;
  
  constructor(
    private nodeId: string,
    private monitorEndpoint: string,
    private intervalMs: number = 5000,
    private httpClient: HttpClient
  ) {}
 
  start(): void {
    this.intervalId = setInterval(() => this.sendHeartbeat(), this.intervalMs);
    this.sendHeartbeat(); // Send immediately on start
  }
 
  private async sendHeartbeat(): Promise<void> {
    const heartbeat: Heartbeat = {
      nodeId: this.nodeId,
      timestamp: Date.now(),
      sequenceNumber: ++this.sequenceNumber,
      // Include health indicators in heartbeat
      metrics: {
        cpuPercent: getCpuUsage(),
        memoryPercent: getMemoryUsage(),
        activeConnections: getConnectionCount(),
        pendingRequests: getPendingRequestCount(),
      }
    };
 
    try {
      await this.httpClient.post(this.monitorEndpoint, heartbeat, {
        timeout: this.intervalMs / 2  // Timeout before next heartbeat
      });
    } catch (error) {
      // Log but don't crash - heartbeat sender should be resilient
      console.error('Failed to send heartbeat', error);
    }
  }
 
  stop(): void {
    if (this.intervalId) {
      clearInterval(this.intervalId);
      this.intervalId = null;
    }
  }
}
 
// Central Monitor - Heartbeat Receiver
class HeartbeatMonitor {
  private nodeStates = new Map<string, NodeState>();
  private readonly deadlineMultiplier = 3; // Miss 3 heartbeats = dead
  
  constructor(
    private expectedIntervalMs: number,
    private onNodeDead: (nodeId: string) => void,
    private onNodeRecovered: (nodeId: string) => void
  ) {
    // Check for dead nodes periodically
    setInterval(() => this.checkDeadNodes(), this.expectedIntervalMs);
  }
 
  receiveHeartbeat(heartbeat: Heartbeat): void {
    const existingState = this.nodeStates.get(heartbeat.nodeId);
    
    const newState: NodeState = {
      lastHeartbeat: heartbeat.timestamp,
      lastSequence: heartbeat.sequenceNumber,
      consecutiveMisses: 0,
      status: 'healthy',
      metrics: heartbeat.metrics,
    };
 
    this.nodeStates.set(heartbeat.nodeId, newState);
 
    // Node recovery detection
    if (existingState?.status === 'dead') {
      this.onNodeRecovered(heartbeat.nodeId);
    }
  }
 
  private checkDeadNodes(): void {
    const now = Date.now();
    const deadline = this.expectedIntervalMs * this.deadlineMultiplier;
 
    for (const [nodeId, state] of this.nodeStates.entries()) {
      const timeSinceHeartbeat = now - state.lastHeartbeat;
      
      if (timeSinceHeartbeat > deadline && state.status !== 'dead') {
        state.status = 'dead';
        state.consecutiveMisses = Math.floor(timeSinceHeartbeat / this.expectedIntervalMs);
        this.onNodeDead(nodeId);
      } else if (timeSinceHeartbeat > this.expectedIntervalMs) {
        state.status = 'suspicious';
        state.consecutiveMisses++;
      }
    }
  }
}

Pattern 2: Pull Heartbeats (Monitor Polls Node)

The monitoring service periodically queries monitored nodes with health check requests. If a node fails to respond, it's marked as potentially failed.

Advantages:

Monitor controls polling timing and can adjust dynamically
Can combine heartbeat with functional health checks
Works when nodes cannot send outbound traffic
Easier to implement distributed monitoring

Disadvantages:

Increases inbound traffic to monitored nodes
Monitor failure means no detection at all
Polling interval creates inherent detection delay

Pattern 3: Peer-to-Peer Heartbeats (Gossip)

Nodes periodically exchange heartbeats directly with each other. Each node maintains a local view of cluster membership. Node failures are detected locally and gossipped to other nodes.

Advantages:

No single point of failure
Scales well with cluster size
Fault detection is distributed
Works in network partitions (within partition)

Disadvantages:

More complex to implement correctly
Inconsistent views during partition healing
Higher overall network traffic

Heartbeat Pattern Comparison
Characteristic	Push (Node→Monitor)	Pull (Monitor→Node)	Gossip (Peer-to-Peer)
Single point of failure	Yes (monitor)	Yes (monitor)	No
Network traffic	Low	Medium	High
Detection speed	Fast	Medium	Variable
Implementation complexity	Simple	Simple	Complex
Scales to 1000+ nodes	With sharded monitors	With sharded monitors	Yes, naturally
Best for	Central monitoring	Health check integration	Decentralized clusters

Heartbeat Interval Selection

Health Check Design

The Health Check Spectrum:

Health checks exist on a spectrum from shallow to deep:

Health Check Depth Levels

•Level 0: TCP Connectivity — Can we establish a TCP connection? Checks only that the process is listening. Fastest but catches only hard failures.
•Level 1: HTTP Response — Does the service return a 200 OK? Verifies the web server is responding but doesn't test business logic.
•Level 2: Application Startup — Is the application fully initialized? Verifies dependencies loaded, caches warmed, ready to serve traffic.
•Level 3: Dependency Connectivity — Can the service reach its dependencies (database, cache, external APIs)? Catches dependency failures before they impact users.
•Level 4: Functional Verification — Can the service perform its core function? Execute a read query, write a test record, process a sample message.
•Level 5: Performance Verification — Is the service performing within acceptable parameters? Check latency percentiles, throughput, error rates.

Designing Effective Health Endpoints:

A well-designed health check endpoint provides graduated information for different consumers:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
interface HealthCheckResponse {
  status: 'healthy' | 'degraded' | 'unhealthy';
  timestamp: string;
  version: string;
  uptime: number;
  checks: ComponentCheck[];
}
 
interface ComponentCheck {
  name: string;
  status: 'pass' | 'warn' | 'fail';
  duration_ms: number;
  message?: string;
  details?: Record<string, unknown>;
}
 
class HealthController {
  // Liveness probe: Is the process alive?
  // Used by orchestrators to know when to restart
  // Should be extremely fast and never fail due to dependencies
  async getLiveness(): Promise<{ status: 'ok' }> {
    return { status: 'ok' };
  }
 
  // Readiness probe: Is the service ready to receive traffic?
  // Used by load balancers to know when to route traffic
  // Should check critical dependencies
  async getReadiness(): Promise<HealthCheckResponse> {
    const checks: ComponentCheck[] = [];
    
    // Check database connection
    checks.push(await this.checkDatabase());
    
    // Check cache connection
    checks.push(await this.checkCache());
    
    // Check required external services
    checks.push(await this.checkPaymentGateway());
 
    const overallStatus = this.determineOverallStatus(checks);
    
    return {
      status: overallStatus,
      timestamp: new Date().toISOString(),
      version: process.env.APP_VERSION || 'unknown',
      uptime: process.uptime(),
      checks,
    };
  }
 
  // Deep health: Comprehensive diagnostic information
  // Used by operators for troubleshooting
  // Can be expensive, call sparingly
  async getDeepHealth(): Promise<HealthCheckResponse> {
    const checks: ComponentCheck[] = [];
    
    // All readiness checks
    checks.push(await this.checkDatabase());
    checks.push(await this.checkCache());
    checks.push(await this.checkPaymentGateway());
    
    // Additional diagnostic checks
    checks.push(await this.checkDatabaseReplicationLag());
    checks.push(await this.checkMessageQueueDepth());
    checks.push(await this.checkDiskSpace());
    checks.push(await this.checkMemoryPressure());
    checks.push(await this.checkCertificateExpiry());
    
    // Performance verification
    checks.push(await this.executeTestQuery());
    
    return {
      status: this.determineOverallStatus(checks),
      timestamp: new Date().toISOString(),
      version: process.env.APP_VERSION || 'unknown',
      uptime: process.uptime(),
      checks,
    };
  }
 
  private async checkDatabase(): Promise<ComponentCheck> {
    const start = Date.now();
    try {
      await db.query('SELECT 1');
      return {
        name: 'database',
        status: 'pass',
        duration_ms: Date.now() - start,
      };
    } catch (error) {
      return {
        name: 'database',
        status: 'fail',
        duration_ms: Date.now() - start,
        message: error.message,
      };
    }
  }
 
  private async checkDatabaseReplicationLag(): Promise<ComponentCheck> {
    const start = Date.now();
    try {
      const lag = await db.query('SELECT pg_last_wal_replay_lsn() - pg_last_wal_receive_lsn()');
      const lagBytes = lag.rows[0].lag_bytes || 0;
      const lagMb = lagBytes / (1024 * 1024);
      
      let status: 'pass' | 'warn' | 'fail';
      if (lagMb < 1) status = 'pass';
      else if (lagMb < 100) status = 'warn';
      else status = 'fail';
      
      return {
        name: 'database_replication_lag',
        status,
        duration_ms: Date.now() - start,
        details: { lag_mb: lagMb },
      };
    } catch (error) {
      return {
        name: 'database_replication_lag',
        status: 'fail',
        duration_ms: Date.now() - start,
        message: error.message,
      };
    }
  }
 
  private determineOverallStatus(checks: ComponentCheck[]): 'healthy' | 'degraded' | 'unhealthy' {
    if (checks.some(c => c.status === 'fail')) return 'unhealthy';
    if (checks.some(c => c.status === 'warn')) return 'degraded';
    return 'healthy';
  }
}

Health Check Anti-Patterns

Liveness vs Readiness in Kubernetes:

Kubernetes distinguishes between liveness and readiness probes, a pattern valuable even outside Kubernetes:

Startup Probe: "Has this container finished starting?" Prevents liveness checks from killing slow-starting applications. Only used during initial startup.

Detection Algorithms

Algorithm 1: Fixed Timeout (Naive)

The simplest approach: if no heartbeat arrives within T seconds, declare failure.

Advantages: Simple to implement and understand Disadvantages: Vulnerable to network jitter, requires manual tuning of T

Algorithm 2: Consecutive Failures

Require N consecutive missed heartbeats (or failed health checks) before declaring failure. This smooths over transient issues.

Configuration: failureThreshold = 3, checkInterval = 5s → Detection in 15s minimum

Advantages: Filters transient failures Disadvantages: Adds N×interval to detection latency

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
/**
 * Sliding Window Failure Detector
 * 
 * Instead of requiring consecutive failures, this algorithm
 * tracks failures within a time window. This is more resilient
 * to intermittent issues while still detecting real failures.
 */
class SlidingWindowDetector {
  private failureTimestamps: number[] = [];
  
  constructor(
    private windowSizeMs: number,       // e.g., 60000 (1 minute)
    private failureThreshold: number,   // e.g., 5 failures
    private successThreshold: number    // e.g., 3 successes to recover
  ) {}
 
  recordResult(success: boolean): DetectionResult {
    const now = Date.now();
    
    // Clean old failures outside window
    this.failureTimestamps = this.failureTimestamps.filter(
      ts => now - ts < this.windowSizeMs
    );
    
    if (!success) {
      this.failureTimestamps.push(now);
    }
    
    const failureCount = this.failureTimestamps.length;
    const failureRate = failureCount / (this.windowSizeMs / 1000);
    
    return {
      isHealthy: failureCount < this.failureThreshold,
      failureCount,
      failureRate,
      windowSizeMs: this.windowSizeMs,
      threshold: this.failureThreshold,
    };
  }
}
 
/**
 * Phi Accrual Failure Detector
 * 
 * Returns a suspicion level instead of binary healthy/unhealthy.
 * Based on the paper "The φ Accrual Failure Detector" by Hayashibara et al.
 * Used in Apache Cassandra and Akka.
 */
class PhiAccrualDetector {
  private arrivalTimes: number[] = [];
  private readonly maxSamples = 1000;
  
  recordHeartbeat(): void {
    this.arrivalTimes.push(Date.now());
    if (this.arrivalTimes.length > this.maxSamples) {
      this.arrivalTimes.shift();
    }
  }
  
  getPhi(): number {
    if (this.arrivalTimes.length < 2) {
      return 0;  // Not enough data
    }
    
    // Calculate inter-arrival times
    const intervals: number[] = [];
    for (let i = 1; i < this.arrivalTimes.length; i++) {
      intervals.push(this.arrivalTimes[i] - this.arrivalTimes[i - 1]);
    }
    
    // Calculate mean and standard deviation
    const mean = intervals.reduce((a, b) => a + b, 0) / intervals.length;
    const variance = intervals.reduce((sum, val) => sum + Math.pow(val - mean, 2), 0) / intervals.length;
    const stdDev = Math.sqrt(variance);
    
    // Time since last heartbeat
    const timeSinceLast = Date.now() - this.arrivalTimes[this.arrivalTimes.length - 1];
    
    // Phi calculation (simplified)
    // Higher phi = higher probability of failure
    // phi of 1 ≈ 10% chance of failure
    // phi of 4 ≈ 99% chance of failure
    const phi = this.calculatePhi(timeSinceLast, mean, stdDev);
    
    return phi;
  }
  
  private calculatePhi(timeSinceLast: number, mean: number, stdDev: number): number {
    // Using normal distribution approximation
    const y = (timeSinceLast - mean) / stdDev;
    const e = Math.exp(-y * y / 2);
    const pLater = 1 - (1 / Math.sqrt(2 * Math.PI)) * e;
    
    // Convert probability to phi scale
    return -Math.log10(1 - pLater);
  }
  
  isLikelyFailed(threshold: number = 8): boolean {
    return this.getPhi() > threshold;
  }
}

Algorithm 3: Adaptive Detection

Adaptive detectors automatically tune their parameters based on observed behavior. If a node typically responds in 10ms but occasionally takes 100ms, the detector adjusts its thresholds accordingly.

Algorithm 4: Quorum-Based Detection

Multiple monitors independently observe the node. Failure is only declared if a quorum of monitors agree. This prevents false positives from a single monitor's network issues.

Detection Algorithm Comparison
Algorithm	Detection Speed	False Positive Risk	Complexity	Best For
Fixed Timeout	Fast	High	Low	Simple systems, local monitoring
Consecutive Failures	Slow	Medium	Low	Most production systems
Sliding Window	Medium	Medium	Medium	Fluctuating workloads
Phi Accrual	Adaptive	Low	High	Heterogeneous nodes
Quorum-Based	Medium	Very Low	High	Critical systems, distributed monitoring

Combining Algorithms

Handling Partial Failures

Types of Partial Failures:

1. Degraded Response Times

The service responds, but significantly slower than normal. It might be under heavy load, experiencing resource contention, or suffering from a slowly degrading component (filling disk, memory leak).

Detection approach: Monitor latency percentiles (p50, p95, p99). Alert when latency exceeds thresholds. Consider removing from load balancer rotation even if technically healthy.

2. Elevated Error Rates

Some requests succeed, but an unusually high percentage fail. This could indicate database connection exhaustion, downstream dependency issues, or partial data corruption.

Detection approach: Track error rate as a percentage of total requests. Use statistical significance testing to distinguish real problems from normal variance.

Gray Failure States

•Zombie State: Process is alive but not making progress. Often caused by deadlocks or infinite loops. Heartbeats may continue even though useful work has stopped.
•Split Functionality: Some endpoints work, others fail. The /health endpoint returns 200 but /api/orders returns 500. This is why health checks should verify actual functionality.
•Intermittent Failures: Requests randomly fail due to race conditions, resource exhaustion, or hardware issues. Averaging can mask these in health checks.
•Slow Degradation: Performance gradually worsens over hours or days. Memory leaks, connection pool exhaustion, log file growth. Hard to detect with instant health checks.
•Dependency Masking: Service appears healthy but one of its dependencies is failing. Service returns cached or default data, hiding the underlying problem.

Multi-Signal Detection:

To catch partial failures, effective detection systems combine multiple signals:

No single signal catches all failure modes. The combination provides comprehensive coverage.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
interface Signal {
  name: string;
  value: number;
  thresholds: {
    warning: number;
    critical: number;
  };
  weight: number;  // Importance in overall score
}
 
class MultiSignalHealthEvaluator {
  evaluateHealth(signals: Signal[]): HealthEvaluation {
    let totalWeight = 0;
    let healthScore = 0;  // 0-100, higher is healthier
    const issues: string[] = [];
    
    for (const signal of signals) {
      totalWeight += signal.weight;
      
      if (signal.value >= signal.thresholds.critical) {
        healthScore += 0;  // Add nothing for critical signals
        issues.push(`CRITICAL: ${signal.name} = ${signal.value}`);
      } else if (signal.value >= signal.thresholds.warning) {
        healthScore += signal.weight * 50;  // Half credit for warning
        issues.push(`WARNING: ${signal.name} = ${signal.value}`);
      } else {
        healthScore += signal.weight * 100;  // Full credit for healthy
      }
    }
    
    const normalizedScore = healthScore / totalWeight;
    
    return {
      score: normalizedScore,
      status: this.scoreToStatus(normalizedScore),
      issues,
      recommendation: this.getRecommendation(normalizedScore, issues),
    };
  }
  
  private scoreToStatus(score: number): 'healthy' | 'degraded' | 'unhealthy' {
    if (score >= 80) return 'healthy';
    if (score >= 50) return 'degraded';
    return 'unhealthy';
  }
  
  private getRecommendation(score: number, issues: string[]): string {
    if (score >= 80) return 'No action required';
    if (score >= 50) {
      if (issues.some(i => i.includes('latency'))) {
        return 'Consider reducing traffic weight in load balancer';
      }
      return 'Monitor closely, prepare for possible failover';
    }
    if (issues.some(i => i.includes('CRITICAL'))) {
      return 'Initiate failover or remove from rotation immediately';
    }
    return 'Investigate and consider failover';
  }
}
 
// Usage example
const signals: Signal[] = [
  {
    name: 'error_rate_percent',
    value: 2.5,
    thresholds: { warning: 1, critical: 5 },
    weight: 3,  // High importance
  },
  {
    name: 'p99_latency_ms',
    value: 450,
    thresholds: { warning: 500, critical: 2000 },
    weight: 2,
  },
  {
    name: 'cpu_percent',
    value: 75,
    thresholds: { warning: 80, critical: 95 },
    weight: 1,
  },
  {
    name: 'memory_percent',
    value: 85,
    thresholds: { warning: 80, critical: 95 },
    weight: 1,
  },
];
 
const evaluator = new MultiSignalHealthEvaluator();
const result = evaluator.evaluateHealth(signals);
// result: { score: 83.3, status: 'healthy', issues: ['WARNING: memory_percent = 85'], ... }

The Observability Connection

Detection System Architecture

Distributed Monitoring Architecture:

Production detection systems distribute monitoring across multiple independent vantage points:

1. Local Monitors: Run on the same host or rack as the monitored component. Low latency, high accuracy for local failures. Vulnerable to correlated failures (host dies, monitor dies too).

2. Regional Monitors: Run in the same datacenter/region but different availability zones. Can detect zone-specific issues. Still vulnerable to region-wide failures.

3. Global Monitors: Run in different regions. Can detect region-wide failures. Higher latency, more false positives due to network distance.

Consensus Among Monitors:

When monitors disagree, how do you decide? This is where consensus protocols become essential:

Simple Majority: If more than half of monitors report failure, trigger failover. Simple but vulnerable to network partitions where minority has the truth.

Weighted Voting: Assign different weights to different monitors based on their reliability or proximity. Local monitors might get higher weight than global ones.

Leader-Based Decision: One monitor is elected as the decision-maker. Others provide input, but the leader makes the final call. Simpler but introduces single point of failure for decisions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
interface MonitoringVote {
  monitorId: string;
  region: string;
  targetHealthy: boolean;
  confidence: number;  // 0-1, based on connection quality
  timestamp: number;
}
 
interface QuorumDecision {
  isHealthy: boolean;
  confidence: number;
  healthyVotes: number;
  unhealthyVotes: number;
  quorumMet: boolean;
  dissent: string[];
}
 
class QuorumBasedDetector {
  private votes = new Map<string, MonitoringVote>();
  
  constructor(
    private quorumSize: number,
    private staleVoteThresholdMs: number = 30000
  ) {}
 
  recordVote(vote: MonitoringVote): void {
    this.votes.set(vote.monitorId, vote);
  }
 
  getDecision(): QuorumDecision {
    const now = Date.now();
    
    // Filter out stale votes
    const freshVotes = Array.from(this.votes.values()).filter(
      v => now - v.timestamp < this.staleVoteThresholdMs
    );
    
    if (freshVotes.length < this.quorumSize) {
      return {
        isHealthy: true,  // Fail open - assume healthy if unsure
        confidence: 0,
        healthyVotes: freshVotes.filter(v => v.targetHealthy).length,
        unhealthyVotes: freshVotes.filter(v => !v.targetHealthy).length,
        quorumMet: false,
        dissent: ['Insufficient monitors reporting'],
      };
    }
    
    // Weight votes by confidence
    let healthyScore = 0;
    let unhealthyScore = 0;
    
    for (const vote of freshVotes) {
      if (vote.targetHealthy) {
        healthyScore += vote.confidence;
      } else {
        unhealthyScore += vote.confidence;
      }
    }
    
    const totalScore = healthyScore + unhealthyScore;
    const isHealthy = healthyScore > unhealthyScore;
    
    // Calculate dissent (monitors that disagree with majority)
    const majority = isHealthy;
    const dissent = freshVotes
      .filter(v => v.targetHealthy !== majority)
      .map(v => `${v.monitorId} (${v.region}) voted ${v.targetHealthy ? 'healthy' : 'unhealthy'}`);
    
    return {
      isHealthy,
      confidence: Math.abs(healthyScore - unhealthyScore) / totalScore,
      healthyVotes: freshVotes.filter(v => v.targetHealthy).length,
      unhealthyVotes: freshVotes.filter(v => !v.targetHealthy).length,
      quorumMet: true,
      dissent,
    };
  }
}

Fail-Open vs Fail-Closed

Common Detection Pitfalls

Even well-designed detection systems fall into common traps. Understanding these pitfalls helps you avoid them in your own implementations.

Pitfall 1: Detection via Control Plane

Solution: Use independent network paths for detection. Health checks should bypass load balancers or use dedicated health check endpoints with direct routing.

Pitfall 2: Health Check Thundering Herd

Solution: Add jitter to check intervals. Random offset of 0-interval ensures checks distribute evenly over time.

Detection Anti-Patterns

•Self-Check Trust: Node reports its own health status. A frozen node continues reporting healthy. Detection should be external, not self-reported.
•Ignoring Clock Skew: Detection uses timestamps without NTP synchronization. Monitors disagree on when heartbeats arrived. Use monotonic clocks or accept clock drift in calculations.
•Static Thresholds: Same timeout for all nodes regardless of their typical latency. A US-based monitor times out an India-based service that's actually healthy. Use adaptive thresholds.
•Missing Dependency Health: Health check passes while critical dependency is down. Node routes traffic to broken endpoints. Include dependency checks in readiness probes.
•Cascading Health Checks: Health check endpoint calls external services synchronously. External service slowdown makes health check fail. Health checks should be fast and mostly self-contained.

Pitfall 3: The GC Pause Problem

JVM and other garbage-collected runtimes can pause for seconds during full GC. This looks exactly like a failure to detection systems but the service will recover automatically.

Solution: Tune thresholds to accommodate expected GC pauses. Use GC logging to correlate false positives with GC events. Consider using GC-aware health endpoints that pause checks during known GC.

Pitfall 4: Network Partition Blindness

If detection monitors are all in the same network segment as the monitored service, a partition that isolates clients also isolates monitors. From monitors' perspective, everything looks healthy.

Solution: Distribute monitors across network boundaries. Include external synthetic monitoring that simulates real client access patterns.

Testing Your Detection

Summary: Failover Detection Mastery

Failure detection is the foundation upon which all failover mechanisms rest. Without accurate, timely detection, even the best failover automation is useless. Let's consolidate the key principles:

Key Takeaways

•Detection faces fundamental uncertainty — We cannot reliably distinguish failed nodes from slow nodes. Detection is probabilistic, not absolute.
•Heartbeats provide liveness, health checks provide readiness — Both are needed. A live process that can't serve requests is as bad as a dead one.
•Detection algorithms trade speed for accuracy — Simple timeouts are fast but noisy. Consecutive failures and Phi accrual reduce false positives at the cost of detection latency.
•Partial failures are the norm — Binary healthy/unhealthy is insufficient. Multi-signal detection catches degraded states before they become outages.
•Detection systems must be more reliable than monitored systems — Distributed monitoring, quorum decisions, and independent network paths are essential.
•Every detection pitfall correlates with a production incident — Learn from common anti-patterns. Test that detection works before you need it.

What's Next:

Page Complete

2 / 5