Loading learning content...
In the early hours of March 2, 2017, a typo in an Amazon S3 command cascaded into one of the most significant cloud outages in history. Massive swaths of the internet went dark. But here's what many don't realize: the detection systems worked perfectly. They detected the failures within seconds. The problem was that too many things failed at once, overwhelming both automated and manual response capabilities.
This incident illustrates a fundamental truth about failover detection: detecting that something is wrong is only the beginning. The real challenge is:
This page provides the comprehensive foundation you need to design detection systems that are accurate, timely, and resilient.
By the end of this page, you will master: heartbeat mechanisms and their configuration, health check design for different system types, failure detection algorithms and their tradeoffs, handling partial failures and gray states, detection system architecture for high availability, and common detection pitfalls with their solutions.
At its core, failure detection in distributed systems faces an impossible problem: you cannot reliably distinguish between a failed node and a slow node.
This is not a limitation of current technology—it's a fundamental property of asynchronous distributed systems, formalized in the FLP impossibility result. If a node stops responding, we cannot know with certainty whether:
Since we cannot achieve perfect detection, we must instead design systems that make probabilistic decisions with understood tradeoffs. Every detection system balances two types of errors:
The Detection Tradeoff Space:
Every detection system operates within a tradeoff space defined by these parameters:
Aggressive detection (short timeouts, frequent checks) reduces detection latency but increases false positives. Conservative detection (long timeouts, multiple confirmations) reduces false positives but increases detection latency. There is no universally optimal point—the right configuration depends on your system's specific requirements.
Academic research has produced sophisticated approaches like the Phi Accrual Failure Detector (used in Cassandra and Akka). Instead of returning binary healthy/unhealthy, it returns a suspicion level (phi) that increases over time since the last heartbeat. This allows applications to make nuanced decisions based on their own risk tolerance. A phi of 1 means 10% chance the node has failed; phi of 4 means 99+% chance.
Heartbeats are the most fundamental failure detection mechanism. One component periodically sends a signal to another to indicate it's alive. Absence of heartbeats for a configured duration triggers failure detection.
Heartbeat Architecture Patterns:
Pattern 1: Push Heartbeats (Node to Monitor)
The monitored node periodically sends heartbeat messages to a monitoring service. If the monitor doesn't receive a heartbeat within the expected window, it marks the node as potentially failed.
Advantages:
Disadvantages:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
// Monitored Node - Heartbeat Senderclass HeartbeatSender { private intervalId: NodeJS.Timeout | null = null; private sequenceNumber: number = 0; constructor( private nodeId: string, private monitorEndpoint: string, private intervalMs: number = 5000, private httpClient: HttpClient ) {} start(): void { this.intervalId = setInterval(() => this.sendHeartbeat(), this.intervalMs); this.sendHeartbeat(); // Send immediately on start } private async sendHeartbeat(): Promise<void> { const heartbeat: Heartbeat = { nodeId: this.nodeId, timestamp: Date.now(), sequenceNumber: ++this.sequenceNumber, // Include health indicators in heartbeat metrics: { cpuPercent: getCpuUsage(), memoryPercent: getMemoryUsage(), activeConnections: getConnectionCount(), pendingRequests: getPendingRequestCount(), } }; try { await this.httpClient.post(this.monitorEndpoint, heartbeat, { timeout: this.intervalMs / 2 // Timeout before next heartbeat }); } catch (error) { // Log but don't crash - heartbeat sender should be resilient console.error('Failed to send heartbeat', error); } } stop(): void { if (this.intervalId) { clearInterval(this.intervalId); this.intervalId = null; } }} // Central Monitor - Heartbeat Receiverclass HeartbeatMonitor { private nodeStates = new Map<string, NodeState>(); private readonly deadlineMultiplier = 3; // Miss 3 heartbeats = dead constructor( private expectedIntervalMs: number, private onNodeDead: (nodeId: string) => void, private onNodeRecovered: (nodeId: string) => void ) { // Check for dead nodes periodically setInterval(() => this.checkDeadNodes(), this.expectedIntervalMs); } receiveHeartbeat(heartbeat: Heartbeat): void { const existingState = this.nodeStates.get(heartbeat.nodeId); const newState: NodeState = { lastHeartbeat: heartbeat.timestamp, lastSequence: heartbeat.sequenceNumber, consecutiveMisses: 0, status: 'healthy', metrics: heartbeat.metrics, }; this.nodeStates.set(heartbeat.nodeId, newState); // Node recovery detection if (existingState?.status === 'dead') { this.onNodeRecovered(heartbeat.nodeId); } } private checkDeadNodes(): void { const now = Date.now(); const deadline = this.expectedIntervalMs * this.deadlineMultiplier; for (const [nodeId, state] of this.nodeStates.entries()) { const timeSinceHeartbeat = now - state.lastHeartbeat; if (timeSinceHeartbeat > deadline && state.status !== 'dead') { state.status = 'dead'; state.consecutiveMisses = Math.floor(timeSinceHeartbeat / this.expectedIntervalMs); this.onNodeDead(nodeId); } else if (timeSinceHeartbeat > this.expectedIntervalMs) { state.status = 'suspicious'; state.consecutiveMisses++; } } }}Pattern 2: Pull Heartbeats (Monitor Polls Node)
The monitoring service periodically queries monitored nodes with health check requests. If a node fails to respond, it's marked as potentially failed.
Advantages:
Disadvantages:
Pattern 3: Peer-to-Peer Heartbeats (Gossip)
Nodes periodically exchange heartbeats directly with each other. Each node maintains a local view of cluster membership. Node failures are detected locally and gossipped to other nodes.
Advantages:
Disadvantages:
| Characteristic | Push (Node→Monitor) | Pull (Monitor→Node) | Gossip (Peer-to-Peer) |
|---|---|---|---|
| Single point of failure | Yes (monitor) | Yes (monitor) | No |
| Network traffic | Low | Medium | High |
| Detection speed | Fast | Medium | Variable |
| Implementation complexity | Simple | Simple | Complex |
| Scales to 1000+ nodes | With sharded monitors | With sharded monitors | Yes, naturally |
| Best for | Central monitoring | Health check integration | Decentralized clusters |
The heartbeat interval directly impacts detection latency. If you heartbeat every 5 seconds and require 3 missed heartbeats before declaring failure, your minimum detection time is 15 seconds. For sub-second detection, you need sub-second heartbeats—but this increases network load and the risk of false positives from momentary delays.
While heartbeats verify that a process is alive, health checks verify that a service is functioning correctly. A database might be alive (heartbeating) but unable to accept queries (unhealthy). Effective health checks bridge this gap.
The Health Check Spectrum:
Health checks exist on a spectrum from shallow to deep:
Designing Effective Health Endpoints:
A well-designed health check endpoint provides graduated information for different consumers:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
interface HealthCheckResponse { status: 'healthy' | 'degraded' | 'unhealthy'; timestamp: string; version: string; uptime: number; checks: ComponentCheck[];} interface ComponentCheck { name: string; status: 'pass' | 'warn' | 'fail'; duration_ms: number; message?: string; details?: Record<string, unknown>;} class HealthController { // Liveness probe: Is the process alive? // Used by orchestrators to know when to restart // Should be extremely fast and never fail due to dependencies async getLiveness(): Promise<{ status: 'ok' }> { return { status: 'ok' }; } // Readiness probe: Is the service ready to receive traffic? // Used by load balancers to know when to route traffic // Should check critical dependencies async getReadiness(): Promise<HealthCheckResponse> { const checks: ComponentCheck[] = []; // Check database connection checks.push(await this.checkDatabase()); // Check cache connection checks.push(await this.checkCache()); // Check required external services checks.push(await this.checkPaymentGateway()); const overallStatus = this.determineOverallStatus(checks); return { status: overallStatus, timestamp: new Date().toISOString(), version: process.env.APP_VERSION || 'unknown', uptime: process.uptime(), checks, }; } // Deep health: Comprehensive diagnostic information // Used by operators for troubleshooting // Can be expensive, call sparingly async getDeepHealth(): Promise<HealthCheckResponse> { const checks: ComponentCheck[] = []; // All readiness checks checks.push(await this.checkDatabase()); checks.push(await this.checkCache()); checks.push(await this.checkPaymentGateway()); // Additional diagnostic checks checks.push(await this.checkDatabaseReplicationLag()); checks.push(await this.checkMessageQueueDepth()); checks.push(await this.checkDiskSpace()); checks.push(await this.checkMemoryPressure()); checks.push(await this.checkCertificateExpiry()); // Performance verification checks.push(await this.executeTestQuery()); return { status: this.determineOverallStatus(checks), timestamp: new Date().toISOString(), version: process.env.APP_VERSION || 'unknown', uptime: process.uptime(), checks, }; } private async checkDatabase(): Promise<ComponentCheck> { const start = Date.now(); try { await db.query('SELECT 1'); return { name: 'database', status: 'pass', duration_ms: Date.now() - start, }; } catch (error) { return { name: 'database', status: 'fail', duration_ms: Date.now() - start, message: error.message, }; } } private async checkDatabaseReplicationLag(): Promise<ComponentCheck> { const start = Date.now(); try { const lag = await db.query('SELECT pg_last_wal_replay_lsn() - pg_last_wal_receive_lsn()'); const lagBytes = lag.rows[0].lag_bytes || 0; const lagMb = lagBytes / (1024 * 1024); let status: 'pass' | 'warn' | 'fail'; if (lagMb < 1) status = 'pass'; else if (lagMb < 100) status = 'warn'; else status = 'fail'; return { name: 'database_replication_lag', status, duration_ms: Date.now() - start, details: { lag_mb: lagMb }, }; } catch (error) { return { name: 'database_replication_lag', status: 'fail', duration_ms: Date.now() - start, message: error.message, }; } } private determineOverallStatus(checks: ComponentCheck[]): 'healthy' | 'degraded' | 'unhealthy' { if (checks.some(c => c.status === 'fail')) return 'unhealthy'; if (checks.some(c => c.status === 'warn')) return 'degraded'; return 'healthy'; }}Common mistakes: 1) Health checks that are too slow (>1s) cause load balancers to timeout. 2) Health checks that fail due to optional dependencies cause false positives. 3) Health checks that mutate state (create test records) cause operational issues. 4) Deep health checks called too frequently overwhelm the system.
Liveness vs Readiness in Kubernetes:
Kubernetes distinguishes between liveness and readiness probes, a pattern valuable even outside Kubernetes:
Liveness Probe: "Should we restart this container?" If liveness fails, the container is killed and restarted. Should only fail if the process is genuinely broken and restart would help. Should NOT fail due to dependency issues—restarting won't fix those.
Readiness Probe: "Should we send traffic to this container?" If readiness fails, the container is removed from load balancer rotation but not restarted. Should fail during startup, during graceful shutdown, and when dependencies are unavailable.
Startup Probe: "Has this container finished starting?" Prevents liveness checks from killing slow-starting applications. Only used during initial startup.
Raw heartbeat data must be processed through detection algorithms to produce actionable failure signals. Different algorithms offer different tradeoffs between detection speed and false positive rates.
Algorithm 1: Fixed Timeout (Naive)
The simplest approach: if no heartbeat arrives within T seconds, declare failure.
Advantages: Simple to implement and understand Disadvantages: Vulnerable to network jitter, requires manual tuning of T
Algorithm 2: Consecutive Failures
Require N consecutive missed heartbeats (or failed health checks) before declaring failure. This smooths over transient issues.
Configuration: failureThreshold = 3, checkInterval = 5s → Detection in 15s minimum
Advantages: Filters transient failures Disadvantages: Adds N×interval to detection latency
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
/** * Sliding Window Failure Detector * * Instead of requiring consecutive failures, this algorithm * tracks failures within a time window. This is more resilient * to intermittent issues while still detecting real failures. */class SlidingWindowDetector { private failureTimestamps: number[] = []; constructor( private windowSizeMs: number, // e.g., 60000 (1 minute) private failureThreshold: number, // e.g., 5 failures private successThreshold: number // e.g., 3 successes to recover ) {} recordResult(success: boolean): DetectionResult { const now = Date.now(); // Clean old failures outside window this.failureTimestamps = this.failureTimestamps.filter( ts => now - ts < this.windowSizeMs ); if (!success) { this.failureTimestamps.push(now); } const failureCount = this.failureTimestamps.length; const failureRate = failureCount / (this.windowSizeMs / 1000); return { isHealthy: failureCount < this.failureThreshold, failureCount, failureRate, windowSizeMs: this.windowSizeMs, threshold: this.failureThreshold, }; }} /** * Phi Accrual Failure Detector * * Returns a suspicion level instead of binary healthy/unhealthy. * Based on the paper "The φ Accrual Failure Detector" by Hayashibara et al. * Used in Apache Cassandra and Akka. */class PhiAccrualDetector { private arrivalTimes: number[] = []; private readonly maxSamples = 1000; recordHeartbeat(): void { this.arrivalTimes.push(Date.now()); if (this.arrivalTimes.length > this.maxSamples) { this.arrivalTimes.shift(); } } getPhi(): number { if (this.arrivalTimes.length < 2) { return 0; // Not enough data } // Calculate inter-arrival times const intervals: number[] = []; for (let i = 1; i < this.arrivalTimes.length; i++) { intervals.push(this.arrivalTimes[i] - this.arrivalTimes[i - 1]); } // Calculate mean and standard deviation const mean = intervals.reduce((a, b) => a + b, 0) / intervals.length; const variance = intervals.reduce((sum, val) => sum + Math.pow(val - mean, 2), 0) / intervals.length; const stdDev = Math.sqrt(variance); // Time since last heartbeat const timeSinceLast = Date.now() - this.arrivalTimes[this.arrivalTimes.length - 1]; // Phi calculation (simplified) // Higher phi = higher probability of failure // phi of 1 ≈ 10% chance of failure // phi of 4 ≈ 99% chance of failure const phi = this.calculatePhi(timeSinceLast, mean, stdDev); return phi; } private calculatePhi(timeSinceLast: number, mean: number, stdDev: number): number { // Using normal distribution approximation const y = (timeSinceLast - mean) / stdDev; const e = Math.exp(-y * y / 2); const pLater = 1 - (1 / Math.sqrt(2 * Math.PI)) * e; // Convert probability to phi scale return -Math.log10(1 - pLater); } isLikelyFailed(threshold: number = 8): boolean { return this.getPhi() > threshold; }}Algorithm 3: Adaptive Detection
Adaptive detectors automatically tune their parameters based on observed behavior. If a node typically responds in 10ms but occasionally takes 100ms, the detector adjusts its thresholds accordingly.
The Phi Accrual Detector is the canonical example: it builds a statistical model of inter-arrival times and uses this to calculate the probability that a node has failed given the time since the last heartbeat.
Key insight: A 5-second delay means different things for different nodes. For a node that typically responds in 1ms, it's almost certainly failed. For a node that sometimes takes 4 seconds due to garbage collection, it might be fine.
Algorithm 4: Quorum-Based Detection
Multiple monitors independently observe the node. Failure is only declared if a quorum of monitors agree. This prevents false positives from a single monitor's network issues.
| Algorithm | Detection Speed | False Positive Risk | Complexity | Best For |
|---|---|---|---|---|
| Fixed Timeout | Fast | High | Low | Simple systems, local monitoring |
| Consecutive Failures | Slow | Medium | Low | Most production systems |
| Sliding Window | Medium | Medium | Medium | Fluctuating workloads |
| Phi Accrual | Adaptive | Low | High | Heterogeneous nodes |
| Quorum-Based | Medium | Very Low | High | Critical systems, distributed monitoring |
Production systems often combine multiple algorithms. For example: use consecutive failures for quick detection, then require quorum confirmation before triggering failover. This captures the speed of simple algorithms while adding the reliability of quorum consensus.
Real-world failures rarely present as clean binary states. More commonly, systems experience partial failures or gray states where behavior is degraded but not completely failed. Effective detection must handle these nuanced states.
Types of Partial Failures:
1. Degraded Response Times
The service responds, but significantly slower than normal. It might be under heavy load, experiencing resource contention, or suffering from a slowly degrading component (filling disk, memory leak).
Detection approach: Monitor latency percentiles (p50, p95, p99). Alert when latency exceeds thresholds. Consider removing from load balancer rotation even if technically healthy.
2. Elevated Error Rates
Some requests succeed, but an unusually high percentage fail. This could indicate database connection exhaustion, downstream dependency issues, or partial data corruption.
Detection approach: Track error rate as a percentage of total requests. Use statistical significance testing to distinguish real problems from normal variance.
Multi-Signal Detection:
To catch partial failures, effective detection systems combine multiple signals:
Latency Signals: p50, p95, p99 response times compared to baseline Error Signals: 4xx rate, 5xx rate, timeout rate Saturation Signals: CPU utilization, memory pressure, connection pool usage, queue depth Traffic Signals: Request rate compared to expected patterns Synthetic Signals: Active probing that exercises code paths beyond /health
No single signal catches all failure modes. The combination provides comprehensive coverage.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
interface Signal { name: string; value: number; thresholds: { warning: number; critical: number; }; weight: number; // Importance in overall score} class MultiSignalHealthEvaluator { evaluateHealth(signals: Signal[]): HealthEvaluation { let totalWeight = 0; let healthScore = 0; // 0-100, higher is healthier const issues: string[] = []; for (const signal of signals) { totalWeight += signal.weight; if (signal.value >= signal.thresholds.critical) { healthScore += 0; // Add nothing for critical signals issues.push(`CRITICAL: ${signal.name} = ${signal.value}`); } else if (signal.value >= signal.thresholds.warning) { healthScore += signal.weight * 50; // Half credit for warning issues.push(`WARNING: ${signal.name} = ${signal.value}`); } else { healthScore += signal.weight * 100; // Full credit for healthy } } const normalizedScore = healthScore / totalWeight; return { score: normalizedScore, status: this.scoreToStatus(normalizedScore), issues, recommendation: this.getRecommendation(normalizedScore, issues), }; } private scoreToStatus(score: number): 'healthy' | 'degraded' | 'unhealthy' { if (score >= 80) return 'healthy'; if (score >= 50) return 'degraded'; return 'unhealthy'; } private getRecommendation(score: number, issues: string[]): string { if (score >= 80) return 'No action required'; if (score >= 50) { if (issues.some(i => i.includes('latency'))) { return 'Consider reducing traffic weight in load balancer'; } return 'Monitor closely, prepare for possible failover'; } if (issues.some(i => i.includes('CRITICAL'))) { return 'Initiate failover or remove from rotation immediately'; } return 'Investigate and consider failover'; }} // Usage exampleconst signals: Signal[] = [ { name: 'error_rate_percent', value: 2.5, thresholds: { warning: 1, critical: 5 }, weight: 3, // High importance }, { name: 'p99_latency_ms', value: 450, thresholds: { warning: 500, critical: 2000 }, weight: 2, }, { name: 'cpu_percent', value: 75, thresholds: { warning: 80, critical: 95 }, weight: 1, }, { name: 'memory_percent', value: 85, thresholds: { warning: 80, critical: 95 }, weight: 1, },]; const evaluator = new MultiSignalHealthEvaluator();const result = evaluator.evaluateHealth(signals);// result: { score: 83.3, status: 'healthy', issues: ['WARNING: memory_percent = 85'], ... }Effective partial failure detection requires mature observability infrastructure. Metrics, logs, and traces provide the raw signals. Alerting rules and health evaluators process them into actionable information. Investment in observability directly improves detection capability.
The detection system itself must be highly available—otherwise, you can't detect failures when you most need to. This creates a recursive design challenge: how do you build a reliable system to detect unreliable systems?
Distributed Monitoring Architecture:
Production detection systems distribute monitoring across multiple independent vantage points:
1. Local Monitors: Run on the same host or rack as the monitored component. Low latency, high accuracy for local failures. Vulnerable to correlated failures (host dies, monitor dies too).
2. Regional Monitors: Run in the same datacenter/region but different availability zones. Can detect zone-specific issues. Still vulnerable to region-wide failures.
3. Global Monitors: Run in different regions. Can detect region-wide failures. Higher latency, more false positives due to network distance.
4. External Monitors: Commercial services monitoring from outside your infrastructure (Pingdom, Datadog Synthetics, New Relic). Independent failure domain, but limited visibility into internal state.
Consensus Among Monitors:
When monitors disagree, how do you decide? This is where consensus protocols become essential:
Simple Majority: If more than half of monitors report failure, trigger failover. Simple but vulnerable to network partitions where minority has the truth.
Weighted Voting: Assign different weights to different monitors based on their reliability or proximity. Local monitors might get higher weight than global ones.
Leader-Based Decision: One monitor is elected as the decision-maker. Others provide input, but the leader makes the final call. Simpler but introduces single point of failure for decisions.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
interface MonitoringVote { monitorId: string; region: string; targetHealthy: boolean; confidence: number; // 0-1, based on connection quality timestamp: number;} interface QuorumDecision { isHealthy: boolean; confidence: number; healthyVotes: number; unhealthyVotes: number; quorumMet: boolean; dissent: string[];} class QuorumBasedDetector { private votes = new Map<string, MonitoringVote>(); constructor( private quorumSize: number, private staleVoteThresholdMs: number = 30000 ) {} recordVote(vote: MonitoringVote): void { this.votes.set(vote.monitorId, vote); } getDecision(): QuorumDecision { const now = Date.now(); // Filter out stale votes const freshVotes = Array.from(this.votes.values()).filter( v => now - v.timestamp < this.staleVoteThresholdMs ); if (freshVotes.length < this.quorumSize) { return { isHealthy: true, // Fail open - assume healthy if unsure confidence: 0, healthyVotes: freshVotes.filter(v => v.targetHealthy).length, unhealthyVotes: freshVotes.filter(v => !v.targetHealthy).length, quorumMet: false, dissent: ['Insufficient monitors reporting'], }; } // Weight votes by confidence let healthyScore = 0; let unhealthyScore = 0; for (const vote of freshVotes) { if (vote.targetHealthy) { healthyScore += vote.confidence; } else { unhealthyScore += vote.confidence; } } const totalScore = healthyScore + unhealthyScore; const isHealthy = healthyScore > unhealthyScore; // Calculate dissent (monitors that disagree with majority) const majority = isHealthy; const dissent = freshVotes .filter(v => v.targetHealthy !== majority) .map(v => `${v.monitorId} (${v.region}) voted ${v.targetHealthy ? 'healthy' : 'unhealthy'}`); return { isHealthy, confidence: Math.abs(healthyScore - unhealthyScore) / totalScore, healthyVotes: freshVotes.filter(v => v.targetHealthy).length, unhealthyVotes: freshVotes.filter(v => !v.targetHealthy).length, quorumMet: true, dissent, }; }}When the detection system itself fails (not enough monitors, network partition), you must choose: Fail-Open (assume monitored component is healthy) means you might miss real failures. Fail-Closed (assume monitored component is unhealthy) means you might cause unnecessary failovers. Most systems fail-open as unnecessary failovers are worse than delayed detection.
Even well-designed detection systems fall into common traps. Understanding these pitfalls helps you avoid them in your own implementations.
Pitfall 1: Detection via Control Plane
A common mistake is having the detection system share failure modes with the monitored system. If health checks go through the same load balancer as user traffic, a load balancer failure blinds your detection.
Solution: Use independent network paths for detection. Health checks should bypass load balancers or use dedicated health check endpoints with direct routing.
Pitfall 2: Health Check Thundering Herd
When multiple monitors all check health at the same interval, they can synchronize and create periodic load spikes. If checkInterval = 5s for 100 monitors, the monitored service sees 100 health checks simultaneously every 5 seconds.
Solution: Add jitter to check intervals. Random offset of 0-interval ensures checks distribute evenly over time.
Pitfall 3: The GC Pause Problem
JVM and other garbage-collected runtimes can pause for seconds during full GC. This looks exactly like a failure to detection systems but the service will recover automatically.
Solution: Tune thresholds to accommodate expected GC pauses. Use GC logging to correlate false positives with GC events. Consider using GC-aware health endpoints that pause checks during known GC.
Pitfall 4: Network Partition Blindness
If detection monitors are all in the same network segment as the monitored service, a partition that isolates clients also isolates monitors. From monitors' perspective, everything looks healthy.
Solution: Distribute monitors across network boundaries. Include external synthetic monitoring that simulates real client access patterns.
Regularly test that your detection actually detects failures: 1) Introduce fake failures (chaos engineering) and verify detection triggers. 2) Review false positive and false negative rates from production data. 3) Measure actual time-to-detection compared to your targets. 4) Verify detection works during partial failures, not just complete outages.
Failure detection is the foundation upon which all failover mechanisms rest. Without accurate, timely detection, even the best failover automation is useless. Let's consolidate the key principles:
What's Next:
With detection established as our eyes into system health, we turn to the critical question of timing: Once we detect a failure, how quickly should we react? The next page explores failover timing—the careful balance between speed and safety that determines whether failover helps or harms.
You now understand the fundamental challenges of failure detection, can implement heartbeat and health check mechanisms, evaluate detection algorithms for your use case, and design detection architectures that are themselves highly available. Next: Failover Timing.