Loading learning content...
In the autumn of 2017, a major US airline's reservation system went dark. Flights were grounded, passengers stranded, and the company lost an estimated $150 million in a single day. The postmortem revealed a chilling truth: the failed servers had shown warning signs for hours, but the health monitoring system had categorized them as 'healthy' until the moment of complete collapse.
This catastrophe illustrates a fundamental truth about distributed systems: a system is only as reliable as its ability to detect and respond to failures. Health checks are the nervous system of modern infrastructure—the mechanism by which load balancers, orchestrators, and service meshes determine which instances can serve traffic and which have become liabilities.
But not all health checks are created equal. The distinction between active and passive health checking fundamentally shapes how quickly you detect failures, how much overhead you incur, and how accurately you model server health. Understanding this distinction isn't merely academic—it's the difference between systems that gracefully handle failures and systems that amplify them into cascading outages.
By the end of this page, you will understand the fundamental mechanisms of active and passive health checks, their architectural trade-offs, when to use each approach, and how to combine them for robust failure detection. You'll be equipped to design health checking strategies that balance detection speed, system overhead, and accuracy.
Before diving into the technical mechanisms, we must understand why health checks exist and what problem they solve in distributed systems.
The Fundamental Challenge:
In a distributed system, traffic must be routed only to servers that can actually handle requests. Without health checks, a load balancer faces an impossible situation:
Health checks solve this by establishing a continuous feedback loop between the traffic routing layer and the backend servers. This feedback loop must answer a deceptively simple question: Is this server capable of serving traffic right now?
The complexity lies in what 'capable' means:
| Dimension | What It Tests | Failure Mode Detected |
|---|---|---|
| Network Reachability | Can packets reach the server? | Network partitions, firewall issues, server offline |
| Port Availability | Is the service listening on expected port? | Process crash, misconfiguration, binding failures |
| Application Responsiveness | Can the app handle HTTP requests? | Application deadlock, resource exhaustion, startup issues |
| Dependency Health | Can the app reach its dependencies? | Database failures, cache unavailability, downstream outages |
| Performance Adequacy | Can the app respond within SLA bounds? | Overload, resource contention, garbage collection pauses |
Servers rarely fail completely and obviously. More often, they experience partial failures—they can accept connections but respond slowly, or handle some request types but not others. A robust health check strategy must detect these nuanced failure modes, not just binary alive/dead states.
Active health checking operates on a simple principle: the monitoring system periodically sends probe requests to backend servers and evaluates their responses. This is the 'polling' approach to health monitoring.
Mechanism:
This creates a state machine for each backend server:
Probe Types:
Active health checks can operate at different protocol layers, each with distinct trade-offs:
1234567891011121314151617181920212223242526272829303132333435
# NGINX Plus Active Health Check Configuration Example upstream backend_servers { zone backend_zone 64k; server 10.0.0.1:8080 weight=1; server 10.0.0.2:8080 weight=1; server 10.0.0.3:8080 weight=1; # Active health check parameters # Probe every 5 seconds, timeout after 3 seconds # Mark unhealthy after 3 failures, restore after 2 successes} server { listen 80; location / { proxy_pass http://backend_servers; # Active health check directive health_check interval=5s fails=3 passes=2 uri=/health/ready match=healthy; }} # Response matching blockmatch healthy { status 200; header Content-Type ~ "application/json"; body ~ '"status":\s*"healthy"';}The health check interval represents a fundamental trade-off: shorter intervals detect failures faster but increase system load. For most production systems, intervals between 5-30 seconds provide reasonable detection times without excessive overhead. Critical services may warrant intervals as low as 1-2 seconds.
Passive health checking takes a fundamentally different approach: instead of sending dedicated probe requests, the system observes the outcomes of real production traffic to infer server health. This is the 'reactive' or 'observation-based' approach.
Mechanism:
The key insight is that passive health checking uses production traffic as the probe. This has profound implications for both accuracy and failure detection dynamics.
| Observed Behavior | Health Implication | Typical Response |
|---|---|---|
| TCP connection refused | Port not listening, process crashed | Immediate failure count increment |
| TCP connection timeout | Network issue or severely overloaded | Failure count increment after timeout |
| HTTP 500-599 responses | Application error | Configurable: count as failure or track separately |
| HTTP 503 Service Unavailable | Application deliberately refusing traffic | Often treated as explicit unhealthy signal |
| Response timeout | Overloaded or deadlocked application | Failure count increment after timeout |
| Connection reset | Process crash during request | Immediate failure count increment |
The Circuit Breaker Pattern:
Passive health checking often integrates with circuit breaker patterns. The circuit breaker operates as a state machine that tracks failure rates and 'opens' when failures exceed a threshold, stopping traffic to the failing server:
If test requests succeed, the circuit closes and normal traffic resumes. If they fail, the circuit opens again.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
# Envoy Proxy Outlier Detection (Passive Health Check) Configuration clusters: - name: backend_cluster connect_timeout: 0.25s type: STRICT_DNS lb_policy: ROUND_ROBIN # Passive health checking via outlier detection outlier_detection: # Trigger after 5 consecutive 5xx responses consecutive_5xx: 5 # OR trigger after 5 consecutive gateway errors (502, 503, 504) consecutive_gateway_failure: 5 # OR trigger after 5 consecutive local origin failures consecutive_local_origin_failure: 5 # Time between ejection analysis sweeps interval: 10s # Base ejection time (actual time = base * ejection count) base_ejection_time: 30s # Maximum ejection percentage (protects against ejecting all hosts) max_ejection_percent: 50 # Whether to track success rate and eject slow hosts success_rate_minimum_hosts: 5 success_rate_request_volume: 100 success_rate_stdev_factor: 1900 # 1.9 standard deviations # Failure percentage-based ejection failure_percentage_threshold: 85 failure_percentage_minimum_hosts: 5 failure_percentage_request_volume: 50 load_assignment: cluster_name: backend_cluster endpoints: - lb_endpoints: - endpoint: address: socket_address: address: 10.0.0.1 port_value: 8080Passive health checks have a fundamental limitation: they can only detect failures in servers that receive traffic. A server that has just started (or just recovered) has no traffic history to analyze. This is why passive-only health checking is rarely sufficient—new or recovering servers have no established health signal.
Understanding when to use active versus passive health checking requires a nuanced analysis of their respective strengths and weaknesses. Neither approach is universally superior—the right choice depends on your failure detection requirements, infrastructure constraints, and operational maturity.
| Criterion | Active Health Checks | Passive Health Checks |
|---|---|---|
| Detection Speed (Idle Server) | Bounded by check interval | Never detected (no traffic) |
| Detection Speed (Busy Server) | Bounded by check interval | Immediate (first failed request) |
| False Positives | Possible if health endpoint fails independently | Lower—based on real traffic outcomes |
| False Negatives | Possible if health endpoint too simple | Possible if failure mode is intermittent |
| Network Overhead | Linear with server count × check frequency | Zero additional overhead |
| Dependency Verification | Can probe deep health including dependencies | Only observes outcomes, not root causes |
| New Instance Handling | Can verify before adding to rotation | Cannot assess until traffic is routed |
| Recovery Detection | Automatic with continued probing | Requires separate mechanism |
| Configuration Complexity | Requires health endpoint implementation | Minimal—observes existing traffic |
When using active health checks with many load balancers (e.g., in a replicated or globally distributed setup), synchronization of check intervals can create 'thundering herd' effects where all checkers probe simultaneously. Implement jitter (random delay) in check intervals to spread the load.
In production systems, the most robust approach is typically a hybrid strategy that leverages the complementary strengths of both active and passive health checking. This defense-in-depth approach provides multiple layers of failure detection.
The Layered Health Check Architecture:
Active Checks for Baseline Health: Periodic probes verify that servers are fundamentally operational—process running, port bound, able to respond to requests. This catches servers that have crashed, hung during startup, or become isolated from the network.
Passive Checks for Runtime Health: Continuous observation of production traffic catches degradation that emerges under load—memory leaks that develop over time, resource exhaustion under specific request patterns, or intermittent failures that don't show up in synthetic probes.
Smart Recovery Orchestration: Active probing restores servers to the rotation after passive checks have ejected them, but with a warmup period that limits initial traffic to verify production-level health before full restoration.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
# Kubernetes-style Hybrid Health Check Strategy# Combining readiness probes (active) with service mesh outlier detection (passive) # Active Health Checks via Kubernetes ProbesapiVersion: apps/v1kind: Deploymentmetadata: name: api-serverspec: template: spec: containers: - name: api image: api:latest ports: - containerPort: 8080 # Liveness probe: Is the process fundamentally alive? livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 10 periodSeconds: 15 timeoutSeconds: 5 failureThreshold: 3 # Readiness probe: Can this instance handle traffic? readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 successThreshold: 2 # Startup probe: Wait for slow-starting applications startupProbe: httpGet: path: /health/live port: 8080 periodSeconds: 5 failureThreshold: 30 # Allow 150 seconds for startup ---# Passive Health Checks via Istio Service MeshapiVersion: networking.istio.io/v1alpha3kind: DestinationRulemetadata: name: api-server-destinationspec: host: api-server trafficPolicy: connectionPool: tcp: maxConnections: 100 http: h2UpgradePolicy: UPGRADE http1MaxPendingRequests: 100 http2MaxRequests: 1000 # Outlier detection for passive health checking outlierDetection: consecutive5xxErrors: 5 interval: 10s baseEjectionTime: 30s maxEjectionPercent: 50 minHealthPercent: 30 # Success rate-based ejection consecutiveLocalOriginFailures: 5 splitExternalLocalOriginErrors: trueKubernetes popularized the three-probe pattern: Liveness (should we restart the container?), Readiness (should we send traffic?), and Startup (has the app finished initializing?). This separation allows efficient handling of different failure modes—a slow-starting app isn't killed, a temporarily overloaded app stops receiving traffic without restarting.
Designing health check strategies for production requires careful attention to edge cases, failure modes, and operational requirements that aren't apparent in simple configurations.
| System Type | Recommended Interval | Failure Threshold | Recovery Threshold |
|---|---|---|---|
| High-frequency trading | 100-500ms | 2-3 failures | 2 successes |
| Real-time web applications | 1-5s | 2-3 failures | 2-3 successes |
| Standard web services | 5-15s | 3-5 failures | 2-3 successes |
| Batch processing systems | 30-60s | 3-5 failures | 2 successes |
| Long-running jobs | 60-120s | 5+ failures | 3 successes |
Overly aggressive health checks can cause the failures they're meant to detect. If health check probes consume significant resources, a busy server might fail health checks simply because it's prioritizing real traffic over probe responses. Design health endpoints to be extremely lightweight—they should complete in single-digit milliseconds even under load.
This deep dive into active and passive health checking reveals that effective health monitoring isn't about choosing one approach—it's about understanding the trade-offs and designing layered strategies that provide comprehensive failure detection.
What's next:
Understanding how health checks work is foundational, but the quality of health information depends on well-designed health check endpoints. In the next page, we'll explore the art and science of designing health endpoints that accurately represent server health without creating false signals or excessive overhead.
You now understand the fundamental mechanisms of active and passive health checking, their respective trade-offs, and how to combine them into robust hybrid strategies. Next, we'll dive into designing effective health check endpoints.