System Design (HLD)Health Checks & Failover

Health Checks & Failover

LevelIntermediate

Duration75 mins

TopicHealth Checks & Failover

1 / 5

Active vs Passive Health Checks

The Heartbeat of Distributed Systems

In the autumn of 2017, a major US airline's reservation system went dark. Flights were grounded, passengers stranded, and the company lost an estimated $150 million in a single day. The postmortem revealed a chilling truth: the failed servers had shown warning signs for hours, but the health monitoring system had categorized them as 'healthy' until the moment of complete collapse.

This catastrophe illustrates a fundamental truth about distributed systems: a system is only as reliable as its ability to detect and respond to failures. Health checks are the nervous system of modern infrastructure—the mechanism by which load balancers, orchestrators, and service meshes determine which instances can serve traffic and which have become liabilities.

But not all health checks are created equal. The distinction between active and passive health checking fundamentally shapes how quickly you detect failures, how much overhead you incur, and how accurately you model server health. Understanding this distinction isn't merely academic—it's the difference between systems that gracefully handle failures and systems that amplify them into cascading outages.

What You Will Learn

By the end of this page, you will understand the fundamental mechanisms of active and passive health checks, their architectural trade-offs, when to use each approach, and how to combine them for robust failure detection. You'll be equipped to design health checking strategies that balance detection speed, system overhead, and accuracy.

The Health Check Imperative

Before diving into the technical mechanisms, we must understand why health checks exist and what problem they solve in distributed systems.

The Fundamental Challenge:

In a distributed system, traffic must be routed only to servers that can actually handle requests. Without health checks, a load balancer faces an impossible situation:

It cannot assume all servers are always available (hardware fails, processes crash, networks partition)
It cannot route traffic randomly and hope for the best (failed servers would receive and drop requests)
It cannot rely on servers to voluntarily deregister (crashed processes can't send final messages)

Health checks solve this by establishing a continuous feedback loop between the traffic routing layer and the backend servers. This feedback loop must answer a deceptively simple question: Is this server capable of serving traffic right now?

The complexity lies in what 'capable' means:

Can the server accept TCP connections?
Can it process HTTP requests?
Can it access its database dependencies?
Can it respond within acceptable latency bounds?
Is it overloaded to the point where adding more traffic would cause cascading failures?

Health Check Dimensions
Dimension	What It Tests	Failure Mode Detected
Network Reachability	Can packets reach the server?	Network partitions, firewall issues, server offline
Port Availability	Is the service listening on expected port?	Process crash, misconfiguration, binding failures
Application Responsiveness	Can the app handle HTTP requests?	Application deadlock, resource exhaustion, startup issues
Dependency Health	Can the app reach its dependencies?	Database failures, cache unavailability, downstream outages
Performance Adequacy	Can the app respond within SLA bounds?	Overload, resource contention, garbage collection pauses

The Partial Failure Problem

Servers rarely fail completely and obviously. More often, they experience partial failures—they can accept connections but respond slowly, or handle some request types but not others. A robust health check strategy must detect these nuanced failure modes, not just binary alive/dead states.

Active Health Checks: The Probing Model

Active health checking operates on a simple principle: the monitoring system periodically sends probe requests to backend servers and evaluates their responses. This is the 'polling' approach to health monitoring.

Mechanism:

The health checker (typically the load balancer itself) maintains a list of backend servers
At configurable intervals (e.g., every 5 seconds), it sends a probe to each server
The probe can be a TCP connection attempt, an HTTP request, or a custom check
If the server responds correctly within a timeout, it's marked healthy
If the server fails to respond or returns an error, failure counters increment
After a threshold of consecutive failures, the server is marked unhealthy and removed from the rotation
The checker continues probing unhealthy servers; consecutive successes restore them to healthy status

This creates a state machine for each backend server:

Converting Mermaid diagram...

Probe Types:

Active health checks can operate at different protocol layers, each with distinct trade-offs:

Active Probe Types

•TCP Connect Probes — Attempt to establish a TCP connection on the service port. Fastest and lowest overhead, but only verifies network reachability and port binding. A deadlocked application that's still accepting connections will appear healthy.
•HTTP/HTTPS Probes — Send an HTTP request (typically GET) to a designated health endpoint and verify the response code (usually expecting 200 OK). Tests application-level responsiveness but adds latency and CPU overhead.
•gRPC Health Probes — Use the gRPC Health Checking Protocol, which provides a standardized interface for health reporting. Essential for gRPC-native services and supports per-service health status in multi-service deployments.
•Custom Script Probes — Execute arbitrary scripts or commands that define application-specific health criteria. Maximum flexibility but require careful implementation to avoid false positives/negatives.
•Database/Dependency Probes — Query application health endpoints that verify connectivity to databases, caches, and other critical dependencies. Deep health verification but can create coupling between health check failures and transient dependency issues.

nginx-active-health-config.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# NGINX Plus Active Health Check Configuration Example
 
upstream backend_servers {
    zone backend_zone 64k;
    
    server 10.0.0.1:8080 weight=1;
    server 10.0.0.2:8080 weight=1;
    server 10.0.0.3:8080 weight=1;
    
    # Active health check parameters
    # Probe every 5 seconds, timeout after 3 seconds
    # Mark unhealthy after 3 failures, restore after 2 successes
}
 
server {
    listen 80;
    
    location / {
        proxy_pass http://backend_servers;
        
        # Active health check directive
        health_check interval=5s
                     fails=3
                     passes=2
                     uri=/health/ready
                     match=healthy;
    }
}
 
# Response matching block
match healthy {
    status 200;
    header Content-Type ~ "application/json";
    body ~ '"status":\s*"healthy"';
}

Configuring Health Check Intervals

The health check interval represents a fundamental trade-off: shorter intervals detect failures faster but increase system load. For most production systems, intervals between 5-30 seconds provide reasonable detection times without excessive overhead. Critical services may warrant intervals as low as 1-2 seconds.

Passive Health Checks: The Observation Model

Passive health checking takes a fundamentally different approach: instead of sending dedicated probe requests, the system observes the outcomes of real production traffic to infer server health. This is the 'reactive' or 'observation-based' approach.

Mechanism:

The load balancer forwards production requests to backend servers as normal
It monitors the outcomes of these requests: success/failure, response codes, latency
When a server exhibits problematic patterns (connection failures, 5xx errors, timeouts), the load balancer updates its health assessment
After a threshold of observed failures, the server is marked unhealthy and removed from rotation
Recovery typically requires active probing or manual intervention, since no traffic flows to unhealthy servers

The key insight is that passive health checking uses production traffic as the probe. This has profound implications for both accuracy and failure detection dynamics.

Passive Health Check Triggers
Observed Behavior	Health Implication	Typical Response
TCP connection refused	Port not listening, process crashed	Immediate failure count increment
TCP connection timeout	Network issue or severely overloaded	Failure count increment after timeout
HTTP 500-599 responses	Application error	Configurable: count as failure or track separately
HTTP 503 Service Unavailable	Application deliberately refusing traffic	Often treated as explicit unhealthy signal
Response timeout	Overloaded or deadlocked application	Failure count increment after timeout
Connection reset	Process crash during request	Immediate failure count increment

The Circuit Breaker Pattern:

Passive health checking often integrates with circuit breaker patterns. The circuit breaker operates as a state machine that tracks failure rates and 'opens' when failures exceed a threshold, stopping traffic to the failing server:

Closed State: Traffic flows normally. Failures are counted.
Open State: Traffic is blocked. The circuit 'trips' after failure threshold is exceeded.
Half-Open State: After a cooldown period, a limited number of test requests are allowed through.

If test requests succeed, the circuit closes and normal traffic resumes. If they fail, the circuit opens again.

Converting Mermaid diagram...

envoy-passive-health-config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Envoy Proxy Outlier Detection (Passive Health Check) Configuration
 
clusters:
  - name: backend_cluster
    connect_timeout: 0.25s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    
    # Passive health checking via outlier detection
    outlier_detection:
      # Trigger after 5 consecutive 5xx responses
      consecutive_5xx: 5
      
      # OR trigger after 5 consecutive gateway errors (502, 503, 504)
      consecutive_gateway_failure: 5
      
      # OR trigger after 5 consecutive local origin failures
      consecutive_local_origin_failure: 5
      
      # Time between ejection analysis sweeps
      interval: 10s
      
      # Base ejection time (actual time = base * ejection count)
      base_ejection_time: 30s
      
      # Maximum ejection percentage (protects against ejecting all hosts)
      max_ejection_percent: 50
      
      # Whether to track success rate and eject slow hosts
      success_rate_minimum_hosts: 5
      success_rate_request_volume: 100
      success_rate_stdev_factor: 1900  # 1.9 standard deviations
      
      # Failure percentage-based ejection
      failure_percentage_threshold: 85
      failure_percentage_minimum_hosts: 5
      failure_percentage_request_volume: 50
    
    load_assignment:
      cluster_name: backend_cluster
      endpoints:
        - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: 10.0.0.1
                    port_value: 8080

The Cold Start Problem

Passive health checks have a fundamental limitation: they can only detect failures in servers that receive traffic. A server that has just started (or just recovered) has no traffic history to analyze. This is why passive-only health checking is rarely sufficient—new or recovering servers have no established health signal.

Comparative Analysis: Active vs Passive

Understanding when to use active versus passive health checking requires a nuanced analysis of their respective strengths and weaknesses. Neither approach is universally superior—the right choice depends on your failure detection requirements, infrastructure constraints, and operational maturity.

Active Health Checks

•Proactive failure detection — Discovers failures before production traffic is affected
•Consistent monitoring — Every server is checked regardless of traffic distribution
•Pre-warming support — Can verify new instances before they receive traffic
•Simpler recovery logic — Continued probing automatically detects recovery
•Dependency verification — Health endpoints can verify downstream connectivity

Passive Health Checks

•Zero additional overhead — No dedicated probe traffic required
•Real-world accuracy — Health assessment based on actual request handling
•Immediate detection — Failures detected as they happen, not at next interval
•Load-sensitive — Naturally accounts for request patterns and load distribution
•Simpler configuration — No health endpoints to implement and maintain

Deep Trade-off Comparison
Criterion	Active Health Checks	Passive Health Checks
Detection Speed (Idle Server)	Bounded by check interval	Never detected (no traffic)
Detection Speed (Busy Server)	Bounded by check interval	Immediate (first failed request)
False Positives	Possible if health endpoint fails independently	Lower—based on real traffic outcomes
False Negatives	Possible if health endpoint too simple	Possible if failure mode is intermittent
Network Overhead	Linear with server count × check frequency	Zero additional overhead
Dependency Verification	Can probe deep health including dependencies	Only observes outcomes, not root causes
New Instance Handling	Can verify before adding to rotation	Cannot assess until traffic is routed
Recovery Detection	Automatic with continued probing	Requires separate mechanism
Configuration Complexity	Requires health endpoint implementation	Minimal—observes existing traffic

The Thundering Herd Risk

When using active health checks with many load balancers (e.g., in a replicated or globally distributed setup), synchronization of check intervals can create 'thundering herd' effects where all checkers probe simultaneously. Implement jitter (random delay) in check intervals to spread the load.

Hybrid Strategies: Combining Both Approaches

In production systems, the most robust approach is typically a hybrid strategy that leverages the complementary strengths of both active and passive health checking. This defense-in-depth approach provides multiple layers of failure detection.

The Layered Health Check Architecture:

Active Checks for Baseline Health: Periodic probes verify that servers are fundamentally operational—process running, port bound, able to respond to requests. This catches servers that have crashed, hung during startup, or become isolated from the network.
Passive Checks for Runtime Health: Continuous observation of production traffic catches degradation that emerges under load—memory leaks that develop over time, resource exhaustion under specific request patterns, or intermittent failures that don't show up in synthetic probes.
Smart Recovery Orchestration: Active probing restores servers to the rotation after passive checks have ejected them, but with a warmup period that limits initial traffic to verify production-level health before full restoration.

Converting Mermaid diagram...

hybrid-health-strategy.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# Kubernetes-style Hybrid Health Check Strategy
# Combining readiness probes (active) with service mesh outlier detection (passive)
 
# Active Health Checks via Kubernetes Probes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      containers:
        - name: api
          image: api:latest
          ports:
            - containerPort: 8080
          
          # Liveness probe: Is the process fundamentally alive?
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3
          
          # Readiness probe: Can this instance handle traffic?
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
            successThreshold: 2
          
          # Startup probe: Wait for slow-starting applications
          startupProbe:
            httpGet:
              path: /health/live
              port: 8080
            periodSeconds: 5
            failureThreshold: 30  # Allow 150 seconds for startup
 
---
# Passive Health Checks via Istio Service Mesh
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: api-server-destination
spec:
  host: api-server
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    
    # Outlier detection for passive health checking
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
      minHealthPercent: 30
      
      # Success rate-based ejection
      consecutiveLocalOriginFailures: 5
      splitExternalLocalOriginErrors: true

The Three Probe Pattern

Kubernetes popularized the three-probe pattern: Liveness (should we restart the container?), Readiness (should we send traffic?), and Startup (has the app finished initializing?). This separation allows efficient handling of different failure modes—a slow-starting app isn't killed, a temporarily overloaded app stops receiving traffic without restarting.

Production Considerations

Designing health check strategies for production requires careful attention to edge cases, failure modes, and operational requirements that aren't apparent in simple configurations.

Critical Production Considerations

•Ejection Limits — Never allow health checks to remove all servers from rotation. Configure maximum ejection percentages (e.g., 50%) to ensure some capacity remains even during widespread issues. Serving with degraded capacity is better than serving nothing.
•Dependency vs Self Health — Decide carefully whether health endpoints should report dependency failures. If your database goes down, should individual app servers report unhealthy? This could cause all traffic to stop when partial service might still be possible.
•Health Check Authentication — Health endpoints are often unauthenticated to simplify load balancer configuration, but this creates a potential information disclosure vector. Restrict health endpoint accessibility to internal networks.
•Timing and Jitter — In large deployments, synchronize health checks can create load spikes. Add random jitter to check intervals to spread the probing load evenly over time.
•Graceful Shutdown Integration — Servers undergoing shutdown should fail health checks before stopping request processing. This allows the load balancer to drain traffic gracefully rather than encountering connection resets.
•Flapping Prevention — Servers that rapidly oscillate between healthy and unhealthy states ('flapping') can cause routing instability. Implement hysteresis—require multiple consecutive successes for recovery, not just one.

Health Check Timing Guidelines
System Type	Recommended Interval	Failure Threshold	Recovery Threshold
High-frequency trading	100-500ms	2-3 failures	2 successes
Real-time web applications	1-5s	2-3 failures	2-3 successes
Standard web services	5-15s	3-5 failures	2-3 successes
Batch processing systems	30-60s	3-5 failures	2 successes
Long-running jobs	60-120s	5+ failures	3 successes

The Cascading Failure Trap

Overly aggressive health checks can cause the failures they're meant to detect. If health check probes consume significant resources, a busy server might fail health checks simply because it's prioritizing real traffic over probe responses. Design health endpoints to be extremely lightweight—they should complete in single-digit milliseconds even under load.

Summary: Choosing Your Health Check Strategy

This deep dive into active and passive health checking reveals that effective health monitoring isn't about choosing one approach—it's about understanding the trade-offs and designing layered strategies that provide comprehensive failure detection.

Key Takeaways

•Active health checks probe servers at fixed intervals, providing consistent monitoring independent of traffic patterns but with bounded detection latency and network overhead.
•Passive health checks observe production traffic outcomes, providing zero-overhead detection for busy servers but requiring traffic flow and struggling with idle or new instances.
•Hybrid strategies combine both approaches—active probes for baseline health and recovery detection, passive observation for runtime degradation under real load.
•Production configurations require careful attention to ejection limits, timing jitter, graceful shutdown integration, and the distinction between self-health and dependency-health.
•Health check design is ultimately about trade-offs: detection speed vs overhead, simplicity vs accuracy, and sensitivity vs stability.

What's next:

Understanding how health checks work is foundational, but the quality of health information depends on well-designed health check endpoints. In the next page, we'll explore the art and science of designing health endpoints that accurately represent server health without creating false signals or excessive overhead.

Page Complete

You now understand the fundamental mechanisms of active and passive health checking, their respective trade-offs, and how to combine them into robust hybrid strategies. Next, we'll dive into designing effective health check endpoints.

1 / 5

Loading learning content...

System Design (HLD)Health Checks & Failover

Health Checks & Failover

LevelIntermediate

Duration75 mins

TopicHealth Checks & Failover

1 / 5

Active vs Passive Health Checks

The Heartbeat of Distributed Systems

What You Will Learn

The Health Check Imperative

Before diving into the technical mechanisms, we must understand why health checks exist and what problem they solve in distributed systems.

The Fundamental Challenge:

In a distributed system, traffic must be routed only to servers that can actually handle requests. Without health checks, a load balancer faces an impossible situation:

It cannot assume all servers are always available (hardware fails, processes crash, networks partition)
It cannot route traffic randomly and hope for the best (failed servers would receive and drop requests)
It cannot rely on servers to voluntarily deregister (crashed processes can't send final messages)

The complexity lies in what 'capable' means:

Can the server accept TCP connections?
Can it process HTTP requests?
Can it access its database dependencies?
Can it respond within acceptable latency bounds?
Is it overloaded to the point where adding more traffic would cause cascading failures?

Health Check Dimensions
Dimension	What It Tests	Failure Mode Detected
Network Reachability	Can packets reach the server?	Network partitions, firewall issues, server offline
Port Availability	Is the service listening on expected port?	Process crash, misconfiguration, binding failures
Application Responsiveness	Can the app handle HTTP requests?	Application deadlock, resource exhaustion, startup issues
Dependency Health	Can the app reach its dependencies?	Database failures, cache unavailability, downstream outages
Performance Adequacy	Can the app respond within SLA bounds?	Overload, resource contention, garbage collection pauses

The Partial Failure Problem

Active Health Checks: The Probing Model

Mechanism:

The health checker (typically the load balancer itself) maintains a list of backend servers
At configurable intervals (e.g., every 5 seconds), it sends a probe to each server
The probe can be a TCP connection attempt, an HTTP request, or a custom check
If the server responds correctly within a timeout, it's marked healthy
If the server fails to respond or returns an error, failure counters increment
After a threshold of consecutive failures, the server is marked unhealthy and removed from the rotation
The checker continues probing unhealthy servers; consecutive successes restore them to healthy status

This creates a state machine for each backend server:

Converting Mermaid diagram...

Probe Types:

Active health checks can operate at different protocol layers, each with distinct trade-offs:

Active Probe Types

•TCP Connect Probes — Attempt to establish a TCP connection on the service port. Fastest and lowest overhead, but only verifies network reachability and port binding. A deadlocked application that's still accepting connections will appear healthy.
•HTTP/HTTPS Probes — Send an HTTP request (typically GET) to a designated health endpoint and verify the response code (usually expecting 200 OK). Tests application-level responsiveness but adds latency and CPU overhead.
•gRPC Health Probes — Use the gRPC Health Checking Protocol, which provides a standardized interface for health reporting. Essential for gRPC-native services and supports per-service health status in multi-service deployments.
•Custom Script Probes — Execute arbitrary scripts or commands that define application-specific health criteria. Maximum flexibility but require careful implementation to avoid false positives/negatives.
•Database/Dependency Probes — Query application health endpoints that verify connectivity to databases, caches, and other critical dependencies. Deep health verification but can create coupling between health check failures and transient dependency issues.

nginx-active-health-config.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# NGINX Plus Active Health Check Configuration Example
 
upstream backend_servers {
    zone backend_zone 64k;
    
    server 10.0.0.1:8080 weight=1;
    server 10.0.0.2:8080 weight=1;
    server 10.0.0.3:8080 weight=1;
    
    # Active health check parameters
    # Probe every 5 seconds, timeout after 3 seconds
    # Mark unhealthy after 3 failures, restore after 2 successes
}
 
server {
    listen 80;
    
    location / {
        proxy_pass http://backend_servers;
        
        # Active health check directive
        health_check interval=5s
                     fails=3
                     passes=2
                     uri=/health/ready
                     match=healthy;
    }
}
 
# Response matching block
match healthy {
    status 200;
    header Content-Type ~ "application/json";
    body ~ '"status":\s*"healthy"';
}

Configuring Health Check Intervals

Passive Health Checks: The Observation Model

Mechanism:

The load balancer forwards production requests to backend servers as normal
It monitors the outcomes of these requests: success/failure, response codes, latency
When a server exhibits problematic patterns (connection failures, 5xx errors, timeouts), the load balancer updates its health assessment
After a threshold of observed failures, the server is marked unhealthy and removed from rotation
Recovery typically requires active probing or manual intervention, since no traffic flows to unhealthy servers

The key insight is that passive health checking uses production traffic as the probe. This has profound implications for both accuracy and failure detection dynamics.

Passive Health Check Triggers
Observed Behavior	Health Implication	Typical Response
TCP connection refused	Port not listening, process crashed	Immediate failure count increment
TCP connection timeout	Network issue or severely overloaded	Failure count increment after timeout
HTTP 500-599 responses	Application error	Configurable: count as failure or track separately
HTTP 503 Service Unavailable	Application deliberately refusing traffic	Often treated as explicit unhealthy signal
Response timeout	Overloaded or deadlocked application	Failure count increment after timeout
Connection reset	Process crash during request	Immediate failure count increment

The Circuit Breaker Pattern:

Closed State: Traffic flows normally. Failures are counted.
Open State: Traffic is blocked. The circuit 'trips' after failure threshold is exceeded.
Half-Open State: After a cooldown period, a limited number of test requests are allowed through.

If test requests succeed, the circuit closes and normal traffic resumes. If they fail, the circuit opens again.

Converting Mermaid diagram...

envoy-passive-health-config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# Envoy Proxy Outlier Detection (Passive Health Check) Configuration
 
clusters:
  - name: backend_cluster
    connect_timeout: 0.25s
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    
    # Passive health checking via outlier detection
    outlier_detection:
      # Trigger after 5 consecutive 5xx responses
      consecutive_5xx: 5
      
      # OR trigger after 5 consecutive gateway errors (502, 503, 504)
      consecutive_gateway_failure: 5
      
      # OR trigger after 5 consecutive local origin failures
      consecutive_local_origin_failure: 5
      
      # Time between ejection analysis sweeps
      interval: 10s
      
      # Base ejection time (actual time = base * ejection count)
      base_ejection_time: 30s
      
      # Maximum ejection percentage (protects against ejecting all hosts)
      max_ejection_percent: 50
      
      # Whether to track success rate and eject slow hosts
      success_rate_minimum_hosts: 5
      success_rate_request_volume: 100
      success_rate_stdev_factor: 1900  # 1.9 standard deviations
      
      # Failure percentage-based ejection
      failure_percentage_threshold: 85
      failure_percentage_minimum_hosts: 5
      failure_percentage_request_volume: 50
    
    load_assignment:
      cluster_name: backend_cluster
      endpoints:
        - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: 10.0.0.1
                    port_value: 8080

The Cold Start Problem

Comparative Analysis: Active vs Passive

Active Health Checks

•Proactive failure detection — Discovers failures before production traffic is affected
•Consistent monitoring — Every server is checked regardless of traffic distribution
•Pre-warming support — Can verify new instances before they receive traffic
•Simpler recovery logic — Continued probing automatically detects recovery
•Dependency verification — Health endpoints can verify downstream connectivity

Passive Health Checks

•Zero additional overhead — No dedicated probe traffic required
•Real-world accuracy — Health assessment based on actual request handling
•Immediate detection — Failures detected as they happen, not at next interval
•Load-sensitive — Naturally accounts for request patterns and load distribution
•Simpler configuration — No health endpoints to implement and maintain

Deep Trade-off Comparison
Criterion	Active Health Checks	Passive Health Checks
Detection Speed (Idle Server)	Bounded by check interval	Never detected (no traffic)
Detection Speed (Busy Server)	Bounded by check interval	Immediate (first failed request)
False Positives	Possible if health endpoint fails independently	Lower—based on real traffic outcomes
False Negatives	Possible if health endpoint too simple	Possible if failure mode is intermittent
Network Overhead	Linear with server count × check frequency	Zero additional overhead
Dependency Verification	Can probe deep health including dependencies	Only observes outcomes, not root causes
New Instance Handling	Can verify before adding to rotation	Cannot assess until traffic is routed
Recovery Detection	Automatic with continued probing	Requires separate mechanism
Configuration Complexity	Requires health endpoint implementation	Minimal—observes existing traffic

The Thundering Herd Risk

Hybrid Strategies: Combining Both Approaches

The Layered Health Check Architecture:

Active Checks for Baseline Health: Periodic probes verify that servers are fundamentally operational—process running, port bound, able to respond to requests. This catches servers that have crashed, hung during startup, or become isolated from the network.
Passive Checks for Runtime Health: Continuous observation of production traffic catches degradation that emerges under load—memory leaks that develop over time, resource exhaustion under specific request patterns, or intermittent failures that don't show up in synthetic probes.
Smart Recovery Orchestration: Active probing restores servers to the rotation after passive checks have ejected them, but with a warmup period that limits initial traffic to verify production-level health before full restoration.

Converting Mermaid diagram...

hybrid-health-strategy.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# Kubernetes-style Hybrid Health Check Strategy
# Combining readiness probes (active) with service mesh outlier detection (passive)
 
# Active Health Checks via Kubernetes Probes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      containers:
        - name: api
          image: api:latest
          ports:
            - containerPort: 8080
          
          # Liveness probe: Is the process fundamentally alive?
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 15
            timeoutSeconds: 5
            failureThreshold: 3
          
          # Readiness probe: Can this instance handle traffic?
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3
            successThreshold: 2
          
          # Startup probe: Wait for slow-starting applications
          startupProbe:
            httpGet:
              path: /health/live
              port: 8080
            periodSeconds: 5
            failureThreshold: 30  # Allow 150 seconds for startup
 
---
# Passive Health Checks via Istio Service Mesh
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: api-server-destination
spec:
  host: api-server
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    
    # Outlier detection for passive health checking
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
      minHealthPercent: 30
      
      # Success rate-based ejection
      consecutiveLocalOriginFailures: 5
      splitExternalLocalOriginErrors: true

The Three Probe Pattern

Production Considerations

Designing health check strategies for production requires careful attention to edge cases, failure modes, and operational requirements that aren't apparent in simple configurations.

Critical Production Considerations

•Ejection Limits — Never allow health checks to remove all servers from rotation. Configure maximum ejection percentages (e.g., 50%) to ensure some capacity remains even during widespread issues. Serving with degraded capacity is better than serving nothing.
•Dependency vs Self Health — Decide carefully whether health endpoints should report dependency failures. If your database goes down, should individual app servers report unhealthy? This could cause all traffic to stop when partial service might still be possible.
•Health Check Authentication — Health endpoints are often unauthenticated to simplify load balancer configuration, but this creates a potential information disclosure vector. Restrict health endpoint accessibility to internal networks.
•Timing and Jitter — In large deployments, synchronize health checks can create load spikes. Add random jitter to check intervals to spread the probing load evenly over time.
•Graceful Shutdown Integration — Servers undergoing shutdown should fail health checks before stopping request processing. This allows the load balancer to drain traffic gracefully rather than encountering connection resets.
•Flapping Prevention — Servers that rapidly oscillate between healthy and unhealthy states ('flapping') can cause routing instability. Implement hysteresis—require multiple consecutive successes for recovery, not just one.

Health Check Timing Guidelines
System Type	Recommended Interval	Failure Threshold	Recovery Threshold
High-frequency trading	100-500ms	2-3 failures	2 successes
Real-time web applications	1-5s	2-3 failures	2-3 successes
Standard web services	5-15s	3-5 failures	2-3 successes
Batch processing systems	30-60s	3-5 failures	2 successes
Long-running jobs	60-120s	5+ failures	3 successes

The Cascading Failure Trap

Summary: Choosing Your Health Check Strategy

Key Takeaways

•Active health checks probe servers at fixed intervals, providing consistent monitoring independent of traffic patterns but with bounded detection latency and network overhead.
•Passive health checks observe production traffic outcomes, providing zero-overhead detection for busy servers but requiring traffic flow and struggling with idle or new instances.
•Hybrid strategies combine both approaches—active probes for baseline health and recovery detection, passive observation for runtime degradation under real load.
•Production configurations require careful attention to ejection limits, timing jitter, graceful shutdown integration, and the distinction between self-health and dependency-health.
•Health check design is ultimately about trade-offs: detection speed vs overhead, simplicity vs accuracy, and sensitivity vs stability.

What's next:

Page Complete

1 / 5