Loading content...
A load balancer is only as effective as its ability to distinguish healthy backends from unhealthy ones. Without proper health checking, a load balancer becomes a liability—blindly routing traffic to dead servers, causing user-facing errors, and failing to fulfill its core purpose of maintaining availability.
Health checking is the mechanism by which load balancers continuously verify that backend servers are capable of handling requests. It's the guardian at the gate, ensuring that only functioning servers receive traffic and automatically removing problematic servers from rotation.
This page provides a comprehensive exploration of health check mechanisms, from basic port checks to sophisticated application-level probes, including the configuration nuances that determine whether your system gracefully handles failures or cascades into outage.
By the end of this page, you will understand: the types of health checks and their tradeoffs, configuration parameters and their impact, active vs passive health checking, graceful degradation patterns, health check anti-patterns that cause outages, and production best practices used by major services.
To understand the critical importance of health checks, consider what happens without them:
Scenario: Health Checks Disabled
[Time 0] Backend B crashes due to OOM
[Time 0+] Load balancer continues routing 33% of traffic to B
[Time 0+] All requests to B fail with connection refused
[Time 0+] 33% of users experience immediate errors
[Time 5m] Engineers notice errors in monitoring
[Time 8m] Manual intervention removes B from load balancer
[Time 8m] Error rate drops to 0%
Scenario: Health Checks Enabled
[Time 0] Backend B crashes due to OOM
[Time 10s] Health check fails (first attempt)
[Time 20s] Health check fails (second attempt)
[Time 30s] Health check fails (third attempt) — B marked unhealthy
[Time 30s] Load balancer stops routing to B
[Time 30s] All traffic goes to A and C (no user impact)
[Time 60s] B restarts (auto-healing or manual)
[Time 90s] Health check succeeds (B marked healthy)
[Time 90s] Traffic resumes to B
The difference: 8 minutes of 33% error rate vs. 30 seconds of potential minor impact (and often zero impact if the failed server was handling no active connections).
| Benefit | Description |
|---|---|
| Automatic Failure Detection | Remove crashed/hung backends within seconds |
| Zero-Downtime Deployments | Drain traffic before shutdown, add after startup |
| Graceful Degradation | Under load, unhealthy backends stop receiving traffic |
| Self-Healing Integration | Works with auto-scaling to replace failed instances |
| Visibility | Health status provides operational insight |
| SLA Protection | Reduces blast radius of individual failures |
Routing to an unhealthy backend isn't just a failed request—it's wasted time. The user waits for a timeout (often seconds), then either sees an error or gets retried. Meanwhile, healthy backends are underutilized. Poor health checking turns partial failures into user-visible degradation.
Health checks range from simple connectivity tests to sophisticated application-level probes. Each level provides more information but requires more configuration and can have more failure modes.
Level 1: TCP Health Checks (L4)
Load Balancer → SYN → Backend
Load Balancer ← SYN-ACK ← Backend (Success: port is listening)
Load Balancer → RST → Backend (Close immediately)
What it Detects:
What it Misses:
Use Cases: Non-HTTP services, databases, Redis, when application-level checks aren't possible
Level 2: HTTP Health Checks (L7)
Load Balancer → GET /health HTTP/1.1
Load Balancer ← HTTP/1.1 200 OK (Success: 2xx response)
What it Detects:
What it Misses:
Level 3: Application-Level Health Checks (Deep)
Load Balancer → GET /health/ready HTTP/1.1
Load Balancer ← HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "healthy",
"checks": {
"database": "connected",
"cache": "connected",
"disk": "ok"
}
}
What it Detects:
| Level | Protocol | Detects | Complexity | Resource Usage |
|---|---|---|---|---|
| TCP Connect | TCP SYN/ACK | Port listening | Very Low | Very Low |
| HTTP Request | HTTP GET | App responding | Low | Low |
| HTTP Content | HTTP GET + body check | Correct response | Medium | Medium |
| Deep Health | HTTP GET + dependency check | Full functionality | High | High |
Kubernetes distinguishes 'liveness' (is the process alive?) from 'readiness' (can it accept traffic?). Apply this to load balancer health checks: TCP checks establish liveness; HTTP checks verify readiness. A server can be live (process running) but not ready (dependencies unavailable).
Properly configuring health checks requires understanding the key parameters and their impact on detection speed and stability.
Core Parameters:
| Parameter | Description | Typical Value | Tradeoff |
|---|---|---|---|
| Interval | Time between checks | 5-30 seconds | Faster detection vs. resource usage |
| Timeout | Max wait for response | 2-10 seconds | Sensitivity vs. false positives |
| Healthy Threshold | Successes to mark healthy | 2-3 | Stability vs. recovery speed |
| Unhealthy Threshold | Failures to mark unhealthy | 2-3 | Sensitivity vs. stability |
| Path (HTTP) | URL to check | /health | Simple vs. thorough |
| Expected Codes (HTTP) | Success status codes | 200-299 | Strict vs. lenient |
| Expected Body (HTTP) | Content to match | None or specific | Thoroughness vs. brittleness |
Detection Time Calculation:
Time to Detect Failure = Interval × Unhealthy Threshold
Example:
Interval = 10 seconds
Unhealthy Threshold = 3
Detection Time = 10 × 3 = 30 seconds
Recovery Time Calculation:
Time to Recover = Interval × Healthy Threshold
Example:
Interval = 10 seconds
Healthy Threshold = 2
Recovery Time = 10 × 2 = 20 seconds
Configuration Example: AWS ALB
{
"TargetGroup": {
"HealthCheckProtocol": "HTTP",
"HealthCheckPath": "/health",
"HealthCheckPort": "traffic-port",
"HealthCheckIntervalSeconds": 15,
"HealthCheckTimeoutSeconds": 5,
"HealthyThresholdCount": 2,
"UnhealthyThresholdCount": 3,
"Matcher": {
"HttpCode": "200-299"
}
}
}
Configuration Example: NGINX
upstream backend {
server 192.168.1.10:8080;
server 192.168.1.11:8080;
server 192.168.1.12:8080;
}
server {
location @health_check {
health_check interval=10s fails=3 passes=2;
health_check_timeout 5s;
proxy_pass http://backend;
}
}
/healthHealth checking can be active (load balancer probes backends) or passive (load balancer observes real traffic). Each approach has distinct characteristics.
Active Health Checking:
┌──────────────────┐ Health Probe ┌─────────────┐
│ Load Balancer │ ─────────────────►│ Backend │
│ │ ◄─────────────────│ │
└──────────────────┘ Health Response └─────────────┘
(Separate from production traffic)
How It Works:
Pros:
Cons:
Passive Health Checking (Outlier Detection):
┌──────────────────┐ Production Traffic ┌─────────────┐
│ Load Balancer │ ◄───────────────────►│ Backend │
│ (observes │ │ │
│ failures) │ │ │
└──────────────────┘ └─────────────┘
(Monitors real request outcomes)
How It Works:
Pros:
Cons:
Envoy Outlier Detection Configuration:
outlier_detection:
consecutive_5xx: 5 # 5 consecutive 5xx = eject
interval: 10s # Check every 10 seconds
base_ejection_time: 30s # Minimum ejection duration
max_ejection_percent: 50 # Never eject more than 50%
consecutive_gateway_failure: 5 # 5 connection failures = eject
| Aspect | Active | Passive |
|---|---|---|
| Detection Trigger | Scheduled probes | Real traffic failures |
| Detection Speed | Interval-dependent (seconds) | Immediate (per-request) |
| Zero-Traffic Detection | Yes | No |
| Additional Load | Yes (health check requests) | No |
| Accuracy | As accurate as probe | Depends on traffic patterns |
| Configuration | Probe settings | Statistical thresholds |
Best-in-class systems use both active and passive health checking. Active checks catch dead backends quickly; passive checks catch subtle issues (slowness, intermittent errors) that active checks might miss. Envoy, for example, supports both simultaneously.
The design of your application's health endpoint significantly impacts the effectiveness of health checking.
Health Endpoint Patterns:
Pattern 1: Simple OK Response
@app.route('/health')
def health():
return 'OK', 200
✓ Fast, low overhead ✗ Doesn't verify actual functionality
Pattern 2: Dependency-Aware Health Check
@app.route('/health')
def health():
try:
# Check database
db.execute('SELECT 1')
# Check cache
redis.ping()
# Check external API
requests.get(external_api, timeout=1)
return 'OK', 200
except Exception as e:
return f'Unhealthy: {e}', 503
✓ Verifies application can actually work ✗ Couples availability to dependencies ✗ Can cascade failures (dependency down = all backends unhealthy)
Pattern 3: Structured Health Response
@app.route('/health')
def health():
checks = {
'database': check_database(),
'cache': check_cache(),
'disk_space': check_disk(),
}
all_healthy = all(c['status'] == 'healthy' for c in checks.values())
return jsonify({
'status': 'healthy' if all_healthy else 'degraded',
'checks': checks,
'timestamp': datetime.utcnow().isoformat(),
}), 200 if all_healthy else 503
Response Example:
{
"status": "healthy",
"checks": {
"database": {"status": "healthy", "latency_ms": 2},
"cache": {"status": "healthy", "latency_ms": 1},
"disk_space": {"status": "healthy", "free_gb": 45}
},
"timestamp": "2024-01-15T10:30:00Z"
}
Pattern 4: Separate Liveness and Readiness
@app.route('/health/live')
def liveness():
# Is the process alive and not deadlocked?
return 'OK', 200
@app.route('/health/ready')
def readiness():
# Is the app ready to receive traffic?
if not db_connected or not warmed_cache:
return 'Not Ready', 503
return 'Ready', 200
Use liveness for "should I restart this?" and readiness for "should I send traffic?"
If your health check fails because Redis is down, all backends become unhealthy simultaneously. The load balancer has nowhere to send traffic—total outage. Consider: is 'degraded with Redis down' better than 'completely unavailable'? Often, returning a cache miss is better than returning no response.
When a backend becomes unhealthy (intentionally for deployment or unexpectedly due to failure), existing connections need to be handled gracefully. Connection draining is the process of completing in-flight requests before removing a backend.
Without Connection Draining:
[Time 0] Backend marked unhealthy
[Time 0] LB immediately stops routing to backend
[Time 0] 100 in-flight requests receive connection reset
[Time 0] 100 users see errors
With Connection Draining:
[Time 0] Backend marked unhealthy
[Time 0] LB stops sending NEW requests to backend
[Time 0] LB continues forwarding existing connections
[Time 5s] 95 requests complete normally
[Time 10s] Remaining 5 requests complete
[Time 10s] Backend fully drained, can be shut down
Draining Timeout:
A maximum draining duration prevents hung connections from blocking shutdown indefinitely:
Draining Timeout = 30 seconds
After 30 seconds, remaining connections are forcibly closed.
This is a tradeoff: longer timeout = more graceful, but slower deployments.
Implementing Graceful Shutdown (Backend Side):
import signal
import time
class GracefulServer:
def __init__(self):
self.is_ready = True
self.active_requests = 0
signal.signal(signal.SIGTERM, self.shutdown_handler)
def shutdown_handler(self, signum, frame):
print("Received SIGTERM, starting graceful shutdown")
# Step 1: Stop accepting new requests (fail health checks)
self.is_ready = False
# Step 2: Wait for health checks to fail (LB stops sending traffic)
time.sleep(10) # 2-3 health check intervals
# Step 3: Wait for in-flight requests to complete
timeout = 30
start = time.time()
while self.active_requests > 0 and (time.time() - start) < timeout:
time.sleep(1)
# Step 4: Exit
print(f"Shutdown complete, {self.active_requests} requests aborted")
sys.exit(0)
@app.route('/health/ready')
def readiness(self):
if self.is_ready:
return 'Ready', 200
return 'Draining', 503 # Tell LB to stop sending traffic
Configuration: AWS Target Group Deregistration
{
"TargetGroup": {
"TargetGroupAttributes": [
{
"Key": "deregistration_delay.timeout_seconds",
"Value": "30"
}
]
}
}
| Application Type | Recommended Timeout | Reasoning |
|---|---|---|
| Web APIs (fast) | 30 seconds | Most requests complete in <1 second |
| Batch processing | 5 minutes | Jobs may take minutes |
| WebSocket/long polling | 5-30 minutes | Long-lived connections |
| Database connections | 60 seconds | Transaction completion |
| Streaming media | Session-based | Don't interrupt active streams |
In Kubernetes, use a preStop lifecycle hook to delay shutdown, giving the Endpoints controller time to remove the pod from Service endpoints. A simple 'sleep 5' can prevent traffic being sent to a terminating pod.
Production systems often need more sophisticated health checking than basic pass/fail. Here are advanced patterns used by large-scale services.
Pattern 1: Gradual Degradation (Weight Adjustment)
Instead of binary healthy/unhealthy, adjust backend weight based on health metrics:
Backend Response Time Increasing:
Normal (50ms): weight = 100
Elevated (100ms): weight = 50
High (200ms): weight = 25
Critical (500ms): weight = 0 (removed)
This naturally shifts traffic away from struggling backends before they fail completely.
Pattern 2: Panic Mode Prevention
Prevent removing too many backends simultaneously:
# Envoy outlier detection
max_ejection_percent: 50 # Never eject more than 50% of backends
Even if 80% of backends are unhealthy, keep routing to the "least bad" ones. Total outage is worse than degraded service.
Pattern 3: Hystrix-Style Circuit Breaker
[Closed State] - Normal operation
↓ (Failure rate > threshold)
[Open State] - Stop sending requests (fail fast)
↓ (After timeout)
[Half-Open State] - Send limited test requests
↓ (Tests succeed) ↓ (Tests fail)
[Closed State] [Open State]
Pattern 4: Differentiated Health Checks
Different consumers may need different health criteria:
@app.route('/health/lb') # For load balancer
def lb_health():
# Simple: can process requests?
return 'OK', 200
@app.route('/health/k8s') # For Kubernetes
def k8s_health():
# Is container alive?
return 'OK', 200
@app.route('/health/monitoring') # For observability
def monitoring_health():
# Detailed metrics
return jsonify(get_detailed_metrics()), 200
@app.route('/health/canary') # For canary deployment
def canary_health():
# Extra strict checks for canary
run_synthetic_transactions()
return 'OK', 200
Pattern 5: Warm-Up Period
Newly started backends may not perform at full capacity (cold caches, JIT not optimized). Delay adding to pool:
[Container Starts]
↓
[Liveness check passes] → Container won't be restarted
↓ (30 second warm-up)
[Readiness check passes] → LB starts sending traffic
↓ (Gradual weight ramp-up)
[Full traffic after 60 seconds]
Pattern 6: Subsetting with Health Priority
With many backends, each LB instance may only check a subset:
Backend Pool: 1000 servers
LB Instance A checks: 100 servers (random subset)
LB Instance B checks: 100 servers (different subset)
...
Gossip protocol shares health state between LB instances.
Setting thresholds (failure rates, ejection percentages, timeouts) is a tuning exercise. Start conservative, observe in production, and adjust. What works for one service may not work for another. Build alerting to notify when thresholds are approached.
We've covered the essential aspects of health checking for load balancers. Let's consolidate the key principles.
What's Next:
We've mastered health checking for single regions and data centers. But what about global systems serving users around the world? Next, we'll explore global load balancing—how organizations distribute traffic across geographic regions, implement resilient multi-region architectures, and provide optimal performance regardless of where users are located.
You now understand health checking from basic TCP probes to sophisticated application-level patterns. You can design health endpoints, configure appropriate thresholds, and implement graceful degradation. Next, we'll scale up to global load balancing.