System Design (HLD)Health Checks & Failover

Health Checks & Failover

LevelIntermediate

Duration75 mins

TopicHealth Checks & Failover

2 / 5

Health Check Endpoints

The Art of Representing Health

In 2019, a major e-commerce platform experienced an unusual outage. Their health check endpoint returned 200 OK. The load balancer dutifully marked all servers as healthy. Yet customers couldn't complete checkouts—the payment processing service was down, and the health endpoint hadn't been designed to verify it.

The health check was technically correct: the servers were running. But it was operationally useless: the system couldn't fulfill its core business function.

This scenario illustrates a critical truth: a health check endpoint is only as valuable as the health conditions it verifies. The seemingly simple task of returning a status code masks profound design decisions about what 'health' means for your application.

Should a health endpoint verify database connectivity? What about downstream services? Should it check for sufficient disk space or memory? Should it test business-critical workflows or just process availability? The answers shape whether your health checks provide meaningful signals or false reassurance.

What You Will Learn

By the end of this page, you will understand how to design health check endpoints that accurately represent your application's ability to serve its intended purpose. You'll learn the distinction between liveness, readiness, and deep health checks, how to verify dependencies without creating brittleness, and how to structure endpoints for operational clarity.

The Health Check Taxonomy

Not all health checks serve the same purpose. Modern distributed systems typically implement a taxonomy of health endpoints, each designed to answer a different question about the application's state.

The Three-Tier Health Check Model:

This model, popularized by Kubernetes but applicable to any infrastructure, separates health checking into distinct concerns:

Health Check Types and Their Purpose
Check Type	Question Answered	Failure Response	Typical Endpoint
Liveness	Is the process fundamentally alive?	Restart the container/process	/health/live or /livez
Readiness	Can this instance handle traffic right now?	Remove from load balancer rotation	/health/ready or /readyz
Startup	Has the application finished initializing?	Wait longer before checking liveness/readiness	/health/startup or /startupz
Deep Health	Is the full application stack operational?	Diagnostic and alerting purposes	/health or /health/full

Understanding the Distinctions:

Liveness answers: Should we keep this instance running, or should we restart it?

A liveness check should fail only when the instance is in an unrecoverable state—deadlocked, memory-corrupted, or otherwise broken in a way that requires a restart to fix. Liveness checks should be extremely simple and fast. They should not verify dependencies, because if your database goes down, restarting your application servers won't fix it.

Readiness answers: Can this instance productively handle a request right now?

Readiness checks can and should verify dependencies. If the database is unreachable, the readiness check should fail, removing the instance from load balancer rotation. But the instance itself isn't broken—when the database recovers, readiness should restore automatically. Crucially, a readiness failure should not trigger a restart.

Startup answers: Has the application completed its initialization sequence?

For applications with slow startup (loading caches, establishing connection pools, running migrations), the startup check prevents liveness probes from killing an instance that's still initializing. Once the startup check passes, the liveness and readiness probes take over.

Converting Mermaid diagram...

The Deadly Dependency in Liveness Checks

A common production catastrophe: including database connectivity in liveness checks. When the database goes down briefly, all application servers fail liveness and get restarted simultaneously. Now you have a database outage AND all your application servers trying to restart at once. Always keep liveness checks free of external dependencies.

Designing Liveness Endpoints

The liveness endpoint has a singular purpose: determine whether the application process is in a state where restarting would help. This seemingly simple requirement demands careful consideration of what constitutes an 'unrecoverable' state.

What Liveness Should Check:

Appropriate Liveness Checks

•Process responsiveness — Can the application respond to an HTTP request at all? This verifies the web server thread isn't deadlocked.
•Deadlock detection — For applications that track thread pool health, verify that worker threads aren't all blocked waiting for each other.
•Memory corruption checks — Some frameworks can detect heap corruption or critical data structure inconsistency.
•Watchdog timers — If background tasks should complete within known intervals, a stuck task might indicate a hung process.

What Liveness Should NOT Check

•Database connectivity — Database outages are not fixed by restarting the application.
•Downstream service availability — External service failures don't indicate corrupted local state.
•Temporary resource exhaustion — High memory usage or CPU load may resolve without restart.
•Cache miss rates — Poor cache performance doesn't require process restart.
•Business logic validation — Failed transactions don't mean the process is broken.

liveness-endpoint-examples.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// TypeScript/Express Example: Minimal Liveness Endpoint
 
import express from 'express';
import { threadPoolHealth } from './monitoring';
 
const app = express();
 
/**
 * Liveness endpoint - answers "should we restart this process?"
 * 
 * Rules:
 * - Must respond extremely fast (< 10ms target)
 * - Must NOT check external dependencies
 * - Should only fail for unrecoverable local state
 */
app.get('/health/live', (req, res) => {
    // Check 1: Can we even respond? (Implicit - if we got here, yes)
    
    // Check 2: Are worker threads responsive?
    const threadHealth = threadPoolHealth();
    if (threadHealth.deadlockedThreads > 0) {
        return res.status(503).json({
            status: 'unhealthy',
            reason: 'thread_deadlock',
            deadlockedThreads: threadHealth.deadlockedThreads
        });
    }
    
    // Check 3: Is our event loop responsive?
    // (If this handler runs, the event loop is alive)
    
    // All checks pass - we're alive
    res.status(200).json({
        status: 'healthy',
        timestamp: new Date().toISOString()
    });
});
 
// Anti-pattern: DO NOT include database checks in liveness
// ❌ BAD - This can cause cascading restarts
app.get('/health/live-bad', async (req, res) => {
    try {
        await database.query('SELECT 1');  // DON'T DO THIS IN LIVENESS
        res.json({ status: 'healthy' });
    } catch (error) {
        res.status(503).json({ status: 'unhealthy' });  // Will restart!
    }
});

liveness-endpoint.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Go/Gin Example: Minimal Liveness Endpoint
 
package main
 
import (
    "net/http"
    "runtime"
    "time"
    
    "github.com/gin-gonic/gin"
)
 
// LivenessHandler checks if the process is fundamentally alive
// and should continue running (vs being restarted)
func LivenessHandler(c *gin.Context) {
    // Check 1: Verify the runtime is healthy
    // If we couldn't call NumGoroutine, we'd have a serious problem
    numGoroutines := runtime.NumGoroutine()
    
    // Check 2: Watch for goroutine leaks that might indicate deadlock
    // This threshold should be tuned per-application
    const maxGoroutines = 10000
    if numGoroutines > maxGoroutines {
        c.JSON(http.StatusServiceUnavailable, gin.H{
            "status":      "unhealthy",
            "reason":      "goroutine_leak_suspected",
            "goroutines":  numGoroutines,
            "threshold":   maxGoroutines,
        })
        return
    }
    
    // Check 3: Verify memory isn't critically exhausted
    var memStats runtime.MemStats
    runtime.ReadMemStats(&memStats)
    
    // If heap is over 90% of system memory, we may need restart
    const heapThreshold = 0.9
    heapUsageRatio := float64(memStats.HeapAlloc) / float64(memStats.HeapSys)
    if heapUsageRatio > heapThreshold {
        c.JSON(http.StatusServiceUnavailable, gin.H{
            "status":        "unhealthy",
            "reason":        "heap_exhaustion",
            "heapAlloc":     memStats.HeapAlloc,
            "heapSys":       memStats.HeapSys,
            "usageRatio":    heapUsageRatio,
        })
        return
    }
    
    // All checks pass
    c.JSON(http.StatusOK, gin.H{
        "status":     "healthy",
        "goroutines": numGoroutines,
        "timestamp":  time.Now().UTC().Format(time.RFC3339),
    })
}

The Minimal Liveness Pattern

The simplest valid liveness check is returning 200 OK with no logic at all. If the HTTP handler can execute, the process is alive. Only add additional checks if you have specific failure modes (like deadlocks) that don't prevent HTTP responses.

Designing Readiness Endpoints

The readiness endpoint answers a more nuanced question than liveness: Can this instance productively serve a request right now? This requires understanding both what 'productively' means for your application and what transient conditions might prevent it.

Readiness Check Principles:

Dependency Verification: Unlike liveness, readiness should check dependencies. If your database is down, you can't serve requests—you should be temporarily removed from rotation.
Fast Failure Recovery: Readiness failures should be recoverable without restart. When the database comes back, readiness should automatically restore.
Load Awareness: Consider whether your instance is too overloaded to accept additional traffic. An overwhelmed instance might be 'alive' but not 'ready' for more load.
Initialization Status: Before all startup tasks complete (warming caches, establishing connections), the instance isn't ready.

Readiness Check Components
Component	What to Check	Failure Meaning
Primary Database	Connection pool has available connections	Can't process data-dependent requests
Cache Layer	Cache connection is established	Performance may be degraded
Message Queue	Queue connection is active	Can't process async workflows
Initialization	All startup tasks have completed	Not yet ready to serve
Resource Limits	Under configured ceiling for connections/memory	May be overloaded
Feature Flags	Feature flag service is reachable (if critical)	May serve incorrect feature states

readiness-endpoint-comprehensive.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
// TypeScript/Express: Comprehensive Readiness Endpoint
 
import express, { Request, Response } from 'express';
import { Pool } from 'pg';
import Redis from 'ioredis';
 
interface HealthCheckResult {
    service: string;
    status: 'healthy' | 'unhealthy' | 'degraded';
    latencyMs?: number;
    message?: string;
}
 
interface ReadinessResponse {
    status: 'ready' | 'not_ready';
    timestamp: string;
    checks: HealthCheckResult[];
    version?: string;
}
 
// Dependency references
let pgPool: Pool;
let redisClient: Redis;
let isInitialized = false;
 
// Configurable timeouts
const CHECK_TIMEOUT_MS = 2000;
const MAX_CONNECTION_POOL_USAGE = 0.8;  // 80% threshold
 
/**
 * Check PostgreSQL connectivity and pool health
 */
async function checkDatabase(): Promise<HealthCheckResult> {
    const start = Date.now();
    try {
        // Verify we can execute a query
        await Promise.race([
            pgPool.query('SELECT 1'),
            new Promise((_, reject) => 
                setTimeout(() => reject(new Error('timeout')), CHECK_TIMEOUT_MS)
            )
        ]);
        
        // Check connection pool saturation
        const poolStats = pgPool.totalCount;
        const idleCount = pgPool.idleCount;
        const waitingCount = pgPool.waitingCount;
        
        if (waitingCount > 0) {
            return {
                service: 'postgresql',
                status: 'degraded',
                latencyMs: Date.now() - start,
                message: `${waitingCount} requests waiting for connections`
            };
        }
        
        return {
            service: 'postgresql',
            status: 'healthy',
            latencyMs: Date.now() - start
        };
    } catch (error) {
        return {
            service: 'postgresql',
            status: 'unhealthy',
            latencyMs: Date.now() - start,
            message: error instanceof Error ? error.message : 'Unknown error'
        };
    }
}
 
/**
 * Check Redis connectivity
 */
async function checkRedis(): Promise<HealthCheckResult> {
    const start = Date.now();
    try {
        await Promise.race([
            redisClient.ping(),
            new Promise((_, reject) => 
                setTimeout(() => reject(new Error('timeout')), CHECK_TIMEOUT_MS)
            )
        ]);
        
        return {
            service: 'redis',
            status: 'healthy',
            latencyMs: Date.now() - start
        };
    } catch (error) {
        return {
            service: 'redis',
            status: 'unhealthy',
            latencyMs: Date.now() - start,
            message: error instanceof Error ? error.message : 'Unknown error'
        };
    }
}
 
/**
 * Check initialization status
 */
function checkInitialization(): HealthCheckResult {
    if (!isInitialized) {
        return {
            service: 'initialization',
            status: 'unhealthy',
            message: 'Application initialization not complete'
        };
    }
    return {
        service: 'initialization',
        status: 'healthy'
    };
}
 
/**
 * Readiness endpoint handler
 */
async function readinessHandler(req: Request, res: Response) {
    // Run all checks in parallel
    const checkResults = await Promise.all([
        checkInitialization(),
        checkDatabase(),
        checkRedis()
    ]);
    
    // Determine overall readiness
    const hasUnhealthy = checkResults.some(c => c.status === 'unhealthy');
    const hasDegraded = checkResults.some(c => c.status === 'degraded');
    
    const response: ReadinessResponse = {
        status: hasUnhealthy ? 'not_ready' : 'ready',
        timestamp: new Date().toISOString(),
        checks: checkResults,
        version: process.env.APP_VERSION
    };
    
    const statusCode = hasUnhealthy ? 503 : 200;
    res.status(statusCode).json(response);
}
 
const app = express();
app.get('/health/ready', readinessHandler);

The Dependency Health Dilemma

Should every dependency failure cause a readiness failure? Not necessarily. Consider a service that can partially function without its cache layer—perhaps with degraded performance but still serving requests. Use 'degraded' status for non-critical dependencies and reserve 'unhealthy' for dependencies without which processing is impossible.

Deep Health Checks for Diagnostics

Beyond liveness and readiness, many production systems implement a 'deep health' or 'full health' endpoint that provides comprehensive diagnostic information. This endpoint isn't used for routing decisions—it's used for monitoring, debugging, and operational visibility.

Purpose of Deep Health Checks:

Diagnostic Context: When investigating issues, operators need visibility into all system components, not just pass/fail status.
Trend Analysis: By recording deep health responses over time, you can identify gradual degradation before it becomes a hard failure.
Dependency Mapping: Deep health responses document which external services your application depends on, serving as living documentation.
Version Tracking: Including application version, deployment timestamp, and configuration details helps correlate health with changes.

deep-health-endpoint.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
// TypeScript: Deep Health Endpoint for Diagnostics
 
interface DeepHealthResponse {
    status: 'healthy' | 'unhealthy' | 'degraded';
    timestamp: string;
    uptime: number;
    version: {
        app: string;
        git_sha: string;
        deployed_at: string;
    };
    system: {
        hostname: string;
        node_version: string;
        memory: {
            heap_used_mb: number;
            heap_total_mb: number;
            external_mb: number;
            rss_mb: number;
        };
        cpu: {
            user_percent: number;
            system_percent: number;
        };
    };
    dependencies: DependencyHealth[];
    configuration: {
        environment: string;
        region: string;
        log_level: string;
        feature_flags: Record<string, boolean>;
    };
}
 
interface DependencyHealth {
    name: string;
    type: 'database' | 'cache' | 'queue' | 'service' | 'storage';
    status: 'healthy' | 'unhealthy' | 'degraded' | 'unknown';
    latency_ms?: number;
    details?: Record<string, unknown>;
    error?: string;
}
 
async function deepHealthHandler(req: Request, res: Response) {
    const startTime = process.hrtime.bigint();
    
    // Gather system metrics
    const memUsage = process.memoryUsage();
    const cpuUsage = process.cpuUsage();
    
    // Check all dependencies with detailed information
    const dependencies: DependencyHealth[] = await Promise.all([
        checkDatabaseDeep(),
        checkRedisDeep(),
        checkKafkaDeep(),
        checkS3Deep(),
        checkDownstreamServicesDeep()
    ]).then(results => results.flat());
    
    // Determine overall status
    const unhealthyCount = dependencies.filter(d => d.status === 'unhealthy').length;
    const degradedCount = dependencies.filter(d => d.status === 'degraded').length;
    
    let status: 'healthy' | 'unhealthy' | 'degraded' = 'healthy';
    if (unhealthyCount > 0) status = 'unhealthy';
    else if (degradedCount > 0) status = 'degraded';
    
    const response: DeepHealthResponse = {
        status,
        timestamp: new Date().toISOString(),
        uptime: process.uptime(),
        version: {
            app: process.env.APP_VERSION || 'unknown',
            git_sha: process.env.GIT_SHA || 'unknown',
            deployed_at: process.env.DEPLOYED_AT || 'unknown'
        },
        system: {
            hostname: os.hostname(),
            node_version: process.version,
            memory: {
                heap_used_mb: Math.round(memUsage.heapUsed / 1024 / 1024),
                heap_total_mb: Math.round(memUsage.heapTotal / 1024 / 1024),
                external_mb: Math.round(memUsage.external / 1024 / 1024),
                rss_mb: Math.round(memUsage.rss / 1024 / 1024)
            },
            cpu: {
                user_percent: cpuUsage.user / 1000000,  // Convert to seconds
                system_percent: cpuUsage.system / 1000000
            }
        },
        dependencies,
        configuration: {
            environment: process.env.NODE_ENV || 'development',
            region: process.env.AWS_REGION || 'unknown',
            log_level: process.env.LOG_LEVEL || 'info',
            feature_flags: await getFeatureFlags()
        }
    };
    
    // Always return 200 for deep health - it's diagnostic, not routing
    // Use the response body for status interpretation
    res.status(200).json(response);
}
 
async function checkDatabaseDeep(): Promise<DependencyHealth[]> {
    try {
        const start = Date.now();
        
        // Get detailed database stats
        const poolStats = pgPool;
        const [versionResult] = await pgPool.query('SELECT version()');
        const [statsResult] = await pgPool.query(`
            SELECT 
                numbackends as connections,
                xact_commit as transactions_committed,
                xact_rollback as transactions_rolled_back,
                blks_hit as cache_hits,
                blks_read as disk_reads
            FROM pg_stat_database 
            WHERE datname = current_database()
        `);
        
        return [{
            name: 'postgresql',
            type: 'database',
            status: 'healthy',
            latency_ms: Date.now() - start,
            details: {
                version: versionResult.rows[0].version,
                pool_total: poolStats.totalCount,
                pool_idle: poolStats.idleCount,
                pool_waiting: poolStats.waitingCount,
                stats: statsResult.rows[0]
            }
        }];
    } catch (error) {
        return [{
            name: 'postgresql',
            type: 'database',
            status: 'unhealthy',
            error: error instanceof Error ? error.message : 'Unknown error'
        }];
    }
}

Deep Health Security Considerations

Deep health endpoints expose sensitive operational details including hostnames, version numbers, and configuration. Always restrict access through authentication, network segmentation, or IP allowlisting. Never expose deep health to the public internet.

Health Endpoint Performance Considerations

Health check endpoints face a unique performance paradox: they're called frequently, often under the exact conditions where the system is already stressed, yet they're typically expected to respond extremely quickly. Poor health endpoint performance can create a vicious cycle where stressed systems appear unhealthy due to slow health checks, causing traffic removal and directing more load to remaining instances.

Performance Guidelines:

Health Endpoint Performance Best Practices

•Liveness: < 10ms — Liveness checks should be near-instantaneous. No blocking I/O, no database queries, no network calls. Return in-memory state or simply return 200.
•Readiness: < 100ms — Readiness checks can verify dependencies but should use connection pool health, not new queries. Check cached connection state rather than executing queries.
•Deep Health: < 5s — Diagnostic endpoints can be slower but should use timeouts to prevent hanging. Use concurrent checks with a global timeout.
•Timeout Cascade Prevention — Set health endpoint timeouts slightly shorter than the caller's timeout. If the load balancer times out at 3s, your endpoint should timeout at 2.5s and return a failure response rather than hanging.
•Separate Thread Pool — Under high load, regular request processing can starve health check requests. Consider dedicated threads or priority queues for health endpoints.
•Caching Results — For expensive checks, cache results briefly (1-5 seconds). This prevents thundering herd when multiple checkers probe simultaneously.

performant-health-checks.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
// TypeScript: Performance-Optimized Health Checking
 
import { performance } from 'perf_hooks';
 
// Cache for expensive health check results
interface CachedHealth {
    result: HealthCheckResult;
    cachedAt: number;
}
 
const healthCache = new Map<string, CachedHealth>();
const CACHE_TTL_MS = 2000;  // 2 second cache
 
/**
 * Get cached health check or execute if stale
 */
async function getCachedHealthCheck(
    name: string,
    checker: () => Promise<HealthCheckResult>
): Promise<HealthCheckResult> {
    const now = Date.now();
    const cached = healthCache.get(name);
    
    if (cached && (now - cached.cachedAt) < CACHE_TTL_MS) {
        return { ...cached.result, cached: true };
    }
    
    const result = await checker();
    healthCache.set(name, { result, cachedAt: now });
    return result;
}
 
/**
 * Run multiple health checks concurrently with global timeout
 */
async function runHealthChecks(
    checks: Array<{ name: string; check: () => Promise<HealthCheckResult> }>,
    globalTimeoutMs: number
): Promise<HealthCheckResult[]> {
    const start = performance.now();
    
    // Create promise for global timeout
    const timeout = new Promise<never>((_, reject) => {
        setTimeout(() => {
            reject(new Error(`Health checks exceeded global timeout of ${globalTimeoutMs}ms`));
        }, globalTimeoutMs);
    });
    
    // Run all checks with individual handling
    const checkPromises = checks.map(async ({ name, check }) => {
        try {
            return await getCachedHealthCheck(name, check);
        } catch (error) {
            // Individual check failure doesn't fail the batch
            return {
                service: name,
                status: 'unhealthy' as const,
                message: error instanceof Error ? error.message : 'Unknown error'
            };
        }
    });
    
    try {
        // Race against global timeout
        const results = await Promise.race([
            Promise.all(checkPromises),
            timeout
        ]);
        
        const duration = performance.now() - start;
        console.log(`Health checks completed in ${duration.toFixed(2)}ms`);
        
        return results;
    } catch (error) {
        // Global timeout exceeded - return partial results with timeout markers
        const duration = performance.now() - start;
        console.error(`Health checks timed out after ${duration.toFixed(2)}ms`);
        
        return checks.map(({ name }) => ({
            service: name,
            status: 'unhealthy' as const,
            message: 'Health check timed out'
        }));
    }
}
 
// Express middleware to track health endpoint performance
function healthMetricsMiddleware(req: Request, res: Response, next: NextFunction) {
    const start = performance.now();
    
    res.on('finish', () => {
        const duration = performance.now() - start;
        
        // Record metrics
        healthCheckDuration.observe({
            path: req.path,
            status: res.statusCode.toString()
        }, duration);
        
        // Warn if health check is slow
        if (duration > 100 && req.path.includes('/ready')) {
            console.warn(`Slow readiness check: ${duration.toFixed(2)}ms`);
        } else if (duration > 10 && req.path.includes('/live')) {
            console.warn(`Slow liveness check: ${duration.toFixed(2)}ms`);
        }
    });
    
    next();
}

Connection Pool Health vs Connection Test

Instead of running 'SELECT 1' to verify database health, check the connection pool's internal state. Most connection pools track how many connections are available, waiting, and failed. Reading this metadata is instant and doesn't consume a connection or execute a query.

Response Format Standards

While load balancers typically only care about HTTP status codes, well-designed health endpoints provide structured responses that aid debugging, monitoring, and automation. Several standards have emerged for health check response formats.

HTTP Status Code Conventions:

Health Check HTTP Status Codes
Status Code	Meaning	When to Use
200 OK	Healthy/Ready	All checks pass
503 Service Unavailable	Unhealthy/Not Ready	Critical checks fail; remove from rotation
500 Internal Server Error	Check Failed	The health check itself encountered an error
429 Too Many Requests	Rate Limited	Health check load is excessive
207 Multi-Status	Mixed Results	Some checks pass, some fail (informational only)

Response Body Standards:

The RFC draft for Health Check Response Format (draft-inadarei-api-health-check) proposes a standardized JSON structure:

rfc-health-check-format.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
    "status": "pass",
    "version": "1.2.3",
    "releaseId": "abc123",
    "serviceId": "api-gateway",
    "description": "Primary API Gateway Service",
    "output": "",
    "notes": ["All systems operational"],
    "checks": {
        "postgresql:connections": [
            {
                "componentId": "database-001",
                "componentType": "datastore",
                "observedValue": 25,
                "observedUnit": "connections",
                "status": "pass",
                "time": "2024-01-15T10:30:00Z"
            }
        ],
        "redis:responseTime": [
            {
                "componentId": "cache-001",
                "componentType": "datastore",
                "observedValue": 2.5,
                "observedUnit": "ms",
                "status": "pass",
                "time": "2024-01-15T10:30:00Z"
            }
        ],
        "memory:utilization": [
            {
                "componentId": "self",
                "componentType": "system",
                "observedValue": 68.5,
                "observedUnit": "percent",
                "status": "warn",
                "time": "2024-01-15T10:30:00Z",
                "output": "Memory usage approaching threshold"
            }
        ]
    },
    "links": {
        "about": "https://docs.company.com/api-gateway",
        "docs": "https://docs.company.com/api-gateway/health"
    }
}

Keep It Simple for Load Balancers

Load balancers typically don't parse response bodies—they use status codes. The detailed response body is for human operators and monitoring systems. Keep liveness and readiness endpoints simple. Reserve detailed responses for deep health/diagnostic endpoints.

Summary: Designing Effective Health Endpoints

Health check endpoints are the interface between your application and the infrastructure that routes traffic to it. Their design determines whether your system can accurately represent its operational state.

Key Takeaways

•Separate liveness from readiness — Liveness determines restarts; readiness determines traffic routing. Never include dependencies in liveness checks.
•Keep liveness minimal — The simplest valid liveness check just returns 200 OK. Only add logic to detect unrecoverable states like deadlocks.
•Make readiness dependency-aware — Readiness should verify connections to databases, caches, and other critical dependencies—but use connection pool health, not test queries.
•Implement deep health for diagnostics — Comprehensive health endpoints serve debugging and monitoring, not traffic routing. Restrict access appropriately.
•Optimize for performance — Health endpoints are called under stress. Use caching, concurrent checks with timeouts, and avoid blocking I/O in critical paths.
•Use standard status codes — 200 for healthy, 503 for unhealthy. Detailed responses are for humans; status codes are for automation.

What's next:

Health checks tell us whether something is wrong. The next critical skill is understanding how to detect failures and when they become actionable. We'll explore failure detection mechanisms—the algorithms and strategies that convert health observations into routing decisions.

Page Complete

You now understand how to design health check endpoints that accurately represent your application's ability to serve traffic. You've learned the taxonomy of health checks, how to implement each type, and how to optimize for production requirements. Next, we'll explore failure detection mechanisms.

2 / 5

Loading learning content...

System Design (HLD)Health Checks & Failover

Health Checks & Failover

LevelIntermediate

Duration75 mins

TopicHealth Checks & Failover

2 / 5

Health Check Endpoints

The Art of Representing Health

The health check was technically correct: the servers were running. But it was operationally useless: the system couldn't fulfill its core business function.

What You Will Learn

The Health Check Taxonomy

The Three-Tier Health Check Model:

This model, popularized by Kubernetes but applicable to any infrastructure, separates health checking into distinct concerns:

Health Check Types and Their Purpose
Check Type	Question Answered	Failure Response	Typical Endpoint
Liveness	Is the process fundamentally alive?	Restart the container/process	/health/live or /livez
Readiness	Can this instance handle traffic right now?	Remove from load balancer rotation	/health/ready or /readyz
Startup	Has the application finished initializing?	Wait longer before checking liveness/readiness	/health/startup or /startupz
Deep Health	Is the full application stack operational?	Diagnostic and alerting purposes	/health or /health/full

Understanding the Distinctions:

Liveness answers: Should we keep this instance running, or should we restart it?

Readiness answers: Can this instance productively handle a request right now?

Startup answers: Has the application completed its initialization sequence?

Converting Mermaid diagram...

The Deadly Dependency in Liveness Checks

Designing Liveness Endpoints

What Liveness Should Check:

Appropriate Liveness Checks

•Process responsiveness — Can the application respond to an HTTP request at all? This verifies the web server thread isn't deadlocked.
•Deadlock detection — For applications that track thread pool health, verify that worker threads aren't all blocked waiting for each other.
•Memory corruption checks — Some frameworks can detect heap corruption or critical data structure inconsistency.
•Watchdog timers — If background tasks should complete within known intervals, a stuck task might indicate a hung process.

What Liveness Should NOT Check

•Database connectivity — Database outages are not fixed by restarting the application.
•Downstream service availability — External service failures don't indicate corrupted local state.
•Temporary resource exhaustion — High memory usage or CPU load may resolve without restart.
•Cache miss rates — Poor cache performance doesn't require process restart.
•Business logic validation — Failed transactions don't mean the process is broken.

liveness-endpoint-examples.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// TypeScript/Express Example: Minimal Liveness Endpoint
 
import express from 'express';
import { threadPoolHealth } from './monitoring';
 
const app = express();
 
/**
 * Liveness endpoint - answers "should we restart this process?"
 * 
 * Rules:
 * - Must respond extremely fast (< 10ms target)
 * - Must NOT check external dependencies
 * - Should only fail for unrecoverable local state
 */
app.get('/health/live', (req, res) => {
    // Check 1: Can we even respond? (Implicit - if we got here, yes)
    
    // Check 2: Are worker threads responsive?
    const threadHealth = threadPoolHealth();
    if (threadHealth.deadlockedThreads > 0) {
        return res.status(503).json({
            status: 'unhealthy',
            reason: 'thread_deadlock',
            deadlockedThreads: threadHealth.deadlockedThreads
        });
    }
    
    // Check 3: Is our event loop responsive?
    // (If this handler runs, the event loop is alive)
    
    // All checks pass - we're alive
    res.status(200).json({
        status: 'healthy',
        timestamp: new Date().toISOString()
    });
});
 
// Anti-pattern: DO NOT include database checks in liveness
// ❌ BAD - This can cause cascading restarts
app.get('/health/live-bad', async (req, res) => {
    try {
        await database.query('SELECT 1');  // DON'T DO THIS IN LIVENESS
        res.json({ status: 'healthy' });
    } catch (error) {
        res.status(503).json({ status: 'unhealthy' });  // Will restart!
    }
});

liveness-endpoint.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Go/Gin Example: Minimal Liveness Endpoint
 
package main
 
import (
    "net/http"
    "runtime"
    "time"
    
    "github.com/gin-gonic/gin"
)
 
// LivenessHandler checks if the process is fundamentally alive
// and should continue running (vs being restarted)
func LivenessHandler(c *gin.Context) {
    // Check 1: Verify the runtime is healthy
    // If we couldn't call NumGoroutine, we'd have a serious problem
    numGoroutines := runtime.NumGoroutine()
    
    // Check 2: Watch for goroutine leaks that might indicate deadlock
    // This threshold should be tuned per-application
    const maxGoroutines = 10000
    if numGoroutines > maxGoroutines {
        c.JSON(http.StatusServiceUnavailable, gin.H{
            "status":      "unhealthy",
            "reason":      "goroutine_leak_suspected",
            "goroutines":  numGoroutines,
            "threshold":   maxGoroutines,
        })
        return
    }
    
    // Check 3: Verify memory isn't critically exhausted
    var memStats runtime.MemStats
    runtime.ReadMemStats(&memStats)
    
    // If heap is over 90% of system memory, we may need restart
    const heapThreshold = 0.9
    heapUsageRatio := float64(memStats.HeapAlloc) / float64(memStats.HeapSys)
    if heapUsageRatio > heapThreshold {
        c.JSON(http.StatusServiceUnavailable, gin.H{
            "status":        "unhealthy",
            "reason":        "heap_exhaustion",
            "heapAlloc":     memStats.HeapAlloc,
            "heapSys":       memStats.HeapSys,
            "usageRatio":    heapUsageRatio,
        })
        return
    }
    
    // All checks pass
    c.JSON(http.StatusOK, gin.H{
        "status":     "healthy",
        "goroutines": numGoroutines,
        "timestamp":  time.Now().UTC().Format(time.RFC3339),
    })
}

The Minimal Liveness Pattern

Designing Readiness Endpoints

Readiness Check Principles:

Dependency Verification: Unlike liveness, readiness should check dependencies. If your database is down, you can't serve requests—you should be temporarily removed from rotation.
Fast Failure Recovery: Readiness failures should be recoverable without restart. When the database comes back, readiness should automatically restore.
Load Awareness: Consider whether your instance is too overloaded to accept additional traffic. An overwhelmed instance might be 'alive' but not 'ready' for more load.
Initialization Status: Before all startup tasks complete (warming caches, establishing connections), the instance isn't ready.

Readiness Check Components
Component	What to Check	Failure Meaning
Primary Database	Connection pool has available connections	Can't process data-dependent requests
Cache Layer	Cache connection is established	Performance may be degraded
Message Queue	Queue connection is active	Can't process async workflows
Initialization	All startup tasks have completed	Not yet ready to serve
Resource Limits	Under configured ceiling for connections/memory	May be overloaded
Feature Flags	Feature flag service is reachable (if critical)	May serve incorrect feature states

readiness-endpoint-comprehensive.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
// TypeScript/Express: Comprehensive Readiness Endpoint
 
import express, { Request, Response } from 'express';
import { Pool } from 'pg';
import Redis from 'ioredis';
 
interface HealthCheckResult {
    service: string;
    status: 'healthy' | 'unhealthy' | 'degraded';
    latencyMs?: number;
    message?: string;
}
 
interface ReadinessResponse {
    status: 'ready' | 'not_ready';
    timestamp: string;
    checks: HealthCheckResult[];
    version?: string;
}
 
// Dependency references
let pgPool: Pool;
let redisClient: Redis;
let isInitialized = false;
 
// Configurable timeouts
const CHECK_TIMEOUT_MS = 2000;
const MAX_CONNECTION_POOL_USAGE = 0.8;  // 80% threshold
 
/**
 * Check PostgreSQL connectivity and pool health
 */
async function checkDatabase(): Promise<HealthCheckResult> {
    const start = Date.now();
    try {
        // Verify we can execute a query
        await Promise.race([
            pgPool.query('SELECT 1'),
            new Promise((_, reject) => 
                setTimeout(() => reject(new Error('timeout')), CHECK_TIMEOUT_MS)
            )
        ]);
        
        // Check connection pool saturation
        const poolStats = pgPool.totalCount;
        const idleCount = pgPool.idleCount;
        const waitingCount = pgPool.waitingCount;
        
        if (waitingCount > 0) {
            return {
                service: 'postgresql',
                status: 'degraded',
                latencyMs: Date.now() - start,
                message: `${waitingCount} requests waiting for connections`
            };
        }
        
        return {
            service: 'postgresql',
            status: 'healthy',
            latencyMs: Date.now() - start
        };
    } catch (error) {
        return {
            service: 'postgresql',
            status: 'unhealthy',
            latencyMs: Date.now() - start,
            message: error instanceof Error ? error.message : 'Unknown error'
        };
    }
}
 
/**
 * Check Redis connectivity
 */
async function checkRedis(): Promise<HealthCheckResult> {
    const start = Date.now();
    try {
        await Promise.race([
            redisClient.ping(),
            new Promise((_, reject) => 
                setTimeout(() => reject(new Error('timeout')), CHECK_TIMEOUT_MS)
            )
        ]);
        
        return {
            service: 'redis',
            status: 'healthy',
            latencyMs: Date.now() - start
        };
    } catch (error) {
        return {
            service: 'redis',
            status: 'unhealthy',
            latencyMs: Date.now() - start,
            message: error instanceof Error ? error.message : 'Unknown error'
        };
    }
}
 
/**
 * Check initialization status
 */
function checkInitialization(): HealthCheckResult {
    if (!isInitialized) {
        return {
            service: 'initialization',
            status: 'unhealthy',
            message: 'Application initialization not complete'
        };
    }
    return {
        service: 'initialization',
        status: 'healthy'
    };
}
 
/**
 * Readiness endpoint handler
 */
async function readinessHandler(req: Request, res: Response) {
    // Run all checks in parallel
    const checkResults = await Promise.all([
        checkInitialization(),
        checkDatabase(),
        checkRedis()
    ]);
    
    // Determine overall readiness
    const hasUnhealthy = checkResults.some(c => c.status === 'unhealthy');
    const hasDegraded = checkResults.some(c => c.status === 'degraded');
    
    const response: ReadinessResponse = {
        status: hasUnhealthy ? 'not_ready' : 'ready',
        timestamp: new Date().toISOString(),
        checks: checkResults,
        version: process.env.APP_VERSION
    };
    
    const statusCode = hasUnhealthy ? 503 : 200;
    res.status(statusCode).json(response);
}
 
const app = express();
app.get('/health/ready', readinessHandler);

The Dependency Health Dilemma

Deep Health Checks for Diagnostics

Purpose of Deep Health Checks:

Diagnostic Context: When investigating issues, operators need visibility into all system components, not just pass/fail status.
Trend Analysis: By recording deep health responses over time, you can identify gradual degradation before it becomes a hard failure.
Dependency Mapping: Deep health responses document which external services your application depends on, serving as living documentation.
Version Tracking: Including application version, deployment timestamp, and configuration details helps correlate health with changes.

deep-health-endpoint.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
// TypeScript: Deep Health Endpoint for Diagnostics
 
interface DeepHealthResponse {
    status: 'healthy' | 'unhealthy' | 'degraded';
    timestamp: string;
    uptime: number;
    version: {
        app: string;
        git_sha: string;
        deployed_at: string;
    };
    system: {
        hostname: string;
        node_version: string;
        memory: {
            heap_used_mb: number;
            heap_total_mb: number;
            external_mb: number;
            rss_mb: number;
        };
        cpu: {
            user_percent: number;
            system_percent: number;
        };
    };
    dependencies: DependencyHealth[];
    configuration: {
        environment: string;
        region: string;
        log_level: string;
        feature_flags: Record<string, boolean>;
    };
}
 
interface DependencyHealth {
    name: string;
    type: 'database' | 'cache' | 'queue' | 'service' | 'storage';
    status: 'healthy' | 'unhealthy' | 'degraded' | 'unknown';
    latency_ms?: number;
    details?: Record<string, unknown>;
    error?: string;
}
 
async function deepHealthHandler(req: Request, res: Response) {
    const startTime = process.hrtime.bigint();
    
    // Gather system metrics
    const memUsage = process.memoryUsage();
    const cpuUsage = process.cpuUsage();
    
    // Check all dependencies with detailed information
    const dependencies: DependencyHealth[] = await Promise.all([
        checkDatabaseDeep(),
        checkRedisDeep(),
        checkKafkaDeep(),
        checkS3Deep(),
        checkDownstreamServicesDeep()
    ]).then(results => results.flat());
    
    // Determine overall status
    const unhealthyCount = dependencies.filter(d => d.status === 'unhealthy').length;
    const degradedCount = dependencies.filter(d => d.status === 'degraded').length;
    
    let status: 'healthy' | 'unhealthy' | 'degraded' = 'healthy';
    if (unhealthyCount > 0) status = 'unhealthy';
    else if (degradedCount > 0) status = 'degraded';
    
    const response: DeepHealthResponse = {
        status,
        timestamp: new Date().toISOString(),
        uptime: process.uptime(),
        version: {
            app: process.env.APP_VERSION || 'unknown',
            git_sha: process.env.GIT_SHA || 'unknown',
            deployed_at: process.env.DEPLOYED_AT || 'unknown'
        },
        system: {
            hostname: os.hostname(),
            node_version: process.version,
            memory: {
                heap_used_mb: Math.round(memUsage.heapUsed / 1024 / 1024),
                heap_total_mb: Math.round(memUsage.heapTotal / 1024 / 1024),
                external_mb: Math.round(memUsage.external / 1024 / 1024),
                rss_mb: Math.round(memUsage.rss / 1024 / 1024)
            },
            cpu: {
                user_percent: cpuUsage.user / 1000000,  // Convert to seconds
                system_percent: cpuUsage.system / 1000000
            }
        },
        dependencies,
        configuration: {
            environment: process.env.NODE_ENV || 'development',
            region: process.env.AWS_REGION || 'unknown',
            log_level: process.env.LOG_LEVEL || 'info',
            feature_flags: await getFeatureFlags()
        }
    };
    
    // Always return 200 for deep health - it's diagnostic, not routing
    // Use the response body for status interpretation
    res.status(200).json(response);
}
 
async function checkDatabaseDeep(): Promise<DependencyHealth[]> {
    try {
        const start = Date.now();
        
        // Get detailed database stats
        const poolStats = pgPool;
        const [versionResult] = await pgPool.query('SELECT version()');
        const [statsResult] = await pgPool.query(`
            SELECT 
                numbackends as connections,
                xact_commit as transactions_committed,
                xact_rollback as transactions_rolled_back,
                blks_hit as cache_hits,
                blks_read as disk_reads
            FROM pg_stat_database 
            WHERE datname = current_database()
        `);
        
        return [{
            name: 'postgresql',
            type: 'database',
            status: 'healthy',
            latency_ms: Date.now() - start,
            details: {
                version: versionResult.rows[0].version,
                pool_total: poolStats.totalCount,
                pool_idle: poolStats.idleCount,
                pool_waiting: poolStats.waitingCount,
                stats: statsResult.rows[0]
            }
        }];
    } catch (error) {
        return [{
            name: 'postgresql',
            type: 'database',
            status: 'unhealthy',
            error: error instanceof Error ? error.message : 'Unknown error'
        }];
    }
}

Deep Health Security Considerations

Health Endpoint Performance Considerations

Performance Guidelines:

Health Endpoint Performance Best Practices

•Liveness: < 10ms — Liveness checks should be near-instantaneous. No blocking I/O, no database queries, no network calls. Return in-memory state or simply return 200.
•Readiness: < 100ms — Readiness checks can verify dependencies but should use connection pool health, not new queries. Check cached connection state rather than executing queries.
•Deep Health: < 5s — Diagnostic endpoints can be slower but should use timeouts to prevent hanging. Use concurrent checks with a global timeout.
•Timeout Cascade Prevention — Set health endpoint timeouts slightly shorter than the caller's timeout. If the load balancer times out at 3s, your endpoint should timeout at 2.5s and return a failure response rather than hanging.
•Separate Thread Pool — Under high load, regular request processing can starve health check requests. Consider dedicated threads or priority queues for health endpoints.
•Caching Results — For expensive checks, cache results briefly (1-5 seconds). This prevents thundering herd when multiple checkers probe simultaneously.

performant-health-checks.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
// TypeScript: Performance-Optimized Health Checking
 
import { performance } from 'perf_hooks';
 
// Cache for expensive health check results
interface CachedHealth {
    result: HealthCheckResult;
    cachedAt: number;
}
 
const healthCache = new Map<string, CachedHealth>();
const CACHE_TTL_MS = 2000;  // 2 second cache
 
/**
 * Get cached health check or execute if stale
 */
async function getCachedHealthCheck(
    name: string,
    checker: () => Promise<HealthCheckResult>
): Promise<HealthCheckResult> {
    const now = Date.now();
    const cached = healthCache.get(name);
    
    if (cached && (now - cached.cachedAt) < CACHE_TTL_MS) {
        return { ...cached.result, cached: true };
    }
    
    const result = await checker();
    healthCache.set(name, { result, cachedAt: now });
    return result;
}
 
/**
 * Run multiple health checks concurrently with global timeout
 */
async function runHealthChecks(
    checks: Array<{ name: string; check: () => Promise<HealthCheckResult> }>,
    globalTimeoutMs: number
): Promise<HealthCheckResult[]> {
    const start = performance.now();
    
    // Create promise for global timeout
    const timeout = new Promise<never>((_, reject) => {
        setTimeout(() => {
            reject(new Error(`Health checks exceeded global timeout of ${globalTimeoutMs}ms`));
        }, globalTimeoutMs);
    });
    
    // Run all checks with individual handling
    const checkPromises = checks.map(async ({ name, check }) => {
        try {
            return await getCachedHealthCheck(name, check);
        } catch (error) {
            // Individual check failure doesn't fail the batch
            return {
                service: name,
                status: 'unhealthy' as const,
                message: error instanceof Error ? error.message : 'Unknown error'
            };
        }
    });
    
    try {
        // Race against global timeout
        const results = await Promise.race([
            Promise.all(checkPromises),
            timeout
        ]);
        
        const duration = performance.now() - start;
        console.log(`Health checks completed in ${duration.toFixed(2)}ms`);
        
        return results;
    } catch (error) {
        // Global timeout exceeded - return partial results with timeout markers
        const duration = performance.now() - start;
        console.error(`Health checks timed out after ${duration.toFixed(2)}ms`);
        
        return checks.map(({ name }) => ({
            service: name,
            status: 'unhealthy' as const,
            message: 'Health check timed out'
        }));
    }
}
 
// Express middleware to track health endpoint performance
function healthMetricsMiddleware(req: Request, res: Response, next: NextFunction) {
    const start = performance.now();
    
    res.on('finish', () => {
        const duration = performance.now() - start;
        
        // Record metrics
        healthCheckDuration.observe({
            path: req.path,
            status: res.statusCode.toString()
        }, duration);
        
        // Warn if health check is slow
        if (duration > 100 && req.path.includes('/ready')) {
            console.warn(`Slow readiness check: ${duration.toFixed(2)}ms`);
        } else if (duration > 10 && req.path.includes('/live')) {
            console.warn(`Slow liveness check: ${duration.toFixed(2)}ms`);
        }
    });
    
    next();
}

Connection Pool Health vs Connection Test

Response Format Standards

HTTP Status Code Conventions:

Health Check HTTP Status Codes
Status Code	Meaning	When to Use
200 OK	Healthy/Ready	All checks pass
503 Service Unavailable	Unhealthy/Not Ready	Critical checks fail; remove from rotation
500 Internal Server Error	Check Failed	The health check itself encountered an error
429 Too Many Requests	Rate Limited	Health check load is excessive
207 Multi-Status	Mixed Results	Some checks pass, some fail (informational only)

Response Body Standards:

The RFC draft for Health Check Response Format (draft-inadarei-api-health-check) proposes a standardized JSON structure:

rfc-health-check-format.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
    "status": "pass",
    "version": "1.2.3",
    "releaseId": "abc123",
    "serviceId": "api-gateway",
    "description": "Primary API Gateway Service",
    "output": "",
    "notes": ["All systems operational"],
    "checks": {
        "postgresql:connections": [
            {
                "componentId": "database-001",
                "componentType": "datastore",
                "observedValue": 25,
                "observedUnit": "connections",
                "status": "pass",
                "time": "2024-01-15T10:30:00Z"
            }
        ],
        "redis:responseTime": [
            {
                "componentId": "cache-001",
                "componentType": "datastore",
                "observedValue": 2.5,
                "observedUnit": "ms",
                "status": "pass",
                "time": "2024-01-15T10:30:00Z"
            }
        ],
        "memory:utilization": [
            {
                "componentId": "self",
                "componentType": "system",
                "observedValue": 68.5,
                "observedUnit": "percent",
                "status": "warn",
                "time": "2024-01-15T10:30:00Z",
                "output": "Memory usage approaching threshold"
            }
        ]
    },
    "links": {
        "about": "https://docs.company.com/api-gateway",
        "docs": "https://docs.company.com/api-gateway/health"
    }
}

Keep It Simple for Load Balancers

Summary: Designing Effective Health Endpoints

Key Takeaways

•Separate liveness from readiness — Liveness determines restarts; readiness determines traffic routing. Never include dependencies in liveness checks.
•Keep liveness minimal — The simplest valid liveness check just returns 200 OK. Only add logic to detect unrecoverable states like deadlocks.
•Make readiness dependency-aware — Readiness should verify connections to databases, caches, and other critical dependencies—but use connection pool health, not test queries.
•Implement deep health for diagnostics — Comprehensive health endpoints serve debugging and monitoring, not traffic routing. Restrict access appropriately.
•Optimize for performance — Health endpoints are called under stress. Use caching, concurrent checks with timeouts, and avoid blocking I/O in critical paths.
•Use standard status codes — 200 for healthy, 503 for unhealthy. Detailed responses are for humans; status codes are for automation.

What's next:

Page Complete

2 / 5