Loading learning content...
In 2019, a major e-commerce platform experienced an unusual outage. Their health check endpoint returned 200 OK. The load balancer dutifully marked all servers as healthy. Yet customers couldn't complete checkouts—the payment processing service was down, and the health endpoint hadn't been designed to verify it.
The health check was technically correct: the servers were running. But it was operationally useless: the system couldn't fulfill its core business function.
This scenario illustrates a critical truth: a health check endpoint is only as valuable as the health conditions it verifies. The seemingly simple task of returning a status code masks profound design decisions about what 'health' means for your application.
Should a health endpoint verify database connectivity? What about downstream services? Should it check for sufficient disk space or memory? Should it test business-critical workflows or just process availability? The answers shape whether your health checks provide meaningful signals or false reassurance.
By the end of this page, you will understand how to design health check endpoints that accurately represent your application's ability to serve its intended purpose. You'll learn the distinction between liveness, readiness, and deep health checks, how to verify dependencies without creating brittleness, and how to structure endpoints for operational clarity.
Not all health checks serve the same purpose. Modern distributed systems typically implement a taxonomy of health endpoints, each designed to answer a different question about the application's state.
The Three-Tier Health Check Model:
This model, popularized by Kubernetes but applicable to any infrastructure, separates health checking into distinct concerns:
| Check Type | Question Answered | Failure Response | Typical Endpoint |
|---|---|---|---|
| Liveness | Is the process fundamentally alive? | Restart the container/process | /health/live or /livez |
| Readiness | Can this instance handle traffic right now? | Remove from load balancer rotation | /health/ready or /readyz |
| Startup | Has the application finished initializing? | Wait longer before checking liveness/readiness | /health/startup or /startupz |
| Deep Health | Is the full application stack operational? | Diagnostic and alerting purposes | /health or /health/full |
Understanding the Distinctions:
Liveness answers: Should we keep this instance running, or should we restart it?
A liveness check should fail only when the instance is in an unrecoverable state—deadlocked, memory-corrupted, or otherwise broken in a way that requires a restart to fix. Liveness checks should be extremely simple and fast. They should not verify dependencies, because if your database goes down, restarting your application servers won't fix it.
Readiness answers: Can this instance productively handle a request right now?
Readiness checks can and should verify dependencies. If the database is unreachable, the readiness check should fail, removing the instance from load balancer rotation. But the instance itself isn't broken—when the database recovers, readiness should restore automatically. Crucially, a readiness failure should not trigger a restart.
Startup answers: Has the application completed its initialization sequence?
For applications with slow startup (loading caches, establishing connection pools, running migrations), the startup check prevents liveness probes from killing an instance that's still initializing. Once the startup check passes, the liveness and readiness probes take over.
A common production catastrophe: including database connectivity in liveness checks. When the database goes down briefly, all application servers fail liveness and get restarted simultaneously. Now you have a database outage AND all your application servers trying to restart at once. Always keep liveness checks free of external dependencies.
The liveness endpoint has a singular purpose: determine whether the application process is in a state where restarting would help. This seemingly simple requirement demands careful consideration of what constitutes an 'unrecoverable' state.
What Liveness Should Check:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
// TypeScript/Express Example: Minimal Liveness Endpoint import express from 'express';import { threadPoolHealth } from './monitoring'; const app = express(); /** * Liveness endpoint - answers "should we restart this process?" * * Rules: * - Must respond extremely fast (< 10ms target) * - Must NOT check external dependencies * - Should only fail for unrecoverable local state */app.get('/health/live', (req, res) => { // Check 1: Can we even respond? (Implicit - if we got here, yes) // Check 2: Are worker threads responsive? const threadHealth = threadPoolHealth(); if (threadHealth.deadlockedThreads > 0) { return res.status(503).json({ status: 'unhealthy', reason: 'thread_deadlock', deadlockedThreads: threadHealth.deadlockedThreads }); } // Check 3: Is our event loop responsive? // (If this handler runs, the event loop is alive) // All checks pass - we're alive res.status(200).json({ status: 'healthy', timestamp: new Date().toISOString() });}); // Anti-pattern: DO NOT include database checks in liveness// ❌ BAD - This can cause cascading restartsapp.get('/health/live-bad', async (req, res) => { try { await database.query('SELECT 1'); // DON'T DO THIS IN LIVENESS res.json({ status: 'healthy' }); } catch (error) { res.status(503).json({ status: 'unhealthy' }); // Will restart! }});123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
// Go/Gin Example: Minimal Liveness Endpoint package main import ( "net/http" "runtime" "time" "github.com/gin-gonic/gin") // LivenessHandler checks if the process is fundamentally alive// and should continue running (vs being restarted)func LivenessHandler(c *gin.Context) { // Check 1: Verify the runtime is healthy // If we couldn't call NumGoroutine, we'd have a serious problem numGoroutines := runtime.NumGoroutine() // Check 2: Watch for goroutine leaks that might indicate deadlock // This threshold should be tuned per-application const maxGoroutines = 10000 if numGoroutines > maxGoroutines { c.JSON(http.StatusServiceUnavailable, gin.H{ "status": "unhealthy", "reason": "goroutine_leak_suspected", "goroutines": numGoroutines, "threshold": maxGoroutines, }) return } // Check 3: Verify memory isn't critically exhausted var memStats runtime.MemStats runtime.ReadMemStats(&memStats) // If heap is over 90% of system memory, we may need restart const heapThreshold = 0.9 heapUsageRatio := float64(memStats.HeapAlloc) / float64(memStats.HeapSys) if heapUsageRatio > heapThreshold { c.JSON(http.StatusServiceUnavailable, gin.H{ "status": "unhealthy", "reason": "heap_exhaustion", "heapAlloc": memStats.HeapAlloc, "heapSys": memStats.HeapSys, "usageRatio": heapUsageRatio, }) return } // All checks pass c.JSON(http.StatusOK, gin.H{ "status": "healthy", "goroutines": numGoroutines, "timestamp": time.Now().UTC().Format(time.RFC3339), })}The simplest valid liveness check is returning 200 OK with no logic at all. If the HTTP handler can execute, the process is alive. Only add additional checks if you have specific failure modes (like deadlocks) that don't prevent HTTP responses.
The readiness endpoint answers a more nuanced question than liveness: Can this instance productively serve a request right now? This requires understanding both what 'productively' means for your application and what transient conditions might prevent it.
Readiness Check Principles:
Dependency Verification: Unlike liveness, readiness should check dependencies. If your database is down, you can't serve requests—you should be temporarily removed from rotation.
Fast Failure Recovery: Readiness failures should be recoverable without restart. When the database comes back, readiness should automatically restore.
Load Awareness: Consider whether your instance is too overloaded to accept additional traffic. An overwhelmed instance might be 'alive' but not 'ready' for more load.
Initialization Status: Before all startup tasks complete (warming caches, establishing connections), the instance isn't ready.
| Component | What to Check | Failure Meaning |
|---|---|---|
| Primary Database | Connection pool has available connections | Can't process data-dependent requests |
| Cache Layer | Cache connection is established | Performance may be degraded |
| Message Queue | Queue connection is active | Can't process async workflows |
| Initialization | All startup tasks have completed | Not yet ready to serve |
| Resource Limits | Under configured ceiling for connections/memory | May be overloaded |
| Feature Flags | Feature flag service is reachable (if critical) | May serve incorrect feature states |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145
// TypeScript/Express: Comprehensive Readiness Endpoint import express, { Request, Response } from 'express';import { Pool } from 'pg';import Redis from 'ioredis'; interface HealthCheckResult { service: string; status: 'healthy' | 'unhealthy' | 'degraded'; latencyMs?: number; message?: string;} interface ReadinessResponse { status: 'ready' | 'not_ready'; timestamp: string; checks: HealthCheckResult[]; version?: string;} // Dependency referenceslet pgPool: Pool;let redisClient: Redis;let isInitialized = false; // Configurable timeoutsconst CHECK_TIMEOUT_MS = 2000;const MAX_CONNECTION_POOL_USAGE = 0.8; // 80% threshold /** * Check PostgreSQL connectivity and pool health */async function checkDatabase(): Promise<HealthCheckResult> { const start = Date.now(); try { // Verify we can execute a query await Promise.race([ pgPool.query('SELECT 1'), new Promise((_, reject) => setTimeout(() => reject(new Error('timeout')), CHECK_TIMEOUT_MS) ) ]); // Check connection pool saturation const poolStats = pgPool.totalCount; const idleCount = pgPool.idleCount; const waitingCount = pgPool.waitingCount; if (waitingCount > 0) { return { service: 'postgresql', status: 'degraded', latencyMs: Date.now() - start, message: `${waitingCount} requests waiting for connections` }; } return { service: 'postgresql', status: 'healthy', latencyMs: Date.now() - start }; } catch (error) { return { service: 'postgresql', status: 'unhealthy', latencyMs: Date.now() - start, message: error instanceof Error ? error.message : 'Unknown error' }; }} /** * Check Redis connectivity */async function checkRedis(): Promise<HealthCheckResult> { const start = Date.now(); try { await Promise.race([ redisClient.ping(), new Promise((_, reject) => setTimeout(() => reject(new Error('timeout')), CHECK_TIMEOUT_MS) ) ]); return { service: 'redis', status: 'healthy', latencyMs: Date.now() - start }; } catch (error) { return { service: 'redis', status: 'unhealthy', latencyMs: Date.now() - start, message: error instanceof Error ? error.message : 'Unknown error' }; }} /** * Check initialization status */function checkInitialization(): HealthCheckResult { if (!isInitialized) { return { service: 'initialization', status: 'unhealthy', message: 'Application initialization not complete' }; } return { service: 'initialization', status: 'healthy' };} /** * Readiness endpoint handler */async function readinessHandler(req: Request, res: Response) { // Run all checks in parallel const checkResults = await Promise.all([ checkInitialization(), checkDatabase(), checkRedis() ]); // Determine overall readiness const hasUnhealthy = checkResults.some(c => c.status === 'unhealthy'); const hasDegraded = checkResults.some(c => c.status === 'degraded'); const response: ReadinessResponse = { status: hasUnhealthy ? 'not_ready' : 'ready', timestamp: new Date().toISOString(), checks: checkResults, version: process.env.APP_VERSION }; const statusCode = hasUnhealthy ? 503 : 200; res.status(statusCode).json(response);} const app = express();app.get('/health/ready', readinessHandler);Should every dependency failure cause a readiness failure? Not necessarily. Consider a service that can partially function without its cache layer—perhaps with degraded performance but still serving requests. Use 'degraded' status for non-critical dependencies and reserve 'unhealthy' for dependencies without which processing is impossible.
Beyond liveness and readiness, many production systems implement a 'deep health' or 'full health' endpoint that provides comprehensive diagnostic information. This endpoint isn't used for routing decisions—it's used for monitoring, debugging, and operational visibility.
Purpose of Deep Health Checks:
Diagnostic Context: When investigating issues, operators need visibility into all system components, not just pass/fail status.
Trend Analysis: By recording deep health responses over time, you can identify gradual degradation before it becomes a hard failure.
Dependency Mapping: Deep health responses document which external services your application depends on, serving as living documentation.
Version Tracking: Including application version, deployment timestamp, and configuration details helps correlate health with changes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144
// TypeScript: Deep Health Endpoint for Diagnostics interface DeepHealthResponse { status: 'healthy' | 'unhealthy' | 'degraded'; timestamp: string; uptime: number; version: { app: string; git_sha: string; deployed_at: string; }; system: { hostname: string; node_version: string; memory: { heap_used_mb: number; heap_total_mb: number; external_mb: number; rss_mb: number; }; cpu: { user_percent: number; system_percent: number; }; }; dependencies: DependencyHealth[]; configuration: { environment: string; region: string; log_level: string; feature_flags: Record<string, boolean>; };} interface DependencyHealth { name: string; type: 'database' | 'cache' | 'queue' | 'service' | 'storage'; status: 'healthy' | 'unhealthy' | 'degraded' | 'unknown'; latency_ms?: number; details?: Record<string, unknown>; error?: string;} async function deepHealthHandler(req: Request, res: Response) { const startTime = process.hrtime.bigint(); // Gather system metrics const memUsage = process.memoryUsage(); const cpuUsage = process.cpuUsage(); // Check all dependencies with detailed information const dependencies: DependencyHealth[] = await Promise.all([ checkDatabaseDeep(), checkRedisDeep(), checkKafkaDeep(), checkS3Deep(), checkDownstreamServicesDeep() ]).then(results => results.flat()); // Determine overall status const unhealthyCount = dependencies.filter(d => d.status === 'unhealthy').length; const degradedCount = dependencies.filter(d => d.status === 'degraded').length; let status: 'healthy' | 'unhealthy' | 'degraded' = 'healthy'; if (unhealthyCount > 0) status = 'unhealthy'; else if (degradedCount > 0) status = 'degraded'; const response: DeepHealthResponse = { status, timestamp: new Date().toISOString(), uptime: process.uptime(), version: { app: process.env.APP_VERSION || 'unknown', git_sha: process.env.GIT_SHA || 'unknown', deployed_at: process.env.DEPLOYED_AT || 'unknown' }, system: { hostname: os.hostname(), node_version: process.version, memory: { heap_used_mb: Math.round(memUsage.heapUsed / 1024 / 1024), heap_total_mb: Math.round(memUsage.heapTotal / 1024 / 1024), external_mb: Math.round(memUsage.external / 1024 / 1024), rss_mb: Math.round(memUsage.rss / 1024 / 1024) }, cpu: { user_percent: cpuUsage.user / 1000000, // Convert to seconds system_percent: cpuUsage.system / 1000000 } }, dependencies, configuration: { environment: process.env.NODE_ENV || 'development', region: process.env.AWS_REGION || 'unknown', log_level: process.env.LOG_LEVEL || 'info', feature_flags: await getFeatureFlags() } }; // Always return 200 for deep health - it's diagnostic, not routing // Use the response body for status interpretation res.status(200).json(response);} async function checkDatabaseDeep(): Promise<DependencyHealth[]> { try { const start = Date.now(); // Get detailed database stats const poolStats = pgPool; const [versionResult] = await pgPool.query('SELECT version()'); const [statsResult] = await pgPool.query(` SELECT numbackends as connections, xact_commit as transactions_committed, xact_rollback as transactions_rolled_back, blks_hit as cache_hits, blks_read as disk_reads FROM pg_stat_database WHERE datname = current_database() `); return [{ name: 'postgresql', type: 'database', status: 'healthy', latency_ms: Date.now() - start, details: { version: versionResult.rows[0].version, pool_total: poolStats.totalCount, pool_idle: poolStats.idleCount, pool_waiting: poolStats.waitingCount, stats: statsResult.rows[0] } }]; } catch (error) { return [{ name: 'postgresql', type: 'database', status: 'unhealthy', error: error instanceof Error ? error.message : 'Unknown error' }]; }}Deep health endpoints expose sensitive operational details including hostnames, version numbers, and configuration. Always restrict access through authentication, network segmentation, or IP allowlisting. Never expose deep health to the public internet.
Health check endpoints face a unique performance paradox: they're called frequently, often under the exact conditions where the system is already stressed, yet they're typically expected to respond extremely quickly. Poor health endpoint performance can create a vicious cycle where stressed systems appear unhealthy due to slow health checks, causing traffic removal and directing more load to remaining instances.
Performance Guidelines:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
// TypeScript: Performance-Optimized Health Checking import { performance } from 'perf_hooks'; // Cache for expensive health check resultsinterface CachedHealth { result: HealthCheckResult; cachedAt: number;} const healthCache = new Map<string, CachedHealth>();const CACHE_TTL_MS = 2000; // 2 second cache /** * Get cached health check or execute if stale */async function getCachedHealthCheck( name: string, checker: () => Promise<HealthCheckResult>): Promise<HealthCheckResult> { const now = Date.now(); const cached = healthCache.get(name); if (cached && (now - cached.cachedAt) < CACHE_TTL_MS) { return { ...cached.result, cached: true }; } const result = await checker(); healthCache.set(name, { result, cachedAt: now }); return result;} /** * Run multiple health checks concurrently with global timeout */async function runHealthChecks( checks: Array<{ name: string; check: () => Promise<HealthCheckResult> }>, globalTimeoutMs: number): Promise<HealthCheckResult[]> { const start = performance.now(); // Create promise for global timeout const timeout = new Promise<never>((_, reject) => { setTimeout(() => { reject(new Error(`Health checks exceeded global timeout of ${globalTimeoutMs}ms`)); }, globalTimeoutMs); }); // Run all checks with individual handling const checkPromises = checks.map(async ({ name, check }) => { try { return await getCachedHealthCheck(name, check); } catch (error) { // Individual check failure doesn't fail the batch return { service: name, status: 'unhealthy' as const, message: error instanceof Error ? error.message : 'Unknown error' }; } }); try { // Race against global timeout const results = await Promise.race([ Promise.all(checkPromises), timeout ]); const duration = performance.now() - start; console.log(`Health checks completed in ${duration.toFixed(2)}ms`); return results; } catch (error) { // Global timeout exceeded - return partial results with timeout markers const duration = performance.now() - start; console.error(`Health checks timed out after ${duration.toFixed(2)}ms`); return checks.map(({ name }) => ({ service: name, status: 'unhealthy' as const, message: 'Health check timed out' })); }} // Express middleware to track health endpoint performancefunction healthMetricsMiddleware(req: Request, res: Response, next: NextFunction) { const start = performance.now(); res.on('finish', () => { const duration = performance.now() - start; // Record metrics healthCheckDuration.observe({ path: req.path, status: res.statusCode.toString() }, duration); // Warn if health check is slow if (duration > 100 && req.path.includes('/ready')) { console.warn(`Slow readiness check: ${duration.toFixed(2)}ms`); } else if (duration > 10 && req.path.includes('/live')) { console.warn(`Slow liveness check: ${duration.toFixed(2)}ms`); } }); next();}Instead of running 'SELECT 1' to verify database health, check the connection pool's internal state. Most connection pools track how many connections are available, waiting, and failed. Reading this metadata is instant and doesn't consume a connection or execute a query.
While load balancers typically only care about HTTP status codes, well-designed health endpoints provide structured responses that aid debugging, monitoring, and automation. Several standards have emerged for health check response formats.
HTTP Status Code Conventions:
| Status Code | Meaning | When to Use |
|---|---|---|
| 200 OK | Healthy/Ready | All checks pass |
| 503 Service Unavailable | Unhealthy/Not Ready | Critical checks fail; remove from rotation |
| 500 Internal Server Error | Check Failed | The health check itself encountered an error |
| 429 Too Many Requests | Rate Limited | Health check load is excessive |
| 207 Multi-Status | Mixed Results | Some checks pass, some fail (informational only) |
Response Body Standards:
The RFC draft for Health Check Response Format (draft-inadarei-api-health-check) proposes a standardized JSON structure:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
{ "status": "pass", "version": "1.2.3", "releaseId": "abc123", "serviceId": "api-gateway", "description": "Primary API Gateway Service", "output": "", "notes": ["All systems operational"], "checks": { "postgresql:connections": [ { "componentId": "database-001", "componentType": "datastore", "observedValue": 25, "observedUnit": "connections", "status": "pass", "time": "2024-01-15T10:30:00Z" } ], "redis:responseTime": [ { "componentId": "cache-001", "componentType": "datastore", "observedValue": 2.5, "observedUnit": "ms", "status": "pass", "time": "2024-01-15T10:30:00Z" } ], "memory:utilization": [ { "componentId": "self", "componentType": "system", "observedValue": 68.5, "observedUnit": "percent", "status": "warn", "time": "2024-01-15T10:30:00Z", "output": "Memory usage approaching threshold" } ] }, "links": { "about": "https://docs.company.com/api-gateway", "docs": "https://docs.company.com/api-gateway/health" }}Load balancers typically don't parse response bodies—they use status codes. The detailed response body is for human operators and monitoring systems. Keep liveness and readiness endpoints simple. Reserve detailed responses for deep health/diagnostic endpoints.
Health check endpoints are the interface between your application and the infrastructure that routes traffic to it. Their design determines whether your system can accurately represent its operational state.
What's next:
Health checks tell us whether something is wrong. The next critical skill is understanding how to detect failures and when they become actionable. We'll explore failure detection mechanisms—the algorithms and strategies that convert health observations into routing decisions.
You now understand how to design health check endpoints that accurately represent your application's ability to serve traffic. You've learned the taxonomy of health checks, how to implement each type, and how to optimize for production requirements. Next, we'll explore failure detection mechanisms.