Loading content...
In the previous page, we learned that connection establishment (DNS + TCP + TLS) can consume 75% or more of a request's latency on cold paths. This reveals a fundamental truth: how you manage connections is as important as what you send over them.
Connection management—the art of establishing, reusing, monitoring, and retiring network connections—is one of the most impactful aspects of distributed systems engineering. Poor connection management manifests as:
Conversely, well-tuned connection management enables systems to handle high throughput with minimal latency overhead. This page provides comprehensive coverage of connection management strategies, from basic pooling to advanced failure handling.
By the end of this page, you will understand connection pooling mechanics, know how to size and configure connection pools, implement health checking for pooled connections, handle connection failures gracefully, and optimize keep-alive settings for different scenarios.
Connection pooling is the practice of maintaining a cache of database/service connections that can be reused across requests, rather than creating a new connection for each request.
The Cost of Connection Per Request:
Without pooling, each request pays the full connection establishment cost:
Request 1: [DNS 50ms][TCP 50ms][TLS 100ms][Request 30ms] = 230ms
Request 2: [DNS 50ms][TCP 50ms][TLS 100ms][Request 30ms] = 230ms
Request 3: [DNS 50ms][TCP 50ms][TLS 100ms][Request 30ms] = 230ms
...
100 requests = 23,000ms = 23 seconds of connection overhead alone!
With Connection Pooling:
Request 1: [DNS 50ms][TCP 50ms][TLS 100ms][Request 30ms] = 230ms (establishes pool)
Request 2: [Request 30ms] = 30ms (reuses connection)
Request 3: [Request 30ms] = 30ms (reuses connection)
...
100 requests = 230ms + (99 × 30ms) = 3,200ms = 3.2 seconds
Savings: 86% latency reduction!
Resource Benefits:
Beyond latency, pooling provides critical resource benefits:
| Resource | Per-Request Connections | Pooled Connections |
|---|---|---|
| Server file descriptors | New FD per request | Fixed pool of FDs |
| Server memory | Buffer per connection | Bounded buffers |
| Client ephemeral ports | Port per request | Port per pool entry |
| TLS session overhead | Full handshake each time | Session reuse |
| DNS queries | Per request (if no cache) | Once per connection |
| TCP TIME_WAIT | Accumulates rapidly | Connections rarely close |
At 1,000 requests/second without pooling: 1,000 TCP connections opened/second, 60,000 connections in TIME_WAIT state (assuming 60s TIME_WAIT), port exhaustion, file descriptor exhaustion, and potential connection storms that overwhelm servers.
Connection State Management:
Connection pools must manage connection lifecycle:
┌─────────────────────────────────────────────────────────────────────┐
│ Connection Pool State Machine │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Creating │────▶│ Available│────▶│ In Use │────▶│ Released │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Failed │ │ Expired │ │ Error │ │ Validated│ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌───────────────────────────────────────────────────────────────┐│
│ │ Removed ││
│ └───────────────────────────────────────────────────────────────┘│
│ │
└─────────────────────────────────────────────────────────────────────┘
Key transitions:
Correctly sizing connection pools is both art and science. Pool too small, and requests queue waiting for connections. Pool too large, and you waste resources and potentially overwhelm downstream services.
Key Pool Parameters:
| Parameter | Description | Typical Values |
|---|---|---|
| Min Size | Minimum connections to maintain | 1-10 |
| Max Size | Maximum connections allowed | 10-100 |
| Idle Timeout | How long idle connections live | 30s-5min |
| Connection Timeout | Max time to establish connection | 5-30s |
| Acquire Timeout | Max time to wait for available connection | 1-10s |
| Validation Interval | How often to health-check idle connections | 30s-60s |
| Max Lifetime | Maximum age of a connection | 30min-1hr |
Pool Size Formula:
A starting point for pool sizing:
Minimal Pool Size = Average_Concurrent_Requests_In_Flight
Safe Pool Size = Peak_Concurrent_Requests × Safety_Factor
Where:
Concurrent_Requests = Request_Rate × Average_Request_Duration
Safety_Factor = 1.5 - 2.0
Example:
Request rate: 100 requests/second
Average request duration: 50ms = 0.05s
Concurrent requests: 100 × 0.05 = 5
Safe pool size: 5 × 2 = 10 connections
But consider variability:
Practical recommendation: Start with 10-20 connections per downstream service, monitor queue depth and wait times, adjust based on observed behavior.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
// Production-grade connection pool configuration interface ConnectionPoolConfig { // Pool sizing minSize: number; // Minimum connections to maintain maxSize: number; // Maximum connections allowed // Timeouts connectionTimeoutMs: number; // Max time to establish new connection acquireTimeoutMs: number; // Max time to wait for available connection idleTimeoutMs: number; // Close idle connections after this duration maxLifetimeMs: number; // Max age of any connection // Health checking validationIntervalMs: number; // How often to validate idle connections validationQuery?: string; // Query/request for health check validationTimeoutMs: number; // Max time for health check // Behavior waitForConnections: boolean; // Queue if pool empty, or fail fast evictionRunIntervalMs: number; // How often to check for stale connections} // Configuration for different scenariosconst configs = { // High-throughput API to frequently-called service highThroughput: { minSize: 10, maxSize: 50, connectionTimeoutMs: 5000, acquireTimeoutMs: 3000, idleTimeoutMs: 120000, // 2 minutes maxLifetimeMs: 1800000, // 30 minutes validationIntervalMs: 30000, validationTimeoutMs: 5000, waitForConnections: true, evictionRunIntervalMs: 60000, } as ConnectionPoolConfig, // Infrequently-called service (e.g., nightly batch) lowFrequency: { minSize: 1, maxSize: 5, connectionTimeoutMs: 10000, acquireTimeoutMs: 10000, idleTimeoutMs: 30000, // 30 seconds maxLifetimeMs: 300000, // 5 minutes validationIntervalMs: 60000, validationTimeoutMs: 5000, waitForConnections: true, evictionRunIntervalMs: 30000, } as ConnectionPoolConfig, // Latency-critical service (fail fast) latencySensitive: { minSize: 20, // More connections to avoid waiting maxSize: 100, connectionTimeoutMs: 1000, // Fail fast on slow connections acquireTimeoutMs: 100, // Very short wait idleTimeoutMs: 60000, maxLifetimeMs: 600000, // 10 minutes validationIntervalMs: 10000, validationTimeoutMs: 1000, waitForConnections: false, // Fail immediately if no connections evictionRunIntervalMs: 10000, } as ConnectionPoolConfig,}; // Adaptive pool that adjusts based on loadclass AdaptiveConnectionPool { private currentSize: number; private config: ConnectionPoolConfig; private metrics: PoolMetrics; constructor(config: ConnectionPoolConfig) { this.config = config; this.currentSize = config.minSize; this.metrics = new PoolMetrics(); // Start adaptive sizing loop setInterval(() => this.adjustPoolSize(), 10000); } private adjustPoolSize(): void { const avgWaitTime = this.metrics.getAverageAcquireWaitTime(); const utilizationRate = this.metrics.getUtilizationRate(); // Scale up if waiting too long or utilization too high if (avgWaitTime > 10 || utilizationRate > 0.8) { const newSize = Math.min( this.currentSize * 1.5, this.config.maxSize ); this.scaleToSize(Math.ceil(newSize)); } // Scale down if utilization consistently low if (utilizationRate < 0.2 && this.currentSize > this.config.minSize) { const newSize = Math.max( this.currentSize * 0.7, this.config.minSize ); this.scaleToSize(Math.floor(newSize)); } } private scaleToSize(targetSize: number): void { console.log(`Adjusting pool size: ${this.currentSize} → ${targetSize}`); this.currentSize = targetSize; // Actual connection creation/destruction logic here }}HTTP Keep-Alive (persistent connections) allows multiple HTTP requests over a single TCP connection. This is now the default behavior in HTTP/1.1 and is fundamental to HTTP/2 and HTTP/3.
Keep-Alive Headers:
HTTP/1.1 Response Headers:
Connection: keep-alive
Keep-Alive: timeout=60, max=100
Meaning:
timeout=60 → Server will close idle connection after 60 seconds
max=100 → Server will close connection after 100 requests
Server Configuration (Nginx):
# nginx.conf
# Keep-alive settings for client connections
keepalive_timeout 65s; # Close idle connections after 65s
keepalive_requests 1000; # Max requests per connection
# Keep-alive for upstream (proxy) connections
upstream backend {
server backend1:8080;
server backend2:8080;
keepalive 32; # Pool of 32 keep-alive connections
keepalive_timeout 60s; # Idle timeout for upstream connections
keepalive_requests 100; # Max requests per upstream connection
}
server {
location / {
proxy_pass http://backend;
# Required for keep-alive to upstream
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
TCP Keep-Alive vs HTTP Keep-Alive:
These are different mechanisms:
| Aspect | HTTP Keep-Alive | TCP Keep-Alive |
|---|---|---|
| Layer | Application (HTTP) | Transport (TCP) |
| Purpose | Connection reuse | Dead connection detection |
| Mechanism | Connection header | TCP probe packets |
| Default | On (HTTP/1.1+) | Off (OS-specific) |
| Timing | Seconds-minutes | Minutes-hours |
TCP Keep-Alive for Dead Connection Detection:
TCP keep-alive sends probe packets to detect broken connections (e.g., network partition, crashed remote host):
Linux TCP keep-alive parameters (sysctl):
net.ipv4.tcp_keepalive_time = 7200 # Seconds before first probe (2 hours!)
net.ipv4.tcp_keepalive_intvl = 75 # Seconds between probes
net.ipv4.tcp_keepalive_probes = 9 # Number of failed probes before closing
Total time to detect dead connection: 7200 + (75 × 9) = 7875 seconds = 2.2 hours!
For production systems, configure more aggressive TCP keep-alive:
# Faster dead connection detection
net.ipv4.tcp_keepalive_time = 60 # First probe after 60 seconds
net.ipv4.tcp_keepalive_intvl = 10 # Probe every 10 seconds
net.ipv4.tcp_keepalive_probes = 6 # Give up after 6 failed probes
Total: 60 + (10 × 6) = 120 seconds = 2 minutes to detect dead connection
Application-Level Keep-Alive:
For greater control, implement application-level health checks:
Load balancers often close idle connections before your client timeout. If your client expects 60s timeout but the LB closes at 30s, you'll get 'connection reset' errors. Always set client idle timeout SHORTER than intermediary timeouts: Client (25s) < LB (30s) < Server (60s).
Pooled connections can become unhealthy due to:
Health Checking Strategies:
1. Test-on-Borrow:
Validate connection before each use:
Acquire connection from pool
↓
Run validation query/request
↓
If success → return connection to caller
If failure → destroy connection, try another
Pros: Guaranteed healthy connection Cons: Adds latency to every request
2. Test-on-Return:
Validate connection when returned to pool:
Request completes
↓
Return connection to pool
↓
Run validation query/request
↓
If success → mark available
If failure → destroy connection
Pros: Validation off critical path Cons: Unhealthy connection might be borrowed before test completes
3. Background Validation:
Periodically validate idle connections:
Every N seconds:
For each idle connection:
If idle > validation_interval:
Run validation
If failure → destroy
Pros: No latency impact, catches stale connections Cons: Unhealthy connection might be borrowed between checks
4. Hybrid (Recommended):
Combine strategies:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180
// Robust connection health checking interface PooledConnection { id: string; createdAt: Date; lastUsedAt: Date; lastValidatedAt: Date; errorCount: number; isHealthy: boolean;} interface HealthCheckConfig { validationIntervalMs: number; // How often to check idle connections validationTimeoutMs: number; // Max time for validation maxConsecutiveErrors: number; // Errors before marking unhealthy staleThresholdMs: number; // Idle time before checking on borrow} class ConnectionHealthChecker { private config: HealthCheckConfig; constructor(config: HealthCheckConfig) { this.config = config; } // Validate connection with timeout async validateConnection( conn: PooledConnection, validateFn: () => Promise<boolean> ): Promise<boolean> { return new Promise(async (resolve) => { const timeout = setTimeout(() => { console.warn(`Connection ${conn.id} validation timeout`); resolve(false); }, this.config.validationTimeoutMs); try { const isValid = await validateFn(); clearTimeout(timeout); conn.lastValidatedAt = new Date(); conn.isHealthy = isValid; if (isValid) { conn.errorCount = 0; } else { conn.errorCount++; } resolve(isValid); } catch (error) { clearTimeout(timeout); conn.errorCount++; conn.isHealthy = conn.errorCount < this.config.maxConsecutiveErrors; resolve(false); } }); } // Check if validation needed before borrow needsValidationOnBorrow(conn: PooledConnection): boolean { const idleTime = Date.now() - conn.lastUsedAt.getTime(); return idleTime > this.config.staleThresholdMs; } // Background validation for idle connections async validateIdleConnections( connections: PooledConnection[], validateFn: (conn: PooledConnection) => Promise<boolean> ): Promise<PooledConnection[]> { const unhealthy: PooledConnection[] = []; for (const conn of connections) { const timeSinceValidation = Date.now() - conn.lastValidatedAt.getTime(); if (timeSinceValidation > this.config.validationIntervalMs) { const isHealthy = await this.validateConnection( conn, () => validateFn(conn) ); if (!isHealthy) { unhealthy.push(conn); } } } return unhealthy; }} // HTTP-specific health checkasync function httpHealthCheck( connection: any, // HTTP connection object healthEndpoint: string = '/health'): Promise<boolean> { try { const response = await connection.request({ method: 'GET', path: healthEndpoint, headers: { 'User-Agent': 'HealthCheck/1.0', }, }); // Consider 2xx and 3xx as healthy return response.statusCode >= 200 && response.statusCode < 400; } catch (error) { return false; }} // Database-specific health checkasync function databaseHealthCheck(connection: any): Promise<boolean> { try { // Simple query that should always succeed const result = await connection.query('SELECT 1'); return result.rows.length > 0; } catch (error) { return false; }} // Usage in connection poolclass HealthAwareConnectionPool { private connections: Map<string, PooledConnection> = new Map(); private healthChecker: ConnectionHealthChecker; private healthCheckInterval: NodeJS.Timer; constructor(config: HealthCheckConfig) { this.healthChecker = new ConnectionHealthChecker(config); // Start background health checking this.healthCheckInterval = setInterval( () => this.runBackgroundHealthChecks(), config.validationIntervalMs ); } async acquire(): Promise<PooledConnection> { for (const [id, conn] of this.connections) { if (conn.isHealthy && !conn.inUse) { // Check if stale connection needs validation if (this.healthChecker.needsValidationOnBorrow(conn)) { const isValid = await this.healthChecker.validateConnection( conn, () => httpHealthCheck(conn) ); if (!isValid) { this.removeConnection(id); continue; } } conn.inUse = true; conn.lastUsedAt = new Date(); return conn; } } // No available connections, create new one return this.createConnection(); } private async runBackgroundHealthChecks(): Promise<void> { const idleConnections = Array.from(this.connections.values()) .filter(c => !c.inUse); const unhealthy = await this.healthChecker.validateIdleConnections( idleConnections, (conn) => httpHealthCheck(conn) ); // Remove unhealthy connections for (const conn of unhealthy) { this.removeConnection(conn.id); } }}Connection failures are inevitable in distributed systems. Robust failure handling is essential for reliability.
Types of Connection Failures:
| Failure Type | Manifestation | Recovery Strategy |
|---|---|---|
| Connection refused | ECONNREFUSED | Retry with backoff, try different host |
| Connection timeout | ETIMEDOUT | Retry with backoff, check network |
| Connection reset | ECONNRESET | Destroy connection, retry |
| DNS failure | ENOTFOUND | Retry, check DNS resolver |
| TLS error | Various | Check certificates, retry |
| Read timeout | Application-level | Retry (if idempotent), log |
| Half-open connection | Data corruption | Detect via health check, destroy |
Error Classification:
┌─────────────────────────────────────────────────────────────────────┐
│ Error Classification │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ RETRYABLE (Transient) NON-RETRYABLE (Permanent) │
│ ───────────────────── ────────────────────────── │
│ • Connection timeout • 4xx client errors │
│ • Connection reset • Invalid request │
│ • 503 Service Unavailable • Authentication failure │
│ • 502 Bad Gateway • Invalid data │
│ • 504 Gateway Timeout • 404 Not Found │
│ • Network unreachable • 400 Bad Request │
│ • DNS temporary failure • SSL/TLS errors (invalid cert) │
│ │
│ RETRY DECISION: │
│ If RETRYABLE && (retries < max) && (within deadline) │
│ → Apply backoff, retry │
│ Else │
│ → Return error to caller │
│ │
└─────────────────────────────────────────────────────────────────────┘
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180
// Comprehensive connection failure handling enum ErrorCategory { TRANSIENT_NETWORK, // Connection timeout, reset, etc. TRANSIENT_SERVER, // 502, 503, 504 CLIENT_ERROR, // 4xx errors CONFIGURATION_ERROR, // TLS, auth setup issues UNKNOWN,} interface RetryConfig { maxRetries: number; initialDelayMs: number; maxDelayMs: number; backoffMultiplier: number; jitterFactor: number; // 0-1, random variance} function categorizeError(error: Error | Response): ErrorCategory { // Network-level errors if (error instanceof Error) { const code = (error as NodeJS.ErrnoException).code; const transientNetworkCodes = [ 'ECONNRESET', 'ECONNREFUSED', 'ETIMEDOUT', 'ENOTFOUND', // DNS failure (might be transient) 'EPIPE', 'EHOSTUNREACH', 'ENETUNREACH', ]; if (code && transientNetworkCodes.includes(code)) { return ErrorCategory.TRANSIENT_NETWORK; } // TLS errors if (error.message.includes('certificate') || error.message.includes('SSL') || error.message.includes('TLS')) { return ErrorCategory.CONFIGURATION_ERROR; } } // HTTP response errors if ('status' in error) { const status = (error as Response).status; if (status >= 500 && status < 600) { // Most 5xx are transient, but not all if ([502, 503, 504].includes(status)) { return ErrorCategory.TRANSIENT_SERVER; } } if (status >= 400 && status < 500) { return ErrorCategory.CLIENT_ERROR; } } return ErrorCategory.UNKNOWN;} function isRetryable(error: Error | Response): boolean { const category = categorizeError(error); return category === ErrorCategory.TRANSIENT_NETWORK || category === ErrorCategory.TRANSIENT_SERVER;} function calculateBackoff( attempt: number, config: RetryConfig): number { // Exponential backoff with jitter const exponentialDelay = config.initialDelayMs * Math.pow(config.backoffMultiplier, attempt); const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs); // Add jitter to prevent thundering herd const jitter = cappedDelay * config.jitterFactor * Math.random(); return cappedDelay + jitter;} async function executeWithRetry<T>( operation: () => Promise<T>, config: RetryConfig, onRetry?: (attempt: number, error: Error, nextDelayMs: number) => void): Promise<T> { let lastError: Error; for (let attempt = 0; attempt <= config.maxRetries; attempt++) { try { return await operation(); } catch (error) { lastError = error as Error; if (!isRetryable(error as Error) || attempt >= config.maxRetries) { throw lastError; } const delayMs = calculateBackoff(attempt, config); if (onRetry) { onRetry(attempt + 1, lastError, delayMs); } await sleep(delayMs); } } throw lastError!;} // Connection pool with failure handlingclass ResilientConnectionPool { private pool: ConnectionPool; private retryConfig: RetryConfig; private failureTracker: Map<string, number> = new Map(); async executeRequest<T>( endpoint: string, request: () => Promise<T> ): Promise<T> { return executeWithRetry( async () => { const connection = await this.pool.acquire(); try { const result = await request(); this.recordSuccess(endpoint); return result; } catch (error) { this.recordFailure(endpoint, error as Error); // Return connection or destroy based on error if (isConnectionBroken(error as Error)) { this.pool.destroy(connection); } else { this.pool.release(connection); } throw error; } }, this.retryConfig, (attempt, error, delay) => { console.log( `Retry ${attempt}/${this.retryConfig.maxRetries} for ${endpoint}`, `after ${delay}ms due to: ${error.message}` ); } ); } private recordFailure(endpoint: string, error: Error): void { const current = this.failureTracker.get(endpoint) || 0; this.failureTracker.set(endpoint, current + 1); // Could trigger circuit breaker here if (current + 1 >= 5) { console.warn(`High failure rate for ${endpoint}`); } } private recordSuccess(endpoint: string): void { this.failureTracker.delete(endpoint); }} function isConnectionBroken(error: Error): boolean { const code = (error as NodeJS.ErrnoException).code; return ['ECONNRESET', 'EPIPE', 'ETIMEDOUT'].includes(code || '');} function sleep(ms: number): Promise<void> { return new Promise(resolve => setTimeout(resolve, ms));}Connections consume resources on both client and server. Without limits, a misbehaving client or traffic spike can exhaust resources and cause system failure.
Client-Side Limits:
| Resource | Limit | Symptom When Exhausted |
|---|---|---|
| Ephemeral ports | ~64K per destination IP | EADDRNOTAVAIL |
| File descriptors | ulimit -n | EMFILE, ENFILE |
| Memory (buffers) | Physical RAM | OOM, swap |
| Thread pool | Pool size | Request queuing |
Server-Side Limits:
| Resource | Limit | Symptom When Exhausted |
|---|---|---|
| File descriptors | ulimit -n | Cannot accept connections |
| Listen backlog | somaxconn | Connection drops |
| Worker threads/processes | Configuration | Slow response, queuing |
| Memory | Physical RAM | OOM killer, swap |
| Connection slots | Max connections | Connection refused |
Linux Kernel Tuning for High Connection Loads:
# Increase file descriptor limits
ulimit -n 100000
# Or persist in /etc/security/limits.conf
* soft nofile 100000
* hard nofile 100000
# Kernel parameters (/etc/sysctl.conf)
# Increase system-wide file descriptor limit
fs.file-max = 1000000
# Increase socket buffer sizes
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
# Increase connection queue
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
# Faster TIME_WAIT recycling
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
# Increase ephemeral port range
net.ipv4.ip_local_port_range = 1024 65535
# Enable TCP window scaling
net.ipv4.tcp_window_scaling = 1
Connection Rate Limiting:
Protect against connection storms with rate limiting:
# Nginx rate limiting for connections
limit_conn_zone $binary_remote_addr zone=addr:10m;
limit_conn addr 10; # Max 10 connections per IP
limit_conn_status 429; # Return 429 when exceeded
limit_req_zone $binary_remote_addr zone=req_limit:10m rate=10r/s;
limit_req zone=req_limit burst=20 nodelay;
Scenario: Server restarts, 10,000 clients immediately try to reconnect. Without limits: All connections accepted → server overwhelmed → fails again → clients retry → infinite loop. With limits: Accept what can be handled, reject rest with 503, clients back off exponentially → graceful recovery.
Effective connection management requires visibility into pool behavior. Key metrics to monitor:
Essential Metrics:
| Metric | What It Shows | Alert Threshold |
|---|---|---|
| Pool size | Current connection count | > 80% of max |
| Available connections | Idle connections ready | < 2 connections |
| Wait time | Time to acquire connection | > 100ms |
| Wait queue depth | Requests waiting for connection | > 0 sustained |
| Connection errors | Failed connection attempts | Rate spike |
| Connection churn | Connections opened/closed rate | High churn |
| Stale connections | Connections failing validation | > 0 sustained |
| Connection age | Average lifetime | Too short indicates churn |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137
// Connection pool metrics collection interface PoolMetrics { // Current state totalConnections: number; availableConnections: number; inUseConnections: number; waitingRequests: number; // Timing averageAcquireTimeMs: number; maxAcquireTimeMs: number; p99AcquireTimeMs: number; // Errors connectionFailures: number; validationFailures: number; timeoutErrors: number; // Churn connectionsCreated: number; connectionsDestroyed: number; connectionReuses: number;} class MetricsCollector { private acquireTimes: number[] = []; private counters: Map<string, number> = new Map(); private gauges: Map<string, number> = new Map(); recordAcquireTime(durationMs: number): void { this.acquireTimes.push(durationMs); // Keep last 1000 samples if (this.acquireTimes.length > 1000) { this.acquireTimes.shift(); } } incrementCounter(name: string, value: number = 1): void { const current = this.counters.get(name) || 0; this.counters.set(name, current + value); } setGauge(name: string, value: number): void { this.gauges.set(name, value); } getMetrics(): PoolMetrics { const sorted = [...this.acquireTimes].sort((a, b) => a - b); return { totalConnections: this.gauges.get('total_connections') || 0, availableConnections: this.gauges.get('available_connections') || 0, inUseConnections: this.gauges.get('in_use_connections') || 0, waitingRequests: this.gauges.get('waiting_requests') || 0, averageAcquireTimeMs: this.average(sorted), maxAcquireTimeMs: sorted[sorted.length - 1] || 0, p99AcquireTimeMs: this.percentile(sorted, 0.99), connectionFailures: this.counters.get('connection_failures') || 0, validationFailures: this.counters.get('validation_failures') || 0, timeoutErrors: this.counters.get('timeout_errors') || 0, connectionsCreated: this.counters.get('connections_created') || 0, connectionsDestroyed: this.counters.get('connections_destroyed') || 0, connectionReuses: this.counters.get('connection_reuses') || 0, }; } private average(sorted: number[]): number { if (sorted.length === 0) return 0; return sorted.reduce((a, b) => a + b, 0) / sorted.length; } private percentile(sorted: number[], p: number): number { if (sorted.length === 0) return 0; const index = Math.ceil(sorted.length * p) - 1; return sorted[Math.max(0, index)]; } // Export for Prometheus/Grafana toPrometheusFormat(): string { const metrics = this.getMetrics(); return `# HELP connection_pool_size Current number of connections# TYPE connection_pool_size gaugeconnection_pool_size{state="total"} ${metrics.totalConnections}connection_pool_size{state="available"} ${metrics.availableConnections}connection_pool_size{state="in_use"} ${metrics.inUseConnections} # HELP connection_pool_wait_queue Requests waiting for connection# TYPE connection_pool_wait_queue gaugeconnection_pool_wait_queue ${metrics.waitingRequests} # HELP connection_acquire_time_ms Time to acquire connection# TYPE connection_acquire_time_ms summaryconnection_acquire_time_ms{quantile="0.5"} ${this.percentile([...this.acquireTimes].sort((a,b)=>a-b), 0.5)}connection_acquire_time_ms{quantile="0.99"} ${metrics.p99AcquireTimeMs} # HELP connection_failures_total Total connection failures# TYPE connection_failures_total counterconnection_failures_total ${metrics.connectionFailures} `.trim(); }} // Alert conditionsfunction checkAlerts(metrics: PoolMetrics): string[] { const alerts: string[] = []; // Pool exhaustion const utilizationPct = metrics.inUseConnections / metrics.totalConnections * 100; if (utilizationPct > 80) { alerts.push(`High pool utilization: ${utilizationPct.toFixed(1)}%`); } // Wait queue if (metrics.waitingRequests > 0) { alerts.push(`Requests waiting for connections: ${metrics.waitingRequests}`); } // Slow acquisition if (metrics.p99AcquireTimeMs > 100) { alerts.push(`High p99 acquire time: ${metrics.p99AcquireTimeMs}ms`); } // Connection churn const churnRate = (metrics.connectionsCreated + metrics.connectionsDestroyed) / Math.max(metrics.connectionReuses, 1); if (churnRate > 0.1) { alerts.push(`High connection churn: ${(churnRate * 100).toFixed(1)}%`); } return alerts;}Connection management is a foundational skill for building performant, reliable distributed systems. Proper connection handling can mean the difference between a system that scales gracefully and one that collapses under load.
Module Complete:
You have now completed Module 1: Request-Response Pattern. You understand:
With this foundation, you're prepared to explore the remaining modules in this chapter: Service Discovery, Circuit Breaker Pattern, Retry Strategies, Timeout Patterns, and Bulkhead Pattern—each building on the request-response fundamentals covered here.
You have mastered the Request-Response Pattern module. You now possess comprehensive knowledge of synchronous communication—from protocol mechanics to production-grade connection management. This knowledge forms the foundation for understanding more advanced patterns like circuit breakers, retries, and bulkheads.