Request-Response Pattern - Learning Module

Loading content...

0/273

Connection Management

The Art of Connection Management

In the previous page, we learned that connection establishment (DNS + TCP + TLS) can consume 75% or more of a request's latency on cold paths. This reveals a fundamental truth: how you manage connections is as important as what you send over them.

Connection management—the art of establishing, reusing, monitoring, and retiring network connections—is one of the most impactful aspects of distributed systems engineering. Poor connection management manifests as:

Latency spikes from connection churn
Resource exhaustion from too many connections
Cascading failures from connection storms
Intermittent errors from stale or broken connections

Conversely, well-tuned connection management enables systems to handle high throughput with minimal latency overhead. This page provides comprehensive coverage of connection management strategies, from basic pooling to advanced failure handling.

What You Will Learn

By the end of this page, you will understand connection pooling mechanics, know how to size and configure connection pools, implement health checking for pooled connections, handle connection failures gracefully, and optimize keep-alive settings for different scenarios.

Why Connection Pooling Matters

Connection pooling is the practice of maintaining a cache of database/service connections that can be reused across requests, rather than creating a new connection for each request.

The Cost of Connection Per Request:

Without pooling, each request pays the full connection establishment cost:

Request 1: [DNS 50ms][TCP 50ms][TLS 100ms][Request 30ms] = 230ms
Request 2: [DNS 50ms][TCP 50ms][TLS 100ms][Request 30ms] = 230ms
Request 3: [DNS 50ms][TCP 50ms][TLS 100ms][Request 30ms] = 230ms
...
100 requests = 23,000ms = 23 seconds of connection overhead alone!

With Connection Pooling:

Request 1: [DNS 50ms][TCP 50ms][TLS 100ms][Request 30ms] = 230ms (establishes pool)
Request 2: [Request 30ms] = 30ms (reuses connection)
Request 3: [Request 30ms] = 30ms (reuses connection)
...
100 requests = 230ms + (99 × 30ms) = 3,200ms = 3.2 seconds

Savings: 86% latency reduction!

Resource Benefits:

Beyond latency, pooling provides critical resource benefits:

Resource	Per-Request Connections	Pooled Connections
Server file descriptors	New FD per request	Fixed pool of FDs
Server memory	Buffer per connection	Bounded buffers
Client ephemeral ports	Port per request	Port per pool entry
TLS session overhead	Full handshake each time	Session reuse
DNS queries	Per request (if no cache)	Once per connection
TCP TIME_WAIT	Accumulates rapidly	Connections rarely close

Without Pooling, High Traffic = Failure

At 1,000 requests/second without pooling: 1,000 TCP connections opened/second, 60,000 connections in TIME_WAIT state (assuming 60s TIME_WAIT), port exhaustion, file descriptor exhaustion, and potential connection storms that overwhelm servers.

Connection State Management:

Connection pools must manage connection lifecycle:

┌─────────────────────────────────────────────────────────────────────┐
│                   Connection Pool State Machine                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐  │
│   │ Creating │────▶│ Available│────▶│  In Use  │────▶│ Released │  │
│   └──────────┘     └──────────┘     └──────────┘     └──────────┘  │
│         │               │                │                 │        │
│         │               │                │                 │        │
│         ▼               ▼                ▼                 ▼        │
│   ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐  │
│   │  Failed  │     │  Expired │     │  Error   │     │ Validated│  │
│   └──────────┘     └──────────┘     └──────────┘     └──────────┘  │
│         │               │                │                 │        │
│         ▼               ▼                ▼                 ▼        │
│   ┌───────────────────────────────────────────────────────────────┐│
│   │                        Removed                                ││
│   └───────────────────────────────────────────────────────────────┘│
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Key transitions:

Creating → Available: Connection established successfully
Available → In Use: Connection acquired by request
In Use → Released: Request completes, connection returned
Released → Validated → Available: Health check passes
Any → Removed: Connection unhealthy, expired, or pool shrinking

Pool Sizing and Configuration

Correctly sizing connection pools is both art and science. Pool too small, and requests queue waiting for connections. Pool too large, and you waste resources and potentially overwhelm downstream services.

Key Pool Parameters:

Parameter	Description	Typical Values
Min Size	Minimum connections to maintain	1-10
Max Size	Maximum connections allowed	10-100
Idle Timeout	How long idle connections live	30s-5min
Connection Timeout	Max time to establish connection	5-30s
Acquire Timeout	Max time to wait for available connection	1-10s
Validation Interval	How often to health-check idle connections	30s-60s
Max Lifetime	Maximum age of a connection	30min-1hr

Pool Size Formula:

A starting point for pool sizing:

Minimal Pool Size = Average_Concurrent_Requests_In_Flight
Safe Pool Size = Peak_Concurrent_Requests × Safety_Factor

Where:
  Concurrent_Requests = Request_Rate × Average_Request_Duration
  Safety_Factor = 1.5 - 2.0

Example:
  Request rate: 100 requests/second
  Average request duration: 50ms = 0.05s
  Concurrent requests: 100 × 0.05 = 5
  Safe pool size: 5 × 2 = 10 connections

But consider variability:

Request rates have peaks (2-10x average)
Some requests are slow (outliers)
You want headroom for spikes

Practical recommendation: Start with 10-20 connections per downstream service, monitor queue depth and wait times, adjust based on observed behavior.

connection-pool-config.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
// Production-grade connection pool configuration
 
interface ConnectionPoolConfig {
    // Pool sizing
    minSize: number;           // Minimum connections to maintain
    maxSize: number;           // Maximum connections allowed
    
    // Timeouts
    connectionTimeoutMs: number;  // Max time to establish new connection
    acquireTimeoutMs: number;     // Max time to wait for available connection
    idleTimeoutMs: number;        // Close idle connections after this duration
    maxLifetimeMs: number;        // Max age of any connection
    
    // Health checking
    validationIntervalMs: number; // How often to validate idle connections
    validationQuery?: string;     // Query/request for health check
    validationTimeoutMs: number;  // Max time for health check
    
    // Behavior
    waitForConnections: boolean;  // Queue if pool empty, or fail fast
    evictionRunIntervalMs: number; // How often to check for stale connections
}
 
// Configuration for different scenarios
const configs = {
    // High-throughput API to frequently-called service
    highThroughput: {
        minSize: 10,
        maxSize: 50,
        connectionTimeoutMs: 5000,
        acquireTimeoutMs: 3000,
        idleTimeoutMs: 120000,      // 2 minutes
        maxLifetimeMs: 1800000,     // 30 minutes
        validationIntervalMs: 30000,
        validationTimeoutMs: 5000,
        waitForConnections: true,
        evictionRunIntervalMs: 60000,
    } as ConnectionPoolConfig,
    
    // Infrequently-called service (e.g., nightly batch)
    lowFrequency: {
        minSize: 1,
        maxSize: 5,
        connectionTimeoutMs: 10000,
        acquireTimeoutMs: 10000,
        idleTimeoutMs: 30000,       // 30 seconds
        maxLifetimeMs: 300000,      // 5 minutes
        validationIntervalMs: 60000,
        validationTimeoutMs: 5000,
        waitForConnections: true,
        evictionRunIntervalMs: 30000,
    } as ConnectionPoolConfig,
    
    // Latency-critical service (fail fast)
    latencySensitive: {
        minSize: 20,               // More connections to avoid waiting
        maxSize: 100,
        connectionTimeoutMs: 1000, // Fail fast on slow connections
        acquireTimeoutMs: 100,     // Very short wait
        idleTimeoutMs: 60000,
        maxLifetimeMs: 600000,     // 10 minutes
        validationIntervalMs: 10000,
        validationTimeoutMs: 1000,
        waitForConnections: false, // Fail immediately if no connections
        evictionRunIntervalMs: 10000,
    } as ConnectionPoolConfig,
};
 
// Adaptive pool that adjusts based on load
class AdaptiveConnectionPool {
    private currentSize: number;
    private config: ConnectionPoolConfig;
    private metrics: PoolMetrics;
    
    constructor(config: ConnectionPoolConfig) {
        this.config = config;
        this.currentSize = config.minSize;
        this.metrics = new PoolMetrics();
        
        // Start adaptive sizing loop
        setInterval(() => this.adjustPoolSize(), 10000);
    }
    
    private adjustPoolSize(): void {
        const avgWaitTime = this.metrics.getAverageAcquireWaitTime();
        const utilizationRate = this.metrics.getUtilizationRate();
        
        // Scale up if waiting too long or utilization too high
        if (avgWaitTime > 10 || utilizationRate > 0.8) {
            const newSize = Math.min(
                this.currentSize * 1.5,
                this.config.maxSize
            );
            this.scaleToSize(Math.ceil(newSize));
        }
        
        // Scale down if utilization consistently low
        if (utilizationRate < 0.2 && this.currentSize > this.config.minSize) {
            const newSize = Math.max(
                this.currentSize * 0.7,
                this.config.minSize
            );
            this.scaleToSize(Math.floor(newSize));
        }
    }
    
    private scaleToSize(targetSize: number): void {
        console.log(`Adjusting pool size: ${this.currentSize} → ${targetSize}`);
        this.currentSize = targetSize;
        // Actual connection creation/destruction logic here
    }
}

HTTP Keep-Alive Configuration

HTTP Keep-Alive (persistent connections) allows multiple HTTP requests over a single TCP connection. This is now the default behavior in HTTP/1.1 and is fundamental to HTTP/2 and HTTP/3.

Keep-Alive Headers:

HTTP/1.1 Response Headers:

Connection: keep-alive
Keep-Alive: timeout=60, max=100

Meaning:
  timeout=60  → Server will close idle connection after 60 seconds
  max=100     → Server will close connection after 100 requests

Server Configuration (Nginx):

# nginx.conf

# Keep-alive settings for client connections
keepalive_timeout 65s;     # Close idle connections after 65s
keepalive_requests 1000;   # Max requests per connection

# Keep-alive for upstream (proxy) connections
upstream backend {
    server backend1:8080;
    server backend2:8080;
    
    keepalive 32;           # Pool of 32 keep-alive connections
    keepalive_timeout 60s;  # Idle timeout for upstream connections
    keepalive_requests 100; # Max requests per upstream connection
}

server {
    location / {
        proxy_pass http://backend;
        
        # Required for keep-alive to upstream
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }
}

TCP Keep-Alive vs HTTP Keep-Alive:

These are different mechanisms:

Aspect	HTTP Keep-Alive	TCP Keep-Alive
Layer	Application (HTTP)	Transport (TCP)
Purpose	Connection reuse	Dead connection detection
Mechanism	Connection header	TCP probe packets
Default	On (HTTP/1.1+)	Off (OS-specific)
Timing	Seconds-minutes	Minutes-hours

TCP Keep-Alive for Dead Connection Detection:

TCP keep-alive sends probe packets to detect broken connections (e.g., network partition, crashed remote host):

Linux TCP keep-alive parameters (sysctl):

net.ipv4.tcp_keepalive_time = 7200   # Seconds before first probe (2 hours!)
net.ipv4.tcp_keepalive_intvl = 75    # Seconds between probes
net.ipv4.tcp_keepalive_probes = 9    # Number of failed probes before closing

Total time to detect dead connection: 7200 + (75 × 9) = 7875 seconds = 2.2 hours!

For production systems, configure more aggressive TCP keep-alive:

# Faster dead connection detection
net.ipv4.tcp_keepalive_time = 60    # First probe after 60 seconds
net.ipv4.tcp_keepalive_intvl = 10   # Probe every 10 seconds
net.ipv4.tcp_keepalive_probes = 6   # Give up after 6 failed probes

Total: 60 + (10 × 6) = 120 seconds = 2 minutes to detect dead connection

Application-Level Keep-Alive:

For greater control, implement application-level health checks:

Ping/Pong: Send periodic "ping" messages, expect "pong" response
Read/Write timeouts: Detect stuck connections via timeout
Circuit breaker: Track connection failures and disable unhealthy endpoints

The Load Balancer Problem

Load balancers often close idle connections before your client timeout. If your client expects 60s timeout but the LB closes at 30s, you'll get 'connection reset' errors. Always set client idle timeout SHORTER than intermediary timeouts: Client (25s) < LB (30s) < Server (60s).

Connection Health Checking

Pooled connections can become unhealthy due to:

Server-side closure: Server closed connection without client knowing
Network issues: Connection corrupted or partitioned
Resource limits: Server rejected due to overload
TLS issues: Certificate expired, session invalid
Application errors: Server in error state

Health Checking Strategies:

1. Test-on-Borrow:

Validate connection before each use:

Acquire connection from pool
  ↓
Run validation query/request
  ↓
If success → return connection to caller
If failure → destroy connection, try another

Pros: Guaranteed healthy connection Cons: Adds latency to every request

2. Test-on-Return:

Validate connection when returned to pool:

Request completes
  ↓
Return connection to pool
  ↓
Run validation query/request
  ↓
If success → mark available
If failure → destroy connection

Pros: Validation off critical path Cons: Unhealthy connection might be borrowed before test completes

3. Background Validation:

Periodically validate idle connections:

Every N seconds:
  For each idle connection:
    If idle > validation_interval:
      Run validation
      If failure → destroy

Pros: No latency impact, catches stale connections Cons: Unhealthy connection might be borrowed between checks

4. Hybrid (Recommended):

Combine strategies:

Background validation for idle connections
Test-on-borrow ONLY if connection has been idle long
Fast failure detection and retry

connection-health-check.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
// Robust connection health checking
 
interface PooledConnection {
    id: string;
    createdAt: Date;
    lastUsedAt: Date;
    lastValidatedAt: Date;
    errorCount: number;
    isHealthy: boolean;
}
 
interface HealthCheckConfig {
    validationIntervalMs: number;   // How often to check idle connections
    validationTimeoutMs: number;    // Max time for validation
    maxConsecutiveErrors: number;   // Errors before marking unhealthy
    staleThresholdMs: number;       // Idle time before checking on borrow
}
 
class ConnectionHealthChecker {
    private config: HealthCheckConfig;
    
    constructor(config: HealthCheckConfig) {
        this.config = config;
    }
    
    // Validate connection with timeout
    async validateConnection(
        conn: PooledConnection,
        validateFn: () => Promise<boolean>
    ): Promise<boolean> {
        return new Promise(async (resolve) => {
            const timeout = setTimeout(() => {
                console.warn(`Connection ${conn.id} validation timeout`);
                resolve(false);
            }, this.config.validationTimeoutMs);
            
            try {
                const isValid = await validateFn();
                clearTimeout(timeout);
                
                conn.lastValidatedAt = new Date();
                conn.isHealthy = isValid;
                
                if (isValid) {
                    conn.errorCount = 0;
                } else {
                    conn.errorCount++;
                }
                
                resolve(isValid);
            } catch (error) {
                clearTimeout(timeout);
                conn.errorCount++;
                conn.isHealthy = conn.errorCount < this.config.maxConsecutiveErrors;
                resolve(false);
            }
        });
    }
    
    // Check if validation needed before borrow
    needsValidationOnBorrow(conn: PooledConnection): boolean {
        const idleTime = Date.now() - conn.lastUsedAt.getTime();
        return idleTime > this.config.staleThresholdMs;
    }
    
    // Background validation for idle connections
    async validateIdleConnections(
        connections: PooledConnection[],
        validateFn: (conn: PooledConnection) => Promise<boolean>
    ): Promise<PooledConnection[]> {
        const unhealthy: PooledConnection[] = [];
        
        for (const conn of connections) {
            const timeSinceValidation = Date.now() - conn.lastValidatedAt.getTime();
            
            if (timeSinceValidation > this.config.validationIntervalMs) {
                const isHealthy = await this.validateConnection(
                    conn,
                    () => validateFn(conn)
                );
                
                if (!isHealthy) {
                    unhealthy.push(conn);
                }
            }
        }
        
        return unhealthy;
    }
}
 
// HTTP-specific health check
async function httpHealthCheck(
    connection: any,  // HTTP connection object
    healthEndpoint: string = '/health'
): Promise<boolean> {
    try {
        const response = await connection.request({
            method: 'GET',
            path: healthEndpoint,
            headers: {
                'User-Agent': 'HealthCheck/1.0',
            },
        });
        
        // Consider 2xx and 3xx as healthy
        return response.statusCode >= 200 && response.statusCode < 400;
    } catch (error) {
        return false;
    }
}
 
// Database-specific health check
async function databaseHealthCheck(connection: any): Promise<boolean> {
    try {
        // Simple query that should always succeed
        const result = await connection.query('SELECT 1');
        return result.rows.length > 0;
    } catch (error) {
        return false;
    }
}
 
// Usage in connection pool
class HealthAwareConnectionPool {
    private connections: Map<string, PooledConnection> = new Map();
    private healthChecker: ConnectionHealthChecker;
    private healthCheckInterval: NodeJS.Timer;
    
    constructor(config: HealthCheckConfig) {
        this.healthChecker = new ConnectionHealthChecker(config);
        
        // Start background health checking
        this.healthCheckInterval = setInterval(
            () => this.runBackgroundHealthChecks(),
            config.validationIntervalMs
        );
    }
    
    async acquire(): Promise<PooledConnection> {
        for (const [id, conn] of this.connections) {
            if (conn.isHealthy && !conn.inUse) {
                // Check if stale connection needs validation
                if (this.healthChecker.needsValidationOnBorrow(conn)) {
                    const isValid = await this.healthChecker.validateConnection(
                        conn,
                        () => httpHealthCheck(conn)
                    );
                    
                    if (!isValid) {
                        this.removeConnection(id);
                        continue;
                    }
                }
                
                conn.inUse = true;
                conn.lastUsedAt = new Date();
                return conn;
            }
        }
        
        // No available connections, create new one
        return this.createConnection();
    }
    
    private async runBackgroundHealthChecks(): Promise<void> {
        const idleConnections = Array.from(this.connections.values())
            .filter(c => !c.inUse);
        
        const unhealthy = await this.healthChecker.validateIdleConnections(
            idleConnections,
            (conn) => httpHealthCheck(conn)
        );
        
        // Remove unhealthy connections
        for (const conn of unhealthy) {
            this.removeConnection(conn.id);
        }
    }
}

Connection Failure Handling

Connection failures are inevitable in distributed systems. Robust failure handling is essential for reliability.

Types of Connection Failures:

Failure Type	Manifestation	Recovery Strategy
Connection refused	ECONNREFUSED	Retry with backoff, try different host
Connection timeout	ETIMEDOUT	Retry with backoff, check network
Connection reset	ECONNRESET	Destroy connection, retry
DNS failure	ENOTFOUND	Retry, check DNS resolver
TLS error	Various	Check certificates, retry
Read timeout	Application-level	Retry (if idempotent), log
Half-open connection	Data corruption	Detect via health check, destroy

Error Classification:

┌─────────────────────────────────────────────────────────────────────┐
│                     Error Classification                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  RETRYABLE (Transient)           NON-RETRYABLE (Permanent)         │
│  ─────────────────────           ──────────────────────────         │
│  • Connection timeout            • 4xx client errors                │
│  • Connection reset              • Invalid request                  │
│  • 503 Service Unavailable       • Authentication failure           │
│  • 502 Bad Gateway               • Invalid data                     │
│  • 504 Gateway Timeout           • 404 Not Found                    │
│  • Network unreachable           • 400 Bad Request                  │
│  • DNS temporary failure         • SSL/TLS errors (invalid cert)    │
│                                                                     │
│  RETRY DECISION:                                                    │
│  If RETRYABLE && (retries < max) && (within deadline)              │
│      → Apply backoff, retry                                         │
│  Else                                                               │
│      → Return error to caller                                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

connection-failure-handling.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
// Comprehensive connection failure handling
 
enum ErrorCategory {
    TRANSIENT_NETWORK,     // Connection timeout, reset, etc.
    TRANSIENT_SERVER,      // 502, 503, 504
    CLIENT_ERROR,          // 4xx errors
    CONFIGURATION_ERROR,   // TLS, auth setup issues
    UNKNOWN,
}
 
interface RetryConfig {
    maxRetries: number;
    initialDelayMs: number;
    maxDelayMs: number;
    backoffMultiplier: number;
    jitterFactor: number;  // 0-1, random variance
}
 
function categorizeError(error: Error | Response): ErrorCategory {
    // Network-level errors
    if (error instanceof Error) {
        const code = (error as NodeJS.ErrnoException).code;
        
        const transientNetworkCodes = [
            'ECONNRESET',
            'ECONNREFUSED',
            'ETIMEDOUT',
            'ENOTFOUND',      // DNS failure (might be transient)
            'EPIPE',
            'EHOSTUNREACH',
            'ENETUNREACH',
        ];
        
        if (code && transientNetworkCodes.includes(code)) {
            return ErrorCategory.TRANSIENT_NETWORK;
        }
        
        // TLS errors
        if (error.message.includes('certificate') || 
            error.message.includes('SSL') ||
            error.message.includes('TLS')) {
            return ErrorCategory.CONFIGURATION_ERROR;
        }
    }
    
    // HTTP response errors
    if ('status' in error) {
        const status = (error as Response).status;
        
        if (status >= 500 && status < 600) {
            // Most 5xx are transient, but not all
            if ([502, 503, 504].includes(status)) {
                return ErrorCategory.TRANSIENT_SERVER;
            }
        }
        
        if (status >= 400 && status < 500) {
            return ErrorCategory.CLIENT_ERROR;
        }
    }
    
    return ErrorCategory.UNKNOWN;
}
 
function isRetryable(error: Error | Response): boolean {
    const category = categorizeError(error);
    return category === ErrorCategory.TRANSIENT_NETWORK || 
           category === ErrorCategory.TRANSIENT_SERVER;
}
 
function calculateBackoff(
    attempt: number,
    config: RetryConfig
): number {
    // Exponential backoff with jitter
    const exponentialDelay = config.initialDelayMs * 
        Math.pow(config.backoffMultiplier, attempt);
    
    const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs);
    
    // Add jitter to prevent thundering herd
    const jitter = cappedDelay * config.jitterFactor * Math.random();
    
    return cappedDelay + jitter;
}
 
async function executeWithRetry<T>(
    operation: () => Promise<T>,
    config: RetryConfig,
    onRetry?: (attempt: number, error: Error, nextDelayMs: number) => void
): Promise<T> {
    let lastError: Error;
    
    for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
        try {
            return await operation();
        } catch (error) {
            lastError = error as Error;
            
            if (!isRetryable(error as Error) || attempt >= config.maxRetries) {
                throw lastError;
            }
            
            const delayMs = calculateBackoff(attempt, config);
            
            if (onRetry) {
                onRetry(attempt + 1, lastError, delayMs);
            }
            
            await sleep(delayMs);
        }
    }
    
    throw lastError!;
}
 
// Connection pool with failure handling
class ResilientConnectionPool {
    private pool: ConnectionPool;
    private retryConfig: RetryConfig;
    private failureTracker: Map<string, number> = new Map();
    
    async executeRequest<T>(
        endpoint: string,
        request: () => Promise<T>
    ): Promise<T> {
        return executeWithRetry(
            async () => {
                const connection = await this.pool.acquire();
                
                try {
                    const result = await request();
                    this.recordSuccess(endpoint);
                    return result;
                } catch (error) {
                    this.recordFailure(endpoint, error as Error);
                    
                    // Return connection or destroy based on error
                    if (isConnectionBroken(error as Error)) {
                        this.pool.destroy(connection);
                    } else {
                        this.pool.release(connection);
                    }
                    
                    throw error;
                }
            },
            this.retryConfig,
            (attempt, error, delay) => {
                console.log(
                    `Retry ${attempt}/${this.retryConfig.maxRetries} for ${endpoint}`,
                    `after ${delay}ms due to: ${error.message}`
                );
            }
        );
    }
    
    private recordFailure(endpoint: string, error: Error): void {
        const current = this.failureTracker.get(endpoint) || 0;
        this.failureTracker.set(endpoint, current + 1);
        
        // Could trigger circuit breaker here
        if (current + 1 >= 5) {
            console.warn(`High failure rate for ${endpoint}`);
        }
    }
    
    private recordSuccess(endpoint: string): void {
        this.failureTracker.delete(endpoint);
    }
}
 
function isConnectionBroken(error: Error): boolean {
    const code = (error as NodeJS.ErrnoException).code;
    return ['ECONNRESET', 'EPIPE', 'ETIMEDOUT'].includes(code || '');
}
 
function sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
}

Connection Limits and Resource Protection

Connections consume resources on both client and server. Without limits, a misbehaving client or traffic spike can exhaust resources and cause system failure.

Client-Side Limits:

Resource	Limit	Symptom When Exhausted
Ephemeral ports	~64K per destination IP	EADDRNOTAVAIL
File descriptors	ulimit -n	EMFILE, ENFILE
Memory (buffers)	Physical RAM	OOM, swap
Thread pool	Pool size	Request queuing

Server-Side Limits:

Resource	Limit	Symptom When Exhausted
File descriptors	ulimit -n	Cannot accept connections
Listen backlog	somaxconn	Connection drops
Worker threads/processes	Configuration	Slow response, queuing
Memory	Physical RAM	OOM killer, swap
Connection slots	Max connections	Connection refused

Linux Kernel Tuning for High Connection Loads:

# Increase file descriptor limits
ulimit -n 100000

# Or persist in /etc/security/limits.conf
* soft nofile 100000
* hard nofile 100000

# Kernel parameters (/etc/sysctl.conf)

# Increase system-wide file descriptor limit
fs.file-max = 1000000

# Increase socket buffer sizes
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# Increase connection queue
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535

# Faster TIME_WAIT recycling
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15

# Increase ephemeral port range
net.ipv4.ip_local_port_range = 1024 65535

# Enable TCP window scaling
net.ipv4.tcp_window_scaling = 1

Connection Rate Limiting:

Protect against connection storms with rate limiting:

# Nginx rate limiting for connections
limit_conn_zone $binary_remote_addr zone=addr:10m;
limit_conn addr 10;           # Max 10 connections per IP
limit_conn_status 429;        # Return 429 when exceeded

limit_req_zone $binary_remote_addr zone=req_limit:10m rate=10r/s;
limit_req zone=req_limit burst=20 nodelay;

The Connection Storm Scenario

Scenario: Server restarts, 10,000 clients immediately try to reconnect. Without limits: All connections accepted → server overwhelmed → fails again → clients retry → infinite loop. With limits: Accept what can be handled, reject rest with 503, clients back off exponentially → graceful recovery.

Connection Pool Monitoring

Effective connection management requires visibility into pool behavior. Key metrics to monitor:

Essential Metrics:

Metric	What It Shows	Alert Threshold
Pool size	Current connection count	> 80% of max
Available connections	Idle connections ready	< 2 connections
Wait time	Time to acquire connection	> 100ms
Wait queue depth	Requests waiting for connection	> 0 sustained
Connection errors	Failed connection attempts	Rate spike
Connection churn	Connections opened/closed rate	High churn
Stale connections	Connections failing validation	> 0 sustained
Connection age	Average lifetime	Too short indicates churn

pool-metrics.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
// Connection pool metrics collection
 
interface PoolMetrics {
    // Current state
    totalConnections: number;
    availableConnections: number;
    inUseConnections: number;
    waitingRequests: number;
    
    // Timing
    averageAcquireTimeMs: number;
    maxAcquireTimeMs: number;
    p99AcquireTimeMs: number;
    
    // Errors
    connectionFailures: number;
    validationFailures: number;
    timeoutErrors: number;
    
    // Churn
    connectionsCreated: number;
    connectionsDestroyed: number;
    connectionReuses: number;
}
 
class MetricsCollector {
    private acquireTimes: number[] = [];
    private counters: Map<string, number> = new Map();
    private gauges: Map<string, number> = new Map();
    
    recordAcquireTime(durationMs: number): void {
        this.acquireTimes.push(durationMs);
        
        // Keep last 1000 samples
        if (this.acquireTimes.length > 1000) {
            this.acquireTimes.shift();
        }
    }
    
    incrementCounter(name: string, value: number = 1): void {
        const current = this.counters.get(name) || 0;
        this.counters.set(name, current + value);
    }
    
    setGauge(name: string, value: number): void {
        this.gauges.set(name, value);
    }
    
    getMetrics(): PoolMetrics {
        const sorted = [...this.acquireTimes].sort((a, b) => a - b);
        
        return {
            totalConnections: this.gauges.get('total_connections') || 0,
            availableConnections: this.gauges.get('available_connections') || 0,
            inUseConnections: this.gauges.get('in_use_connections') || 0,
            waitingRequests: this.gauges.get('waiting_requests') || 0,
            
            averageAcquireTimeMs: this.average(sorted),
            maxAcquireTimeMs: sorted[sorted.length - 1] || 0,
            p99AcquireTimeMs: this.percentile(sorted, 0.99),
            
            connectionFailures: this.counters.get('connection_failures') || 0,
            validationFailures: this.counters.get('validation_failures') || 0,
            timeoutErrors: this.counters.get('timeout_errors') || 0,
            
            connectionsCreated: this.counters.get('connections_created') || 0,
            connectionsDestroyed: this.counters.get('connections_destroyed') || 0,
            connectionReuses: this.counters.get('connection_reuses') || 0,
        };
    }
    
    private average(sorted: number[]): number {
        if (sorted.length === 0) return 0;
        return sorted.reduce((a, b) => a + b, 0) / sorted.length;
    }
    
    private percentile(sorted: number[], p: number): number {
        if (sorted.length === 0) return 0;
        const index = Math.ceil(sorted.length * p) - 1;
        return sorted[Math.max(0, index)];
    }
    
    // Export for Prometheus/Grafana
    toPrometheusFormat(): string {
        const metrics = this.getMetrics();
        return `
# HELP connection_pool_size Current number of connections
# TYPE connection_pool_size gauge
connection_pool_size{state="total"} ${metrics.totalConnections}
connection_pool_size{state="available"} ${metrics.availableConnections}
connection_pool_size{state="in_use"} ${metrics.inUseConnections}
 
# HELP connection_pool_wait_queue Requests waiting for connection
# TYPE connection_pool_wait_queue gauge
connection_pool_wait_queue ${metrics.waitingRequests}
 
# HELP connection_acquire_time_ms Time to acquire connection
# TYPE connection_acquire_time_ms summary
connection_acquire_time_ms{quantile="0.5"} ${this.percentile([...this.acquireTimes].sort((a,b)=>a-b), 0.5)}
connection_acquire_time_ms{quantile="0.99"} ${metrics.p99AcquireTimeMs}
 
# HELP connection_failures_total Total connection failures
# TYPE connection_failures_total counter
connection_failures_total ${metrics.connectionFailures}
        `.trim();
    }
}
 
// Alert conditions
function checkAlerts(metrics: PoolMetrics): string[] {
    const alerts: string[] = [];
    
    // Pool exhaustion
    const utilizationPct = metrics.inUseConnections / metrics.totalConnections * 100;
    if (utilizationPct > 80) {
        alerts.push(`High pool utilization: ${utilizationPct.toFixed(1)}%`);
    }
    
    // Wait queue
    if (metrics.waitingRequests > 0) {
        alerts.push(`Requests waiting for connections: ${metrics.waitingRequests}`);
    }
    
    // Slow acquisition
    if (metrics.p99AcquireTimeMs > 100) {
        alerts.push(`High p99 acquire time: ${metrics.p99AcquireTimeMs}ms`);
    }
    
    // Connection churn
    const churnRate = (metrics.connectionsCreated + metrics.connectionsDestroyed) / 
                      Math.max(metrics.connectionReuses, 1);
    if (churnRate > 0.1) {
        alerts.push(`High connection churn: ${(churnRate * 100).toFixed(1)}%`);
    }
    
    return alerts;
}

Summary: Connection Management

Connection management is a foundational skill for building performant, reliable distributed systems. Proper connection handling can mean the difference between a system that scales gracefully and one that collapses under load.

Key Takeaways

•Connection pooling reduces latency by 75%+ by eliminating repeated DNS, TCP, and TLS handshakes for subsequent requests.
•Pool sizing is critical: Too small causes request queuing; too large wastes resources. Start with 10-20 connections and tune based on metrics.
•Keep-alive configuration must be coordinated across client, load balancer, and server. Client timeout should be shorter than intermediary timeouts.
•Health checking is essential to detect broken connections. Use hybrid strategies: background validation + test-on-borrow for stale connections.
•Error classification enables smart retries: Distinguish transient failures (retry) from permanent failures (fail immediately).
•Resource limits protect against resource exhaustion. Configure file descriptors, connection limits, and request rate limiting.
•Monitor pool metrics: utilization, wait time, churn, and errors. Alert before pools become saturated.

Module Complete:

You have now completed Module 1: Request-Response Pattern. You understand:

The fundamentals of synchronous communication and temporal coupling
HTTP protocol evolution from HTTP/1.1 through HTTP/3
The complete request lifecycle from DNS to response processing
Connection management strategies for production systems

With this foundation, you're prepared to explore the remaining modules in this chapter: Service Discovery, Circuit Breaker Pattern, Retry Strategies, Timeout Patterns, and Bulkhead Pattern—each building on the request-response fundamentals covered here.

Module Complete

You have mastered the Request-Response Pattern module. You now possess comprehensive knowledge of synchronous communication—from protocol mechanics to production-grade connection management. This knowledge forms the foundation for understanding more advanced patterns like circuit breakers, retries, and bulkheads.