Timeout Patterns - Learning Module

Loading content...

0/273

Connection vs Read Timeouts

Two Timeouts, Two Failure Modes

When engineers first implement timeouts, a common mistake is treating "timeout" as a single, monolithic concept—setting one timeout value and expecting it to protect against all waiting scenarios. This oversimplification misses a crucial distinction: the failure modes during connection establishment are fundamentally different from the failure modes during data transfer.

A connection timeout protects you during the handshake phase, when your client is attempting to establish a transport-layer connection with a server. A read timeout (sometimes called a socket timeout or response timeout) protects you after the connection is established, during the actual exchange of data. Understanding this distinction is essential for effective timeout configuration—too short on connection means rejecting reachable servers; too short on read means abandoning in-progress work.

What You Will Learn

By the end of this page, you will understand the precise distinction between connection and read timeouts, the different failure modes each protects against, how to size each timeout appropriately for your use case, and the common pitfalls in timeout configuration.

Anatomy of a Network Request

To understand where different timeouts apply, we must first understand the phases of a network request. Consider a simple HTTP request from a client to a server:

Phase 1: DNS Resolution

Before anything else, the client must resolve the server's hostname to an IP address. This involves:

Checking local DNS cache
Querying the operating system's resolver
Potentially multiple DNS server queries (recursive resolution)

DNS resolution can fail silently or hang if DNS servers are unreachable. Many libraries have a separate DNS timeout, though it's often bundled with the connection timeout.

Phase 2: TCP Handshake (Connection Establishment)

Once the IP address is known, the client initiates a TCP connection:

Client sends SYN packet to server
Server responds with SYN-ACK
Client sends ACK, connection established

This three-way handshake typically completes in milliseconds on a healthy network but can hang indefinitely if:

The server's IP is routable but the host is down (no RST packet returned)
A firewall silently drops packets
Network congestion causes packet loss

The connection timeout governs this phase.

Converting Mermaid diagram...

Phase 3: TLS Handshake (For HTTPS)

For secure connections, an additional handshake occurs after TCP establishment:

ClientHello with supported cipher suites
ServerHello with chosen cipher suite and certificate
Key exchange and session establishment

This adds latency (potentially 1-2 additional round trips) and introduces more failure points. TLS handshake issues can cause connections to hang during certificate validation or key exchange. Some libraries include TLS handshake time in the connection timeout; others have a separate TLS timeout.

Phase 4: Request Transmission and Response Reception

Once the connection is established (and optionally secured), the actual HTTP transaction begins:

Client sends HTTP request (headers and body)
Server processes the request
Server sends HTTP response (headers and body)
Client reads the response

The read timeout governs the waiting between sending a request and receiving the complete response.

This phase can hang if:

The server is processing slowly (database queries, computation)
The server crashed after receiving the request
Network issues cause packet loss during data transfer
The server is sending a large response slowly

Connection Timeout: Guarding the Handshake

The connection timeout defines the maximum time your client will wait to establish a transport-layer connection with a server. This timeout fires if the TCP (and optionally TLS) handshake doesn't complete within the specified duration.

What it protects against:

Unreachable hosts — The IP is valid but the host is powered off or network-disconnected. Without RST packets (which require an active host), the client keeps retrying SYN packets according to OS-level configurations (often waiting minutes).
Firewall black-holing — Some firewalls silently drop packets rather than returning ICMP unreachable or TCP RST. From the client's perspective, this is indistinguishable from a very slow network.
Network partitions — The path to the server is broken somewhere in the network. Packets leave but never arrive.
Server overwhelm — The server's connection backlog is full. New SYN packets are ignored until existing connections are accepted.

TCP Connection Failure Modes
Failure Mode	Symptom	Without Connection Timeout	With Connection Timeout
Host unreachable (no RST)	SYN sent, no response	Wait for OS-level timeout (2+ minutes)	Fail after configured timeout
Firewall dropping packets	SYN sent, silently dropped	Wait indefinitely	Fail after configured timeout
Server backlog full	SYN sent, no SYN-ACK	Wait for backlog to drain (may never happen)	Fail after configured timeout
DNS failure	Resolution hangs	Wait for DNS timeout (varies)	May be covered if DNS is included
TLS certificate issue	Handshake stalls	Wait indefinitely (depending on library)	Fail if included in connection timeout

Typical connection timeout values:

Connection timeouts are generally short—measured in hundreds of milliseconds to a few seconds. The reasoning:

TCP handshakes are fast (milliseconds) when both ends are healthy and reachable
If the connection takes seconds to establish, the request is likely to be slow anyway
Failing fast on connection allows trying alternative endpoints sooner

Common ranges:

Same datacenter: 100-500ms
Same region, different zone: 500ms-1s
Cross-region: 1-3s
Public internet to third-party API: 2-5s

Connection pooling implications:

When using connection pools, connection timeout applies when:

Creating a new connection (pool is empty or all connections are in use)
Waiting for a connection from the pool (if configured to wait)

Many connection pool implementations have a separate "acquire timeout" for waiting to get a connection from the pool, distinct from the timeout for establishing new connections.

connection-timeout-examples
1
2
3
4
5
6
7
8
9
10
11
12
13
import java.net.http.HttpClient;
import java.time.Duration;
 
// Java 11+ HttpClient with explicit connection timeout
HttpClient client = HttpClient.newBuilder()
    .connectTimeout(Duration.ofSeconds(2))  // Connection timeout
    .build();
 
// Legacy Apache HttpClient
RequestConfig config = RequestConfig.custom()
    .setConnectTimeout(2000)  // Connection establishment timeout (ms)
    .setConnectionRequestTimeout(1000)  // Time to get connection from pool
    .build();

Hidden DNS Timeouts

DNS resolution often happens before the connection timeout starts counting. If DNS takes 30 seconds and then connection establishment times out after 2 seconds, your effective timeout is 32 seconds. Configure DNS timeouts explicitly or use IP addresses for critical paths.

Read Timeout: Guarding the Response

The read timeout (also called socket timeout, response timeout, or data timeout) defines the maximum time your client will wait to receive data after the connection is established and the request is sent. Crucially, this timeout applies to the gaps between data reception, not the total response time.

What it protects against:

Server processing delays — The server received your request but is slow processing it (database locks, external dependencies, CPU saturation).
Server crashes during processing — The server crashed after receiving your request but before sending a response. The TCP connection might remain half-open.
Network issues during transfer — Packets are being lost or delayed, causing slow or stalled data transfer.
Server bugs — The server entered an infinite loop or deadlock and will never respond.
Slow streams — The server is streaming a response very slowly (perhaps because its upstream dependency is slow).

Critical distinction: Socket timeout vs. total timeout

This is where confusion often arises. A read/socket timeout typically measures the time between receiving any data, not the total time for the complete response.

Socket timeout behavior (most HTTP libraries):

Timeout clock starts when waiting for data
Receiving any data resets the clock
Timeout only fires if no data arrives within the window

Example: 5-second socket timeout with streaming response:

T+0s:   Request sent, waiting for response...
T+4s:   First byte received, clock resets
T+8s:   More data received, clock resets
T+12s:  More data received, clock resets
T+15s:  Response complete (no timeout, despite 15s total time)

The response took 15 seconds total, but because data arrived every 4 seconds, the 5-second socket timeout never fired.

Total timeout behavior:

Some clients offer a separate "total" or "request" timeout that bounds the entire operation regardless of streaming:

T+0s:   Request sent, waiting for response...
T+4s:   First byte received
T+8s:   More data received
T+10s:  TIMEOUT! Total timeout of 10s exceeded

This is safer for many use cases—you don't want a malicious server keeping your connection open indefinitely by dribbling bytes.

Read Timeout Types Comparison
Timeout Type	Clock Behavior	Fires When	Best For
Socket/Read Timeout	Resets on any data	No data for N seconds	Detecting dead connections
Response Timeout	Starts on first byte wait	Headers not received in N seconds	Detecting slow servers
Total/Request Timeout	Never resets	Total time exceeds N seconds	Bounding overall latency
Idle Timeout	Resets on any activity	No activity in either direction	Connection pool cleanup

Typical read timeout values:

Read timeouts are generally longer than connection timeouts because they include server processing time:

Common ranges by operation type:

Fast key-value lookups: 100-500ms
Simple database queries: 1-5s
Complex aggregations: 5-30s
Report generation: 30s-5m
File uploads: Based on expected size and bandwidth
Third-party payment processing: 30s-60s (they have their own SLAs)

The timeout should reflect your SLA, not your hope:

A common mistake is setting read timeouts based on average response time. If your service averages 100ms but occasionally takes 5s, setting a 200ms timeout means those slow responses fail. Instead:

Set the timeout at your acceptable maximum latency
If responses exceed this, they should fail by design
Monitor timeout rates as a signal of backend issues

Timeout Budgeting

If your overall SLA is 500ms, and you call 3 downstream services sequentially, you cannot give each service a 500ms timeout—you'd potentially wait 1.5 seconds. Budget your timeouts: perhaps 150ms each, with 50ms buffer for your own processing.

Configuration Patterns and Best Practices

Configuring timeouts correctly requires understanding both the technical mechanics and the operational context. These patterns help avoid common pitfalls.

Timeout Configuration Best Practices

•Always set both timeouts explicitly — Never rely on library defaults. Many libraries have no default timeout (infinite wait) or absurdly long defaults.
•Connection timeout < Read timeout — If connection takes longer than processing, something is fundamentally wrong with network routing.
•Use a total timeout as the outer bound — Even with socket timeouts, a streaming response can take forever. Cap the total duration.
•Make timeouts configurable — Hardcoding timeouts makes production tuning impossible. Use environment variables or configuration services.
•Log timeout events with context — When a timeout fires, log which operation, timeout value, and any available context (request ID, target service).
•Differentiate timeouts by operation — A health check needs shorter timeouts than a batch job. Configure per-endpoint or per-operation.
•Test timeout behavior explicitly — Write integration tests that verify your code handles timeouts correctly (returns error, doesn't leak resources).

Pattern: Layered timeout configuration

In complex systems, you often have multiple layers that each need timeout configuration:

API Gateway (total request timeout: 30s)
  └── Service A (client timeout: 10s)
        └── Database Pool (acquire timeout: 2s)
              └── Database Query (statement timeout: 8s)

Each layer should have timeouts, but inner layer timeouts should be shorter than outer layers. If the database query timeout is 8s and Service A's overall timeout is 10s, there's 2s buffer for connection, network, and processing overhead.

Anti-pattern: Timeouts longer than outer layer:

API Gateway (total request timeout: 10s)  # Outer layer
  └── Service A (client timeout: 30s)      # WRONG: Inner > Outer

Here, Service A might wait 30 seconds for a dependency, but the API Gateway already gave up after 10 seconds. The client got an error, but Service A continues working on a doomed request, wasting resources.

layered-timeout-config
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Layered timeout configuration example
const config = {
  // External gateway/proxy timeout (if applicable)
  externalTimeout: 30_000,  // 30s
  
  // Service-level request timeout
  requestTimeout: 25_000,   // 25s, less than external
  
  // Individual downstream call timeouts
  downstreamTimeouts: {
    paymentService: {
      connection: 2_000,    // 2s to connect
      read: 15_000,         // 15s for payment processing
    },
    inventoryService: {
      connection: 1_000,    // 1s to connect (same DC)
      read: 5_000,          // 5s for inventory check
    },
    cache: {
      connection: 500,      // 500ms to Redis
      read: 1_000,          // 1s max for cache operations
    },
    database: {
      acquire: 2_000,       // 2s to get connection from pool
      query: 10_000,        // 10s for query execution
    },
  },
};
 
// Validation: ensure inner timeouts don't exceed outer
function validateTimeouts(config: typeof config): void {
  const maxDownstream = Math.max(
    config.downstreamTimeouts.paymentService.connection + 
    config.downstreamTimeouts.paymentService.read,
    config.downstreamTimeouts.inventoryService.connection + 
    config.downstreamTimeouts.inventoryService.read,
    // ... etc
  );
  
  if (maxDownstream > config.requestTimeout) {
    throw new Error(
      `Downstream timeout (${maxDownstream}ms) exceeds request timeout (${config.requestTimeout}ms)`
    );
  }
}

Environment-Specific Timeouts

Consider using different timeout values for different environments. Development environments might use longer timeouts for debugging ease, while production uses strict timeouts. CI/CD environments might use the strictest timeouts to catch timeout-sensitive code early.

Common Pitfalls and How to Avoid Them

Even experienced engineers stumble over certain timeout-related issues. These pitfalls are common enough to deserve explicit attention.

Common Pitfalls

•Trusting library defaults — Most HTTP clients default to infinite or very long timeouts.
•Single timeout for all operations — A 1s timeout is great for cache, terrible for payments.
•Forgetting retry impact — 3 retries with 10s timeout = 30s worst case.
•Ignoring DNS resolution — DNS can hang before connection timeout even starts.
•Socket timeout ≠ total timeout — Slow streaming defeats socket timeouts.
•Not testing timeout paths — Timeout handling code often has bugs.

How to Avoid Them

•Explicitly configure every client — Set connection and read timeouts at client creation.
•Per-operation timeout configuration — Use named timeouts for different use cases.
•Calculate total impact of retries — Total wait = attempts × per-attempt timeout.
•Configure DNS timeout separately — Use explicit resolvers with timeouts if needed.
•Add a total request timeout — Bound overall operation duration regardless of streaming.
•Write chaos tests — Inject artificial delays to verify timeout handling.

Pitfall deep dive: The retry multiplication problem

Retries and timeouts interact in subtle ways. Consider this configuration:

Read timeout: 10 seconds
Max retries: 3 (meaning 4 total attempts)
No backoff or exponential delay

Worst case latency: 40 seconds (10s × 4 attempts)

If your SLA is 30 seconds, this configuration violates the SLA even with aggressive per-request timeouts.

The solution: Calculate your timeout budget considering retries:

SLA: 30 seconds
Max attempts: 4 (1 initial + 3 retries)
Per-attempt timeout: 30s ÷ 4 = 7.5s (round down to 7s for safety)

Alternatively, implement a total timeout that spans all retry attempts:

Total request timeout: 30 seconds (hard deadline)
Per-attempt timeout: 10 seconds
Retries continue until total timeout exhausted

Connection Pool Starvation

A subtle pitfall: when using connection pools, a long read timeout doesn't just affect one request—it holds a connection from the pool. If all pooled connections are stuck waiting for slow responses, new requests can't acquire connections. Set connection pool acquire timeouts to fail fast rather than queue indefinitely.

Write Timeouts: The Overlooked Third Timeout

While connection and read timeouts receive most of the attention, there's a third timeout type that's often overlooked: the write timeout. This timeout governs how long the client waits to send data to the server.

When write timeouts matter:

Large request bodies — Uploading files, posting large JSON payloads, streaming data to servers
Slow networks — WAN connections, cellular networks, congested links
Server congestion — The server's receive buffer is full, causing TCP backpressure

The mechanics:

TCP uses flow control to prevent fast senders from overwhelming slow receivers. If the receiver's buffer is full, the sender blocks until buffer space is available. A write timeout protects the sender from blocking indefinitely in this situation.

Three Timeout Types Compared
Timeout Type	Governs	Typical Values	When It Matters
Connection	TCP + TLS handshake	1-5 seconds	Every connection
Read	Waiting for response data	5-60 seconds	Every request
Write	Sending request data	5-30 seconds	Large request bodies, slow networks

Write timeout configuration:

Many HTTP client libraries don't expose write timeouts directly, or bundle them with read timeouts. Here's how to configure them in libraries that support it:

write-timeout-config
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import (
    "net/http"
    "time"
    "net"
)
 
transport := &http.Transport{
    DialContext: (&net.Dialer{
        Timeout: 2 * time.Second,  // Connection timeout
    }).DialContext,
    
    // Write timeout is not directly configurable on Transport
    // Use context deadline or http.Server.WriteTimeout for servers
    
    // For servers, configure WriteTimeout:
    // server := &http.Server{
    //     WriteTimeout: 10 * time.Second,
    // }
}
 
// For clients, use context with deadline that covers the entire operation
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
 
req, _ := http.NewRequestWithContext(ctx, "POST", url, largeBody)
resp, err := client.Do(req)

When Write Timeouts Are Critical

If your service accepts file uploads or large POST bodies, write timeouts are essential. Without them, a client with a slow network (or a malicious slow-loris attack) can hold your server's connection and memory indefinitely while slowly trickling data.

Practical Configuration Guide

Let's synthesize everything into actionable configuration guidance for common scenarios.

Timeout Configuration by Scenario
Scenario	Connection Timeout	Read Timeout	Total Timeout	Rationale
Internal service (same DC)	500ms	5s	10s	Low latency expected; fast failure on issues
Internal service (different region)	2s	10s	30s	Network latency is higher; more variability
External API (payment)	3s	30s	45s	Third-party may be slow; their SLA matters
Cache (Redis/Memcached)	200ms	1s	2s	Cache should be fast; slowly hitting cache defeats purpose
Database query (simple)	1s	5s	10s	Simple queries should be quick
Database query (complex)	1s	30s	60s	Analytics queries can be slow
Health check	500ms	2s	3s	Health checks must be fast; slow check = unhealthy
File upload endpoint	2s	10s per MB	Based on size	Scale with expected file sizes

Template configuration for a typical web service:

timeout-config-template
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Timeout configuration template
timeouts:
  # Default timeouts for service-to-service calls
  default:
    connection_ms: 1000      # 1s to establish connection
    read_ms: 10000           # 10s to receive response
    write_ms: 10000          # 10s to send request
    total_ms: 15000          # 15s overall deadline
    
  # Override per service/operation
  services:
    payment-service:
      connection_ms: 2000    # Payment infra may be in different network
      read_ms: 30000         # Payment processing can be slow
      total_ms: 45000        # Give plenty of room
      
    inventory-service:
      connection_ms: 500     # Same datacenter
      read_ms: 5000          # Quick lookups
      total_ms: 8000
      
    recommendation-engine:
      connection_ms: 1000
      read_ms: 3000
      total_ms: 5000
      # Note: recommendations are non-critical; fail fast
      
  # Infrastructure
  cache:
    connection_ms: 200
    read_ms: 500
    total_ms: 1000
    
  database:
    connection_pool:
      acquire_ms: 2000       # Time to get connection from pool
    connection_ms: 1000
    default_query_ms: 5000   # Simple queries
    complex_query_ms: 30000  # Reports, aggregations

Start Aggressive, Loosen as Needed

It's better to start with aggressive timeouts and loosen them based on metrics than to start permissive and tighten. Aggressive timeouts surface latency issues early; permissive timeouts hide problems until they become outages.

Summary: Mastering Timeout Types

Key Takeaways

•Connection timeout protects the handshake — Guards against unreachable hosts, firewalls, and network partitions during TCP/TLS establishment.
•Read timeout protects response waiting — Guards against slow processing, server crashes, and network issues during data transfer.
•Socket timeout resets on data — Receiving any data resets the clock; use total timeout for absolute bounds.
•Write timeout protects sending — Important for large request bodies and slow client connections.
•Configure both explicitly — Never rely on library defaults; infinite wait is common default.
•Layer timeouts correctly — Inner timeouts should be shorter than outer layer timeouts.
•Consider retries in timeout budgets — Total wait time = attempts × per-attempt timeout.

What's next:

Now that we understand the types of timeouts and when each applies, the next page explores timeout propagation—how timeout values flow through a chain of service calls and how to ensure that deadlines are respected across service boundaries.

Page Complete

You now understand the crucial distinction between connection and read timeouts, their failure modes, and how to configure them appropriately. Next, we'll explore how timeouts propagate through distributed systems and the challenges of maintaining consistent deadlines across service boundaries.

Connection vs Read Timeouts

Two Timeouts, Two Failure Modes

What You Will Learn

Anatomy of a Network Request

To understand where different timeouts apply, we must first understand the phases of a network request. Consider a simple HTTP request from a client to a server:

Phase 1: DNS Resolution

Before anything else, the client must resolve the server's hostname to an IP address. This involves:

Checking local DNS cache
Querying the operating system's resolver
Potentially multiple DNS server queries (recursive resolution)

DNS resolution can fail silently or hang if DNS servers are unreachable. Many libraries have a separate DNS timeout, though it's often bundled with the connection timeout.

Phase 2: TCP Handshake (Connection Establishment)

Once the IP address is known, the client initiates a TCP connection:

Client sends SYN packet to server
Server responds with SYN-ACK
Client sends ACK, connection established

This three-way handshake typically completes in milliseconds on a healthy network but can hang indefinitely if:

The server's IP is routable but the host is down (no RST packet returned)
A firewall silently drops packets
Network congestion causes packet loss

The connection timeout governs this phase.

Converting Mermaid diagram...

Phase 3: TLS Handshake (For HTTPS)

For secure connections, an additional handshake occurs after TCP establishment:

ClientHello with supported cipher suites
ServerHello with chosen cipher suite and certificate
Key exchange and session establishment

Phase 4: Request Transmission and Response Reception

Once the connection is established (and optionally secured), the actual HTTP transaction begins:

Client sends HTTP request (headers and body)
Server processes the request
Server sends HTTP response (headers and body)
Client reads the response

The read timeout governs the waiting between sending a request and receiving the complete response.

This phase can hang if:

The server is processing slowly (database queries, computation)
The server crashed after receiving the request
Network issues cause packet loss during data transfer
The server is sending a large response slowly

Connection Timeout: Guarding the Handshake

What it protects against:

Unreachable hosts — The IP is valid but the host is powered off or network-disconnected. Without RST packets (which require an active host), the client keeps retrying SYN packets according to OS-level configurations (often waiting minutes).
Firewall black-holing — Some firewalls silently drop packets rather than returning ICMP unreachable or TCP RST. From the client's perspective, this is indistinguishable from a very slow network.
Network partitions — The path to the server is broken somewhere in the network. Packets leave but never arrive.
Server overwhelm — The server's connection backlog is full. New SYN packets are ignored until existing connections are accepted.

TCP Connection Failure Modes
Failure Mode	Symptom	Without Connection Timeout	With Connection Timeout
Host unreachable (no RST)	SYN sent, no response	Wait for OS-level timeout (2+ minutes)	Fail after configured timeout
Firewall dropping packets	SYN sent, silently dropped	Wait indefinitely	Fail after configured timeout
Server backlog full	SYN sent, no SYN-ACK	Wait for backlog to drain (may never happen)	Fail after configured timeout
DNS failure	Resolution hangs	Wait for DNS timeout (varies)	May be covered if DNS is included
TLS certificate issue	Handshake stalls	Wait indefinitely (depending on library)	Fail if included in connection timeout

Typical connection timeout values:

Connection timeouts are generally short—measured in hundreds of milliseconds to a few seconds. The reasoning:

TCP handshakes are fast (milliseconds) when both ends are healthy and reachable
If the connection takes seconds to establish, the request is likely to be slow anyway
Failing fast on connection allows trying alternative endpoints sooner

Common ranges:

Same datacenter: 100-500ms
Same region, different zone: 500ms-1s
Cross-region: 1-3s
Public internet to third-party API: 2-5s

Connection pooling implications:

When using connection pools, connection timeout applies when:

Creating a new connection (pool is empty or all connections are in use)
Waiting for a connection from the pool (if configured to wait)

Many connection pool implementations have a separate "acquire timeout" for waiting to get a connection from the pool, distinct from the timeout for establishing new connections.

connection-timeout-examples
1
2
3
4
5
6
7
8
9
10
11
12
13
import java.net.http.HttpClient;
import java.time.Duration;
 
// Java 11+ HttpClient with explicit connection timeout
HttpClient client = HttpClient.newBuilder()
    .connectTimeout(Duration.ofSeconds(2))  // Connection timeout
    .build();
 
// Legacy Apache HttpClient
RequestConfig config = RequestConfig.custom()
    .setConnectTimeout(2000)  // Connection establishment timeout (ms)
    .setConnectionRequestTimeout(1000)  // Time to get connection from pool
    .build();

Hidden DNS Timeouts

Read Timeout: Guarding the Response

What it protects against:

Server processing delays — The server received your request but is slow processing it (database locks, external dependencies, CPU saturation).
Server crashes during processing — The server crashed after receiving your request but before sending a response. The TCP connection might remain half-open.
Network issues during transfer — Packets are being lost or delayed, causing slow or stalled data transfer.
Server bugs — The server entered an infinite loop or deadlock and will never respond.
Slow streams — The server is streaming a response very slowly (perhaps because its upstream dependency is slow).

Critical distinction: Socket timeout vs. total timeout

This is where confusion often arises. A read/socket timeout typically measures the time between receiving any data, not the total time for the complete response.

Socket timeout behavior (most HTTP libraries):

Timeout clock starts when waiting for data
Receiving any data resets the clock
Timeout only fires if no data arrives within the window

Example: 5-second socket timeout with streaming response:

T+0s:   Request sent, waiting for response...
T+4s:   First byte received, clock resets
T+8s:   More data received, clock resets
T+12s:  More data received, clock resets
T+15s:  Response complete (no timeout, despite 15s total time)

The response took 15 seconds total, but because data arrived every 4 seconds, the 5-second socket timeout never fired.

Total timeout behavior:

Some clients offer a separate "total" or "request" timeout that bounds the entire operation regardless of streaming:

T+0s:   Request sent, waiting for response...
T+4s:   First byte received
T+8s:   More data received
T+10s:  TIMEOUT! Total timeout of 10s exceeded

This is safer for many use cases—you don't want a malicious server keeping your connection open indefinitely by dribbling bytes.

Read Timeout Types Comparison
Timeout Type	Clock Behavior	Fires When	Best For
Socket/Read Timeout	Resets on any data	No data for N seconds	Detecting dead connections
Response Timeout	Starts on first byte wait	Headers not received in N seconds	Detecting slow servers
Total/Request Timeout	Never resets	Total time exceeds N seconds	Bounding overall latency
Idle Timeout	Resets on any activity	No activity in either direction	Connection pool cleanup

Typical read timeout values:

Read timeouts are generally longer than connection timeouts because they include server processing time:

Common ranges by operation type:

Fast key-value lookups: 100-500ms
Simple database queries: 1-5s
Complex aggregations: 5-30s
Report generation: 30s-5m
File uploads: Based on expected size and bandwidth
Third-party payment processing: 30s-60s (they have their own SLAs)

The timeout should reflect your SLA, not your hope:

A common mistake is setting read timeouts based on average response time. If your service averages 100ms but occasionally takes 5s, setting a 200ms timeout means those slow responses fail. Instead:

Set the timeout at your acceptable maximum latency
If responses exceed this, they should fail by design
Monitor timeout rates as a signal of backend issues

Timeout Budgeting

Configuration Patterns and Best Practices

Configuring timeouts correctly requires understanding both the technical mechanics and the operational context. These patterns help avoid common pitfalls.

Timeout Configuration Best Practices

•Always set both timeouts explicitly — Never rely on library defaults. Many libraries have no default timeout (infinite wait) or absurdly long defaults.
•Connection timeout < Read timeout — If connection takes longer than processing, something is fundamentally wrong with network routing.
•Use a total timeout as the outer bound — Even with socket timeouts, a streaming response can take forever. Cap the total duration.
•Make timeouts configurable — Hardcoding timeouts makes production tuning impossible. Use environment variables or configuration services.
•Log timeout events with context — When a timeout fires, log which operation, timeout value, and any available context (request ID, target service).
•Differentiate timeouts by operation — A health check needs shorter timeouts than a batch job. Configure per-endpoint or per-operation.
•Test timeout behavior explicitly — Write integration tests that verify your code handles timeouts correctly (returns error, doesn't leak resources).

Pattern: Layered timeout configuration

In complex systems, you often have multiple layers that each need timeout configuration:

API Gateway (total request timeout: 30s)
  └── Service A (client timeout: 10s)
        └── Database Pool (acquire timeout: 2s)
              └── Database Query (statement timeout: 8s)

Anti-pattern: Timeouts longer than outer layer:

API Gateway (total request timeout: 10s)  # Outer layer
  └── Service A (client timeout: 30s)      # WRONG: Inner > Outer

layered-timeout-config
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Layered timeout configuration example
const config = {
  // External gateway/proxy timeout (if applicable)
  externalTimeout: 30_000,  // 30s
  
  // Service-level request timeout
  requestTimeout: 25_000,   // 25s, less than external
  
  // Individual downstream call timeouts
  downstreamTimeouts: {
    paymentService: {
      connection: 2_000,    // 2s to connect
      read: 15_000,         // 15s for payment processing
    },
    inventoryService: {
      connection: 1_000,    // 1s to connect (same DC)
      read: 5_000,          // 5s for inventory check
    },
    cache: {
      connection: 500,      // 500ms to Redis
      read: 1_000,          // 1s max for cache operations
    },
    database: {
      acquire: 2_000,       // 2s to get connection from pool
      query: 10_000,        // 10s for query execution
    },
  },
};
 
// Validation: ensure inner timeouts don't exceed outer
function validateTimeouts(config: typeof config): void {
  const maxDownstream = Math.max(
    config.downstreamTimeouts.paymentService.connection + 
    config.downstreamTimeouts.paymentService.read,
    config.downstreamTimeouts.inventoryService.connection + 
    config.downstreamTimeouts.inventoryService.read,
    // ... etc
  );
  
  if (maxDownstream > config.requestTimeout) {
    throw new Error(
      `Downstream timeout (${maxDownstream}ms) exceeds request timeout (${config.requestTimeout}ms)`
    );
  }
}

Environment-Specific Timeouts

Common Pitfalls and How to Avoid Them

Even experienced engineers stumble over certain timeout-related issues. These pitfalls are common enough to deserve explicit attention.

Common Pitfalls

•Trusting library defaults — Most HTTP clients default to infinite or very long timeouts.
•Single timeout for all operations — A 1s timeout is great for cache, terrible for payments.
•Forgetting retry impact — 3 retries with 10s timeout = 30s worst case.
•Ignoring DNS resolution — DNS can hang before connection timeout even starts.
•Socket timeout ≠ total timeout — Slow streaming defeats socket timeouts.
•Not testing timeout paths — Timeout handling code often has bugs.

How to Avoid Them

•Explicitly configure every client — Set connection and read timeouts at client creation.
•Per-operation timeout configuration — Use named timeouts for different use cases.
•Calculate total impact of retries — Total wait = attempts × per-attempt timeout.
•Configure DNS timeout separately — Use explicit resolvers with timeouts if needed.
•Add a total request timeout — Bound overall operation duration regardless of streaming.
•Write chaos tests — Inject artificial delays to verify timeout handling.

Pitfall deep dive: The retry multiplication problem

Retries and timeouts interact in subtle ways. Consider this configuration:

Read timeout: 10 seconds
Max retries: 3 (meaning 4 total attempts)
No backoff or exponential delay

Worst case latency: 40 seconds (10s × 4 attempts)

If your SLA is 30 seconds, this configuration violates the SLA even with aggressive per-request timeouts.

The solution: Calculate your timeout budget considering retries:

SLA: 30 seconds
Max attempts: 4 (1 initial + 3 retries)
Per-attempt timeout: 30s ÷ 4 = 7.5s (round down to 7s for safety)

Alternatively, implement a total timeout that spans all retry attempts:

Total request timeout: 30 seconds (hard deadline)
Per-attempt timeout: 10 seconds
Retries continue until total timeout exhausted

Connection Pool Starvation

Write Timeouts: The Overlooked Third Timeout

When write timeouts matter:

Large request bodies — Uploading files, posting large JSON payloads, streaming data to servers
Slow networks — WAN connections, cellular networks, congested links
Server congestion — The server's receive buffer is full, causing TCP backpressure

The mechanics:

Three Timeout Types Compared
Timeout Type	Governs	Typical Values	When It Matters
Connection	TCP + TLS handshake	1-5 seconds	Every connection
Read	Waiting for response data	5-60 seconds	Every request
Write	Sending request data	5-30 seconds	Large request bodies, slow networks

Write timeout configuration:

Many HTTP client libraries don't expose write timeouts directly, or bundle them with read timeouts. Here's how to configure them in libraries that support it:

write-timeout-config
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import (
    "net/http"
    "time"
    "net"
)
 
transport := &http.Transport{
    DialContext: (&net.Dialer{
        Timeout: 2 * time.Second,  // Connection timeout
    }).DialContext,
    
    // Write timeout is not directly configurable on Transport
    // Use context deadline or http.Server.WriteTimeout for servers
    
    // For servers, configure WriteTimeout:
    // server := &http.Server{
    //     WriteTimeout: 10 * time.Second,
    // }
}
 
// For clients, use context with deadline that covers the entire operation
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
 
req, _ := http.NewRequestWithContext(ctx, "POST", url, largeBody)
resp, err := client.Do(req)

When Write Timeouts Are Critical

Practical Configuration Guide

Let's synthesize everything into actionable configuration guidance for common scenarios.

Timeout Configuration by Scenario
Scenario	Connection Timeout	Read Timeout	Total Timeout	Rationale
Internal service (same DC)	500ms	5s	10s	Low latency expected; fast failure on issues
Internal service (different region)	2s	10s	30s	Network latency is higher; more variability
External API (payment)	3s	30s	45s	Third-party may be slow; their SLA matters
Cache (Redis/Memcached)	200ms	1s	2s	Cache should be fast; slowly hitting cache defeats purpose
Database query (simple)	1s	5s	10s	Simple queries should be quick
Database query (complex)	1s	30s	60s	Analytics queries can be slow
Health check	500ms	2s	3s	Health checks must be fast; slow check = unhealthy
File upload endpoint	2s	10s per MB	Based on size	Scale with expected file sizes

Template configuration for a typical web service:

timeout-config-template
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Timeout configuration template
timeouts:
  # Default timeouts for service-to-service calls
  default:
    connection_ms: 1000      # 1s to establish connection
    read_ms: 10000           # 10s to receive response
    write_ms: 10000          # 10s to send request
    total_ms: 15000          # 15s overall deadline
    
  # Override per service/operation
  services:
    payment-service:
      connection_ms: 2000    # Payment infra may be in different network
      read_ms: 30000         # Payment processing can be slow
      total_ms: 45000        # Give plenty of room
      
    inventory-service:
      connection_ms: 500     # Same datacenter
      read_ms: 5000          # Quick lookups
      total_ms: 8000
      
    recommendation-engine:
      connection_ms: 1000
      read_ms: 3000
      total_ms: 5000
      # Note: recommendations are non-critical; fail fast
      
  # Infrastructure
  cache:
    connection_ms: 200
    read_ms: 500
    total_ms: 1000
    
  database:
    connection_pool:
      acquire_ms: 2000       # Time to get connection from pool
    connection_ms: 1000
    default_query_ms: 5000   # Simple queries
    complex_query_ms: 30000  # Reports, aggregations

Start Aggressive, Loosen as Needed

Summary: Mastering Timeout Types

Key Takeaways

•Connection timeout protects the handshake — Guards against unreachable hosts, firewalls, and network partitions during TCP/TLS establishment.
•Read timeout protects response waiting — Guards against slow processing, server crashes, and network issues during data transfer.
•Socket timeout resets on data — Receiving any data resets the clock; use total timeout for absolute bounds.
•Write timeout protects sending — Important for large request bodies and slow client connections.
•Configure both explicitly — Never rely on library defaults; infinite wait is common default.
•Layer timeouts correctly — Inner timeouts should be shorter than outer layer timeouts.
•Consider retries in timeout budgets — Total wait time = attempts × per-attempt timeout.

What's next:

Page Complete