Dynamic Content Acceleration - Learning Module

Loading content...

0/273

TCP Optimization

The Foundation Layer: Why TCP Matters

Every HTTP request, every API call, every WebSocket connection rides on top of TCP (Transmission Control Protocol). TCP provides the reliable, ordered delivery that applications depend on—but it was designed in the 1980s for a very different internet than today's global, high-bandwidth, latency-sensitive environment.

CDNs don't just use TCP—they optimize it aggressively. The difference between default TCP settings and CDN-tuned TCP can mean 2-3x throughput improvements and 30-50% latency reductions, particularly for dynamic content where every request traverses the full protocol stack.

What You Will Learn

This page explores TCP at the level CDN engineers work with it: congestion control algorithms, window sizing strategies, TCP Fast Open, and advanced kernel tuning. You'll understand why standard TCP configurations leave significant performance on the table.

TCP's Core Performance Challenges

TCP was designed for reliability over performance. Its default behaviors, while ensuring correctness, create significant performance penalties for modern web traffic. Understanding these challenges is essential before exploring optimizations.

Fundamental TCP Bottlenecks

•Three-Way Handshake Latency — Every new connection requires SYN → SYN-ACK → ACK before data can flow. For distant servers, this is 1.5 RTT (Round Trip Times) of pure overhead—often 200-400ms for global users.
•Slow Start Algorithm — New connections begin with tiny congestion windows (typically 10 segments ≈ 14KB). Reaching full throughput requires multiple RTTs of exponential growth.
•Congestion Window Collapse — Any packet loss triggers dramatic window reduction (often 50%). Recovery requires many RTTs, devastating throughput for lossy paths.
•Head-of-Line Blocking — TCP's in-order delivery means one lost packet blocks all subsequent data, even for independent application-level requests.
•Receive Window Limitations — Default receive buffer sizes limit throughput on high-bandwidth, high-latency paths (the 'bandwidth-delay product' problem).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Scenario: Sydney → Virginia, 160ms RTT, 0% loss
        Target throughput: 100 Mbps
        Bandwidth-Delay Product: 100 Mbps × 0.16s = 16 Mbit = 2 MB
 
Initial Congestion Window (IW): 10 segments (14 KB)
 
Round Trip | Window Size | Throughput
-----------+-------------+-----------
    1      |    14 KB    |   0.7 Mbps
    2      |    28 KB    |   1.4 Mbps
    3      |    56 KB    |   2.8 Mbps
    4      |   112 KB    |   5.6 Mbps
    5      |   224 KB    |  11.2 Mbps
    6      |   448 KB    |  22.4 Mbps
    7      |   896 KB    |  44.8 Mbps
    8      |  1.79 MB    |  89.6 Mbps
    9      |  2.0 MB     | 100.0 Mbps (saturated)
 
Time to full throughput: 9 RTT × 160ms = 1.44 seconds
 
For a 50 KB dynamic API response, average throughput is ~5 Mbps,
not the 100 Mbps available—because slow start never finishes.

The Dynamic Content Problem

Slow start particularly penalizes dynamic content. With typical API responses under 100KB, connections often complete before reaching full throughput. Each request pays the slow start penalty without ever benefiting from a warmed-up connection.

Congestion Control Algorithms

Congestion control determines how TCP adjusts its sending rate in response to network conditions. The choice of algorithm dramatically impacts performance. CDNs typically use modern algorithms optimized for their specific traffic patterns.

Congestion Control Algorithm Comparison
Algorithm	Loss Response	Characteristics	Best For
CUBIC (default)	Halve window on loss	Conservative, fair, widely deployed	General internet, shared links
BBR (Google)	Model-based, not loss-based	Aggressive, high throughput, may be unfair	Long-distance, high-bandwidth paths
BBRv2	Improved fairness over BBR	Better coexistence with CUBIC	Production CDN deployments
QUIC CC	Pluggable, evolving	Works above UDP, avoids OS kernel limits	HTTP/3, performance-critical apps

Deep dive: BBR (Bottleneck Bandwidth and Round-trip propagation time)

BBR represents a fundamental rethinking of congestion control. While CUBIC and its predecessors react to packet loss as a congestion signal, BBR actively measures two properties:

Bottleneck Bandwidth (BtlBw): Maximum throughput the path can sustain
Round-trip propagation time (RTprop): Minimum RTT when queues are empty

BBR uses these measurements to operate at the optimal point—sending at exactly the bottleneck rate without building queues. This avoids the buffer bloat problem where CUBIC fills intermediate buffers, increasing latency.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// BBR operates in four phases, cycling continuously
enum BBRPhase {
  STARTUP,      // Quickly find bottleneck bandwidth (exponential search)
  DRAIN,        // Drain queues created during startup
  PROBE_BW,     // Steady state: probe for more bandwidth periodically
  PROBE_RTT,    // Periodically drain queues to measure true RTprop
}
 
interface BBRState {
  btlBw: number;      // Estimated bottleneck bandwidth (max filter)
  rtProp: number;     // Estimated propagation delay (min filter)
  cwnd: number;       // Congestion window
  pacingRate: number; // Sending rate (pacing, not burst)
  phase: BBRPhase;
}
 
// The key insight: pacing_rate = btlBw, cwnd = btlBw × rtProp
// This keeps exactly one BDP (bandwidth-delay product) in flight
function calculateSendingParameters(state: BBRState): void {
  // Target: send at bottleneck rate, keep pipe exactly full
  state.pacingRate = state.btlBw;
  
  // Congestion window = one bandwidth-delay product
  // This is the minimum buffer needed to keep the pipe full
  const bdp = state.btlBw * state.rtProp;
  state.cwnd = bdp * 1.25;  // Small margin for measurement variance
  
  // In PROBE_BW, briefly increase rate to test for more capacity
  // In PROBE_RTT, reduce cwnd to drain queues and measure true RTT
}

BBR performance advantages for CDNs:

•Higher throughput on lossy links: BBR tolerates 1-5% packet loss with minimal throughput impact, while CUBIC collapses. This matters for mobile networks and international paths.
•Lower latency under load: By avoiding buffer bloat, BBR maintains low RTT even at high throughput. CUBIC can add 100-500ms of queuing delay.
•Faster bandwidth discovery: BBR's STARTUP phase reaches full bandwidth in ~4 RTTs, versus 10+ RTTs for CUBIC slow start.
•Better long-distance performance: BBR's model-based approach works well for high-bandwidth, high-latency links (trans-oceanic) where CUBIC underperforms.

BBR Fairness Concerns

BBRv1 was criticized for being too aggressive—it could starve CUBIC flows competing on the same link. BBRv2 addresses this with improved fairness, making it more suitable for production CDN deployments where the CDN doesn't control all traffic on the path.

Initial Congestion Window Tuning

The initial congestion window (IW) determines how much data TCP can send before receiving the first acknowledgment. For short transfers (typical API responses), IW directly determines performance—there's no time for slow start to increase the window.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Check current default initial window
$ ss -i | grep -o "cwnd:[0-9]*" | head -1
cwnd:10
 
# The default IW of 10 segments ≈ 14 KB
# Modern recommendations: IW of 20-30 segments
 
# Increase initial window via route configuration
$ sudo ip route change default via 10.0.0.1 initcwnd 20 initrwnd 20
 
# For CDN servers, set system-wide via sysctl
$ sudo sysctl -w net.ipv4.tcp_slow_start_after_idle=0
# Prevents cwnd reset after idle periods
 
# CDN production configuration often includes:
$ cat /etc/sysctl.d/99-tcp-optimization.conf
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728

Initial Window Impact on Small Transfers
Response Size	IW=10 (14KB)	IW=20 (28KB)	IW=30 (42KB)	Improvement
10 KB	1 RTT	1 RTT	1 RTT	None (fits in IW)
20 KB	2 RTTs	1 RTT	1 RTT	50% less latency
40 KB	3 RTTs	2 RTTs	1 RTT	33-67% less latency
80 KB	4 RTTs	3 RTTs	2 RTTs	25-50% less latency

Calculating the optimal initial window:

The ideal IW depends on typical response sizes. For CDNs serving API responses:

Analyze p95 response size distribution
Set IW to cover p80-p90 of responses in one RTT
Balance against risk of overwhelming slow receivers

Most CDNs set IW between 20-40 segments (28-56 KB), covering the majority of dynamic API responses in a single round trip.

The Idle Connection Problem

By default, TCP resets the congestion window after a connection goes idle (tcp_slow_start_after_idle). This devastates performance for keep-alive connections with bursty traffic. CDNs disable this, preserving the accumulated window for reuse.

TCP Fast Open (TFO)

TCP Fast Open (TFO) eliminates the three-way handshake penalty for repeat visitors by allowing data in the initial SYN packet. This can reduce connection latency by an entire round trip—significant for geographically distant users.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
STANDARD TCP (3-way handshake):
Client                          Server
   |                               |
   |-------- SYN ----------------->|  RTT 1
   |<------- SYN-ACK --------------|
   |-------- ACK + HTTP Request -->|  RTT 2
   |<------- HTTP Response --------|
   |                               |
Total: 2 RTT before response begins
 
TCP FAST OPEN (with cached cookie):
Client                          Server
   |                               |
   |-- SYN + Cookie + HTTP ------->|  RTT 1
   |<-- SYN-ACK + HTTP Response ---|
   |-------- ACK ----------------->|  (concurrent)
   |                               |
Total: 1 RTT before response begins
 
SAVINGS: 1 full RTT (80-200ms for distant users)

How TFO works:

Initial connection: Client requests a TFO cookie in the SYN options. Server generates a cryptographic cookie based on client IP and a server secret.
Cookie caching: Client stores the cookie locally (typically for hours to days).
Subsequent connections: Client includes the cookie and application data in the SYN packet. Server validates the cookie and immediately processes the request.
Security: The cookie prevents abuse—attackers can't forge cookies for IP addresses they don't control, mitigating amplification attacks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Linux: Enable TFO server-side (CDN edge servers)
$ sudo sysctl -w net.ipv4.tcp_fastopen=3
# 1 = client only, 2 = server only, 3 = both
 
# Set TFO queue length (pending SYN+data requests)
$ sudo sysctl -w net.ipv4.tcp_fastopen_queue_len=1024
 
# View TFO statistics
$ cat /proc/net/netstat | grep TFO
TcpExt: ... TCPFastOpenActive 15234 TCPFastOpenActiveFail 12 
        TCPFastOpenPassive 45678 TCPFastOpenPassiveFail 5
 
# Nginx configuration for TFO
listen 443 ssl fastopen=256;  # 256 pending TFO connections allowed
 
# Verify TFO is working with curl
$ curl --tcp-fastopen https://example.com -I

TFO Deployment Challenges

TFO requires support from client OS, server OS, and all middleboxes. Some firewalls and NATs strip TFO options, breaking the optimization. CDNs often see TFO work for 40-70% of connections, not 100%. Still significant, but not universal.

TFO security considerations:

TFO introduces a replay attack surface. If attackers capture a SYN+data packet, they can replay it (until the cookie expires). Mitigations:

Idempotent requests only: TFO is safest for GET requests. Replaying a GET harms nothing beyond wasted resources.
Non-idempotent detection: Servers should detect replays for POST/PUT and reject or require fresh connections.
Cookie rotation: Regularly rotating the server secret invalidates old cookies.
Rate limiting: Limit TFO requests per source IP to prevent replay-based DoS.

Receive and Send Buffer Optimization

TCP buffers directly limit achievable throughput. For high-bandwidth, high-latency links, undersized buffers become the bottleneck—a crucial consideration for CDN edge servers handling global traffic.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Maximum possible throughput is limited by:
  Throughput ≤ (Window Size) / RTT
 
Example: Sydney → Virginia, 160ms RTT
  
Default Linux receive buffer max: 212,992 bytes (208 KB)
Maximum throughput: 208 KB / 0.16s = 1.3 MB/s = 10.4 Mbps
 
Actual available bandwidth: 100 Mbps
Bandwidth-Delay Product: 100 Mbps × 0.16s = 2 MB
 
To fully utilize the path, window must reach 2 MB.
With 208 KB max buffer, we can only use 10% of available bandwidth!
 
Solution: Increase buffer limits to exceed BDP
Target max buffer: 16-128 MB for long-haul connections

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Check current buffer settings
$ sysctl net.core.rmem_max net.core.wmem_max
net.core.rmem_max = 212992
net.core.wmem_max = 212992
 
# Check TCP-specific tuning (min, default, max)
$ sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem
net.ipv4.tcp_rmem = 4096    131072  6291456
net.ipv4.tcp_wmem = 4096    16384   4194304
 
# CDN-optimized buffer settings:
$ sudo sysctl -w net.core.rmem_max=134217728     # 128 MB
$ sudo sysctl -w net.core.wmem_max=134217728     # 128 MB
$ sudo sysctl -w net.core.rmem_default=1048576   # 1 MB
$ sudo sysctl -w net.core.wmem_default=1048576   # 1 MB
 
# TCP auto-tuning settings (min, default, max)
$ sudo sysctl -w net.ipv4.tcp_rmem='4096 1048576 134217728'
$ sudo sysctl -w net.ipv4.tcp_wmem='4096 1048576 134217728'
 
# Enable window scaling for large windows
$ sudo sysctl -w net.ipv4.tcp_window_scaling=1
 
# Memory pressure tuning (bytes: low, pressure, high)
$ sudo sysctl -w net.ipv4.tcp_mem='786432 1048576 1572864'

Buffer sizing strategy:

Minimum (first value): Per-socket minimum, never reclaimed even under memory pressure.
Default (second value): Initial allocation for new connections. Set high enough for typical transfers.
Maximum (third value): Upper limit for auto-tuning. Must exceed maximum expected BDP.

CDN servers typically configure maximums of 64-128 MB to handle the longest, highest-bandwidth paths (intercontinental transfers on gigabit links).

Memory Considerations

Large buffer limits don't immediately consume memory—TCP auto-tuning expands buffers only as needed. However, many concurrent long-haul connections can accumulate significant memory usage. CDN servers need sufficient RAM and careful memory pressure configuration.

Packet Pacing and Micro-burst Mitigation

Traditional TCP sends data in bursts—when an ACK arrives acknowledging 10 packets, it immediately sends 10 more. These micro-bursts can overwhelm buffers in switches and routers, causing packet loss and triggering congestion control. Packet pacing smooths transmission for better performance.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
BURSTY TCP (Traditional):
Time    Packets
0ms     [1][2][3][4][5][6][7][8][9][10] <- all at once!
...     (waiting for ACKs)
160ms   [11][12][13][14][15][16][17][18][19][20] <- burst again
 
Problem: 10 packets arrive at switch simultaneously
If switch buffer holds 8 packets, 2 are lost immediately
 
PACED TCP (e.g., BBR):
Time    Packets
0ms     [1]
2ms     [2]
4ms     [3]
6ms     [4]
...
18ms    [10]
(continuous stream vs bursts)
 
Result: Switch never sees more than 1-2 packets queued
No buffer overflow, no loss, lower latency

Enabling packet pacing:

BBR uses pacing inherently—its design centers on sending at exactly the measured bottleneck rate. For other congestion control algorithms:

Linux FQ (Fair Queue) qdisc: Provides per-flow pacing at the packet scheduler level
SO_MAX_PACING_RATE socket option: Application-controlled pacing rate
TCP pacing in kernel: System-wide pacing settings

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Set Fair Queue as the default qdisc for pacing
$ sudo sysctl -w net.core.default_qdisc=fq
 
# Apply FQ to existing interfaces
$ sudo tc qdisc replace dev eth0 root fq
 
# Check current qdisc
$ tc qdisc show dev eth0
qdisc fq 8001: root refcnt 2 limit 10000p flow_limit 100p 
  buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b
 
# For BBR to work correctly, FQ qdisc is recommended
# BBR + FQ = paced transmission with model-based congestion control
 
# Verify pacing is active on connections
$ ss -ti | grep pacing
         rtt:0.25/0.125 ... pacing_rate 125000bps

Pacing and Latency

Pacing doesn't just prevent loss—it also reduces queuing latency throughout the path. By avoiding buffer buildup at intermediate nodes, paced connections see lower and more consistent RTT, which benefits congestion control accuracy and user-perceived performance.

Advanced TCP Options and Kernel Tuning

Beyond the major optimizations, CDNs configure numerous TCP options that collectively contribute to performance. Here are the key settings used in production deployments:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# /etc/sysctl.d/99-cdn-tcp.conf
 
# Congestion control
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
 
# Buffer tuning (min, default, max bytes)
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 1048576 134217728
net.ipv4.tcp_wmem = 4096 1048576 134217728
 
# Initial window and slow start
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_no_metrics_save = 1
 
# TCP Fast Open
net.ipv4.tcp_fastopen = 3
 
# TIME_WAIT handling (connection reuse optimization)
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 10
 
# Keepalive tuning (detect dead connections faster)
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 6
 
# SYN handling (resist SYN floods, maintain performance)
net.ipv4.tcp_max_syn_backlog = 65536
net.ipv4.tcp_syncookies = 1
net.core.somaxconn = 65536
 
# SACK and selective ACK (loss recovery efficiency)
net.ipv4.tcp_sack = 1
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_early_retrans = 3
 
# MTU probing (find optimal packet size)
net.ipv4.tcp_mtu_probing = 1
 
# Timestamps (RTT measurement accuracy)
net.ipv4.tcp_timestamps = 1

Key TCP Options Explained
Setting	Purpose	CDN Value
tcp_slow_start_after_idle	Reset cwnd after idle?	0 = Keep cwnd for reused connections
tcp_no_metrics_save	Store per-host metrics?	1 = Don't let bad history penalize current connections
tcp_tw_reuse	Reuse TIME_WAIT sockets	1 = Faster connection recycling under load
tcp_sack	Selective acknowledgment	1 = Efficient recovery from multiple losses
tcp_early_retrans	Faster loss detection	3 = Retransmit without waiting for full timeout
tcp_mtu_probing	Discover path MTU	1 = Use larger packets when possible

Testing Before Production

TCP tuning changes can have unexpected interactions with network equipment, firewalls, and client implementations. Always test changes in staging environments with realistic traffic patterns before deploying to production CDN nodes.

Summary: TCP as a Performance Lever

TCP optimization is a primary weapon in the CDN performance arsenal. By tuning the protocol layer, CDNs extract dramatically better performance from the same network infrastructure.

Key Takeaways

•Default TCP is conservative — Settings designed for reliability and fairness leave significant performance unrealized.
•Modern congestion control (BBR) outperforms CUBIC — Model-based algorithms handle loss better and achieve higher throughput with lower latency.
•Initial window size matters for small transfers — Most API responses complete before slow start finishes; larger IW delivers responses faster.
•TCP Fast Open eliminates handshake latency — For repeat visitors, TFO saves an entire round trip per connection.
•Buffer sizing limits throughput — The bandwidth-delay product determines required buffer sizes; defaults are often too small for global traffic.
•Packet pacing prevents micro-burst losses — Smooth transmission maintains lower latency and reduces loss rates.

What's next:

The next page explores connection reuse—how CDNs maintain persistent, warm connections between edge servers and origin servers to completely bypass connection establishment overhead for forwarded requests.

Page Complete

You now understand TCP optimization at the level practiced by CDN engineers. These protocol-layer tunings complement the network-layer optimizations, together delivering the 50-70% latency improvements possible with dynamic content acceleration.