Loading content...
Every HTTP request, every API call, every WebSocket connection rides on top of TCP (Transmission Control Protocol). TCP provides the reliable, ordered delivery that applications depend on—but it was designed in the 1980s for a very different internet than today's global, high-bandwidth, latency-sensitive environment.
CDNs don't just use TCP—they optimize it aggressively. The difference between default TCP settings and CDN-tuned TCP can mean 2-3x throughput improvements and 30-50% latency reductions, particularly for dynamic content where every request traverses the full protocol stack.
This page explores TCP at the level CDN engineers work with it: congestion control algorithms, window sizing strategies, TCP Fast Open, and advanced kernel tuning. You'll understand why standard TCP configurations leave significant performance on the table.
TCP was designed for reliability over performance. Its default behaviors, while ensuring correctness, create significant performance penalties for modern web traffic. Understanding these challenges is essential before exploring optimizations.
12345678910111213141516171819202122
Scenario: Sydney → Virginia, 160ms RTT, 0% loss Target throughput: 100 Mbps Bandwidth-Delay Product: 100 Mbps × 0.16s = 16 Mbit = 2 MB Initial Congestion Window (IW): 10 segments (14 KB) Round Trip | Window Size | Throughput-----------+-------------+----------- 1 | 14 KB | 0.7 Mbps 2 | 28 KB | 1.4 Mbps 3 | 56 KB | 2.8 Mbps 4 | 112 KB | 5.6 Mbps 5 | 224 KB | 11.2 Mbps 6 | 448 KB | 22.4 Mbps 7 | 896 KB | 44.8 Mbps 8 | 1.79 MB | 89.6 Mbps 9 | 2.0 MB | 100.0 Mbps (saturated) Time to full throughput: 9 RTT × 160ms = 1.44 seconds For a 50 KB dynamic API response, average throughput is ~5 Mbps,not the 100 Mbps available—because slow start never finishes.Slow start particularly penalizes dynamic content. With typical API responses under 100KB, connections often complete before reaching full throughput. Each request pays the slow start penalty without ever benefiting from a warmed-up connection.
Congestion control determines how TCP adjusts its sending rate in response to network conditions. The choice of algorithm dramatically impacts performance. CDNs typically use modern algorithms optimized for their specific traffic patterns.
| Algorithm | Loss Response | Characteristics | Best For |
|---|---|---|---|
| CUBIC (default) | Halve window on loss | Conservative, fair, widely deployed | General internet, shared links |
| BBR (Google) | Model-based, not loss-based | Aggressive, high throughput, may be unfair | Long-distance, high-bandwidth paths |
| BBRv2 | Improved fairness over BBR | Better coexistence with CUBIC | Production CDN deployments |
| QUIC CC | Pluggable, evolving | Works above UDP, avoids OS kernel limits | HTTP/3, performance-critical apps |
Deep dive: BBR (Bottleneck Bandwidth and Round-trip propagation time)
BBR represents a fundamental rethinking of congestion control. While CUBIC and its predecessors react to packet loss as a congestion signal, BBR actively measures two properties:
BBR uses these measurements to operate at the optimal point—sending at exactly the bottleneck rate without building queues. This avoids the buffer bloat problem where CUBIC fills intermediate buffers, increasing latency.
123456789101112131415161718192021222324252627282930
// BBR operates in four phases, cycling continuouslyenum BBRPhase { STARTUP, // Quickly find bottleneck bandwidth (exponential search) DRAIN, // Drain queues created during startup PROBE_BW, // Steady state: probe for more bandwidth periodically PROBE_RTT, // Periodically drain queues to measure true RTprop} interface BBRState { btlBw: number; // Estimated bottleneck bandwidth (max filter) rtProp: number; // Estimated propagation delay (min filter) cwnd: number; // Congestion window pacingRate: number; // Sending rate (pacing, not burst) phase: BBRPhase;} // The key insight: pacing_rate = btlBw, cwnd = btlBw × rtProp// This keeps exactly one BDP (bandwidth-delay product) in flightfunction calculateSendingParameters(state: BBRState): void { // Target: send at bottleneck rate, keep pipe exactly full state.pacingRate = state.btlBw; // Congestion window = one bandwidth-delay product // This is the minimum buffer needed to keep the pipe full const bdp = state.btlBw * state.rtProp; state.cwnd = bdp * 1.25; // Small margin for measurement variance // In PROBE_BW, briefly increase rate to test for more capacity // In PROBE_RTT, reduce cwnd to drain queues and measure true RTT}BBR performance advantages for CDNs:
BBRv1 was criticized for being too aggressive—it could starve CUBIC flows competing on the same link. BBRv2 addresses this with improved fairness, making it more suitable for production CDN deployments where the CDN doesn't control all traffic on the path.
The initial congestion window (IW) determines how much data TCP can send before receiving the first acknowledgment. For short transfers (typical API responses), IW directly determines performance—there's no time for slow start to increase the window.
1234567891011121314151617181920212223
# Check current default initial window$ ss -i | grep -o "cwnd:[0-9]*" | head -1cwnd:10 # The default IW of 10 segments ≈ 14 KB# Modern recommendations: IW of 20-30 segments # Increase initial window via route configuration$ sudo ip route change default via 10.0.0.1 initcwnd 20 initrwnd 20 # For CDN servers, set system-wide via sysctl$ sudo sysctl -w net.ipv4.tcp_slow_start_after_idle=0# Prevents cwnd reset after idle periods # CDN production configuration often includes:$ cat /etc/sysctl.d/99-tcp-optimization.confnet.ipv4.tcp_slow_start_after_idle = 0net.ipv4.tcp_no_metrics_save = 1net.ipv4.tcp_moderate_rcvbuf = 1net.core.rmem_max = 134217728net.core.wmem_max = 134217728net.ipv4.tcp_rmem = 4096 87380 134217728net.ipv4.tcp_wmem = 4096 65536 134217728| Response Size | IW=10 (14KB) | IW=20 (28KB) | IW=30 (42KB) | Improvement |
|---|---|---|---|---|
| 10 KB | 1 RTT | 1 RTT | 1 RTT | None (fits in IW) |
| 20 KB | 2 RTTs | 1 RTT | 1 RTT | 50% less latency |
| 40 KB | 3 RTTs | 2 RTTs | 1 RTT | 33-67% less latency |
| 80 KB | 4 RTTs | 3 RTTs | 2 RTTs | 25-50% less latency |
Calculating the optimal initial window:
The ideal IW depends on typical response sizes. For CDNs serving API responses:
Most CDNs set IW between 20-40 segments (28-56 KB), covering the majority of dynamic API responses in a single round trip.
By default, TCP resets the congestion window after a connection goes idle (tcp_slow_start_after_idle). This devastates performance for keep-alive connections with bursty traffic. CDNs disable this, preserving the accumulated window for reuse.
TCP Fast Open (TFO) eliminates the three-way handshake penalty for repeat visitors by allowing data in the initial SYN packet. This can reduce connection latency by an entire round trip—significant for geographically distant users.
1234567891011121314151617181920
STANDARD TCP (3-way handshake):Client Server | | |-------- SYN ----------------->| RTT 1 |<------- SYN-ACK --------------| |-------- ACK + HTTP Request -->| RTT 2 |<------- HTTP Response --------| | |Total: 2 RTT before response begins TCP FAST OPEN (with cached cookie):Client Server | | |-- SYN + Cookie + HTTP ------->| RTT 1 |<-- SYN-ACK + HTTP Response ---| |-------- ACK ----------------->| (concurrent) | |Total: 1 RTT before response begins SAVINGS: 1 full RTT (80-200ms for distant users)How TFO works:
Initial connection: Client requests a TFO cookie in the SYN options. Server generates a cryptographic cookie based on client IP and a server secret.
Cookie caching: Client stores the cookie locally (typically for hours to days).
Subsequent connections: Client includes the cookie and application data in the SYN packet. Server validates the cookie and immediately processes the request.
Security: The cookie prevents abuse—attackers can't forge cookies for IP addresses they don't control, mitigating amplification attacks.
1234567891011121314151617
# Linux: Enable TFO server-side (CDN edge servers)$ sudo sysctl -w net.ipv4.tcp_fastopen=3# 1 = client only, 2 = server only, 3 = both # Set TFO queue length (pending SYN+data requests)$ sudo sysctl -w net.ipv4.tcp_fastopen_queue_len=1024 # View TFO statistics$ cat /proc/net/netstat | grep TFOTcpExt: ... TCPFastOpenActive 15234 TCPFastOpenActiveFail 12 TCPFastOpenPassive 45678 TCPFastOpenPassiveFail 5 # Nginx configuration for TFOlisten 443 ssl fastopen=256; # 256 pending TFO connections allowed # Verify TFO is working with curl$ curl --tcp-fastopen https://example.com -ITFO requires support from client OS, server OS, and all middleboxes. Some firewalls and NATs strip TFO options, breaking the optimization. CDNs often see TFO work for 40-70% of connections, not 100%. Still significant, but not universal.
TFO security considerations:
TFO introduces a replay attack surface. If attackers capture a SYN+data packet, they can replay it (until the cookie expires). Mitigations:
TCP buffers directly limit achievable throughput. For high-bandwidth, high-latency links, undersized buffers become the bottleneck—a crucial consideration for CDN edge servers handling global traffic.
12345678910111213141516
Maximum possible throughput is limited by: Throughput ≤ (Window Size) / RTT Example: Sydney → Virginia, 160ms RTT Default Linux receive buffer max: 212,992 bytes (208 KB)Maximum throughput: 208 KB / 0.16s = 1.3 MB/s = 10.4 Mbps Actual available bandwidth: 100 MbpsBandwidth-Delay Product: 100 Mbps × 0.16s = 2 MB To fully utilize the path, window must reach 2 MB.With 208 KB max buffer, we can only use 10% of available bandwidth! Solution: Increase buffer limits to exceed BDPTarget max buffer: 16-128 MB for long-haul connections12345678910111213141516171819202122232425
# Check current buffer settings$ sysctl net.core.rmem_max net.core.wmem_maxnet.core.rmem_max = 212992net.core.wmem_max = 212992 # Check TCP-specific tuning (min, default, max)$ sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmemnet.ipv4.tcp_rmem = 4096 131072 6291456net.ipv4.tcp_wmem = 4096 16384 4194304 # CDN-optimized buffer settings:$ sudo sysctl -w net.core.rmem_max=134217728 # 128 MB$ sudo sysctl -w net.core.wmem_max=134217728 # 128 MB$ sudo sysctl -w net.core.rmem_default=1048576 # 1 MB$ sudo sysctl -w net.core.wmem_default=1048576 # 1 MB # TCP auto-tuning settings (min, default, max)$ sudo sysctl -w net.ipv4.tcp_rmem='4096 1048576 134217728'$ sudo sysctl -w net.ipv4.tcp_wmem='4096 1048576 134217728' # Enable window scaling for large windows$ sudo sysctl -w net.ipv4.tcp_window_scaling=1 # Memory pressure tuning (bytes: low, pressure, high)$ sudo sysctl -w net.ipv4.tcp_mem='786432 1048576 1572864'Buffer sizing strategy:
CDN servers typically configure maximums of 64-128 MB to handle the longest, highest-bandwidth paths (intercontinental transfers on gigabit links).
Large buffer limits don't immediately consume memory—TCP auto-tuning expands buffers only as needed. However, many concurrent long-haul connections can accumulate significant memory usage. CDN servers need sufficient RAM and careful memory pressure configuration.
Traditional TCP sends data in bursts—when an ACK arrives acknowledging 10 packets, it immediately sends 10 more. These micro-bursts can overwhelm buffers in switches and routers, causing packet loss and triggering congestion control. Packet pacing smooths transmission for better performance.
123456789101112131415161718192021
BURSTY TCP (Traditional):Time Packets0ms [1][2][3][4][5][6][7][8][9][10] <- all at once!... (waiting for ACKs)160ms [11][12][13][14][15][16][17][18][19][20] <- burst again Problem: 10 packets arrive at switch simultaneouslyIf switch buffer holds 8 packets, 2 are lost immediately PACED TCP (e.g., BBR):Time Packets0ms [1]2ms [2]4ms [3]6ms [4]...18ms [10](continuous stream vs bursts) Result: Switch never sees more than 1-2 packets queuedNo buffer overflow, no loss, lower latencyEnabling packet pacing:
BBR uses pacing inherently—its design centers on sending at exactly the measured bottleneck rate. For other congestion control algorithms:
1234567891011121314151617
# Set Fair Queue as the default qdisc for pacing$ sudo sysctl -w net.core.default_qdisc=fq # Apply FQ to existing interfaces$ sudo tc qdisc replace dev eth0 root fq # Check current qdisc$ tc qdisc show dev eth0qdisc fq 8001: root refcnt 2 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028b initial_quantum 15140b # For BBR to work correctly, FQ qdisc is recommended# BBR + FQ = paced transmission with model-based congestion control # Verify pacing is active on connections$ ss -ti | grep pacing rtt:0.25/0.125 ... pacing_rate 125000bpsPacing doesn't just prevent loss—it also reduces queuing latency throughout the path. By avoiding buffer buildup at intermediate nodes, paced connections see lower and more consistent RTT, which benefits congestion control accuracy and user-perceived performance.
Beyond the major optimizations, CDNs configure numerous TCP options that collectively contribute to performance. Here are the key settings used in production deployments:
12345678910111213141516171819202122232425262728293031323334353637383940414243
# /etc/sysctl.d/99-cdn-tcp.conf # Congestion controlnet.ipv4.tcp_congestion_control = bbrnet.core.default_qdisc = fq # Buffer tuning (min, default, max bytes)net.core.rmem_max = 134217728net.core.wmem_max = 134217728net.ipv4.tcp_rmem = 4096 1048576 134217728net.ipv4.tcp_wmem = 4096 1048576 134217728 # Initial window and slow startnet.ipv4.tcp_slow_start_after_idle = 0net.ipv4.tcp_no_metrics_save = 1 # TCP Fast Opennet.ipv4.tcp_fastopen = 3 # TIME_WAIT handling (connection reuse optimization)net.ipv4.tcp_tw_reuse = 1net.ipv4.tcp_fin_timeout = 10 # Keepalive tuning (detect dead connections faster)net.ipv4.tcp_keepalive_time = 60net.ipv4.tcp_keepalive_intvl = 10net.ipv4.tcp_keepalive_probes = 6 # SYN handling (resist SYN floods, maintain performance)net.ipv4.tcp_max_syn_backlog = 65536net.ipv4.tcp_syncookies = 1net.core.somaxconn = 65536 # SACK and selective ACK (loss recovery efficiency)net.ipv4.tcp_sack = 1net.ipv4.tcp_dsack = 1net.ipv4.tcp_early_retrans = 3 # MTU probing (find optimal packet size)net.ipv4.tcp_mtu_probing = 1 # Timestamps (RTT measurement accuracy)net.ipv4.tcp_timestamps = 1| Setting | Purpose | CDN Value |
|---|---|---|
| tcp_slow_start_after_idle | Reset cwnd after idle? | 0 = Keep cwnd for reused connections |
| tcp_no_metrics_save | Store per-host metrics? | 1 = Don't let bad history penalize current connections |
| tcp_tw_reuse | Reuse TIME_WAIT sockets | 1 = Faster connection recycling under load |
| tcp_sack | Selective acknowledgment | 1 = Efficient recovery from multiple losses |
| tcp_early_retrans | Faster loss detection | 3 = Retransmit without waiting for full timeout |
| tcp_mtu_probing | Discover path MTU | 1 = Use larger packets when possible |
TCP tuning changes can have unexpected interactions with network equipment, firewalls, and client implementations. Always test changes in staging environments with realistic traffic patterns before deploying to production CDN nodes.
TCP optimization is a primary weapon in the CDN performance arsenal. By tuning the protocol layer, CDNs extract dramatically better performance from the same network infrastructure.
What's next:
The next page explores connection reuse—how CDNs maintain persistent, warm connections between edge servers and origin servers to completely bypass connection establishment overhead for forwarded requests.
You now understand TCP optimization at the level practiced by CDN engineers. These protocol-layer tunings complement the network-layer optimizations, together delivering the 50-70% latency improvements possible with dynamic content acceleration.