Loading content...
Consider a tightrope walker who has successfully reached the middle of the rope. The aggressive climbing phase is over—now comes the delicate task of maintaining balance while inching forward. One misstep could mean disaster, yet standing still means never reaching the goal.
This is TCP's situation in congestion avoidance. Having used slow start to rapidly discover network capacity, TCP now faces a more subtle challenge: how to fully utilize available bandwidth while avoiding the congestion that would force a painful restart. The answer is conservative, linear growth—adding just enough capacity to test for headroom without overwhelming the network.
Congestion avoidance represents the steady-state behavior of most long-lived TCP connections. Understanding it is essential for optimizing throughput, diagnosing performance issues, and appreciating how millions of connections cooperatively share the Internet's capacity.
This page covers why congestion avoidance uses linear growth instead of exponential, the exact mechanics of the cwnd increase algorithm, how connections probe for additional capacity, the relationship between congestion avoidance and network stability, and the AIMD (Additive Increase, Multiplicative Decrease) principle that ensures fairness.
Congestion avoidance begins when slow start ends. Recall the termination conditions for slow start:
In both cases, continuing exponential growth would be dangerous. The network is either near its limit (ssthresh approached) or has already signaled overload (loss). Congestion avoidance responds with a fundamentally different growth strategy.
The philosophical shift:
Slow start asks: "What's the maximum capacity I can discover?"
Congestion avoidance asks: "Is there any additional capacity I can safely claim?"
This shift from exploration to exploitation changes the growth rate from exponential to linear. The sender transitions from doubling the transmission rate each RTT to adding one segment per RTT—roughly a factor of N difference for a window of N segments.
| Property | Slow Start | Congestion Avoidance |
|---|---|---|
| Goal | Discover capacity rapidly | Maintain near-capacity utilization |
| Growth type | Exponential (×2 per RTT) | Linear (+1 segment per RTT) |
| Risk level | High (will overshoot) | Low (gentle probing) |
| Typical duration | 3-10 RTTs | Indefinite (entire connection) |
| cwnd update rule | cwnd += MSS per ACK | cwnd += MSS²/cwnd per ACK |
| Triggered by | Connection start, loss events | cwnd ≥ ssthresh |
| Terminated by | cwnd ≥ ssthresh or loss | Loss only |
The seamless transition:
The transition from slow start to congestion avoidance is often invisible to the user. There's no pause or reset—the sender simply changes the formula for cwnd updates. What was "add one MSS per ACK" becomes "add a fraction of MSS per ACK."
Example transition (cwnd = ssthresh = 64 segments):
Last slow start RTT (cwnd = 32→64):
- 32 segments sent
- 32 ACKs received
- cwnd += 32 × MSS = 64 MSS total
- Growth: 2×
First congestion avoidance RTT (cwnd = 64→65):
- 64 segments sent
- 64 ACKs received
- cwnd += 64 × (MSS/64) = 1 MSS total
- Growth: 1.015625×
The growth rate drops dramatically—from doubling to a 1.5% increase. This is the intended behavior: slow start found the capacity, now congestion avoidance maintains it.
Many short connections (HTTP requests, DNS queries) never leave slow start. If the transfer completes before cwnd reaches ssthresh, congestion avoidance is never entered. This is why initial window size (IW=10) and connection reuse matter so much for web performance.
The congestion avoidance algorithm can be expressed in several equivalent ways. Let's explore each formulation:
Per-ACK formulation (standard):
For each ACK received:
cwnd = cwnd + (MSS × MSS) / cwnd
Or equivalently:
cwnd = cwnd + MSS / cwnd (when cwnd is in MSS units)
Why this formula?
Let's analyze what happens over one RTT when cwnd = N segments:
The result: cwnd increases by exactly one MSS per RTT, regardless of the current cwnd value. This is linear growth—additive, not multiplicative.
12345678910111213141516171819202122232425262728
// Congestion Avoidance Implementation// Called when ACK is received and cwnd >= ssthresh function on_ack_received_congestion_avoidance(): // Method 1: Fractional MSS per ACK (common) cwnd_increment = MSS * MSS / cwnd cwnd = cwnd + cwnd_increment // Method 2: Counter-based (avoids fractions) // Increment counter for each ACK cwnd_counter = cwnd_counter + 1 if cwnd_counter >= cwnd / MSS: cwnd = cwnd + MSS cwnd_counter = 0 // Method 3: Direct formula per RTT // (Used in some implementations) // After receiving cwnd/MSS ACKs: // new_cwnd = old_cwnd + MSS // ParametersMSS = 1460 // Maximum Segment Size (bytes)cwnd = 65535 // Example: 45 segmentsssthresh = 60000 // Below cwnd, so in congestion avoidance // Per-ACK calculationincrement = 1460 * 1460 / 65535 // ≈ 32.5 bytes// After 45 ACKs: 45 * 32.5 ≈ 1462 bytes ≈ 1 MSSMathematical analysis:
Let cwnd(t) be the congestion window at time t (measured in RTTs). During congestion avoidance:
cwnd(t+1) = cwnd(t) + 1 (in MSS units)
Solving this linear recurrence:
cwnd(t) = cwnd(0) + t (in MSS units)
This is pure linear growth—a straight line on a cwnd vs. time graph. Compare to slow start's exponential:
Slow start: cwnd(t) = cwnd(0) × 2^t
Congestion avoidance: cwnd(t) = cwnd(0) + t
Practical implications:
The formula cwnd += MSS²/cwnd assumes one ACK per segment. With delayed ACKs (one ACK per two segments), uncorrected implementations would grow at half the expected rate. Appropriate Byte Counting (ABC, RFC 3465) addresses this by incrementing based on bytes acknowledged, not ACK count.
TCP's congestion control follows a principle called AIMD: Additive Increase, Multiplicative Decrease. This pattern is fundamental to understanding both individual connection behavior and network-wide fairness.
Additive Increase (AI):
During congestion avoidance, cwnd increases by a constant amount each RTT:
cwnd(t+1) = cwnd(t) + α
where α = 1 MSS in standard TCP. This is the "additive" part.
Multiplicative Decrease (MD):
When loss is detected, cwnd is reduced by a multiplicative factor:
cwnd(new) = cwnd(old) × β
where β = 0.5 in standard TCP (i.e., cwnd is halved). This is the "multiplicative" part.
Why this combination?
AIMD has a remarkable property: it converges to fair bandwidth allocation. Let's see why:
Fairness convergence proof (intuitive):
Imagine two flows sharing a bottleneck. Flow A has cwnd = 100, Flow B has cwnd = 20 (unfair).
Scenario: Bottleneck capacity = 60 segments
Phase 1 - Additive Increase:
A: 100→101, B: 20→21 (both add 1)
Combined: 121 + 21 = 122, exceeds capacity, loss occurs
Phase 2 - Multiplicative Decrease (both flows experience loss):
A: 101→50.5, B: 21→10.5
Phase 3 - Additive Increase again:
A: 50.5→51.5, B: 10.5→11.5
... continues ...
After many cycles:
Both flows converge toward 30 segments each (fair share)
The key insight:
Over many cycles, the ratio of flow rates approaches 1:1.
| Variant | Additive Increase (α) | Multiplicative Decrease (β) | Fairness Trade-off |
|---|---|---|---|
| TCP Reno | 1 MSS/RTT | 0.5 | Classical fairness |
| TCP CUBIC | ~1 MSS/RTT (adjusted) | 0.7 | Better high-BDP performance |
| TCP Vegas | 1/baseRTT | If RTT increases | Delay-based, preemptive |
| Compound TCP | 1 MSS/RTT + dwnd | 0.5 | Hybrid loss/delay |
| DCTCP | 1 MSS/RTT | Variable (ECN-based) | Fine-grained for data centers |
AIMD isn't just about fairness—it also ensures network stability. Multiplicative decrease quickly reduces aggregate load when congestion occurs, creating headroom for recovery. Additive increase slowly fills this headroom, preventing synchronized oscillations. This combination makes the Internet stable even with millions of competing flows.
The combination of additive increase and multiplicative decrease creates a characteristic sawtooth pattern in the cwnd over time. This pattern is one of the most recognizable signatures of TCP behavior.
Understanding the sawtooth:
cwnd ▲
│ ╱│ ╱│ ╱│
│ ╱ │ ╱ │ ╱ │
│ ╱ │ ╱ │ ╱ │
│ ╱ │ ╱ │ ╱ │
│ ╱ │ ╱ │ ╱ │
│ ╱ │╱ │╱ │
└─────────────────────────────────────► time
Rise Drop Rise Drop
(linear) (half) (linear) (half)
Phases of the sawtooth:
Sawtooth characteristics:
Let W be the point where loss occurs (the network capacity in segments). The sawtooth oscillates between W/2 and W:
Minimum cwnd: W/2 (immediately after loss)
Maximum cwnd: W (immediately before loss)
Average cwnd: 3W/4 (due to linear rise)
Cycle duration: W/2 RTTs (to grow from W/2 to W)
Throughput analysis:
The average throughput during the sawtooth is:
Avg throughput = 3W/4 × MSS / RTT
Compared to the ideal throughput (cwnd always at W):
Efficiency = (3W/4) / W = 75%
This 25% loss is the cost of TCP's probing behavior. The connection cannot maintain exactly the right rate—it must probe upward until loss confirms the limit.
Real-world deviations:
Practical sawtooths differ from the ideal:
| Network Capacity | BDP | Sawtooth Period | Avg Throughput | RTT Impact |
|---|---|---|---|---|
| 10 Mbps, 50ms RTT | 62 KB | ~21 RTTs (1.05s) | ~7.5 Mbps | Moderate oscillation |
| 100 Mbps, 50ms RTT | 625 KB | ~214 RTTs (10.7s) | ~75 Mbps | Long smooth cycles |
| 1 Gbps, 5ms RTT | 625 KB | ~214 RTTs (1.07s) | ~750 Mbps | Fast cycles |
| 1 Gbps, 100ms RTT | 12.5 MB | ~4267 RTTs (7.1min) | ~750 Mbps | Very slow recovery |
| 10 Gbps, 1ms RTT | 1.25 MB | ~427 RTTs (0.43s) | ~7.5 Gbps | Rapid oscillation |
On high-bandwidth, high-latency links ("long fat networks" or LFNs), the sawtooth period becomes extremely long. A 10 Gbps transatlantic link (100ms RTT) has a BDP of 125 MB. After a loss halves cwnd, recovering to full capacity takes over 42,000 RTTs—more than an hour! This is why CUBIC and BBR were developed—to accelerate recovery on high-BDP networks.
Even in congestion avoidance, TCP continues to probe for additional capacity. Network conditions change—competing flows end, routing changes, or capacity increases. Congestion avoidance's linear growth is a continuous probe: each RTT, the sender tests whether one more segment can fit.
Why linear probing works:
The question "Is there more capacity?" has an unknown answer that changes over time. TCP's approach:
This probing continues indefinitely. Even a connection that has been running for hours will still add 1 MSS per RTT, constantly testing for available capacity.
Probing efficiency:
Linear probing has a specific cost-benefit ratio:
When probing finds new capacity:
If a competing flow ends, releasing 100 MSS of capacity:
RTT 0: Available capacity = 100 MSS, our cwnd = 50 MSS
RTT 1: cwnd = 51, capacity used = 51, no loss
RTT 2: cwnd = 52, capacity used = 52, no loss
...
RTT 50: cwnd = 100, all capacity claimed
RTT 51: cwnd = 101, exceeds capacity, loss, halve cwnd
The probing eventually discovers and claims the new capacity, though it takes O(capacity) RTTs. This is slow compared to slow start but safe.
Faster probing approaches:
Recognizing that linear probing is slow for large capacity changes, some TCP variants use hybrid approaches:
To see congestion avoidance in action, use ss -ti on Linux during a file transfer. The cwnd field should show gradual increases (typically 1-2 segments per snapshot interval). If cwnd is growing rapidly after the initial phase, something unusual may be happening (application-limited, or non-standard congestion control).
One of congestion avoidance's most important properties is its contribution to fair bandwidth allocation among competing flows. Let's analyze how this works:
The fairness definition:
N flows sharing a bottleneck of capacity C are "fair" when each receives C/N throughput. In practice, some variation is acceptable, measured by Jain's Fairness Index:
Fairness Index = (Σxᵢ)² / (N × Σxᵢ²)
where xᵢ is flow i's throughput. Perfect fairness = 1.0.
Why congestion avoidance promotes fairness:
Consider two flows, A and B, sharing a 100 Mbps link:
Case 1: Both start fair (50 Mbps each)
- Both add 1 MSS/RTT
- Both reach loss at same time
- Both halve → still fair
Case 2: A has 80 Mbps, B has 20 Mbps (unfair)
- Both add 1 MSS/RTT (equal absolute increase)
- Both lose when combined exceeds 100 Mbps
- A loses 40 Mbps, B loses 10 Mbps (proportional decrease)
- A now has 40 Mbps, B has 10 Mbps → ratio improved
After many cycles:
- Ratio A:B approaches 1:1
| Cycle | Flow A (Mbps) | Flow B (Mbps) | Ratio A:B | Combined |
|---|---|---|---|---|
| 0 (initial) | 80 | 20 | 4:1 | 100 (loss) |
| After halve | 40 | 10 | 4:1 | 50 |
| After growth | 60 | 30 | 2:1 | 90 |
| After growth | 70 | 40 | 1.75:1 | 110 (loss) |
| After halve | 35 | 20 | 1.75:1 | 55 |
| ... | ... | ... | ... | ... |
| Converged | 50 | 50 | 1:1 | 100 (loss) |
RTT fairness issue:
There's an important caveat: flows with different RTTs won't converge to equal throughput. A flow with half the RTT:
This "RTT unfairness" means that connections to nearby servers tend to dominate over connections to distant servers. Some TCP variants (CUBIC, BBR) attempt to reduce RTT bias.
Multiple bottleneck fairness:
The fairness analysis assumes a single bottleneck. In practice, networks have complex topologies. A flow's fair share depends on which specific links it traverses and their individual capacities. TCP's AIMD ensures local fairness at each congested link but doesn't guarantee end-to-end max-min fairness.
Perfect fairness (equal throughput for all flows) may not be efficient. A large file transfer might benefit from more bandwidth than a small HTTP request. Some argue for 'proportional fairness' or 'utility-based fairness' instead. However, TCP's simple AIMD provides a reasonable default that doesn't require explicit coordination.
The original congestion avoidance algorithm (Reno-style linear growth) has limitations on modern high-speed networks. Several enhancements have been developed:
CUBIC Congestion Avoidance (Linux default):
CUBIC replaces linear growth with a cubic function centered on Wmax (the window size before the last loss):
W(t) = C × (t - K)³ + Wmax
where K = ∛(Wmax × β / C)
This creates three regions:
The result: faster recovery from loss with the same stability near capacity.
Pacing:
Modern implementations often pace segment transmissions rather than sending bursts. During congestion avoidance, segments are spread across the RTT:
Classic: |seg1|seg2|seg3|......wait......|
Paced: |.seg1.|..seg2.|..seg3.|..wait..|
Pacing reduces burstiness, smooths queue occupancy, and improves coexistence with other traffic.
| Enhancement | Mechanism | Benefit | Deployed In |
|---|---|---|---|
| CUBIC | Cubic function instead of linear | Faster recovery on high-BDP | Linux (default) |
| Compound | Delay + loss windows combined | Better high-BDP performance | Windows Server 2008+ |
| Proportional Rate Reduction | Smoother recovery after loss | Fewer spurious retransmits | Linux 3.2+ |
| Pacing | Spread packets over RTT | Reduced burstiness | Most modern stacks |
| HyStart++ | Improved slow start exit | Fewer exit-induced losses | QUIC, modern TCP |
| ECN support | Explicit congestion notification | React before loss | Widely supported |
ECN-based congestion response:
With Explicit Congestion Notification, routers can signal congestion before dropping packets. When an ECN congestion signal (CE mark) is received:
Classic response (similar to loss):
ssthresh = cwnd / 2
cwnd = cwnd / 2
Continue in congestion avoidance
DCTCP response (data centers):
Reduce proportionally to fraction of marked packets
1% marked → reduce by ~1%
50% marked → reduce by ~50%
ECN allows earlier, more precise congestion signals, enabling smoother congestion avoidance without packet loss.
BBR's different approach:
Google's BBR algorithm doesn't use traditional congestion avoidance at all. Instead of additive increase until loss, BBR:
This represents a fundamental departure from AIMD while preserving the goal of efficient, fair bandwidth utilization.
On Linux, view the current algorithm with sysctl net.ipv4.tcp_congestion_control. List available options with sysctl net.ipv4.tcp_available_congestion_control. Per-socket selection is possible via setsockopt(), allowing applications to choose algorithms suited to their traffic patterns.
We've thoroughly explored congestion avoidance—the phase where TCP connections spend most of their time. Let's consolidate the key insights:
What's next:
The final topic in this module examines linear growth in greater detail—analyzing its mathematical properties, its impact on long-term throughput, and its interaction with other aspects of TCP behavior.
You now understand TCP congestion avoidance—the steady-state phase where connections carefully balance throughput against congestion risk. The additive increase, multiplicative decrease pattern creates a stable, fair, and efficient system that enables millions of connections to share Internet capacity. Next, we'll examine the linear growth pattern in greater depth.