Loading learning content...
Understanding fast retransmit conceptually is essential, but understanding its implementation reveals the real-world engineering decisions that make it work reliably at scale. Production TCP stacks—in Linux, Windows, FreeBSD, and network devices—implement fast retransmit with careful attention to edge cases, performance, and interoperability.
This page bridges theory and practice, examining the data structures, state machines, and code patterns that bring fast retransmit to life. Whether you're implementing a custom TCP stack, debugging production issues, or simply deepening your understanding, this implementation-focused perspective is invaluable.
By the end of this page, you will understand: (1) the data structures used to track retransmission state, (2) the state machine governing fast retransmit and recovery, (3) pseudo-code and real code patterns, (4) platform-specific implementation differences, and (5) common bugs and debugging techniques.
Fast retransmit implementation requires several key data structures to track connection state, identify losses, and manage recovery. Let's examine each component.
Transmission Control Block (TCB):
The TCB is the primary per-connection data structure, containing all state for a TCP connection. Fast retransmit uses these fields:
123456789101112131415161718192021222324252627282930313233343536373839404142
// TCP Control Block - Fast Retransmit Fieldsstruct tcp_sock { // === Sequence Number State === u32 snd_una; // Send unacknowledged - oldest unacked byte u32 snd_nxt; // Send next - next byte to transmit u32 snd_wnd; // Send window (from receiver) // === Congestion Control State === u32 cwnd; // Congestion window (segments or bytes) u32 ssthresh; // Slow start threshold // === Fast Retransmit State === u32 dupacks; // Count of duplicate ACKs received u32 high_seq; // Highest seq sent at time of fast recovery entry u32 prior_ssthresh; // ssthresh before entering fast recovery u8 frto_counter; // F-RTO algorithm counter // === State Flags === u8 ca_state; // Congestion control state (OPEN, DISORDER, CWR, RECOVERY, LOSS) // === SACK State (if enabled) === struct tcp_sack_block recv_sack_cache[4]; // SACK blocks from receiver u32 sacked_out; // Count of SACKed segments u32 lost_out; // Estimated lost segments (based on SACK) u32 retrans_out; // Segments currently being retransmitted // === Scoreboard (for SACK-based recovery) === struct rb_root retransmit_tree; // Red-black tree of sent segments // === Timing === u32 rto; // Retransmission timeout (in jiffies/ms) u64 retrans_stamp; // Time of last retransmission}; // Congestion control statesenum tcp_ca_state { TCP_CA_Open = 0, // Normal operation TCP_CA_Disorder = 1, // Received dupacks, possible reordering TCP_CA_CWR = 2, // Congestion Window Reduced (ECN) TCP_CA_Recovery = 3, // Fast recovery in progress TCP_CA_Loss = 4, // RTO fired, in loss recovery};The Send Buffer and Retransmission Queue:
TCP must maintain sent-but-unacknowledged data for potential retransmission. This is typically implemented as:
123456789101112131415161718
// Segment metadata for retransmission trackingstruct tcp_skb_cb { u32 seq; // Sequence number of first byte u32 end_seq; // Sequence number after last byte u8 sacked; // SACK state flags u8 tcp_flags; // SYN, FIN, etc. u64 tx_time; // Transmission timestamp (for RTT and RACK) u8 retrans_count; // Number of times retransmitted}; // SACK state flags per segment#define TCPCB_SACKED_ACKED 0x01 // SACKed by receiver#define TCPCB_SACKED_RETRANS 0x02 // Currently being retransmitted #define TCPCB_LOST 0x04 // Declared lost#define TCPCB_TAGBITS 0x07 // All flags // The send queue - list of unacknowledged segmentsstruct sk_buff_head write_queue; // Ordered by sequence numberWith SACK enabled, implementations maintain a 'scoreboard'—a data structure tracking the SACK state of each segment individually. This allows identifying exactly which segments are lost vs SACKed, enabling precise retransmission. The scoreboard adds memory overhead but dramatically improves multi-loss recovery.
Fast retransmit operates as part of TCP's broader congestion control state machine. Understanding state transitions is key to correct implementation.
Linux TCP Congestion Control States:
State Descriptions:
OPEN (Normal Operation):
DISORDER (Possible Reordering):
RECOVERY (Fast Recovery):
LOSS (RTO Recovery):
123456789101112131415161718192021222324252627282930313233343536373839
// State transition logic (simplified from Linux tcp_fastretrans_alert)function on_ack_received(ack): if is_new_ack(ack): handle_new_ack(ack) else if is_duplicate_ack(ack): handle_dup_ack(ack) function handle_new_ack(ack): if ca_state == RECOVERY: if ack >= high_seq: // Recovery complete: all pre-recovery data acked exit_recovery() ca_state = OPEN else: // Partial ACK during recovery // NewReno: retransmit next lost segment // SACK: retransmit based on scoreboard maybe_retransmit_on_partial_ack() else: ca_state = OPEN dupacks = 0 update_cwnd_on_ack(ack) function handle_dup_ack(ack): dupacks += 1 if ca_state == OPEN: ca_state = DISORDER if ca_state == DISORDER or ca_state == RECOVERY: if dupacks == 3 and ca_state != RECOVERY: // === FAST RETRANSMIT TRIGGER === enter_recovery() retransmit_segment(snd_una) else if ca_state == RECOVERY: // Window inflation cwnd += 1 SMSS try_send_more_data()The 'high_seq' variable (snd_nxt at recovery entry) is critical for knowing when recovery is complete. TCP must ACK all data sent before entering recovery, not just the retransmitted segment. Bugs in high_seq tracking cause premature or delayed recovery exit.
Let's examine a complete pseudo-implementation of fast retransmit with fast recovery, incorporating all the details discussed so far.
Full Algorithm Implementation:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138
// Complete Fast Retransmit + Fast Recovery Implementation// Based on TCP Reno (RFC 5681) with NewReno extensions (RFC 6582) struct tcp_connection { // ... all fields from earlier ... uint32_t snd_una, snd_nxt, snd_wnd; uint32_t cwnd, ssthresh; uint32_t dupacks; uint32_t high_seq; enum tcp_ca_state ca_state; uint32_t mss; // Max segment size}; #define DUPACK_THRESHOLD 3 void tcp_process_ack(struct tcp_connection *conn, uint32_t ack_num) { // Check if this is a new ACK (advances snd_una) if (seq_after(ack_num, conn->snd_una)) { // // NEW ACK PROCESSING // uint32_t bytes_acked = ack_num - conn->snd_una; conn->snd_una = ack_num; // Remove acknowledged data from send buffer tcp_clean_rtx_queue(conn, ack_num); // State-specific handling switch (conn->ca_state) { case TCP_CA_Recovery: if (seq_geq(ack_num, conn->high_seq)) { // Full recovery complete tcp_end_recovery(conn); conn->ca_state = TCP_CA_Open; // Deflate window to ssthresh conn->cwnd = min(conn->ssthresh, tcp_packets_in_flight(conn) + conn->mss); } else { // Partial ACK - more segments to retransmit (NewReno) tcp_retransmit_one(conn); // Retransmit next lost // Deflate cwnd by amount acked, then add 1 mss conn->cwnd -= bytes_acked; conn->cwnd += conn->mss; } break; case TCP_CA_Loss: // In loss state, new ACK allows progress conn->ca_state = TCP_CA_Open; break; case TCP_CA_Disorder: case TCP_CA_Open: default: // Normal cwnd increase tcp_cong_avoid(conn, bytes_acked); conn->ca_state = TCP_CA_Open; break; } // Reset duplicate ACK counter on new ACK conn->dupacks = 0; // Try to send more data tcp_write_xmit(conn); } else if (ack_num == conn->snd_una) { // // DUPLICATE ACK PROCESSING // if (!tcp_is_valid_dupack(conn, ack_num)) { return; // Not a true dup ACK (RFC 5681 criteria) } conn->dupacks++; switch (conn->ca_state) { case TCP_CA_Open: // First dup ACK - enter disorder conn->ca_state = TCP_CA_Disorder; // Fall through case TCP_CA_Disorder: if (conn->dupacks == DUPACK_THRESHOLD) { // // === FAST RETRANSMIT TRIGGER === // tcp_enter_recovery(conn); tcp_retransmit_one(conn); // Retransmit snd_una } break; case TCP_CA_Recovery: // Already in recovery - inflate window conn->cwnd += conn->mss; tcp_write_xmit(conn); // Send new data if window allows break; case TCP_CA_Loss: // In loss state, dup ACKs ignored break; } }} void tcp_enter_recovery(struct tcp_connection *conn) { // Record high_seq for recovery exit condition conn->high_seq = conn->snd_nxt; // Set new ssthresh (RFC 5681: max(FlightSize/2, 2*MSS)) uint32_t flight = tcp_packets_in_flight(conn); conn->ssthresh = max(flight / 2, 2); // In segments // Set cwnd for recovery phase // ssthresh + 3*MSS for the 3 segments that triggered dup ACKs conn->cwnd = conn->ssthresh + DUPACK_THRESHOLD * conn->mss; // Enter recovery state conn->ca_state = TCP_CA_Recovery; // Record stats TCP_INC_STATS(TCP_MIB_FASTRETR);} void tcp_end_recovery(struct tcp_connection *conn) { conn->dupacks = 0; conn->ca_state = TCP_CA_Open;} bool tcp_is_valid_dupack(struct tcp_connection *conn, uint32_t ack) { // RFC 5681 duplicate ACK criteria: // 1. ACK carries no data // 2. ACK carries no window change // 3. ACK carries no SYN or FIN // 4. ACK number equals snd_una // (These checks done before calling this function) return true; // Simplified - actual impl checks packet flags}Production implementations track statistics (TCP_MIB_FASTRETR, etc.) for monitoring. These counters appear in /proc/net/snmp and /proc/net/netstat on Linux, enabling runtime visibility into TCP behavior.
When Selective Acknowledgment (SACK) is negotiated, fast retransmit becomes significantly more powerful. SACK provides explicit information about which segments the receiver has, enabling precise identification of lost segments.
SACK Processing:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
// SACK block structurestruct tcp_sack_block { uint32_t start; // First sequence number in block uint32_t end; // Sequence number after last in block}; // Process SACK option from ACK packetvoid tcp_process_sack(struct tcp_connection *conn, struct tcp_sack_block *blocks, int num_blocks) { // Update scoreboard based on SACK blocks for (int i = 0; i < num_blocks; i++) { tcp_sack_mark_range(conn, blocks[i].start, blocks[i].end); } // Identify lost segments using SACK information tcp_update_lost_estimation(conn);} // Mark segments in range as SACKedvoid tcp_sack_mark_range(struct tcp_connection *conn, uint32_t start, uint32_t end) { struct sk_buff *skb; // Walk send queue, mark matching segments foreach_skb_in_write_queue(skb, conn) { struct tcp_skb_cb *cb = TCP_SKB_CB(skb); if (seq_within(cb->seq, start, end) || seq_within(cb->end_seq - 1, start, end)) { // Segment overlaps SACK block cb->sacked |= TCPCB_SACKED_ACKED; conn->sacked_out++; } }} // Identify lost segments based on SACKvoid tcp_update_lost_estimation(struct tcp_connection *conn) { struct sk_buff *skb; int fack_count = 0; // Forward ACK count // FACK-based loss detection: // Segment is lost if >3 segments after it have been SACKed foreach_skb_in_write_queue_reverse(skb, conn) { struct tcp_skb_cb *cb = TCP_SKB_CB(skb); if (cb->sacked & TCPCB_SACKED_ACKED) { fack_count++; } else { // Not SACKed - check if lost if (fack_count >= DUPACK_THRESHOLD) { // 3+ SACKed segments after this one = lost if (!(cb->sacked & TCPCB_LOST)) { cb->sacked |= TCPCB_LOST; conn->lost_out++; } } } }} // Retransmit all segments marked as lostvoid tcp_retransmit_lost(struct tcp_connection *conn) { struct sk_buff *skb; foreach_skb_in_write_queue(skb, conn) { struct tcp_skb_cb *cb = TCP_SKB_CB(skb); if ((cb->sacked & TCPCB_LOST) && !(cb->sacked & TCPCB_SACKED_RETRANS)) { // Lost and not already being retransmitted tcp_retransmit_skb(conn, skb); cb->sacked |= TCPCB_SACKED_RETRANS; conn->retrans_out++; } }}| Capability | Without SACK | With SACK |
|---|---|---|
| Know which segments lost | Only first (snd_una) | All lost segments explicitly |
| Retransmit strategy | One at a time | All lost at once |
| Recovery for N losses | ~N RTTs | ~1 RTT |
| Spurious retransmit risk | Higher | Lower (explicit info) |
| Memory overhead | Lower | Higher (scoreboard) |
FACK is a loss detection algorithm that uses SACK information. It considers a segment lost if 3 or more segments after it have been SACKed. This is similar to the 3 dup ACK rule but based on explicit SACK data rather than counting ACKs.
While the core fast retransmit algorithm is standardized, implementations vary across operating systems. Understanding these differences is valuable for debugging and optimization.
Linux TCP:
net/ipv4/tcp_input.c (main ACK processing), tcp_output.c (retransmission), tcp_recovery.c (RACK)tcp_fastretrans_alert() - called on every ACK, handles all fast retransmit/recovery logic/proc/sys/net/ipv4/tcp_early_retrans, tcp_recovery (RACK on/off)123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
// Simplified view of Linux's tcp_fastretrans_alert()// Called from tcp_ack() for every incoming ACK static void tcp_fastretrans_alert(struct sock *sk, struct tcp_ack_state *ack_state){ struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); int dup_ackers = ack_state->dup_acks; // Update reordering estimation if (dup_ackers > tp->reordering) { tcp_check_reordering(sk, dup_ackers); } // State machine transitions switch (icsk->icsk_ca_state) { case TCP_CA_Open: if (dup_ackers >= 1) { tcp_enter_cwr(sk); // Or Disorder } break; case TCP_CA_Disorder: // Check for fast retransmit trigger if (tcp_time_to_recover(sk, dup_ackers)) { tcp_enter_recovery(sk, true); tcp_xmit_retransmit_queue(sk); } break; case TCP_CA_Recovery: // Handle partial/full ACKs if (ack_state->flag & FLAG_SND_UNA_ADVANCED) { tcp_try_undo_partial(sk); tcp_check_space(sk); } // Inflate cwnd on dup ACKs tcp_cwnd_reduction(sk, dup_ackers); break; case TCP_CA_Loss: tcp_try_undo_loss(sk, true); break; } // Retransmit lost segments (SACK-based if available) if (tcp_is_in_recovery(sk)) { tcp_xmit_retransmit_queue(sk); }}Windows TCP:
TcpMaxDupAcks (default 2 on older versions, 3 since Win10)FreeBSD/macOS (BSD Stack):
tcp_sack.csys/netinet/tcp_input.cFor deep understanding, reading the actual kernel source is invaluable. Start with Linux's tcp_input.c and trace the path from tcp_rcv_established() → tcp_ack() → tcp_fastretrans_alert(). The code is well-commented with RFC references.
Fast retransmit implementations are notoriously tricky. Subtle bugs can cause performance degradation, spurious retransmissions, or complete failure to recover. Here are common pitfalls:
Bug 1: Incorrect Duplicate ACK Counting
1234567891011121314151617181920212223
// BUG: Counting ACKs instead of duplicate ACKsvoid buggy_handle_ack(struct tcp_conn *c, u32 ack_num) { if (ack_num == c->snd_una) { c->ack_count++; // WRONG: counts ALL acks with same number // Should filter by RFC 5681 criteria if (c->ack_count >= 3) { fast_retransmit(); } }} // CORRECT: Filter by RFC 5681 duplicate ACK criteriavoid correct_handle_ack(struct tcp_conn *c, packet *pkt) { if (pkt->ack_num == c->snd_una && pkt->data_len == 0 && // No data pkt->window == c->last_window && // No window change !(pkt->flags & (SYN|FIN))) { // No SYN/FIN c->dupacks++; if (c->dupacks == 3) { fast_retransmit(); } }}Bug 2: Not Resetting Counter on New ACK
123456789101112
// BUG: Forgetting to reset dupacks on new ACKvoid buggy_new_ack(struct tcp_conn *c, u32 ack_num) { c->snd_una = ack_num; // MISSING: c->dupacks = 0; // Result: Can trigger spurious fast retransmit later} // CORRECT: Always reset on new ACKvoid correct_new_ack(struct tcp_conn *c, u32 ack_num) { c->snd_una = ack_num; c->dupacks = 0; // Critical!}A historically famous bug: early implementations exited fast recovery on any new ACK, even partial ACKs (that advance snd_una but don't acknowledge all data). NewReno (RFC 6582) requires staying in recovery until all pre-recovery data is acked. This bug caused severe performance issues with multiple losses.
Verifying fast retransmit correctness requires careful testing. Production-grade TCP stacks use extensive test suites and packet traces.
Testing Approaches:
1234567891011121314151617181920212223242526272829
#!/bin/bash# Test fast retransmit using network emulation # Setup: Add packet losssudo tc qdisc add dev eth0 root netem loss 5% # Start packet capturetcpdump -i eth0 -w test_capture.pcap tcp port 5201 & # Run iperf testiperf3 -c 10.0.0.1 -t 30 # Stop capturekill %1 # Analyze with tsharkecho "=== Fast Retransmit Analysis ==="tshark -r test_capture.pcap -Y "tcp.analysis.fast_retransmission" -T fields -e frame.number -e tcp.seq | wc -lecho "Fast retransmissions triggered" echo "=== Duplicate ACK Analysis ==="tshark -r test_capture.pcap -Y "tcp.analysis.duplicate_ack" -T fields -e frame.number | wc -l echo "Duplicate ACKs sent" # Check kernel countersnstat -sz | grep TcpFastRetrans # Cleanupsudo tc qdisc del dev eth0 rootDebugging Techniques:
Using ss for live connection state:
ss -ti dst 10.0.0.1 | grep -E "(retrans|cwnd|ssthresh)"
Kernel tracing (Linux):
# Trace TCP events
echo 1 > /sys/kernel/debug/tracing/events/tcp/tcp_retransmit_skb/enable
cat /sys/kernel/debug/tracing/trace_pipe
Wireshark filters:
tcp.analysis.fast_retransmission: Show fast retransmissionstcp.analysis.duplicate_ack: Show duplicate ACKstcp.analysis.retransmission: All retransmissionstcp.analysis.spurious_retransmission: Unnecessary retransmissionsGoogle's packetdrill tool allows scripting precise packet sequences to test TCP behavior. It's invaluable for testing fast retransmit edge cases—you can inject exactly 3 dup ACKs and verify the exact retransmission behavior. See: https://github.com/google/packetdrill
Implementing fast retransmit correctly is a non-trivial engineering challenge. The algorithm is elegant in concept but demands careful attention to state management, edge cases, and interaction with other TCP mechanisms.
Module Complete:
You have now completed the comprehensive exploration of TCP Fast Retransmit. From the foundational understanding of duplicate ACKs, through the trigger mechanism, timeout avoidance benefits, performance analysis, and implementation details—you possess expert-level knowledge of this critical TCP optimization.
This knowledge is directly applicable to:
Congratulations! You have mastered TCP Fast Retransmit—from the theoretical foundations of duplicate ACKs to the practical realities of production implementation. This knowledge positions you to understand, debug, and optimize TCP behavior in any environment.