Loading learning content...
Error detection tells us that something went wrong. Forward Error Correction can fix minor damage. But what happens when:
In these cases, the only solution is retransmission—the sender must transmit the affected data again. This sounds simple, but the details are complex: How does the sender know retransmission is needed? How do we avoid sending duplicates? What if the acknowledgment itself gets lost?
Retransmission mechanisms form the backbone of reliable data link layer protocols, ensuring that despite the chaos of physical transmission, data arrives correctly and in order.
This page explores the mechanics of retransmission: why it's needed, the fundamental components (ACKs, NAKs, timeouts, sequence numbers), the challenges of reliable coordination, and how retransmission protocols handle various failure scenarios including lost frames, lost acknowledgments, and duplicate frames.
Retransmission seems straightforward: if something goes wrong, send it again. But the distributed nature of communication creates subtle challenges.
The Fundamental Difficulty:
The sender and receiver are physically separated, communicating only through the unreliable channel we're trying to make reliable. Each party has incomplete information:
This asymmetry leads to scenarios where correct protocol behavior is non-obvious.
Duplicates are particularly insidious. If a bank transfer frame is delivered twice, money moves twice. Unlike 'lost data' which is obviously wrong, a duplicate looks valid—it has correct CRC, correct format, correct addresses. Only careful sequencing prevents accepting the same data twice.
Reliable retransmission protocols are built from a small set of fundamental mechanisms. Understanding each component is essential before studying complete protocols.
1. Acknowledgments (ACKs)
Positive acknowledgment confirms successful receipt. When the receiver gets a valid frame, it sends an ACK back to the sender. The ACK may include:
2. Negative Acknowledgments (NAKs)
Explicit signal that something went wrong. NAKs can accelerate recovery by immediately requesting retransmission rather than waiting for timeout. However, NAKs add complexity:
Many protocols omit NAKs entirely, relying solely on timeouts.
3. Timeouts
The sender sets a timer when transmitting a frame. If no ACK arrives before timeout, assume something failed and retransmit. Timeout design is critical:
| Network Type | Typical RTT | Recommended Timeout | Considerations |
|---|---|---|---|
| Local Ethernet | < 1 ms | 10-50 ms | Fast but rare losses; timeout rarely triggers |
| Wi-Fi LAN | 1-10 ms | 50-100 ms | Variable delays; retransmissions more common |
| WAN (Internet) | 10-100 ms | 200-500 ms | Highly variable; adaptive timeout essential |
| Satellite | 500-700 ms | 2-3 seconds | Long but predictable; FEC preferred over ARQ |
| Deep Space | 3-22 minutes | N/A | Retransmission impractical; must use FEC |
4. Sequence Numbers
Uniquely identify each frame to:
Sequence numbers don't need to count forever—they cycle through a finite range. The range must be large enough to distinguish frames 'in flight' from their predecessors/successors.
5. Retransmission Buffer
The sender must retain transmitted frames until acknowledged. If retransmission is needed, the frame is retransmitted from buffer. Buffer management:
In bidirectional communication, ACKs can be 'piggybacked' on data frames going the opposite direction. This reduces overhead: instead of separate ACK frames, the ACK information rides along in the header of a data frame. Piggybacking works best when traffic is roughly symmetric; if acknowledger has nothing to send, ACK must eventually be sent alone.
A protocol must decide what event triggers retransmission. Different approaches have different tradeoffs:
Timeout-Based Retransmission:
The simplest approach: if no ACK arrives within timeout period, retransmit.
Advantages:
Disadvantages:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import timeimport threading class TimeoutRetransmitter: """ Simple timeout-based retransmission implementation. """ def __init__(self, timeout_ms: float, max_retries: int = 5): self.timeout = timeout_ms / 1000 # Convert to seconds self.max_retries = max_retries self.pending_acks = {} # seq_num -> (frame, timer, retries) self.lock = threading.Lock() def send_frame(self, seq_num: int, frame: bytes, transmit_func): """Send a frame and start timeout timer.""" with self.lock: if seq_num in self.pending_acks: raise ValueError(f"Sequence {seq_num} already pending") # Transmit the frame transmit_func(frame) # Start timeout timer timer = threading.Timer( self.timeout, self._handle_timeout, args=(seq_num, frame, transmit_func, 1) ) timer.start() self.pending_acks[seq_num] = (frame, timer, 0) def _handle_timeout(self, seq_num: int, frame: bytes, transmit_func, retry_count: int): """Handle timeout: retransmit or give up.""" with self.lock: if seq_num not in self.pending_acks: return # Already ACKed, ignore if retry_count >= self.max_retries: print(f"Frame {seq_num}: max retries exceeded, giving up") del self.pending_acks[seq_num] # Notify upper layer of failure return print(f"Frame {seq_num}: timeout, retransmit #{retry_count}") transmit_func(frame) # Restart timer timer = threading.Timer( self.timeout, self._handle_timeout, args=(seq_num, frame, transmit_func, retry_count + 1) ) timer.start() self.pending_acks[seq_num] = (frame, timer, retry_count) def receive_ack(self, seq_num: int): """Process incoming ACK.""" with self.lock: if seq_num in self.pending_acks: frame, timer, retries = self.pending_acks[seq_num] timer.cancel() # Stop the timeout del self.pending_acks[seq_num] print(f"Frame {seq_num}: ACKed after {retries} retries")NAK-Based Retransmission:
Receiver explicitly requests retransmission when error is detected.
Advantages:
Disadvantages:
Hybrid Approach:
Most real protocols use both:
Sequence numbers are finite—they eventually wrap around to 0. This creates the risk of confusing old frames with new ones. Sequence number space design requires careful analysis.
The Wraparound Problem:
Imagine 3-bit sequence numbers (0-7). Sender sends frames 0, 1, 2, ..., 7, 0, 1, 2, ... If a very old duplicate of 'frame 0' suddenly appears, the receiver might accept it as the new frame 0.
Sequence Number Requirements:
For window-based protocols, the sequence number space must satisfy:
Stop-and-Wait: Only 1 frame outstanding → 1 bit (0, 1) sufficient Go-Back-N: Window size W → Sequence space ≥ W + 1 Selective Repeat: Window size W → Sequence space ≥ 2W
The difference arises because Selective Repeat accepts out-of-order frames, requiring the receiver to distinguish between:
| Protocol | Sequence Space | With 3-bit Seq (0-7) | Reason |
|---|---|---|---|
| Stop-and-Wait | ≥ 2 | Window = 1 ✓ | Distinguish current from previous frame |
| Go-Back-N | ≥ W + 1 | Window ≤ 7 | After W frames, first seq reused for W+1st |
| Selective Repeat | ≥ 2W | Window ≤ 4 | Sender window + receiver window cannot overlap |
Why Selective Repeat needs sequence space ≥ 2W: Scenario with 3-bit sequence (0-7) and window size 5 (WRONG): Sender window: [0, 1, 2, 3, 4] - sends all 5 framesReceiver window: expects [0, 1, 2, 3, 4] All 5 ACKs are lost in transit! Sender times out, retransmits frame 0.Meanwhile, sender's window conceptually wants to advance to [5, 6, 7, 0, 1] ↑ Reusing 0! Receiver sees "frame 0" arrive - is this: (a) The retransmitted old frame 0? - should discard (b) The new frame 0 from window [5,6,7,0,1]? - should accept AMBIGUOUS! Receiver cannot distinguish. With sequence space ≥ 2W (i.e., ≥ 10), or equivalently window ≤ 4: Sender window: [0, 1, 2, 3] - sends 4 framesReceiver window: expects [0, 1, 2, 3] All ACKs lost, sender retransmits frame 0.Sender's next window would be [4, 5, 6, 7] - no overlap with [0,1,2,3]! Receiver can now distinguish: - If expecting [0,1,2,3]: Accept frame 0 (or discard as dup) - If already at [4,5,6,7]: Definitely old, discardTCP uses 32-bit sequence numbers and defines a Maximum Segment Lifetime (MSL, typically 2 minutes). Duplicates older than 2×MSL are assumed to have died from the network. The sequence space is large enough (4 billion) that wraparound during 4 minutes is impossible at any realistic data rate—ensuring even delayed duplicates are distinguishable.
A timeout fires—but the sender doesn't know if the frame was lost or the ACK was lost. The protocol must handle both cases correctly.
Case 1: Frame Lost
Frame never reached receiver → No ACK was ever sent → Sender retransmits → Receiver accepts first copy → Everything works as expected.
Case 2: ACK Lost
Original frame reached receiver → Receiver ACKed → ACK was lost → Sender retransmits → Receiver gets duplicate!
The receiver must recognize and discard the duplicate. This is accomplished via sequence numbers:
Time Sender Receiver================================================ 0 → Send Frame 0 ──────X (lost) 50 → (waiting...) (nothing received)100 → TIMEOUT! Retransmit 0 ───────→ Receive Frame 0 ✓ Start timer Send ACK 0 ←─── ACK 0 arrives Stop timer Send Frame 1 ───────→ ... Result: Recovery after one timeout. No duplicates. Clean recovery.Time Sender Receiver================================================ 0 → Send Frame 0 ───────→ Receive Frame 0 ✓ Deliver to layer 3 expect = 1 Send ACK 0 ──X── ACK 0 lost!100 → TIMEOUT! Retransmit 0 ───────→ Receive Frame 0 Start timer seq 0 < expect 1 DUPLICATE detected! Discard data Re-send ACK 0 ←─── ACK 0 arrives Stop timer Send Frame 1 ───────→ ... Result: Duplicate handled correctly.When the receiver detects a duplicate frame, it must re-send the ACK. The duplicate arrived because the sender didn't receive an ACK—if we silently discard without re-ACKing, the sender will keep retransmitting forever. The rule: always ACK valid frames, even duplicates.
Frame Corruption Handling:
If a frame arrives with CRC error:
Option 1: Silent Discard
Option 2: Send NAK
Option 3: Send Duplicate ACK
Setting the retransmission timeout is a critical design decision. Too short causes premature retransmission (wasted bandwidth, duplicates). Too long causes slow recovery (poor throughput on lossy channels).
The Estimation Challenge:
Ideal timeout = RTT + ε (small safety margin)
But RTT varies due to:
At the Data Link Layer, RTT is typically stable (single-hop), but at transport layer (TCP), RTT can vary dramatically.
Adaptive Timeout Estimation (TCP Approach):
While primarily a transport-layer concern, the principles apply to any retransmission protocol:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
class AdaptiveTimeout: """ Implements Jacobson/Karels adaptive timeout estimation. Used in TCP and applicable to any ARQ protocol. """ def __init__(self, initial_rto: float = 1.0): self.srtt = None # Smoothed RTT estimate self.rttvar = None # RTT variance estimate self.rto = initial_rto # Retransmission timeout # Smoothing factors (RFC 6298 recommends these) self.alpha = 1/8 # SRTT smoothing factor self.beta = 1/4 # RTTVAR smoothing factor self.K = 4 # Variance multiplier for RTO # Bounds self.min_rto = 0.2 # Minimum RTO (200ms for TCP) self.max_rto = 60.0 # Maximum RTO (60 seconds) def update(self, measured_rtt: float): """ Update timeout estimate based on measured RTT. Called when ACK received for a non-retransmitted segment. """ if self.srtt is None: # First measurement self.srtt = measured_rtt self.rttvar = measured_rtt / 2 else: # Subsequent measurements # RTTVAR = (1-beta) * RTTVAR + beta * |SRTT - R| self.rttvar = (1 - self.beta) * self.rttvar + self.beta * abs(self.srtt - measured_rtt) # SRTT = (1-alpha) * SRTT + alpha * R self.srtt = (1 - self.alpha) * self.srtt + self.alpha * measured_rtt # RTO = SRTT + K * RTTVAR self.rto = self.srtt + self.K * self.rttvar # Apply bounds self.rto = max(self.min_rto, min(self.max_rto, self.rto)) return self.rto def backoff(self): """ Exponential backoff after timeout. Doubles RTO to avoid congestion collapse. """ self.rto = min(self.rto * 2, self.max_rto) return self.rto # Example: Timeout adaptation over varying network conditionstimeout = AdaptiveTimeout(initial_rto=1.0) # Simulate RTT measurementsrtts = [0.1, 0.12, 0.11, 0.15, 0.13, 0.2, 0.18, 0.22, 0.19, 0.17]print("Adaptive Timeout Estimation:")print(f"{'RTT':>8} {'SRTT':>8} {'RTTVAR':>8} {'RTO':>8}")for rtt in rtts: rto = timeout.update(rtt) print(f"{rtt:8.3f} {timeout.srtt:8.3f} {timeout.rttvar:8.3f} {rto:8.3f}") # Simulate timeout and backoffprint(f"Timeout occurred! Backing off...")print(f"RTO after backoff: {timeout.backoff():.3f}")print(f"RTO after backoff: {timeout.backoff():.3f}")When an ACK arrives for a retransmitted frame, you can't tell if it's acknowledging the original transmission (delayed ACK) or the retransmission. This makes RTT measurement ambiguous. Karn's algorithm: don't update RTT estimate from retransmitted segments. Only measure RTT from clean, first-transmission ACKs.
When retransmission fails repeatedly, something is seriously wrong—possibly network congestion. Continuing to retransmit aggressively makes things worse. Exponential backoff increases timeout after each failure.
The Congestion Problem:
Imagine many senders all experiencing timeouts (due to congestion causing losses). If they all immediately retransmit:
Exponential Backoff Solution:
After each timeout:
Ethernet Binary Exponential Backoff for Collision: After n collisions (n = 1, 2, 3, ..., 10): - Choose random k from [0, 2^n - 1] - Wait k × slot_time before retransmit After collisions 10-16: - Choose random k from [0, 1023] (capped at 2^10 - 1) - Wait k × slot_time After 16 collisions: - Give up, report failure to upper layer Example progression (collision numbers):┌───────────┬─────────────┬───────────────────┐│ Collision │ Max Backoff │ Average Wait │├───────────┼─────────────┼───────────────────┤│ 1 │ 1 │ 0.5 slots ││ 2 │ 3 │ 1.5 slots ││ 3 │ 7 │ 3.5 slots ││ 4 │ 15 │ 7.5 slots ││ 5 │ 31 │ 15.5 slots ││ ... │ ... │ ... ││ 10+ │ 1023 │ 511.5 slots │└───────────┴─────────────┴───────────────────┘ Why exponential?1. Light load: Collisions resolve quickly (low backoff)2. Heavy load: Collisions spread out over time3. Random choice: Avoids synchronized retransmission4. Adapts to congestion level automaticallyData Link Layer backoff (like Ethernet's) handles collision/channel access. Transport layer backoff (TCP's) handles end-to-end congestion. Both use exponential principles but operate at different layers and timescales. TCP's congestion control is more sophisticated, incorporating window adjustments beyond simple timeout doubling.
Infinite retransmission isn't practical—resources are consumed, latency grows without bound, and the underlying problem may be permanent (cable unplugged, receiver crashed). Protocols must define when to give up.
Maximum Retry Strategies:
| Protocol | Max Retries | Action After Limit | Rationale |
|---|---|---|---|
| Ethernet CSMA/CD | 16 | Discard frame, report error | Persistent collision = likely hardware failure |
| 802.11 Wi-Fi | 7 (default) | Discard, report to upper layer | Wireless interference may be transient |
| HDLC | Configurable (N2) | Reset link or report failure | Depends on underlying medium |
| TCP | ~15 (configurable) | Close connection | Several minutes of failure = peer unreachable |
| UDP | N/A | No retransmission | Unreliable by design; app handles retry |
Failure Escalation:
When the DLL exhausts retries, it must notify higher layers:
The upper layers then decide how to handle it:
While waiting for a troubled frame's retransmission, subsequent frames may be stalled (depending on protocol). Go-Back-N retransmits all frames after the lost one. Even Selective Repeat may block if receiver buffer fills. Design retransmission limits considering the total impact on data flow, not just the individual frame.
We've explored the mechanisms that enable recovery from errors through retransmission—the coordination required between sender and receiver to ensure corrupted or lost data is resent correctly.
What's Next:
We now understand individual retransmission mechanisms. The final page brings it all together: ARQ Protocols—the complete protocol designs (Stop-and-Wait, Go-Back-N, Selective Repeat) that orchestrate error detection, retransmission, and flow control into coherent, reliable communication systems.
You now understand the mechanics of retransmission—from triggering conditions and duplicate handling to timeout estimation and backoff strategies. These mechanisms form the foundation of reliable protocols. Next, we'll see how they're assembled into complete ARQ protocols.