Computer NetworksError Control

Error Control in the Data Link Layer

LevelIntermediate

Duration60 mins

TopicError Control

4 / 5

Retransmission

When Detection Fails to Correct: The Need for Retransmission

Error detection tells us that something went wrong. Forward Error Correction can fix minor damage. But what happens when:

Detected errors exceed FEC correction capability?
The channel is too reliable to justify FEC overhead?
A frame is lost entirely (buffer overflow, collision)?

In these cases, the only solution is retransmission—the sender must transmit the affected data again. This sounds simple, but the details are complex: How does the sender know retransmission is needed? How do we avoid sending duplicates? What if the acknowledgment itself gets lost?

Retransmission mechanisms form the backbone of reliable data link layer protocols, ensuring that despite the chaos of physical transmission, data arrives correctly and in order.

What You Will Learn

This page explores the mechanics of retransmission: why it's needed, the fundamental components (ACKs, NAKs, timeouts, sequence numbers), the challenges of reliable coordination, and how retransmission protocols handle various failure scenarios including lost frames, lost acknowledgments, and duplicate frames.

The Retransmission Problem

Retransmission seems straightforward: if something goes wrong, send it again. But the distributed nature of communication creates subtle challenges.

The Fundamental Difficulty:

The sender and receiver are physically separated, communicating only through the unreliable channel we're trying to make reliable. Each party has incomplete information:

Sender knows: What it sent, when it sent it
Sender doesn't know: Whether it arrived correctly, if ACK was sent
Receiver knows: What it received (if anything), whether CRC passed
Receiver doesn't know: Whether its ACK arrived, if sender is retransmitting

This asymmetry leads to scenarios where correct protocol behavior is non-obvious.

Failure Scenarios Requiring Careful Handling

•Frame corrupted in transit — Receiver detects error (CRC fails), discards frame. How does sender learn to retransmit?
•Frame lost entirely — Buffer overflow, collision damage, or routing failure. Receiver never sees it—cannot even send NAK.
•ACK corrupted — Sender can't interpret response. Is it ACK or NAK? Retransmit or wait?
•ACK lost — Receiver is happy; sender never learns it. Sender may retransmit—creating duplicate.
•Premature timeout — Network delay exceeds sender's patience. Sender retransmits; receiver gets duplicate.
•Out-of-order delivery — Late retransmission arrives after receiver already accepted a later frame.

The Duplicate Problem

Duplicates are particularly insidious. If a bank transfer frame is delivered twice, money moves twice. Unlike 'lost data' which is obviously wrong, a duplicate looks valid—it has correct CRC, correct format, correct addresses. Only careful sequencing prevents accepting the same data twice.

Building Blocks of Retransmission Protocols

Reliable retransmission protocols are built from a small set of fundamental mechanisms. Understanding each component is essential before studying complete protocols.

1. Acknowledgments (ACKs)

Positive acknowledgment confirms successful receipt. When the receiver gets a valid frame, it sends an ACK back to the sender. The ACK may include:

Sequence number of the acknowledged frame
Possibly cumulative acknowledgment ('all frames up to N received')
Sometimes piggyback on a data frame going the other direction

2. Negative Acknowledgments (NAKs)

Explicit signal that something went wrong. NAKs can accelerate recovery by immediately requesting retransmission rather than waiting for timeout. However, NAKs add complexity:

Receiver must detect the error to send NAK
Lost frames produce no NAK (nothing received to NAK)
NAK can itself be lost or corrupted

Many protocols omit NAKs entirely, relying solely on timeouts.

3. Timeouts

The sender sets a timer when transmitting a frame. If no ACK arrives before timeout, assume something failed and retransmit. Timeout design is critical:

Too short: Premature retransmission wastes bandwidth, creates duplicates
Too long: Slow recovery, poor throughput on lossy channels
Optimal: Approximately RTT + processing time + safety margin

Timeout Considerations by Network Type
Network Type	Typical RTT	Recommended Timeout	Considerations
Local Ethernet	< 1 ms	10-50 ms	Fast but rare losses; timeout rarely triggers
Wi-Fi LAN	1-10 ms	50-100 ms	Variable delays; retransmissions more common
WAN (Internet)	10-100 ms	200-500 ms	Highly variable; adaptive timeout essential
Satellite	500-700 ms	2-3 seconds	Long but predictable; FEC preferred over ARQ
Deep Space	3-22 minutes	N/A	Retransmission impractical; must use FEC

4. Sequence Numbers

Uniquely identify each frame to:

Match ACKs to the correct frame
Detect duplicates (same sequence number received twice)
Maintain ordering (know which frame comes next)

Sequence numbers don't need to count forever—they cycle through a finite range. The range must be large enough to distinguish frames 'in flight' from their predecessors/successors.

5. Retransmission Buffer

The sender must retain transmitted frames until acknowledged. If retransmission is needed, the frame is retransmitted from buffer. Buffer management:

Allocate buffer space for each outstanding frame
Release buffer when ACK confirms delivery
Limit outstanding frames to available buffer space

Piggybacking: Efficiency Through Combination

In bidirectional communication, ACKs can be 'piggybacked' on data frames going the opposite direction. This reduces overhead: instead of separate ACK frames, the ACK information rides along in the header of a data frame. Piggybacking works best when traffic is roughly symmetric; if acknowledger has nothing to send, ACK must eventually be sent alone.

Retransmission Triggers: When to Resend

A protocol must decide what event triggers retransmission. Different approaches have different tradeoffs:

Timeout-Based Retransmission:

The simplest approach: if no ACK arrives within timeout period, retransmit.

Advantages:

Handles all failure cases (lost frame, lost ACK, corruption)
Simple implementation
No NAK mechanism required

Disadvantages:

Always waits full timeout even when error is known
Must estimate timeout correctly (hard in variable-delay networks)
Premature timeout wastes bandwidth

timeout_retransmission.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import time
import threading
 
class TimeoutRetransmitter:
    """
    Simple timeout-based retransmission implementation.
    """
    def __init__(self, timeout_ms: float, max_retries: int = 5):
        self.timeout = timeout_ms / 1000  # Convert to seconds
        self.max_retries = max_retries
        self.pending_acks = {}  # seq_num -> (frame, timer, retries)
        self.lock = threading.Lock()
    
    def send_frame(self, seq_num: int, frame: bytes, transmit_func):
        """Send a frame and start timeout timer."""
        with self.lock:
            if seq_num in self.pending_acks:
                raise ValueError(f"Sequence {seq_num} already pending")
            
            # Transmit the frame
            transmit_func(frame)
            
            # Start timeout timer
            timer = threading.Timer(
                self.timeout, 
                self._handle_timeout,
                args=(seq_num, frame, transmit_func, 1)
            )
            timer.start()
            
            self.pending_acks[seq_num] = (frame, timer, 0)
    
    def _handle_timeout(self, seq_num: int, frame: bytes, 
                        transmit_func, retry_count: int):
        """Handle timeout: retransmit or give up."""
        with self.lock:
            if seq_num not in self.pending_acks:
                return  # Already ACKed, ignore
            
            if retry_count >= self.max_retries:
                print(f"Frame {seq_num}: max retries exceeded, giving up")
                del self.pending_acks[seq_num]
                # Notify upper layer of failure
                return
            
            print(f"Frame {seq_num}: timeout, retransmit #{retry_count}")
            transmit_func(frame)
            
            # Restart timer
            timer = threading.Timer(
                self.timeout,
                self._handle_timeout,
                args=(seq_num, frame, transmit_func, retry_count + 1)
            )
            timer.start()
            self.pending_acks[seq_num] = (frame, timer, retry_count)
    
    def receive_ack(self, seq_num: int):
        """Process incoming ACK."""
        with self.lock:
            if seq_num in self.pending_acks:
                frame, timer, retries = self.pending_acks[seq_num]
                timer.cancel()  # Stop the timeout
                del self.pending_acks[seq_num]
                print(f"Frame {seq_num}: ACKed after {retries} retries")

NAK-Based Retransmission:

Receiver explicitly requests retransmission when error is detected.

Advantages:

Faster recovery (no timeout wait)
Sender learns immediately of corruption

Disadvantages:

Cannot handle lost frames (nothing received to NAK)
NAK can itself be lost
More complex receiver logic
Duplicate NAKs can cause redundant retransmissions

Hybrid Approach:

Most real protocols use both:

NAK for immediate notification of detected errors
Timeout as backup for lost frames or lost NAKs
NAK often rate-limited to avoid duplicate retransmissions

Fast Retransmit Techniques

•Duplicate ACKs — In sliding window protocols, if receiver gets out-of-order frame, it resends ACK for last in-order frame. Three duplicate ACKs signal loss—retransmit without waiting for timeout.
•SACK (Selective ACK) — Receiver reports exactly which frames it has, letting sender target specific gaps for retransmission.
•Explicit Gap Notification — Some protocols let receiver specify 'I'm missing frame N'—more informative than simple NAK.

Sequence Number Space and Wraparound

Sequence numbers are finite—they eventually wrap around to 0. This creates the risk of confusing old frames with new ones. Sequence number space design requires careful analysis.

The Wraparound Problem:

Imagine 3-bit sequence numbers (0-7). Sender sends frames 0, 1, 2, ..., 7, 0, 1, 2, ... If a very old duplicate of 'frame 0' suddenly appears, the receiver might accept it as the new frame 0.

Sequence Number Requirements:

For window-based protocols, the sequence number space must satisfy:

Stop-and-Wait: Only 1 frame outstanding → 1 bit (0, 1) sufficient Go-Back-N: Window size W → Sequence space ≥ W + 1 Selective Repeat: Window size W → Sequence space ≥ 2W

The difference arises because Selective Repeat accepts out-of-order frames, requiring the receiver to distinguish between:

A new frame with sequence S
A retransmission of old frame S (within current window)
A very old duplicate of ancient frame S

Sequence Number Space Requirements
Protocol	Sequence Space	With 3-bit Seq (0-7)	Reason
Stop-and-Wait	≥ 2	Window = 1 ✓	Distinguish current from previous frame
Go-Back-N	≥ W + 1	Window ≤ 7	After W frames, first seq reused for W+1st
Selective Repeat	≥ 2W	Window ≤ 4	Sender window + receiver window cannot overlap

Selective Repeat Sequence Space Example
Why Selective Repeat needs sequence space ≥ 2W:
 
Scenario with 3-bit sequence (0-7) and window size 5 (WRONG):
 
Sender window: [0, 1, 2, 3, 4]  - sends all 5 frames
Receiver window: expects [0, 1, 2, 3, 4]
 
All 5 ACKs are lost in transit!
 
Sender times out, retransmits frame 0.
Meanwhile, sender's window conceptually wants to advance to [5, 6, 7, 0, 1]
                                                               ↑
                                                     Reusing 0!
 
Receiver sees "frame 0" arrive - is this:
  (a) The retransmitted old frame 0? - should discard
  (b) The new frame 0 from window [5,6,7,0,1]? - should accept
 
AMBIGUOUS! Receiver cannot distinguish.
 
With sequence space ≥ 2W (i.e., ≥ 10), or equivalently window ≤ 4:
 
Sender window: [0, 1, 2, 3]  - sends 4 frames
Receiver window: expects [0, 1, 2, 3]
 
All ACKs lost, sender retransmits frame 0.
Sender's next window would be [4, 5, 6, 7] - no overlap with [0,1,2,3]!
 
Receiver can now distinguish:
  - If expecting [0,1,2,3]: Accept frame 0 (or discard as dup)
  - If already at [4,5,6,7]: Definitely old, discard

Maximum Segment Lifetime

TCP uses 32-bit sequence numbers and defines a Maximum Segment Lifetime (MSL, typically 2 minutes). Duplicates older than 2×MSL are assumed to have died from the network. The sequence space is large enough (4 billion) that wraparound during 4 minutes is impossible at any realistic data rate—ensuring even delayed duplicates are distinguishable.

Handling Lost Frames vs. Lost ACKs

A timeout fires—but the sender doesn't know if the frame was lost or the ACK was lost. The protocol must handle both cases correctly.

Case 1: Frame Lost

Frame never reached receiver → No ACK was ever sent → Sender retransmits → Receiver accepts first copy → Everything works as expected.

Case 2: ACK Lost

Original frame reached receiver → Receiver ACKed → ACK was lost → Sender retransmits → Receiver gets duplicate!

The receiver must recognize and discard the duplicate. This is accomplished via sequence numbers:

Receiver tracks 'expected next sequence number'
If received frame has older sequence number → It's a duplicate → Discard data, but re-ACK (sender needs the acknowledgment)

Lost Frame Scenario
Time    Sender              Receiver
================================================
  0  →  Send Frame 0 ──────X (lost)
 50  →  (waiting...)        (nothing received)
100  →  TIMEOUT!            
       Retransmit 0 ───────→ Receive Frame 0 ✓
       Start timer          Send ACK 0
                        ←─── ACK 0 arrives
       Stop timer           
       Send Frame 1 ───────→ ...
 
Result: Recovery after one timeout.
        No duplicates.
        Clean recovery.

Lost ACK Scenario
Time    Sender              Receiver
================================================
  0  →  Send Frame 0 ───────→ Receive Frame 0 ✓
                             Deliver to layer 3
                             expect = 1
                             Send ACK 0
                        ──X── ACK 0 lost!
100  →  TIMEOUT!            
       Retransmit 0 ───────→ Receive Frame 0
       Start timer          seq 0 < expect 1
                             DUPLICATE detected!
                             Discard data
                             Re-send ACK 0
                        ←─── ACK 0 arrives
       Stop timer           
       Send Frame 1 ───────→ ...
 
Result: Duplicate handled correctly.

Always Re-ACK Duplicates

When the receiver detects a duplicate frame, it must re-send the ACK. The duplicate arrived because the sender didn't receive an ACK—if we silently discard without re-ACKing, the sender will keep retransmitting forever. The rule: always ACK valid frames, even duplicates.

Frame Corruption Handling:

If a frame arrives with CRC error:

Option 1: Silent Discard

Receiver discards corrupted frame
Sender times out and retransmits
Simple but adds delay

Option 2: Send NAK

Receiver sends NAK (if protocol supports it)
Sender immediately retransmits
Faster recovery

Option 3: Send Duplicate ACK

If using sliding window, receiver ACKs last good frame
This implicitly signals loss of subsequent frames
Used in TCP fast retransmit

Timeout Estimation: The Goldilocks Problem

Setting the retransmission timeout is a critical design decision. Too short causes premature retransmission (wasted bandwidth, duplicates). Too long causes slow recovery (poor throughput on lossy channels).

The Estimation Challenge:

Ideal timeout = RTT + ε (small safety margin)

But RTT varies due to:

Queue delays at routers/switches
Congestion fluctuations
Path changes
Receiver processing variation

At the Data Link Layer, RTT is typically stable (single-hop), but at transport layer (TCP), RTT can vary dramatically.

Adaptive Timeout Estimation (TCP Approach):

While primarily a transport-layer concern, the principles apply to any retransmission protocol:

adaptive_timeout.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
class AdaptiveTimeout:
    """
    Implements Jacobson/Karels adaptive timeout estimation.
    Used in TCP and applicable to any ARQ protocol.
    """
    def __init__(self, initial_rto: float = 1.0):
        self.srtt = None      # Smoothed RTT estimate
        self.rttvar = None    # RTT variance estimate
        self.rto = initial_rto  # Retransmission timeout
        
        # Smoothing factors (RFC 6298 recommends these)
        self.alpha = 1/8   # SRTT smoothing factor
        self.beta = 1/4    # RTTVAR smoothing factor
        self.K = 4         # Variance multiplier for RTO
        
        # Bounds
        self.min_rto = 0.2  # Minimum RTO (200ms for TCP)
        self.max_rto = 60.0 # Maximum RTO (60 seconds)
    
    def update(self, measured_rtt: float):
        """
        Update timeout estimate based on measured RTT.
        Called when ACK received for a non-retransmitted segment.
        """
        if self.srtt is None:
            # First measurement
            self.srtt = measured_rtt
            self.rttvar = measured_rtt / 2
        else:
            # Subsequent measurements
            # RTTVAR = (1-beta) * RTTVAR + beta * |SRTT - R|
            self.rttvar = (1 - self.beta) * self.rttvar +                           self.beta * abs(self.srtt - measured_rtt)
            
            # SRTT = (1-alpha) * SRTT + alpha * R
            self.srtt = (1 - self.alpha) * self.srtt +                         self.alpha * measured_rtt
        
        # RTO = SRTT + K * RTTVAR
        self.rto = self.srtt + self.K * self.rttvar
        
        # Apply bounds
        self.rto = max(self.min_rto, min(self.max_rto, self.rto))
        
        return self.rto
    
    def backoff(self):
        """
        Exponential backoff after timeout.
        Doubles RTO to avoid congestion collapse.
        """
        self.rto = min(self.rto * 2, self.max_rto)
        return self.rto
 
 
# Example: Timeout adaptation over varying network conditions
timeout = AdaptiveTimeout(initial_rto=1.0)
 
# Simulate RTT measurements
rtts = [0.1, 0.12, 0.11, 0.15, 0.13, 0.2, 0.18, 0.22, 0.19, 0.17]
print("Adaptive Timeout Estimation:")
print(f"{'RTT':>8} {'SRTT':>8} {'RTTVAR':>8} {'RTO':>8}")
for rtt in rtts:
    rto = timeout.update(rtt)
    print(f"{rtt:8.3f} {timeout.srtt:8.3f} {timeout.rttvar:8.3f} {rto:8.3f}")
 
# Simulate timeout and backoff
print(f"
Timeout occurred! Backing off...")
print(f"RTO after backoff: {timeout.backoff():.3f}")
print(f"RTO after backoff: {timeout.backoff():.3f}")

Retransmission Ambiguity

When an ACK arrives for a retransmitted frame, you can't tell if it's acknowledging the original transmission (delayed ACK) or the retransmission. This makes RTT measurement ambiguous. Karn's algorithm: don't update RTT estimate from retransmitted segments. Only measure RTT from clean, first-transmission ACKs.

Exponential Backoff and Congestion

When retransmission fails repeatedly, something is seriously wrong—possibly network congestion. Continuing to retransmit aggressively makes things worse. Exponential backoff increases timeout after each failure.

The Congestion Problem:

Imagine many senders all experiencing timeouts (due to congestion causing losses). If they all immediately retransmit:

More traffic enters congested network
More losses occur
More retransmissions
Positive feedback loop → Congestion collapse

Exponential Backoff Solution:

After each timeout:

Double the timeout interval
Retransmit
If timeout again, double again
Continue until maximum reached or maximum retries exhausted

Binary Exponential Backoff (Ethernet-style)
Ethernet Binary Exponential Backoff for Collision:
 
After n collisions (n = 1, 2, 3, ..., 10):
  - Choose random k from [0, 2^n - 1]
  - Wait k × slot_time before retransmit
  
After collisions 10-16:
  - Choose random k from [0, 1023]  (capped at 2^10 - 1)
  - Wait k × slot_time
  
After 16 collisions:
  - Give up, report failure to upper layer
 
Example progression (collision numbers):
┌───────────┬─────────────┬───────────────────┐
│ Collision │ Max Backoff │ Average Wait      │
├───────────┼─────────────┼───────────────────┤
│     1     │     1       │   0.5 slots       │
│     2     │     3       │   1.5 slots       │
│     3     │     7       │   3.5 slots       │
│     4     │    15       │   7.5 slots       │
│     5     │    31       │    15.5 slots     │
│    ...    │    ...      │    ...            │
│    10+    │   1023      │   511.5 slots     │
└───────────┴─────────────┴───────────────────┘
 
Why exponential?
1. Light load: Collisions resolve quickly (low backoff)
2. Heavy load: Collisions spread out over time
3. Random choice: Avoids synchronized retransmission
4. Adapts to congestion level automatically

Benefits of Exponential Backoff

•Adaptive load shedding — Under congestion, retransmissions space out, reducing offered load
•Stability — System naturally finds equilibrium rather than entering positive feedback loops
•Fairness — Randomization prevents synchronized thundering herd problems
•Graceful degradation — Heavy load causes delay but not collapse
•No explicit signaling needed — Each sender independently reduces rate based on local observations

DLL vs. Transport Backoff

Data Link Layer backoff (like Ethernet's) handles collision/channel access. Transport layer backoff (TCP's) handles end-to-end congestion. Both use exponential principles but operate at different layers and timescales. TCP's congestion control is more sophisticated, incorporating window adjustments beyond simple timeout doubling.

Retransmission Limits and Failure Handling

Infinite retransmission isn't practical—resources are consumed, latency grows without bound, and the underlying problem may be permanent (cable unplugged, receiver crashed). Protocols must define when to give up.

Maximum Retry Strategies:

Retransmission Limits in Common Protocols
Protocol	Max Retries	Action After Limit	Rationale
Ethernet CSMA/CD	16	Discard frame, report error	Persistent collision = likely hardware failure
802.11 Wi-Fi	7 (default)	Discard, report to upper layer	Wireless interference may be transient
HDLC	Configurable (N2)	Reset link or report failure	Depends on underlying medium
TCP	~15 (configurable)	Close connection	Several minutes of failure = peer unreachable
UDP	N/A	No retransmission	Unreliable by design; app handles retry

Failure Escalation:

When the DLL exhausts retries, it must notify higher layers:

Return error to network layer — Frame could not be delivered
Update link status — Mark link as potentially down
Trigger link monitoring — Protocol may start sending keepalives
Initiate recovery — Spanning Tree reconvergence, routing protocol updates

The upper layers then decide how to handle it:

TCP retransmits end-to-end
Routing protocols find alternate paths
Application reports error to user

Design Considerations for Retransmission Limits

•Total Maximum Delay — Maximum time = initial_timeout × 2^(max_retries) × factor. Ensure acceptable latency bound.
•Resource Consumption — Each pending retransmission consumes buffer space, timers, and state. Limits prevent exhaustion.
•Failure Detection Speed — Lower limits detect failures faster but may give up prematurely on temporarily congested links.
•Layer Coordination — DLL retransmission interacts with transport retransmission. Excessive DLL retries can confuse TCP congestion control.
•Applications Sensitivity — Real-time apps prefer fast failure; bulk transfer prefers exhaustive retry.

The Head-of-Line Blocking Problem

While waiting for a troubled frame's retransmission, subsequent frames may be stalled (depending on protocol). Go-Back-N retransmits all frames after the lost one. Even Selective Repeat may block if receiver buffer fills. Design retransmission limits considering the total impact on data flow, not just the individual frame.

Summary: Mastering Retransmission

We've explored the mechanisms that enable recovery from errors through retransmission—the coordination required between sender and receiver to ensure corrupted or lost data is resent correctly.

Key Takeaways

•Retransmission handles what FEC cannot — Lost frames, excessive errors, and corruption beyond correction capability all require resending.
•Core building blocks: ACKs, timeouts, sequence numbers — These mechanisms work together to coordinate recovery between sender and receiver.
•Timeout vs. NAK triggers — Timeouts handle all failure cases reliably; NAKs accelerate recovery when errors are detected promptly.
•Sequence numbers prevent duplicates — When ACKs are lost, retransmissions create duplicates; sequence comparison allows discarding them.
•Sequence space must be sized correctly — Selective Repeat requires 2× window size in sequence space to avoid ambiguity.
•Timeout estimation is crucial — Too short = premature retransmit; too long = slow recovery. Adaptive algorithms track RTT variation.
•Exponential backoff prevents congestion collapse — Repeated failures trigger increasing delays, naturally reducing load under stress.
•Retransmission limits bound resource consumption — Infinite retry isn't practical; failure must eventually escalate to higher layers.

What's Next:

We now understand individual retransmission mechanisms. The final page brings it all together: ARQ Protocols—the complete protocol designs (Stop-and-Wait, Go-Back-N, Selective Repeat) that orchestrate error detection, retransmission, and flow control into coherent, reliable communication systems.

Page Complete

You now understand the mechanics of retransmission—from triggering conditions and duplicate handling to timeout estimation and backoff strategies. These mechanisms form the foundation of reliable protocols. Next, we'll see how they're assembled into complete ARQ protocols.

4 / 5

Loading learning content...

Computer NetworksError Control

Error Control in the Data Link Layer

LevelIntermediate

Duration60 mins

TopicError Control

4 / 5

Retransmission

When Detection Fails to Correct: The Need for Retransmission

Error detection tells us that something went wrong. Forward Error Correction can fix minor damage. But what happens when:

Detected errors exceed FEC correction capability?
The channel is too reliable to justify FEC overhead?
A frame is lost entirely (buffer overflow, collision)?

Retransmission mechanisms form the backbone of reliable data link layer protocols, ensuring that despite the chaos of physical transmission, data arrives correctly and in order.

What You Will Learn

The Retransmission Problem

Retransmission seems straightforward: if something goes wrong, send it again. But the distributed nature of communication creates subtle challenges.

The Fundamental Difficulty:

The sender and receiver are physically separated, communicating only through the unreliable channel we're trying to make reliable. Each party has incomplete information:

Sender knows: What it sent, when it sent it
Sender doesn't know: Whether it arrived correctly, if ACK was sent
Receiver knows: What it received (if anything), whether CRC passed
Receiver doesn't know: Whether its ACK arrived, if sender is retransmitting

This asymmetry leads to scenarios where correct protocol behavior is non-obvious.

Failure Scenarios Requiring Careful Handling

•Frame corrupted in transit — Receiver detects error (CRC fails), discards frame. How does sender learn to retransmit?
•Frame lost entirely — Buffer overflow, collision damage, or routing failure. Receiver never sees it—cannot even send NAK.
•ACK corrupted — Sender can't interpret response. Is it ACK or NAK? Retransmit or wait?
•ACK lost — Receiver is happy; sender never learns it. Sender may retransmit—creating duplicate.
•Premature timeout — Network delay exceeds sender's patience. Sender retransmits; receiver gets duplicate.
•Out-of-order delivery — Late retransmission arrives after receiver already accepted a later frame.

The Duplicate Problem

Building Blocks of Retransmission Protocols

Reliable retransmission protocols are built from a small set of fundamental mechanisms. Understanding each component is essential before studying complete protocols.

1. Acknowledgments (ACKs)

Positive acknowledgment confirms successful receipt. When the receiver gets a valid frame, it sends an ACK back to the sender. The ACK may include:

Sequence number of the acknowledged frame
Possibly cumulative acknowledgment ('all frames up to N received')
Sometimes piggyback on a data frame going the other direction

2. Negative Acknowledgments (NAKs)

Explicit signal that something went wrong. NAKs can accelerate recovery by immediately requesting retransmission rather than waiting for timeout. However, NAKs add complexity:

Receiver must detect the error to send NAK
Lost frames produce no NAK (nothing received to NAK)
NAK can itself be lost or corrupted

Many protocols omit NAKs entirely, relying solely on timeouts.

3. Timeouts

The sender sets a timer when transmitting a frame. If no ACK arrives before timeout, assume something failed and retransmit. Timeout design is critical:

Too short: Premature retransmission wastes bandwidth, creates duplicates
Too long: Slow recovery, poor throughput on lossy channels
Optimal: Approximately RTT + processing time + safety margin

Timeout Considerations by Network Type
Network Type	Typical RTT	Recommended Timeout	Considerations
Local Ethernet	< 1 ms	10-50 ms	Fast but rare losses; timeout rarely triggers
Wi-Fi LAN	1-10 ms	50-100 ms	Variable delays; retransmissions more common
WAN (Internet)	10-100 ms	200-500 ms	Highly variable; adaptive timeout essential
Satellite	500-700 ms	2-3 seconds	Long but predictable; FEC preferred over ARQ
Deep Space	3-22 minutes	N/A	Retransmission impractical; must use FEC

4. Sequence Numbers

Uniquely identify each frame to:

Match ACKs to the correct frame
Detect duplicates (same sequence number received twice)
Maintain ordering (know which frame comes next)

Sequence numbers don't need to count forever—they cycle through a finite range. The range must be large enough to distinguish frames 'in flight' from their predecessors/successors.

5. Retransmission Buffer

The sender must retain transmitted frames until acknowledged. If retransmission is needed, the frame is retransmitted from buffer. Buffer management:

Allocate buffer space for each outstanding frame
Release buffer when ACK confirms delivery
Limit outstanding frames to available buffer space

Piggybacking: Efficiency Through Combination

Retransmission Triggers: When to Resend

A protocol must decide what event triggers retransmission. Different approaches have different tradeoffs:

Timeout-Based Retransmission:

The simplest approach: if no ACK arrives within timeout period, retransmit.

Advantages:

Handles all failure cases (lost frame, lost ACK, corruption)
Simple implementation
No NAK mechanism required

Disadvantages:

Always waits full timeout even when error is known
Must estimate timeout correctly (hard in variable-delay networks)
Premature timeout wastes bandwidth

timeout_retransmission.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import time
import threading
 
class TimeoutRetransmitter:
    """
    Simple timeout-based retransmission implementation.
    """
    def __init__(self, timeout_ms: float, max_retries: int = 5):
        self.timeout = timeout_ms / 1000  # Convert to seconds
        self.max_retries = max_retries
        self.pending_acks = {}  # seq_num -> (frame, timer, retries)
        self.lock = threading.Lock()
    
    def send_frame(self, seq_num: int, frame: bytes, transmit_func):
        """Send a frame and start timeout timer."""
        with self.lock:
            if seq_num in self.pending_acks:
                raise ValueError(f"Sequence {seq_num} already pending")
            
            # Transmit the frame
            transmit_func(frame)
            
            # Start timeout timer
            timer = threading.Timer(
                self.timeout, 
                self._handle_timeout,
                args=(seq_num, frame, transmit_func, 1)
            )
            timer.start()
            
            self.pending_acks[seq_num] = (frame, timer, 0)
    
    def _handle_timeout(self, seq_num: int, frame: bytes, 
                        transmit_func, retry_count: int):
        """Handle timeout: retransmit or give up."""
        with self.lock:
            if seq_num not in self.pending_acks:
                return  # Already ACKed, ignore
            
            if retry_count >= self.max_retries:
                print(f"Frame {seq_num}: max retries exceeded, giving up")
                del self.pending_acks[seq_num]
                # Notify upper layer of failure
                return
            
            print(f"Frame {seq_num}: timeout, retransmit #{retry_count}")
            transmit_func(frame)
            
            # Restart timer
            timer = threading.Timer(
                self.timeout,
                self._handle_timeout,
                args=(seq_num, frame, transmit_func, retry_count + 1)
            )
            timer.start()
            self.pending_acks[seq_num] = (frame, timer, retry_count)
    
    def receive_ack(self, seq_num: int):
        """Process incoming ACK."""
        with self.lock:
            if seq_num in self.pending_acks:
                frame, timer, retries = self.pending_acks[seq_num]
                timer.cancel()  # Stop the timeout
                del self.pending_acks[seq_num]
                print(f"Frame {seq_num}: ACKed after {retries} retries")

NAK-Based Retransmission:

Receiver explicitly requests retransmission when error is detected.

Advantages:

Faster recovery (no timeout wait)
Sender learns immediately of corruption

Disadvantages:

Cannot handle lost frames (nothing received to NAK)
NAK can itself be lost
More complex receiver logic
Duplicate NAKs can cause redundant retransmissions

Hybrid Approach:

Most real protocols use both:

NAK for immediate notification of detected errors
Timeout as backup for lost frames or lost NAKs
NAK often rate-limited to avoid duplicate retransmissions

Fast Retransmit Techniques

•Duplicate ACKs — In sliding window protocols, if receiver gets out-of-order frame, it resends ACK for last in-order frame. Three duplicate ACKs signal loss—retransmit without waiting for timeout.
•SACK (Selective ACK) — Receiver reports exactly which frames it has, letting sender target specific gaps for retransmission.
•Explicit Gap Notification — Some protocols let receiver specify 'I'm missing frame N'—more informative than simple NAK.

Sequence Number Space and Wraparound

Sequence numbers are finite—they eventually wrap around to 0. This creates the risk of confusing old frames with new ones. Sequence number space design requires careful analysis.

The Wraparound Problem:

Imagine 3-bit sequence numbers (0-7). Sender sends frames 0, 1, 2, ..., 7, 0, 1, 2, ... If a very old duplicate of 'frame 0' suddenly appears, the receiver might accept it as the new frame 0.

Sequence Number Requirements:

For window-based protocols, the sequence number space must satisfy:

Stop-and-Wait: Only 1 frame outstanding → 1 bit (0, 1) sufficient Go-Back-N: Window size W → Sequence space ≥ W + 1 Selective Repeat: Window size W → Sequence space ≥ 2W

The difference arises because Selective Repeat accepts out-of-order frames, requiring the receiver to distinguish between:

A new frame with sequence S
A retransmission of old frame S (within current window)
A very old duplicate of ancient frame S

Sequence Number Space Requirements
Protocol	Sequence Space	With 3-bit Seq (0-7)	Reason
Stop-and-Wait	≥ 2	Window = 1 ✓	Distinguish current from previous frame
Go-Back-N	≥ W + 1	Window ≤ 7	After W frames, first seq reused for W+1st
Selective Repeat	≥ 2W	Window ≤ 4	Sender window + receiver window cannot overlap

Selective Repeat Sequence Space Example
Why Selective Repeat needs sequence space ≥ 2W:
 
Scenario with 3-bit sequence (0-7) and window size 5 (WRONG):
 
Sender window: [0, 1, 2, 3, 4]  - sends all 5 frames
Receiver window: expects [0, 1, 2, 3, 4]
 
All 5 ACKs are lost in transit!
 
Sender times out, retransmits frame 0.
Meanwhile, sender's window conceptually wants to advance to [5, 6, 7, 0, 1]
                                                               ↑
                                                     Reusing 0!
 
Receiver sees "frame 0" arrive - is this:
  (a) The retransmitted old frame 0? - should discard
  (b) The new frame 0 from window [5,6,7,0,1]? - should accept
 
AMBIGUOUS! Receiver cannot distinguish.
 
With sequence space ≥ 2W (i.e., ≥ 10), or equivalently window ≤ 4:
 
Sender window: [0, 1, 2, 3]  - sends 4 frames
Receiver window: expects [0, 1, 2, 3]
 
All ACKs lost, sender retransmits frame 0.
Sender's next window would be [4, 5, 6, 7] - no overlap with [0,1,2,3]!
 
Receiver can now distinguish:
  - If expecting [0,1,2,3]: Accept frame 0 (or discard as dup)
  - If already at [4,5,6,7]: Definitely old, discard

Maximum Segment Lifetime

Handling Lost Frames vs. Lost ACKs

A timeout fires—but the sender doesn't know if the frame was lost or the ACK was lost. The protocol must handle both cases correctly.

Case 1: Frame Lost

Frame never reached receiver → No ACK was ever sent → Sender retransmits → Receiver accepts first copy → Everything works as expected.

Case 2: ACK Lost

Original frame reached receiver → Receiver ACKed → ACK was lost → Sender retransmits → Receiver gets duplicate!

The receiver must recognize and discard the duplicate. This is accomplished via sequence numbers:

Receiver tracks 'expected next sequence number'
If received frame has older sequence number → It's a duplicate → Discard data, but re-ACK (sender needs the acknowledgment)

Lost Frame Scenario
Time    Sender              Receiver
================================================
  0  →  Send Frame 0 ──────X (lost)
 50  →  (waiting...)        (nothing received)
100  →  TIMEOUT!            
       Retransmit 0 ───────→ Receive Frame 0 ✓
       Start timer          Send ACK 0
                        ←─── ACK 0 arrives
       Stop timer           
       Send Frame 1 ───────→ ...
 
Result: Recovery after one timeout.
        No duplicates.
        Clean recovery.

Lost ACK Scenario
Time    Sender              Receiver
================================================
  0  →  Send Frame 0 ───────→ Receive Frame 0 ✓
                             Deliver to layer 3
                             expect = 1
                             Send ACK 0
                        ──X── ACK 0 lost!
100  →  TIMEOUT!            
       Retransmit 0 ───────→ Receive Frame 0
       Start timer          seq 0 < expect 1
                             DUPLICATE detected!
                             Discard data
                             Re-send ACK 0
                        ←─── ACK 0 arrives
       Stop timer           
       Send Frame 1 ───────→ ...
 
Result: Duplicate handled correctly.

Always Re-ACK Duplicates

Frame Corruption Handling:

If a frame arrives with CRC error:

Option 1: Silent Discard

Receiver discards corrupted frame
Sender times out and retransmits
Simple but adds delay

Option 2: Send NAK

Receiver sends NAK (if protocol supports it)
Sender immediately retransmits
Faster recovery

Option 3: Send Duplicate ACK

If using sliding window, receiver ACKs last good frame
This implicitly signals loss of subsequent frames
Used in TCP fast retransmit

Timeout Estimation: The Goldilocks Problem

The Estimation Challenge:

Ideal timeout = RTT + ε (small safety margin)

But RTT varies due to:

Queue delays at routers/switches
Congestion fluctuations
Path changes
Receiver processing variation

At the Data Link Layer, RTT is typically stable (single-hop), but at transport layer (TCP), RTT can vary dramatically.

Adaptive Timeout Estimation (TCP Approach):

While primarily a transport-layer concern, the principles apply to any retransmission protocol:

adaptive_timeout.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
class AdaptiveTimeout:
    """
    Implements Jacobson/Karels adaptive timeout estimation.
    Used in TCP and applicable to any ARQ protocol.
    """
    def __init__(self, initial_rto: float = 1.0):
        self.srtt = None      # Smoothed RTT estimate
        self.rttvar = None    # RTT variance estimate
        self.rto = initial_rto  # Retransmission timeout
        
        # Smoothing factors (RFC 6298 recommends these)
        self.alpha = 1/8   # SRTT smoothing factor
        self.beta = 1/4    # RTTVAR smoothing factor
        self.K = 4         # Variance multiplier for RTO
        
        # Bounds
        self.min_rto = 0.2  # Minimum RTO (200ms for TCP)
        self.max_rto = 60.0 # Maximum RTO (60 seconds)
    
    def update(self, measured_rtt: float):
        """
        Update timeout estimate based on measured RTT.
        Called when ACK received for a non-retransmitted segment.
        """
        if self.srtt is None:
            # First measurement
            self.srtt = measured_rtt
            self.rttvar = measured_rtt / 2
        else:
            # Subsequent measurements
            # RTTVAR = (1-beta) * RTTVAR + beta * |SRTT - R|
            self.rttvar = (1 - self.beta) * self.rttvar +                           self.beta * abs(self.srtt - measured_rtt)
            
            # SRTT = (1-alpha) * SRTT + alpha * R
            self.srtt = (1 - self.alpha) * self.srtt +                         self.alpha * measured_rtt
        
        # RTO = SRTT + K * RTTVAR
        self.rto = self.srtt + self.K * self.rttvar
        
        # Apply bounds
        self.rto = max(self.min_rto, min(self.max_rto, self.rto))
        
        return self.rto
    
    def backoff(self):
        """
        Exponential backoff after timeout.
        Doubles RTO to avoid congestion collapse.
        """
        self.rto = min(self.rto * 2, self.max_rto)
        return self.rto
 
 
# Example: Timeout adaptation over varying network conditions
timeout = AdaptiveTimeout(initial_rto=1.0)
 
# Simulate RTT measurements
rtts = [0.1, 0.12, 0.11, 0.15, 0.13, 0.2, 0.18, 0.22, 0.19, 0.17]
print("Adaptive Timeout Estimation:")
print(f"{'RTT':>8} {'SRTT':>8} {'RTTVAR':>8} {'RTO':>8}")
for rtt in rtts:
    rto = timeout.update(rtt)
    print(f"{rtt:8.3f} {timeout.srtt:8.3f} {timeout.rttvar:8.3f} {rto:8.3f}")
 
# Simulate timeout and backoff
print(f"
Timeout occurred! Backing off...")
print(f"RTO after backoff: {timeout.backoff():.3f}")
print(f"RTO after backoff: {timeout.backoff():.3f}")

Retransmission Ambiguity

Exponential Backoff and Congestion

The Congestion Problem:

Imagine many senders all experiencing timeouts (due to congestion causing losses). If they all immediately retransmit:

More traffic enters congested network
More losses occur
More retransmissions
Positive feedback loop → Congestion collapse

Exponential Backoff Solution:

After each timeout:

Double the timeout interval
Retransmit
If timeout again, double again
Continue until maximum reached or maximum retries exhausted

Binary Exponential Backoff (Ethernet-style)
Ethernet Binary Exponential Backoff for Collision:
 
After n collisions (n = 1, 2, 3, ..., 10):
  - Choose random k from [0, 2^n - 1]
  - Wait k × slot_time before retransmit
  
After collisions 10-16:
  - Choose random k from [0, 1023]  (capped at 2^10 - 1)
  - Wait k × slot_time
  
After 16 collisions:
  - Give up, report failure to upper layer
 
Example progression (collision numbers):
┌───────────┬─────────────┬───────────────────┐
│ Collision │ Max Backoff │ Average Wait      │
├───────────┼─────────────┼───────────────────┤
│     1     │     1       │   0.5 slots       │
│     2     │     3       │   1.5 slots       │
│     3     │     7       │   3.5 slots       │
│     4     │    15       │   7.5 slots       │
│     5     │    31       │    15.5 slots     │
│    ...    │    ...      │    ...            │
│    10+    │   1023      │   511.5 slots     │
└───────────┴─────────────┴───────────────────┘
 
Why exponential?
1. Light load: Collisions resolve quickly (low backoff)
2. Heavy load: Collisions spread out over time
3. Random choice: Avoids synchronized retransmission
4. Adapts to congestion level automatically

Benefits of Exponential Backoff

•Adaptive load shedding — Under congestion, retransmissions space out, reducing offered load
•Stability — System naturally finds equilibrium rather than entering positive feedback loops
•Fairness — Randomization prevents synchronized thundering herd problems
•Graceful degradation — Heavy load causes delay but not collapse
•No explicit signaling needed — Each sender independently reduces rate based on local observations

DLL vs. Transport Backoff

Retransmission Limits and Failure Handling

Maximum Retry Strategies:

Retransmission Limits in Common Protocols
Protocol	Max Retries	Action After Limit	Rationale
Ethernet CSMA/CD	16	Discard frame, report error	Persistent collision = likely hardware failure
802.11 Wi-Fi	7 (default)	Discard, report to upper layer	Wireless interference may be transient
HDLC	Configurable (N2)	Reset link or report failure	Depends on underlying medium
TCP	~15 (configurable)	Close connection	Several minutes of failure = peer unreachable
UDP	N/A	No retransmission	Unreliable by design; app handles retry

Failure Escalation:

When the DLL exhausts retries, it must notify higher layers:

Return error to network layer — Frame could not be delivered
Update link status — Mark link as potentially down
Trigger link monitoring — Protocol may start sending keepalives
Initiate recovery — Spanning Tree reconvergence, routing protocol updates

The upper layers then decide how to handle it:

TCP retransmits end-to-end
Routing protocols find alternate paths
Application reports error to user

Design Considerations for Retransmission Limits

•Total Maximum Delay — Maximum time = initial_timeout × 2^(max_retries) × factor. Ensure acceptable latency bound.
•Resource Consumption — Each pending retransmission consumes buffer space, timers, and state. Limits prevent exhaustion.
•Failure Detection Speed — Lower limits detect failures faster but may give up prematurely on temporarily congested links.
•Layer Coordination — DLL retransmission interacts with transport retransmission. Excessive DLL retries can confuse TCP congestion control.
•Applications Sensitivity — Real-time apps prefer fast failure; bulk transfer prefers exhaustive retry.

The Head-of-Line Blocking Problem

Summary: Mastering Retransmission

We've explored the mechanisms that enable recovery from errors through retransmission—the coordination required between sender and receiver to ensure corrupted or lost data is resent correctly.

Key Takeaways

•Retransmission handles what FEC cannot — Lost frames, excessive errors, and corruption beyond correction capability all require resending.
•Core building blocks: ACKs, timeouts, sequence numbers — These mechanisms work together to coordinate recovery between sender and receiver.
•Timeout vs. NAK triggers — Timeouts handle all failure cases reliably; NAKs accelerate recovery when errors are detected promptly.
•Sequence numbers prevent duplicates — When ACKs are lost, retransmissions create duplicates; sequence comparison allows discarding them.
•Sequence space must be sized correctly — Selective Repeat requires 2× window size in sequence space to avoid ambiguity.
•Timeout estimation is crucial — Too short = premature retransmit; too long = slow recovery. Adaptive algorithms track RTT variation.
•Exponential backoff prevents congestion collapse — Repeated failures trigger increasing delays, naturally reducing load under stress.
•Retransmission limits bound resource consumption — Infinite retry isn't practical; failure must eventually escalate to higher layers.

What's Next:

Page Complete

4 / 5