Tcp Overview - Learning Module

Loading content...

0/240

Reliable Delivery

The Promise of Reliability

When you send an email, you expect it to arrive complete—not with missing paragraphs or scrambled sentences. When you download a file, you expect every byte to match the original—not a corrupted version missing chunks. When you submit a bank transaction, you expect it to either succeed completely or fail cleanly—not partially execute with uncertain results.

This expectation of reliability is so fundamental that we rarely think about it. Yet the underlying internet provides no such guarantee. IP packets can be lost to congestion, corrupted by transmission errors, duplicated by retransmissions, or arrive in random order. The network is chaos.

TCP transforms this chaos into order. It takes the internet's "best-effort" delivery and builds upon it a rock-solid guarantee: your data will arrive correctly and completely, or the connection will fail with an explicit error. There is no middle ground, no partial success, no silent data loss.

In this page, we'll explore the mechanisms that make this possible—the elegant engineering that converts unreliable packet delivery into reliable stream transport.

What You Will Learn

By the end of this page, you will understand how TCP achieves reliable delivery through sequence numbers, acknowledgments, retransmission mechanisms, duplicate detection, and checksum verification. You'll see how these mechanisms work together as a coordinated system, and understand the trade-offs TCP makes between reliability and performance.

The Reliability Problem

To appreciate TCP's reliability mechanisms, we must first understand what can go wrong in a packet-switched network:

Types of packet errors:

Network Layer Unreliability

•Packet Loss — Routers drop packets when queues overflow (congestion). Bit errors may cause checksum failures. Hardware failures can lose packets.
•Packet Corruption — Electromagnetic interference, cosmic rays, faulty hardware can flip bits. Corrupted packets may have invalid checksums or undetected errors.
•Packet Reordering — Different packets may take different paths with different delays. A packet sent first may arrive last.
•Packet Duplication — Retransmissions (at any layer) can cause the same data to arrive multiple times.
•Packet Delay — Network congestion, routing changes, or long paths can delay packets significantly, even indefinitely.

IP's position: "Not my problem"

The Internet Protocol explicitly provides only best-effort delivery. RFC 791 states that IP provides "datagram delivery" with "no guarantee of delivery." This isn't a bug—it's a deliberate design decision that keeps the network layer simple and scalable.

But applications need reliability. They can't deal with missing data or corruption. So something must fill this gap—and that something is TCP.

The reliability requirements:

Complete delivery: Every byte sent must eventually arrive or the sender must know it failed
No corruption: Data must be checked and corrupted data rejected
No duplicates: The same data shouldn't be delivered multiple times
Correct ordering: Data must be delivered in the exact order it was sent

The End-to-End Argument

Reliability is implemented at the endpoints (TCP hosts) rather than in the network (routers) because endpoints can do it correctly and completely—the network cannot. A router that retransmits lost packets doesn't know if the ultimate destination received them. Only the final destination can confirm receipt. This is the essence of the end-to-end principle.

Sequence Numbers: The Foundation of Reliability

TCP's reliability begins with sequence numbers. Every byte in the data stream is assigned a unique 32-bit sequence number, providing the foundation for all other reliability mechanisms.

Byte-oriented numbering:

Unlike protocols that number packets/messages, TCP numbers individual bytes. If a connection's Initial Sequence Number (ISN) is 1000 and the sender sends 500 bytes, those bytes are numbered 1000-1499. The next segment starts at sequence 1500.

ISN = 1000

First segment:   Seq=1000, Len=500  →  Bytes 1000-1499
Second segment:  Seq=1500, Len=500  →  Bytes 1500-1999
Third segment:   Seq=2000, Len=300  →  Bytes 2000-2299

This byte-oriented approach enables TCP to:

Identify exactly which bytes are missing (gaps in sequence space)
Retransmit only lost portions, not entire messages
Handle segments of varying sizes efficiently

Sequence Number Functions:

Sequence numbers serve multiple critical purposes:

Uses of TCP Sequence Numbers
Function	How Sequence Numbers Help	Example
Gap Detection	Receiver identifies missing bytes by sequence gaps	Got 1000-1499, 2000-2499... where's 1500-1999?
Reordering	Receiver reassembles out-of-order segments correctly	Segments 3,1,2 arrive → assemble as 1,2,3
Duplicate Detection	Receiver discards bytes with already-received sequence numbers	Retransmission of bytes 1000-1499 ignored if already received
Acknowledgment	Receiver tells sender which bytes arrived via ACK	ACK=2000 means 'received all bytes before 2000'
Retransmission	Sender knows exactly which bytes to retransmit	No ACK for 1500-1999? Retransmit that range

sequence_space.txt

Visualization

TCP Sequence Space (32-bit, wraps at 2^32)
 
        0                           2^31                        2^32-1
        |---------------------------|---------------------------|
                                    ↻ wraps around to 0
 
For a connection with ISN=1000:
        
        1000        1500        2000        2500        3000
        |-----------|-----------|-----------|-----------|
        ↑           ↑           ↑           ↑
        |           |           |           Next to send (SND.NXT)
        |           |           Last sent
        |           Last acknowledged (SND.UNA)
        ISN (connection start)
 
Send Window (bytes sender can transmit):
        
        SND.UNA             SND.NXT                 SND.UNA + SND.WND
        |===================|~~~~~~~~~~~~~~~~~~~~~~~|
        |                   |                       |
        ACKed               Sent but               Can send
        (can discard)       unACKed                (but haven't yet)
 
Receive Window (bytes receiver expects):
 
        RCV.NXT                                     RCV.NXT + RCV.WND
        |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
        |                                            |
        Next expected                                Last acceptable
        (anything before is duplicate/old)           (anything after is too far ahead)

Sequence Number Wraparound

At 32 bits, sequence numbers wrap around after 4GB of data. For very high-speed connections (10Gbps+), this can happen in seconds. The PAWS (Protection Against Wrapped Sequences) mechanism uses TCP timestamps to distinguish old wrapped segments from new ones. Without PAWS, wrapped sequence numbers could cause data corruption.

The Acknowledgment Mechanism

Sequence numbers would be useless without a way for the receiver to tell the sender what arrived. TCP's acknowledgment (ACK) mechanism closes this loop.

Cumulative Acknowledgments:

TCP uses cumulative ACKs: the ACK number indicates the next byte the receiver expects—meaning all bytes before that number have been received.

ACK = 2000  →  "I have received all bytes up to (but not including) 2000.
                I'm expecting byte 2000 next."

This cumulative approach has a major advantage: if an ACK gets lost, the next ACK still confirms everything. If ACK 2000 is lost, but ACK 2500 arrives, the sender knows bytes 0-2499 were received.

However, cumulative ACKs have a drawback: they can't directly tell the sender which specific segments arrived after a gap. If byte 1500 is lost but 2000-2999 arrived, the receiver can only say ACK=1500 ("I'm still waiting for 1500").

Converting Mermaid diagram...

Selective Acknowledgments (SACK):

SACK is a TCP option that addresses the cumulative ACK limitation. SACK allows the receiver to report non-contiguous blocks of received data:

ACK=1500
SACK blocks: (2000-2499), (3000-3499)

This tells the sender: "I'm missing 1500-1999 and 2500-2999, but I have the rest." The sender can now retransmit only the missing ranges, not everything from 1500 onwards.

SACK format in TCP options:

Kind	Length	Block 1 Start	Block 1 End	Block 2 Start	Block 2 End	...
5	10+8*n	32-bit	32-bit	32-bit	32-bit

SACK must be negotiated during the handshake (SACK-Permitted option in SYN). If both sides support SACK, the receiver can use it to report gaps precisely.

Delayed ACKs

TCP doesn't send an ACK for every segment received. Delayed ACKs wait briefly (up to 500ms per RFC 5681, typically 40ms in practice) hoping to piggyback the ACK on outgoing data. If no data is pending, the ACK is sent after the delay. This reduces ACK traffic but can hurt latency. Delaying is disabled for segments that arrive out of order—they trigger immediate ACKs to help the sender detect loss.

Retransmission Strategies

When data goes unacknowledged, TCP must retransmit. But when should it retransmit? Retransmitting too quickly wastes bandwidth on data that's merely delayed. Waiting too long leaves the receiver waiting unnecessarily. TCP uses multiple strategies to get this balance right.

1. Timeout-Based Retransmission:

The most fundamental mechanism: if an ACK doesn't arrive within the Retransmission Timeout (RTO), TCP assumes the segment was lost and retransmits.

Calculating RTO is tricky—networks have varying and dynamic delays. TCP estimates the Round-Trip Time (RTT) and sets RTO based on it:

RTO = SRTT + 4 × RTTVAR

Where:

SRTT (Smoothed RTT) = weighted average of RTT samples
RTTVAR = variance in RTT measurements

This adaptive timeout adjusts to network conditions—fast networks get short RTOs, slow networks get longer ones.

rto_calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# TCP RTO Calculation (RFC 6298 compliant)
 
class RTOEstimator:
    def __init__(self):
        self.srtt = None       # Smoothed RTT (None means no sample yet)
        self.rttvar = None     # RTT variance
        self.rto = 1.0         # Initial RTO = 1 second (common default)
        
        # Constants from RFC 6298
        self.alpha = 0.125     # SRTT smoothing factor (1/8)
        self.beta = 0.25       # RTTVAR smoothing factor (1/4)
        self.k = 4             # Variance multiplier
        self.min_rto = 0.2     # Minimum RTO (200ms typical)
        self.max_rto = 60.0    # Maximum RTO (60 seconds)
    
    def update(self, rtt_sample):
        """Update RTO estimate with new RTT measurement."""
        if self.srtt is None:
            # First measurement - initialize
            self.srtt = rtt_sample
            self.rttvar = rtt_sample / 2
        else:
            # Subsequent measurements - compute incrementally
            # RTTVAR = (1-β) × RTTVAR + β × |SRTT - R'|
            self.rttvar = (1 - self.beta) * self.rttvar + \
                          self.beta * abs(self.srtt - rtt_sample)
            
            # SRTT = (1-α) × SRTT + α × R'
            self.srtt = (1 - self.alpha) * self.srtt + \
                        self.alpha * rtt_sample
        
        # RTO = SRTT + 4 × RTTVAR
        self.rto = self.srtt + self.k * self.rttvar
        
        # Clamp to reasonable bounds
        self.rto = max(self.min_rto, min(self.max_rto, self.rto))
        
        return self.rto
    
    def timeout_occurred(self):
        """Handle RTO expiration - apply exponential backoff."""
        # Double RTO for each timeout (capped at max_rto)
        self.rto = min(self.rto * 2, self.max_rto)
        return self.rto
 
# Example usage:
estimator = RTOEstimator()
print(f"Initial RTO: {estimator.rto}s")
 
# Receive RTT samples
samples = [0.05, 0.055, 0.048, 0.062, 0.051]  # 50ms average
for sample in samples:
    rto = estimator.update(sample)
    print(f"RTT: {sample*1000:.0f}ms → RTO: {rto*1000:.1f}ms")
 
# If timeout occurs, back off exponentially
print("
Timeout occurred!")
print(f"New RTO: {estimator.timeout_occurred()*1000:.1f}ms")

2. Fast Retransmit:

Waiting for timeout can be slow—RTOs are typically hundreds of milliseconds. Fast Retransmit accelerates loss detection using duplicate ACKs.

When the receiver gets an out-of-order segment, it immediately sends an ACK repeating the last in-order byte. Each subsequent out-of-order segment triggers another duplicate ACK.

The sender interprets 3 duplicate ACKs (4 ACKs with the same number total) as strong evidence that the next expected segment was lost:

Duplicate ACK could mean: reordering, loss, or duplication
3 duplicate ACKs strongly suggest loss (segments are arriving, but one is missing)
Retransmit immediately without waiting for timeout

Converting Mermaid diagram...

Why 3 Duplicate ACKs?

Why not retransmit after 1 or 2 duplicate ACKs? Because minor packet reordering is common and not a sign of loss. A packet delayed by one or two positions causes 1-2 duplicate ACKs, then resolves. Three duplicate ACKs represent enough out-of-order segments that loss is likely. This threshold balances early detection against false positives.

Duplicate Detection and Handling

Retransmissions mean the same data might arrive multiple times. TCP must detect and discard duplicates to deliver data exactly once.

Why duplicates occur:

Retransmission of non-lost segment: The original segment was just delayed; both it and the retransmission arrive
Retransmission during recovery: Fast retransmit may retransmit data already being resent
Lower-layer retransmissions: Link layers (WiFi, etc.) may retransmit causing duplicates
Route changes: Old copies of segments may arrive via old routes long after being retransmitted

How TCP detects duplicates:

Sequence numbers make detection straightforward:

Receiver state: RCV.NXT = 2000 (expecting byte 2000 next)

Incoming segment: Seq=1000, Len=500 (bytes 1000-1499)

Compare: 1000 ≤ 1499 < 2000
Conclusion: All bytes in this segment are before RCV.NXT
           → Duplicate! Discard the data.
           → Still send ACK (ACK=2000) to confirm receipt

The receiver's RCV.NXT serves as a watermark. Any data with sequence numbers below this has already been received and delivered—it's a duplicate.

duplicate_handling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# Simplified segment reception logic
 
def receive_segment(segment, rcv_nxt, rcv_wnd, receive_buffer):
    """
    Process incoming TCP segment.
    Returns: (updated rcv_nxt, should_ack, ack_value)
    """
    seq = segment.sequence_number
    data = segment.data
    seg_len = len(data)
    seg_end = seq + seg_len  # Last byte + 1
    
    # Calculate receive window boundaries
    rcv_window_end = rcv_nxt + rcv_wnd
    
    # Case 1: Entirely duplicate (completely before RCV.NXT)
    if seg_end <= rcv_nxt:
        print(f"Duplicate segment: {seq}-{seg_end-1} (already received)")
        # Still ACK to help sender know current state
        return rcv_nxt, True, rcv_nxt
    
    # Case 2: Entirely beyond window (too far ahead)
    if seq >= rcv_window_end:
        print(f"Segment beyond window: {seq} > {rcv_window_end-1}")
        # ACK current position, but don't process
        return rcv_nxt, True, rcv_nxt
    
    # Case 3: Partially overlapping with already-received data
    if seq < rcv_nxt:
        # Trim the duplicate prefix
        trim_bytes = rcv_nxt - seq
        data = data[trim_bytes:]
        seq = rcv_nxt
        print(f"Trimmed {trim_bytes} duplicate bytes")
    
    # Case 4: Partially beyond window
    if seg_end > rcv_window_end:
        # Trim the out-of-window suffix
        trim_bytes = seg_end - rcv_window_end
        data = data[:-trim_bytes]
        print(f"Trimmed {trim_bytes} bytes beyond window")
    
    # Case 5: In-order segment (starts exactly at RCV.NXT)
    if seq == rcv_nxt:
        # Deliver immediately
        receive_buffer.append(data)
        rcv_nxt = seq + len(data)
        
        # Check if we can deliver buffered out-of-order segments
        while rcv_nxt in receive_buffer.out_of_order:
            buffered = receive_buffer.out_of_order.pop(rcv_nxt)
            receive_buffer.append(buffered)
            rcv_nxt += len(buffered)
        
        return rcv_nxt, True, rcv_nxt
    
    # Case 6: Out-of-order segment (seq > RCV.NXT)
    # Buffer it for later; send duplicate ACK
    receive_buffer.out_of_order[seq] = data
    print(f"Out-of-order: buffered {seq}-{seq+len(data)-1}")
    return rcv_nxt, True, rcv_nxt  # ACK still indicates gap

Partial Overlap Handling

Segments may partially overlap with already-received data—perhaps the sender retransmitted more than necessary. TCP handles this by accepting only the new bytes and discarding duplicates. This is why sequence numbers track bytes, not segments: TCP can precisely identify which bytes are new.

Checksum: Integrity Verification

Reliable delivery also means integrity—the data received must exactly match the data sent. TCP uses a 16-bit checksum to detect corruption.

The TCP Checksum:

The TCP checksum covers:

A pseudo-header (includes IP source/destination addresses)
The entire TCP header
The TCP payload (data)

The pseudo-header includes IP addresses because TCP wants to verify the segment arrived at the correct destination—not just that the TCP header is valid. Including IP addresses catches misrouted segments.

Pseudo-header format (IPv4):

Field	Size
Source IP Address	4 bytes
Destination IP Address	4 bytes
Zero (padding)	1 byte
Protocol (6 for TCP)	1 byte
TCP Length	2 bytes

Checksum calculation:

The checksum is the 16-bit one's complement sum of all 16-bit words in the pseudo-header, TCP header, and data (with the checksum field set to zero during calculation). Odd-length data is padded with a zero byte.

tcp_checksum.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
def ones_complement_checksum(data: bytes) -> int:
    """
    Calculate TCP/IP one's complement checksum.
    
    1. Sum all 16-bit words
    2. Add any carries back into the sum
    3. Take one's complement of result
    """
    # Pad to even length if necessary
    if len(data) % 2 == 1:
        data = data + b'\x00'
    
    # Sum all 16-bit words
    total = 0
    for i in range(0, len(data), 2):
        word = (data[i] << 8) + data[i + 1]
        total += word
    
    # Fold 32-bit sum to 16 bits (add carries)
    while total > 0xFFFF:
        total = (total & 0xFFFF) + (total >> 16)
    
    # One's complement
    return ~total & 0xFFFF
 
 
def create_pseudo_header(src_ip, dst_ip, tcp_length):
    """Create TCP pseudo-header for checksum calculation."""
    import socket
    
    # Pack IP addresses (4 bytes each)
    src_bytes = socket.inet_aton(src_ip)
    dst_bytes = socket.inet_aton(dst_ip)
    
    # Pseudo-header: src_ip + dst_ip + zero + protocol + tcp_length
    pseudo = (
        src_bytes +          # 4 bytes: source IP
        dst_bytes +          # 4 bytes: dest IP
        b'\x00' +            # 1 byte: zero/reserved
        b'\x06' +            # 1 byte: protocol (6 = TCP)
        tcp_length.to_bytes(2, 'big')  # 2 bytes: TCP segment length
    )
    return pseudo
 
 
def compute_tcp_checksum(src_ip, dst_ip, tcp_header, tcp_data):
    """
    Compute TCP checksum including pseudo-header.
    
    Args:
        src_ip: Source IP address string
        dst_ip: Destination IP address string  
        tcp_header: TCP header bytes (checksum field should be zero)
        tcp_data: TCP payload bytes
    
    Returns:
        16-bit checksum value
    """
    tcp_length = len(tcp_header) + len(tcp_data)
    pseudo_header = create_pseudo_header(src_ip, dst_ip, tcp_length)
    
    # Concatenate pseudo-header + TCP header + TCP data
    checksum_data = pseudo_header + tcp_header + tcp_data
    
    return ones_complement_checksum(checksum_data)
 
 
# Example verification at receiver:
def verify_tcp_segment(src_ip, dst_ip, tcp_header, tcp_data):
    """Verify received segment's checksum."""
    # Include the received checksum in calculation
    # If valid, result should be 0xFFFF (all ones)
    checksum_data = (
        create_pseudo_header(src_ip, dst_ip, len(tcp_header) + len(tcp_data)) +
        tcp_header + tcp_data
    )
    
    result = ones_complement_checksum(checksum_data)
    
    # After adding checksum to its complement, we should get all 1s
    return result == 0xFFFF or result == 0x0000

Checksum Limitations

The TCP checksum is a simple sum—it detects accidental bit errors well but provides no protection against malicious modification. It can also miss certain error patterns (like byte swaps that sum to the same value). Modern links typically have additional error detection (CRCs), and applications requiring cryptographic integrity should use TLS or similar protocols.

The Reliability Contract

All the mechanisms we've discussed work together to provide TCP's reliability contract:

Every byte written to a TCP socket will be delivered to the application on the other end, in order, exactly once—or the connection will be terminated with an error.

This is a binary guarantee: complete success or explicit failure. There's no partial delivery, no silent data loss, no undetected corruption.

What "reliable" means:

TCP's Reliability Guarantees
Problem	TCP Solution	Guarantee
Packet loss	Sequence numbers + ACKs + retransmission	All data eventually arrives or connection fails
Packet corruption	Checksum verification	Corrupted data is discarded and retransmitted
Packet duplication	Sequence number tracking	Data is delivered exactly once
Packet reordering	Sequence-based reassembly	Data is delivered in send order
Connection failure	Timeout + probe mechanisms	Persistent failures are reported to the application

What "reliable" does NOT mean:

Guaranteed delivery speed: TCP will keep trying, but network conditions limit throughput
Immediate failure detection: Detecting a dead connection can take minutes (keepalives have long intervals)
Protection against malicious attacks: TCP can't prevent a determined attacker from disrupting communication
Delivery to the right application: TCP delivers to the socket; ensuring the right process handles it is an OS concern

The application's role:

Reliability is a contract between TCP endpoints. The application must also cooperate:

Call recv() regularly: If the application is slow, the receive buffer fills, and the sender is throttled
Handle connection failures: When TCP reports an error, the application must decide how to respond
Use appropriate timeouts: For user-facing applications, a 60-second connection timeout may be too long
Consider application-level acknowledgments: For critical operations (like financial transactions), waiting for an application response may be necessary

TCP Doesn't Mean End-to-End Application Reliability

TCP guarantees byte delivery to the remote TCP stack, not to the remote application. If the remote host receives the data, ACKs it, then crashes before the application reads it, the sender is told 'delivered' but the application never saw it. Critical systems need application-level acknowledgment on top of TCP.

Performance Implications of Reliability

Reliability comes at a cost. Every mechanism that makes TCP reliable also affects performance. Understanding these trade-offs helps when tuning TCP or choosing between TCP and alternatives.

Latency costs:

Reliability-Induced Latency

•Connection setup: 1.5 RTT before data can flow (1 RTT for handshake + 0.5 RTT for ACK)
•Head-of-line blocking: One lost segment delays delivery of all subsequent data
•Retransmission delay: Discovering loss via timeout adds RTO time (often hundreds of ms)
•Fast retransmit delay: Discovering loss via dup ACKs still requires 3 additional segments to arrive
•Delayed ACKs: Waiting to piggyback ACKs can add up to 200-500ms per ACK

Throughput costs:

Mechanism	Cost
ACKs	Consume bandwidth in the reverse direction
Retransmissions	Waste bandwidth resending already-received data (partial overlaps)
Header overhead	20+ bytes per segment for reliability fields
Rate limiting	Congestion control may limit rate below available capacity

Memory costs:

Send buffer: Keeps unacknowledged data for potential retransmission
Receive buffer: Buffers out-of-order segments pending gap fill
TCB state: Per-connection state consumes kernel memory

The Reliability-Performance Trade-off

This is why UDP exists and why protocols like QUIC were created. Applications with different requirements make different trade-offs: video streaming tolerates loss to achieve low latency; file transfers require perfect reliability regardless of latency. TCP offers one specific trade-off—strong reliability with its associated costs.

Summary: The Mechanics of Reliability

TCP's reliability is not magic—it's a carefully engineered set of mechanisms working together. Let's summarize what we've learned:

Key Takeaways

•Sequence numbers uniquely identify every byte, enabling gap detection, reordering, and duplicate detection.
•Acknowledgments inform the sender what arrived. Cumulative ACKs are simple; SACK provides precise gap information.
•Retransmission via timeout (RTO) catches all losses; fast retransmit (3 dup ACKs) accelerates common cases.
•Duplicate detection uses sequence numbers to ensure data is delivered exactly once.
•Checksums detect corruption, causing corrupted segments to be discarded and retransmitted.
•The reliability contract is binary: complete delivery or explicit connection failure.
•Reliability has costs: latency, throughput overhead, memory, and complexity.

From Chaos to Order

TCP takes the internet's unreliable, best-effort packet delivery and builds upon it a reliable, ordered byte stream. Applications can focus on their logic without worrying about network failures—TCP handles the recovery. This abstraction is TCP's greatest gift to application developers.

What's next:

Reliability ensures data arrives correctly; ordering ensures it arrives in the right sequence. The next page examines TCP's ordered delivery guarantee in detail—how sequence numbers enable reassembly, how the receive buffer manages out-of-order segments, and the implications of ordering guarantees for application design.

Reliable Delivery

The Promise of Reliability

In this page, we'll explore the mechanisms that make this possible—the elegant engineering that converts unreliable packet delivery into reliable stream transport.

What You Will Learn

The Reliability Problem

To appreciate TCP's reliability mechanisms, we must first understand what can go wrong in a packet-switched network:

Types of packet errors:

Network Layer Unreliability

•Packet Loss — Routers drop packets when queues overflow (congestion). Bit errors may cause checksum failures. Hardware failures can lose packets.
•Packet Corruption — Electromagnetic interference, cosmic rays, faulty hardware can flip bits. Corrupted packets may have invalid checksums or undetected errors.
•Packet Reordering — Different packets may take different paths with different delays. A packet sent first may arrive last.
•Packet Duplication — Retransmissions (at any layer) can cause the same data to arrive multiple times.
•Packet Delay — Network congestion, routing changes, or long paths can delay packets significantly, even indefinitely.

IP's position: "Not my problem"

But applications need reliability. They can't deal with missing data or corruption. So something must fill this gap—and that something is TCP.

The reliability requirements:

Complete delivery: Every byte sent must eventually arrive or the sender must know it failed
No corruption: Data must be checked and corrupted data rejected
No duplicates: The same data shouldn't be delivered multiple times
Correct ordering: Data must be delivered in the exact order it was sent

The End-to-End Argument

Sequence Numbers: The Foundation of Reliability

TCP's reliability begins with sequence numbers. Every byte in the data stream is assigned a unique 32-bit sequence number, providing the foundation for all other reliability mechanisms.

Byte-oriented numbering:

ISN = 1000

First segment:   Seq=1000, Len=500  →  Bytes 1000-1499
Second segment:  Seq=1500, Len=500  →  Bytes 1500-1999
Third segment:   Seq=2000, Len=300  →  Bytes 2000-2299

This byte-oriented approach enables TCP to:

Identify exactly which bytes are missing (gaps in sequence space)
Retransmit only lost portions, not entire messages
Handle segments of varying sizes efficiently

Sequence Number Functions:

Sequence numbers serve multiple critical purposes:

Uses of TCP Sequence Numbers
Function	How Sequence Numbers Help	Example
Gap Detection	Receiver identifies missing bytes by sequence gaps	Got 1000-1499, 2000-2499... where's 1500-1999?
Reordering	Receiver reassembles out-of-order segments correctly	Segments 3,1,2 arrive → assemble as 1,2,3
Duplicate Detection	Receiver discards bytes with already-received sequence numbers	Retransmission of bytes 1000-1499 ignored if already received
Acknowledgment	Receiver tells sender which bytes arrived via ACK	ACK=2000 means 'received all bytes before 2000'
Retransmission	Sender knows exactly which bytes to retransmit	No ACK for 1500-1999? Retransmit that range

sequence_space.txt

Visualization

TCP Sequence Space (32-bit, wraps at 2^32)
 
        0                           2^31                        2^32-1
        |---------------------------|---------------------------|
                                    ↻ wraps around to 0
 
For a connection with ISN=1000:
        
        1000        1500        2000        2500        3000
        |-----------|-----------|-----------|-----------|
        ↑           ↑           ↑           ↑
        |           |           |           Next to send (SND.NXT)
        |           |           Last sent
        |           Last acknowledged (SND.UNA)
        ISN (connection start)
 
Send Window (bytes sender can transmit):
        
        SND.UNA             SND.NXT                 SND.UNA + SND.WND
        |===================|~~~~~~~~~~~~~~~~~~~~~~~|
        |                   |                       |
        ACKed               Sent but               Can send
        (can discard)       unACKed                (but haven't yet)
 
Receive Window (bytes receiver expects):
 
        RCV.NXT                                     RCV.NXT + RCV.WND
        |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
        |                                            |
        Next expected                                Last acceptable
        (anything before is duplicate/old)           (anything after is too far ahead)

Sequence Number Wraparound

The Acknowledgment Mechanism

Sequence numbers would be useless without a way for the receiver to tell the sender what arrived. TCP's acknowledgment (ACK) mechanism closes this loop.

Cumulative Acknowledgments:

TCP uses cumulative ACKs: the ACK number indicates the next byte the receiver expects—meaning all bytes before that number have been received.

ACK = 2000  →  "I have received all bytes up to (but not including) 2000.
                I'm expecting byte 2000 next."

This cumulative approach has a major advantage: if an ACK gets lost, the next ACK still confirms everything. If ACK 2000 is lost, but ACK 2500 arrives, the sender knows bytes 0-2499 were received.

Converting Mermaid diagram...

Selective Acknowledgments (SACK):

SACK is a TCP option that addresses the cumulative ACK limitation. SACK allows the receiver to report non-contiguous blocks of received data:

ACK=1500
SACK blocks: (2000-2499), (3000-3499)

This tells the sender: "I'm missing 1500-1999 and 2500-2999, but I have the rest." The sender can now retransmit only the missing ranges, not everything from 1500 onwards.

SACK format in TCP options:

Kind	Length	Block 1 Start	Block 1 End	Block 2 Start	Block 2 End	...
5	10+8*n	32-bit	32-bit	32-bit	32-bit

SACK must be negotiated during the handshake (SACK-Permitted option in SYN). If both sides support SACK, the receiver can use it to report gaps precisely.

Delayed ACKs

Retransmission Strategies

1. Timeout-Based Retransmission:

The most fundamental mechanism: if an ACK doesn't arrive within the Retransmission Timeout (RTO), TCP assumes the segment was lost and retransmits.

Calculating RTO is tricky—networks have varying and dynamic delays. TCP estimates the Round-Trip Time (RTT) and sets RTO based on it:

RTO = SRTT + 4 × RTTVAR

Where:

SRTT (Smoothed RTT) = weighted average of RTT samples
RTTVAR = variance in RTT measurements

This adaptive timeout adjusts to network conditions—fast networks get short RTOs, slow networks get longer ones.

rto_calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# TCP RTO Calculation (RFC 6298 compliant)
 
class RTOEstimator:
    def __init__(self):
        self.srtt = None       # Smoothed RTT (None means no sample yet)
        self.rttvar = None     # RTT variance
        self.rto = 1.0         # Initial RTO = 1 second (common default)
        
        # Constants from RFC 6298
        self.alpha = 0.125     # SRTT smoothing factor (1/8)
        self.beta = 0.25       # RTTVAR smoothing factor (1/4)
        self.k = 4             # Variance multiplier
        self.min_rto = 0.2     # Minimum RTO (200ms typical)
        self.max_rto = 60.0    # Maximum RTO (60 seconds)
    
    def update(self, rtt_sample):
        """Update RTO estimate with new RTT measurement."""
        if self.srtt is None:
            # First measurement - initialize
            self.srtt = rtt_sample
            self.rttvar = rtt_sample / 2
        else:
            # Subsequent measurements - compute incrementally
            # RTTVAR = (1-β) × RTTVAR + β × |SRTT - R'|
            self.rttvar = (1 - self.beta) * self.rttvar + \
                          self.beta * abs(self.srtt - rtt_sample)
            
            # SRTT = (1-α) × SRTT + α × R'
            self.srtt = (1 - self.alpha) * self.srtt + \
                        self.alpha * rtt_sample
        
        # RTO = SRTT + 4 × RTTVAR
        self.rto = self.srtt + self.k * self.rttvar
        
        # Clamp to reasonable bounds
        self.rto = max(self.min_rto, min(self.max_rto, self.rto))
        
        return self.rto
    
    def timeout_occurred(self):
        """Handle RTO expiration - apply exponential backoff."""
        # Double RTO for each timeout (capped at max_rto)
        self.rto = min(self.rto * 2, self.max_rto)
        return self.rto
 
# Example usage:
estimator = RTOEstimator()
print(f"Initial RTO: {estimator.rto}s")
 
# Receive RTT samples
samples = [0.05, 0.055, 0.048, 0.062, 0.051]  # 50ms average
for sample in samples:
    rto = estimator.update(sample)
    print(f"RTT: {sample*1000:.0f}ms → RTO: {rto*1000:.1f}ms")
 
# If timeout occurs, back off exponentially
print("
Timeout occurred!")
print(f"New RTO: {estimator.timeout_occurred()*1000:.1f}ms")

2. Fast Retransmit:

Waiting for timeout can be slow—RTOs are typically hundreds of milliseconds. Fast Retransmit accelerates loss detection using duplicate ACKs.

When the receiver gets an out-of-order segment, it immediately sends an ACK repeating the last in-order byte. Each subsequent out-of-order segment triggers another duplicate ACK.

The sender interprets 3 duplicate ACKs (4 ACKs with the same number total) as strong evidence that the next expected segment was lost:

Duplicate ACK could mean: reordering, loss, or duplication
3 duplicate ACKs strongly suggest loss (segments are arriving, but one is missing)
Retransmit immediately without waiting for timeout

Converting Mermaid diagram...

Why 3 Duplicate ACKs?

Duplicate Detection and Handling

Retransmissions mean the same data might arrive multiple times. TCP must detect and discard duplicates to deliver data exactly once.

Why duplicates occur:

Retransmission of non-lost segment: The original segment was just delayed; both it and the retransmission arrive
Retransmission during recovery: Fast retransmit may retransmit data already being resent
Lower-layer retransmissions: Link layers (WiFi, etc.) may retransmit causing duplicates
Route changes: Old copies of segments may arrive via old routes long after being retransmitted

How TCP detects duplicates:

Sequence numbers make detection straightforward:

Receiver state: RCV.NXT = 2000 (expecting byte 2000 next)

Incoming segment: Seq=1000, Len=500 (bytes 1000-1499)

Compare: 1000 ≤ 1499 < 2000
Conclusion: All bytes in this segment are before RCV.NXT
           → Duplicate! Discard the data.
           → Still send ACK (ACK=2000) to confirm receipt

The receiver's RCV.NXT serves as a watermark. Any data with sequence numbers below this has already been received and delivered—it's a duplicate.

duplicate_handling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# Simplified segment reception logic
 
def receive_segment(segment, rcv_nxt, rcv_wnd, receive_buffer):
    """
    Process incoming TCP segment.
    Returns: (updated rcv_nxt, should_ack, ack_value)
    """
    seq = segment.sequence_number
    data = segment.data
    seg_len = len(data)
    seg_end = seq + seg_len  # Last byte + 1
    
    # Calculate receive window boundaries
    rcv_window_end = rcv_nxt + rcv_wnd
    
    # Case 1: Entirely duplicate (completely before RCV.NXT)
    if seg_end <= rcv_nxt:
        print(f"Duplicate segment: {seq}-{seg_end-1} (already received)")
        # Still ACK to help sender know current state
        return rcv_nxt, True, rcv_nxt
    
    # Case 2: Entirely beyond window (too far ahead)
    if seq >= rcv_window_end:
        print(f"Segment beyond window: {seq} > {rcv_window_end-1}")
        # ACK current position, but don't process
        return rcv_nxt, True, rcv_nxt
    
    # Case 3: Partially overlapping with already-received data
    if seq < rcv_nxt:
        # Trim the duplicate prefix
        trim_bytes = rcv_nxt - seq
        data = data[trim_bytes:]
        seq = rcv_nxt
        print(f"Trimmed {trim_bytes} duplicate bytes")
    
    # Case 4: Partially beyond window
    if seg_end > rcv_window_end:
        # Trim the out-of-window suffix
        trim_bytes = seg_end - rcv_window_end
        data = data[:-trim_bytes]
        print(f"Trimmed {trim_bytes} bytes beyond window")
    
    # Case 5: In-order segment (starts exactly at RCV.NXT)
    if seq == rcv_nxt:
        # Deliver immediately
        receive_buffer.append(data)
        rcv_nxt = seq + len(data)
        
        # Check if we can deliver buffered out-of-order segments
        while rcv_nxt in receive_buffer.out_of_order:
            buffered = receive_buffer.out_of_order.pop(rcv_nxt)
            receive_buffer.append(buffered)
            rcv_nxt += len(buffered)
        
        return rcv_nxt, True, rcv_nxt
    
    # Case 6: Out-of-order segment (seq > RCV.NXT)
    # Buffer it for later; send duplicate ACK
    receive_buffer.out_of_order[seq] = data
    print(f"Out-of-order: buffered {seq}-{seq+len(data)-1}")
    return rcv_nxt, True, rcv_nxt  # ACK still indicates gap

Partial Overlap Handling

Checksum: Integrity Verification

Reliable delivery also means integrity—the data received must exactly match the data sent. TCP uses a 16-bit checksum to detect corruption.

The TCP Checksum:

The TCP checksum covers:

A pseudo-header (includes IP source/destination addresses)
The entire TCP header
The TCP payload (data)

Pseudo-header format (IPv4):

Field	Size
Source IP Address	4 bytes
Destination IP Address	4 bytes
Zero (padding)	1 byte
Protocol (6 for TCP)	1 byte
TCP Length	2 bytes

Checksum calculation:

tcp_checksum.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
def ones_complement_checksum(data: bytes) -> int:
    """
    Calculate TCP/IP one's complement checksum.
    
    1. Sum all 16-bit words
    2. Add any carries back into the sum
    3. Take one's complement of result
    """
    # Pad to even length if necessary
    if len(data) % 2 == 1:
        data = data + b'\x00'
    
    # Sum all 16-bit words
    total = 0
    for i in range(0, len(data), 2):
        word = (data[i] << 8) + data[i + 1]
        total += word
    
    # Fold 32-bit sum to 16 bits (add carries)
    while total > 0xFFFF:
        total = (total & 0xFFFF) + (total >> 16)
    
    # One's complement
    return ~total & 0xFFFF
 
 
def create_pseudo_header(src_ip, dst_ip, tcp_length):
    """Create TCP pseudo-header for checksum calculation."""
    import socket
    
    # Pack IP addresses (4 bytes each)
    src_bytes = socket.inet_aton(src_ip)
    dst_bytes = socket.inet_aton(dst_ip)
    
    # Pseudo-header: src_ip + dst_ip + zero + protocol + tcp_length
    pseudo = (
        src_bytes +          # 4 bytes: source IP
        dst_bytes +          # 4 bytes: dest IP
        b'\x00' +            # 1 byte: zero/reserved
        b'\x06' +            # 1 byte: protocol (6 = TCP)
        tcp_length.to_bytes(2, 'big')  # 2 bytes: TCP segment length
    )
    return pseudo
 
 
def compute_tcp_checksum(src_ip, dst_ip, tcp_header, tcp_data):
    """
    Compute TCP checksum including pseudo-header.
    
    Args:
        src_ip: Source IP address string
        dst_ip: Destination IP address string  
        tcp_header: TCP header bytes (checksum field should be zero)
        tcp_data: TCP payload bytes
    
    Returns:
        16-bit checksum value
    """
    tcp_length = len(tcp_header) + len(tcp_data)
    pseudo_header = create_pseudo_header(src_ip, dst_ip, tcp_length)
    
    # Concatenate pseudo-header + TCP header + TCP data
    checksum_data = pseudo_header + tcp_header + tcp_data
    
    return ones_complement_checksum(checksum_data)
 
 
# Example verification at receiver:
def verify_tcp_segment(src_ip, dst_ip, tcp_header, tcp_data):
    """Verify received segment's checksum."""
    # Include the received checksum in calculation
    # If valid, result should be 0xFFFF (all ones)
    checksum_data = (
        create_pseudo_header(src_ip, dst_ip, len(tcp_header) + len(tcp_data)) +
        tcp_header + tcp_data
    )
    
    result = ones_complement_checksum(checksum_data)
    
    # After adding checksum to its complement, we should get all 1s
    return result == 0xFFFF or result == 0x0000

Checksum Limitations

The Reliability Contract

All the mechanisms we've discussed work together to provide TCP's reliability contract:

Every byte written to a TCP socket will be delivered to the application on the other end, in order, exactly once—or the connection will be terminated with an error.

This is a binary guarantee: complete success or explicit failure. There's no partial delivery, no silent data loss, no undetected corruption.

What "reliable" means:

TCP's Reliability Guarantees
Problem	TCP Solution	Guarantee
Packet loss	Sequence numbers + ACKs + retransmission	All data eventually arrives or connection fails
Packet corruption	Checksum verification	Corrupted data is discarded and retransmitted
Packet duplication	Sequence number tracking	Data is delivered exactly once
Packet reordering	Sequence-based reassembly	Data is delivered in send order
Connection failure	Timeout + probe mechanisms	Persistent failures are reported to the application

What "reliable" does NOT mean:

Guaranteed delivery speed: TCP will keep trying, but network conditions limit throughput
Immediate failure detection: Detecting a dead connection can take minutes (keepalives have long intervals)
Protection against malicious attacks: TCP can't prevent a determined attacker from disrupting communication
Delivery to the right application: TCP delivers to the socket; ensuring the right process handles it is an OS concern

The application's role:

Reliability is a contract between TCP endpoints. The application must also cooperate:

Call recv() regularly: If the application is slow, the receive buffer fills, and the sender is throttled
Handle connection failures: When TCP reports an error, the application must decide how to respond
Use appropriate timeouts: For user-facing applications, a 60-second connection timeout may be too long
Consider application-level acknowledgments: For critical operations (like financial transactions), waiting for an application response may be necessary

TCP Doesn't Mean End-to-End Application Reliability

Performance Implications of Reliability

Reliability comes at a cost. Every mechanism that makes TCP reliable also affects performance. Understanding these trade-offs helps when tuning TCP or choosing between TCP and alternatives.

Latency costs:

Reliability-Induced Latency

•Connection setup: 1.5 RTT before data can flow (1 RTT for handshake + 0.5 RTT for ACK)
•Head-of-line blocking: One lost segment delays delivery of all subsequent data
•Retransmission delay: Discovering loss via timeout adds RTO time (often hundreds of ms)
•Fast retransmit delay: Discovering loss via dup ACKs still requires 3 additional segments to arrive
•Delayed ACKs: Waiting to piggyback ACKs can add up to 200-500ms per ACK

Throughput costs:

Mechanism	Cost
ACKs	Consume bandwidth in the reverse direction
Retransmissions	Waste bandwidth resending already-received data (partial overlaps)
Header overhead	20+ bytes per segment for reliability fields
Rate limiting	Congestion control may limit rate below available capacity

Memory costs:

Send buffer: Keeps unacknowledged data for potential retransmission
Receive buffer: Buffers out-of-order segments pending gap fill
TCB state: Per-connection state consumes kernel memory

The Reliability-Performance Trade-off

Summary: The Mechanics of Reliability

TCP's reliability is not magic—it's a carefully engineered set of mechanisms working together. Let's summarize what we've learned:

Key Takeaways

•Sequence numbers uniquely identify every byte, enabling gap detection, reordering, and duplicate detection.
•Acknowledgments inform the sender what arrived. Cumulative ACKs are simple; SACK provides precise gap information.
•Retransmission via timeout (RTO) catches all losses; fast retransmit (3 dup ACKs) accelerates common cases.
•Duplicate detection uses sequence numbers to ensure data is delivered exactly once.
•Checksums detect corruption, causing corrupted segments to be discarded and retransmitted.
•The reliability contract is binary: complete delivery or explicit connection failure.
•Reliability has costs: latency, throughput overhead, memory, and complexity.

From Chaos to Order

What's next: