Loading content...
When you send an email, you expect it to arrive complete—not with missing paragraphs or scrambled sentences. When you download a file, you expect every byte to match the original—not a corrupted version missing chunks. When you submit a bank transaction, you expect it to either succeed completely or fail cleanly—not partially execute with uncertain results.
This expectation of reliability is so fundamental that we rarely think about it. Yet the underlying internet provides no such guarantee. IP packets can be lost to congestion, corrupted by transmission errors, duplicated by retransmissions, or arrive in random order. The network is chaos.
TCP transforms this chaos into order. It takes the internet's "best-effort" delivery and builds upon it a rock-solid guarantee: your data will arrive correctly and completely, or the connection will fail with an explicit error. There is no middle ground, no partial success, no silent data loss.
In this page, we'll explore the mechanisms that make this possible—the elegant engineering that converts unreliable packet delivery into reliable stream transport.
By the end of this page, you will understand how TCP achieves reliable delivery through sequence numbers, acknowledgments, retransmission mechanisms, duplicate detection, and checksum verification. You'll see how these mechanisms work together as a coordinated system, and understand the trade-offs TCP makes between reliability and performance.
To appreciate TCP's reliability mechanisms, we must first understand what can go wrong in a packet-switched network:
Types of packet errors:
IP's position: "Not my problem"
The Internet Protocol explicitly provides only best-effort delivery. RFC 791 states that IP provides "datagram delivery" with "no guarantee of delivery." This isn't a bug—it's a deliberate design decision that keeps the network layer simple and scalable.
But applications need reliability. They can't deal with missing data or corruption. So something must fill this gap—and that something is TCP.
The reliability requirements:
Reliability is implemented at the endpoints (TCP hosts) rather than in the network (routers) because endpoints can do it correctly and completely—the network cannot. A router that retransmits lost packets doesn't know if the ultimate destination received them. Only the final destination can confirm receipt. This is the essence of the end-to-end principle.
TCP's reliability begins with sequence numbers. Every byte in the data stream is assigned a unique 32-bit sequence number, providing the foundation for all other reliability mechanisms.
Byte-oriented numbering:
Unlike protocols that number packets/messages, TCP numbers individual bytes. If a connection's Initial Sequence Number (ISN) is 1000 and the sender sends 500 bytes, those bytes are numbered 1000-1499. The next segment starts at sequence 1500.
ISN = 1000
First segment: Seq=1000, Len=500 → Bytes 1000-1499
Second segment: Seq=1500, Len=500 → Bytes 1500-1999
Third segment: Seq=2000, Len=300 → Bytes 2000-2299
This byte-oriented approach enables TCP to:
Sequence Number Functions:
Sequence numbers serve multiple critical purposes:
| Function | How Sequence Numbers Help | Example |
|---|---|---|
| Gap Detection | Receiver identifies missing bytes by sequence gaps | Got 1000-1499, 2000-2499... where's 1500-1999? |
| Reordering | Receiver reassembles out-of-order segments correctly | Segments 3,1,2 arrive → assemble as 1,2,3 |
| Duplicate Detection | Receiver discards bytes with already-received sequence numbers | Retransmission of bytes 1000-1499 ignored if already received |
| Acknowledgment | Receiver tells sender which bytes arrived via ACK | ACK=2000 means 'received all bytes before 2000' |
| Retransmission | Sender knows exactly which bytes to retransmit | No ACK for 1500-1999? Retransmit that range |
TCP Sequence Space (32-bit, wraps at 2^32) 0 2^31 2^32-1 |---------------------------|---------------------------| ↻ wraps around to 0 For a connection with ISN=1000: 1000 1500 2000 2500 3000 |-----------|-----------|-----------|-----------| ↑ ↑ ↑ ↑ | | | Next to send (SND.NXT) | | Last sent | Last acknowledged (SND.UNA) ISN (connection start) Send Window (bytes sender can transmit): SND.UNA SND.NXT SND.UNA + SND.WND |===================|~~~~~~~~~~~~~~~~~~~~~~~| | | | ACKed Sent but Can send (can discard) unACKed (but haven't yet) Receive Window (bytes receiver expects): RCV.NXT RCV.NXT + RCV.WND |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| | | Next expected Last acceptable (anything before is duplicate/old) (anything after is too far ahead)At 32 bits, sequence numbers wrap around after 4GB of data. For very high-speed connections (10Gbps+), this can happen in seconds. The PAWS (Protection Against Wrapped Sequences) mechanism uses TCP timestamps to distinguish old wrapped segments from new ones. Without PAWS, wrapped sequence numbers could cause data corruption.
Sequence numbers would be useless without a way for the receiver to tell the sender what arrived. TCP's acknowledgment (ACK) mechanism closes this loop.
Cumulative Acknowledgments:
TCP uses cumulative ACKs: the ACK number indicates the next byte the receiver expects—meaning all bytes before that number have been received.
ACK = 2000 → "I have received all bytes up to (but not including) 2000.
I'm expecting byte 2000 next."
This cumulative approach has a major advantage: if an ACK gets lost, the next ACK still confirms everything. If ACK 2000 is lost, but ACK 2500 arrives, the sender knows bytes 0-2499 were received.
However, cumulative ACKs have a drawback: they can't directly tell the sender which specific segments arrived after a gap. If byte 1500 is lost but 2000-2999 arrived, the receiver can only say ACK=1500 ("I'm still waiting for 1500").
Selective Acknowledgments (SACK):
SACK is a TCP option that addresses the cumulative ACK limitation. SACK allows the receiver to report non-contiguous blocks of received data:
ACK=1500
SACK blocks: (2000-2499), (3000-3499)
This tells the sender: "I'm missing 1500-1999 and 2500-2999, but I have the rest." The sender can now retransmit only the missing ranges, not everything from 1500 onwards.
SACK format in TCP options:
| Kind | Length | Block 1 Start | Block 1 End | Block 2 Start | Block 2 End | ... |
|---|---|---|---|---|---|---|
| 5 | 10+8*n | 32-bit | 32-bit | 32-bit | 32-bit |
SACK must be negotiated during the handshake (SACK-Permitted option in SYN). If both sides support SACK, the receiver can use it to report gaps precisely.
TCP doesn't send an ACK for every segment received. Delayed ACKs wait briefly (up to 500ms per RFC 5681, typically 40ms in practice) hoping to piggyback the ACK on outgoing data. If no data is pending, the ACK is sent after the delay. This reduces ACK traffic but can hurt latency. Delaying is disabled for segments that arrive out of order—they trigger immediate ACKs to help the sender detect loss.
When data goes unacknowledged, TCP must retransmit. But when should it retransmit? Retransmitting too quickly wastes bandwidth on data that's merely delayed. Waiting too long leaves the receiver waiting unnecessarily. TCP uses multiple strategies to get this balance right.
1. Timeout-Based Retransmission:
The most fundamental mechanism: if an ACK doesn't arrive within the Retransmission Timeout (RTO), TCP assumes the segment was lost and retransmits.
Calculating RTO is tricky—networks have varying and dynamic delays. TCP estimates the Round-Trip Time (RTT) and sets RTO based on it:
RTO = SRTT + 4 × RTTVAR
Where:
This adaptive timeout adjusts to network conditions—fast networks get short RTOs, slow networks get longer ones.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
# TCP RTO Calculation (RFC 6298 compliant) class RTOEstimator: def __init__(self): self.srtt = None # Smoothed RTT (None means no sample yet) self.rttvar = None # RTT variance self.rto = 1.0 # Initial RTO = 1 second (common default) # Constants from RFC 6298 self.alpha = 0.125 # SRTT smoothing factor (1/8) self.beta = 0.25 # RTTVAR smoothing factor (1/4) self.k = 4 # Variance multiplier self.min_rto = 0.2 # Minimum RTO (200ms typical) self.max_rto = 60.0 # Maximum RTO (60 seconds) def update(self, rtt_sample): """Update RTO estimate with new RTT measurement.""" if self.srtt is None: # First measurement - initialize self.srtt = rtt_sample self.rttvar = rtt_sample / 2 else: # Subsequent measurements - compute incrementally # RTTVAR = (1-β) × RTTVAR + β × |SRTT - R'| self.rttvar = (1 - self.beta) * self.rttvar + \ self.beta * abs(self.srtt - rtt_sample) # SRTT = (1-α) × SRTT + α × R' self.srtt = (1 - self.alpha) * self.srtt + \ self.alpha * rtt_sample # RTO = SRTT + 4 × RTTVAR self.rto = self.srtt + self.k * self.rttvar # Clamp to reasonable bounds self.rto = max(self.min_rto, min(self.max_rto, self.rto)) return self.rto def timeout_occurred(self): """Handle RTO expiration - apply exponential backoff.""" # Double RTO for each timeout (capped at max_rto) self.rto = min(self.rto * 2, self.max_rto) return self.rto # Example usage:estimator = RTOEstimator()print(f"Initial RTO: {estimator.rto}s") # Receive RTT samplessamples = [0.05, 0.055, 0.048, 0.062, 0.051] # 50ms averagefor sample in samples: rto = estimator.update(sample) print(f"RTT: {sample*1000:.0f}ms → RTO: {rto*1000:.1f}ms") # If timeout occurs, back off exponentiallyprint("Timeout occurred!")print(f"New RTO: {estimator.timeout_occurred()*1000:.1f}ms")2. Fast Retransmit:
Waiting for timeout can be slow—RTOs are typically hundreds of milliseconds. Fast Retransmit accelerates loss detection using duplicate ACKs.
When the receiver gets an out-of-order segment, it immediately sends an ACK repeating the last in-order byte. Each subsequent out-of-order segment triggers another duplicate ACK.
The sender interprets 3 duplicate ACKs (4 ACKs with the same number total) as strong evidence that the next expected segment was lost:
Why not retransmit after 1 or 2 duplicate ACKs? Because minor packet reordering is common and not a sign of loss. A packet delayed by one or two positions causes 1-2 duplicate ACKs, then resolves. Three duplicate ACKs represent enough out-of-order segments that loss is likely. This threshold balances early detection against false positives.
Retransmissions mean the same data might arrive multiple times. TCP must detect and discard duplicates to deliver data exactly once.
Why duplicates occur:
How TCP detects duplicates:
Sequence numbers make detection straightforward:
Receiver state: RCV.NXT = 2000 (expecting byte 2000 next)
Incoming segment: Seq=1000, Len=500 (bytes 1000-1499)
Compare: 1000 ≤ 1499 < 2000
Conclusion: All bytes in this segment are before RCV.NXT
→ Duplicate! Discard the data.
→ Still send ACK (ACK=2000) to confirm receipt
The receiver's RCV.NXT serves as a watermark. Any data with sequence numbers below this has already been received and delivered—it's a duplicate.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
# Simplified segment reception logic def receive_segment(segment, rcv_nxt, rcv_wnd, receive_buffer): """ Process incoming TCP segment. Returns: (updated rcv_nxt, should_ack, ack_value) """ seq = segment.sequence_number data = segment.data seg_len = len(data) seg_end = seq + seg_len # Last byte + 1 # Calculate receive window boundaries rcv_window_end = rcv_nxt + rcv_wnd # Case 1: Entirely duplicate (completely before RCV.NXT) if seg_end <= rcv_nxt: print(f"Duplicate segment: {seq}-{seg_end-1} (already received)") # Still ACK to help sender know current state return rcv_nxt, True, rcv_nxt # Case 2: Entirely beyond window (too far ahead) if seq >= rcv_window_end: print(f"Segment beyond window: {seq} > {rcv_window_end-1}") # ACK current position, but don't process return rcv_nxt, True, rcv_nxt # Case 3: Partially overlapping with already-received data if seq < rcv_nxt: # Trim the duplicate prefix trim_bytes = rcv_nxt - seq data = data[trim_bytes:] seq = rcv_nxt print(f"Trimmed {trim_bytes} duplicate bytes") # Case 4: Partially beyond window if seg_end > rcv_window_end: # Trim the out-of-window suffix trim_bytes = seg_end - rcv_window_end data = data[:-trim_bytes] print(f"Trimmed {trim_bytes} bytes beyond window") # Case 5: In-order segment (starts exactly at RCV.NXT) if seq == rcv_nxt: # Deliver immediately receive_buffer.append(data) rcv_nxt = seq + len(data) # Check if we can deliver buffered out-of-order segments while rcv_nxt in receive_buffer.out_of_order: buffered = receive_buffer.out_of_order.pop(rcv_nxt) receive_buffer.append(buffered) rcv_nxt += len(buffered) return rcv_nxt, True, rcv_nxt # Case 6: Out-of-order segment (seq > RCV.NXT) # Buffer it for later; send duplicate ACK receive_buffer.out_of_order[seq] = data print(f"Out-of-order: buffered {seq}-{seq+len(data)-1}") return rcv_nxt, True, rcv_nxt # ACK still indicates gapSegments may partially overlap with already-received data—perhaps the sender retransmitted more than necessary. TCP handles this by accepting only the new bytes and discarding duplicates. This is why sequence numbers track bytes, not segments: TCP can precisely identify which bytes are new.
Reliable delivery also means integrity—the data received must exactly match the data sent. TCP uses a 16-bit checksum to detect corruption.
The TCP Checksum:
The TCP checksum covers:
The pseudo-header includes IP addresses because TCP wants to verify the segment arrived at the correct destination—not just that the TCP header is valid. Including IP addresses catches misrouted segments.
Pseudo-header format (IPv4):
| Field | Size |
|---|---|
| Source IP Address | 4 bytes |
| Destination IP Address | 4 bytes |
| Zero (padding) | 1 byte |
| Protocol (6 for TCP) | 1 byte |
| TCP Length | 2 bytes |
Checksum calculation:
The checksum is the 16-bit one's complement sum of all 16-bit words in the pseudo-header, TCP header, and data (with the checksum field set to zero during calculation). Odd-length data is padded with a zero byte.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
def ones_complement_checksum(data: bytes) -> int: """ Calculate TCP/IP one's complement checksum. 1. Sum all 16-bit words 2. Add any carries back into the sum 3. Take one's complement of result """ # Pad to even length if necessary if len(data) % 2 == 1: data = data + b'\x00' # Sum all 16-bit words total = 0 for i in range(0, len(data), 2): word = (data[i] << 8) + data[i + 1] total += word # Fold 32-bit sum to 16 bits (add carries) while total > 0xFFFF: total = (total & 0xFFFF) + (total >> 16) # One's complement return ~total & 0xFFFF def create_pseudo_header(src_ip, dst_ip, tcp_length): """Create TCP pseudo-header for checksum calculation.""" import socket # Pack IP addresses (4 bytes each) src_bytes = socket.inet_aton(src_ip) dst_bytes = socket.inet_aton(dst_ip) # Pseudo-header: src_ip + dst_ip + zero + protocol + tcp_length pseudo = ( src_bytes + # 4 bytes: source IP dst_bytes + # 4 bytes: dest IP b'\x00' + # 1 byte: zero/reserved b'\x06' + # 1 byte: protocol (6 = TCP) tcp_length.to_bytes(2, 'big') # 2 bytes: TCP segment length ) return pseudo def compute_tcp_checksum(src_ip, dst_ip, tcp_header, tcp_data): """ Compute TCP checksum including pseudo-header. Args: src_ip: Source IP address string dst_ip: Destination IP address string tcp_header: TCP header bytes (checksum field should be zero) tcp_data: TCP payload bytes Returns: 16-bit checksum value """ tcp_length = len(tcp_header) + len(tcp_data) pseudo_header = create_pseudo_header(src_ip, dst_ip, tcp_length) # Concatenate pseudo-header + TCP header + TCP data checksum_data = pseudo_header + tcp_header + tcp_data return ones_complement_checksum(checksum_data) # Example verification at receiver:def verify_tcp_segment(src_ip, dst_ip, tcp_header, tcp_data): """Verify received segment's checksum.""" # Include the received checksum in calculation # If valid, result should be 0xFFFF (all ones) checksum_data = ( create_pseudo_header(src_ip, dst_ip, len(tcp_header) + len(tcp_data)) + tcp_header + tcp_data ) result = ones_complement_checksum(checksum_data) # After adding checksum to its complement, we should get all 1s return result == 0xFFFF or result == 0x0000The TCP checksum is a simple sum—it detects accidental bit errors well but provides no protection against malicious modification. It can also miss certain error patterns (like byte swaps that sum to the same value). Modern links typically have additional error detection (CRCs), and applications requiring cryptographic integrity should use TLS or similar protocols.
All the mechanisms we've discussed work together to provide TCP's reliability contract:
Every byte written to a TCP socket will be delivered to the application on the other end, in order, exactly once—or the connection will be terminated with an error.
This is a binary guarantee: complete success or explicit failure. There's no partial delivery, no silent data loss, no undetected corruption.
What "reliable" means:
| Problem | TCP Solution | Guarantee |
|---|---|---|
| Packet loss | Sequence numbers + ACKs + retransmission | All data eventually arrives or connection fails |
| Packet corruption | Checksum verification | Corrupted data is discarded and retransmitted |
| Packet duplication | Sequence number tracking | Data is delivered exactly once |
| Packet reordering | Sequence-based reassembly | Data is delivered in send order |
| Connection failure | Timeout + probe mechanisms | Persistent failures are reported to the application |
What "reliable" does NOT mean:
The application's role:
Reliability is a contract between TCP endpoints. The application must also cooperate:
recv() regularly: If the application is slow, the receive buffer fills, and the sender is throttledTCP guarantees byte delivery to the remote TCP stack, not to the remote application. If the remote host receives the data, ACKs it, then crashes before the application reads it, the sender is told 'delivered' but the application never saw it. Critical systems need application-level acknowledgment on top of TCP.
Reliability comes at a cost. Every mechanism that makes TCP reliable also affects performance. Understanding these trade-offs helps when tuning TCP or choosing between TCP and alternatives.
Latency costs:
Throughput costs:
| Mechanism | Cost |
|---|---|
| ACKs | Consume bandwidth in the reverse direction |
| Retransmissions | Waste bandwidth resending already-received data (partial overlaps) |
| Header overhead | 20+ bytes per segment for reliability fields |
| Rate limiting | Congestion control may limit rate below available capacity |
Memory costs:
This is why UDP exists and why protocols like QUIC were created. Applications with different requirements make different trade-offs: video streaming tolerates loss to achieve low latency; file transfers require perfect reliability regardless of latency. TCP offers one specific trade-off—strong reliability with its associated costs.
TCP's reliability is not magic—it's a carefully engineered set of mechanisms working together. Let's summarize what we've learned:
TCP takes the internet's unreliable, best-effort packet delivery and builds upon it a reliable, ordered byte stream. Applications can focus on their logic without worrying about network failures—TCP handles the recovery. This abstraction is TCP's greatest gift to application developers.
What's next:
Reliability ensures data arrives correctly; ordering ensures it arrives in the right sequence. The next page examines TCP's ordered delivery guarantee in detail—how sequence numbers enable reassembly, how the receive buffer manages out-of-order segments, and the implications of ordering guarantees for application design.