Loading learning content...
TCP/IP is the protocol that powers the internet. Every web request, database query, file transfer, and real-time communication relies on TCP's ability to provide reliable, ordered, error-checked delivery over an unreliable network. Linux's TCP implementation is among the most refined and battle-tested code in existence—it handles billions of connections globally, from embedded IoT devices to hyperscale data centers serving millions of concurrent connections.
Understanding Linux's TCP implementation isn't just academic curiosity. Performance engineers tune TCP parameters daily. Network developers implement custom congestion control algorithms. System administrators debug connection timeouts and throughput issues. Security researchers analyze TCP vulnerabilities. Everyone building networked systems benefits from understanding how TCP actually works under the hood.
This page dives deep into the Linux TCP implementation—the data structures that represent connections, the state machine that governs connection lifecycle, the algorithms that control congestion, and the optimizations that enable modern high-performance networking.
By the end of this page, you will understand the Linux TCP stack architecture, including the tcp_sock structure, connection establishment and termination, the TCP state machine implementation, socket hash tables for connection lookup, congestion control framework, and key TCP optimizations. You'll see how decades of protocol research have been distilled into production-grade code.
The Linux TCP implementation centers on struct tcp_sock—a large structure that extends struct sock with hundreds of TCP-specific fields. This structure contains everything needed to manage a TCP connection: sequence numbers, window sizes, congestion control state, retransmission timers, and much more.
The TCP socket hierarchy:
TCP sockets use a layered structure design:
struct tcp_sock
└── struct inet_connection_sock
└── struct inet_sock
└── struct sock
Each layer adds protocol-specific fields while inheriting the generic socket functionality from its parent.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
/** * struct tcp_sock - TCP socket state * * This is the primary structure for TCP connections. * It contains all state needed for TCP protocol operation. */struct tcp_sock { /* Inherit inet_connection_sock (includes struct sock) */ struct inet_connection_sock inet_conn; /* === Sequence Number Tracking === */ /* Send sequence space */ u32 snd_una; /* First unacknowledged byte */ u32 snd_nxt; /* Next sequence to send */ u32 snd_sml; /* Last byte of small packet sent */ u32 write_seq; /* Tail of data in send queue */ u32 pushed_seq; /* Last pushed sequence */ /* Receive sequence space */ u32 rcv_nxt; /* Next expected receive sequence */ u32 copied_seq; /* Bytes copied to user */ u32 rcv_wup; /* rcv_nxt at last window update */ /* === Window Management === */ u32 snd_wnd; /* Send window (from receiver) */ u32 max_window; /* Maximum window ever seen */ u32 rcv_wnd; /* Current receive window */ u32 window_clamp; /* Maximum window to advertise */ /* Window scaling (RFC 1323) */ u8 rx_opt.snd_wscale; /* Window scaling for send */ u8 rx_opt.rcv_wscale; /* Window scaling for receive */ /* === Congestion Control === */ u32 snd_cwnd; /* Congestion window (in packets) */ u32 snd_cwnd_cnt; /* Fractional cwnd growth counter */ u32 snd_ssthresh; /* Slow start threshold */ u32 prior_cwnd; /* Cwnd before loss/recovery */ /* RTT estimation (Jacobson/Karels) */ u32 srtt_us; /* Smoothed RTT in microseconds */ u32 mdev_us; /* RTT mean deviation */ u32 mdev_max_us; /* Maximum mdev for RTO */ u32 rttvar_us; /* Smoothed RTT variance */ u32 rtt_seq; /* Seq when RTT sample taken */ /* === Retransmission === */ u32 retrans_out; /* Segments currently retransmitted */ u32 lost_out; /* Segments assumed lost */ u32 sacked_out; /* SACK'd segments */ u32 retransmit_skb_hint; /* Retransmit hint */ /* Timers (see inet_connection_sock) */ /* === Selective ACK (SACK) === */ struct tcp_sack_block recv_sack_cache[4]; /* SACK blocks */ struct tcp_sack_block selective_acks[4]; /* Current SACK info */ /* === Connection Options === */ u16 mss_cache; /* Cached effective MSS */ u16 advmss; /* Advertised MSS */ /* Timestamps (RFC 1323) */ u32 tsoffset; /* Timestamp offset */ u32 rx_opt.rcv_tsval; /* Received timestamp */ u32 rx_opt.rcv_tsecr; /* Timestamp echo reply */ /* === Pacing and Delivery Rate === */ u64 tcp_mstamp; /* Most recent transmit timestamp */ u32 delivered; /* Total delivered segments */ u32 app_limited; /* Application-limited flag */ struct rate_sample rs; /* Rate sample for BBR, etc. */ /* === Congestion Control Plugin === */ const struct tcp_congestion_ops *ca_ops; /* CC algorithm */ u32 ca_priv[16]; /* Private CC algorithm state */ /* ... many more fields ... */}; /** * struct inet_connection_sock - Connection-oriented inet socket * * Contains connection management state shared by TCP, SCTP, etc. */struct inet_connection_sock { struct inet_sock icsk_inet; /* Accept queue for listening sockets */ struct request_sock_queue icsk_accept_queue; /* Retransmit and other timers */ struct timer_list icsk_retransmit_timer; struct timer_list icsk_delack_timer; /* Timer state */ __u8 icsk_retransmits; /* Retransmit count */ __u8 icsk_pending; /* Pending timer */ __u8 icsk_backoff; /* Backoff multiplier */ /* Connection establishment */ __u8 icsk_syn_retries; __u32 icsk_rto; /* Retransmit timeout */ __u32 icsk_ack.ato; /* Delayed ACK timeout */ /* Maximum segment size */ __u16 icsk_pmtu_cookie; /* Path MTU */ /* ... more fields ... */};A struct tcp_sock is approximately 2KB on 64-bit systems. For a server handling 1 million connections, this means ~2GB just for socket structures. This is why high-connection-count servers carefully tune memory limits and why protocols like QUIC (UDP-based) can sometimes offer memory advantages.
TCP's three-way handshake is the foundation of reliable connection establishment. Linux implements this through a sophisticated mechanism involving request sockets, SYN queues, and accept queues—designed to handle both normal connections and SYN flood attacks.
The three-way handshake in Linux:
tcp_v4_connect() sends initial SYN segmenttcp_v4_rcv() → tcp_v4_do_rcv() → tcp_conn_request()123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147
/** * Client-side: Initiate TCP connection */int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len){ struct tcp_sock *tp = tcp_sk(sk); struct inet_sock *inet = inet_sk(sk); struct sockaddr_in *usin = (struct sockaddr_in *)uaddr; struct rtable *rt; /* Validate and extract destination address */ daddr = usin->sin_addr.s_addr; dport = usin->sin_port; /* Route lookup */ rt = ip_route_connect(...); if (IS_ERR(rt)) return PTR_ERR(rt); /* Select source address if not bound */ if (!inet->inet_saddr) inet->inet_saddr = fl4.saddr; /* Set peer address */ inet->inet_daddr = daddr; inet->inet_dport = dport; /* Choose initial sequence number (ISN) */ if (!tp->write_seq) tp->write_seq = secure_tcp_seq(inet->inet_saddr, inet->inet_daddr, inet->inet_sport, inet->inet_dport); /* Generate initial timestamp */ tp->tsoffset = secure_tcp_ts_off(net, ...); /* Build and send SYN */ err = tcp_connect(sk); return err;} /** * tcp_connect - Send SYN and start connection timer */int tcp_connect(struct sock *sk){ struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; /* Allocate SYN packet */ skb = sk_stream_alloc_skb(sk, 0, GFP_KERNEL, true); /* Set SYN flag */ tcp_skb_pcount_set(skb, 1); tcp_skb_timestamp(sk, skb); /* Initialize send sequence */ tp->snd_nxt = tp->write_seq; tp->pushed_seq = tp->write_seq; /* Send SYN */ tcp_transmit_skb(sk, skb, 1, GFP_KERNEL); /* Start retransmit timer */ inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, inet_csk(sk)->icsk_rto, TCP_RTO_MAX); return 0;} /** * Server-side: Handle incoming SYN * * This creates a "request socket" (mini-socket) to track * the half-open connection without consuming full socket memory. */int tcp_conn_request(struct request_sock_ops *rsk_ops, const struct tcp_request_sock_ops *af_ops, struct sock *sk, struct sk_buff *skb){ struct request_sock *req; struct tcp_request_sock *treq; /* Allocate request socket (much smaller than tcp_sock) */ req = inet_reqsk_alloc(rsk_ops, sk, true); if (!req) goto drop; treq = tcp_rsk(req); /* Store connection parameters */ inet_rsk(req)->ir_loc_addr = ip_hdr(skb)->daddr; inet_rsk(req)->ir_rmt_addr = ip_hdr(skb)->saddr; inet_rsk(req)->ir_rmt_port = tcp_hdr(skb)->source; /* Generate server ISN */ treq->snt_isn = cookie_init_sequence(af_ops, sk, skb, &req->mss); /* Store in SYN table or use SYN cookies */ if (net->core.sysctl_somaxconn <= inet_csk_reqsk_queue_len(sk)) { /* Queue full - use SYN cookies if enabled */ if (!net->ipv4.sysctl_tcp_syncookies) goto drop; want_cookie = true; } if (!want_cookie) inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT); /* Send SYN-ACK */ af_ops->send_synack(sk, dst, fl, req, ...); return 0; drop: kfree(req); return 0;} /** * Process ACK completing the three-way handshake */struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb, struct request_sock *req){ struct sock *child; struct tcp_sock *tp; /* Validate ACK sequence */ if (!between(TCP_SKB_CB(skb)->ack_seq, tcp_rsk(req)->snt_isn, tcp_rsk(req)->snt_isn + 1 + req->mss)) return NULL; /* Create full socket from request */ child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL); if (!child) return NULL; /* Move to accept queue */ inet_csk_reqsk_queue_add(sk, req, child); /* Wake up accept() waiters */ sk_data_ready(sk); return child;}When the SYN queue is full, Linux can use SYN cookies (net.ipv4.tcp_syncookies=1). Instead of storing state for each SYN, the server encodes connection information into the ISN itself. When the ACK arrives, the server reconstructs the connection from the sequence number. This allows handling massive SYN floods without memory exhaustion.
TCP connections follow a well-defined state machine specified in RFC 793. Linux implements this state machine through the sk->sk_state field and associated functions that handle state transitions.
TCP connection states:
| State | Value | Description | Normal Transition |
|---|---|---|---|
| ESTABLISHED | 1 | Connection active, data flows | After 3-way handshake |
| SYN_SENT | 2 | SYN sent, awaiting SYN-ACK | After connect() call |
| SYN_RECV | 3 | SYN-ACK sent, awaiting ACK | Server after SYN received |
| FIN_WAIT1 | 4 | FIN sent, awaiting ACK or FIN | Active close initiated |
| FIN_WAIT2 | 5 | Our FIN acknowledged, awaiting peer FIN | ACK received in FIN_WAIT1 |
| TIME_WAIT | 6 | Waiting for delayed segments to expire | All FINs exchanged |
| CLOSE | 7 | Socket is closed | Final state |
| CLOSE_WAIT | 8 | Peer sent FIN, awaiting local close | Passive close started |
| LAST_ACK | 9 | FIN sent after CLOSE_WAIT | close() after CLOSE_WAIT |
| LISTEN | 10 | Socket is listening for connections | After listen() call |
| CLOSING | 11 | Both sides sent FIN simultaneously | Rare simultaneous close |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173
/** * TCP states (from include/net/tcp_states.h) */enum { TCP_ESTABLISHED = 1, TCP_SYN_SENT, TCP_SYN_RECV, TCP_FIN_WAIT1, TCP_FIN_WAIT2, TCP_TIME_WAIT, TCP_CLOSE, TCP_CLOSE_WAIT, TCP_LAST_ACK, TCP_LISTEN, TCP_CLOSING, TCP_NEW_SYN_RECV, /* Request socket state */ TCP_MAX_STATES /* Leave at end */}; /** * tcp_set_state - Change socket state * * This function handles all state transitions, updating * hash tables and performing necessary cleanup. */void tcp_set_state(struct sock *sk, int state){ int oldstate = sk->sk_state; /* Handle transitions affecting hash tables */ switch (state) { case TCP_ESTABLISHED: if (oldstate != TCP_ESTABLISHED) TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB); break; case TCP_CLOSE: /* Socket closing - cleanup timers */ __tcp_clear_all_timers(sk); /* Fall through */ case TCP_CLOSE_WAIT: /* Leaving established state */ if (oldstate == TCP_ESTABLISHED) TCP_DEC_STATS(sock_net(sk), TCP_MIB_CURRESTAB); break; } /* Update socket state */ sk_state_store(sk, state); /* Trace state transition (for debugging) */ trace_tcp_set_state(sk, oldstate, state);} /** * tcp_rcv_state_process - Main TCP receive state machine * * This is the heart of TCP packet processing where state * transitions occur based on incoming segments. */int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb){ struct tcp_sock *tp = tcp_sk(sk); struct tcphdr *th = tcp_hdr(skb); int queued = 0; switch (sk->sk_state) { case TCP_CLOSE: /* Closed socket received packet - send RST */ goto discard; case TCP_LISTEN: /* Listening socket - handle new connections */ if (th->syn) { /* New connection request */ return tcp_conn_request(...); } goto discard; case TCP_SYN_SENT: /* Awaiting SYN-ACK from peer */ queued = tcp_rcv_synsent_state_process(sk, skb, th); if (queued >= 0) return queued; break; case TCP_SYN_RECV: /* Server awaiting ACK to complete handshake */ if (th->ack && acceptable_ack) { tcp_set_state(sk, TCP_ESTABLISHED); /* Connection established! */ } break; } /* Common processing for established states */ if (!after(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) { /* Segment is in window */ if (th->rst) { tcp_reset(sk); return 0; } if (th->fin) tcp_fin(sk); } return 0;} /** * tcp_close - Initiate connection termination * * Called when application calls close() on socket. */void tcp_close(struct sock *sk, long timeout){ struct tcp_sock *tp = tcp_sk(sk); int state; lock_sock(sk); /* Discard unsent data if linger is off */ if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) { tp->linger2 = 0; /* Send RST instead of proper close */ tcp_set_state(sk, TCP_CLOSE); tcp_send_active_reset(sk, GFP_KERNEL); goto out; } /* Normal close - send FIN */ state = sk->sk_state; if (state == TCP_ESTABLISHED) { /* Enter FIN_WAIT1 and send FIN */ tcp_set_state(sk, TCP_FIN_WAIT1); tcp_send_fin(sk); } else if (state == TCP_CLOSE_WAIT) { /* Peer already sent FIN - enter LAST_ACK */ tcp_set_state(sk, TCP_LAST_ACK); tcp_send_fin(sk); } out: release_sock(sk);} /** * TIME_WAIT handling * * TIME_WAIT sockets use a special stripped-down structure * to reduce memory consumption (there can be many). */struct inet_timewait_sock { struct sock_common __tw_common; __be16 tw_dport; /* Peer port */ unsigned char tw_substate; /* State within TIME_WAIT */ unsigned char tw_timeout; /* Remaining timeout */ /* Bind bucket reference */ struct inet_bind_bucket *tw_bind; /* Hlist entries for lookup */ struct hlist_node tw_death_node; struct hlist_node tw_bind_node; struct hlist_node tw_hash_node;};TIME_WAIT lasts 2*MSL (typically 60 seconds). High-traffic servers making many outbound connections can exhaust local ports due to TIME_WAIT accumulation. Solutions include SO_REUSEADDR, tcp_tw_reuse sysctl, connection pooling, or switching to persistent connections (HTTP/2 keepalive).
When a TCP packet arrives, the kernel must quickly find the socket it belongs to. This is accomplished through hash tables indexed by connection 4-tuple (source IP, source port, destination IP, destination port). The lookup must be extremely fast—it occurs for every single packet.
Hash table organization:
Linux maintains multiple hash tables for different socket types:
| Hash Table | Contents | Lookup Key | Purpose |
|---|---|---|---|
| ehash | Established sockets | 4-tuple (full connection) | Packet → socket lookup |
| bhash | Bound sockets | Local port | Port allocation, bind conflict check |
| lhash | Listening sockets | Local port | SYN → listening socket lookup |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
/** * struct inet_hashinfo - TCP socket hash tables * * This structure holds all hash tables used for TCP socket lookup. * It's initialized at boot and sized based on system memory. */struct inet_hashinfo { /* Established/TIME_WAIT connections hash */ struct inet_ehash_bucket *ehash; spinlock_t *ehash_locks; unsigned int ehash_mask; unsigned int ehash_locks_mask; /* Bind hash (listening and bound sockets) */ struct kmem_cache *bind_bucket_cachep; struct inet_bind_hashbucket *bhash; unsigned int bhash_size; /* Listening hash (listen sockets only) */ struct inet_listen_hashbucket *listening_hash; unsigned int lhash2_mask; /* ... */}; /* Global TCP hash table instance */struct inet_hashinfo tcp_hashinfo; /** * inet_ehash_bucket - Hash bucket for established connections * * Each bucket is an RCU-protected list of sockets * with same hash value. */struct inet_ehash_bucket { struct hlist_nulls_head chain;}; /** * __inet_lookup_established - Find established socket * * This function is called for every incoming TCP packet * to find the socket that should receive it. */struct sock *__inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo, const __be32 saddr, const __be16 sport, const __be32 daddr, const u16 hnum, const int dif, const int sdif){ INET_ADDR_COOKIE(acookie, saddr, daddr); const __portpair ports = INET_COMBINED_PORTS(sport, hnum); struct sock *sk; const struct hlist_nulls_node *node; unsigned int hash = inet_ehashfn(net, daddr, hnum, saddr, sport); unsigned int slot = hash & hashinfo->ehash_mask; struct inet_ehash_bucket *head = &hashinfo->ehash[slot]; begin: /* Lockless RCU traversal */ sk_nulls_for_each_rcu(sk, node, &head->chain) { /* Quick rejection using combined port comparison */ if (sk->sk_hash != hash) continue; if (likely(INET_MATCH(sk, net, acookie, saddr, daddr, ports, dif, sdif))) return sk; /* Found! */ } /* Handle nulls marker for concurrent modification */ if (get_nulls_value(node) != slot) goto begin; return NULL;} /** * INET_MATCH macro - Check if socket matches packet * * Optimized for common case with prefetching and * minimal memory accesses. */#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \ (((__sk)->sk_portpair == (__ports)) && \ ((__sk)->sk_addrpair == (__cookie)) && \ (!(__sk)->sk_bound_dev_if || \ ((__sk)->sk_bound_dev_if == (__dif)) || \ ((__sk)->sk_bound_dev_if == (__sdif))) && \ net_eq(sock_net(__sk), (__net))) /** * __inet_lookup_listener - Find listening socket * * Called when SYN arrives to find the server socket * that should handle the new connection. */struct sock *__inet_lookup_listener(struct net *net, struct inet_hashinfo *hashinfo, struct sk_buff *skb, int doff, const __be32 saddr, __be16 sport, const __be32 daddr, const unsigned short hnum, const int dif, const int sdif){ struct inet_listen_hashbucket *ilb2; struct sock *result = NULL; unsigned int hash2; /* Hash lookup in listening table */ hash2 = ipv4_portaddr_hash(net, daddr, hnum); ilb2 = &hashinfo->lhash2[hash2 & hashinfo->lhash2_mask]; /* Search for exact match (IP + port) */ result = inet_lhash2_lookup(net, ilb2, skb, doff, saddr, sport, daddr, hnum, dif, sdif); if (result) return result; /* Try wildcard (0.0.0.0) listener */ hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum); ilb2 = &hashinfo->lhash2[hash2 & hashinfo->lhash2_mask]; return inet_lhash2_lookup(net, ilb2, skb, doff, saddr, sport, htonl(INADDR_ANY), hnum, dif, sdif);} /** * Hash table sizing at boot * * The hash table size is calculated based on available memory. * Larger tables reduce collision probability for high-connection systems. */void __init tcp_init(void){ /* Calculate hash table size based on memory */ nr_pages = totalram_pages(); /* Target: one bucket per expected maximum connections */ tcp_hashinfo.ehash_mask = alloc_large_system_hash("TCP established", sizeof(struct inet_ehash_bucket), thash_entries, 17, /* min 2^17 = 128K entries */ 0, NULL, &tcp_hashinfo.ehash_locks_mask, 0, 64 * 1024); /* max entries */}Socket lookup uses RCU (Read-Copy-Update) for lock-free packet processing. The receiving CPU can look up sockets without acquiring any locks—only an RCU read-side critical section. This enables multi-gigabit packet rates on multi-core systems where locking would cause severe contention.
TCP congestion control prevents senders from overwhelming the network. Linux implements a pluggable congestion control framework where different algorithms can be loaded as modules and selected per-socket or system-wide.
Congestion control fundamentals:
| Algorithm | Type | Key Feature | Use Case |
|---|---|---|---|
| Reno | Loss-based | Classic AIMD | Baseline, historical reference |
| CUBIC | Loss-based | Cubic function for cwnd growth | Default on most systems |
| BBR | Model-based | Measures bandwidth, minimizes latency | Google infrastructure, low-latency |
| Vegas | Delay-based | RTT increase signals congestion | Low-loss environments |
| DCTCP | ECN-based | Uses ECN for early signaling | Data center networks |
| Westwood+ | BW estimation | Estimates available bandwidth | Wireless networks |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
/** * struct tcp_congestion_ops - Congestion control algorithm interface * * Each CC algorithm implements this interface to plug into * the TCP stack. */struct tcp_congestion_ops { struct list_head list; /* Unique name for algorithm selection */ char name[TCP_CA_NAME_MAX]; struct module *owner; /* Required: Called on each ACK */ void (*cong_avoid)(struct sock *sk, u32 ack, u32 acked); /* Required: Set slow start threshold on loss */ u32 (*ssthresh)(struct sock *sk); /* Optional: Called when connection established */ void (*init)(struct sock *sk); /* Optional: Called when connection destroyed */ void (*release)(struct sock *sk); /* Optional: RTT sample callback */ void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample); /* Optional: ECN handling */ void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev); /* Optional: Undo cwnd reduction */ u32 (*undo_cwnd)(struct sock *sk); /* Optional: Get current state for ss/netstat */ size_t (*get_info)(struct sock *sk, u32 ext, int *attr, union tcp_cc_info *info); /* Flags indicating algorithm capabilities */ u32 flags;}; /** * CUBIC congestion control (default since Linux 2.6.19) * * Uses a cubic function to grow cwnd, providing faster * recovery to previous bandwidth than Reno's linear growth. */static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked){ struct tcp_sock *tp = tcp_sk(sk); struct bictcp *ca = inet_csk_ca(sk); if (!tcp_is_cwnd_limited(sk)) return; /* Application-limited, don't grow */ if (tcp_in_slow_start(tp)) { /* Slow start: exponential growth */ u32 cnt = tcp_slow_start(tp, acked); if (!cnt) return; acked = cnt; } /* Congestion avoidance: cubic function */ bictcp_update(ca, tp->snd_cwnd, acked); tcp_cong_avoid_ai(tp, ca->cnt, acked);} /** * bictcp_update - Compute cubic window * * W(t) = C*(t-K)^3 + Wmax * * Where K = cubic_root(Wmax*beta/C), beta = 0.3, C = 0.4 */static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked){ u32 delta, bic_target, max_cnt; u64 offs, t; /* Calculate time since last loss */ t = (u64)(tcp_time_stamp(tp) - ca->epoch_start); t += usecs_to_jiffies(ca->delay_min >> 3); t <<= BICTCP_HZ; /* Calculate cubic window target */ offs = ca->bic_K - t; delta = (cube_rtt_scale * offs * offs * offs) >> 40; bic_target = ca->bic_origin_point + delta; /* Compute growth rate */ if (bic_target > cwnd) ca->cnt = cwnd / (bic_target - cwnd); else ca->cnt = 100 * cwnd; /* Very slow growth */} /** * BBR congestion control (Google's model-based algorithm) * * BBR tries to send at the bottleneck bandwidth while * maintaining minimum RTT, avoiding buffer bloat. */static void bbr_main(struct sock *sk, const struct rate_sample *rs){ struct bbr *bbr = inet_csk_ca(sk); u32 bw; /* Estimate maximum bandwidth */ bw = bbr_max_bw(sk); /* Estimate minimum RTT (propagation delay) */ if (rs->rtt_us > 0 && rs->rtt_us < bbr->min_rtt_us) bbr->min_rtt_us = rs->rtt_us; /* Calculate target: BDP = bandwidth * delay */ bbr_set_pacing_rate(sk, bw, bbr->pacing_gain); bbr_set_cwnd(sk, rs, rs->acked_sacked, bw, bbr->cwnd_gain);} /** * Selecting congestion control algorithm *//* System-wide default */$ sysctl net.ipv4.tcp_congestion_control=bbr /* Per-socket (application code) */setsockopt(sock, IPPROTO_TCP, TCP_CONGESTION, "bbr", 4); /* View available algorithms */$ sysctl net.ipv4.tcp_available_congestion_controlnet.ipv4.tcp_available_congestion_control = reno cubic bbrCUBIC is loss-based: it fills buffers until packets drop. BBR is model-based: it estimates available bandwidth and maintains minimal queuing. BBR can achieve much lower latency but may be unfair to CUBIC flows in some conditions. Data centers often use BBR or DCTCP; the internet at large still mostly uses CUBIC.
Linux TCP includes numerous optimizations developed over decades to maximize throughput and minimize latency. Understanding these helps tune performance for specific workloads.
Key TCP optimizations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
/** * TCP Fast Open - Send data with SYN * * TFO allows data to be sent in the SYN packet for repeat * connections, saving one RTT. */ /* Client-side: Send data with SYN */int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg, int *copied, size_t size){ struct tcp_sock *tp = tcp_sk(sk); struct sockaddr *uaddr; int err, flags; /* MSG_FASTOPEN triggers TFO */ if (!(msg->msg_flags & MSG_FASTOPEN)) return 0; /* Get cookie from cache or request new one */ if (!tcp_fastopen_cookie_check(sk, &req->cookie)) { /* No cached cookie, will request one */ tp->fastopen_req->cookie.len = 0; } /* Build and send SYN with data */ err = tcp_connect(sk); /* Copy data to send buffer */ tcp_sendmsg_locked(sk, msg, size); return err;} /* Server-side: Accept data in SYN */int tcp_fastopen_create_child(struct sock *sk, struct sk_buff *skb, struct request_sock *req){ struct sock *child; /* Validate TFO cookie */ if (!tcp_fastopen_cookie_valid(&foc)) { /* Invalid cookie - fall back to normal handshake */ return -1; } /* Create child socket immediately (before ACK) */ child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL); /* Queue SYN data for application */ if (skb->len > tcp_hdrlen(skb)) { struct sk_buff *data_skb = skb_clone(skb, GFP_ATOMIC); __skb_pull(data_skb, tcp_hdrlen(skb)); skb_queue_tail(&child->sk_receive_queue, data_skb); child->sk_data_ready(child); } return 0;} /** * SACK processing * * SACK tells sender which segments receiver has, enabling * selective retransmission. */void tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb, u32 prior_snd_una){ struct tcp_sock *tp = tcp_sk(sk); struct tcp_sack_block *sp; int num_sacks; /* Parse SACK blocks from ACK segment */ sp = tcp_parse_options(ack_skb)->sack; num_sacks = tcp_parse_options(ack_skb)->num_sacks; for (i = 0; i < num_sacks; i++) { u32 start_seq = ntohl(sp[i].start_seq); u32 end_seq = ntohl(sp[i].end_seq); /* Mark segments in this range as SACKed */ skb_queue_walk(&sk->sk_write_queue, skb) { if (between(TCP_SKB_CB(skb)->seq, start_seq, end_seq)) TCP_SKB_CB(skb)->sacked |= TCPCB_SACKED_ACKED; } } /* Retransmit non-SACKed segments that are considered lost */ tcp_xmit_retransmit_queue(sk);} /** * Key sysctl tuning parameters */# Enable TCP Fast Opennet.ipv4.tcp_fastopen = 3 # Both client and server # Window scaling for high-BDP pathsnet.ipv4.tcp_window_scaling = 1 # SACK net.ipv4.tcp_sack = 1 # Receive buffer auto-tuningnet.ipv4.tcp_moderate_rcvbuf = 1 # Buffer sizes (auto-tuned between min/default/max)net.ipv4.tcp_rmem = 4096 87380 6291456net.ipv4.tcp_wmem = 4096 65536 6291456Different workloads need different tuning. Web servers benefit from TFO and small buffers. Bulk transfer (backups, replication) needs large buffers. Interactive applications (gaming, SSH) benefit from TCP_NODELAY and TCP_QUICKACK. Database traffic often uses TCP_QUICKACK to reduce commit latency.
While TCP dominates connection-oriented communication, UDP is essential for DNS, streaming, gaming, VoIP, and is the foundation for QUIC/HTTP3. Linux's UDP implementation is simpler than TCP—no connection state, no reliability, no congestion control—but it still provides important features.
UDP socket structure:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119
/** * struct udp_sock - UDP socket state * * Much simpler than tcp_sock - no connection tracking needed. */struct udp_sock { struct inet_sock inet; int pending; /* Pending message type */ unsigned int corkflag; /* UDP_CORK is set */ __u8 encap_type; /* Encapsulation (ESP, GTP, etc.) */ /* GRO (Generic Receive Offload) support */ u16 len; /* Total pending length */ u16 gso_size; /* GSO segment size */ /* Receive queue memory */ int (*encap_rcv)(struct sock *sk, struct sk_buff *skb);}; /** * udp_sendmsg - Send UDP datagram */int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len){ struct inet_sock *inet = inet_sk(sk); struct udp_sock *up = udp_sk(sk); struct flowi4 *fl4; struct sk_buff *skb; int err; /* Get destination from message or connected socket */ if (msg->msg_name) { /* sendto() with explicit destination */ struct sockaddr_in *usin = msg->msg_name; daddr = usin->sin_addr.s_addr; dport = usin->sin_port; } else if (sk->sk_state == TCP_ESTABLISHED) { /* Connected UDP socket */ daddr = inet->inet_daddr; dport = inet->inet_dport; } else { return -EDESTADDRREQ; } /* Route lookup */ rt = ip_route_output_flow(net, &fl4, sk); /* Corked send - accumulate data */ if (up->pending) { skb = ip_finish_skb(sk, &fl4); } else { /* Allocate skb and copy user data */ skb = sock_alloc_send_skb(sk, len, msg->msg_flags, &err); copy_from_iter(skb_put(skb, len), len, &msg->msg_iter); } /* Add UDP header */ udp_set_header(skb, inet->inet_sport, dport); /* Send via IP layer */ err = udp_send_skb(skb, &fl4); return err;} /** * udp_rcv - Receive UDP datagram */int udp_rcv(struct sk_buff *skb){ struct sock *sk; struct udphdr *uh; __be32 saddr, daddr; /* Extract addresses and ports */ uh = udp_hdr(skb); saddr = ip_hdr(skb)->saddr; daddr = ip_hdr(skb)->daddr; /* Validate checksum */ if (udp_lib_checksum_complete(skb)) goto csum_error; /* Look up socket (hash lookup by 4-tuple or 2-tuple) */ sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, &udp_table); if (sk) { /* Found socket - deliver */ int ret = udp_queue_rcv_skb(sk, skb); return ret; } /* No socket - send ICMP port unreachable */ icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0); kfree_skb(skb); return 0;} /** * UDP optimizations */ /* UDP GRO - Coalesce related UDP packets */struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb){ /* Combine UDP packets with same flow into one skb */ /* Reduces per-packet overhead for high-rate flows */} /* UDP GSO - Segment large UDP "super-packets" in software */struct sk_buff *udp4_gso_segment(struct sk_buff *skb, netdev_features_t features){ /* Split large UDP message into MTU-sized segments */ /* Application sends one large write, kernel segments */}QUIC (HTTP/3) implements TCP-like reliability over UDP. This enables user-space control over congestion and reliability, faster evolution than kernel TCP, and avoids head-of-line blocking. Linux's UDP optimizations (GRO, GSO, receive ring buffers) make high-performance QUIC implementations possible.
Linux's TCP/IP implementation represents decades of protocol research translated into production code. Understanding these internals enables effective performance tuning, debugging, and system design for networked applications.
What's next:
With the TCP/IP stack understood, we'll trace the complete packet flow through Linux networking—from application write to wire transmission, and from NIC reception to application read. You'll see how all the pieces we've covered integrate into a cohesive packet processing pipeline.
You now understand the Linux TCP/IP implementation—the data structures, state machines, and algorithms that power internet communication. This knowledge is essential for network performance engineering, debugging connection issues, and building high-performance networked applications.