Operating SystemsLinux Internals

Linux Networking

LevelAdvanced

Duration180 mins

TopicLinux Internals

4 / 5

TCP/IP Implementation

The Heart of Internet Communication

TCP/IP is the protocol that powers the internet. Every web request, database query, file transfer, and real-time communication relies on TCP's ability to provide reliable, ordered, error-checked delivery over an unreliable network. Linux's TCP implementation is among the most refined and battle-tested code in existence—it handles billions of connections globally, from embedded IoT devices to hyperscale data centers serving millions of concurrent connections.

Understanding Linux's TCP implementation isn't just academic curiosity. Performance engineers tune TCP parameters daily. Network developers implement custom congestion control algorithms. System administrators debug connection timeouts and throughput issues. Security researchers analyze TCP vulnerabilities. Everyone building networked systems benefits from understanding how TCP actually works under the hood.

This page dives deep into the Linux TCP implementation—the data structures that represent connections, the state machine that governs connection lifecycle, the algorithms that control congestion, and the optimizations that enable modern high-performance networking.

What You Will Learn

By the end of this page, you will understand the Linux TCP stack architecture, including the tcp_sock structure, connection establishment and termination, the TCP state machine implementation, socket hash tables for connection lookup, congestion control framework, and key TCP optimizations. You'll see how decades of protocol research have been distilled into production-grade code.

TCP Data Structures

The Linux TCP implementation centers on struct tcp_sock—a large structure that extends struct sock with hundreds of TCP-specific fields. This structure contains everything needed to manage a TCP connection: sequence numbers, window sizes, congestion control state, retransmission timers, and much more.

The TCP socket hierarchy:

TCP sockets use a layered structure design:

struct tcp_sock
  └── struct inet_connection_sock
        └── struct inet_sock
              └── struct sock

Each layer adds protocol-specific fields while inheriting the generic socket functionality from its parent.

TCP socket structures
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
/**
 * struct tcp_sock - TCP socket state
 * 
 * This is the primary structure for TCP connections.
 * It contains all state needed for TCP protocol operation.
 */
struct tcp_sock {
    /* Inherit inet_connection_sock (includes struct sock) */
    struct inet_connection_sock inet_conn;
    
    /* === Sequence Number Tracking === */
    
    /* Send sequence space */
    u32     snd_una;        /* First unacknowledged byte */
    u32     snd_nxt;        /* Next sequence to send */
    u32     snd_sml;        /* Last byte of small packet sent */
    u32     write_seq;      /* Tail of data in send queue */
    u32     pushed_seq;     /* Last pushed sequence */
    
    /* Receive sequence space */
    u32     rcv_nxt;        /* Next expected receive sequence */
    u32     copied_seq;     /* Bytes copied to user */
    u32     rcv_wup;        /* rcv_nxt at last window update */
    
    /* === Window Management === */
    
    u32     snd_wnd;        /* Send window (from receiver) */
    u32     max_window;     /* Maximum window ever seen */
    u32     rcv_wnd;        /* Current receive window */
    u32     window_clamp;   /* Maximum window to advertise */
    
    /* Window scaling (RFC 1323) */
    u8      rx_opt.snd_wscale;  /* Window scaling for send */
    u8      rx_opt.rcv_wscale;  /* Window scaling for receive */
    
    /* === Congestion Control === */
    
    u32     snd_cwnd;       /* Congestion window (in packets) */
    u32     snd_cwnd_cnt;   /* Fractional cwnd growth counter */
    u32     snd_ssthresh;   /* Slow start threshold */
    u32     prior_cwnd;     /* Cwnd before loss/recovery */
    
    /* RTT estimation (Jacobson/Karels) */
    u32     srtt_us;        /* Smoothed RTT in microseconds */
    u32     mdev_us;        /* RTT mean deviation */
    u32     mdev_max_us;    /* Maximum mdev for RTO */
    u32     rttvar_us;      /* Smoothed RTT variance */
    u32     rtt_seq;        /* Seq when RTT sample taken */
    
    /* === Retransmission === */
    
    u32     retrans_out;    /* Segments currently retransmitted */
    u32     lost_out;       /* Segments assumed lost */
    u32     sacked_out;     /* SACK'd segments */
    u32     retransmit_skb_hint; /* Retransmit hint */
    
    /* Timers (see inet_connection_sock) */
    
    /* === Selective ACK (SACK) === */
    
    struct tcp_sack_block recv_sack_cache[4];  /* SACK blocks */
    struct tcp_sack_block selective_acks[4];   /* Current SACK info */
    
    /* === Connection Options === */
    
    u16     mss_cache;      /* Cached effective MSS */
    u16     advmss;         /* Advertised MSS */
    
    /* Timestamps (RFC 1323) */
    u32     tsoffset;       /* Timestamp offset */
    u32     rx_opt.rcv_tsval;   /* Received timestamp */
    u32     rx_opt.rcv_tsecr;   /* Timestamp echo reply */
    
    /* === Pacing and Delivery Rate === */
    
    u64     tcp_mstamp;     /* Most recent transmit timestamp */
    u32     delivered;      /* Total delivered segments */
    u32     app_limited;    /* Application-limited flag */
    struct rate_sample rs;  /* Rate sample for BBR, etc. */
    
    /* === Congestion Control Plugin === */
    
    const struct tcp_congestion_ops *ca_ops;  /* CC algorithm */
    u32     ca_priv[16];    /* Private CC algorithm state */
    
    /* ... many more fields ... */
};
 
/**
 * struct inet_connection_sock - Connection-oriented inet socket
 * 
 * Contains connection management state shared by TCP, SCTP, etc.
 */
struct inet_connection_sock {
    struct inet_sock    icsk_inet;
    
    /* Accept queue for listening sockets */
    struct request_sock_queue   icsk_accept_queue;
    
    /* Retransmit and other timers */
    struct timer_list   icsk_retransmit_timer;
    struct timer_list   icsk_delack_timer;
    
    /* Timer state */
    __u8                icsk_retransmits;   /* Retransmit count */
    __u8                icsk_pending;       /* Pending timer */
    __u8                icsk_backoff;       /* Backoff multiplier */
    
    /* Connection establishment */
    __u8                icsk_syn_retries;
    __u32               icsk_rto;           /* Retransmit timeout */
    __u32               icsk_ack.ato;       /* Delayed ACK timeout */
    
    /* Maximum segment size */
    __u16               icsk_pmtu_cookie;   /* Path MTU */
    
    /* ... more fields ... */
};

Size Implications

A struct tcp_sock is approximately 2KB on 64-bit systems. For a server handling 1 million connections, this means ~2GB just for socket structures. This is why high-connection-count servers carefully tune memory limits and why protocols like QUIC (UDP-based) can sometimes offer memory advantages.

Connection Establishment

TCP's three-way handshake is the foundation of reliable connection establishment. Linux implements this through a sophisticated mechanism involving request sockets, SYN queues, and accept queues—designed to handle both normal connections and SYN flood attacks.

The three-way handshake in Linux:

Client sends SYN: tcp_v4_connect() sends initial SYN segment
Server processes SYN: tcp_v4_rcv() → tcp_v4_do_rcv() → tcp_conn_request()
Server sends SYN-ACK: Creates request_sock, sends response
Client processes SYN-ACK: Completes client-side setup, sends ACK
Server processes ACK: Promotes request_sock to full tcp_sock, moves to accept queue

TCP connection establishment
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
/**
 * Client-side: Initiate TCP connection
 */
int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct inet_sock *inet = inet_sk(sk);
    struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
    struct rtable *rt;
    
    /* Validate and extract destination address */
    daddr = usin->sin_addr.s_addr;
    dport = usin->sin_port;
    
    /* Route lookup */
    rt = ip_route_connect(...);
    if (IS_ERR(rt))
        return PTR_ERR(rt);
    
    /* Select source address if not bound */
    if (!inet->inet_saddr)
        inet->inet_saddr = fl4.saddr;
    
    /* Set peer address */
    inet->inet_daddr = daddr;
    inet->inet_dport = dport;
    
    /* Choose initial sequence number (ISN) */
    if (!tp->write_seq)
        tp->write_seq = secure_tcp_seq(inet->inet_saddr, inet->inet_daddr,
                                       inet->inet_sport, inet->inet_dport);
    
    /* Generate initial timestamp */
    tp->tsoffset = secure_tcp_ts_off(net, ...);
    
    /* Build and send SYN */
    err = tcp_connect(sk);
    
    return err;
}
 
/**
 * tcp_connect - Send SYN and start connection timer
 */
int tcp_connect(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;
    
    /* Allocate SYN packet */
    skb = sk_stream_alloc_skb(sk, 0, GFP_KERNEL, true);
    
    /* Set SYN flag */
    tcp_skb_pcount_set(skb, 1);
    tcp_skb_timestamp(sk, skb);
    
    /* Initialize send sequence */
    tp->snd_nxt = tp->write_seq;
    tp->pushed_seq = tp->write_seq;
    
    /* Send SYN */
    tcp_transmit_skb(sk, skb, 1, GFP_KERNEL);
    
    /* Start retransmit timer */
    inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
                              inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
    
    return 0;
}
 
/**
 * Server-side: Handle incoming SYN
 * 
 * This creates a "request socket" (mini-socket) to track
 * the half-open connection without consuming full socket memory.
 */
int tcp_conn_request(struct request_sock_ops *rsk_ops,
                     const struct tcp_request_sock_ops *af_ops,
                     struct sock *sk, struct sk_buff *skb)
{
    struct request_sock *req;
    struct tcp_request_sock *treq;
    
    /* Allocate request socket (much smaller than tcp_sock) */
    req = inet_reqsk_alloc(rsk_ops, sk, true);
    if (!req)
        goto drop;
    
    treq = tcp_rsk(req);
    
    /* Store connection parameters */
    inet_rsk(req)->ir_loc_addr = ip_hdr(skb)->daddr;
    inet_rsk(req)->ir_rmt_addr = ip_hdr(skb)->saddr;
    inet_rsk(req)->ir_rmt_port = tcp_hdr(skb)->source;
    
    /* Generate server ISN */
    treq->snt_isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
    
    /* Store in SYN table or use SYN cookies */
    if (net->core.sysctl_somaxconn <= inet_csk_reqsk_queue_len(sk)) {
        /* Queue full - use SYN cookies if enabled */
        if (!net->ipv4.sysctl_tcp_syncookies)
            goto drop;
        want_cookie = true;
    }
    
    if (!want_cookie)
        inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
    
    /* Send SYN-ACK */
    af_ops->send_synack(sk, dst, fl, req, ...);
    
    return 0;
    
drop:
    kfree(req);
    return 0;
}
 
/**
 * Process ACK completing the three-way handshake
 */
struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
                           struct request_sock *req)
{
    struct sock *child;
    struct tcp_sock *tp;
    
    /* Validate ACK sequence */
    if (!between(TCP_SKB_CB(skb)->ack_seq,
                 tcp_rsk(req)->snt_isn,
                 tcp_rsk(req)->snt_isn + 1 + req->mss))
        return NULL;
    
    /* Create full socket from request */
    child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
    if (!child)
        return NULL;
    
    /* Move to accept queue */
    inet_csk_reqsk_queue_add(sk, req, child);
    
    /* Wake up accept() waiters */
    sk_data_ready(sk);
    
    return child;
}

SYN Cookies: Defense Against SYN Floods

When the SYN queue is full, Linux can use SYN cookies (net.ipv4.tcp_syncookies=1). Instead of storing state for each SYN, the server encodes connection information into the ISN itself. When the ACK arrives, the server reconstructs the connection from the sequence number. This allows handling massive SYN floods without memory exhaustion.

TCP State Machine

TCP connections follow a well-defined state machine specified in RFC 793. Linux implements this state machine through the sk->sk_state field and associated functions that handle state transitions.

TCP connection states:

TCP State Definitions
State	Value	Description	Normal Transition
ESTABLISHED	1	Connection active, data flows	After 3-way handshake
SYN_SENT	2	SYN sent, awaiting SYN-ACK	After connect() call
SYN_RECV	3	SYN-ACK sent, awaiting ACK	Server after SYN received
FIN_WAIT1	4	FIN sent, awaiting ACK or FIN	Active close initiated
FIN_WAIT2	5	Our FIN acknowledged, awaiting peer FIN	ACK received in FIN_WAIT1
TIME_WAIT	6	Waiting for delayed segments to expire	All FINs exchanged
CLOSE	7	Socket is closed	Final state
CLOSE_WAIT	8	Peer sent FIN, awaiting local close	Passive close started
LAST_ACK	9	FIN sent after CLOSE_WAIT	close() after CLOSE_WAIT
LISTEN	10	Socket is listening for connections	After listen() call
CLOSING	11	Both sides sent FIN simultaneously	Rare simultaneous close

TCP state machine implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
/**
 * TCP states (from include/net/tcp_states.h)
 */
enum {
    TCP_ESTABLISHED = 1,
    TCP_SYN_SENT,
    TCP_SYN_RECV,
    TCP_FIN_WAIT1,
    TCP_FIN_WAIT2,
    TCP_TIME_WAIT,
    TCP_CLOSE,
    TCP_CLOSE_WAIT,
    TCP_LAST_ACK,
    TCP_LISTEN,
    TCP_CLOSING,
    TCP_NEW_SYN_RECV,   /* Request socket state */
    
    TCP_MAX_STATES      /* Leave at end */
};
 
/**
 * tcp_set_state - Change socket state
 * 
 * This function handles all state transitions, updating
 * hash tables and performing necessary cleanup.
 */
void tcp_set_state(struct sock *sk, int state)
{
    int oldstate = sk->sk_state;
    
    /* Handle transitions affecting hash tables */
    switch (state) {
    case TCP_ESTABLISHED:
        if (oldstate != TCP_ESTABLISHED)
            TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
        break;
        
    case TCP_CLOSE:
        /* Socket closing - cleanup timers */
        __tcp_clear_all_timers(sk);
        /* Fall through */
        
    case TCP_CLOSE_WAIT:
        /* Leaving established state */
        if (oldstate == TCP_ESTABLISHED)
            TCP_DEC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
        break;
    }
    
    /* Update socket state */
    sk_state_store(sk, state);
    
    /* Trace state transition (for debugging) */
    trace_tcp_set_state(sk, oldstate, state);
}
 
/**
 * tcp_rcv_state_process - Main TCP receive state machine
 * 
 * This is the heart of TCP packet processing where state
 * transitions occur based on incoming segments.
 */
int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct tcphdr *th = tcp_hdr(skb);
    int queued = 0;
    
    switch (sk->sk_state) {
    case TCP_CLOSE:
        /* Closed socket received packet - send RST */
        goto discard;
        
    case TCP_LISTEN:
        /* Listening socket - handle new connections */
        if (th->syn) {
            /* New connection request */
            return tcp_conn_request(...);
        }
        goto discard;
        
    case TCP_SYN_SENT:
        /* Awaiting SYN-ACK from peer */
        queued = tcp_rcv_synsent_state_process(sk, skb, th);
        if (queued >= 0)
            return queued;
        break;
        
    case TCP_SYN_RECV:
        /* Server awaiting ACK to complete handshake */
        if (th->ack && acceptable_ack) {
            tcp_set_state(sk, TCP_ESTABLISHED);
            /* Connection established! */
        }
        break;
    }
    
    /* Common processing for established states */
    if (!after(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
        /* Segment is in window */
        
        if (th->rst) {
            tcp_reset(sk);
            return 0;
        }
        
        if (th->fin)
            tcp_fin(sk);
    }
    
    return 0;
}
 
/**
 * tcp_close - Initiate connection termination
 * 
 * Called when application calls close() on socket.
 */
void tcp_close(struct sock *sk, long timeout)
{
    struct tcp_sock *tp = tcp_sk(sk);
    int state;
    
    lock_sock(sk);
    
    /* Discard unsent data if linger is off */
    if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
        tp->linger2 = 0;
        /* Send RST instead of proper close */
        tcp_set_state(sk, TCP_CLOSE);
        tcp_send_active_reset(sk, GFP_KERNEL);
        goto out;
    }
    
    /* Normal close - send FIN */
    state = sk->sk_state;
    
    if (state == TCP_ESTABLISHED) {
        /* Enter FIN_WAIT1 and send FIN */
        tcp_set_state(sk, TCP_FIN_WAIT1);
        tcp_send_fin(sk);
    }
    else if (state == TCP_CLOSE_WAIT) {
        /* Peer already sent FIN - enter LAST_ACK */
        tcp_set_state(sk, TCP_LAST_ACK);
        tcp_send_fin(sk);
    }
    
out:
    release_sock(sk);
}
 
/**
 * TIME_WAIT handling
 * 
 * TIME_WAIT sockets use a special stripped-down structure
 * to reduce memory consumption (there can be many).
 */
struct inet_timewait_sock {
    struct sock_common  __tw_common;
    
    __be16              tw_dport;       /* Peer port */
    unsigned char       tw_substate;    /* State within TIME_WAIT */
    unsigned char       tw_timeout;     /* Remaining timeout */
    
    /* Bind bucket reference */
    struct inet_bind_bucket *tw_bind;
    
    /* Hlist entries for lookup */
    struct hlist_node   tw_death_node;
    struct hlist_node   tw_bind_node;
    struct hlist_node   tw_hash_node;
};

TIME_WAIT and Connection Exhaustion

TIME_WAIT lasts 2*MSL (typically 60 seconds). High-traffic servers making many outbound connections can exhaust local ports due to TIME_WAIT accumulation. Solutions include SO_REUSEADDR, tcp_tw_reuse sysctl, connection pooling, or switching to persistent connections (HTTP/2 keepalive).

Socket Hash Tables

When a TCP packet arrives, the kernel must quickly find the socket it belongs to. This is accomplished through hash tables indexed by connection 4-tuple (source IP, source port, destination IP, destination port). The lookup must be extremely fast—it occurs for every single packet.

Hash table organization:

Linux maintains multiple hash tables for different socket types:

TCP Socket Hash Tables
Hash Table	Contents	Lookup Key	Purpose
ehash	Established sockets	4-tuple (full connection)	Packet → socket lookup
bhash	Bound sockets	Local port	Port allocation, bind conflict check
lhash	Listening sockets	Local port	SYN → listening socket lookup

TCP hash table implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
/**
 * struct inet_hashinfo - TCP socket hash tables
 * 
 * This structure holds all hash tables used for TCP socket lookup.
 * It's initialized at boot and sized based on system memory.
 */
struct inet_hashinfo {
    /* Established/TIME_WAIT connections hash */
    struct inet_ehash_bucket *ehash;
    spinlock_t               *ehash_locks;
    unsigned int             ehash_mask;
    unsigned int             ehash_locks_mask;
    
    /* Bind hash (listening and bound sockets) */
    struct kmem_cache       *bind_bucket_cachep;
    struct inet_bind_hashbucket *bhash;
    unsigned int            bhash_size;
    
    /* Listening hash (listen sockets only) */
    struct inet_listen_hashbucket *listening_hash;
    unsigned int            lhash2_mask;
    
    /* ... */
};
 
/* Global TCP hash table instance */
struct inet_hashinfo tcp_hashinfo;
 
/**
 * inet_ehash_bucket - Hash bucket for established connections
 * 
 * Each bucket is an RCU-protected list of sockets
 * with same hash value.
 */
struct inet_ehash_bucket {
    struct hlist_nulls_head chain;
};
 
/**
 * __inet_lookup_established - Find established socket
 * 
 * This function is called for every incoming TCP packet
 * to find the socket that should receive it.
 */
struct sock *__inet_lookup_established(struct net *net,
                                       struct inet_hashinfo *hashinfo,
                                       const __be32 saddr, const __be16 sport,
                                       const __be32 daddr, const u16 hnum,
                                       const int dif, const int sdif)
{
    INET_ADDR_COOKIE(acookie, saddr, daddr);
    const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
    struct sock *sk;
    const struct hlist_nulls_node *node;
    unsigned int hash = inet_ehashfn(net, daddr, hnum, saddr, sport);
    unsigned int slot = hash & hashinfo->ehash_mask;
    struct inet_ehash_bucket *head = &hashinfo->ehash[slot];
    
begin:
    /* Lockless RCU traversal */
    sk_nulls_for_each_rcu(sk, node, &head->chain) {
        /* Quick rejection using combined port comparison */
        if (sk->sk_hash != hash)
            continue;
        if (likely(INET_MATCH(sk, net, acookie, saddr, daddr, ports, dif, sdif)))
            return sk;  /* Found! */
    }
    /* Handle nulls marker for concurrent modification */
    if (get_nulls_value(node) != slot)
        goto begin;
    
    return NULL;
}
 
/**
 * INET_MATCH macro - Check if socket matches packet
 * 
 * Optimized for common case with prefetching and
 * minimal memory accesses.
 */
#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \
    (((__sk)->sk_portpair == (__ports))                     && \
     ((__sk)->sk_addrpair == (__cookie))                    && \
     (!(__sk)->sk_bound_dev_if ||                              \
      ((__sk)->sk_bound_dev_if == (__dif)) ||                  \
      ((__sk)->sk_bound_dev_if == (__sdif)))                && \
     net_eq(sock_net(__sk), (__net)))
 
/**
 * __inet_lookup_listener - Find listening socket
 * 
 * Called when SYN arrives to find the server socket
 * that should handle the new connection.
 */
struct sock *__inet_lookup_listener(struct net *net,
                                    struct inet_hashinfo *hashinfo,
                                    struct sk_buff *skb, int doff,
                                    const __be32 saddr, __be16 sport,
                                    const __be32 daddr, const unsigned short hnum,
                                    const int dif, const int sdif)
{
    struct inet_listen_hashbucket *ilb2;
    struct sock *result = NULL;
    unsigned int hash2;
    
    /* Hash lookup in listening table */
    hash2 = ipv4_portaddr_hash(net, daddr, hnum);
    ilb2 = &hashinfo->lhash2[hash2 & hashinfo->lhash2_mask];
    
    /* Search for exact match (IP + port) */
    result = inet_lhash2_lookup(net, ilb2, skb, doff,
                                saddr, sport, daddr, hnum, dif, sdif);
    
    if (result)
        return result;
    
    /* Try wildcard (0.0.0.0) listener */
    hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
    ilb2 = &hashinfo->lhash2[hash2 & hashinfo->lhash2_mask];
    
    return inet_lhash2_lookup(net, ilb2, skb, doff,
                              saddr, sport, htonl(INADDR_ANY),
                              hnum, dif, sdif);
}
 
/**
 * Hash table sizing at boot
 * 
 * The hash table size is calculated based on available memory.
 * Larger tables reduce collision probability for high-connection systems.
 */
void __init tcp_init(void)
{
    /* Calculate hash table size based on memory */
    nr_pages = totalram_pages();
    
    /* Target: one bucket per expected maximum connections */
    tcp_hashinfo.ehash_mask = 
        alloc_large_system_hash("TCP established",
                                sizeof(struct inet_ehash_bucket),
                                thash_entries,
                                17,  /* min 2^17 = 128K entries */
                                0,
                                NULL,
                                &tcp_hashinfo.ehash_locks_mask,
                                0,
                                64 * 1024);  /* max entries */
}

RCU for Lock-Free Lookup

Socket lookup uses RCU (Read-Copy-Update) for lock-free packet processing. The receiving CPU can look up sockets without acquiring any locks—only an RCU read-side critical section. This enables multi-gigabit packet rates on multi-core systems where locking would cause severe contention.

Congestion Control

TCP congestion control prevents senders from overwhelming the network. Linux implements a pluggable congestion control framework where different algorithms can be loaded as modules and selected per-socket or system-wide.

Congestion control fundamentals:

Congestion Window (cwnd): Maximum unacknowledged data limited by network
Slow Start: Exponential cwnd growth when below ssthresh
Congestion Avoidance: Linear cwnd growth when above ssthresh
Loss Detection: Packet loss signals congestion (triggers reduction)
Bandwidth Probing: Many algorithms probe for available bandwidth

Linux Congestion Control Algorithms
Algorithm	Type	Key Feature	Use Case
Reno	Loss-based	Classic AIMD	Baseline, historical reference
CUBIC	Loss-based	Cubic function for cwnd growth	Default on most systems
BBR	Model-based	Measures bandwidth, minimizes latency	Google infrastructure, low-latency
Vegas	Delay-based	RTT increase signals congestion	Low-loss environments
DCTCP	ECN-based	Uses ECN for early signaling	Data center networks
Westwood+	BW estimation	Estimates available bandwidth	Wireless networks

Congestion control framework
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
/**
 * struct tcp_congestion_ops - Congestion control algorithm interface
 * 
 * Each CC algorithm implements this interface to plug into
 * the TCP stack.
 */
struct tcp_congestion_ops {
    struct list_head list;
    
    /* Unique name for algorithm selection */
    char            name[TCP_CA_NAME_MAX];
    struct module   *owner;
    
    /* Required: Called on each ACK */
    void (*cong_avoid)(struct sock *sk, u32 ack, u32 acked);
    
    /* Required: Set slow start threshold on loss */
    u32  (*ssthresh)(struct sock *sk);
    
    /* Optional: Called when connection established */
    void (*init)(struct sock *sk);
    
    /* Optional: Called when connection destroyed */
    void (*release)(struct sock *sk);
    
    /* Optional: RTT sample callback */
    void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
    
    /* Optional: ECN handling */
    void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev);
    
    /* Optional: Undo cwnd reduction */
    u32  (*undo_cwnd)(struct sock *sk);
    
    /* Optional: Get current state for ss/netstat */
    size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
                       union tcp_cc_info *info);
    
    /* Flags indicating algorithm capabilities */
    u32     flags;
};
 
/**
 * CUBIC congestion control (default since Linux 2.6.19)
 * 
 * Uses a cubic function to grow cwnd, providing faster
 * recovery to previous bandwidth than Reno's linear growth.
 */
static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bictcp *ca = inet_csk_ca(sk);
    
    if (!tcp_is_cwnd_limited(sk))
        return;  /* Application-limited, don't grow */
    
    if (tcp_in_slow_start(tp)) {
        /* Slow start: exponential growth */
        u32 cnt = tcp_slow_start(tp, acked);
        if (!cnt)
            return;
        acked = cnt;
    }
    
    /* Congestion avoidance: cubic function */
    bictcp_update(ca, tp->snd_cwnd, acked);
    tcp_cong_avoid_ai(tp, ca->cnt, acked);
}
 
/**
 * bictcp_update - Compute cubic window
 * 
 * W(t) = C*(t-K)^3 + Wmax
 * 
 * Where K = cubic_root(Wmax*beta/C), beta = 0.3, C = 0.4
 */
static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
{
    u32 delta, bic_target, max_cnt;
    u64 offs, t;
    
    /* Calculate time since last loss */
    t = (u64)(tcp_time_stamp(tp) - ca->epoch_start);
    t += usecs_to_jiffies(ca->delay_min >> 3);
    t <<= BICTCP_HZ;
    
    /* Calculate cubic window target */
    offs = ca->bic_K - t;
    delta = (cube_rtt_scale * offs * offs * offs) >> 40;
    bic_target = ca->bic_origin_point + delta;
    
    /* Compute growth rate */
    if (bic_target > cwnd)
        ca->cnt = cwnd / (bic_target - cwnd);
    else
        ca->cnt = 100 * cwnd;  /* Very slow growth */
}
 
/**
 * BBR congestion control (Google's model-based algorithm)
 * 
 * BBR tries to send at the bottleneck bandwidth while
 * maintaining minimum RTT, avoiding buffer bloat.
 */
static void bbr_main(struct sock *sk, const struct rate_sample *rs)
{
    struct bbr *bbr = inet_csk_ca(sk);
    u32 bw;
    
    /* Estimate maximum bandwidth */
    bw = bbr_max_bw(sk);
    
    /* Estimate minimum RTT (propagation delay) */
    if (rs->rtt_us > 0 && rs->rtt_us < bbr->min_rtt_us)
        bbr->min_rtt_us = rs->rtt_us;
    
    /* Calculate target: BDP = bandwidth * delay */
    bbr_set_pacing_rate(sk, bw, bbr->pacing_gain);
    bbr_set_cwnd(sk, rs, rs->acked_sacked, bw, bbr->cwnd_gain);
}
 
/**
 * Selecting congestion control algorithm
 */
/* System-wide default */
$ sysctl net.ipv4.tcp_congestion_control=bbr
 
/* Per-socket (application code) */
setsockopt(sock, IPPROTO_TCP, TCP_CONGESTION, "bbr", 4);
 
/* View available algorithms */
$ sysctl net.ipv4.tcp_available_congestion_control
net.ipv4.tcp_available_congestion_control = reno cubic bbr

BBR vs CUBIC

CUBIC is loss-based: it fills buffers until packets drop. BBR is model-based: it estimates available bandwidth and maintains minimal queuing. BBR can achieve much lower latency but may be unfair to CUBIC flows in some conditions. Data centers often use BBR or DCTCP; the internet at large still mostly uses CUBIC.

TCP Optimizations

Linux TCP includes numerous optimizations developed over decades to maximize throughput and minimize latency. Understanding these helps tune performance for specific workloads.

Key TCP optimizations:

Performance Optimizations

•TCP Fast Open (TFO) — Send data in SYN packet, reducing latency by one RTT for repeat connections. Uses cryptographic cookies for security.
•Delayed ACK — Batch ACKs (up to 40ms) to reduce packet count. Can be disabled with TCP_QUICKACK for latency-sensitive apps.
•Nagle's Algorithm — Buffer small writes until ACK or MSS reached. Disable with TCP_NODELAY for interactive applications.
•TCP_CORK — Explicitly cork output until uncorked or 200ms timeout. Useful for building complete messages before sending.
•Window Scaling — RFC 1323 extension enables windows >64KB, essential for high-bandwidth-delay networks.
•Selective ACK (SACK) — Receiver reports non-contiguous received segments, enabling sender to retransmit only lost segments.
•Early Retransmit (ER) — Trigger fast retransmit with fewer DUPACKs when there's limited data in flight.
•Tail Loss Probe (TLP) — Send probe to trigger ACKs for tail loss detection before RTO expires.
•RACK (Recent ACK) — Time-based loss detection using recent ACK timestamps, more robust than DUPACK counting.

TCP Fast Open implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
/**
 * TCP Fast Open - Send data with SYN
 * 
 * TFO allows data to be sent in the SYN packet for repeat
 * connections, saving one RTT.
 */
 
/* Client-side: Send data with SYN */
int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg,
                         int *copied, size_t size)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sockaddr *uaddr;
    int err, flags;
    
    /* MSG_FASTOPEN triggers TFO */
    if (!(msg->msg_flags & MSG_FASTOPEN))
        return 0;
    
    /* Get cookie from cache or request new one */
    if (!tcp_fastopen_cookie_check(sk, &req->cookie)) {
        /* No cached cookie, will request one */
        tp->fastopen_req->cookie.len = 0;
    }
    
    /* Build and send SYN with data */
    err = tcp_connect(sk);
    
    /* Copy data to send buffer */
    tcp_sendmsg_locked(sk, msg, size);
    
    return err;
}
 
/* Server-side: Accept data in SYN */
int tcp_fastopen_create_child(struct sock *sk, struct sk_buff *skb,
                              struct request_sock *req)
{
    struct sock *child;
    
    /* Validate TFO cookie */
    if (!tcp_fastopen_cookie_valid(&foc)) {
        /* Invalid cookie - fall back to normal handshake */
        return -1;
    }
    
    /* Create child socket immediately (before ACK) */
    child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
    
    /* Queue SYN data for application */
    if (skb->len > tcp_hdrlen(skb)) {
        struct sk_buff *data_skb = skb_clone(skb, GFP_ATOMIC);
        __skb_pull(data_skb, tcp_hdrlen(skb));
        skb_queue_tail(&child->sk_receive_queue, data_skb);
        child->sk_data_ready(child);
    }
    
    return 0;
}
 
/**
 * SACK processing
 * 
 * SACK tells sender which segments receiver has, enabling
 * selective retransmission.
 */
void tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
                             u32 prior_snd_una)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct tcp_sack_block *sp;
    int num_sacks;
    
    /* Parse SACK blocks from ACK segment */
    sp = tcp_parse_options(ack_skb)->sack;
    num_sacks = tcp_parse_options(ack_skb)->num_sacks;
    
    for (i = 0; i < num_sacks; i++) {
        u32 start_seq = ntohl(sp[i].start_seq);
        u32 end_seq = ntohl(sp[i].end_seq);
        
        /* Mark segments in this range as SACKed */
        skb_queue_walk(&sk->sk_write_queue, skb) {
            if (between(TCP_SKB_CB(skb)->seq, start_seq, end_seq))
                TCP_SKB_CB(skb)->sacked |= TCPCB_SACKED_ACKED;
        }
    }
    
    /* Retransmit non-SACKed segments that are considered lost */
    tcp_xmit_retransmit_queue(sk);
}
 
/**
 * Key sysctl tuning parameters
 */
# Enable TCP Fast Open
net.ipv4.tcp_fastopen = 3  # Both client and server
 
# Window scaling for high-BDP paths
net.ipv4.tcp_window_scaling = 1
 
# SACK  
net.ipv4.tcp_sack = 1
 
# Receive buffer auto-tuning
net.ipv4.tcp_moderate_rcvbuf = 1
 
# Buffer sizes (auto-tuned between min/default/max)
net.ipv4.tcp_rmem = 4096 87380 6291456
net.ipv4.tcp_wmem = 4096 65536 6291456

Tuning for Workload

Different workloads need different tuning. Web servers benefit from TFO and small buffers. Bulk transfer (backups, replication) needs large buffers. Interactive applications (gaming, SSH) benefit from TCP_NODELAY and TCP_QUICKACK. Database traffic often uses TCP_QUICKACK to reduce commit latency.

UDP: The Simpler Protocol

While TCP dominates connection-oriented communication, UDP is essential for DNS, streaming, gaming, VoIP, and is the foundation for QUIC/HTTP3. Linux's UDP implementation is simpler than TCP—no connection state, no reliability, no congestion control—but it still provides important features.

UDP socket structure:

UDP implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
/**
 * struct udp_sock - UDP socket state
 * 
 * Much simpler than tcp_sock - no connection tracking needed.
 */
struct udp_sock {
    struct inet_sock inet;
    
    int              pending;        /* Pending message type */
    unsigned int     corkflag;       /* UDP_CORK is set */
    __u8             encap_type;     /* Encapsulation (ESP, GTP, etc.) */
    
    /* GRO (Generic Receive Offload) support */
    u16              len;            /* Total pending length */
    u16              gso_size;       /* GSO segment size */
    
    /* Receive queue memory */
    int             (*encap_rcv)(struct sock *sk, struct sk_buff *skb);
};
 
/**
 * udp_sendmsg - Send UDP datagram
 */
int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
{
    struct inet_sock *inet = inet_sk(sk);
    struct udp_sock *up = udp_sk(sk);
    struct flowi4 *fl4;
    struct sk_buff *skb;
    int err;
    
    /* Get destination from message or connected socket */
    if (msg->msg_name) {
        /* sendto() with explicit destination */
        struct sockaddr_in *usin = msg->msg_name;
        daddr = usin->sin_addr.s_addr;
        dport = usin->sin_port;
    } else if (sk->sk_state == TCP_ESTABLISHED) {
        /* Connected UDP socket */
        daddr = inet->inet_daddr;
        dport = inet->inet_dport;
    } else {
        return -EDESTADDRREQ;
    }
    
    /* Route lookup */
    rt = ip_route_output_flow(net, &fl4, sk);
    
    /* Corked send - accumulate data */
    if (up->pending) {
        skb = ip_finish_skb(sk, &fl4);
    } else {
        /* Allocate skb and copy user data */
        skb = sock_alloc_send_skb(sk, len, msg->msg_flags, &err);
        copy_from_iter(skb_put(skb, len), len, &msg->msg_iter);
    }
    
    /* Add UDP header */
    udp_set_header(skb, inet->inet_sport, dport);
    
    /* Send via IP layer */
    err = udp_send_skb(skb, &fl4);
    
    return err;
}
 
/**
 * udp_rcv - Receive UDP datagram
 */
int udp_rcv(struct sk_buff *skb)
{
    struct sock *sk;
    struct udphdr *uh;
    __be32 saddr, daddr;
    
    /* Extract addresses and ports */
    uh = udp_hdr(skb);
    saddr = ip_hdr(skb)->saddr;
    daddr = ip_hdr(skb)->daddr;
    
    /* Validate checksum */
    if (udp_lib_checksum_complete(skb))
        goto csum_error;
    
    /* Look up socket (hash lookup by 4-tuple or 2-tuple) */
    sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, &udp_table);
    
    if (sk) {
        /* Found socket - deliver */
        int ret = udp_queue_rcv_skb(sk, skb);
        return ret;
    }
    
    /* No socket - send ICMP port unreachable */
    icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);
    kfree_skb(skb);
    
    return 0;
}
 
/**
 * UDP optimizations
 */
 
/* UDP GRO - Coalesce related UDP packets */
struct sk_buff *udp_gro_receive(struct list_head *head,
                                struct sk_buff *skb)
{
    /* Combine UDP packets with same flow into one skb */
    /* Reduces per-packet overhead for high-rate flows */
}
 
/* UDP GSO - Segment large UDP "super-packets" in software */
struct sk_buff *udp4_gso_segment(struct sk_buff *skb,
                                 netdev_features_t features)
{
    /* Split large UDP message into MTU-sized segments */
    /* Application sends one large write, kernel segments */
}

QUIC and the UDP Renaissance

QUIC (HTTP/3) implements TCP-like reliability over UDP. This enables user-space control over congestion and reliability, faster evolution than kernel TCP, and avoids head-of-line blocking. Linux's UDP optimizations (GRO, GSO, receive ring buffers) make high-performance QUIC implementations possible.

Summary: Mastering TCP/IP Implementation

Linux's TCP/IP implementation represents decades of protocol research translated into production code. Understanding these internals enables effective performance tuning, debugging, and system design for networked applications.

Key Takeaways

•tcp_sock is comprehensive — Hundreds of fields track sequence numbers, windows, congestion state, timers, and options for each connection
•Three-way handshake uses request sockets — Lightweight structures track half-open connections, with SYN cookies preventing SYN flood memory exhaustion
•State machine governs lifecycle — TCP_ESTABLISHED, TIME_WAIT, and other states define connection behavior at each phase
•Hash tables enable fast lookup — RCU-protected hash tables indexed by connection 4-tuple provide O(1) socket lookup for every packet
•Pluggable congestion control — Algorithms like CUBIC and BBR can be selected per-socket; each implements different tradeoffs between throughput and latency
•Extensive optimizations — TFO, SACK, window scaling, and auto-tuning enable both low-latency and high-throughput operation
•UDP is simpler but important — No connection state, but essential for DNS, streaming, and as foundation for QUIC/HTTP3

What's next:

With the TCP/IP stack understood, we'll trace the complete packet flow through Linux networking—from application write to wire transmission, and from NIC reception to application read. You'll see how all the pieces we've covered integrate into a cohesive packet processing pipeline.

Page Complete

You now understand the Linux TCP/IP implementation—the data structures, state machines, and algorithms that power internet communication. This knowledge is essential for network performance engineering, debugging connection issues, and building high-performance networked applications.

4 / 5

Loading learning content...

Operating SystemsLinux Internals

Linux Networking

LevelAdvanced

Duration180 mins

TopicLinux Internals

4 / 5

TCP/IP Implementation

The Heart of Internet Communication

What You Will Learn

TCP Data Structures

The TCP socket hierarchy:

TCP sockets use a layered structure design:

struct tcp_sock
  └── struct inet_connection_sock
        └── struct inet_sock
              └── struct sock

Each layer adds protocol-specific fields while inheriting the generic socket functionality from its parent.

TCP socket structures
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
/**
 * struct tcp_sock - TCP socket state
 * 
 * This is the primary structure for TCP connections.
 * It contains all state needed for TCP protocol operation.
 */
struct tcp_sock {
    /* Inherit inet_connection_sock (includes struct sock) */
    struct inet_connection_sock inet_conn;
    
    /* === Sequence Number Tracking === */
    
    /* Send sequence space */
    u32     snd_una;        /* First unacknowledged byte */
    u32     snd_nxt;        /* Next sequence to send */
    u32     snd_sml;        /* Last byte of small packet sent */
    u32     write_seq;      /* Tail of data in send queue */
    u32     pushed_seq;     /* Last pushed sequence */
    
    /* Receive sequence space */
    u32     rcv_nxt;        /* Next expected receive sequence */
    u32     copied_seq;     /* Bytes copied to user */
    u32     rcv_wup;        /* rcv_nxt at last window update */
    
    /* === Window Management === */
    
    u32     snd_wnd;        /* Send window (from receiver) */
    u32     max_window;     /* Maximum window ever seen */
    u32     rcv_wnd;        /* Current receive window */
    u32     window_clamp;   /* Maximum window to advertise */
    
    /* Window scaling (RFC 1323) */
    u8      rx_opt.snd_wscale;  /* Window scaling for send */
    u8      rx_opt.rcv_wscale;  /* Window scaling for receive */
    
    /* === Congestion Control === */
    
    u32     snd_cwnd;       /* Congestion window (in packets) */
    u32     snd_cwnd_cnt;   /* Fractional cwnd growth counter */
    u32     snd_ssthresh;   /* Slow start threshold */
    u32     prior_cwnd;     /* Cwnd before loss/recovery */
    
    /* RTT estimation (Jacobson/Karels) */
    u32     srtt_us;        /* Smoothed RTT in microseconds */
    u32     mdev_us;        /* RTT mean deviation */
    u32     mdev_max_us;    /* Maximum mdev for RTO */
    u32     rttvar_us;      /* Smoothed RTT variance */
    u32     rtt_seq;        /* Seq when RTT sample taken */
    
    /* === Retransmission === */
    
    u32     retrans_out;    /* Segments currently retransmitted */
    u32     lost_out;       /* Segments assumed lost */
    u32     sacked_out;     /* SACK'd segments */
    u32     retransmit_skb_hint; /* Retransmit hint */
    
    /* Timers (see inet_connection_sock) */
    
    /* === Selective ACK (SACK) === */
    
    struct tcp_sack_block recv_sack_cache[4];  /* SACK blocks */
    struct tcp_sack_block selective_acks[4];   /* Current SACK info */
    
    /* === Connection Options === */
    
    u16     mss_cache;      /* Cached effective MSS */
    u16     advmss;         /* Advertised MSS */
    
    /* Timestamps (RFC 1323) */
    u32     tsoffset;       /* Timestamp offset */
    u32     rx_opt.rcv_tsval;   /* Received timestamp */
    u32     rx_opt.rcv_tsecr;   /* Timestamp echo reply */
    
    /* === Pacing and Delivery Rate === */
    
    u64     tcp_mstamp;     /* Most recent transmit timestamp */
    u32     delivered;      /* Total delivered segments */
    u32     app_limited;    /* Application-limited flag */
    struct rate_sample rs;  /* Rate sample for BBR, etc. */
    
    /* === Congestion Control Plugin === */
    
    const struct tcp_congestion_ops *ca_ops;  /* CC algorithm */
    u32     ca_priv[16];    /* Private CC algorithm state */
    
    /* ... many more fields ... */
};
 
/**
 * struct inet_connection_sock - Connection-oriented inet socket
 * 
 * Contains connection management state shared by TCP, SCTP, etc.
 */
struct inet_connection_sock {
    struct inet_sock    icsk_inet;
    
    /* Accept queue for listening sockets */
    struct request_sock_queue   icsk_accept_queue;
    
    /* Retransmit and other timers */
    struct timer_list   icsk_retransmit_timer;
    struct timer_list   icsk_delack_timer;
    
    /* Timer state */
    __u8                icsk_retransmits;   /* Retransmit count */
    __u8                icsk_pending;       /* Pending timer */
    __u8                icsk_backoff;       /* Backoff multiplier */
    
    /* Connection establishment */
    __u8                icsk_syn_retries;
    __u32               icsk_rto;           /* Retransmit timeout */
    __u32               icsk_ack.ato;       /* Delayed ACK timeout */
    
    /* Maximum segment size */
    __u16               icsk_pmtu_cookie;   /* Path MTU */
    
    /* ... more fields ... */
};

Size Implications

Connection Establishment

The three-way handshake in Linux:

Client sends SYN: tcp_v4_connect() sends initial SYN segment
Server processes SYN: tcp_v4_rcv() → tcp_v4_do_rcv() → tcp_conn_request()
Server sends SYN-ACK: Creates request_sock, sends response
Client processes SYN-ACK: Completes client-side setup, sends ACK
Server processes ACK: Promotes request_sock to full tcp_sock, moves to accept queue

TCP connection establishment
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
/**
 * Client-side: Initiate TCP connection
 */
int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct inet_sock *inet = inet_sk(sk);
    struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
    struct rtable *rt;
    
    /* Validate and extract destination address */
    daddr = usin->sin_addr.s_addr;
    dport = usin->sin_port;
    
    /* Route lookup */
    rt = ip_route_connect(...);
    if (IS_ERR(rt))
        return PTR_ERR(rt);
    
    /* Select source address if not bound */
    if (!inet->inet_saddr)
        inet->inet_saddr = fl4.saddr;
    
    /* Set peer address */
    inet->inet_daddr = daddr;
    inet->inet_dport = dport;
    
    /* Choose initial sequence number (ISN) */
    if (!tp->write_seq)
        tp->write_seq = secure_tcp_seq(inet->inet_saddr, inet->inet_daddr,
                                       inet->inet_sport, inet->inet_dport);
    
    /* Generate initial timestamp */
    tp->tsoffset = secure_tcp_ts_off(net, ...);
    
    /* Build and send SYN */
    err = tcp_connect(sk);
    
    return err;
}
 
/**
 * tcp_connect - Send SYN and start connection timer
 */
int tcp_connect(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;
    
    /* Allocate SYN packet */
    skb = sk_stream_alloc_skb(sk, 0, GFP_KERNEL, true);
    
    /* Set SYN flag */
    tcp_skb_pcount_set(skb, 1);
    tcp_skb_timestamp(sk, skb);
    
    /* Initialize send sequence */
    tp->snd_nxt = tp->write_seq;
    tp->pushed_seq = tp->write_seq;
    
    /* Send SYN */
    tcp_transmit_skb(sk, skb, 1, GFP_KERNEL);
    
    /* Start retransmit timer */
    inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
                              inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
    
    return 0;
}
 
/**
 * Server-side: Handle incoming SYN
 * 
 * This creates a "request socket" (mini-socket) to track
 * the half-open connection without consuming full socket memory.
 */
int tcp_conn_request(struct request_sock_ops *rsk_ops,
                     const struct tcp_request_sock_ops *af_ops,
                     struct sock *sk, struct sk_buff *skb)
{
    struct request_sock *req;
    struct tcp_request_sock *treq;
    
    /* Allocate request socket (much smaller than tcp_sock) */
    req = inet_reqsk_alloc(rsk_ops, sk, true);
    if (!req)
        goto drop;
    
    treq = tcp_rsk(req);
    
    /* Store connection parameters */
    inet_rsk(req)->ir_loc_addr = ip_hdr(skb)->daddr;
    inet_rsk(req)->ir_rmt_addr = ip_hdr(skb)->saddr;
    inet_rsk(req)->ir_rmt_port = tcp_hdr(skb)->source;
    
    /* Generate server ISN */
    treq->snt_isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
    
    /* Store in SYN table or use SYN cookies */
    if (net->core.sysctl_somaxconn <= inet_csk_reqsk_queue_len(sk)) {
        /* Queue full - use SYN cookies if enabled */
        if (!net->ipv4.sysctl_tcp_syncookies)
            goto drop;
        want_cookie = true;
    }
    
    if (!want_cookie)
        inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
    
    /* Send SYN-ACK */
    af_ops->send_synack(sk, dst, fl, req, ...);
    
    return 0;
    
drop:
    kfree(req);
    return 0;
}
 
/**
 * Process ACK completing the three-way handshake
 */
struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
                           struct request_sock *req)
{
    struct sock *child;
    struct tcp_sock *tp;
    
    /* Validate ACK sequence */
    if (!between(TCP_SKB_CB(skb)->ack_seq,
                 tcp_rsk(req)->snt_isn,
                 tcp_rsk(req)->snt_isn + 1 + req->mss))
        return NULL;
    
    /* Create full socket from request */
    child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
    if (!child)
        return NULL;
    
    /* Move to accept queue */
    inet_csk_reqsk_queue_add(sk, req, child);
    
    /* Wake up accept() waiters */
    sk_data_ready(sk);
    
    return child;
}

SYN Cookies: Defense Against SYN Floods

TCP State Machine

TCP connections follow a well-defined state machine specified in RFC 793. Linux implements this state machine through the sk->sk_state field and associated functions that handle state transitions.

TCP connection states:

TCP State Definitions
State	Value	Description	Normal Transition
ESTABLISHED	1	Connection active, data flows	After 3-way handshake
SYN_SENT	2	SYN sent, awaiting SYN-ACK	After connect() call
SYN_RECV	3	SYN-ACK sent, awaiting ACK	Server after SYN received
FIN_WAIT1	4	FIN sent, awaiting ACK or FIN	Active close initiated
FIN_WAIT2	5	Our FIN acknowledged, awaiting peer FIN	ACK received in FIN_WAIT1
TIME_WAIT	6	Waiting for delayed segments to expire	All FINs exchanged
CLOSE	7	Socket is closed	Final state
CLOSE_WAIT	8	Peer sent FIN, awaiting local close	Passive close started
LAST_ACK	9	FIN sent after CLOSE_WAIT	close() after CLOSE_WAIT
LISTEN	10	Socket is listening for connections	After listen() call
CLOSING	11	Both sides sent FIN simultaneously	Rare simultaneous close

TCP state machine implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
/**
 * TCP states (from include/net/tcp_states.h)
 */
enum {
    TCP_ESTABLISHED = 1,
    TCP_SYN_SENT,
    TCP_SYN_RECV,
    TCP_FIN_WAIT1,
    TCP_FIN_WAIT2,
    TCP_TIME_WAIT,
    TCP_CLOSE,
    TCP_CLOSE_WAIT,
    TCP_LAST_ACK,
    TCP_LISTEN,
    TCP_CLOSING,
    TCP_NEW_SYN_RECV,   /* Request socket state */
    
    TCP_MAX_STATES      /* Leave at end */
};
 
/**
 * tcp_set_state - Change socket state
 * 
 * This function handles all state transitions, updating
 * hash tables and performing necessary cleanup.
 */
void tcp_set_state(struct sock *sk, int state)
{
    int oldstate = sk->sk_state;
    
    /* Handle transitions affecting hash tables */
    switch (state) {
    case TCP_ESTABLISHED:
        if (oldstate != TCP_ESTABLISHED)
            TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
        break;
        
    case TCP_CLOSE:
        /* Socket closing - cleanup timers */
        __tcp_clear_all_timers(sk);
        /* Fall through */
        
    case TCP_CLOSE_WAIT:
        /* Leaving established state */
        if (oldstate == TCP_ESTABLISHED)
            TCP_DEC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
        break;
    }
    
    /* Update socket state */
    sk_state_store(sk, state);
    
    /* Trace state transition (for debugging) */
    trace_tcp_set_state(sk, oldstate, state);
}
 
/**
 * tcp_rcv_state_process - Main TCP receive state machine
 * 
 * This is the heart of TCP packet processing where state
 * transitions occur based on incoming segments.
 */
int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct tcphdr *th = tcp_hdr(skb);
    int queued = 0;
    
    switch (sk->sk_state) {
    case TCP_CLOSE:
        /* Closed socket received packet - send RST */
        goto discard;
        
    case TCP_LISTEN:
        /* Listening socket - handle new connections */
        if (th->syn) {
            /* New connection request */
            return tcp_conn_request(...);
        }
        goto discard;
        
    case TCP_SYN_SENT:
        /* Awaiting SYN-ACK from peer */
        queued = tcp_rcv_synsent_state_process(sk, skb, th);
        if (queued >= 0)
            return queued;
        break;
        
    case TCP_SYN_RECV:
        /* Server awaiting ACK to complete handshake */
        if (th->ack && acceptable_ack) {
            tcp_set_state(sk, TCP_ESTABLISHED);
            /* Connection established! */
        }
        break;
    }
    
    /* Common processing for established states */
    if (!after(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
        /* Segment is in window */
        
        if (th->rst) {
            tcp_reset(sk);
            return 0;
        }
        
        if (th->fin)
            tcp_fin(sk);
    }
    
    return 0;
}
 
/**
 * tcp_close - Initiate connection termination
 * 
 * Called when application calls close() on socket.
 */
void tcp_close(struct sock *sk, long timeout)
{
    struct tcp_sock *tp = tcp_sk(sk);
    int state;
    
    lock_sock(sk);
    
    /* Discard unsent data if linger is off */
    if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
        tp->linger2 = 0;
        /* Send RST instead of proper close */
        tcp_set_state(sk, TCP_CLOSE);
        tcp_send_active_reset(sk, GFP_KERNEL);
        goto out;
    }
    
    /* Normal close - send FIN */
    state = sk->sk_state;
    
    if (state == TCP_ESTABLISHED) {
        /* Enter FIN_WAIT1 and send FIN */
        tcp_set_state(sk, TCP_FIN_WAIT1);
        tcp_send_fin(sk);
    }
    else if (state == TCP_CLOSE_WAIT) {
        /* Peer already sent FIN - enter LAST_ACK */
        tcp_set_state(sk, TCP_LAST_ACK);
        tcp_send_fin(sk);
    }
    
out:
    release_sock(sk);
}
 
/**
 * TIME_WAIT handling
 * 
 * TIME_WAIT sockets use a special stripped-down structure
 * to reduce memory consumption (there can be many).
 */
struct inet_timewait_sock {
    struct sock_common  __tw_common;
    
    __be16              tw_dport;       /* Peer port */
    unsigned char       tw_substate;    /* State within TIME_WAIT */
    unsigned char       tw_timeout;     /* Remaining timeout */
    
    /* Bind bucket reference */
    struct inet_bind_bucket *tw_bind;
    
    /* Hlist entries for lookup */
    struct hlist_node   tw_death_node;
    struct hlist_node   tw_bind_node;
    struct hlist_node   tw_hash_node;
};

TIME_WAIT and Connection Exhaustion

Socket Hash Tables

Hash table organization:

Linux maintains multiple hash tables for different socket types:

TCP Socket Hash Tables
Hash Table	Contents	Lookup Key	Purpose
ehash	Established sockets	4-tuple (full connection)	Packet → socket lookup
bhash	Bound sockets	Local port	Port allocation, bind conflict check
lhash	Listening sockets	Local port	SYN → listening socket lookup

TCP hash table implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
/**
 * struct inet_hashinfo - TCP socket hash tables
 * 
 * This structure holds all hash tables used for TCP socket lookup.
 * It's initialized at boot and sized based on system memory.
 */
struct inet_hashinfo {
    /* Established/TIME_WAIT connections hash */
    struct inet_ehash_bucket *ehash;
    spinlock_t               *ehash_locks;
    unsigned int             ehash_mask;
    unsigned int             ehash_locks_mask;
    
    /* Bind hash (listening and bound sockets) */
    struct kmem_cache       *bind_bucket_cachep;
    struct inet_bind_hashbucket *bhash;
    unsigned int            bhash_size;
    
    /* Listening hash (listen sockets only) */
    struct inet_listen_hashbucket *listening_hash;
    unsigned int            lhash2_mask;
    
    /* ... */
};
 
/* Global TCP hash table instance */
struct inet_hashinfo tcp_hashinfo;
 
/**
 * inet_ehash_bucket - Hash bucket for established connections
 * 
 * Each bucket is an RCU-protected list of sockets
 * with same hash value.
 */
struct inet_ehash_bucket {
    struct hlist_nulls_head chain;
};
 
/**
 * __inet_lookup_established - Find established socket
 * 
 * This function is called for every incoming TCP packet
 * to find the socket that should receive it.
 */
struct sock *__inet_lookup_established(struct net *net,
                                       struct inet_hashinfo *hashinfo,
                                       const __be32 saddr, const __be16 sport,
                                       const __be32 daddr, const u16 hnum,
                                       const int dif, const int sdif)
{
    INET_ADDR_COOKIE(acookie, saddr, daddr);
    const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
    struct sock *sk;
    const struct hlist_nulls_node *node;
    unsigned int hash = inet_ehashfn(net, daddr, hnum, saddr, sport);
    unsigned int slot = hash & hashinfo->ehash_mask;
    struct inet_ehash_bucket *head = &hashinfo->ehash[slot];
    
begin:
    /* Lockless RCU traversal */
    sk_nulls_for_each_rcu(sk, node, &head->chain) {
        /* Quick rejection using combined port comparison */
        if (sk->sk_hash != hash)
            continue;
        if (likely(INET_MATCH(sk, net, acookie, saddr, daddr, ports, dif, sdif)))
            return sk;  /* Found! */
    }
    /* Handle nulls marker for concurrent modification */
    if (get_nulls_value(node) != slot)
        goto begin;
    
    return NULL;
}
 
/**
 * INET_MATCH macro - Check if socket matches packet
 * 
 * Optimized for common case with prefetching and
 * minimal memory accesses.
 */
#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \
    (((__sk)->sk_portpair == (__ports))                     && \
     ((__sk)->sk_addrpair == (__cookie))                    && \
     (!(__sk)->sk_bound_dev_if ||                              \
      ((__sk)->sk_bound_dev_if == (__dif)) ||                  \
      ((__sk)->sk_bound_dev_if == (__sdif)))                && \
     net_eq(sock_net(__sk), (__net)))
 
/**
 * __inet_lookup_listener - Find listening socket
 * 
 * Called when SYN arrives to find the server socket
 * that should handle the new connection.
 */
struct sock *__inet_lookup_listener(struct net *net,
                                    struct inet_hashinfo *hashinfo,
                                    struct sk_buff *skb, int doff,
                                    const __be32 saddr, __be16 sport,
                                    const __be32 daddr, const unsigned short hnum,
                                    const int dif, const int sdif)
{
    struct inet_listen_hashbucket *ilb2;
    struct sock *result = NULL;
    unsigned int hash2;
    
    /* Hash lookup in listening table */
    hash2 = ipv4_portaddr_hash(net, daddr, hnum);
    ilb2 = &hashinfo->lhash2[hash2 & hashinfo->lhash2_mask];
    
    /* Search for exact match (IP + port) */
    result = inet_lhash2_lookup(net, ilb2, skb, doff,
                                saddr, sport, daddr, hnum, dif, sdif);
    
    if (result)
        return result;
    
    /* Try wildcard (0.0.0.0) listener */
    hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
    ilb2 = &hashinfo->lhash2[hash2 & hashinfo->lhash2_mask];
    
    return inet_lhash2_lookup(net, ilb2, skb, doff,
                              saddr, sport, htonl(INADDR_ANY),
                              hnum, dif, sdif);
}
 
/**
 * Hash table sizing at boot
 * 
 * The hash table size is calculated based on available memory.
 * Larger tables reduce collision probability for high-connection systems.
 */
void __init tcp_init(void)
{
    /* Calculate hash table size based on memory */
    nr_pages = totalram_pages();
    
    /* Target: one bucket per expected maximum connections */
    tcp_hashinfo.ehash_mask = 
        alloc_large_system_hash("TCP established",
                                sizeof(struct inet_ehash_bucket),
                                thash_entries,
                                17,  /* min 2^17 = 128K entries */
                                0,
                                NULL,
                                &tcp_hashinfo.ehash_locks_mask,
                                0,
                                64 * 1024);  /* max entries */
}

RCU for Lock-Free Lookup

Congestion Control

Congestion control fundamentals:

Congestion Window (cwnd): Maximum unacknowledged data limited by network
Slow Start: Exponential cwnd growth when below ssthresh
Congestion Avoidance: Linear cwnd growth when above ssthresh
Loss Detection: Packet loss signals congestion (triggers reduction)
Bandwidth Probing: Many algorithms probe for available bandwidth

Linux Congestion Control Algorithms
Algorithm	Type	Key Feature	Use Case
Reno	Loss-based	Classic AIMD	Baseline, historical reference
CUBIC	Loss-based	Cubic function for cwnd growth	Default on most systems
BBR	Model-based	Measures bandwidth, minimizes latency	Google infrastructure, low-latency
Vegas	Delay-based	RTT increase signals congestion	Low-loss environments
DCTCP	ECN-based	Uses ECN for early signaling	Data center networks
Westwood+	BW estimation	Estimates available bandwidth	Wireless networks

Congestion control framework
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
/**
 * struct tcp_congestion_ops - Congestion control algorithm interface
 * 
 * Each CC algorithm implements this interface to plug into
 * the TCP stack.
 */
struct tcp_congestion_ops {
    struct list_head list;
    
    /* Unique name for algorithm selection */
    char            name[TCP_CA_NAME_MAX];
    struct module   *owner;
    
    /* Required: Called on each ACK */
    void (*cong_avoid)(struct sock *sk, u32 ack, u32 acked);
    
    /* Required: Set slow start threshold on loss */
    u32  (*ssthresh)(struct sock *sk);
    
    /* Optional: Called when connection established */
    void (*init)(struct sock *sk);
    
    /* Optional: Called when connection destroyed */
    void (*release)(struct sock *sk);
    
    /* Optional: RTT sample callback */
    void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
    
    /* Optional: ECN handling */
    void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev);
    
    /* Optional: Undo cwnd reduction */
    u32  (*undo_cwnd)(struct sock *sk);
    
    /* Optional: Get current state for ss/netstat */
    size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
                       union tcp_cc_info *info);
    
    /* Flags indicating algorithm capabilities */
    u32     flags;
};
 
/**
 * CUBIC congestion control (default since Linux 2.6.19)
 * 
 * Uses a cubic function to grow cwnd, providing faster
 * recovery to previous bandwidth than Reno's linear growth.
 */
static void bictcp_cong_avoid(struct sock *sk, u32 ack, u32 acked)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct bictcp *ca = inet_csk_ca(sk);
    
    if (!tcp_is_cwnd_limited(sk))
        return;  /* Application-limited, don't grow */
    
    if (tcp_in_slow_start(tp)) {
        /* Slow start: exponential growth */
        u32 cnt = tcp_slow_start(tp, acked);
        if (!cnt)
            return;
        acked = cnt;
    }
    
    /* Congestion avoidance: cubic function */
    bictcp_update(ca, tp->snd_cwnd, acked);
    tcp_cong_avoid_ai(tp, ca->cnt, acked);
}
 
/**
 * bictcp_update - Compute cubic window
 * 
 * W(t) = C*(t-K)^3 + Wmax
 * 
 * Where K = cubic_root(Wmax*beta/C), beta = 0.3, C = 0.4
 */
static inline void bictcp_update(struct bictcp *ca, u32 cwnd, u32 acked)
{
    u32 delta, bic_target, max_cnt;
    u64 offs, t;
    
    /* Calculate time since last loss */
    t = (u64)(tcp_time_stamp(tp) - ca->epoch_start);
    t += usecs_to_jiffies(ca->delay_min >> 3);
    t <<= BICTCP_HZ;
    
    /* Calculate cubic window target */
    offs = ca->bic_K - t;
    delta = (cube_rtt_scale * offs * offs * offs) >> 40;
    bic_target = ca->bic_origin_point + delta;
    
    /* Compute growth rate */
    if (bic_target > cwnd)
        ca->cnt = cwnd / (bic_target - cwnd);
    else
        ca->cnt = 100 * cwnd;  /* Very slow growth */
}
 
/**
 * BBR congestion control (Google's model-based algorithm)
 * 
 * BBR tries to send at the bottleneck bandwidth while
 * maintaining minimum RTT, avoiding buffer bloat.
 */
static void bbr_main(struct sock *sk, const struct rate_sample *rs)
{
    struct bbr *bbr = inet_csk_ca(sk);
    u32 bw;
    
    /* Estimate maximum bandwidth */
    bw = bbr_max_bw(sk);
    
    /* Estimate minimum RTT (propagation delay) */
    if (rs->rtt_us > 0 && rs->rtt_us < bbr->min_rtt_us)
        bbr->min_rtt_us = rs->rtt_us;
    
    /* Calculate target: BDP = bandwidth * delay */
    bbr_set_pacing_rate(sk, bw, bbr->pacing_gain);
    bbr_set_cwnd(sk, rs, rs->acked_sacked, bw, bbr->cwnd_gain);
}
 
/**
 * Selecting congestion control algorithm
 */
/* System-wide default */
$ sysctl net.ipv4.tcp_congestion_control=bbr
 
/* Per-socket (application code) */
setsockopt(sock, IPPROTO_TCP, TCP_CONGESTION, "bbr", 4);
 
/* View available algorithms */
$ sysctl net.ipv4.tcp_available_congestion_control
net.ipv4.tcp_available_congestion_control = reno cubic bbr

BBR vs CUBIC

TCP Optimizations

Linux TCP includes numerous optimizations developed over decades to maximize throughput and minimize latency. Understanding these helps tune performance for specific workloads.

Key TCP optimizations:

Performance Optimizations

•TCP Fast Open (TFO) — Send data in SYN packet, reducing latency by one RTT for repeat connections. Uses cryptographic cookies for security.
•Delayed ACK — Batch ACKs (up to 40ms) to reduce packet count. Can be disabled with TCP_QUICKACK for latency-sensitive apps.
•Nagle's Algorithm — Buffer small writes until ACK or MSS reached. Disable with TCP_NODELAY for interactive applications.
•TCP_CORK — Explicitly cork output until uncorked or 200ms timeout. Useful for building complete messages before sending.
•Window Scaling — RFC 1323 extension enables windows >64KB, essential for high-bandwidth-delay networks.
•Selective ACK (SACK) — Receiver reports non-contiguous received segments, enabling sender to retransmit only lost segments.
•Early Retransmit (ER) — Trigger fast retransmit with fewer DUPACKs when there's limited data in flight.
•Tail Loss Probe (TLP) — Send probe to trigger ACKs for tail loss detection before RTO expires.
•RACK (Recent ACK) — Time-based loss detection using recent ACK timestamps, more robust than DUPACK counting.

TCP Fast Open implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
/**
 * TCP Fast Open - Send data with SYN
 * 
 * TFO allows data to be sent in the SYN packet for repeat
 * connections, saving one RTT.
 */
 
/* Client-side: Send data with SYN */
int tcp_sendmsg_fastopen(struct sock *sk, struct msghdr *msg,
                         int *copied, size_t size)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sockaddr *uaddr;
    int err, flags;
    
    /* MSG_FASTOPEN triggers TFO */
    if (!(msg->msg_flags & MSG_FASTOPEN))
        return 0;
    
    /* Get cookie from cache or request new one */
    if (!tcp_fastopen_cookie_check(sk, &req->cookie)) {
        /* No cached cookie, will request one */
        tp->fastopen_req->cookie.len = 0;
    }
    
    /* Build and send SYN with data */
    err = tcp_connect(sk);
    
    /* Copy data to send buffer */
    tcp_sendmsg_locked(sk, msg, size);
    
    return err;
}
 
/* Server-side: Accept data in SYN */
int tcp_fastopen_create_child(struct sock *sk, struct sk_buff *skb,
                              struct request_sock *req)
{
    struct sock *child;
    
    /* Validate TFO cookie */
    if (!tcp_fastopen_cookie_valid(&foc)) {
        /* Invalid cookie - fall back to normal handshake */
        return -1;
    }
    
    /* Create child socket immediately (before ACK) */
    child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
    
    /* Queue SYN data for application */
    if (skb->len > tcp_hdrlen(skb)) {
        struct sk_buff *data_skb = skb_clone(skb, GFP_ATOMIC);
        __skb_pull(data_skb, tcp_hdrlen(skb));
        skb_queue_tail(&child->sk_receive_queue, data_skb);
        child->sk_data_ready(child);
    }
    
    return 0;
}
 
/**
 * SACK processing
 * 
 * SACK tells sender which segments receiver has, enabling
 * selective retransmission.
 */
void tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
                             u32 prior_snd_una)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct tcp_sack_block *sp;
    int num_sacks;
    
    /* Parse SACK blocks from ACK segment */
    sp = tcp_parse_options(ack_skb)->sack;
    num_sacks = tcp_parse_options(ack_skb)->num_sacks;
    
    for (i = 0; i < num_sacks; i++) {
        u32 start_seq = ntohl(sp[i].start_seq);
        u32 end_seq = ntohl(sp[i].end_seq);
        
        /* Mark segments in this range as SACKed */
        skb_queue_walk(&sk->sk_write_queue, skb) {
            if (between(TCP_SKB_CB(skb)->seq, start_seq, end_seq))
                TCP_SKB_CB(skb)->sacked |= TCPCB_SACKED_ACKED;
        }
    }
    
    /* Retransmit non-SACKed segments that are considered lost */
    tcp_xmit_retransmit_queue(sk);
}
 
/**
 * Key sysctl tuning parameters
 */
# Enable TCP Fast Open
net.ipv4.tcp_fastopen = 3  # Both client and server
 
# Window scaling for high-BDP paths
net.ipv4.tcp_window_scaling = 1
 
# SACK  
net.ipv4.tcp_sack = 1
 
# Receive buffer auto-tuning
net.ipv4.tcp_moderate_rcvbuf = 1
 
# Buffer sizes (auto-tuned between min/default/max)
net.ipv4.tcp_rmem = 4096 87380 6291456
net.ipv4.tcp_wmem = 4096 65536 6291456

Tuning for Workload

UDP: The Simpler Protocol

UDP socket structure:

UDP implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
/**
 * struct udp_sock - UDP socket state
 * 
 * Much simpler than tcp_sock - no connection tracking needed.
 */
struct udp_sock {
    struct inet_sock inet;
    
    int              pending;        /* Pending message type */
    unsigned int     corkflag;       /* UDP_CORK is set */
    __u8             encap_type;     /* Encapsulation (ESP, GTP, etc.) */
    
    /* GRO (Generic Receive Offload) support */
    u16              len;            /* Total pending length */
    u16              gso_size;       /* GSO segment size */
    
    /* Receive queue memory */
    int             (*encap_rcv)(struct sock *sk, struct sk_buff *skb);
};
 
/**
 * udp_sendmsg - Send UDP datagram
 */
int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
{
    struct inet_sock *inet = inet_sk(sk);
    struct udp_sock *up = udp_sk(sk);
    struct flowi4 *fl4;
    struct sk_buff *skb;
    int err;
    
    /* Get destination from message or connected socket */
    if (msg->msg_name) {
        /* sendto() with explicit destination */
        struct sockaddr_in *usin = msg->msg_name;
        daddr = usin->sin_addr.s_addr;
        dport = usin->sin_port;
    } else if (sk->sk_state == TCP_ESTABLISHED) {
        /* Connected UDP socket */
        daddr = inet->inet_daddr;
        dport = inet->inet_dport;
    } else {
        return -EDESTADDRREQ;
    }
    
    /* Route lookup */
    rt = ip_route_output_flow(net, &fl4, sk);
    
    /* Corked send - accumulate data */
    if (up->pending) {
        skb = ip_finish_skb(sk, &fl4);
    } else {
        /* Allocate skb and copy user data */
        skb = sock_alloc_send_skb(sk, len, msg->msg_flags, &err);
        copy_from_iter(skb_put(skb, len), len, &msg->msg_iter);
    }
    
    /* Add UDP header */
    udp_set_header(skb, inet->inet_sport, dport);
    
    /* Send via IP layer */
    err = udp_send_skb(skb, &fl4);
    
    return err;
}
 
/**
 * udp_rcv - Receive UDP datagram
 */
int udp_rcv(struct sk_buff *skb)
{
    struct sock *sk;
    struct udphdr *uh;
    __be32 saddr, daddr;
    
    /* Extract addresses and ports */
    uh = udp_hdr(skb);
    saddr = ip_hdr(skb)->saddr;
    daddr = ip_hdr(skb)->daddr;
    
    /* Validate checksum */
    if (udp_lib_checksum_complete(skb))
        goto csum_error;
    
    /* Look up socket (hash lookup by 4-tuple or 2-tuple) */
    sk = __udp4_lib_lookup_skb(skb, uh->source, uh->dest, &udp_table);
    
    if (sk) {
        /* Found socket - deliver */
        int ret = udp_queue_rcv_skb(sk, skb);
        return ret;
    }
    
    /* No socket - send ICMP port unreachable */
    icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);
    kfree_skb(skb);
    
    return 0;
}
 
/**
 * UDP optimizations
 */
 
/* UDP GRO - Coalesce related UDP packets */
struct sk_buff *udp_gro_receive(struct list_head *head,
                                struct sk_buff *skb)
{
    /* Combine UDP packets with same flow into one skb */
    /* Reduces per-packet overhead for high-rate flows */
}
 
/* UDP GSO - Segment large UDP "super-packets" in software */
struct sk_buff *udp4_gso_segment(struct sk_buff *skb,
                                 netdev_features_t features)
{
    /* Split large UDP message into MTU-sized segments */
    /* Application sends one large write, kernel segments */
}

QUIC and the UDP Renaissance

Summary: Mastering TCP/IP Implementation

Key Takeaways

•tcp_sock is comprehensive — Hundreds of fields track sequence numbers, windows, congestion state, timers, and options for each connection
•Three-way handshake uses request sockets — Lightweight structures track half-open connections, with SYN cookies preventing SYN flood memory exhaustion
•State machine governs lifecycle — TCP_ESTABLISHED, TIME_WAIT, and other states define connection behavior at each phase
•Hash tables enable fast lookup — RCU-protected hash tables indexed by connection 4-tuple provide O(1) socket lookup for every packet
•Pluggable congestion control — Algorithms like CUBIC and BBR can be selected per-socket; each implements different tradeoffs between throughput and latency
•Extensive optimizations — TFO, SACK, window scaling, and auto-tuning enable both low-latency and high-throughput operation
•UDP is simpler but important — No connection state, but essential for DNS, streaming, and as foundation for QUIC/HTTP3

What's next:

Page Complete

4 / 5