Operating SystemsLinux Internals

Linux Networking

LevelAdvanced

Duration180 mins

TopicLinux Internals

5 / 5

Packet Flow

Tracing Packets Through the Kernel

Every time you click a link, send a message, or query a database, packets traverse a carefully orchestrated path through the Linux kernel. Understanding this packet flow—from the moment data leaves your application until it hits the network wire, and from wire reception to application delivery—is essential for performance debugging, security analysis, and network engineering.

This page brings together everything we've covered: the socket layer, protocol stack, network devices, and TCP/IP implementation. We'll trace packets step-by-step through the kernel, annotating each function, data structure, and decision point. You'll see how interrupt handling, NAPI processing, netfilter hooks, and traffic control all integrate into a cohesive packet processing pipeline.

By the end, you'll have a mental model of Linux networking that enables you to reason about latency sources, identify bottlenecks, trace packets with tools like ftrace and bpftrace, and understand what actually happens when network I/O occurs.

What You Will Learn

By the end of this page, you will understand the complete transmit path (application to wire), the complete receive path (wire to application), interrupt and softirq processing for network I/O, the role of DMA and ring buffers, and how to trace and debug packet flow using kernel tools.

Transmit Path Overview

When an application calls write() or send() on a socket, data begins a journey down through multiple kernel subsystems before reaching the network interface. The transmit path can be divided into distinct phases, each with specific responsibilities.

The five phases of packet transmission:

Transmit Path Phases
Phase	Context	Key Functions	Actions
System Call	Process context	sys_sendto → sock_sendmsg	Validate, copy data to kernel
Transport Layer	Process context	tcp_sendmsg, tcp_transmit_skb	Segment, add TCP/UDP header
Network Layer	Process context	ip_queue_xmit, ip_output	Route, add IP header, fragment
Queueing	Process context	dev_queue_xmit, qdisc	Shape, schedule, queue to device
Device Driver	Process/Softirq	ndo_start_xmit	DMA setup, NIC transmit

Key insight: Process context transmission

Unlike the receive path (which uses softirqs), the transmit path typically runs entirely in the context of the sending process. This means:

The sending process directly pays the CPU cost
No context switch is needed to transmit
The process blocks if buffers are full
Priority of the process affects transmission timing

This design provides fairness—processes that send more data pay more CPU cost—and simplicity, as the sending thread naturally pushes its own traffic toward the wire.

Exceptions to Process Context

Some transmit operations occur from softirq context: TCP timer expirations (retransmits), deferred packet completion handling, and traffic control scheduler triggers. These ensure packets are transmitted even when the originating process isn't running.

Deep Dive: Application to Socket

Let's trace a TCP send() call in detail, following the code path from system call entry to the transport layer.

The journey begins:

Application to socket layer
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
/*
 * PHASE 1: System Call Entry
 * 
 * User space: send(sockfd, buffer, len, flags)
 */
 
/* Syscall entry point */
SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len,
                unsigned int, flags, struct sockaddr __user *, addr,
                int, addr_len)
{
    return __sys_sendto(fd, buff, len, flags, addr, addr_len);
}
 
int __sys_sendto(int fd, void __user *buff, size_t len, unsigned flags,
                 struct sockaddr __user *addr, int addr_len)
{
    struct socket *sock;
    struct sockaddr_storage address;
    struct msghdr msg;
    struct iovec iov;
    int err;
    
    /* 1. Look up socket from file descriptor */
    sock = sockfd_lookup_light(fd, &err, &fput_needed);
    if (!sock)
        return err;
    
    /* 2. Build iovec for data transfer */
    iov.iov_base = buff;
    iov.iov_len = len;
    iov_iter_init(&msg.msg_iter, WRITE, &iov, 1, len);
    
    /* 3. Copy destination address from user space (if provided) */
    if (addr) {
        err = move_addr_to_kernel(addr, addr_len, &address);
        if (err < 0)
            goto out;
        msg.msg_name = (struct sockaddr *)&address;
        msg.msg_namelen = addr_len;
    }
    
    /* 4. Set flags */
    msg.msg_flags = flags;
    
    /* 5. Delegate to socket-specific sendmsg */
    err = sock_sendmsg(sock, &msg);
    
out:
    fput_light(sock->file, fput_needed);
    return err;
}
 
/**
 * sock_sendmsg - Generic socket send
 * 
 * This calls the protocol-specific sendmsg operation
 * through the socket operations table.
 */
int sock_sendmsg(struct socket *sock, struct msghdr *msg)
{
    int err;
    
    /* Security hook - LSM (SELinux, AppArmor) checks */
    err = security_socket_sendmsg(sock, msg, msg_data_left(msg));
    if (err)
        return err;
    
    /* Call protocol sendmsg: sock->ops->sendmsg */
    /* For TCP: inet_sendmsg */
    return sock->ops->sendmsg(sock, msg, msg_data_left(msg));
}
 
/**
 * inet_sendmsg - INET family sendmsg wrapper
 */
int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
{
    struct sock *sk = sock->sk;
 
    /* Wait for socket to connect if needed */
    if (unlikely(inet_send_prepare(sk)))
        return -EAGAIN;
 
    /* Call protocol sendmsg: sk->sk_prot->sendmsg */
    /* For TCP: tcp_sendmsg */
    return sk->sk_prot->sendmsg(sk, msg, size);
}
 
/*
 * PHASE 2: TCP Processing
 */
 
/**
 * tcp_sendmsg - TCP send processing
 * 
 * This is where data is copied from user space into
 * socket buffers and TCP segmentation begins.
 */
int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;
    int flags = msg->msg_flags;
    int copied = 0;
    int err;
    
    /* Lock the socket */
    lock_sock(sk);
    
    /* Check socket state */
    err = tcp_sendmsg_locked(sk, msg, size);
    
    release_sock(sk);
    return err;
}
 
int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
{
    struct tcp_sock *tp = tcp_sk(sk);
    int mss_now;
    int size_goal;
    int copied = 0;
    
    /* Calculate MSS and size goal */
    mss_now = tcp_send_mss(sk, &size_goal, msg->msg_flags);
    
    while (msg_data_left(msg)) {
        struct sk_buff *skb = tcp_write_queue_tail(sk);
        int copy;
        
        /* Allocate new skb if needed */
        if (!skb || (copy = size_goal - skb->len) <= 0) {
            skb = sk_stream_alloc_skb(sk, 0, GFP_KERNEL, first_skb);
            if (!skb)
                goto wait_for_memory;
            
            /* Add to write queue */
            tcp_add_write_queue_tail(sk, skb);
        }
        
        /* Copy data from user space */
        copy = min_t(int, copy, msg_data_left(msg));
        err = skb_add_data_nocache(sk, skb, &msg->msg_iter, copy);
        if (err)
            goto do_fault;
        
        copied += copy;
        tp->write_seq += copy;
        
        /* If segment is full, schedule transmission */
        if (skb->len >= mss_now)
            __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
    }
    
    /* Try to transmit pending segments */
    tcp_push(sk, msg->msg_flags, mss_now, ...);
    
    return copied;
}

Zero-Copy and sendfile()

For file transfers, sendfile() avoids copying data to user space. The kernel reads file pages directly into socket buffers. For the fastest path, MSG_ZEROCOPY uses the network card's scatter-gather DMA to transmit directly from user memory pages.

Deep Dive: TCP to IP to Device

After TCP creates a segment, it must be wrapped in an IP header, routed to the correct interface, and handed to the device driver. This phase involves routing lookup, netfilter processing, and queueing discipline management.

TCP to device driver
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
/**
 * tcp_transmit_skb - Build TCP header and pass to IP
 * 
 * This is called when a segment is ready for transmission.
 * Handles the transition from transport to network layer.
 */
static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
                            int clone_it, gfp_t gfp_mask)
{
    struct inet_sock *inet = inet_sk(sk);
    struct tcp_sock *tp = tcp_sk(sk);
    struct tcphdr *th;
    int tcp_header_size;
    
    /* Clone if skb is shared (for retransmissions) */
    if (clone_it) {
        skb = skb_clone(skb, gfp_mask);
        if (!skb)
            return -ENOBUFS;
    }
    
    /* Calculate header size with options */
    tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
    
    /* Reserve space for TCP header */
    skb_push(skb, tcp_header_size);
    skb_reset_transport_header(skb);
    
    /* Build TCP header */
    th = (struct tcphdr *)skb->data;
    th->source = inet->inet_sport;
    th->dest = inet->inet_dport;
    th->seq = htonl(tcb->seq);
    th->ack_seq = htonl(tp->rcv_nxt);
    th->doff = tcp_header_size >> 2;
    th->res1 = 0;
    tcp_init_flags(th, tcb->tcp_flags);
    th->window = htons(tcp_select_window(sk));
    th->check = 0;
    th->urg_ptr = 0;
    
    /* Add TCP options (MSS, timestamps, SACK, etc.) */
    tcp_options_write((__be32 *)(th + 1), tp, &opts);
    
    /* Calculate checksum (often offloaded to NIC) */
    if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
        th->check = ~tcp_v4_check(skb->len, inet->inet_saddr,
                                  inet->inet_daddr, 0);
        skb->csum_start = skb_transport_header(skb) - skb->head;
        skb->csum_offset = offsetof(struct tcphdr, check);
    } else {
        th->check = tcp_v4_check(skb->len, inet->inet_saddr,
                                 inet->inet_daddr,
                                 csum_partial(th, tcp_header_size,
                                             skb->csum));
    }
    
    /* Pass to IP layer */
    err = ip_queue_xmit(sk, skb, &inet->cork.fl);
    
    return err;
}
 
/*
 * PHASE 3: Network Layer (IP)
 */
 
/**
 * ip_queue_xmit - Add IP header and route packet
 */
int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
{
    struct inet_sock *inet = inet_sk(sk);
    struct ip_options_rcu *inet_opt;
    struct rtable *rt;
    struct iphdr *iph;
    
    /* 1. Route lookup (usually cached on socket) */
    rt = (struct rtable *)__sk_dst_check(sk, 0);
    if (!rt) {
        /* Cache miss - perform full route lookup */
        rt = ip_route_output_ports(sock_net(sk), fl4, sk,
                                   daddr, saddr, dport, sport,
                                   IPPROTO_TCP, ...);
        if (IS_ERR(rt))
            goto no_route;
        sk_setup_caps(sk, &rt->dst);  /* Cache route */
    }
    
    skb_dst_set_noref(skb, &rt->dst);
    
    /* 2. Reserve space for IP header */
    skb_push(skb, sizeof(struct iphdr) + inet_opt->opt.optlen);
    skb_reset_network_header(skb);
    
    /* 3. Build IP header */
    iph = ip_hdr(skb);
    *((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));
    iph->tot_len = htons(skb->len);
    iph->id = htons(ip_idents_reserve(rt_dst, 1));
    iph->frag_off = htons(IP_DF);  /* Usually Don't Fragment */
    iph->ttl = ip_select_ttl(inet, &rt->dst);
    iph->protocol = sk->sk_protocol;  /* IPPROTO_TCP = 6 */
    iph->saddr = fl4->saddr;
    iph->daddr = fl4->daddr;
    
    /* 4. Pass to IP output */
    return ip_local_out(net, sk, skb);
}
 
/**
 * ip_local_out - IP output processing (with netfilter)
 */
int ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
{
    /* Traverse netfilter LOCAL_OUT hooks (iptables OUTPUT chain) */
    return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT,
                   net, sk, skb, NULL, skb_dst(skb)->dev,
                   dst_output);
}
 
/**
 * ip_output - Final IP processing before sending
 */
int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
{
    struct net_device *dev = skb_dst(skb)->dev;
    
    /* Calculate IP checksum */
    ip_send_check(ip_hdr(skb));
    
    /* Traverse netfilter POST_ROUTING hooks */
    return nf_hook(NFPROTO_IPV4, NF_INET_POST_ROUTING,
                   net, sk, skb, NULL, dev,
                   ip_finish_output);
}
 
/**
 * ip_finish_output - Check MTU and fragment if needed
 */
static int ip_finish_output(struct net *net, struct sock *sk,
                            struct sk_buff *skb)
{
    unsigned int mtu = dst_mtu(skb_dst(skb));
    
    if (skb->len > mtu && !skb_gso(skb)) {
        /* Packet too large - need fragmentation */
        return ip_fragment(net, sk, skb, mtu, ip_finish_output2);
    }
    
    return ip_finish_output2(net, sk, skb);
}
 
/**
 * ip_finish_output2 - Resolve L2 address and queue to device
 */
static int ip_finish_output2(struct net *net, struct sock *sk,
                             struct sk_buff *skb)
{
    struct neighbour *neigh;
    struct dst_entry *dst = skb_dst(skb);
    
    /* Get neighbour (ARP cache entry) */
    neigh = ip_neigh_for_gw(rt, skb, &is_v6gw);
    
    if (!neigh) {
        /* Need to resolve MAC address via ARP */
        return neigh_resolve_output(neigh, skb);
    }
    
    /* Add L2 (Ethernet) header and queue to device */
    return neigh_output(neigh, skb);
}
 
/*
 * PHASE 4: Traffic Control and Device Queue
 */
 
/**
 * dev_queue_xmit - Queue packet for transmission
 */
int dev_queue_xmit(struct sk_buff *skb)
{
    struct net_device *dev = skb->dev;
    struct netdev_queue *txq;
    struct Qdisc *q;
    int rc;
    
    /* Select transmit queue (multi-queue NICs) */
    txq = netdev_core_pick_tx(dev, skb, NULL);
    
    /* Get queueing discipline */
    q = rcu_dereference_bh(txq->qdisc);
    
    if (q->enqueue) {
        /* Qdisc active - enqueue and potentially dequeue */
        rc = __dev_xmit_skb(skb, q, dev, txq);
    } else {
        /* No qdisc (rare) - direct transmit */
        rc = dev_hard_start_xmit(skb, dev, txq);
    }
    
    return rc;
}
 
/**
 * __dev_xmit_skb - Enqueue to qdisc and trigger transmission
 */
static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
                                 struct net_device *dev,
                                 struct netdev_queue *txq)
{
    spinlock_t *root_lock = qdisc_lock(q);
    
    spin_lock(root_lock);
    
    /* Enqueue to qdisc */
    rc = q->enqueue(skb, q, &to_free);
    
    if (qdisc_run_begin(q)) {
        /* Dequeue and transmit */
        __qdisc_run(q);
    }
    
    spin_unlock(root_lock);
    return rc;
}
 
/*
 * PHASE 5: Device Driver
 */
 
/**
 * dev_hard_start_xmit - Call driver transmit function
 */
static int dev_hard_start_xmit(struct sk_buff *skb,
                               struct net_device *dev,
                               struct netdev_queue *txq)
{
    const struct net_device_ops *ops = dev->netdev_ops;
    
    /* Call driver's ndo_start_xmit */
    rc = ops->ndo_start_xmit(skb, dev);
    
    if (rc == NETDEV_TX_OK) {
        /* Success - packet queued to hardware */
        txq->trans_start = jiffies;
    }
    
    return rc;
}
 
/**
 * Example: Intel e1000e driver transmit
 */
static netdev_tx_t e1000_xmit_frame(struct sk_buff *skb,
                                    struct net_device *netdev)
{
    struct e1000_adapter *adapter = netdev_priv(netdev);
    struct e1000_ring *tx_ring = adapter->tx_ring;
    struct e1000_tx_desc *tx_desc;
    
    /* Set up TX descriptor with packet info */
    tx_desc = E1000_TX_DESC(tx_ring, i);
    tx_desc->buffer_addr = dma_map_single(adapter->dev, skb->data,
                                          skb->len, DMA_TO_DEVICE);
    tx_desc->length = cpu_to_le16(skb->len);
    tx_desc->cmd = E1000_TXD_CMD_EOP | E1000_TXD_CMD_RS;
    
    /* Notify hardware of new descriptor */
    writel(i, adapter->hw.hw_addr + E1000_TDT);
    
    return NETDEV_TX_OK;
}

TSO: Offloading Segmentation

For large TCP sends, the kernel doesn't create individual MSS-sized segments. Instead, TCP Segmentation Offload (TSO) passes a "super-segment" to the NIC, which divides it into wire-sized frames. This reduces CPU overhead dramatically—one descriptor can represent 64KB of data instead of dozens for 1500-byte MTU.

Receive Path Overview

The receive path is more complex than transmit because it must handle asynchronous packet arrival. When a packet arrives, the NIC generates a hardware interrupt, which triggers a carefully orchestrated processing pipeline involving NAPI, softirqs, and eventually the transport layer.

The five phases of packet reception:

Receive Path Phases
Phase	Context	Key Functions	Actions
Hardware IRQ	Interrupt context	Driver IRQ handler	Mask IRQ, schedule NAPI
NAPI Poll	Softirq context	napi_poll, driver poll	Retrieve packets from ring
Network Core	Softirq context	netif_receive_skb	GRO, protocol dispatch
IP Layer	Softirq context	ip_rcv, ip_local_deliver	Validate, route, netfilter
Transport	Softirq/Process	tcp_v4_rcv, socket_queue	Deliver to socket queue

Key insight: Softirq processing

Most receive processing happens in softirq context (NET_RX_SOFTIRQ). This is a bottom-half context that:

Runs on the same CPU that handled the hardware interrupt
Has interrupts enabled (unlike IRQ handlers)
Cannot sleep or block
Can be preempted by hardware interrupts
Is time-limited to ensure fairness

The kernel's ksoftirqd threads handle overflow when softirq processing takes too long.

RSS: Receive Side Scaling

Modern NICs use RSS to distribute incoming packets across multiple RX queues based on flow hash. Each queue gets its own interrupt, typically pinned to different CPUs. This enables parallel receive processing—essential for high packet rates. Without RSS, a single CPU would bottleneck all receive processing.

Deep Dive: NIC to Kernel

Let's trace an incoming TCP packet from the moment it arrives at the network interface through to NAPI processing.

Hardware interrupt to NAPI
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
/*
 * PHASE 1: Hardware Interrupt
 * 
 * Packet arrives at NIC → DMA to ring buffer → IRQ raised
 */
 
/* Before packet arrival: Driver sets up receive ring */
static int e1000_alloc_rx_buffers(struct e1000_ring *rx_ring)
{
    struct e1000_rx_desc *rx_desc;
    struct e1000_buffer *buffer_info;
    struct sk_buff *skb;
    
    /* Pre-allocate skb for DMA receive */
    skb = netdev_alloc_skb_ip_align(netdev, rx_ring->buffer_len);
    
    /* Get DMA address for skb data */
    dma = dma_map_single(adapter->dev, skb->data,
                         rx_ring->buffer_len, DMA_FROM_DEVICE);
    
    /* Set up descriptor with DMA address */
    rx_desc->buffer_addr = cpu_to_le64(dma);
    rx_desc->status = 0;  /* Not yet filled */
    
    return 0;
}
 
/**
 * Hardware interrupt handler
 * 
 * Called when NIC signals packet arrival (or TX completion).
 * Must execute quickly - just schedule deferred work.
 */
static irqreturn_t e1000_intr(int irq, void *data)
{
    struct e1000_adapter *adapter = data;
    u32 icr;
    
    /* Read interrupt cause register */
    icr = E1000_READ_REG(&adapter->hw, E1000_ICR);
    
    if (!icr)
        return IRQ_NONE;  /* Not our interrupt */
    
    /* Disable further interrupts (interrupt coalescing) */
    E1000_WRITE_REG(&adapter->hw, E1000_IMC, ~0);
    
    /* Schedule NAPI processing */
    if (napi_schedule_prep(&adapter->napi)) {
        __napi_schedule(&adapter->napi);
    }
    
    return IRQ_HANDLED;
}
 
/**
 * napi_schedule - Schedule NAPI processing
 * 
 * Adds NAPI to per-CPU list and triggers NET_RX_SOFTIRQ
 */
void __napi_schedule(struct napi_struct *n)
{
    unsigned long flags;
    
    local_irq_save(flags);
    
    /* Add to per-CPU NAPI poll list */
    list_add_tail(&n->poll_list, &__get_cpu_var(softnet_data).poll_list);
    
    /* Raise softirq to process later */
    __raise_softirq_irqoff(NET_RX_SOFTIRQ);
    
    local_irq_restore(flags);
}
 
/*
 * PHASE 2: NAPI Poll (Softirq context)
 * 
 * NET_RX_SOFTIRQ handler calls net_rx_action
 */
 
/**
 * net_rx_action - Softirq handler for receive processing
 */
static __latent_entropy void net_rx_action(struct softirq_action *h)
{
    struct softnet_data *sd = this_cpu_ptr(&softnet_data);
    unsigned long time_limit = jiffies + 
                               usecs_to_jiffies(netdev_budget_usecs);
    int budget = netdev_budget;
    
    local_irq_disable();
    
    /* Process each NAPI instance on this CPU */
    while (!list_empty(&sd->poll_list)) {
        struct napi_struct *n;
        int work;
        
        n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list);
        list_del_init(&n->poll_list);
        
        local_irq_enable();
        
        /* Call driver's NAPI poll function */
        work = n->poll(n, budget);
        
        if (work < budget) {
            /* Done - remove from polling, re-enable interrupts */
            napi_complete(n);
        } else {
            /* More work - stay on list for continued polling */
            list_add_tail(&n->poll_list, &sd->poll_list);
        }
        
        budget -= work;
        
        /* Time/budget limits prevent monopolizing CPU */
        if (budget <= 0 || time_after(jiffies, time_limit))
            break;
            
        local_irq_disable();
    }
    
    local_irq_enable();
}
 
/**
 * Driver NAPI poll function
 * 
 * Called from softirq to retrieve packets from hardware.
 */
static int e1000_poll(struct napi_struct *napi, int budget)
{
    struct e1000_adapter *adapter = container_of(napi, ...);
    struct e1000_ring *rx_ring = adapter->rx_ring;
    int work_done = 0;
    
    /* Process completed receive descriptors */
    while (work_done < budget) {
        struct e1000_rx_desc *rx_desc;
        struct sk_buff *skb;
        int length;
        
        /* Get next descriptor */
        rx_desc = E1000_RX_DESC(rx_ring, i);
        
        /* Check if descriptor is complete (DD bit) */
        if (!(rx_desc->status & E1000_RXD_STAT_DD))
            break;  /* No more completed packets */
        
        /* Read descriptor contents */
        length = le16_to_cpu(rx_desc->length);
        
        /* Get pre-allocated skb */
        skb = buffer_info->skb;
        
        /* Unmap DMA buffer */
        dma_unmap_single(adapter->dev, buffer_info->dma,
                         rx_ring->buffer_len, DMA_FROM_DEVICE);
        
        /* Set up skb metadata */
        skb_put(skb, length);
        skb->protocol = eth_type_trans(skb, netdev);
        
        /* Check for hardware checksum validation */
        if (rx_desc->status & E1000_RXD_STAT_IPCS &&
            !(rx_desc->errors & E1000_RXD_ERR_IPE))
            skb->ip_summed = CHECKSUM_UNNECESSARY;
        
        /* Pass to networking stack (with GRO) */
        napi_gro_receive(napi, skb);
        
        /* Allocate new buffer for this descriptor */
        e1000_alloc_rx_buffer(rx_ring, i);
        
        work_done++;
    }
    
    if (work_done < budget) {
        /* All done - complete NAPI and re-enable IRQs */
        napi_complete(napi);
        E1000_WRITE_REG(&adapter->hw, E1000_IMS, adapter->rx_ring_irq);
    }
    
    return work_done;
}

Busy Polling for Low Latency

For ultra-low-latency applications, Linux supports busy polling (SO_BUSY_POLL). Instead of waiting for interrupts, the application actively polls for packets. This eliminates interrupt overhead and context switch latency at the cost of CPU cycles. Used in high-frequency trading and real-time systems.

Deep Dive: Protocol Stack to Socket

After NAPI retrieves packets from the NIC, they traverse the protocol stack—through IP and TCP processing—until reaching the destination socket's receive queue.

Protocol stack processing
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
/*
 * PHASE 3: Network Core Processing
 */
 
/**
 * napi_gro_receive - Entry point with GRO
 * 
 * Generic Receive Offload combines related packets
 * to reduce per-packet overhead.
 */
gro_result_t napi_gro_receive(struct napi_struct *napi,
                               struct sk_buff *skb)
{
    /* Try to merge with existing GRO flows */
    gro_result_t ret = dev_gro_receive(napi, skb);
    
    if (ret == GRO_NORMAL) {
        /* Couldn't merge - process individually */
        return netif_receive_skb_internal(skb);
    }
    
    /* Merged or held for future merging */
    return ret;
}
 
/**
 * netif_receive_skb_internal - Core receive processing
 */
static int netif_receive_skb_internal(struct sk_buff *skb)
{
    /* Dispatch to RPS (Receive Packet Steering) if enabled */
    if (static_key_enabled(&rps_needed)) {
        struct rps_dev_flow voidflow, *rflow = &voidflow;
        int cpu = get_rps_cpu(skb->dev, skb, rflow);
        if (cpu >= 0)
            return enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
    }
    
    return __netif_receive_skb(skb);
}
 
/**
 * __netif_receive_skb_core - Protocol dispatch
 */
static int __netif_receive_skb_core(struct sk_buff *skb)
{
    struct packet_type *ptype;
    __be16 type = skb->protocol;
    
    /* Deliver to raw sockets (tcpdump, etc.) */
    list_for_each_entry_rcu(ptype, &ptype_all, list) {
        deliver_skb(skb, ptype, orig_dev);
    }
    
    /* Deliver to protocol handler based on EtherType */
    /* type == 0x0800 (ETH_P_IP) → ip_rcv() */
    list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & 15], list) {
        if (ptype->type == type)
            return ptype->func(skb, skb->dev, ptype, orig_dev);
    }
    
    /* No handler found - drop */
    kfree_skb(skb);
    return NET_RX_DROP;
}
 
/*
 * PHASE 4: IP Layer Processing
 */
 
/**
 * ip_rcv - IP receive processing
 */
int ip_rcv(struct sk_buff *skb, struct net_device *dev,
           struct packet_type *pt, struct net_device *orig_dev)
{
    struct iphdr *iph;
    
    /* Validate IP header */
    iph = ip_hdr(skb);
    
    if (iph->ihl < 5)          /* Header too short */
        goto drop;
    if (iph->version != 4)      /* Not IPv4 */
        goto drop;
    if (skb->len < ntohs(iph->tot_len))  /* Truncated */
        goto drop;
    
    /* Validate checksum */
    if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
        goto csum_error;
    
    /* Traverse netfilter PREROUTING chain */
    return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
                   net, NULL, skb, dev, NULL,
                   ip_rcv_finish);
    
drop:
    kfree_skb(skb);
    return NET_RX_DROP;
}
 
/**
 * ip_rcv_finish - Route lookup and dispatch
 */
static int ip_rcv_finish(struct net *net, struct sock *sk,
                         struct sk_buff *skb)
{
    struct iphdr *iph = ip_hdr(skb);
    struct rtable *rt;
    
    /* Perform routing lookup */
    rt = ip_route_input_noref(skb, iph->daddr, iph->saddr,
                              iph->tos, skb->dev);
    
    if (rt->rt_type == RTN_LOCAL) {
        /* Packet is for us - deliver locally */
        return ip_local_deliver(skb);
    } else {
        /* Packet needs forwarding */
        return ip_forward(skb);
    }
}
 
/**
 * ip_local_deliver - Deliver to transport layer
 */
int ip_local_deliver(struct sk_buff *skb)
{
    /* Handle IP fragmentation (reassemble) */
    if (ip_is_fragment(ip_hdr(skb))) {
        skb = ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER);
        if (!skb)
            return 0;  /* Not yet complete */
    }
    
    /* Traverse netfilter INPUT chain */
    return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN,
                   net, NULL, skb, skb->dev, NULL,
                   ip_local_deliver_finish);
}
 
/**
 * ip_local_deliver_finish - Dispatch to transport protocol
 */
static int ip_local_deliver_finish(struct net *net, struct sock *sk,
                                   struct sk_buff *skb)
{
    struct iphdr *iph = ip_hdr(skb);
    int protocol = iph->protocol;
    const struct net_protocol *ipprot;
    
    /* Remove IP header (advance data pointer) */
    __skb_pull(skb, ip_hdrlen(skb));
    skb_reset_transport_header(skb);
    
    /* Find transport protocol handler */
    ipprot = rcu_dereference(inet_protos[protocol]);
    
    /* Call transport handler */
    /* protocol = 6 → tcp_v4_rcv */
    /* protocol = 17 → udp_rcv */
    return ipprot->handler(skb);
}
 
/*
 * PHASE 5: Transport Layer (TCP)
 */
 
/**
 * tcp_v4_rcv - TCP receive entry point
 */
int tcp_v4_rcv(struct sk_buff *skb)
{
    struct tcphdr *th;
    struct sock *sk;
    
    /* Validate TCP header */
    th = tcp_hdr(skb);
    if (th->doff < sizeof(struct tcphdr) / 4)
        goto bad_packet;
    
    /* Validate TCP checksum */
    if (skb_csum_unnecessary(skb) == 0 &&
        tcp_v4_checksum_error(skb))
        goto csum_error;
    
    /* Look up socket */
    sk = __inet_lookup_skb(&tcp_hashinfo, skb,
                           th->source, th->dest, sdif);
    
    if (!sk)
        goto no_tcp_socket;  /* Send RST */
    
    /* Process packet based on socket state */
    if (sk->sk_state == TCP_TIME_WAIT)
        return tcp_v4_timewait_process(sk, skb);
    
    if (sk->sk_state == TCP_NEW_SYN_RECV)
        return tcp_v4_do_rcv(sk, skb);
    
    /* Queue to socket or process immediately */
    if (!sock_owned_by_user(sk)) {
        /* Socket not locked - process immediately */
        ret = tcp_v4_do_rcv(sk, skb);
    } else {
        /* Socket locked by user - queue to backlog */
        if (unlikely(sk_add_backlog(sk, skb, ...)))
            goto discard_and_relse;
    }
    
    return 0;
}
 
/**
 * tcp_v4_do_rcv - TCP packet processing
 */
int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{
    if (sk->sk_state == TCP_ESTABLISHED) {
        /* Fast path for established connections */
        struct tcp_sock *tp = tcp_sk(sk);
        
        /* tcp_rcv_established - optimized data path */
        return tcp_rcv_established(sk, skb);
    }
    
    /* Slow path - state machine processing */
    return tcp_rcv_state_process(sk, skb);
}
 
/**
 * tcp_data_queue - Queue data for application
 */
static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
{
    struct tcp_sock *tp = tcp_sk(sk);
    
    if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
        /* In-order segment - queue directly */
        __skb_queue_tail(&sk->sk_receive_queue, skb);
        tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
        
        /* Wake up waiting reader */
        sk->sk_data_ready(sk);
    } else {
        /* Out-of-order - add to OOO queue */
        tcp_data_queue_ofo(sk, skb);
    }
}
 
/**
 * sock_def_readable - Wake up waiting processes
 */
static void sock_def_readable(struct sock *sk)
{
    struct socket_wq *wq = rcu_dereference(sk->sk_wq);
    
    /* Wake up processes waiting in recv() */
    if (wq && waitqueue_active(&wq->wait))
        wake_up_interruptible_sync_poll(&wq->wait,
                                        EPOLLIN | EPOLLRDNORM);
    
    /* Notify epoll/select waiters */
    sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
}

Early Demux Optimization

For TCP, Linux implements "early demux"—the socket lookup happens in ip_rcv_finish before full IP processing. If the socket is found, routing information is taken from the cached socket route instead of performing a full lookup. This accelerates receive processing for established connections.

Tracing and Debugging Packet Flow

Understanding packet flow conceptually is valuable, but engineers often need to trace actual packets through a running system. Linux provides powerful tools for packet tracing and network debugging.

Key tracing tools:

Network Tracing Tools
Tool	Purpose	Use Case
tcpdump/wireshark	Packet capture	See actual packet contents
ss / netstat	Socket state	View connection status
bpftrace	Dynamic tracing	Trace kernel functions, custom analysis
perf	Performance analysis	Find hotspots in network stack
ftrace	Function tracing	Trace kernel function calls
dropwatch	Drop analysis	Find where packets are dropped
ethtool	NIC statistics	Hardware-level diagnostics

Packet flow tracing examples
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# 1. Basic packet capture
sudo tcpdump -i eth0 -nn host 10.0.0.1 and port 80
 
# 2. See TCP state of all connections
ss -tnp
# State    Recv-Q  Send-Q  Local Address:Port  Peer Address:Port  Process
# ESTAB    0       0       10.0.0.1:22         10.0.0.2:54321     users:(("sshd",pid=1234))
 
# 3. Trace with bpftrace - see every tcp_sendmsg call
sudo bpftrace -e 'kprobe:tcp_sendmsg { 
    printf("tcp_sendmsg: pid=%d comm=%s size=%d\n", 
           pid, comm, arg2); 
}'
 
# 4. Trace TCP state changes
sudo bpftrace -e 'tracepoint:tcp:tcp_set_state {
    printf("TCP state change: %s -> %s\n",
           args->oldstate, args->newstate);
}'
 
# 5. Measure TCP receive latency (time from NIC to recv())
sudo bpftrace -e '
kprobe:tcp_v4_rcv { @start[arg0] = nsecs; }
kretprobe:tcp_recvmsg {
    if (@start[arg0]) {
        printf("Latency: %d us\n", (nsecs - @start[arg0]) / 1000);
        delete(@start[arg0]);
    }
}'
 
# 6. Find where packets are dropped
sudo dropwatch -l kas
# Kernel dropped socket receive queue: 1234
# Kernel net_rx_action: 567
 
# 7. perf for network stack profiling
sudo perf top -e cycles:k --call-graph dwarf
# Shows which kernel functions consume CPU
 
# 8. ftrace function graph for tcp_sendmsg
echo tcp_sendmsg > /sys/kernel/debug/tracing/set_graph_function
echo function_graph > /sys/kernel/debug/tracing/current_tracer
cat /sys/kernel/debug/tracing/trace
# 3) 18.542 us   |  tcp_sendmsg();
#  3)               |    tcp_sendmsg_locked() {
#  3)   1.234 us    |      tcp_send_mss();
#  3)               |      __tcp_push_pending_frames() {
#  ...
 
# 9. Network statistics
cat /proc/net/snmp
# Tcp: RtoAlgorithm RtoMin RtoMax ... InSegs OutSegs RetransSegs
 
# 10. Interface statistics
cat /proc/net/dev
# Inter-|   Receive           |  Transmit
#  face |bytes packets errs   |bytes packets errs

Common debugging scenarios:

Symptom	Investigation	Tools
Connection refused	Check if server is listening	ss -tlnp
Slow throughput	Check congestion window, RTT	ss -tnpi, /proc/net/tcp
Packet drops	Find drop location	dropwatch, ethtool -S
High latency	Identify queueing delays	bpftrace, tc -s
Retransmissions	Analyze loss patterns	ss -tnpi, tcpdump

XDP for High-Performance Processing

For ultimate control, XDP (eXpress Data Path) programs run at the driver level before normal stack processing. They can drop, redirect, or modify packets at line rate. Combined with eBPF, XDP enables custom network functions (load balancers, firewalls) running at millions of packets per second per core.

Summary: Mastering Packet Flow

Understanding the complete packet flow through Linux networking ties together all the concepts we've covered. This mental model is essential for performance optimization, security analysis, and debugging complex network issues.

Key Takeaways

•Transmit runs in process context — The sending application directly pays CPU cost; data flows from user space through socket, TCP, IP, qdisc, and driver
•Receive uses interrupts and softirqs — Hardware IRQ triggers NAPI scheduling; NET_RX_SOFTIRQ processes packets in batches for efficiency
•NAPI is key to scalability — Polling instead of per-packet interrupts enables high packet rates without CPU saturation
•GRO reduces overhead — Combining related packets before stack processing amortizes per-packet costs
•Netfilter hooks span the path — Packets traverse iptables/nftables chains at PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING
•Multiple offloads accelerate processing — TSO, GSO, GRO, checksum offload, and RSS enable multi-gigabit performance
•Rich tracing enables debugging — bpftrace, perf, ftrace, dropwatch provide visibility into packet processing

Module Complete:

You've now completed the Linux Networking module. You understand the complete Linux networking subsystem—from the socket API through protocol implementation, namespace isolation, and packet flow. This knowledge positions you to:

Diagnose complex networking issues
Optimize network performance
Design high-throughput networked systems
Implement custom network functionality with eBPF
Understand container networking deeply

The Linux networking stack is among the most sophisticated subsystems in any operating system—and now you understand how it works.

Module Complete

You now understand Linux networking at a deep level—from the socket layer through TCP/IP protocol implementation, network namespaces for container isolation, and the complete packet flow through the kernel. This is systems engineering knowledge that separates expert engineers from the rest.

5 / 5

Loading learning content...

Operating SystemsLinux Internals

Linux Networking

LevelAdvanced

Duration180 mins

TopicLinux Internals

5 / 5

Packet Flow

Tracing Packets Through the Kernel

What You Will Learn

Transmit Path Overview

The five phases of packet transmission:

Transmit Path Phases
Phase	Context	Key Functions	Actions
System Call	Process context	sys_sendto → sock_sendmsg	Validate, copy data to kernel
Transport Layer	Process context	tcp_sendmsg, tcp_transmit_skb	Segment, add TCP/UDP header
Network Layer	Process context	ip_queue_xmit, ip_output	Route, add IP header, fragment
Queueing	Process context	dev_queue_xmit, qdisc	Shape, schedule, queue to device
Device Driver	Process/Softirq	ndo_start_xmit	DMA setup, NIC transmit

Key insight: Process context transmission

Unlike the receive path (which uses softirqs), the transmit path typically runs entirely in the context of the sending process. This means:

The sending process directly pays the CPU cost
No context switch is needed to transmit
The process blocks if buffers are full
Priority of the process affects transmission timing

This design provides fairness—processes that send more data pay more CPU cost—and simplicity, as the sending thread naturally pushes its own traffic toward the wire.

Exceptions to Process Context

Deep Dive: Application to Socket

Let's trace a TCP send() call in detail, following the code path from system call entry to the transport layer.

The journey begins:

Application to socket layer
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
/*
 * PHASE 1: System Call Entry
 * 
 * User space: send(sockfd, buffer, len, flags)
 */
 
/* Syscall entry point */
SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len,
                unsigned int, flags, struct sockaddr __user *, addr,
                int, addr_len)
{
    return __sys_sendto(fd, buff, len, flags, addr, addr_len);
}
 
int __sys_sendto(int fd, void __user *buff, size_t len, unsigned flags,
                 struct sockaddr __user *addr, int addr_len)
{
    struct socket *sock;
    struct sockaddr_storage address;
    struct msghdr msg;
    struct iovec iov;
    int err;
    
    /* 1. Look up socket from file descriptor */
    sock = sockfd_lookup_light(fd, &err, &fput_needed);
    if (!sock)
        return err;
    
    /* 2. Build iovec for data transfer */
    iov.iov_base = buff;
    iov.iov_len = len;
    iov_iter_init(&msg.msg_iter, WRITE, &iov, 1, len);
    
    /* 3. Copy destination address from user space (if provided) */
    if (addr) {
        err = move_addr_to_kernel(addr, addr_len, &address);
        if (err < 0)
            goto out;
        msg.msg_name = (struct sockaddr *)&address;
        msg.msg_namelen = addr_len;
    }
    
    /* 4. Set flags */
    msg.msg_flags = flags;
    
    /* 5. Delegate to socket-specific sendmsg */
    err = sock_sendmsg(sock, &msg);
    
out:
    fput_light(sock->file, fput_needed);
    return err;
}
 
/**
 * sock_sendmsg - Generic socket send
 * 
 * This calls the protocol-specific sendmsg operation
 * through the socket operations table.
 */
int sock_sendmsg(struct socket *sock, struct msghdr *msg)
{
    int err;
    
    /* Security hook - LSM (SELinux, AppArmor) checks */
    err = security_socket_sendmsg(sock, msg, msg_data_left(msg));
    if (err)
        return err;
    
    /* Call protocol sendmsg: sock->ops->sendmsg */
    /* For TCP: inet_sendmsg */
    return sock->ops->sendmsg(sock, msg, msg_data_left(msg));
}
 
/**
 * inet_sendmsg - INET family sendmsg wrapper
 */
int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
{
    struct sock *sk = sock->sk;
 
    /* Wait for socket to connect if needed */
    if (unlikely(inet_send_prepare(sk)))
        return -EAGAIN;
 
    /* Call protocol sendmsg: sk->sk_prot->sendmsg */
    /* For TCP: tcp_sendmsg */
    return sk->sk_prot->sendmsg(sk, msg, size);
}
 
/*
 * PHASE 2: TCP Processing
 */
 
/**
 * tcp_sendmsg - TCP send processing
 * 
 * This is where data is copied from user space into
 * socket buffers and TCP segmentation begins.
 */
int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
{
    struct tcp_sock *tp = tcp_sk(sk);
    struct sk_buff *skb;
    int flags = msg->msg_flags;
    int copied = 0;
    int err;
    
    /* Lock the socket */
    lock_sock(sk);
    
    /* Check socket state */
    err = tcp_sendmsg_locked(sk, msg, size);
    
    release_sock(sk);
    return err;
}
 
int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
{
    struct tcp_sock *tp = tcp_sk(sk);
    int mss_now;
    int size_goal;
    int copied = 0;
    
    /* Calculate MSS and size goal */
    mss_now = tcp_send_mss(sk, &size_goal, msg->msg_flags);
    
    while (msg_data_left(msg)) {
        struct sk_buff *skb = tcp_write_queue_tail(sk);
        int copy;
        
        /* Allocate new skb if needed */
        if (!skb || (copy = size_goal - skb->len) <= 0) {
            skb = sk_stream_alloc_skb(sk, 0, GFP_KERNEL, first_skb);
            if (!skb)
                goto wait_for_memory;
            
            /* Add to write queue */
            tcp_add_write_queue_tail(sk, skb);
        }
        
        /* Copy data from user space */
        copy = min_t(int, copy, msg_data_left(msg));
        err = skb_add_data_nocache(sk, skb, &msg->msg_iter, copy);
        if (err)
            goto do_fault;
        
        copied += copy;
        tp->write_seq += copy;
        
        /* If segment is full, schedule transmission */
        if (skb->len >= mss_now)
            __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
    }
    
    /* Try to transmit pending segments */
    tcp_push(sk, msg->msg_flags, mss_now, ...);
    
    return copied;
}

Zero-Copy and sendfile()

Deep Dive: TCP to IP to Device

TCP to device driver
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
/**
 * tcp_transmit_skb - Build TCP header and pass to IP
 * 
 * This is called when a segment is ready for transmission.
 * Handles the transition from transport to network layer.
 */
static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
                            int clone_it, gfp_t gfp_mask)
{
    struct inet_sock *inet = inet_sk(sk);
    struct tcp_sock *tp = tcp_sk(sk);
    struct tcphdr *th;
    int tcp_header_size;
    
    /* Clone if skb is shared (for retransmissions) */
    if (clone_it) {
        skb = skb_clone(skb, gfp_mask);
        if (!skb)
            return -ENOBUFS;
    }
    
    /* Calculate header size with options */
    tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
    
    /* Reserve space for TCP header */
    skb_push(skb, tcp_header_size);
    skb_reset_transport_header(skb);
    
    /* Build TCP header */
    th = (struct tcphdr *)skb->data;
    th->source = inet->inet_sport;
    th->dest = inet->inet_dport;
    th->seq = htonl(tcb->seq);
    th->ack_seq = htonl(tp->rcv_nxt);
    th->doff = tcp_header_size >> 2;
    th->res1 = 0;
    tcp_init_flags(th, tcb->tcp_flags);
    th->window = htons(tcp_select_window(sk));
    th->check = 0;
    th->urg_ptr = 0;
    
    /* Add TCP options (MSS, timestamps, SACK, etc.) */
    tcp_options_write((__be32 *)(th + 1), tp, &opts);
    
    /* Calculate checksum (often offloaded to NIC) */
    if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
        th->check = ~tcp_v4_check(skb->len, inet->inet_saddr,
                                  inet->inet_daddr, 0);
        skb->csum_start = skb_transport_header(skb) - skb->head;
        skb->csum_offset = offsetof(struct tcphdr, check);
    } else {
        th->check = tcp_v4_check(skb->len, inet->inet_saddr,
                                 inet->inet_daddr,
                                 csum_partial(th, tcp_header_size,
                                             skb->csum));
    }
    
    /* Pass to IP layer */
    err = ip_queue_xmit(sk, skb, &inet->cork.fl);
    
    return err;
}
 
/*
 * PHASE 3: Network Layer (IP)
 */
 
/**
 * ip_queue_xmit - Add IP header and route packet
 */
int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
{
    struct inet_sock *inet = inet_sk(sk);
    struct ip_options_rcu *inet_opt;
    struct rtable *rt;
    struct iphdr *iph;
    
    /* 1. Route lookup (usually cached on socket) */
    rt = (struct rtable *)__sk_dst_check(sk, 0);
    if (!rt) {
        /* Cache miss - perform full route lookup */
        rt = ip_route_output_ports(sock_net(sk), fl4, sk,
                                   daddr, saddr, dport, sport,
                                   IPPROTO_TCP, ...);
        if (IS_ERR(rt))
            goto no_route;
        sk_setup_caps(sk, &rt->dst);  /* Cache route */
    }
    
    skb_dst_set_noref(skb, &rt->dst);
    
    /* 2. Reserve space for IP header */
    skb_push(skb, sizeof(struct iphdr) + inet_opt->opt.optlen);
    skb_reset_network_header(skb);
    
    /* 3. Build IP header */
    iph = ip_hdr(skb);
    *((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));
    iph->tot_len = htons(skb->len);
    iph->id = htons(ip_idents_reserve(rt_dst, 1));
    iph->frag_off = htons(IP_DF);  /* Usually Don't Fragment */
    iph->ttl = ip_select_ttl(inet, &rt->dst);
    iph->protocol = sk->sk_protocol;  /* IPPROTO_TCP = 6 */
    iph->saddr = fl4->saddr;
    iph->daddr = fl4->daddr;
    
    /* 4. Pass to IP output */
    return ip_local_out(net, sk, skb);
}
 
/**
 * ip_local_out - IP output processing (with netfilter)
 */
int ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
{
    /* Traverse netfilter LOCAL_OUT hooks (iptables OUTPUT chain) */
    return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT,
                   net, sk, skb, NULL, skb_dst(skb)->dev,
                   dst_output);
}
 
/**
 * ip_output - Final IP processing before sending
 */
int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
{
    struct net_device *dev = skb_dst(skb)->dev;
    
    /* Calculate IP checksum */
    ip_send_check(ip_hdr(skb));
    
    /* Traverse netfilter POST_ROUTING hooks */
    return nf_hook(NFPROTO_IPV4, NF_INET_POST_ROUTING,
                   net, sk, skb, NULL, dev,
                   ip_finish_output);
}
 
/**
 * ip_finish_output - Check MTU and fragment if needed
 */
static int ip_finish_output(struct net *net, struct sock *sk,
                            struct sk_buff *skb)
{
    unsigned int mtu = dst_mtu(skb_dst(skb));
    
    if (skb->len > mtu && !skb_gso(skb)) {
        /* Packet too large - need fragmentation */
        return ip_fragment(net, sk, skb, mtu, ip_finish_output2);
    }
    
    return ip_finish_output2(net, sk, skb);
}
 
/**
 * ip_finish_output2 - Resolve L2 address and queue to device
 */
static int ip_finish_output2(struct net *net, struct sock *sk,
                             struct sk_buff *skb)
{
    struct neighbour *neigh;
    struct dst_entry *dst = skb_dst(skb);
    
    /* Get neighbour (ARP cache entry) */
    neigh = ip_neigh_for_gw(rt, skb, &is_v6gw);
    
    if (!neigh) {
        /* Need to resolve MAC address via ARP */
        return neigh_resolve_output(neigh, skb);
    }
    
    /* Add L2 (Ethernet) header and queue to device */
    return neigh_output(neigh, skb);
}
 
/*
 * PHASE 4: Traffic Control and Device Queue
 */
 
/**
 * dev_queue_xmit - Queue packet for transmission
 */
int dev_queue_xmit(struct sk_buff *skb)
{
    struct net_device *dev = skb->dev;
    struct netdev_queue *txq;
    struct Qdisc *q;
    int rc;
    
    /* Select transmit queue (multi-queue NICs) */
    txq = netdev_core_pick_tx(dev, skb, NULL);
    
    /* Get queueing discipline */
    q = rcu_dereference_bh(txq->qdisc);
    
    if (q->enqueue) {
        /* Qdisc active - enqueue and potentially dequeue */
        rc = __dev_xmit_skb(skb, q, dev, txq);
    } else {
        /* No qdisc (rare) - direct transmit */
        rc = dev_hard_start_xmit(skb, dev, txq);
    }
    
    return rc;
}
 
/**
 * __dev_xmit_skb - Enqueue to qdisc and trigger transmission
 */
static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
                                 struct net_device *dev,
                                 struct netdev_queue *txq)
{
    spinlock_t *root_lock = qdisc_lock(q);
    
    spin_lock(root_lock);
    
    /* Enqueue to qdisc */
    rc = q->enqueue(skb, q, &to_free);
    
    if (qdisc_run_begin(q)) {
        /* Dequeue and transmit */
        __qdisc_run(q);
    }
    
    spin_unlock(root_lock);
    return rc;
}
 
/*
 * PHASE 5: Device Driver
 */
 
/**
 * dev_hard_start_xmit - Call driver transmit function
 */
static int dev_hard_start_xmit(struct sk_buff *skb,
                               struct net_device *dev,
                               struct netdev_queue *txq)
{
    const struct net_device_ops *ops = dev->netdev_ops;
    
    /* Call driver's ndo_start_xmit */
    rc = ops->ndo_start_xmit(skb, dev);
    
    if (rc == NETDEV_TX_OK) {
        /* Success - packet queued to hardware */
        txq->trans_start = jiffies;
    }
    
    return rc;
}
 
/**
 * Example: Intel e1000e driver transmit
 */
static netdev_tx_t e1000_xmit_frame(struct sk_buff *skb,
                                    struct net_device *netdev)
{
    struct e1000_adapter *adapter = netdev_priv(netdev);
    struct e1000_ring *tx_ring = adapter->tx_ring;
    struct e1000_tx_desc *tx_desc;
    
    /* Set up TX descriptor with packet info */
    tx_desc = E1000_TX_DESC(tx_ring, i);
    tx_desc->buffer_addr = dma_map_single(adapter->dev, skb->data,
                                          skb->len, DMA_TO_DEVICE);
    tx_desc->length = cpu_to_le16(skb->len);
    tx_desc->cmd = E1000_TXD_CMD_EOP | E1000_TXD_CMD_RS;
    
    /* Notify hardware of new descriptor */
    writel(i, adapter->hw.hw_addr + E1000_TDT);
    
    return NETDEV_TX_OK;
}

TSO: Offloading Segmentation

Receive Path Overview

The five phases of packet reception:

Receive Path Phases
Phase	Context	Key Functions	Actions
Hardware IRQ	Interrupt context	Driver IRQ handler	Mask IRQ, schedule NAPI
NAPI Poll	Softirq context	napi_poll, driver poll	Retrieve packets from ring
Network Core	Softirq context	netif_receive_skb	GRO, protocol dispatch
IP Layer	Softirq context	ip_rcv, ip_local_deliver	Validate, route, netfilter
Transport	Softirq/Process	tcp_v4_rcv, socket_queue	Deliver to socket queue

Key insight: Softirq processing

Most receive processing happens in softirq context (NET_RX_SOFTIRQ). This is a bottom-half context that:

Runs on the same CPU that handled the hardware interrupt
Has interrupts enabled (unlike IRQ handlers)
Cannot sleep or block
Can be preempted by hardware interrupts
Is time-limited to ensure fairness

The kernel's ksoftirqd threads handle overflow when softirq processing takes too long.

RSS: Receive Side Scaling

Deep Dive: NIC to Kernel

Let's trace an incoming TCP packet from the moment it arrives at the network interface through to NAPI processing.

Hardware interrupt to NAPI
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
/*
 * PHASE 1: Hardware Interrupt
 * 
 * Packet arrives at NIC → DMA to ring buffer → IRQ raised
 */
 
/* Before packet arrival: Driver sets up receive ring */
static int e1000_alloc_rx_buffers(struct e1000_ring *rx_ring)
{
    struct e1000_rx_desc *rx_desc;
    struct e1000_buffer *buffer_info;
    struct sk_buff *skb;
    
    /* Pre-allocate skb for DMA receive */
    skb = netdev_alloc_skb_ip_align(netdev, rx_ring->buffer_len);
    
    /* Get DMA address for skb data */
    dma = dma_map_single(adapter->dev, skb->data,
                         rx_ring->buffer_len, DMA_FROM_DEVICE);
    
    /* Set up descriptor with DMA address */
    rx_desc->buffer_addr = cpu_to_le64(dma);
    rx_desc->status = 0;  /* Not yet filled */
    
    return 0;
}
 
/**
 * Hardware interrupt handler
 * 
 * Called when NIC signals packet arrival (or TX completion).
 * Must execute quickly - just schedule deferred work.
 */
static irqreturn_t e1000_intr(int irq, void *data)
{
    struct e1000_adapter *adapter = data;
    u32 icr;
    
    /* Read interrupt cause register */
    icr = E1000_READ_REG(&adapter->hw, E1000_ICR);
    
    if (!icr)
        return IRQ_NONE;  /* Not our interrupt */
    
    /* Disable further interrupts (interrupt coalescing) */
    E1000_WRITE_REG(&adapter->hw, E1000_IMC, ~0);
    
    /* Schedule NAPI processing */
    if (napi_schedule_prep(&adapter->napi)) {
        __napi_schedule(&adapter->napi);
    }
    
    return IRQ_HANDLED;
}
 
/**
 * napi_schedule - Schedule NAPI processing
 * 
 * Adds NAPI to per-CPU list and triggers NET_RX_SOFTIRQ
 */
void __napi_schedule(struct napi_struct *n)
{
    unsigned long flags;
    
    local_irq_save(flags);
    
    /* Add to per-CPU NAPI poll list */
    list_add_tail(&n->poll_list, &__get_cpu_var(softnet_data).poll_list);
    
    /* Raise softirq to process later */
    __raise_softirq_irqoff(NET_RX_SOFTIRQ);
    
    local_irq_restore(flags);
}
 
/*
 * PHASE 2: NAPI Poll (Softirq context)
 * 
 * NET_RX_SOFTIRQ handler calls net_rx_action
 */
 
/**
 * net_rx_action - Softirq handler for receive processing
 */
static __latent_entropy void net_rx_action(struct softirq_action *h)
{
    struct softnet_data *sd = this_cpu_ptr(&softnet_data);
    unsigned long time_limit = jiffies + 
                               usecs_to_jiffies(netdev_budget_usecs);
    int budget = netdev_budget;
    
    local_irq_disable();
    
    /* Process each NAPI instance on this CPU */
    while (!list_empty(&sd->poll_list)) {
        struct napi_struct *n;
        int work;
        
        n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list);
        list_del_init(&n->poll_list);
        
        local_irq_enable();
        
        /* Call driver's NAPI poll function */
        work = n->poll(n, budget);
        
        if (work < budget) {
            /* Done - remove from polling, re-enable interrupts */
            napi_complete(n);
        } else {
            /* More work - stay on list for continued polling */
            list_add_tail(&n->poll_list, &sd->poll_list);
        }
        
        budget -= work;
        
        /* Time/budget limits prevent monopolizing CPU */
        if (budget <= 0 || time_after(jiffies, time_limit))
            break;
            
        local_irq_disable();
    }
    
    local_irq_enable();
}
 
/**
 * Driver NAPI poll function
 * 
 * Called from softirq to retrieve packets from hardware.
 */
static int e1000_poll(struct napi_struct *napi, int budget)
{
    struct e1000_adapter *adapter = container_of(napi, ...);
    struct e1000_ring *rx_ring = adapter->rx_ring;
    int work_done = 0;
    
    /* Process completed receive descriptors */
    while (work_done < budget) {
        struct e1000_rx_desc *rx_desc;
        struct sk_buff *skb;
        int length;
        
        /* Get next descriptor */
        rx_desc = E1000_RX_DESC(rx_ring, i);
        
        /* Check if descriptor is complete (DD bit) */
        if (!(rx_desc->status & E1000_RXD_STAT_DD))
            break;  /* No more completed packets */
        
        /* Read descriptor contents */
        length = le16_to_cpu(rx_desc->length);
        
        /* Get pre-allocated skb */
        skb = buffer_info->skb;
        
        /* Unmap DMA buffer */
        dma_unmap_single(adapter->dev, buffer_info->dma,
                         rx_ring->buffer_len, DMA_FROM_DEVICE);
        
        /* Set up skb metadata */
        skb_put(skb, length);
        skb->protocol = eth_type_trans(skb, netdev);
        
        /* Check for hardware checksum validation */
        if (rx_desc->status & E1000_RXD_STAT_IPCS &&
            !(rx_desc->errors & E1000_RXD_ERR_IPE))
            skb->ip_summed = CHECKSUM_UNNECESSARY;
        
        /* Pass to networking stack (with GRO) */
        napi_gro_receive(napi, skb);
        
        /* Allocate new buffer for this descriptor */
        e1000_alloc_rx_buffer(rx_ring, i);
        
        work_done++;
    }
    
    if (work_done < budget) {
        /* All done - complete NAPI and re-enable IRQs */
        napi_complete(napi);
        E1000_WRITE_REG(&adapter->hw, E1000_IMS, adapter->rx_ring_irq);
    }
    
    return work_done;
}

Busy Polling for Low Latency

Deep Dive: Protocol Stack to Socket

After NAPI retrieves packets from the NIC, they traverse the protocol stack—through IP and TCP processing—until reaching the destination socket's receive queue.

Protocol stack processing
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
/*
 * PHASE 3: Network Core Processing
 */
 
/**
 * napi_gro_receive - Entry point with GRO
 * 
 * Generic Receive Offload combines related packets
 * to reduce per-packet overhead.
 */
gro_result_t napi_gro_receive(struct napi_struct *napi,
                               struct sk_buff *skb)
{
    /* Try to merge with existing GRO flows */
    gro_result_t ret = dev_gro_receive(napi, skb);
    
    if (ret == GRO_NORMAL) {
        /* Couldn't merge - process individually */
        return netif_receive_skb_internal(skb);
    }
    
    /* Merged or held for future merging */
    return ret;
}
 
/**
 * netif_receive_skb_internal - Core receive processing
 */
static int netif_receive_skb_internal(struct sk_buff *skb)
{
    /* Dispatch to RPS (Receive Packet Steering) if enabled */
    if (static_key_enabled(&rps_needed)) {
        struct rps_dev_flow voidflow, *rflow = &voidflow;
        int cpu = get_rps_cpu(skb->dev, skb, rflow);
        if (cpu >= 0)
            return enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
    }
    
    return __netif_receive_skb(skb);
}
 
/**
 * __netif_receive_skb_core - Protocol dispatch
 */
static int __netif_receive_skb_core(struct sk_buff *skb)
{
    struct packet_type *ptype;
    __be16 type = skb->protocol;
    
    /* Deliver to raw sockets (tcpdump, etc.) */
    list_for_each_entry_rcu(ptype, &ptype_all, list) {
        deliver_skb(skb, ptype, orig_dev);
    }
    
    /* Deliver to protocol handler based on EtherType */
    /* type == 0x0800 (ETH_P_IP) → ip_rcv() */
    list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & 15], list) {
        if (ptype->type == type)
            return ptype->func(skb, skb->dev, ptype, orig_dev);
    }
    
    /* No handler found - drop */
    kfree_skb(skb);
    return NET_RX_DROP;
}
 
/*
 * PHASE 4: IP Layer Processing
 */
 
/**
 * ip_rcv - IP receive processing
 */
int ip_rcv(struct sk_buff *skb, struct net_device *dev,
           struct packet_type *pt, struct net_device *orig_dev)
{
    struct iphdr *iph;
    
    /* Validate IP header */
    iph = ip_hdr(skb);
    
    if (iph->ihl < 5)          /* Header too short */
        goto drop;
    if (iph->version != 4)      /* Not IPv4 */
        goto drop;
    if (skb->len < ntohs(iph->tot_len))  /* Truncated */
        goto drop;
    
    /* Validate checksum */
    if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
        goto csum_error;
    
    /* Traverse netfilter PREROUTING chain */
    return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
                   net, NULL, skb, dev, NULL,
                   ip_rcv_finish);
    
drop:
    kfree_skb(skb);
    return NET_RX_DROP;
}
 
/**
 * ip_rcv_finish - Route lookup and dispatch
 */
static int ip_rcv_finish(struct net *net, struct sock *sk,
                         struct sk_buff *skb)
{
    struct iphdr *iph = ip_hdr(skb);
    struct rtable *rt;
    
    /* Perform routing lookup */
    rt = ip_route_input_noref(skb, iph->daddr, iph->saddr,
                              iph->tos, skb->dev);
    
    if (rt->rt_type == RTN_LOCAL) {
        /* Packet is for us - deliver locally */
        return ip_local_deliver(skb);
    } else {
        /* Packet needs forwarding */
        return ip_forward(skb);
    }
}
 
/**
 * ip_local_deliver - Deliver to transport layer
 */
int ip_local_deliver(struct sk_buff *skb)
{
    /* Handle IP fragmentation (reassemble) */
    if (ip_is_fragment(ip_hdr(skb))) {
        skb = ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER);
        if (!skb)
            return 0;  /* Not yet complete */
    }
    
    /* Traverse netfilter INPUT chain */
    return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN,
                   net, NULL, skb, skb->dev, NULL,
                   ip_local_deliver_finish);
}
 
/**
 * ip_local_deliver_finish - Dispatch to transport protocol
 */
static int ip_local_deliver_finish(struct net *net, struct sock *sk,
                                   struct sk_buff *skb)
{
    struct iphdr *iph = ip_hdr(skb);
    int protocol = iph->protocol;
    const struct net_protocol *ipprot;
    
    /* Remove IP header (advance data pointer) */
    __skb_pull(skb, ip_hdrlen(skb));
    skb_reset_transport_header(skb);
    
    /* Find transport protocol handler */
    ipprot = rcu_dereference(inet_protos[protocol]);
    
    /* Call transport handler */
    /* protocol = 6 → tcp_v4_rcv */
    /* protocol = 17 → udp_rcv */
    return ipprot->handler(skb);
}
 
/*
 * PHASE 5: Transport Layer (TCP)
 */
 
/**
 * tcp_v4_rcv - TCP receive entry point
 */
int tcp_v4_rcv(struct sk_buff *skb)
{
    struct tcphdr *th;
    struct sock *sk;
    
    /* Validate TCP header */
    th = tcp_hdr(skb);
    if (th->doff < sizeof(struct tcphdr) / 4)
        goto bad_packet;
    
    /* Validate TCP checksum */
    if (skb_csum_unnecessary(skb) == 0 &&
        tcp_v4_checksum_error(skb))
        goto csum_error;
    
    /* Look up socket */
    sk = __inet_lookup_skb(&tcp_hashinfo, skb,
                           th->source, th->dest, sdif);
    
    if (!sk)
        goto no_tcp_socket;  /* Send RST */
    
    /* Process packet based on socket state */
    if (sk->sk_state == TCP_TIME_WAIT)
        return tcp_v4_timewait_process(sk, skb);
    
    if (sk->sk_state == TCP_NEW_SYN_RECV)
        return tcp_v4_do_rcv(sk, skb);
    
    /* Queue to socket or process immediately */
    if (!sock_owned_by_user(sk)) {
        /* Socket not locked - process immediately */
        ret = tcp_v4_do_rcv(sk, skb);
    } else {
        /* Socket locked by user - queue to backlog */
        if (unlikely(sk_add_backlog(sk, skb, ...)))
            goto discard_and_relse;
    }
    
    return 0;
}
 
/**
 * tcp_v4_do_rcv - TCP packet processing
 */
int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
{
    if (sk->sk_state == TCP_ESTABLISHED) {
        /* Fast path for established connections */
        struct tcp_sock *tp = tcp_sk(sk);
        
        /* tcp_rcv_established - optimized data path */
        return tcp_rcv_established(sk, skb);
    }
    
    /* Slow path - state machine processing */
    return tcp_rcv_state_process(sk, skb);
}
 
/**
 * tcp_data_queue - Queue data for application
 */
static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
{
    struct tcp_sock *tp = tcp_sk(sk);
    
    if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
        /* In-order segment - queue directly */
        __skb_queue_tail(&sk->sk_receive_queue, skb);
        tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
        
        /* Wake up waiting reader */
        sk->sk_data_ready(sk);
    } else {
        /* Out-of-order - add to OOO queue */
        tcp_data_queue_ofo(sk, skb);
    }
}
 
/**
 * sock_def_readable - Wake up waiting processes
 */
static void sock_def_readable(struct sock *sk)
{
    struct socket_wq *wq = rcu_dereference(sk->sk_wq);
    
    /* Wake up processes waiting in recv() */
    if (wq && waitqueue_active(&wq->wait))
        wake_up_interruptible_sync_poll(&wq->wait,
                                        EPOLLIN | EPOLLRDNORM);
    
    /* Notify epoll/select waiters */
    sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
}

Early Demux Optimization

Tracing and Debugging Packet Flow

Understanding packet flow conceptually is valuable, but engineers often need to trace actual packets through a running system. Linux provides powerful tools for packet tracing and network debugging.

Key tracing tools:

Network Tracing Tools
Tool	Purpose	Use Case
tcpdump/wireshark	Packet capture	See actual packet contents
ss / netstat	Socket state	View connection status
bpftrace	Dynamic tracing	Trace kernel functions, custom analysis
perf	Performance analysis	Find hotspots in network stack
ftrace	Function tracing	Trace kernel function calls
dropwatch	Drop analysis	Find where packets are dropped
ethtool	NIC statistics	Hardware-level diagnostics

Packet flow tracing examples
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# 1. Basic packet capture
sudo tcpdump -i eth0 -nn host 10.0.0.1 and port 80
 
# 2. See TCP state of all connections
ss -tnp
# State    Recv-Q  Send-Q  Local Address:Port  Peer Address:Port  Process
# ESTAB    0       0       10.0.0.1:22         10.0.0.2:54321     users:(("sshd",pid=1234))
 
# 3. Trace with bpftrace - see every tcp_sendmsg call
sudo bpftrace -e 'kprobe:tcp_sendmsg { 
    printf("tcp_sendmsg: pid=%d comm=%s size=%d\n", 
           pid, comm, arg2); 
}'
 
# 4. Trace TCP state changes
sudo bpftrace -e 'tracepoint:tcp:tcp_set_state {
    printf("TCP state change: %s -> %s\n",
           args->oldstate, args->newstate);
}'
 
# 5. Measure TCP receive latency (time from NIC to recv())
sudo bpftrace -e '
kprobe:tcp_v4_rcv { @start[arg0] = nsecs; }
kretprobe:tcp_recvmsg {
    if (@start[arg0]) {
        printf("Latency: %d us\n", (nsecs - @start[arg0]) / 1000);
        delete(@start[arg0]);
    }
}'
 
# 6. Find where packets are dropped
sudo dropwatch -l kas
# Kernel dropped socket receive queue: 1234
# Kernel net_rx_action: 567
 
# 7. perf for network stack profiling
sudo perf top -e cycles:k --call-graph dwarf
# Shows which kernel functions consume CPU
 
# 8. ftrace function graph for tcp_sendmsg
echo tcp_sendmsg > /sys/kernel/debug/tracing/set_graph_function
echo function_graph > /sys/kernel/debug/tracing/current_tracer
cat /sys/kernel/debug/tracing/trace
# 3) 18.542 us   |  tcp_sendmsg();
#  3)               |    tcp_sendmsg_locked() {
#  3)   1.234 us    |      tcp_send_mss();
#  3)               |      __tcp_push_pending_frames() {
#  ...
 
# 9. Network statistics
cat /proc/net/snmp
# Tcp: RtoAlgorithm RtoMin RtoMax ... InSegs OutSegs RetransSegs
 
# 10. Interface statistics
cat /proc/net/dev
# Inter-|   Receive           |  Transmit
#  face |bytes packets errs   |bytes packets errs

Common debugging scenarios:

Symptom	Investigation	Tools
Connection refused	Check if server is listening	ss -tlnp
Slow throughput	Check congestion window, RTT	ss -tnpi, /proc/net/tcp
Packet drops	Find drop location	dropwatch, ethtool -S
High latency	Identify queueing delays	bpftrace, tc -s
Retransmissions	Analyze loss patterns	ss -tnpi, tcpdump

XDP for High-Performance Processing

Summary: Mastering Packet Flow

Key Takeaways

•Transmit runs in process context — The sending application directly pays CPU cost; data flows from user space through socket, TCP, IP, qdisc, and driver
•Receive uses interrupts and softirqs — Hardware IRQ triggers NAPI scheduling; NET_RX_SOFTIRQ processes packets in batches for efficiency
•NAPI is key to scalability — Polling instead of per-packet interrupts enables high packet rates without CPU saturation
•GRO reduces overhead — Combining related packets before stack processing amortizes per-packet costs
•Netfilter hooks span the path — Packets traverse iptables/nftables chains at PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING
•Multiple offloads accelerate processing — TSO, GSO, GRO, checksum offload, and RSS enable multi-gigabit performance
•Rich tracing enables debugging — bpftrace, perf, ftrace, dropwatch provide visibility into packet processing

Module Complete:

Diagnose complex networking issues
Optimize network performance
Design high-throughput networked systems
Implement custom network functionality with eBPF
Understand container networking deeply

The Linux networking stack is among the most sophisticated subsystems in any operating system—and now you understand how it works.

Module Complete

5 / 5