Loading learning content...
Every time you click a link, send a message, or query a database, packets traverse a carefully orchestrated path through the Linux kernel. Understanding this packet flow—from the moment data leaves your application until it hits the network wire, and from wire reception to application delivery—is essential for performance debugging, security analysis, and network engineering.
This page brings together everything we've covered: the socket layer, protocol stack, network devices, and TCP/IP implementation. We'll trace packets step-by-step through the kernel, annotating each function, data structure, and decision point. You'll see how interrupt handling, NAPI processing, netfilter hooks, and traffic control all integrate into a cohesive packet processing pipeline.
By the end, you'll have a mental model of Linux networking that enables you to reason about latency sources, identify bottlenecks, trace packets with tools like ftrace and bpftrace, and understand what actually happens when network I/O occurs.
By the end of this page, you will understand the complete transmit path (application to wire), the complete receive path (wire to application), interrupt and softirq processing for network I/O, the role of DMA and ring buffers, and how to trace and debug packet flow using kernel tools.
When an application calls write() or send() on a socket, data begins a journey down through multiple kernel subsystems before reaching the network interface. The transmit path can be divided into distinct phases, each with specific responsibilities.
The five phases of packet transmission:
| Phase | Context | Key Functions | Actions |
|---|---|---|---|
| Process context | sys_sendto → sock_sendmsg | Validate, copy data to kernel |
| Process context | tcp_sendmsg, tcp_transmit_skb | Segment, add TCP/UDP header |
| Process context | ip_queue_xmit, ip_output | Route, add IP header, fragment |
| Process context | dev_queue_xmit, qdisc | Shape, schedule, queue to device |
| Process/Softirq | ndo_start_xmit | DMA setup, NIC transmit |
Key insight: Process context transmission
Unlike the receive path (which uses softirqs), the transmit path typically runs entirely in the context of the sending process. This means:
This design provides fairness—processes that send more data pay more CPU cost—and simplicity, as the sending thread naturally pushes its own traffic toward the wire.
Some transmit operations occur from softirq context: TCP timer expirations (retransmits), deferred packet completion handling, and traffic control scheduler triggers. These ensure packets are transmitted even when the originating process isn't running.
Let's trace a TCP send() call in detail, following the code path from system call entry to the transport layer.
The journey begins:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160
/* * PHASE 1: System Call Entry * * User space: send(sockfd, buffer, len, flags) */ /* Syscall entry point */SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len, unsigned int, flags, struct sockaddr __user *, addr, int, addr_len){ return __sys_sendto(fd, buff, len, flags, addr, addr_len);} int __sys_sendto(int fd, void __user *buff, size_t len, unsigned flags, struct sockaddr __user *addr, int addr_len){ struct socket *sock; struct sockaddr_storage address; struct msghdr msg; struct iovec iov; int err; /* 1. Look up socket from file descriptor */ sock = sockfd_lookup_light(fd, &err, &fput_needed); if (!sock) return err; /* 2. Build iovec for data transfer */ iov.iov_base = buff; iov.iov_len = len; iov_iter_init(&msg.msg_iter, WRITE, &iov, 1, len); /* 3. Copy destination address from user space (if provided) */ if (addr) { err = move_addr_to_kernel(addr, addr_len, &address); if (err < 0) goto out; msg.msg_name = (struct sockaddr *)&address; msg.msg_namelen = addr_len; } /* 4. Set flags */ msg.msg_flags = flags; /* 5. Delegate to socket-specific sendmsg */ err = sock_sendmsg(sock, &msg); out: fput_light(sock->file, fput_needed); return err;} /** * sock_sendmsg - Generic socket send * * This calls the protocol-specific sendmsg operation * through the socket operations table. */int sock_sendmsg(struct socket *sock, struct msghdr *msg){ int err; /* Security hook - LSM (SELinux, AppArmor) checks */ err = security_socket_sendmsg(sock, msg, msg_data_left(msg)); if (err) return err; /* Call protocol sendmsg: sock->ops->sendmsg */ /* For TCP: inet_sendmsg */ return sock->ops->sendmsg(sock, msg, msg_data_left(msg));} /** * inet_sendmsg - INET family sendmsg wrapper */int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size){ struct sock *sk = sock->sk; /* Wait for socket to connect if needed */ if (unlikely(inet_send_prepare(sk))) return -EAGAIN; /* Call protocol sendmsg: sk->sk_prot->sendmsg */ /* For TCP: tcp_sendmsg */ return sk->sk_prot->sendmsg(sk, msg, size);} /* * PHASE 2: TCP Processing */ /** * tcp_sendmsg - TCP send processing * * This is where data is copied from user space into * socket buffers and TCP segmentation begins. */int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size){ struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; int flags = msg->msg_flags; int copied = 0; int err; /* Lock the socket */ lock_sock(sk); /* Check socket state */ err = tcp_sendmsg_locked(sk, msg, size); release_sock(sk); return err;} int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size){ struct tcp_sock *tp = tcp_sk(sk); int mss_now; int size_goal; int copied = 0; /* Calculate MSS and size goal */ mss_now = tcp_send_mss(sk, &size_goal, msg->msg_flags); while (msg_data_left(msg)) { struct sk_buff *skb = tcp_write_queue_tail(sk); int copy; /* Allocate new skb if needed */ if (!skb || (copy = size_goal - skb->len) <= 0) { skb = sk_stream_alloc_skb(sk, 0, GFP_KERNEL, first_skb); if (!skb) goto wait_for_memory; /* Add to write queue */ tcp_add_write_queue_tail(sk, skb); } /* Copy data from user space */ copy = min_t(int, copy, msg_data_left(msg)); err = skb_add_data_nocache(sk, skb, &msg->msg_iter, copy); if (err) goto do_fault; copied += copy; tp->write_seq += copy; /* If segment is full, schedule transmission */ if (skb->len >= mss_now) __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH); } /* Try to transmit pending segments */ tcp_push(sk, msg->msg_flags, mss_now, ...); return copied;}For file transfers, sendfile() avoids copying data to user space. The kernel reads file pages directly into socket buffers. For the fastest path, MSG_ZEROCOPY uses the network card's scatter-gather DMA to transmit directly from user memory pages.
After TCP creates a segment, it must be wrapped in an IP header, routed to the correct interface, and handed to the device driver. This phase involves routing lookup, netfilter processing, and queueing discipline management.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274
/** * tcp_transmit_skb - Build TCP header and pass to IP * * This is called when a segment is ready for transmission. * Handles the transition from transport to network layer. */static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, gfp_t gfp_mask){ struct inet_sock *inet = inet_sk(sk); struct tcp_sock *tp = tcp_sk(sk); struct tcphdr *th; int tcp_header_size; /* Clone if skb is shared (for retransmissions) */ if (clone_it) { skb = skb_clone(skb, gfp_mask); if (!skb) return -ENOBUFS; } /* Calculate header size with options */ tcp_header_size = tcp_options_size + sizeof(struct tcphdr); /* Reserve space for TCP header */ skb_push(skb, tcp_header_size); skb_reset_transport_header(skb); /* Build TCP header */ th = (struct tcphdr *)skb->data; th->source = inet->inet_sport; th->dest = inet->inet_dport; th->seq = htonl(tcb->seq); th->ack_seq = htonl(tp->rcv_nxt); th->doff = tcp_header_size >> 2; th->res1 = 0; tcp_init_flags(th, tcb->tcp_flags); th->window = htons(tcp_select_window(sk)); th->check = 0; th->urg_ptr = 0; /* Add TCP options (MSS, timestamps, SACK, etc.) */ tcp_options_write((__be32 *)(th + 1), tp, &opts); /* Calculate checksum (often offloaded to NIC) */ if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) { th->check = ~tcp_v4_check(skb->len, inet->inet_saddr, inet->inet_daddr, 0); skb->csum_start = skb_transport_header(skb) - skb->head; skb->csum_offset = offsetof(struct tcphdr, check); } else { th->check = tcp_v4_check(skb->len, inet->inet_saddr, inet->inet_daddr, csum_partial(th, tcp_header_size, skb->csum)); } /* Pass to IP layer */ err = ip_queue_xmit(sk, skb, &inet->cork.fl); return err;} /* * PHASE 3: Network Layer (IP) */ /** * ip_queue_xmit - Add IP header and route packet */int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl){ struct inet_sock *inet = inet_sk(sk); struct ip_options_rcu *inet_opt; struct rtable *rt; struct iphdr *iph; /* 1. Route lookup (usually cached on socket) */ rt = (struct rtable *)__sk_dst_check(sk, 0); if (!rt) { /* Cache miss - perform full route lookup */ rt = ip_route_output_ports(sock_net(sk), fl4, sk, daddr, saddr, dport, sport, IPPROTO_TCP, ...); if (IS_ERR(rt)) goto no_route; sk_setup_caps(sk, &rt->dst); /* Cache route */ } skb_dst_set_noref(skb, &rt->dst); /* 2. Reserve space for IP header */ skb_push(skb, sizeof(struct iphdr) + inet_opt->opt.optlen); skb_reset_network_header(skb); /* 3. Build IP header */ iph = ip_hdr(skb); *((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff)); iph->tot_len = htons(skb->len); iph->id = htons(ip_idents_reserve(rt_dst, 1)); iph->frag_off = htons(IP_DF); /* Usually Don't Fragment */ iph->ttl = ip_select_ttl(inet, &rt->dst); iph->protocol = sk->sk_protocol; /* IPPROTO_TCP = 6 */ iph->saddr = fl4->saddr; iph->daddr = fl4->daddr; /* 4. Pass to IP output */ return ip_local_out(net, sk, skb);} /** * ip_local_out - IP output processing (with netfilter) */int ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb){ /* Traverse netfilter LOCAL_OUT hooks (iptables OUTPUT chain) */ return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, net, sk, skb, NULL, skb_dst(skb)->dev, dst_output);} /** * ip_output - Final IP processing before sending */int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb){ struct net_device *dev = skb_dst(skb)->dev; /* Calculate IP checksum */ ip_send_check(ip_hdr(skb)); /* Traverse netfilter POST_ROUTING hooks */ return nf_hook(NFPROTO_IPV4, NF_INET_POST_ROUTING, net, sk, skb, NULL, dev, ip_finish_output);} /** * ip_finish_output - Check MTU and fragment if needed */static int ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb){ unsigned int mtu = dst_mtu(skb_dst(skb)); if (skb->len > mtu && !skb_gso(skb)) { /* Packet too large - need fragmentation */ return ip_fragment(net, sk, skb, mtu, ip_finish_output2); } return ip_finish_output2(net, sk, skb);} /** * ip_finish_output2 - Resolve L2 address and queue to device */static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *skb){ struct neighbour *neigh; struct dst_entry *dst = skb_dst(skb); /* Get neighbour (ARP cache entry) */ neigh = ip_neigh_for_gw(rt, skb, &is_v6gw); if (!neigh) { /* Need to resolve MAC address via ARP */ return neigh_resolve_output(neigh, skb); } /* Add L2 (Ethernet) header and queue to device */ return neigh_output(neigh, skb);} /* * PHASE 4: Traffic Control and Device Queue */ /** * dev_queue_xmit - Queue packet for transmission */int dev_queue_xmit(struct sk_buff *skb){ struct net_device *dev = skb->dev; struct netdev_queue *txq; struct Qdisc *q; int rc; /* Select transmit queue (multi-queue NICs) */ txq = netdev_core_pick_tx(dev, skb, NULL); /* Get queueing discipline */ q = rcu_dereference_bh(txq->qdisc); if (q->enqueue) { /* Qdisc active - enqueue and potentially dequeue */ rc = __dev_xmit_skb(skb, q, dev, txq); } else { /* No qdisc (rare) - direct transmit */ rc = dev_hard_start_xmit(skb, dev, txq); } return rc;} /** * __dev_xmit_skb - Enqueue to qdisc and trigger transmission */static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, struct net_device *dev, struct netdev_queue *txq){ spinlock_t *root_lock = qdisc_lock(q); spin_lock(root_lock); /* Enqueue to qdisc */ rc = q->enqueue(skb, q, &to_free); if (qdisc_run_begin(q)) { /* Dequeue and transmit */ __qdisc_run(q); } spin_unlock(root_lock); return rc;} /* * PHASE 5: Device Driver */ /** * dev_hard_start_xmit - Call driver transmit function */static int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq){ const struct net_device_ops *ops = dev->netdev_ops; /* Call driver's ndo_start_xmit */ rc = ops->ndo_start_xmit(skb, dev); if (rc == NETDEV_TX_OK) { /* Success - packet queued to hardware */ txq->trans_start = jiffies; } return rc;} /** * Example: Intel e1000e driver transmit */static netdev_tx_t e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev){ struct e1000_adapter *adapter = netdev_priv(netdev); struct e1000_ring *tx_ring = adapter->tx_ring; struct e1000_tx_desc *tx_desc; /* Set up TX descriptor with packet info */ tx_desc = E1000_TX_DESC(tx_ring, i); tx_desc->buffer_addr = dma_map_single(adapter->dev, skb->data, skb->len, DMA_TO_DEVICE); tx_desc->length = cpu_to_le16(skb->len); tx_desc->cmd = E1000_TXD_CMD_EOP | E1000_TXD_CMD_RS; /* Notify hardware of new descriptor */ writel(i, adapter->hw.hw_addr + E1000_TDT); return NETDEV_TX_OK;}For large TCP sends, the kernel doesn't create individual MSS-sized segments. Instead, TCP Segmentation Offload (TSO) passes a "super-segment" to the NIC, which divides it into wire-sized frames. This reduces CPU overhead dramatically—one descriptor can represent 64KB of data instead of dozens for 1500-byte MTU.
The receive path is more complex than transmit because it must handle asynchronous packet arrival. When a packet arrives, the NIC generates a hardware interrupt, which triggers a carefully orchestrated processing pipeline involving NAPI, softirqs, and eventually the transport layer.
The five phases of packet reception:
| Phase | Context | Key Functions | Actions |
|---|---|---|---|
| Interrupt context | Driver IRQ handler | Mask IRQ, schedule NAPI |
| Softirq context | napi_poll, driver poll | Retrieve packets from ring |
| Softirq context | netif_receive_skb | GRO, protocol dispatch |
| Softirq context | ip_rcv, ip_local_deliver | Validate, route, netfilter |
| Softirq/Process | tcp_v4_rcv, socket_queue | Deliver to socket queue |
Key insight: Softirq processing
Most receive processing happens in softirq context (NET_RX_SOFTIRQ). This is a bottom-half context that:
The kernel's ksoftirqd threads handle overflow when softirq processing takes too long.
Modern NICs use RSS to distribute incoming packets across multiple RX queues based on flow hash. Each queue gets its own interrupt, typically pinned to different CPUs. This enables parallel receive processing—essential for high packet rates. Without RSS, a single CPU would bottleneck all receive processing.
Let's trace an incoming TCP packet from the moment it arrives at the network interface through to NAPI processing.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186
/* * PHASE 1: Hardware Interrupt * * Packet arrives at NIC → DMA to ring buffer → IRQ raised */ /* Before packet arrival: Driver sets up receive ring */static int e1000_alloc_rx_buffers(struct e1000_ring *rx_ring){ struct e1000_rx_desc *rx_desc; struct e1000_buffer *buffer_info; struct sk_buff *skb; /* Pre-allocate skb for DMA receive */ skb = netdev_alloc_skb_ip_align(netdev, rx_ring->buffer_len); /* Get DMA address for skb data */ dma = dma_map_single(adapter->dev, skb->data, rx_ring->buffer_len, DMA_FROM_DEVICE); /* Set up descriptor with DMA address */ rx_desc->buffer_addr = cpu_to_le64(dma); rx_desc->status = 0; /* Not yet filled */ return 0;} /** * Hardware interrupt handler * * Called when NIC signals packet arrival (or TX completion). * Must execute quickly - just schedule deferred work. */static irqreturn_t e1000_intr(int irq, void *data){ struct e1000_adapter *adapter = data; u32 icr; /* Read interrupt cause register */ icr = E1000_READ_REG(&adapter->hw, E1000_ICR); if (!icr) return IRQ_NONE; /* Not our interrupt */ /* Disable further interrupts (interrupt coalescing) */ E1000_WRITE_REG(&adapter->hw, E1000_IMC, ~0); /* Schedule NAPI processing */ if (napi_schedule_prep(&adapter->napi)) { __napi_schedule(&adapter->napi); } return IRQ_HANDLED;} /** * napi_schedule - Schedule NAPI processing * * Adds NAPI to per-CPU list and triggers NET_RX_SOFTIRQ */void __napi_schedule(struct napi_struct *n){ unsigned long flags; local_irq_save(flags); /* Add to per-CPU NAPI poll list */ list_add_tail(&n->poll_list, &__get_cpu_var(softnet_data).poll_list); /* Raise softirq to process later */ __raise_softirq_irqoff(NET_RX_SOFTIRQ); local_irq_restore(flags);} /* * PHASE 2: NAPI Poll (Softirq context) * * NET_RX_SOFTIRQ handler calls net_rx_action */ /** * net_rx_action - Softirq handler for receive processing */static __latent_entropy void net_rx_action(struct softirq_action *h){ struct softnet_data *sd = this_cpu_ptr(&softnet_data); unsigned long time_limit = jiffies + usecs_to_jiffies(netdev_budget_usecs); int budget = netdev_budget; local_irq_disable(); /* Process each NAPI instance on this CPU */ while (!list_empty(&sd->poll_list)) { struct napi_struct *n; int work; n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list); list_del_init(&n->poll_list); local_irq_enable(); /* Call driver's NAPI poll function */ work = n->poll(n, budget); if (work < budget) { /* Done - remove from polling, re-enable interrupts */ napi_complete(n); } else { /* More work - stay on list for continued polling */ list_add_tail(&n->poll_list, &sd->poll_list); } budget -= work; /* Time/budget limits prevent monopolizing CPU */ if (budget <= 0 || time_after(jiffies, time_limit)) break; local_irq_disable(); } local_irq_enable();} /** * Driver NAPI poll function * * Called from softirq to retrieve packets from hardware. */static int e1000_poll(struct napi_struct *napi, int budget){ struct e1000_adapter *adapter = container_of(napi, ...); struct e1000_ring *rx_ring = adapter->rx_ring; int work_done = 0; /* Process completed receive descriptors */ while (work_done < budget) { struct e1000_rx_desc *rx_desc; struct sk_buff *skb; int length; /* Get next descriptor */ rx_desc = E1000_RX_DESC(rx_ring, i); /* Check if descriptor is complete (DD bit) */ if (!(rx_desc->status & E1000_RXD_STAT_DD)) break; /* No more completed packets */ /* Read descriptor contents */ length = le16_to_cpu(rx_desc->length); /* Get pre-allocated skb */ skb = buffer_info->skb; /* Unmap DMA buffer */ dma_unmap_single(adapter->dev, buffer_info->dma, rx_ring->buffer_len, DMA_FROM_DEVICE); /* Set up skb metadata */ skb_put(skb, length); skb->protocol = eth_type_trans(skb, netdev); /* Check for hardware checksum validation */ if (rx_desc->status & E1000_RXD_STAT_IPCS && !(rx_desc->errors & E1000_RXD_ERR_IPE)) skb->ip_summed = CHECKSUM_UNNECESSARY; /* Pass to networking stack (with GRO) */ napi_gro_receive(napi, skb); /* Allocate new buffer for this descriptor */ e1000_alloc_rx_buffer(rx_ring, i); work_done++; } if (work_done < budget) { /* All done - complete NAPI and re-enable IRQs */ napi_complete(napi); E1000_WRITE_REG(&adapter->hw, E1000_IMS, adapter->rx_ring_irq); } return work_done;}For ultra-low-latency applications, Linux supports busy polling (SO_BUSY_POLL). Instead of waiting for interrupts, the application actively polls for packets. This eliminates interrupt overhead and context switch latency at the cost of CPU cycles. Used in high-frequency trading and real-time systems.
After NAPI retrieves packets from the NIC, they traverse the protocol stack—through IP and TCP processing—until reaching the destination socket's receive queue.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266
/* * PHASE 3: Network Core Processing */ /** * napi_gro_receive - Entry point with GRO * * Generic Receive Offload combines related packets * to reduce per-packet overhead. */gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb){ /* Try to merge with existing GRO flows */ gro_result_t ret = dev_gro_receive(napi, skb); if (ret == GRO_NORMAL) { /* Couldn't merge - process individually */ return netif_receive_skb_internal(skb); } /* Merged or held for future merging */ return ret;} /** * netif_receive_skb_internal - Core receive processing */static int netif_receive_skb_internal(struct sk_buff *skb){ /* Dispatch to RPS (Receive Packet Steering) if enabled */ if (static_key_enabled(&rps_needed)) { struct rps_dev_flow voidflow, *rflow = &voidflow; int cpu = get_rps_cpu(skb->dev, skb, rflow); if (cpu >= 0) return enqueue_to_backlog(skb, cpu, &rflow->last_qtail); } return __netif_receive_skb(skb);} /** * __netif_receive_skb_core - Protocol dispatch */static int __netif_receive_skb_core(struct sk_buff *skb){ struct packet_type *ptype; __be16 type = skb->protocol; /* Deliver to raw sockets (tcpdump, etc.) */ list_for_each_entry_rcu(ptype, &ptype_all, list) { deliver_skb(skb, ptype, orig_dev); } /* Deliver to protocol handler based on EtherType */ /* type == 0x0800 (ETH_P_IP) → ip_rcv() */ list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & 15], list) { if (ptype->type == type) return ptype->func(skb, skb->dev, ptype, orig_dev); } /* No handler found - drop */ kfree_skb(skb); return NET_RX_DROP;} /* * PHASE 4: IP Layer Processing */ /** * ip_rcv - IP receive processing */int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev){ struct iphdr *iph; /* Validate IP header */ iph = ip_hdr(skb); if (iph->ihl < 5) /* Header too short */ goto drop; if (iph->version != 4) /* Not IPv4 */ goto drop; if (skb->len < ntohs(iph->tot_len)) /* Truncated */ goto drop; /* Validate checksum */ if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl))) goto csum_error; /* Traverse netfilter PREROUTING chain */ return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL, skb, dev, NULL, ip_rcv_finish); drop: kfree_skb(skb); return NET_RX_DROP;} /** * ip_rcv_finish - Route lookup and dispatch */static int ip_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb){ struct iphdr *iph = ip_hdr(skb); struct rtable *rt; /* Perform routing lookup */ rt = ip_route_input_noref(skb, iph->daddr, iph->saddr, iph->tos, skb->dev); if (rt->rt_type == RTN_LOCAL) { /* Packet is for us - deliver locally */ return ip_local_deliver(skb); } else { /* Packet needs forwarding */ return ip_forward(skb); }} /** * ip_local_deliver - Deliver to transport layer */int ip_local_deliver(struct sk_buff *skb){ /* Handle IP fragmentation (reassemble) */ if (ip_is_fragment(ip_hdr(skb))) { skb = ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER); if (!skb) return 0; /* Not yet complete */ } /* Traverse netfilter INPUT chain */ return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, net, NULL, skb, skb->dev, NULL, ip_local_deliver_finish);} /** * ip_local_deliver_finish - Dispatch to transport protocol */static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct sk_buff *skb){ struct iphdr *iph = ip_hdr(skb); int protocol = iph->protocol; const struct net_protocol *ipprot; /* Remove IP header (advance data pointer) */ __skb_pull(skb, ip_hdrlen(skb)); skb_reset_transport_header(skb); /* Find transport protocol handler */ ipprot = rcu_dereference(inet_protos[protocol]); /* Call transport handler */ /* protocol = 6 → tcp_v4_rcv */ /* protocol = 17 → udp_rcv */ return ipprot->handler(skb);} /* * PHASE 5: Transport Layer (TCP) */ /** * tcp_v4_rcv - TCP receive entry point */int tcp_v4_rcv(struct sk_buff *skb){ struct tcphdr *th; struct sock *sk; /* Validate TCP header */ th = tcp_hdr(skb); if (th->doff < sizeof(struct tcphdr) / 4) goto bad_packet; /* Validate TCP checksum */ if (skb_csum_unnecessary(skb) == 0 && tcp_v4_checksum_error(skb)) goto csum_error; /* Look up socket */ sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest, sdif); if (!sk) goto no_tcp_socket; /* Send RST */ /* Process packet based on socket state */ if (sk->sk_state == TCP_TIME_WAIT) return tcp_v4_timewait_process(sk, skb); if (sk->sk_state == TCP_NEW_SYN_RECV) return tcp_v4_do_rcv(sk, skb); /* Queue to socket or process immediately */ if (!sock_owned_by_user(sk)) { /* Socket not locked - process immediately */ ret = tcp_v4_do_rcv(sk, skb); } else { /* Socket locked by user - queue to backlog */ if (unlikely(sk_add_backlog(sk, skb, ...))) goto discard_and_relse; } return 0;} /** * tcp_v4_do_rcv - TCP packet processing */int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb){ if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path for established connections */ struct tcp_sock *tp = tcp_sk(sk); /* tcp_rcv_established - optimized data path */ return tcp_rcv_established(sk, skb); } /* Slow path - state machine processing */ return tcp_rcv_state_process(sk, skb);} /** * tcp_data_queue - Queue data for application */static void tcp_data_queue(struct sock *sk, struct sk_buff *skb){ struct tcp_sock *tp = tcp_sk(sk); if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) { /* In-order segment - queue directly */ __skb_queue_tail(&sk->sk_receive_queue, skb); tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq; /* Wake up waiting reader */ sk->sk_data_ready(sk); } else { /* Out-of-order - add to OOO queue */ tcp_data_queue_ofo(sk, skb); }} /** * sock_def_readable - Wake up waiting processes */static void sock_def_readable(struct sock *sk){ struct socket_wq *wq = rcu_dereference(sk->sk_wq); /* Wake up processes waiting in recv() */ if (wq && waitqueue_active(&wq->wait)) wake_up_interruptible_sync_poll(&wq->wait, EPOLLIN | EPOLLRDNORM); /* Notify epoll/select waiters */ sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);}For TCP, Linux implements "early demux"—the socket lookup happens in ip_rcv_finish before full IP processing. If the socket is found, routing information is taken from the cached socket route instead of performing a full lookup. This accelerates receive processing for established connections.
Understanding packet flow conceptually is valuable, but engineers often need to trace actual packets through a running system. Linux provides powerful tools for packet tracing and network debugging.
Key tracing tools:
| Tool | Purpose | Use Case |
|---|---|---|
| tcpdump/wireshark | Packet capture | See actual packet contents |
| ss / netstat | Socket state | View connection status |
| bpftrace | Dynamic tracing | Trace kernel functions, custom analysis |
| perf | Performance analysis | Find hotspots in network stack |
| ftrace | Function tracing | Trace kernel function calls |
| dropwatch | Drop analysis | Find where packets are dropped |
| ethtool | NIC statistics | Hardware-level diagnostics |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
# 1. Basic packet capturesudo tcpdump -i eth0 -nn host 10.0.0.1 and port 80 # 2. See TCP state of all connectionsss -tnp# State Recv-Q Send-Q Local Address:Port Peer Address:Port Process# ESTAB 0 0 10.0.0.1:22 10.0.0.2:54321 users:(("sshd",pid=1234)) # 3. Trace with bpftrace - see every tcp_sendmsg callsudo bpftrace -e 'kprobe:tcp_sendmsg { printf("tcp_sendmsg: pid=%d comm=%s size=%d\n", pid, comm, arg2); }' # 4. Trace TCP state changessudo bpftrace -e 'tracepoint:tcp:tcp_set_state { printf("TCP state change: %s -> %s\n", args->oldstate, args->newstate);}' # 5. Measure TCP receive latency (time from NIC to recv())sudo bpftrace -e 'kprobe:tcp_v4_rcv { @start[arg0] = nsecs; }kretprobe:tcp_recvmsg { if (@start[arg0]) { printf("Latency: %d us\n", (nsecs - @start[arg0]) / 1000); delete(@start[arg0]); }}' # 6. Find where packets are droppedsudo dropwatch -l kas# Kernel dropped socket receive queue: 1234# Kernel net_rx_action: 567 # 7. perf for network stack profilingsudo perf top -e cycles:k --call-graph dwarf# Shows which kernel functions consume CPU # 8. ftrace function graph for tcp_sendmsgecho tcp_sendmsg > /sys/kernel/debug/tracing/set_graph_functionecho function_graph > /sys/kernel/debug/tracing/current_tracercat /sys/kernel/debug/tracing/trace# 3) 18.542 us | tcp_sendmsg();# 3) | tcp_sendmsg_locked() {# 3) 1.234 us | tcp_send_mss();# 3) | __tcp_push_pending_frames() {# ... # 9. Network statisticscat /proc/net/snmp# Tcp: RtoAlgorithm RtoMin RtoMax ... InSegs OutSegs RetransSegs # 10. Interface statisticscat /proc/net/dev# Inter-| Receive | Transmit# face |bytes packets errs |bytes packets errsCommon debugging scenarios:
| Symptom | Investigation | Tools |
|---|---|---|
| Connection refused | Check if server is listening | ss -tlnp |
| Slow throughput | Check congestion window, RTT | ss -tnpi, /proc/net/tcp |
| Packet drops | Find drop location | dropwatch, ethtool -S |
| High latency | Identify queueing delays | bpftrace, tc -s |
| Retransmissions | Analyze loss patterns | ss -tnpi, tcpdump |
For ultimate control, XDP (eXpress Data Path) programs run at the driver level before normal stack processing. They can drop, redirect, or modify packets at line rate. Combined with eBPF, XDP enables custom network functions (load balancers, firewalls) running at millions of packets per second per core.
Understanding the complete packet flow through Linux networking ties together all the concepts we've covered. This mental model is essential for performance optimization, security analysis, and debugging complex network issues.
Module Complete:
You've now completed the Linux Networking module. You understand the complete Linux networking subsystem—from the socket API through protocol implementation, namespace isolation, and packet flow. This knowledge positions you to:
The Linux networking stack is among the most sophisticated subsystems in any operating system—and now you understand how it works.
You now understand Linux networking at a deep level—from the socket layer through TCP/IP protocol implementation, network namespaces for container isolation, and the complete packet flow through the kernel. This is systems engineering knowledge that separates expert engineers from the rest.