Loading learning content...
Beneath every network-capable application lies the Linux protocol stack—a meticulously organized set of layers that transform abstract communication requests into electrical signals traversing physical media. When your application calls send() on a TCP socket, a remarkably complex series of transformations occurs: data is copied, headers are prepended, checksums are calculated, fragmentation is performed, routing decisions are made, ARP resolution occurs, and finally packets are handed to a network device driver.
This multi-layered architecture isn't merely organizational tidiness—it's the result of decades of protocol design wisdom, embodied in implementations that handle billions of packets per second across the global internet. The Linux protocol stack implements this architecture with such efficiency that it powers everything from embedded IoT devices to the world's largest hyperscale data centers.
Understanding the protocol stack is fundamental to network performance engineering. Whether you're diagnosing latency issues, optimizing throughput, implementing custom protocols, or simply troubleshooting connectivity problems, knowledge of how packets flow through these layers transforms mysterious network behavior into understandable, debuggable systems.
By the end of this page, you will understand the Linux protocol stack architecture, including the layer hierarchy, the relationship between network layers and kernel code organization, the path packets take through the stack, key data structures at each layer, and how protocol handlers are registered and dispatched. You'll see how the OSI model maps to Linux implementation reality.
The Linux networking stack follows a layered model that corresponds to the OSI reference model, though the implementation combines some layers for efficiency. Each layer has distinct responsibilities and communicates with adjacent layers through well-defined interfaces.
The conceptual layers:
From top to bottom, the Linux networking stack consists of:
Each layer adds (on transmit) or removes (on receive) its own headers, forming the classic networking "hourglass" where IP provides the narrow waist that all higher and lower protocols must pass through.
| Layer | OSI Layers | Key Structures | Primary Functions | Example Protocols |
|---|---|---|---|---|
| Socket | Session/Presentation | struct socket, struct sock | API abstraction, multiplexing | BSD sockets API |
| Transport | Transport (L4) | struct tcp_sock, struct udp_sock | End-to-end delivery, reliability | TCP, UDP, SCTP, DCCP |
| Network | Network (L3) | struct iphdr, struct rtable | Routing, fragmentation, addressing | IPv4, IPv6, ICMP |
| Link | Data Link (L2) | struct net_device, struct ethhdr | Framing, MAC addressing, queuing | Ethernet, WiFi, PPP |
| Device Driver | Physical (L1) | struct sk_buff, device registers | Hardware I/O, DMA, interrupts | Hardware-specific |
The sk_buff: The Universal Packet Container
Across all layers, packets are represented by struct sk_buff (socket buffer). This structure is perhaps the most important in Linux networking—it contains:
The sk_buff is designed for efficient header manipulation: instead of copying data when adding/removing headers, pointers are adjusted. This "zero-copy" design is critical for high-speed networking.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
/** * struct sk_buff - Socket buffer * * This is THE central data structure in Linux networking. * Every packet flowing through the stack is wrapped in an sk_buff. */struct sk_buff { /* Linked list pointers for queue management */ struct sk_buff *next; struct sk_buff *prev; /* Associated socket (if any) */ struct sock *sk; /* Packet arrival timestamp */ ktime_t tstamp; /* Network device packet arrived on / will be sent out */ struct net_device *dev; /* * Pointers to protocol headers * These enable zero-copy layer traversal */ union { struct tcphdr *th; /* TCP header */ struct udphdr *uh; /* UDP header */ struct icmphdr *icmph; /* ICMP header */ unsigned char *raw; /* Raw pointer */ } h; /* Transport layer header */ union { struct iphdr *iph; /* IPv4 header */ struct ipv6hdr *ipv6h; /* IPv6 header */ struct arphdr *arph; /* ARP header */ unsigned char *raw; } nh; /* Network layer header */ union { struct ethhdr *ethernet; /* Ethernet header */ unsigned char *raw; } mac; /* Link layer header */ /* Actual data pointers */ unsigned char *head; /* Start of allocated buffer */ unsigned char *data; /* Start of actual data */ unsigned char *tail; /* End of actual data */ unsigned char *end; /* End of allocated buffer */ /* Length fields */ unsigned int len; /* Bytes of data */ unsigned int data_len; /* Bytes in fragments */ __u16 mac_len; /* Length of link header */ /* Protocol identification */ __be16 protocol; /* Packet protocol (ETH_P_IP, etc.) */ /* Packet type (for us, broadcast, multicast, etc.) */ __u8 pkt_type; /* Checksum status */ __u8 ip_summed; /* Priority for QoS */ __u32 priority; /* ... many more fields ... */};When adding headers, skb_push() moves the data pointer backward. When removing headers, skb_pull() moves it forward. The underlying buffer stays in place—only pointers change. This design is essential for performance: at 100 Gbps, copying headers would consume more CPU than modern systems can provide.
Understanding how packets flow from application to wire is essential for performance optimization and debugging. The transmit path involves multiple subsystems, each adding its contribution before the packet reaches the network device.
The complete transmit path (TCP example):
When an application calls send() on a TCP socket, the following sequence occurs:
tcp_sendmsg()sk_write_queuetcp_transmit_skb() clones segments and passes them downip_queue_xmit() performs routing lookup, adds IP header, handles fragmentation if neededndo_start_xmit() programs the NIC for DMA transmission123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
/* * Simplified view of the TCP transmit path * Each function call represents a layer transition */ /* User space */send(sockfd, buffer, len, flags); /* System call entry */SYSCALL_DEFINE4(sendto, ...) → sock_sendmsg() → inet_sendmsg() → tcp_sendmsg() /** * tcp_sendmsg - Copy data from user space and segment */int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size){ /* Copy user data to socket buffer */ while (size > 0) { /* Get or allocate segment */ skb = tcp_write_queue_tail(sk); if (!skb || skb->len >= mss_now) skb = sk_stream_alloc_skb(...); /* Copy data from user space */ copy = min_t(size_t, size, mss_now - skb->len); skb_copy_to_page_nocache(...); size -= copy; } /* Try to transmit queued segments */ tcp_push(sk, flags, mss_now, ...); return copied;} /** * tcp_transmit_skb - Build TCP header and pass to IP */int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, ...){ /* Reserve space for TCP header */ skb_push(skb, tcp_header_size); skb_reset_transport_header(skb); /* Build TCP header */ th = (struct tcphdr *)skb->data; th->source = inet->inet_sport; th->dest = inet->inet_dport; th->seq = htonl(tcb->seq); th->ack_seq = htonl(tp->rcv_nxt); th->doff = tcp_header_size >> 2; th->window = htons(tcp_select_window(sk)); /* Calculate checksum (may be offloaded to NIC) */ tcp_v4_send_check(sk, skb); /* Pass to IP layer */ return ip_queue_xmit(sk, skb, fl4);} /** * ip_queue_xmit - Add IP header and route packet */int ip_queue_xmit(struct sock *sk, struct sk_buff *skb){ struct rtable *rt; struct iphdr *iph; /* Get cached route or perform lookup */ rt = ip_route_output_ports(...); /* Reserve space and add IP header */ skb_push(skb, sizeof(struct iphdr)); skb_reset_network_header(skb); /* Build IP header */ iph = ip_hdr(skb); iph->version = 4; iph->ihl = 5; iph->tos = inet->tos; iph->tot_len = htons(skb->len); iph->id = htons(ip_idents_reserve(...)); iph->frag_off = htons(IP_DF); iph->ttl = ip_select_ttl(sk, &rt->dst); iph->protocol = sk->sk_protocol; iph->saddr = fl4->saddr; iph->daddr = fl4->daddr; /* Continue down the stack */ return ip_local_out(skb);} /** * ip_local_out - Handle netfilter and send to device */int ip_local_out(struct sk_buff *skb){ /* Traverse Netfilter OUTPUT chain */ return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, ... dst_output);} /** * dst_output → dev_queue_xmit - Queue to device */int dev_queue_xmit(struct sk_buff *skb){ struct Qdisc *q; /* Get device's queueing discipline */ q = rcu_dereference(dev->qdisc); /* Enqueue packet */ if (q->enqueue) return __dev_xmit_skb(skb, q, dev, ...); /* Direct transmit for lockless/simple qdiscs */ return dev_hard_start_xmit(skb, dev, ...);} /** * dev_hard_start_xmit - Call driver transmit */int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, ...){ const struct net_device_ops *ops = dev->netdev_ops; /* Call driver's transmit function */ return ops->ndo_start_xmit(skb, dev);}The entire transmit path typically runs in process context (the context of the calling application), which means it competes for CPU time with the application. For latency-sensitive workloads, this is actually beneficial—the sending thread directly pushes packets toward the wire. For throughput-oriented workloads, various optimizations (GSO, TSO, BQL) batch work to amortize overhead.
The receive path is architecturally more complex than transmit because it must handle asynchronous packet arrival from hardware. The kernel uses multiple mechanisms—hardware interrupts, software interrupts (softirq), and NAPI—to efficiently process incoming packets without overwhelming the CPU.
The complete receive path:
When a packet arrives at the network interface:
napi_gro_receive()netif_receive_skb() determines protocol, calls appropriate handlerip_rcv() validates header, performs routing lookup, traverses netfilter INPUT chaintcp_v4_rcv() or udp_rcv() finds associated socket, delivers to socket queuesk_receive_queue, sleeping reader awakenedrecv() copies data from kernel to user buffer123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172
/* * Packet reception flow from NIC to application * This is invoked asynchronously from hardware interrupts */ /* 1. Hardware interrupt handler (example: Intel e1000e driver) */static irqreturn_t e1000_intr(int irq, void *dev_id){ struct e1000_adapter *adapter = dev_id; /* Acknowledge and disable interrupts */ E1000_WRITE_REG(&adapter->hw, E1000_IMC, ~0); /* Schedule NAPI processing */ napi_schedule(&adapter->napi); return IRQ_HANDLED;} /* 2. NAPI poll function (called from softirq) */static int e1000_poll(struct napi_struct *napi, int budget){ struct e1000_adapter *adapter = container_of(napi, ...); int work_done = 0; while (work_done < budget) { struct e1000_rx_desc *rx_desc; struct sk_buff *skb; /* Get next completed descriptor */ rx_desc = E1000_RX_DESC(ring, i); if (!(rx_desc->status & E1000_RXD_STAT_DD)) break; /* No more completed packets */ /* Allocate sk_buff and copy/map data */ skb = e1000_alloc_rx_skb(adapter, rx_desc); /* Set protocol and device */ skb->protocol = eth_type_trans(skb, netdev); /* Pass up the stack (with GRO) */ napi_gro_receive(napi, skb); work_done++; } /* If we processed less than budget, we're done */ if (work_done < budget) { napi_complete(napi); /* Re-enable interrupts */ E1000_WRITE_REG(&adapter->hw, E1000_IMS, ...); } return work_done;} /* 3. Generic receive processing */gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb){ /* Try to coalesce with existing GRO flows */ gro_result_t ret = dev_gro_receive(napi, skb); if (ret == GRO_NORMAL) return netif_receive_skb(skb); return ret;} /* 4. Protocol dispatch */int netif_receive_skb(struct sk_buff *skb){ struct packet_type *ptype; __be16 type = skb->protocol; /* Deliver to protocol handlers registered for this type */ list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & 0xf], list) { if (ptype->type == type) { /* Call protocol handler: ip_rcv, arp_rcv, etc. */ ptype->func(skb, skb->dev, ptype, ...); } } return NET_RX_SUCCESS;} /* 5. IP layer receive */int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev){ struct iphdr *iph; /* Validate IP header */ iph = ip_hdr(skb); if (iph->ihl < 5 || iph->version != 4) goto drop; if (ip_fast_csum((u8 *)iph, iph->ihl)) goto drop; /* Checksum failed */ /* Traverse netfilter PREROUTING chain */ return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, ip_rcv_finish);} /* 6. IP routing and local delivery */static int ip_rcv_finish(struct sk_buff *skb){ struct rtable *rt; /* Perform routing lookup */ rt = ip_route_input(skb, ...); /* If packet is for local delivery */ if (rt->rt_type == RTN_LOCAL) return ip_local_deliver(skb); /* If packet needs forwarding */ return ip_forward(skb);} /* 7. Transport layer dispatch */int ip_local_deliver(struct sk_buff *skb){ struct iphdr *iph = ip_hdr(skb); struct net_protocol *ipprot; /* Traverse netfilter INPUT chain */ return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, ip_local_deliver_finish);} static int ip_local_deliver_finish(struct sk_buff *skb){ struct iphdr *iph = ip_hdr(skb); int protocol = iph->protocol; /* Find transport protocol handler */ ipprot = rcu_dereference(inet_protos[protocol]); /* Call protocol handler: tcp_v4_rcv, udp_rcv, etc. */ return ipprot->handler(skb);} /* 8. TCP receive processing */int tcp_v4_rcv(struct sk_buff *skb){ struct tcphdr *th = tcp_hdr(skb); struct sock *sk; /* Look up socket by 4-tuple */ sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest, ...); if (!sk) goto no_tcp_socket; /* Deliver to socket */ if (sk->sk_state == TCP_LISTEN) return tcp_v4_do_rcv(sk, skb); /* Queue to socket's receive queue */ if (!sock_owned_by_user(sk)) { if (!tcp_prequeue(sk, skb)) return tcp_v4_do_rcv(sk, skb); } else { /* Socket locked by user, use backlog */ sk_add_backlog(sk, skb, ...); } return 0;}NAPI (New API) is crucial for high-speed networking. Without NAPI, each packet would generate a hardware interrupt, overwhelming the CPU at 10+ Gbps. NAPI allows the kernel to process packets in batches via polling, with interrupts only triggering the start of processing. This interrupt mitigation is what enables Linux to handle millions of packets per second.
The Linux networking stack is highly modular—protocols can be loaded as kernel modules at runtime. This flexibility is achieved through a registration system where protocols register handlers with the kernel, which then dispatches packets appropriately.
Three levels of protocol registration:
| Level | Registration Array | Key Field | Examples | Purpose |
|---|---|---|---|---|
| Address Family | net_families[] | family (AF_*) | AF_INET, AF_UNIX | Socket creation handling |
| L3 Protocol | ptype_base[] | EtherType | ETH_P_IP, ETH_P_ARP | Frame dispatch from NIC |
| L4 Protocol | inet_protos[] | IP protocol | IPPROTO_TCP, IPPROTO_UDP | IP packet dispatch |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119
/** * L2 Protocol Registration - For frame-level protocols * * struct packet_type defines handlers for specific EtherTypes * (e.g., 0x0800 for IPv4, 0x0806 for ARP) */struct packet_type { __be16 type; /* EtherType in network order */ struct net_device *dev; /* NULL = all devices */ int (*func)(struct sk_buff *, ...); /* Handler */ void (*id_match)(struct packet_type *, ...); struct list_head list;}; /* IPv4 packet type registration */static struct packet_type ip_packet_type = { .type = cpu_to_be16(ETH_P_IP), /* 0x0800 */ .func = ip_rcv, /* Handler function */}; void __init ip_init(void){ /* Register to receive all IPv4 frames */ dev_add_pack(&ip_packet_type); /* ... other initialization ... */} /** * L3 Protocol Registration - For IP-layer protocols * * struct net_protocol defines handlers for IP protocol numbers * (e.g., 6 for TCP, 17 for UDP, 1 for ICMP) */struct net_protocol { int (*handler)(struct sk_buff *skb); void (*err_handler)(struct sk_buff *skb, u32 info); unsigned int no_policy:1, netns_ok:1;}; /* TCP protocol registration */static const struct net_protocol tcp_protocol = { .handler = tcp_v4_rcv, .err_handler = tcp_v4_err, .no_policy = 1, .netns_ok = 1,}; void __init tcp_init(void){ /* Register TCP as IP protocol 6 */ if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0) panic("Failed to register TCP protocol"); /* ... other initialization ... */} int inet_add_protocol(const struct net_protocol *prot, unsigned char num){ /* Atomic registration in inet_protos array */ return !cmpxchg((const struct net_protocol **)&inet_protos[num], NULL, prot) ? 0 : -1;} /** * Socket Protocol Registration - For socket types * * struct net_proto_family defines how sockets are created * for each address family */static const struct net_proto_family inet_family_ops = { .family = AF_INET, .create = inet_create, .owner = THIS_MODULE,}; static int __init inet_init(void){ /* Register AF_INET socket family */ (void)sock_register(&inet_family_ops); /* Register individual socket types */ for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; q++) inet_register_protosw(q); return 0;} /** * Protocol switch array for socket types within a family */struct inet_protosw { struct list_head list; unsigned short type; /* SOCK_STREAM, SOCK_DGRAM */ unsigned short protocol; /* IPPROTO_TCP, IPPROTO_UDP */ struct proto *prot; /* Protocol handler (tcp_prot, udp_prot) */ const struct proto_ops *ops; /* Socket operations */ unsigned char flags; /* INET_PROTOSW_PERMANENT, etc. */}; /* TCP socket type registration */static struct inet_protosw inetsw_array[] = { { .type = SOCK_STREAM, .protocol = IPPROTO_TCP, .prot = &tcp_prot, .ops = &inet_stream_ops, .flags = INET_PROTOSW_PERMANENT | INET_PROTOSW_ICSK, }, { .type = SOCK_DGRAM, .protocol = IPPROTO_UDP, .prot = &udp_prot, .ops = &inet_dgram_ops, .flags = INET_PROTOSW_PERMANENT, }, /* ... SOCK_RAW, etc. ... */};How protocol dispatch works:
When a frame arrives at the NIC:
eth_type_trans() sets skb->protocol to the EtherTypenetif_receive_skb() hashes the protocol and looks up ptype_base[hash]packet_type handlers matching the EtherType are calledip_rcv() is called, which extracts IP protocol numberip_local_deliver_finish() looks up inet_protos[protocol_num]tcp_v4_rcv() is calledThis layered lookup enables each layer to handle its own demultiplexing independently.
Protocols can be compiled as kernel modules. When you first use a specific protocol (e.g., SCTP), the kernel may automatically load the module via kmod. The protocol registers its handlers during module initialization and unregisters during cleanup, enabling dynamic protocol support without kernel recompilation.
The network device abstraction (struct net_device) is the interface between protocol-independent networking code and hardware-specific drivers. Every network interface—Ethernet, WiFi, loopback, virtual devices—is represented by a net_device structure.
Key responsibilities of net_device:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119
/** * struct net_device - Network device abstraction * * This is one of the largest structures in the kernel, * representing everything about a network interface. */struct net_device { /* Device identification */ char name[IFNAMSIZ]; /* eth0, lo, etc. */ int ifindex; /* Unique interface index */ unsigned int flags; /* IFF_UP, IFF_LOOPBACK, etc. */ unsigned int mtu; /* Maximum transmission unit */ /* Hardware address */ unsigned char dev_addr[MAX_ADDR_LEN]; unsigned char addr_len; /* Address length */ unsigned short type; /* ARPHRD_ETHER, etc. */ /* Device operations */ const struct net_device_ops *netdev_ops; /* Driver callbacks */ const struct ethtool_ops *ethtool_ops; /* ethtool interface */ /* Feature flags (hardware capabilities) */ netdev_features_t features; /* Active features */ netdev_features_t hw_features; /* Hardware capabilities */ netdev_features_t wanted_features; /* User-requested features */ /* Transmit queues */ struct netdev_queue *_tx; /* Transmit queue array */ unsigned int num_tx_queues; /* Number of TX queues */ unsigned int real_num_tx_queues; /* Currently active TX queues */ /* Queueing discipline */ struct Qdisc *qdisc; /* Root qdisc */ struct Qdisc *qdisc_sleeping; /* Qdisc when device dormant */ /* Receive queues (for multi-queue NICs) */ struct netdev_rx_queue *_rx; unsigned int num_rx_queues; unsigned int real_num_rx_queues; /* NAPI instances for receive processing */ struct list_head napi_list; /* Statistics */ struct net_device_stats stats; /* Basic statistics */ atomic_long_t rx_dropped; /* Packets dropped on RX */ atomic_long_t tx_dropped; /* Packets dropped on TX */ /* State and links */ unsigned int state; /* Device state bits */ struct net *nd_net; /* Network namespace */ /* ... many more fields ... */}; /** * struct net_device_ops - Driver operation callbacks * * Drivers implement these functions to provide hardware-specific behavior */struct net_device_ops { /* Device lifecycle */ int (*ndo_init)(struct net_device *dev); void (*ndo_uninit)(struct net_device *dev); int (*ndo_open)(struct net_device *dev); int (*ndo_stop)(struct net_device *dev); /* Packet transmission - THE critical function */ netdev_tx_t (*ndo_start_xmit)(struct sk_buff *skb, struct net_device *dev); /* Configuration */ int (*ndo_set_mac_address)(struct net_device *dev, void *addr); int (*ndo_change_mtu)(struct net_device *dev, int new_mtu); void (*ndo_set_rx_mode)(struct net_device *dev); /* Multicast/promisc */ /* Feature control */ int (*ndo_set_features)(struct net_device *dev, netdev_features_t features); netdev_features_t (*ndo_fix_features)(struct net_device *dev, netdev_features_t features); /* Statistics */ void (*ndo_get_stats64)(struct net_device *dev, struct rtnl_link_stats64 *stats); /* Queue management (for multi-queue devices) */ u16 (*ndo_select_queue)(struct net_device *dev, struct sk_buff *skb, struct net_device *sb_dev); /* ... many more operations ... */}; /** * Hardware feature flags * These indicate what the NIC can offload from the CPU */#define NETIF_F_SG /* Scatter/gather I/O */#define NETIF_F_CSUM_MASK /* TX checksum offload */#define NETIF_F_RXCSUM /* RX checksum verification */#define NETIF_F_TSO /* TCP segmentation offload */#define NETIF_F_TSO6 /* TSO for IPv6 */#define NETIF_F_GSO /* Generic segmentation offload */#define NETIF_F_GRO /* Generic receive offload */#define NETIF_F_LRO /* Large receive offload */#define NETIF_F_HIGHDMA /* Can DMA to high memory */#define NETIF_F_HW_VLAN_* /* Hardware VLAN handling */ /* Example driver: check if hardware can offload checksums */if (skb->ip_summed == CHECKSUM_PARTIAL) { if (dev->features & NETIF_F_HW_CSUM) { /* Hardware will compute checksum */ /* Set up descriptor for hardware offload */ } else { /* Must compute checksum in software */ skb_checksum_help(skb); }}Modern NICs have multiple hardware transmit and receive queues. This enables parallel packet processing across CPU cores without lock contention. The kernel's XPS (Transmit Packet Steering) and RSS (Receive Side Scaling) distribute packets across queues based on flow hashes, enabling near-linear scaling with core count.
Hardware offloads:
Modern NICs can offload significant work from the CPU:
| Offload | Description | Performance Impact |
|---|---|---|
| TSO (TCP Segmentation Offload) | NIC splits large TCP segments | 60-90% CPU reduction for bulk TX |
| GSO (Generic Segmentation Offload) | Software TSO fallback | Batches work, reduces per-packet overhead |
| GRO (Generic Receive Offload) | Combines received packets | Reduces per-packet RX processing |
| Checksum Offload | NIC computes/verifies checksums | Saves CPU cycles per packet |
| RSS | NIC distributes RX across queues | Enables multi-core RX scaling |
Understanding which offloads your NIC supports (via ethtool -k) is crucial for network performance tuning.
Between the protocol stack and the device driver lies the Traffic Control subsystem—a powerful framework for packet scheduling, shaping, and manipulation. Every packet passing through a network interface traverses a queueing discipline (qdisc) that determines when (and if) it's transmitted.
The tc architecture:
Traffic control is organized around three concepts:
| Qdisc | Type | Purpose | Use Case |
|---|---|---|---|
| pfifo_fast | Classless | Priority FIFO (default) | General purpose, low overhead |
| fq (Fair Queue) | Classless | Per-flow fair queuing | Reducing bufferbloat |
| fq_codel | Classless | FQ + CoDel AQM | Low latency, bufferbloat control |
| htb | Classful | Hierarchical Token Bucket | Rate limiting, bandwidth allocation |
| tbf | Classless | Token Bucket Filter | Simple rate limiting |
| netem | Classless | Network emulator | Testing (delay, loss, jitter) |
| mq | Classful | Multi-queue wrapper | Multi-queue NIC support |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
/** * struct Qdisc - Queueing discipline structure * * Every network device has at least one qdisc that manages * the transmit queue. */struct Qdisc { int (*enqueue)(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free); struct sk_buff * (*dequeue)(struct Qdisc *sch); unsigned int flags; u32 limit; /* Queue length limit */ const struct Qdisc_ops *ops; /* Qdisc operations */ struct qdisc_size_table *stab; /* Size table for GSO sizing */ struct hlist_node hash; u32 handle; /* Unique identifier */ u32 parent; /* Parent qdisc handle */ struct netdev_queue *dev_queue; /* Associated TX queue */ struct net_rate_estimator *rate_est; /* Rate estimator */ struct gnet_stats_basic_packed bstats; /* Basic stats */ struct gnet_stats_queue qstats; /* Queue stats */ /* ... more fields ... */}; /** * Packet enqueue flow through tc */int dev_queue_xmit(struct sk_buff *skb){ struct net_device *dev = skb->dev; struct Qdisc *q; /* Get qdisc for this device/queue */ q = rcu_dereference(dev->qdisc); if (q->enqueue) { /* Enqueue through qdisc */ return __dev_xmit_skb(skb, q, dev, txq); } /* No qdisc (unlikely) - direct transmit */ return dev_hard_start_xmit(skb, dev, txq);} static int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, struct net_device *dev, struct netdev_queue *txq){ spinlock_t *root_lock = qdisc_lock(q); bool contended; int rc; spin_lock(root_lock); /* Try to bypass qdisc and transmit directly if queue is empty */ if ((q->flags & TCQ_F_CAN_BYPASS) && q->q.qlen == 0 && qdisc_run_begin(q)) { /* Direct transmit path - skip qdisc enqueue/dequeue */ rc = sch_direct_xmit(skb, q, dev, txq, root_lock, true); } else { /* Enqueue to qdisc */ rc = q->enqueue(skb, q, &to_free); if (qdisc_run_begin(q)) { /* Attempt to dequeue and transmit */ __qdisc_run(q); } } spin_unlock(root_lock); return rc;} /* Dequeue loop - transmit packets until queue empty or NIC full */void __qdisc_run(struct Qdisc *q){ int quota = dev_tx_weight; /* Limit work per run */ while (qdisc_restart(q, &packets)) { if (--quota <= 0 || need_resched()) { /* Requeue and schedule for later */ __netif_schedule(q); break; } }} static inline int qdisc_restart(struct Qdisc *q, int *packets){ struct sk_buff *skb; /* Dequeue next packet */ skb = q->dequeue(q); if (!skb) return 0; /* Transmit the packet */ return sch_direct_xmit(skb, q, ...);}Example tc configuration:
# View current qdisc
tc qdisc show dev eth0
# Replace default qdisc with fq_codel (for bufferbloat control)
tc qdisc replace dev eth0 root fq_codel
# Rate limit to 1 Gbps with htb
tc qdisc add dev eth0 root handle 1: htb default 10
tc class add dev eth0 parent 1: classid 1:10 htb rate 1gbit
# Add artificial latency for testing
tc qdisc add dev eth0 root netem delay 50ms 10ms
Understanding tc is essential for network performance tuning, implementing QoS, and debugging latency issues.
The default pfifo_fast qdisc can cause severe latency problems (bufferbloat) when queues fill up. Modern kernels recommend fq_codel, which uses Controlled Delay (CoDel) AQM to maintain low latency even under load. For servers handling varied workloads, fq_codel is almost always a better choice than the default.
Netfilter is the kernel's packet filtering and manipulation framework—the foundation for iptables, nftables, and connection tracking. It integrates into the protocol stack through hook points where packet processing can be intercepted.
Netfilter hook points:
Packets traverse different hooks depending on their path through the stack:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
/** * Netfilter hook invocation * * The NF_HOOK macro is used throughout the stack to invoke * registered netfilter handlers at each hook point. */static inline int NF_HOOK(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk, struct sk_buff *skb, struct net_device *in, struct net_device *out, int (*okfn)(struct net *, struct sock *, struct sk_buff *)){ return NF_HOOK_THRESH(pf, hook, net, sk, skb, in, out, okfn, INT_MIN);} /* Example: IP receive path with netfilter hook */int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev){ /* ... IP header validation ... */ /* Invoke PREROUTING hook, then continue to ip_rcv_finish */ return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL, skb, dev, NULL, ip_rcv_finish);} /* Example: IP local delivery with INPUT hook */int ip_local_deliver(struct sk_buff *skb){ /* Handle fragmented packets */ if (ip_is_fragment(ip_hdr(skb))) { skb = ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER); if (!skb) return 0; } /* Invoke INPUT hook, then continue to ip_local_deliver_finish */ return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, net, NULL, skb, skb->dev, NULL, ip_local_deliver_finish);} /** * Hook callback results */#define NF_DROP 0 /* Drop the packet */#define NF_ACCEPT 1 /* Accept, continue processing */#define NF_STOLEN 2 /* Handler took ownership of packet */#define NF_QUEUE 3 /* Queue to userspace (NFQUEUE) */#define NF_REPEAT 4 /* Call this hook again */#define NF_STOP 5 /* Stop processing, accept */ /** * Connection tracking integration * * Netfilter's conntrack module tracks connection state, * enabling stateful filtering and NAT. *//* Possible connection states */enum ip_conntrack_info { IP_CT_NEW, /* First packet of new connection */ IP_CT_ESTABLISHED, /* Part of established connection */ IP_CT_RELATED, /* Related to established (e.g., FTP data) */ IP_CT_RELATED_REPLY, /* Reply to related packet */ /* ... more states ... */}; /* Each tracked connection has an entry */struct nf_conn { /* Connection tuple (5-tuple) */ struct nf_conntrack_tuple_hash tuplehash[IP_CT_DIR_MAX]; /* Connection timeout */ unsigned long timeout; /* State bits */ unsigned long status; /* Expected connections (for protocols with helper) */ struct hlist_head expectations; /* NAT information */ struct nf_conn_nat *nat; /* ... more fields ... */};While iptables remains widely used, nftables is the modern replacement offering better performance, simpler rule management, and a unified framework for IPv4, IPv6, and ARP filtering. Both use the same underlying Netfilter hooks, but nftables uses a more efficient bytecode-based rule evaluation engine.
The Linux protocol stack is a marvel of software engineering—decades of refinement have produced an implementation that scales from embedded devices to hyperscale data centers. Understanding its architecture is essential for anyone building or optimizing networked systems.
What's next:
With the protocol stack architecture understood, we'll explore network namespaces—the kernel feature that enables complete network stack isolation for containers and virtualization. You'll learn how Linux creates multiple independent networking environments on a single kernel.
You now understand the Linux protocol stack architecture—the layers, data structures, and processing paths that implement networking in the kernel. This knowledge is essential for network performance engineering, debugging complex networking issues, and understanding the behavior of networked applications.