Operating SystemsLinux Internals

Linux Networking

LevelAdvanced

Duration180 mins

TopicLinux Internals

2 / 5

Protocol Stack

The Engine Room of Network Communication

Beneath every network-capable application lies the Linux protocol stack—a meticulously organized set of layers that transform abstract communication requests into electrical signals traversing physical media. When your application calls send() on a TCP socket, a remarkably complex series of transformations occurs: data is copied, headers are prepended, checksums are calculated, fragmentation is performed, routing decisions are made, ARP resolution occurs, and finally packets are handed to a network device driver.

This multi-layered architecture isn't merely organizational tidiness—it's the result of decades of protocol design wisdom, embodied in implementations that handle billions of packets per second across the global internet. The Linux protocol stack implements this architecture with such efficiency that it powers everything from embedded IoT devices to the world's largest hyperscale data centers.

Understanding the protocol stack is fundamental to network performance engineering. Whether you're diagnosing latency issues, optimizing throughput, implementing custom protocols, or simply troubleshooting connectivity problems, knowledge of how packets flow through these layers transforms mysterious network behavior into understandable, debuggable systems.

What You Will Learn

By the end of this page, you will understand the Linux protocol stack architecture, including the layer hierarchy, the relationship between network layers and kernel code organization, the path packets take through the stack, key data structures at each layer, and how protocol handlers are registered and dispatched. You'll see how the OSI model maps to Linux implementation reality.

Protocol Stack Overview

The Linux networking stack follows a layered model that corresponds to the OSI reference model, though the implementation combines some layers for efficiency. Each layer has distinct responsibilities and communicates with adjacent layers through well-defined interfaces.

The conceptual layers:

From top to bottom, the Linux networking stack consists of:

Socket Layer: User-space API abstraction (previous page)
Transport Layer: End-to-end communication (TCP, UDP, SCTP)
Network Layer: Host-to-host routing (IPv4, IPv6)
Link Layer: Network interface handling (Ethernet, WiFi, etc.)
Device Driver Layer: Hardware-specific transmit/receive

Each layer adds (on transmit) or removes (on receive) its own headers, forming the classic networking "hourglass" where IP provides the narrow waist that all higher and lower protocols must pass through.

Linux Protocol Stack Layers
Layer	OSI Layers	Key Structures	Primary Functions	Example Protocols
Socket	Session/Presentation	struct socket, struct sock	API abstraction, multiplexing	BSD sockets API
Transport	Transport (L4)	struct tcp_sock, struct udp_sock	End-to-end delivery, reliability	TCP, UDP, SCTP, DCCP
Network	Network (L3)	struct iphdr, struct rtable	Routing, fragmentation, addressing	IPv4, IPv6, ICMP
Link	Data Link (L2)	struct net_device, struct ethhdr	Framing, MAC addressing, queuing	Ethernet, WiFi, PPP
Device Driver	Physical (L1)	struct sk_buff, device registers	Hardware I/O, DMA, interrupts	Hardware-specific

The sk_buff: The Universal Packet Container

Across all layers, packets are represented by struct sk_buff (socket buffer). This structure is perhaps the most important in Linux networking—it contains:

Pointers to packet data at each protocol layer
Metadata about the packet's journey through the stack
Reference to the associated socket (if any)
Routing and QoS information
Timestamps and accounting data

The sk_buff is designed for efficient header manipulation: instead of copying data when adding/removing headers, pointers are adjusted. This "zero-copy" design is critical for high-speed networking.

sk_buff structure (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/**
 * struct sk_buff - Socket buffer
 * 
 * This is THE central data structure in Linux networking.
 * Every packet flowing through the stack is wrapped in an sk_buff.
 */
struct sk_buff {
    /* Linked list pointers for queue management */
    struct sk_buff      *next;
    struct sk_buff      *prev;
    
    /* Associated socket (if any) */
    struct sock         *sk;
    
    /* Packet arrival timestamp */
    ktime_t             tstamp;
    
    /* Network device packet arrived on / will be sent out */
    struct net_device   *dev;
    
    /*
     * Pointers to protocol headers
     * These enable zero-copy layer traversal
     */
    union {
        struct  tcphdr  *th;        /* TCP header */
        struct  udphdr  *uh;        /* UDP header */
        struct  icmphdr *icmph;     /* ICMP header */
        unsigned char   *raw;       /* Raw pointer */
    } h;                            /* Transport layer header */
    
    union {
        struct  iphdr   *iph;       /* IPv4 header */
        struct  ipv6hdr *ipv6h;     /* IPv6 header */
        struct  arphdr  *arph;      /* ARP header */
        unsigned char   *raw;
    } nh;                           /* Network layer header */
    
    union {
        struct  ethhdr  *ethernet;  /* Ethernet header */
        unsigned char   *raw;
    } mac;                          /* Link layer header */
    
    /* Actual data pointers */
    unsigned char       *head;      /* Start of allocated buffer */
    unsigned char       *data;      /* Start of actual data */
    unsigned char       *tail;      /* End of actual data */
    unsigned char       *end;       /* End of allocated buffer */
    
    /* Length fields */
    unsigned int        len;        /* Bytes of data */
    unsigned int        data_len;   /* Bytes in fragments */
    __u16               mac_len;    /* Length of link header */
    
    /* Protocol identification */
    __be16              protocol;   /* Packet protocol (ETH_P_IP, etc.) */
    
    /* Packet type (for us, broadcast, multicast, etc.) */
    __u8                pkt_type;
    
    /* Checksum status */
    __u8                ip_summed;
    
    /* Priority for QoS */
    __u32               priority;
    
    /* ... many more fields ... */
};

Zero-Copy Header Manipulation

When adding headers, skb_push() moves the data pointer backward. When removing headers, skb_pull() moves it forward. The underlying buffer stays in place—only pointers change. This design is essential for performance: at 100 Gbps, copying headers would consume more CPU than modern systems can provide.

Transmit Path Architecture

Understanding how packets flow from application to wire is essential for performance optimization and debugging. The transmit path involves multiple subsystems, each adding its contribution before the packet reaches the network device.

The complete transmit path (TCP example):

When an application calls send() on a TCP socket, the following sequence occurs:

Transmit Path Stages

•System Call Entry — User data is copied from user space to kernel socket buffers via tcp_sendmsg()
•TCP Processing — Data is segmented, TCP headers are added, congestion window is checked, segments are queued to sk_write_queue
•TCP Transmission — When the window allows, tcp_transmit_skb() clones segments and passes them down
•IP Layer — ip_queue_xmit() performs routing lookup, adds IP header, handles fragmentation if needed
•Netfilter — Packets traverse iptables/nftables chains (OUTPUT, POSTROUTING)
•Traffic Control (tc) — Queueing discipline (qdisc) shapes, schedules, and potentially drops packets
•Device Queue — Packets are queued to the network device's transmit ring buffer
•Driver Transmission — Device driver's ndo_start_xmit() programs the NIC for DMA transmission
•Hardware Transmission — NIC reads packet data via DMA and transmits on the wire
•Completion — Interrupt signals completion; sk_buff is freed, TCP is notified for potential window updates

TCP transmit path flow
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
/*
 * Simplified view of the TCP transmit path
 * Each function call represents a layer transition
 */
 
/* User space */
send(sockfd, buffer, len, flags);
 
/* System call entry */
SYSCALL_DEFINE4(sendto, ...)
    → sock_sendmsg()
        → inet_sendmsg()
            → tcp_sendmsg()
 
/**
 * tcp_sendmsg - Copy data from user space and segment
 */
int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
{
    /* Copy user data to socket buffer */
    while (size > 0) {
        /* Get or allocate segment */
        skb = tcp_write_queue_tail(sk);
        if (!skb || skb->len >= mss_now)
            skb = sk_stream_alloc_skb(...);
        
        /* Copy data from user space */
        copy = min_t(size_t, size, mss_now - skb->len);
        skb_copy_to_page_nocache(...);
        
        size -= copy;
    }
    
    /* Try to transmit queued segments */
    tcp_push(sk, flags, mss_now, ...);
    
    return copied;
}
 
/**
 * tcp_transmit_skb - Build TCP header and pass to IP
 */
int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, ...)
{
    /* Reserve space for TCP header */
    skb_push(skb, tcp_header_size);
    skb_reset_transport_header(skb);
 
    /* Build TCP header */
    th = (struct tcphdr *)skb->data;
    th->source = inet->inet_sport;
    th->dest = inet->inet_dport;
    th->seq = htonl(tcb->seq);
    th->ack_seq = htonl(tp->rcv_nxt);
    th->doff = tcp_header_size >> 2;
    th->window = htons(tcp_select_window(sk));
    
    /* Calculate checksum (may be offloaded to NIC) */
    tcp_v4_send_check(sk, skb);
    
    /* Pass to IP layer */
    return ip_queue_xmit(sk, skb, fl4);
}
 
/**
 * ip_queue_xmit - Add IP header and route packet
 */
int ip_queue_xmit(struct sock *sk, struct sk_buff *skb)
{
    struct rtable *rt;
    struct iphdr *iph;
    
    /* Get cached route or perform lookup */
    rt = ip_route_output_ports(...);
    
    /* Reserve space and add IP header */
    skb_push(skb, sizeof(struct iphdr));
    skb_reset_network_header(skb);
    
    /* Build IP header */
    iph = ip_hdr(skb);
    iph->version = 4;
    iph->ihl = 5;
    iph->tos = inet->tos;
    iph->tot_len = htons(skb->len);
    iph->id = htons(ip_idents_reserve(...));
    iph->frag_off = htons(IP_DF);
    iph->ttl = ip_select_ttl(sk, &rt->dst);
    iph->protocol = sk->sk_protocol;
    iph->saddr = fl4->saddr;
    iph->daddr = fl4->daddr;
    
    /* Continue down the stack */
    return ip_local_out(skb);
}
 
/**
 * ip_local_out - Handle netfilter and send to device
 */
int ip_local_out(struct sk_buff *skb)
{
    /* Traverse Netfilter OUTPUT chain */
    return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, ...
                   dst_output);
}
 
/**
 * dst_output → dev_queue_xmit - Queue to device
 */
int dev_queue_xmit(struct sk_buff *skb)
{
    struct Qdisc *q;
    
    /* Get device's queueing discipline */
    q = rcu_dereference(dev->qdisc);
    
    /* Enqueue packet */
    if (q->enqueue)
        return __dev_xmit_skb(skb, q, dev, ...);
    
    /* Direct transmit for lockless/simple qdiscs */
    return dev_hard_start_xmit(skb, dev, ...);
}
 
/**
 * dev_hard_start_xmit - Call driver transmit
 */
int dev_hard_start_xmit(struct sk_buff *skb, 
                        struct net_device *dev, ...)
{
    const struct net_device_ops *ops = dev->netdev_ops;
    
    /* Call driver's transmit function */
    return ops->ndo_start_xmit(skb, dev);
}

Context and Latency

The entire transmit path typically runs in process context (the context of the calling application), which means it competes for CPU time with the application. For latency-sensitive workloads, this is actually beneficial—the sending thread directly pushes packets toward the wire. For throughput-oriented workloads, various optimizations (GSO, TSO, BQL) batch work to amortize overhead.

Receive Path Architecture

The receive path is architecturally more complex than transmit because it must handle asynchronous packet arrival from hardware. The kernel uses multiple mechanisms—hardware interrupts, software interrupts (softirq), and NAPI—to efficiently process incoming packets without overwhelming the CPU.

The complete receive path:

When a packet arrives at the network interface:

Receive Path Stages

•Hardware Reception — NIC receives frame, performs DMA to pre-allocated ring buffer, raises interrupt
•Hardware Interrupt — CPU handles IRQ, masks further interrupts, schedules NAPI
•Softirq Processing — NET_RX_SOFTIRQ runs, calls driver's NAPI poll function
•NAPI Poll — Driver retrieves packets from ring buffer, builds sk_buffs, passes to napi_gro_receive()
•GRO — Generic Receive Offload coalesces related packets to reduce per-packet overhead
•Protocol Dispatch — netif_receive_skb() determines protocol, calls appropriate handler
•IP Layer — ip_rcv() validates header, performs routing lookup, traverses netfilter INPUT chain
•Transport Layer — tcp_v4_rcv() or udp_rcv() finds associated socket, delivers to socket queue
•Socket Layer — Packet queued to sk_receive_queue, sleeping reader awakened
•User Delivery — Application recv() copies data from kernel to user buffer

Receive path flow
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
/*
 * Packet reception flow from NIC to application
 * This is invoked asynchronously from hardware interrupts
 */
 
/* 1. Hardware interrupt handler (example: Intel e1000e driver) */
static irqreturn_t e1000_intr(int irq, void *dev_id)
{
    struct e1000_adapter *adapter = dev_id;
    
    /* Acknowledge and disable interrupts */
    E1000_WRITE_REG(&adapter->hw, E1000_IMC, ~0);
    
    /* Schedule NAPI processing */
    napi_schedule(&adapter->napi);
    
    return IRQ_HANDLED;
}
 
/* 2. NAPI poll function (called from softirq) */
static int e1000_poll(struct napi_struct *napi, int budget)
{
    struct e1000_adapter *adapter = container_of(napi, ...);
    int work_done = 0;
    
    while (work_done < budget) {
        struct e1000_rx_desc *rx_desc;
        struct sk_buff *skb;
        
        /* Get next completed descriptor */
        rx_desc = E1000_RX_DESC(ring, i);
        if (!(rx_desc->status & E1000_RXD_STAT_DD))
            break;  /* No more completed packets */
        
        /* Allocate sk_buff and copy/map data */
        skb = e1000_alloc_rx_skb(adapter, rx_desc);
        
        /* Set protocol and device */
        skb->protocol = eth_type_trans(skb, netdev);
        
        /* Pass up the stack (with GRO) */
        napi_gro_receive(napi, skb);
        
        work_done++;
    }
    
    /* If we processed less than budget, we're done */
    if (work_done < budget) {
        napi_complete(napi);
        /* Re-enable interrupts */
        E1000_WRITE_REG(&adapter->hw, E1000_IMS, ...);
    }
    
    return work_done;
}
 
/* 3. Generic receive processing */
gro_result_t napi_gro_receive(struct napi_struct *napi,
                               struct sk_buff *skb)
{
    /* Try to coalesce with existing GRO flows */
    gro_result_t ret = dev_gro_receive(napi, skb);
    
    if (ret == GRO_NORMAL)
        return netif_receive_skb(skb);
    
    return ret;
}
 
/* 4. Protocol dispatch */
int netif_receive_skb(struct sk_buff *skb)
{
    struct packet_type *ptype;
    __be16 type = skb->protocol;
    
    /* Deliver to protocol handlers registered for this type */
    list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & 0xf], list) {
        if (ptype->type == type) {
            /* Call protocol handler: ip_rcv, arp_rcv, etc. */
            ptype->func(skb, skb->dev, ptype, ...);
        }
    }
    
    return NET_RX_SUCCESS;
}
 
/* 5. IP layer receive */
int ip_rcv(struct sk_buff *skb, struct net_device *dev,
           struct packet_type *pt, struct net_device *orig_dev)
{
    struct iphdr *iph;
    
    /* Validate IP header */
    iph = ip_hdr(skb);
    if (iph->ihl < 5 || iph->version != 4)
        goto drop;
    
    if (ip_fast_csum((u8 *)iph, iph->ihl))
        goto drop;  /* Checksum failed */
    
    /* Traverse netfilter PREROUTING chain */
    return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
                   ip_rcv_finish);
}
 
/* 6. IP routing and local delivery */
static int ip_rcv_finish(struct sk_buff *skb)
{
    struct rtable *rt;
    
    /* Perform routing lookup */
    rt = ip_route_input(skb, ...);
    
    /* If packet is for local delivery */
    if (rt->rt_type == RTN_LOCAL)
        return ip_local_deliver(skb);
    
    /* If packet needs forwarding */
    return ip_forward(skb);
}
 
/* 7. Transport layer dispatch */
int ip_local_deliver(struct sk_buff *skb)
{
    struct iphdr *iph = ip_hdr(skb);
    struct net_protocol *ipprot;
    
    /* Traverse netfilter INPUT chain */
    return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN,
                   ip_local_deliver_finish);
}
 
static int ip_local_deliver_finish(struct sk_buff *skb)
{
    struct iphdr *iph = ip_hdr(skb);
    int protocol = iph->protocol;
    
    /* Find transport protocol handler */
    ipprot = rcu_dereference(inet_protos[protocol]);
    
    /* Call protocol handler: tcp_v4_rcv, udp_rcv, etc. */
    return ipprot->handler(skb);
}
 
/* 8. TCP receive processing */
int tcp_v4_rcv(struct sk_buff *skb)
{
    struct tcphdr *th = tcp_hdr(skb);
    struct sock *sk;
    
    /* Look up socket by 4-tuple */
    sk = __inet_lookup_skb(&tcp_hashinfo, skb,
                           th->source, th->dest, ...);
    
    if (!sk)
        goto no_tcp_socket;
    
    /* Deliver to socket */
    if (sk->sk_state == TCP_LISTEN)
        return tcp_v4_do_rcv(sk, skb);
    
    /* Queue to socket's receive queue */
    if (!sock_owned_by_user(sk)) {
        if (!tcp_prequeue(sk, skb))
            return tcp_v4_do_rcv(sk, skb);
    } else {
        /* Socket locked by user, use backlog */
        sk_add_backlog(sk, skb, ...);
    }
    
    return 0;
}

NAPI: The Key to Scalable Receive

NAPI (New API) is crucial for high-speed networking. Without NAPI, each packet would generate a hardware interrupt, overwhelming the CPU at 10+ Gbps. NAPI allows the kernel to process packets in batches via polling, with interrupts only triggering the start of processing. This interrupt mitigation is what enables Linux to handle millions of packets per second.

Protocol Registration System

The Linux networking stack is highly modular—protocols can be loaded as kernel modules at runtime. This flexibility is achieved through a registration system where protocols register handlers with the kernel, which then dispatches packets appropriately.

Three levels of protocol registration:

Protocol Registration Mechanisms
Level	Registration Array	Key Field	Examples	Purpose
Address Family	net_families[]	family (AF_*)	AF_INET, AF_UNIX	Socket creation handling
L3 Protocol	ptype_base[]	EtherType	ETH_P_IP, ETH_P_ARP	Frame dispatch from NIC
L4 Protocol	inet_protos[]	IP protocol	IPPROTO_TCP, IPPROTO_UDP	IP packet dispatch

Protocol registration
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
/**
 * L2 Protocol Registration - For frame-level protocols
 * 
 * struct packet_type defines handlers for specific EtherTypes
 * (e.g., 0x0800 for IPv4, 0x0806 for ARP)
 */
struct packet_type {
    __be16              type;       /* EtherType in network order */
    struct net_device   *dev;       /* NULL = all devices */
    int                 (*func)(struct sk_buff *, ...);  /* Handler */
    void                (*id_match)(struct packet_type *, ...);
    struct list_head    list;
};
 
/* IPv4 packet type registration */
static struct packet_type ip_packet_type = {
    .type = cpu_to_be16(ETH_P_IP),      /* 0x0800 */
    .func = ip_rcv,                      /* Handler function */
};
 
void __init ip_init(void)
{
    /* Register to receive all IPv4 frames */
    dev_add_pack(&ip_packet_type);
    
    /* ... other initialization ... */
}
 
/**
 * L3 Protocol Registration - For IP-layer protocols
 * 
 * struct net_protocol defines handlers for IP protocol numbers
 * (e.g., 6 for TCP, 17 for UDP, 1 for ICMP)
 */
struct net_protocol {
    int                 (*handler)(struct sk_buff *skb);
    void                (*err_handler)(struct sk_buff *skb, u32 info);
    unsigned int        no_policy:1,
                        netns_ok:1;
};
 
/* TCP protocol registration */
static const struct net_protocol tcp_protocol = {
    .handler        = tcp_v4_rcv,
    .err_handler    = tcp_v4_err,
    .no_policy      = 1,
    .netns_ok       = 1,
};
 
void __init tcp_init(void)
{
    /* Register TCP as IP protocol 6 */
    if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
        panic("Failed to register TCP protocol");
    
    /* ... other initialization ... */
}
 
int inet_add_protocol(const struct net_protocol *prot, unsigned char num)
{
    /* Atomic registration in inet_protos array */
    return !cmpxchg((const struct net_protocol **)&inet_protos[num],
                    NULL, prot) ? 0 : -1;
}
 
/**
 * Socket Protocol Registration - For socket types
 * 
 * struct net_proto_family defines how sockets are created
 * for each address family
 */
static const struct net_proto_family inet_family_ops = {
    .family     = AF_INET,
    .create     = inet_create,
    .owner      = THIS_MODULE,
};
 
static int __init inet_init(void)
{
    /* Register AF_INET socket family */
    (void)sock_register(&inet_family_ops);
    
    /* Register individual socket types */
    for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; q++)
        inet_register_protosw(q);
    
    return 0;
}
 
/**
 * Protocol switch array for socket types within a family
 */
struct inet_protosw {
    struct list_head    list;
    unsigned short      type;       /* SOCK_STREAM, SOCK_DGRAM */
    unsigned short      protocol;   /* IPPROTO_TCP, IPPROTO_UDP */
    struct proto        *prot;      /* Protocol handler (tcp_prot, udp_prot) */
    const struct proto_ops *ops;    /* Socket operations */
    unsigned char       flags;      /* INET_PROTOSW_PERMANENT, etc. */
};
 
/* TCP socket type registration */
static struct inet_protosw inetsw_array[] = {
    {
        .type       = SOCK_STREAM,
        .protocol   = IPPROTO_TCP,
        .prot       = &tcp_prot,
        .ops        = &inet_stream_ops,
        .flags      = INET_PROTOSW_PERMANENT | INET_PROTOSW_ICSK,
    },
    {
        .type       = SOCK_DGRAM,
        .protocol   = IPPROTO_UDP,
        .prot       = &udp_prot,
        .ops        = &inet_dgram_ops,
        .flags      = INET_PROTOSW_PERMANENT,
    },
    /* ... SOCK_RAW, etc. ... */
};

How protocol dispatch works:

When a frame arrives at the NIC:

Driver extracts EtherType from Ethernet header
eth_type_trans() sets skb->protocol to the EtherType
netif_receive_skb() hashes the protocol and looks up ptype_base[hash]
All registered packet_type handlers matching the EtherType are called
For IP (0x0800), ip_rcv() is called, which extracts IP protocol number
ip_local_deliver_finish() looks up inet_protos[protocol_num]
For TCP (6), tcp_v4_rcv() is called
TCP looks up socket using connection 4-tuple in hash tables

This layered lookup enables each layer to handle its own demultiplexing independently.

Module Loading and Protocols

Protocols can be compiled as kernel modules. When you first use a specific protocol (e.g., SCTP), the kernel may automatically load the module via kmod. The protocol registers its handlers during module initialization and unregisters during cleanup, enabling dynamic protocol support without kernel recompilation.

Network Device Interface

The network device abstraction (struct net_device) is the interface between protocol-independent networking code and hardware-specific drivers. Every network interface—Ethernet, WiFi, loopback, virtual devices—is represented by a net_device structure.

Key responsibilities of net_device:

Hardware abstraction: Standard interface regardless of physical media
Buffer management: Transmit queue management, memory allocation
Statistics: Packet/byte counters for monitoring
Feature negotiation: Hardware offload capabilities
State management: Interface up/down, link status

struct net_device (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
/**
 * struct net_device - Network device abstraction
 * 
 * This is one of the largest structures in the kernel,
 * representing everything about a network interface.
 */
struct net_device {
    /* Device identification */
    char                    name[IFNAMSIZ];     /* eth0, lo, etc. */
    int                     ifindex;            /* Unique interface index */
    unsigned int            flags;              /* IFF_UP, IFF_LOOPBACK, etc. */
    unsigned int            mtu;                /* Maximum transmission unit */
    
    /* Hardware address */
    unsigned char           dev_addr[MAX_ADDR_LEN];
    unsigned char           addr_len;           /* Address length */
    unsigned short          type;               /* ARPHRD_ETHER, etc. */
    
    /* Device operations */
    const struct net_device_ops *netdev_ops;    /* Driver callbacks */
    const struct ethtool_ops    *ethtool_ops;   /* ethtool interface */
    
    /* Feature flags (hardware capabilities) */
    netdev_features_t       features;           /* Active features */
    netdev_features_t       hw_features;        /* Hardware capabilities */
    netdev_features_t       wanted_features;    /* User-requested features */
    
    /* Transmit queues */
    struct netdev_queue     *_tx;               /* Transmit queue array */
    unsigned int            num_tx_queues;      /* Number of TX queues */
    unsigned int            real_num_tx_queues; /* Currently active TX queues */
    
    /* Queueing discipline */
    struct Qdisc            *qdisc;             /* Root qdisc */
    struct Qdisc            *qdisc_sleeping;    /* Qdisc when device dormant */
    
    /* Receive queues (for multi-queue NICs) */
    struct netdev_rx_queue  *_rx;
    unsigned int            num_rx_queues;
    unsigned int            real_num_rx_queues;
    
    /* NAPI instances for receive processing */
    struct list_head        napi_list;
    
    /* Statistics */
    struct net_device_stats stats;              /* Basic statistics */
    atomic_long_t           rx_dropped;         /* Packets dropped on RX */
    atomic_long_t           tx_dropped;         /* Packets dropped on TX */
    
    /* State and links */
    unsigned int            state;              /* Device state bits */
    struct net              *nd_net;            /* Network namespace */
    
    /* ... many more fields ... */
};
 
/**
 * struct net_device_ops - Driver operation callbacks
 * 
 * Drivers implement these functions to provide hardware-specific behavior
 */
struct net_device_ops {
    /* Device lifecycle */
    int  (*ndo_init)(struct net_device *dev);
    void (*ndo_uninit)(struct net_device *dev);
    int  (*ndo_open)(struct net_device *dev);
    int  (*ndo_stop)(struct net_device *dev);
    
    /* Packet transmission - THE critical function */
    netdev_tx_t (*ndo_start_xmit)(struct sk_buff *skb,
                                  struct net_device *dev);
    
    /* Configuration */
    int  (*ndo_set_mac_address)(struct net_device *dev, void *addr);
    int  (*ndo_change_mtu)(struct net_device *dev, int new_mtu);
    void (*ndo_set_rx_mode)(struct net_device *dev);  /* Multicast/promisc */
    
    /* Feature control */
    int  (*ndo_set_features)(struct net_device *dev,
                             netdev_features_t features);
    netdev_features_t (*ndo_fix_features)(struct net_device *dev,
                                          netdev_features_t features);
    
    /* Statistics */
    void (*ndo_get_stats64)(struct net_device *dev,
                            struct rtnl_link_stats64 *stats);
    
    /* Queue management (for multi-queue devices) */
    u16  (*ndo_select_queue)(struct net_device *dev, struct sk_buff *skb,
                             struct net_device *sb_dev);
    
    /* ... many more operations ... */
};
 
/**
 * Hardware feature flags
 * These indicate what the NIC can offload from the CPU
 */
#define NETIF_F_SG              /* Scatter/gather I/O */
#define NETIF_F_CSUM_MASK       /* TX checksum offload */
#define NETIF_F_RXCSUM          /* RX checksum verification */
#define NETIF_F_TSO             /* TCP segmentation offload */
#define NETIF_F_TSO6            /* TSO for IPv6 */
#define NETIF_F_GSO             /* Generic segmentation offload */
#define NETIF_F_GRO             /* Generic receive offload */
#define NETIF_F_LRO             /* Large receive offload */
#define NETIF_F_HIGHDMA         /* Can DMA to high memory */
#define NETIF_F_HW_VLAN_*       /* Hardware VLAN handling */
 
/* Example driver: check if hardware can offload checksums */
if (skb->ip_summed == CHECKSUM_PARTIAL) {
    if (dev->features & NETIF_F_HW_CSUM) {
        /* Hardware will compute checksum */
        /* Set up descriptor for hardware offload */
    } else {
        /* Must compute checksum in software */
        skb_checksum_help(skb);
    }
}

Multi-Queue NICs

Modern NICs have multiple hardware transmit and receive queues. This enables parallel packet processing across CPU cores without lock contention. The kernel's XPS (Transmit Packet Steering) and RSS (Receive Side Scaling) distribute packets across queues based on flow hashes, enabling near-linear scaling with core count.

Hardware offloads:

Modern NICs can offload significant work from the CPU:

Offload	Description	Performance Impact
TSO (TCP Segmentation Offload)	NIC splits large TCP segments	60-90% CPU reduction for bulk TX
GSO (Generic Segmentation Offload)	Software TSO fallback	Batches work, reduces per-packet overhead
GRO (Generic Receive Offload)	Combines received packets	Reduces per-packet RX processing
Checksum Offload	NIC computes/verifies checksums	Saves CPU cycles per packet
RSS	NIC distributes RX across queues	Enables multi-core RX scaling

Understanding which offloads your NIC supports (via ethtool -k) is crucial for network performance tuning.

Traffic Control (tc) Subsystem

Between the protocol stack and the device driver lies the Traffic Control subsystem—a powerful framework for packet scheduling, shaping, and manipulation. Every packet passing through a network interface traverses a queueing discipline (qdisc) that determines when (and if) it's transmitted.

The tc architecture:

Traffic control is organized around three concepts:

Qdiscs (Queueing Disciplines): Algorithms that manage packet queues
Classes: Subdivisions within classful qdiscs for hierarchical scheduling
Filters: Rules that classify packets into classes

Common Queueing Disciplines
Qdisc	Type	Purpose	Use Case
pfifo_fast	Classless	Priority FIFO (default)	General purpose, low overhead
fq (Fair Queue)	Classless	Per-flow fair queuing	Reducing bufferbloat
fq_codel	Classless	FQ + CoDel AQM	Low latency, bufferbloat control
htb	Classful	Hierarchical Token Bucket	Rate limiting, bandwidth allocation
tbf	Classless	Token Bucket Filter	Simple rate limiting
netem	Classless	Network emulator	Testing (delay, loss, jitter)
mq	Classful	Multi-queue wrapper	Multi-queue NIC support

Qdisc processing flow
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
/**
 * struct Qdisc - Queueing discipline structure
 * 
 * Every network device has at least one qdisc that manages
 * the transmit queue.
 */
struct Qdisc {
    int                         (*enqueue)(struct sk_buff *skb,
                                           struct Qdisc *sch,
                                           struct sk_buff **to_free);
    struct sk_buff *            (*dequeue)(struct Qdisc *sch);
    unsigned int                flags;
    
    u32                         limit;          /* Queue length limit */
    const struct Qdisc_ops      *ops;           /* Qdisc operations */
    struct qdisc_size_table     *stab;          /* Size table for GSO sizing */
    
    struct hlist_node           hash;
    u32                         handle;         /* Unique identifier */
    u32                         parent;         /* Parent qdisc handle */
    
    struct netdev_queue         *dev_queue;     /* Associated TX queue */
    struct net_rate_estimator   *rate_est;      /* Rate estimator */
    
    struct gnet_stats_basic_packed  bstats;     /* Basic stats */
    struct gnet_stats_queue         qstats;     /* Queue stats */
    
    /* ... more fields ... */
};
 
/**
 * Packet enqueue flow through tc
 */
int dev_queue_xmit(struct sk_buff *skb)
{
    struct net_device *dev = skb->dev;
    struct Qdisc *q;
    
    /* Get qdisc for this device/queue */
    q = rcu_dereference(dev->qdisc);
    
    if (q->enqueue) {
        /* Enqueue through qdisc */
        return __dev_xmit_skb(skb, q, dev, txq);
    }
    
    /* No qdisc (unlikely) - direct transmit */
    return dev_hard_start_xmit(skb, dev, txq);
}
 
static int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
                          struct net_device *dev,
                          struct netdev_queue *txq)
{
    spinlock_t *root_lock = qdisc_lock(q);
    bool contended;
    int rc;
    
    spin_lock(root_lock);
    
    /* Try to bypass qdisc and transmit directly if queue is empty */
    if ((q->flags & TCQ_F_CAN_BYPASS) && 
        q->q.qlen == 0 && 
        qdisc_run_begin(q)) {
        /* Direct transmit path - skip qdisc enqueue/dequeue */
        rc = sch_direct_xmit(skb, q, dev, txq, root_lock, true);
    } else {
        /* Enqueue to qdisc */
        rc = q->enqueue(skb, q, &to_free);
        
        if (qdisc_run_begin(q)) {
            /* Attempt to dequeue and transmit */
            __qdisc_run(q);
        }
    }
    
    spin_unlock(root_lock);
    return rc;
}
 
/* Dequeue loop - transmit packets until queue empty or NIC full */
void __qdisc_run(struct Qdisc *q)
{
    int quota = dev_tx_weight;  /* Limit work per run */
    
    while (qdisc_restart(q, &packets)) {
        if (--quota <= 0 || need_resched()) {
            /* Requeue and schedule for later */
            __netif_schedule(q);
            break;
        }
    }
}
 
static inline int qdisc_restart(struct Qdisc *q, int *packets)
{
    struct sk_buff *skb;
    
    /* Dequeue next packet */
    skb = q->dequeue(q);
    if (!skb)
        return 0;
    
    /* Transmit the packet */
    return sch_direct_xmit(skb, q, ...);
}

Example tc configuration:

# View current qdisc
tc qdisc show dev eth0

# Replace default qdisc with fq_codel (for bufferbloat control)
tc qdisc replace dev eth0 root fq_codel

# Rate limit to 1 Gbps with htb
tc qdisc add dev eth0 root handle 1: htb default 10
tc class add dev eth0 parent 1: classid 1:10 htb rate 1gbit

# Add artificial latency for testing
tc qdisc add dev eth0 root netem delay 50ms 10ms

Understanding tc is essential for network performance tuning, implementing QoS, and debugging latency issues.

Bufferbloat and CoDel

The default pfifo_fast qdisc can cause severe latency problems (bufferbloat) when queues fill up. Modern kernels recommend fq_codel, which uses Controlled Delay (CoDel) AQM to maintain low latency even under load. For servers handling varied workloads, fq_codel is almost always a better choice than the default.

Netfilter Integration

Netfilter is the kernel's packet filtering and manipulation framework—the foundation for iptables, nftables, and connection tracking. It integrates into the protocol stack through hook points where packet processing can be intercepted.

Netfilter hook points:

Packets traverse different hooks depending on their path through the stack:

IPv4 Netfilter Hooks

•NF_INET_PRE_ROUTING — After IP header validation, before routing decision. Used for DNAT, early filtering.
•NF_INET_LOCAL_IN — For packets destined to local host. The INPUT chain lives here.
•NF_INET_FORWARD — For packets being routed through. The FORWARD chain.
•NF_INET_LOCAL_OUT — For locally-generated outgoing packets. The OUTPUT chain.
•NF_INET_POST_ROUTING — After routing, before device transmit. Used for SNAT, MASQUERADE.

Netfilter hook integration
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
/**
 * Netfilter hook invocation
 * 
 * The NF_HOOK macro is used throughout the stack to invoke
 * registered netfilter handlers at each hook point.
 */
static inline int NF_HOOK(uint8_t pf,
                          unsigned int hook,
                          struct net *net,
                          struct sock *sk,
                          struct sk_buff *skb,
                          struct net_device *in,
                          struct net_device *out,
                          int (*okfn)(struct net *, struct sock *,
                                      struct sk_buff *))
{
    return NF_HOOK_THRESH(pf, hook, net, sk, skb, in, out, okfn, INT_MIN);
}
 
/* Example: IP receive path with netfilter hook */
int ip_rcv(struct sk_buff *skb, struct net_device *dev,
           struct packet_type *pt, struct net_device *orig_dev)
{
    /* ... IP header validation ... */
    
    /* Invoke PREROUTING hook, then continue to ip_rcv_finish */
    return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
                   net, NULL, skb, dev, NULL,
                   ip_rcv_finish);
}
 
/* Example: IP local delivery with INPUT hook */
int ip_local_deliver(struct sk_buff *skb)
{
    /* Handle fragmented packets */
    if (ip_is_fragment(ip_hdr(skb))) {
        skb = ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER);
        if (!skb)
            return 0;
    }
 
    /* Invoke INPUT hook, then continue to ip_local_deliver_finish */
    return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN,
                   net, NULL, skb, skb->dev, NULL,
                   ip_local_deliver_finish);
}
 
/**
 * Hook callback results
 */
#define NF_DROP         0   /* Drop the packet */
#define NF_ACCEPT       1   /* Accept, continue processing */
#define NF_STOLEN       2   /* Handler took ownership of packet */
#define NF_QUEUE        3   /* Queue to userspace (NFQUEUE) */
#define NF_REPEAT       4   /* Call this hook again */
#define NF_STOP         5   /* Stop processing, accept */
 
/**
 * Connection tracking integration
 * 
 * Netfilter's conntrack module tracks connection state,
 * enabling stateful filtering and NAT.
 */
/* Possible connection states */
enum ip_conntrack_info {
    IP_CT_NEW,              /* First packet of new connection */
    IP_CT_ESTABLISHED,      /* Part of established connection */
    IP_CT_RELATED,          /* Related to established (e.g., FTP data) */
    IP_CT_RELATED_REPLY,    /* Reply to related packet */
    /* ... more states ... */
};
 
/* Each tracked connection has an entry */
struct nf_conn {
    /* Connection tuple (5-tuple) */
    struct nf_conntrack_tuple_hash tuplehash[IP_CT_DIR_MAX];
    
    /* Connection timeout */
    unsigned long timeout;
    
    /* State bits */
    unsigned long status;
    
    /* Expected connections (for protocols with helper) */
    struct hlist_head expectations;
    
    /* NAT information */
    struct nf_conn_nat *nat;
    
    /* ... more fields ... */
};

nftables vs iptables

While iptables remains widely used, nftables is the modern replacement offering better performance, simpler rule management, and a unified framework for IPv4, IPv6, and ARP filtering. Both use the same underlying Netfilter hooks, but nftables uses a more efficient bytecode-based rule evaluation engine.

Summary: Mastering the Protocol Stack

The Linux protocol stack is a marvel of software engineering—decades of refinement have produced an implementation that scales from embedded devices to hyperscale data centers. Understanding its architecture is essential for anyone building or optimizing networked systems.

Key Takeaways

•Layered architecture — Each layer (socket, transport, network, link, device) has distinct responsibilities and communicates through well-defined interfaces
•sk_buff is central — Every packet is wrapped in an sk_buff structure that enables zero-copy header manipulation and tracks packet metadata throughout its journey
•Protocol registration — Protocols register handlers at multiple levels (address family, EtherType, IP protocol), enabling modular and extensible networking
•NAPI for scalable receive — The receive path uses NAPI polling to batch packet processing and avoid interrupt storms at high packet rates
•net_device abstraction — Hardware diversity is hidden behind a standard interface, with feature flags advertising hardware capabilities
•Traffic Control shapes output — Qdiscs between protocol stack and driver control scheduling, shaping, and packet manipulation
•Netfilter hooks enable filtering — Hook points throughout the stack allow packet filtering, NAT, and connection tracking

What's next:

With the protocol stack architecture understood, we'll explore network namespaces—the kernel feature that enables complete network stack isolation for containers and virtualization. You'll learn how Linux creates multiple independent networking environments on a single kernel.

Page Complete

You now understand the Linux protocol stack architecture—the layers, data structures, and processing paths that implement networking in the kernel. This knowledge is essential for network performance engineering, debugging complex networking issues, and understanding the behavior of networked applications.

2 / 5

Loading learning content...

Operating SystemsLinux Internals

Linux Networking

LevelAdvanced

Duration180 mins

TopicLinux Internals

2 / 5

Protocol Stack

The Engine Room of Network Communication

What You Will Learn

Protocol Stack Overview

The conceptual layers:

From top to bottom, the Linux networking stack consists of:

Socket Layer: User-space API abstraction (previous page)
Transport Layer: End-to-end communication (TCP, UDP, SCTP)
Network Layer: Host-to-host routing (IPv4, IPv6)
Link Layer: Network interface handling (Ethernet, WiFi, etc.)
Device Driver Layer: Hardware-specific transmit/receive

Linux Protocol Stack Layers
Layer	OSI Layers	Key Structures	Primary Functions	Example Protocols
Socket	Session/Presentation	struct socket, struct sock	API abstraction, multiplexing	BSD sockets API
Transport	Transport (L4)	struct tcp_sock, struct udp_sock	End-to-end delivery, reliability	TCP, UDP, SCTP, DCCP
Network	Network (L3)	struct iphdr, struct rtable	Routing, fragmentation, addressing	IPv4, IPv6, ICMP
Link	Data Link (L2)	struct net_device, struct ethhdr	Framing, MAC addressing, queuing	Ethernet, WiFi, PPP
Device Driver	Physical (L1)	struct sk_buff, device registers	Hardware I/O, DMA, interrupts	Hardware-specific

The sk_buff: The Universal Packet Container

Across all layers, packets are represented by struct sk_buff (socket buffer). This structure is perhaps the most important in Linux networking—it contains:

Pointers to packet data at each protocol layer
Metadata about the packet's journey through the stack
Reference to the associated socket (if any)
Routing and QoS information
Timestamps and accounting data

The sk_buff is designed for efficient header manipulation: instead of copying data when adding/removing headers, pointers are adjusted. This "zero-copy" design is critical for high-speed networking.

sk_buff structure (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/**
 * struct sk_buff - Socket buffer
 * 
 * This is THE central data structure in Linux networking.
 * Every packet flowing through the stack is wrapped in an sk_buff.
 */
struct sk_buff {
    /* Linked list pointers for queue management */
    struct sk_buff      *next;
    struct sk_buff      *prev;
    
    /* Associated socket (if any) */
    struct sock         *sk;
    
    /* Packet arrival timestamp */
    ktime_t             tstamp;
    
    /* Network device packet arrived on / will be sent out */
    struct net_device   *dev;
    
    /*
     * Pointers to protocol headers
     * These enable zero-copy layer traversal
     */
    union {
        struct  tcphdr  *th;        /* TCP header */
        struct  udphdr  *uh;        /* UDP header */
        struct  icmphdr *icmph;     /* ICMP header */
        unsigned char   *raw;       /* Raw pointer */
    } h;                            /* Transport layer header */
    
    union {
        struct  iphdr   *iph;       /* IPv4 header */
        struct  ipv6hdr *ipv6h;     /* IPv6 header */
        struct  arphdr  *arph;      /* ARP header */
        unsigned char   *raw;
    } nh;                           /* Network layer header */
    
    union {
        struct  ethhdr  *ethernet;  /* Ethernet header */
        unsigned char   *raw;
    } mac;                          /* Link layer header */
    
    /* Actual data pointers */
    unsigned char       *head;      /* Start of allocated buffer */
    unsigned char       *data;      /* Start of actual data */
    unsigned char       *tail;      /* End of actual data */
    unsigned char       *end;       /* End of allocated buffer */
    
    /* Length fields */
    unsigned int        len;        /* Bytes of data */
    unsigned int        data_len;   /* Bytes in fragments */
    __u16               mac_len;    /* Length of link header */
    
    /* Protocol identification */
    __be16              protocol;   /* Packet protocol (ETH_P_IP, etc.) */
    
    /* Packet type (for us, broadcast, multicast, etc.) */
    __u8                pkt_type;
    
    /* Checksum status */
    __u8                ip_summed;
    
    /* Priority for QoS */
    __u32               priority;
    
    /* ... many more fields ... */
};

Zero-Copy Header Manipulation

Transmit Path Architecture

The complete transmit path (TCP example):

When an application calls send() on a TCP socket, the following sequence occurs:

Transmit Path Stages

•System Call Entry — User data is copied from user space to kernel socket buffers via tcp_sendmsg()
•TCP Processing — Data is segmented, TCP headers are added, congestion window is checked, segments are queued to sk_write_queue
•TCP Transmission — When the window allows, tcp_transmit_skb() clones segments and passes them down
•IP Layer — ip_queue_xmit() performs routing lookup, adds IP header, handles fragmentation if needed
•Netfilter — Packets traverse iptables/nftables chains (OUTPUT, POSTROUTING)
•Traffic Control (tc) — Queueing discipline (qdisc) shapes, schedules, and potentially drops packets
•Device Queue — Packets are queued to the network device's transmit ring buffer
•Driver Transmission — Device driver's ndo_start_xmit() programs the NIC for DMA transmission
•Hardware Transmission — NIC reads packet data via DMA and transmits on the wire
•Completion — Interrupt signals completion; sk_buff is freed, TCP is notified for potential window updates

TCP transmit path flow
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
/*
 * Simplified view of the TCP transmit path
 * Each function call represents a layer transition
 */
 
/* User space */
send(sockfd, buffer, len, flags);
 
/* System call entry */
SYSCALL_DEFINE4(sendto, ...)
    → sock_sendmsg()
        → inet_sendmsg()
            → tcp_sendmsg()
 
/**
 * tcp_sendmsg - Copy data from user space and segment
 */
int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
{
    /* Copy user data to socket buffer */
    while (size > 0) {
        /* Get or allocate segment */
        skb = tcp_write_queue_tail(sk);
        if (!skb || skb->len >= mss_now)
            skb = sk_stream_alloc_skb(...);
        
        /* Copy data from user space */
        copy = min_t(size_t, size, mss_now - skb->len);
        skb_copy_to_page_nocache(...);
        
        size -= copy;
    }
    
    /* Try to transmit queued segments */
    tcp_push(sk, flags, mss_now, ...);
    
    return copied;
}
 
/**
 * tcp_transmit_skb - Build TCP header and pass to IP
 */
int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, ...)
{
    /* Reserve space for TCP header */
    skb_push(skb, tcp_header_size);
    skb_reset_transport_header(skb);
 
    /* Build TCP header */
    th = (struct tcphdr *)skb->data;
    th->source = inet->inet_sport;
    th->dest = inet->inet_dport;
    th->seq = htonl(tcb->seq);
    th->ack_seq = htonl(tp->rcv_nxt);
    th->doff = tcp_header_size >> 2;
    th->window = htons(tcp_select_window(sk));
    
    /* Calculate checksum (may be offloaded to NIC) */
    tcp_v4_send_check(sk, skb);
    
    /* Pass to IP layer */
    return ip_queue_xmit(sk, skb, fl4);
}
 
/**
 * ip_queue_xmit - Add IP header and route packet
 */
int ip_queue_xmit(struct sock *sk, struct sk_buff *skb)
{
    struct rtable *rt;
    struct iphdr *iph;
    
    /* Get cached route or perform lookup */
    rt = ip_route_output_ports(...);
    
    /* Reserve space and add IP header */
    skb_push(skb, sizeof(struct iphdr));
    skb_reset_network_header(skb);
    
    /* Build IP header */
    iph = ip_hdr(skb);
    iph->version = 4;
    iph->ihl = 5;
    iph->tos = inet->tos;
    iph->tot_len = htons(skb->len);
    iph->id = htons(ip_idents_reserve(...));
    iph->frag_off = htons(IP_DF);
    iph->ttl = ip_select_ttl(sk, &rt->dst);
    iph->protocol = sk->sk_protocol;
    iph->saddr = fl4->saddr;
    iph->daddr = fl4->daddr;
    
    /* Continue down the stack */
    return ip_local_out(skb);
}
 
/**
 * ip_local_out - Handle netfilter and send to device
 */
int ip_local_out(struct sk_buff *skb)
{
    /* Traverse Netfilter OUTPUT chain */
    return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, ...
                   dst_output);
}
 
/**
 * dst_output → dev_queue_xmit - Queue to device
 */
int dev_queue_xmit(struct sk_buff *skb)
{
    struct Qdisc *q;
    
    /* Get device's queueing discipline */
    q = rcu_dereference(dev->qdisc);
    
    /* Enqueue packet */
    if (q->enqueue)
        return __dev_xmit_skb(skb, q, dev, ...);
    
    /* Direct transmit for lockless/simple qdiscs */
    return dev_hard_start_xmit(skb, dev, ...);
}
 
/**
 * dev_hard_start_xmit - Call driver transmit
 */
int dev_hard_start_xmit(struct sk_buff *skb, 
                        struct net_device *dev, ...)
{
    const struct net_device_ops *ops = dev->netdev_ops;
    
    /* Call driver's transmit function */
    return ops->ndo_start_xmit(skb, dev);
}

Context and Latency

Receive Path Architecture

The complete receive path:

When a packet arrives at the network interface:

Receive Path Stages

•Hardware Reception — NIC receives frame, performs DMA to pre-allocated ring buffer, raises interrupt
•Hardware Interrupt — CPU handles IRQ, masks further interrupts, schedules NAPI
•Softirq Processing — NET_RX_SOFTIRQ runs, calls driver's NAPI poll function
•NAPI Poll — Driver retrieves packets from ring buffer, builds sk_buffs, passes to napi_gro_receive()
•GRO — Generic Receive Offload coalesces related packets to reduce per-packet overhead
•Protocol Dispatch — netif_receive_skb() determines protocol, calls appropriate handler
•IP Layer — ip_rcv() validates header, performs routing lookup, traverses netfilter INPUT chain
•Transport Layer — tcp_v4_rcv() or udp_rcv() finds associated socket, delivers to socket queue
•Socket Layer — Packet queued to sk_receive_queue, sleeping reader awakened
•User Delivery — Application recv() copies data from kernel to user buffer

Receive path flow
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
/*
 * Packet reception flow from NIC to application
 * This is invoked asynchronously from hardware interrupts
 */
 
/* 1. Hardware interrupt handler (example: Intel e1000e driver) */
static irqreturn_t e1000_intr(int irq, void *dev_id)
{
    struct e1000_adapter *adapter = dev_id;
    
    /* Acknowledge and disable interrupts */
    E1000_WRITE_REG(&adapter->hw, E1000_IMC, ~0);
    
    /* Schedule NAPI processing */
    napi_schedule(&adapter->napi);
    
    return IRQ_HANDLED;
}
 
/* 2. NAPI poll function (called from softirq) */
static int e1000_poll(struct napi_struct *napi, int budget)
{
    struct e1000_adapter *adapter = container_of(napi, ...);
    int work_done = 0;
    
    while (work_done < budget) {
        struct e1000_rx_desc *rx_desc;
        struct sk_buff *skb;
        
        /* Get next completed descriptor */
        rx_desc = E1000_RX_DESC(ring, i);
        if (!(rx_desc->status & E1000_RXD_STAT_DD))
            break;  /* No more completed packets */
        
        /* Allocate sk_buff and copy/map data */
        skb = e1000_alloc_rx_skb(adapter, rx_desc);
        
        /* Set protocol and device */
        skb->protocol = eth_type_trans(skb, netdev);
        
        /* Pass up the stack (with GRO) */
        napi_gro_receive(napi, skb);
        
        work_done++;
    }
    
    /* If we processed less than budget, we're done */
    if (work_done < budget) {
        napi_complete(napi);
        /* Re-enable interrupts */
        E1000_WRITE_REG(&adapter->hw, E1000_IMS, ...);
    }
    
    return work_done;
}
 
/* 3. Generic receive processing */
gro_result_t napi_gro_receive(struct napi_struct *napi,
                               struct sk_buff *skb)
{
    /* Try to coalesce with existing GRO flows */
    gro_result_t ret = dev_gro_receive(napi, skb);
    
    if (ret == GRO_NORMAL)
        return netif_receive_skb(skb);
    
    return ret;
}
 
/* 4. Protocol dispatch */
int netif_receive_skb(struct sk_buff *skb)
{
    struct packet_type *ptype;
    __be16 type = skb->protocol;
    
    /* Deliver to protocol handlers registered for this type */
    list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & 0xf], list) {
        if (ptype->type == type) {
            /* Call protocol handler: ip_rcv, arp_rcv, etc. */
            ptype->func(skb, skb->dev, ptype, ...);
        }
    }
    
    return NET_RX_SUCCESS;
}
 
/* 5. IP layer receive */
int ip_rcv(struct sk_buff *skb, struct net_device *dev,
           struct packet_type *pt, struct net_device *orig_dev)
{
    struct iphdr *iph;
    
    /* Validate IP header */
    iph = ip_hdr(skb);
    if (iph->ihl < 5 || iph->version != 4)
        goto drop;
    
    if (ip_fast_csum((u8 *)iph, iph->ihl))
        goto drop;  /* Checksum failed */
    
    /* Traverse netfilter PREROUTING chain */
    return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
                   ip_rcv_finish);
}
 
/* 6. IP routing and local delivery */
static int ip_rcv_finish(struct sk_buff *skb)
{
    struct rtable *rt;
    
    /* Perform routing lookup */
    rt = ip_route_input(skb, ...);
    
    /* If packet is for local delivery */
    if (rt->rt_type == RTN_LOCAL)
        return ip_local_deliver(skb);
    
    /* If packet needs forwarding */
    return ip_forward(skb);
}
 
/* 7. Transport layer dispatch */
int ip_local_deliver(struct sk_buff *skb)
{
    struct iphdr *iph = ip_hdr(skb);
    struct net_protocol *ipprot;
    
    /* Traverse netfilter INPUT chain */
    return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN,
                   ip_local_deliver_finish);
}
 
static int ip_local_deliver_finish(struct sk_buff *skb)
{
    struct iphdr *iph = ip_hdr(skb);
    int protocol = iph->protocol;
    
    /* Find transport protocol handler */
    ipprot = rcu_dereference(inet_protos[protocol]);
    
    /* Call protocol handler: tcp_v4_rcv, udp_rcv, etc. */
    return ipprot->handler(skb);
}
 
/* 8. TCP receive processing */
int tcp_v4_rcv(struct sk_buff *skb)
{
    struct tcphdr *th = tcp_hdr(skb);
    struct sock *sk;
    
    /* Look up socket by 4-tuple */
    sk = __inet_lookup_skb(&tcp_hashinfo, skb,
                           th->source, th->dest, ...);
    
    if (!sk)
        goto no_tcp_socket;
    
    /* Deliver to socket */
    if (sk->sk_state == TCP_LISTEN)
        return tcp_v4_do_rcv(sk, skb);
    
    /* Queue to socket's receive queue */
    if (!sock_owned_by_user(sk)) {
        if (!tcp_prequeue(sk, skb))
            return tcp_v4_do_rcv(sk, skb);
    } else {
        /* Socket locked by user, use backlog */
        sk_add_backlog(sk, skb, ...);
    }
    
    return 0;
}

NAPI: The Key to Scalable Receive

Protocol Registration System

Three levels of protocol registration:

Protocol Registration Mechanisms
Level	Registration Array	Key Field	Examples	Purpose
Address Family	net_families[]	family (AF_*)	AF_INET, AF_UNIX	Socket creation handling
L3 Protocol	ptype_base[]	EtherType	ETH_P_IP, ETH_P_ARP	Frame dispatch from NIC
L4 Protocol	inet_protos[]	IP protocol	IPPROTO_TCP, IPPROTO_UDP	IP packet dispatch

Protocol registration
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
/**
 * L2 Protocol Registration - For frame-level protocols
 * 
 * struct packet_type defines handlers for specific EtherTypes
 * (e.g., 0x0800 for IPv4, 0x0806 for ARP)
 */
struct packet_type {
    __be16              type;       /* EtherType in network order */
    struct net_device   *dev;       /* NULL = all devices */
    int                 (*func)(struct sk_buff *, ...);  /* Handler */
    void                (*id_match)(struct packet_type *, ...);
    struct list_head    list;
};
 
/* IPv4 packet type registration */
static struct packet_type ip_packet_type = {
    .type = cpu_to_be16(ETH_P_IP),      /* 0x0800 */
    .func = ip_rcv,                      /* Handler function */
};
 
void __init ip_init(void)
{
    /* Register to receive all IPv4 frames */
    dev_add_pack(&ip_packet_type);
    
    /* ... other initialization ... */
}
 
/**
 * L3 Protocol Registration - For IP-layer protocols
 * 
 * struct net_protocol defines handlers for IP protocol numbers
 * (e.g., 6 for TCP, 17 for UDP, 1 for ICMP)
 */
struct net_protocol {
    int                 (*handler)(struct sk_buff *skb);
    void                (*err_handler)(struct sk_buff *skb, u32 info);
    unsigned int        no_policy:1,
                        netns_ok:1;
};
 
/* TCP protocol registration */
static const struct net_protocol tcp_protocol = {
    .handler        = tcp_v4_rcv,
    .err_handler    = tcp_v4_err,
    .no_policy      = 1,
    .netns_ok       = 1,
};
 
void __init tcp_init(void)
{
    /* Register TCP as IP protocol 6 */
    if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
        panic("Failed to register TCP protocol");
    
    /* ... other initialization ... */
}
 
int inet_add_protocol(const struct net_protocol *prot, unsigned char num)
{
    /* Atomic registration in inet_protos array */
    return !cmpxchg((const struct net_protocol **)&inet_protos[num],
                    NULL, prot) ? 0 : -1;
}
 
/**
 * Socket Protocol Registration - For socket types
 * 
 * struct net_proto_family defines how sockets are created
 * for each address family
 */
static const struct net_proto_family inet_family_ops = {
    .family     = AF_INET,
    .create     = inet_create,
    .owner      = THIS_MODULE,
};
 
static int __init inet_init(void)
{
    /* Register AF_INET socket family */
    (void)sock_register(&inet_family_ops);
    
    /* Register individual socket types */
    for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; q++)
        inet_register_protosw(q);
    
    return 0;
}
 
/**
 * Protocol switch array for socket types within a family
 */
struct inet_protosw {
    struct list_head    list;
    unsigned short      type;       /* SOCK_STREAM, SOCK_DGRAM */
    unsigned short      protocol;   /* IPPROTO_TCP, IPPROTO_UDP */
    struct proto        *prot;      /* Protocol handler (tcp_prot, udp_prot) */
    const struct proto_ops *ops;    /* Socket operations */
    unsigned char       flags;      /* INET_PROTOSW_PERMANENT, etc. */
};
 
/* TCP socket type registration */
static struct inet_protosw inetsw_array[] = {
    {
        .type       = SOCK_STREAM,
        .protocol   = IPPROTO_TCP,
        .prot       = &tcp_prot,
        .ops        = &inet_stream_ops,
        .flags      = INET_PROTOSW_PERMANENT | INET_PROTOSW_ICSK,
    },
    {
        .type       = SOCK_DGRAM,
        .protocol   = IPPROTO_UDP,
        .prot       = &udp_prot,
        .ops        = &inet_dgram_ops,
        .flags      = INET_PROTOSW_PERMANENT,
    },
    /* ... SOCK_RAW, etc. ... */
};

How protocol dispatch works:

When a frame arrives at the NIC:

Driver extracts EtherType from Ethernet header
eth_type_trans() sets skb->protocol to the EtherType
netif_receive_skb() hashes the protocol and looks up ptype_base[hash]
All registered packet_type handlers matching the EtherType are called
For IP (0x0800), ip_rcv() is called, which extracts IP protocol number
ip_local_deliver_finish() looks up inet_protos[protocol_num]
For TCP (6), tcp_v4_rcv() is called
TCP looks up socket using connection 4-tuple in hash tables

This layered lookup enables each layer to handle its own demultiplexing independently.

Module Loading and Protocols

Network Device Interface

Key responsibilities of net_device:

Hardware abstraction: Standard interface regardless of physical media
Buffer management: Transmit queue management, memory allocation
Statistics: Packet/byte counters for monitoring
Feature negotiation: Hardware offload capabilities
State management: Interface up/down, link status

struct net_device (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
/**
 * struct net_device - Network device abstraction
 * 
 * This is one of the largest structures in the kernel,
 * representing everything about a network interface.
 */
struct net_device {
    /* Device identification */
    char                    name[IFNAMSIZ];     /* eth0, lo, etc. */
    int                     ifindex;            /* Unique interface index */
    unsigned int            flags;              /* IFF_UP, IFF_LOOPBACK, etc. */
    unsigned int            mtu;                /* Maximum transmission unit */
    
    /* Hardware address */
    unsigned char           dev_addr[MAX_ADDR_LEN];
    unsigned char           addr_len;           /* Address length */
    unsigned short          type;               /* ARPHRD_ETHER, etc. */
    
    /* Device operations */
    const struct net_device_ops *netdev_ops;    /* Driver callbacks */
    const struct ethtool_ops    *ethtool_ops;   /* ethtool interface */
    
    /* Feature flags (hardware capabilities) */
    netdev_features_t       features;           /* Active features */
    netdev_features_t       hw_features;        /* Hardware capabilities */
    netdev_features_t       wanted_features;    /* User-requested features */
    
    /* Transmit queues */
    struct netdev_queue     *_tx;               /* Transmit queue array */
    unsigned int            num_tx_queues;      /* Number of TX queues */
    unsigned int            real_num_tx_queues; /* Currently active TX queues */
    
    /* Queueing discipline */
    struct Qdisc            *qdisc;             /* Root qdisc */
    struct Qdisc            *qdisc_sleeping;    /* Qdisc when device dormant */
    
    /* Receive queues (for multi-queue NICs) */
    struct netdev_rx_queue  *_rx;
    unsigned int            num_rx_queues;
    unsigned int            real_num_rx_queues;
    
    /* NAPI instances for receive processing */
    struct list_head        napi_list;
    
    /* Statistics */
    struct net_device_stats stats;              /* Basic statistics */
    atomic_long_t           rx_dropped;         /* Packets dropped on RX */
    atomic_long_t           tx_dropped;         /* Packets dropped on TX */
    
    /* State and links */
    unsigned int            state;              /* Device state bits */
    struct net              *nd_net;            /* Network namespace */
    
    /* ... many more fields ... */
};
 
/**
 * struct net_device_ops - Driver operation callbacks
 * 
 * Drivers implement these functions to provide hardware-specific behavior
 */
struct net_device_ops {
    /* Device lifecycle */
    int  (*ndo_init)(struct net_device *dev);
    void (*ndo_uninit)(struct net_device *dev);
    int  (*ndo_open)(struct net_device *dev);
    int  (*ndo_stop)(struct net_device *dev);
    
    /* Packet transmission - THE critical function */
    netdev_tx_t (*ndo_start_xmit)(struct sk_buff *skb,
                                  struct net_device *dev);
    
    /* Configuration */
    int  (*ndo_set_mac_address)(struct net_device *dev, void *addr);
    int  (*ndo_change_mtu)(struct net_device *dev, int new_mtu);
    void (*ndo_set_rx_mode)(struct net_device *dev);  /* Multicast/promisc */
    
    /* Feature control */
    int  (*ndo_set_features)(struct net_device *dev,
                             netdev_features_t features);
    netdev_features_t (*ndo_fix_features)(struct net_device *dev,
                                          netdev_features_t features);
    
    /* Statistics */
    void (*ndo_get_stats64)(struct net_device *dev,
                            struct rtnl_link_stats64 *stats);
    
    /* Queue management (for multi-queue devices) */
    u16  (*ndo_select_queue)(struct net_device *dev, struct sk_buff *skb,
                             struct net_device *sb_dev);
    
    /* ... many more operations ... */
};
 
/**
 * Hardware feature flags
 * These indicate what the NIC can offload from the CPU
 */
#define NETIF_F_SG              /* Scatter/gather I/O */
#define NETIF_F_CSUM_MASK       /* TX checksum offload */
#define NETIF_F_RXCSUM          /* RX checksum verification */
#define NETIF_F_TSO             /* TCP segmentation offload */
#define NETIF_F_TSO6            /* TSO for IPv6 */
#define NETIF_F_GSO             /* Generic segmentation offload */
#define NETIF_F_GRO             /* Generic receive offload */
#define NETIF_F_LRO             /* Large receive offload */
#define NETIF_F_HIGHDMA         /* Can DMA to high memory */
#define NETIF_F_HW_VLAN_*       /* Hardware VLAN handling */
 
/* Example driver: check if hardware can offload checksums */
if (skb->ip_summed == CHECKSUM_PARTIAL) {
    if (dev->features & NETIF_F_HW_CSUM) {
        /* Hardware will compute checksum */
        /* Set up descriptor for hardware offload */
    } else {
        /* Must compute checksum in software */
        skb_checksum_help(skb);
    }
}

Multi-Queue NICs

Hardware offloads:

Modern NICs can offload significant work from the CPU:

Offload	Description	Performance Impact
TSO (TCP Segmentation Offload)	NIC splits large TCP segments	60-90% CPU reduction for bulk TX
GSO (Generic Segmentation Offload)	Software TSO fallback	Batches work, reduces per-packet overhead
GRO (Generic Receive Offload)	Combines received packets	Reduces per-packet RX processing
Checksum Offload	NIC computes/verifies checksums	Saves CPU cycles per packet
RSS	NIC distributes RX across queues	Enables multi-core RX scaling

Understanding which offloads your NIC supports (via ethtool -k) is crucial for network performance tuning.

Traffic Control (tc) Subsystem

The tc architecture:

Traffic control is organized around three concepts:

Qdiscs (Queueing Disciplines): Algorithms that manage packet queues
Classes: Subdivisions within classful qdiscs for hierarchical scheduling
Filters: Rules that classify packets into classes

Common Queueing Disciplines
Qdisc	Type	Purpose	Use Case
pfifo_fast	Classless	Priority FIFO (default)	General purpose, low overhead
fq (Fair Queue)	Classless	Per-flow fair queuing	Reducing bufferbloat
fq_codel	Classless	FQ + CoDel AQM	Low latency, bufferbloat control
htb	Classful	Hierarchical Token Bucket	Rate limiting, bandwidth allocation
tbf	Classless	Token Bucket Filter	Simple rate limiting
netem	Classless	Network emulator	Testing (delay, loss, jitter)
mq	Classful	Multi-queue wrapper	Multi-queue NIC support

Qdisc processing flow
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
/**
 * struct Qdisc - Queueing discipline structure
 * 
 * Every network device has at least one qdisc that manages
 * the transmit queue.
 */
struct Qdisc {
    int                         (*enqueue)(struct sk_buff *skb,
                                           struct Qdisc *sch,
                                           struct sk_buff **to_free);
    struct sk_buff *            (*dequeue)(struct Qdisc *sch);
    unsigned int                flags;
    
    u32                         limit;          /* Queue length limit */
    const struct Qdisc_ops      *ops;           /* Qdisc operations */
    struct qdisc_size_table     *stab;          /* Size table for GSO sizing */
    
    struct hlist_node           hash;
    u32                         handle;         /* Unique identifier */
    u32                         parent;         /* Parent qdisc handle */
    
    struct netdev_queue         *dev_queue;     /* Associated TX queue */
    struct net_rate_estimator   *rate_est;      /* Rate estimator */
    
    struct gnet_stats_basic_packed  bstats;     /* Basic stats */
    struct gnet_stats_queue         qstats;     /* Queue stats */
    
    /* ... more fields ... */
};
 
/**
 * Packet enqueue flow through tc
 */
int dev_queue_xmit(struct sk_buff *skb)
{
    struct net_device *dev = skb->dev;
    struct Qdisc *q;
    
    /* Get qdisc for this device/queue */
    q = rcu_dereference(dev->qdisc);
    
    if (q->enqueue) {
        /* Enqueue through qdisc */
        return __dev_xmit_skb(skb, q, dev, txq);
    }
    
    /* No qdisc (unlikely) - direct transmit */
    return dev_hard_start_xmit(skb, dev, txq);
}
 
static int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
                          struct net_device *dev,
                          struct netdev_queue *txq)
{
    spinlock_t *root_lock = qdisc_lock(q);
    bool contended;
    int rc;
    
    spin_lock(root_lock);
    
    /* Try to bypass qdisc and transmit directly if queue is empty */
    if ((q->flags & TCQ_F_CAN_BYPASS) && 
        q->q.qlen == 0 && 
        qdisc_run_begin(q)) {
        /* Direct transmit path - skip qdisc enqueue/dequeue */
        rc = sch_direct_xmit(skb, q, dev, txq, root_lock, true);
    } else {
        /* Enqueue to qdisc */
        rc = q->enqueue(skb, q, &to_free);
        
        if (qdisc_run_begin(q)) {
            /* Attempt to dequeue and transmit */
            __qdisc_run(q);
        }
    }
    
    spin_unlock(root_lock);
    return rc;
}
 
/* Dequeue loop - transmit packets until queue empty or NIC full */
void __qdisc_run(struct Qdisc *q)
{
    int quota = dev_tx_weight;  /* Limit work per run */
    
    while (qdisc_restart(q, &packets)) {
        if (--quota <= 0 || need_resched()) {
            /* Requeue and schedule for later */
            __netif_schedule(q);
            break;
        }
    }
}
 
static inline int qdisc_restart(struct Qdisc *q, int *packets)
{
    struct sk_buff *skb;
    
    /* Dequeue next packet */
    skb = q->dequeue(q);
    if (!skb)
        return 0;
    
    /* Transmit the packet */
    return sch_direct_xmit(skb, q, ...);
}

Example tc configuration:

# View current qdisc
tc qdisc show dev eth0

# Replace default qdisc with fq_codel (for bufferbloat control)
tc qdisc replace dev eth0 root fq_codel

# Rate limit to 1 Gbps with htb
tc qdisc add dev eth0 root handle 1: htb default 10
tc class add dev eth0 parent 1: classid 1:10 htb rate 1gbit

# Add artificial latency for testing
tc qdisc add dev eth0 root netem delay 50ms 10ms

Understanding tc is essential for network performance tuning, implementing QoS, and debugging latency issues.

Bufferbloat and CoDel

Netfilter Integration

Netfilter hook points:

Packets traverse different hooks depending on their path through the stack:

IPv4 Netfilter Hooks

•NF_INET_PRE_ROUTING — After IP header validation, before routing decision. Used for DNAT, early filtering.
•NF_INET_LOCAL_IN — For packets destined to local host. The INPUT chain lives here.
•NF_INET_FORWARD — For packets being routed through. The FORWARD chain.
•NF_INET_LOCAL_OUT — For locally-generated outgoing packets. The OUTPUT chain.
•NF_INET_POST_ROUTING — After routing, before device transmit. Used for SNAT, MASQUERADE.

Netfilter hook integration
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
/**
 * Netfilter hook invocation
 * 
 * The NF_HOOK macro is used throughout the stack to invoke
 * registered netfilter handlers at each hook point.
 */
static inline int NF_HOOK(uint8_t pf,
                          unsigned int hook,
                          struct net *net,
                          struct sock *sk,
                          struct sk_buff *skb,
                          struct net_device *in,
                          struct net_device *out,
                          int (*okfn)(struct net *, struct sock *,
                                      struct sk_buff *))
{
    return NF_HOOK_THRESH(pf, hook, net, sk, skb, in, out, okfn, INT_MIN);
}
 
/* Example: IP receive path with netfilter hook */
int ip_rcv(struct sk_buff *skb, struct net_device *dev,
           struct packet_type *pt, struct net_device *orig_dev)
{
    /* ... IP header validation ... */
    
    /* Invoke PREROUTING hook, then continue to ip_rcv_finish */
    return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
                   net, NULL, skb, dev, NULL,
                   ip_rcv_finish);
}
 
/* Example: IP local delivery with INPUT hook */
int ip_local_deliver(struct sk_buff *skb)
{
    /* Handle fragmented packets */
    if (ip_is_fragment(ip_hdr(skb))) {
        skb = ip_defrag(net, skb, IP_DEFRAG_LOCAL_DELIVER);
        if (!skb)
            return 0;
    }
 
    /* Invoke INPUT hook, then continue to ip_local_deliver_finish */
    return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN,
                   net, NULL, skb, skb->dev, NULL,
                   ip_local_deliver_finish);
}
 
/**
 * Hook callback results
 */
#define NF_DROP         0   /* Drop the packet */
#define NF_ACCEPT       1   /* Accept, continue processing */
#define NF_STOLEN       2   /* Handler took ownership of packet */
#define NF_QUEUE        3   /* Queue to userspace (NFQUEUE) */
#define NF_REPEAT       4   /* Call this hook again */
#define NF_STOP         5   /* Stop processing, accept */
 
/**
 * Connection tracking integration
 * 
 * Netfilter's conntrack module tracks connection state,
 * enabling stateful filtering and NAT.
 */
/* Possible connection states */
enum ip_conntrack_info {
    IP_CT_NEW,              /* First packet of new connection */
    IP_CT_ESTABLISHED,      /* Part of established connection */
    IP_CT_RELATED,          /* Related to established (e.g., FTP data) */
    IP_CT_RELATED_REPLY,    /* Reply to related packet */
    /* ... more states ... */
};
 
/* Each tracked connection has an entry */
struct nf_conn {
    /* Connection tuple (5-tuple) */
    struct nf_conntrack_tuple_hash tuplehash[IP_CT_DIR_MAX];
    
    /* Connection timeout */
    unsigned long timeout;
    
    /* State bits */
    unsigned long status;
    
    /* Expected connections (for protocols with helper) */
    struct hlist_head expectations;
    
    /* NAT information */
    struct nf_conn_nat *nat;
    
    /* ... more fields ... */
};

nftables vs iptables

Summary: Mastering the Protocol Stack

Key Takeaways

•Layered architecture — Each layer (socket, transport, network, link, device) has distinct responsibilities and communicates through well-defined interfaces
•sk_buff is central — Every packet is wrapped in an sk_buff structure that enables zero-copy header manipulation and tracks packet metadata throughout its journey
•Protocol registration — Protocols register handlers at multiple levels (address family, EtherType, IP protocol), enabling modular and extensible networking
•NAPI for scalable receive — The receive path uses NAPI polling to batch packet processing and avoid interrupt storms at high packet rates
•net_device abstraction — Hardware diversity is hidden behind a standard interface, with feature flags advertising hardware capabilities
•Traffic Control shapes output — Qdiscs between protocol stack and driver control scheduling, shaping, and packet manipulation
•Netfilter hooks enable filtering — Hook points throughout the stack allow packet filtering, NAT, and connection tracking

What's next:

Page Complete

2 / 5