Operating SystemseBPF

eBPF: The Linux Kernel's Programmability Revolution

LevelAdvanced

Duration90 mins

TopiceBPF

4 / 5

Networking Use Cases

Networking at the Speed of the Kernel

Imagine processing 10 million packets per second per CPU core. Imagine dropping DDoS traffic before it even creates a socket. Imagine implementing custom load balancing logic without modifying your application or adding proxies. Imagine replacing iptables rules with programs that are 10x faster.

This is eBPF networking.

eBPF has revolutionized Linux networking by enabling programmable packet processing directly in the kernel's data path. From XDP (eXpress Data Path) at the driver level to socket-level filtering, eBPF provides unprecedented flexibility and performance.

Cloudflare handles millions of requests per second using eBPF-based DDoS mitigation. Facebook's Katran load balancer serves billions of connections using XDP. Cilium replaces kube-proxy with eBPF-based Kubernetes networking. These aren't experiments—they're production systems serving real traffic at massive scale.

What You Will Learn

By the end of this page, you will understand XDP and its high-performance packet processing capabilities, TC (Traffic Control) eBPF for classification and shaping, socket-level eBPF for connection steering, and real-world networking applications like load balancing, firewalling, and container networking.

The Linux Networking Stack and eBPF

eBPF integrates at multiple points in the Linux networking stack, each offering different tradeoffs between flexibility and performance.

eBPF Attachment Points in the Network Path

┌─────────────────────────────────────────────────────────────────┐
│                        APPLICATION                              │
│                    (socket read/write)                          │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────┼────────────────────────────────────┐
│                     SOCKET LAYER                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ SO_ATTACH_BPF, SOCKMAP, SK_SKB, SK_MSG                   │    │
│  └─────────────────────────────────────────────────────────┘    │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────┼────────────────────────────────────┐
│                  TRANSPORT LAYER (TCP/UDP)                       │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Cgroup socket programs (connect, bind, etc.)             │    │
│  └─────────────────────────────────────────────────────────┘    │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────┼────────────────────────────────────┐
│                    NETWORK LAYER (IP)                            │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ LWT (Lightweight Tunnels)                                │    │
│  └─────────────────────────────────────────────────────────┘    │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────┼────────────────────────────────────┐
│               TRAFFIC CONTROL (TC)                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ TC ingress/egress: BPF_PROG_TYPE_SCHED_CLS               │    │
│  │ After netfilter, before qdisc                            │    │
│  └─────────────────────────────────────────────────────────┘    │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────┼────────────────────────────────────┐
│                 DRIVER / XDP                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ XDP: BPF_PROG_TYPE_XDP                                   │    │
│  │ Before sk_buff creation, in driver receive path          │    │
│  └─────────────────────────────────────────────────────────┘    │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────┼────────────────────────────────────┐
│                         NIC                                      │
│  Hardware offload (XDP_FLAGS_HW_OFFLOAD)                        │
└─────────────────────────────────────────────────────────────────┘

eBPF Networking Program Types
Hook Point	Program Type	Direction	Performance	Use Case
XDP	BPF_PROG_TYPE_XDP	Ingress only	Fastest (~10M pps)	DDoS, load balancing, filtering
TC	BPF_PROG_TYPE_SCHED_CLS	Ingress + Egress	Fast (~3M pps)	NAT, encapsulation, policing
Socket Filter	BPF_PROG_TYPE_SOCKET_FILTER	Per-socket	Good	Packet capture, filtering
Sockmap	BPF_PROG_TYPE_SK_SKB	Per-socket	Good	Proxying, redirection
cgroup/socket	BPF_PROG_TYPE_CGROUP_*	Per-cgroup	Good	Container networking policy
LWT	BPF_PROG_TYPE_LWT_*	Routing	Good	Tunneling, encapsulation

Choosing the Right Hook Point

Use XDP when: you need maximum performance, early packet drops (DDoS), or don't need sk_buff features. Use TC when: you need egress processing, protocol stack access, or more packet manipulation. Use socket-level BPF when: you need per-connection logic, application-layer proxying, or socket steering.

XDP: eXpress Data Path

XDP (eXpress Data Path) is the fastest eBPF networking hook, processing packets at the earliest possible point—directly in the network driver, before the kernel creates sk_buff structures.

Why XDP is So Fast

No sk_buff allocation: sk_buff (socket buffer) creation is expensive; XDP operates on raw packet data
No memory copies: Packets are processed in-place in DMA memory
Minimal kernel code path: Runs before most of the networking stack
Bulk processing: Drivers can batch XDP operations using NAPI

XDP Actions

Action	Value	Description
`XDP_ABORTED`	0	Error; drop packet, trace
`XDP_DROP`	1	Silently drop packet
`XDP_PASS`	2	Pass to normal network stack
`XDP_TX`	3	Transmit back out same interface
`XDP_REDIRECT`	4	Redirect to another interface/program

XDP Program Examples
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
 
// ============================================
// BASIC XDP PACKET FILTER
// ============================================
SEC("xdp")
int xdp_filter(struct xdp_md *ctx) {
    // Get packet data boundaries
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    
    // Parse Ethernet header
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_DROP;
    
    // Only process IPv4
    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return XDP_PASS;
    
    // Parse IP header
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_DROP;
    
    // Drop ICMP (ping)
    if (ip->protocol == IPPROTO_ICMP)
        return XDP_DROP;
    
    // Drop specific source IP (example: 10.0.0.100)
    if (ip->saddr == bpf_htonl(0x0a000064))
        return XDP_DROP;
    
    return XDP_PASS;
}
 
// ============================================
// XDP LOAD BALANCER (Simple Round-Robin)
// ============================================
struct backend {
    __be32 ip;
    unsigned char mac[6];
};
 
struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 4);
    __type(key, u32);
    __type(value, struct backend);
} backends SEC(".maps");
 
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
    __uint(max_entries, 1);
    __type(key, u32);
    __type(value, u32);
} counter SEC(".maps");
 
SEC("xdp")
int xdp_load_balancer(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_DROP;
    
    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return XDP_PASS;
    
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_DROP;
    
    // Only load balance TCP to port 80
    if (ip->protocol != IPPROTO_TCP)
        return XDP_PASS;
    
    struct tcphdr *tcp = (void *)ip + (ip->ihl * 4);
    if ((void *)(tcp + 1) > data_end)
        return XDP_DROP;
    
    if (tcp->dest != bpf_htons(80))
        return XDP_PASS;
    
    // Round-robin backend selection
    u32 zero = 0;
    u32 *cnt = bpf_map_lookup_elem(&counter, &zero);
    if (!cnt)
        return XDP_PASS;
    
    u32 backend_idx = (*cnt) % 4;
    __sync_fetch_and_add(cnt, 1);
    
    struct backend *be = bpf_map_lookup_elem(&backends, &backend_idx);
    if (!be)
        return XDP_PASS;
    
    // Rewrite destination IP
    ip->daddr = be->ip;
    
    // Update checksum (simplified - real implementation more complex)
    ip->check = 0;
    ip->check = iph_csum(ip);
    
    // Rewrite destination MAC
    __builtin_memcpy(eth->h_dest, be->mac, 6);
    
    // TX back out the same interface
    return XDP_TX;
}
 
// ============================================
// XDP DDoS MITIGATION
// ============================================
struct {
    __uint(type, BPF_MAP_TYPE_LRU_HASH);
    __uint(max_entries, 100000);
    __type(key, __be32);        // Source IP
    __type(value, u64);         // Packet count
} rate_limit SEC(".maps");
 
const volatile u64 pps_limit = 1000;  // Packets per second
 
SEC("xdp")
int xdp_ddos_mitigate(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_DROP;
    
    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return XDP_PASS;
    
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_DROP;
    
    // Rate limit per source IP
    __be32 src = ip->saddr;
    u64 *count = bpf_map_lookup_elem(&rate_limit, &src);
    
    if (count) {
        if (*count > pps_limit) {
            // Exceeded rate limit - drop
            return XDP_DROP;
        }
        __sync_fetch_and_add(count, 1);
    } else {
        u64 one = 1;
        bpf_map_update_elem(&rate_limit, &src, &one, BPF_ANY);
    }
    
    return XDP_PASS;
}
 
char LICENSE[] SEC("license") = "GPL";

XDP Modes

Mode	Flag	Performance	Compatibility
Native (Driver)	`XDP_FLAGS_DRV_MODE`	Best	Driver must support
Generic (SKB)	`XDP_FLAGS_SKB_MODE`	Slower	Works everywhere
Hardware Offload	`XDP_FLAGS_HW_OFFLOAD`	Fastest	Limited NIC support

Supported Drivers (Native XDP): Most modern NICs support native XDP: Intel ixgbe, i40e, ice; Mellanox mlx4, mlx5; Broadcom bnxt; Amazon ENA; Virtio-net; and many more.

# Attach XDP program in native mode
ip link set dev eth0 xdp obj xdp_prog.o sec xdp

# Attach in generic mode (fallback)
ip link set dev eth0 xdpgeneric obj xdp_prog.o sec xdp

# Check XDP program attached
ip link show eth0
# Output: ... xdp/id:42 ...

# Detach XDP program
ip link set dev eth0 xdp off

XDP Limitations

XDP programs cannot access the full networking stack. You can't make TCP connections, access routing tables directly, or perform complex L7 inspection. For these use cases, combine XDP with TC or upper-layer eBPF. Also, XDP only handles ingress packets—for egress processing, use TC.

TC BPF: Traffic Control with eBPF

TC (Traffic Control) is the Linux subsystem for queuing, scheduling, and classifying network traffic. TC BPF programs attach to the TC layer, providing:

Ingress + Egress: Unlike XDP, TC can process packets in both directions
sk_buff access: Full access to socket buffer metadata
Stack integration: Runs after netfilter, can modify packets before routing
Flexible actions: ACCEPT, DROP, REDIRECT, modify and re-inject

TC vs XDP Tradeoffs

Feature	XDP	TC BPF
Egress processing	❌	✅
sk_buff access	❌	✅
Performance	~10M pps	~3M pps
Packet modification	Basic	Full
Works with all NICs	Generic mode	✅

TC BPF Examples
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
 
// TC return values
#define TC_ACT_OK        0   // Accept and continue
#define TC_ACT_SHOT      2   // Drop packet
#define TC_ACT_REDIRECT  7   // Redirect via bpf_redirect
 
// ============================================
// BASIC TC FILTER
// ============================================
SEC("tc")
int tc_filter(struct __sk_buff *skb) {
    void *data_end = (void *)(long)skb->data_end;
    void *data = (void *)(long)skb->data;
    
    // Parse headers
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return TC_ACT_SHOT;
    
    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return TC_ACT_OK;
    
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return TC_ACT_SHOT;
    
    // TC-specific: access skb metadata
    // skb->ifindex, skb->priority, skb->mark, etc.
    
    // Log packet info
    bpf_printk("TC: len=%d, ifindex=%d, proto=%d
",
               skb->len, skb->ifindex, ip->protocol);
    
    return TC_ACT_OK;
}
 
// ============================================
// TC NAT (DNAT Example)
// ============================================
SEC("tc")
int tc_dnat(struct __sk_buff *skb) {
    void *data_end = (void *)(long)skb->data_end;
    void *data = (void *)(long)skb->data;
    
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return TC_ACT_SHOT;
    
    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return TC_ACT_OK;
    
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return TC_ACT_SHOT;
    
    // DNAT: rewrite destination IP
    // Original: 10.0.0.1 -> New: 192.168.1.100
    if (ip->daddr == bpf_htonl(0x0a000001)) {
        __be32 new_ip = bpf_htonl(0xc0a80164);  // 192.168.1.100
        
        // Use helper to properly update checksums
        int ret = bpf_l3_csum_replace(skb, 
            ETH_HLEN + offsetof(struct iphdr, check),
            ip->daddr, new_ip, 4);
        if (ret < 0)
            return TC_ACT_SHOT;
        
        // Update L4 checksum too (TCP/UDP)
        if (ip->protocol == IPPROTO_TCP) {
            ret = bpf_l4_csum_replace(skb,
                ETH_HLEN + sizeof(struct iphdr) + 
                    offsetof(struct tcphdr, check),
                ip->daddr, new_ip, 
                BPF_F_PSEUDO_HDR | 4);
        }
        
        // Write new destination IP
        ret = bpf_skb_store_bytes(skb,
            ETH_HLEN + offsetof(struct iphdr, daddr),
            &new_ip, 4, 0);
    }
    
    return TC_ACT_OK;
}
 
// ============================================
// TC ENCAPSULATION (VXLAN Example)
// ============================================
struct vxlanhdr {
    __be32 flags;
    __be32 vni;
};
 
SEC("tc")
int tc_vxlan_encap(struct __sk_buff *skb) {
    // Calculate new headers size
    int outer_hdr_size = sizeof(struct ethhdr) + 
                         sizeof(struct iphdr) +
                         sizeof(struct udphdr) +
                         sizeof(struct vxlanhdr);
    
    // Expand headroom
    int ret = bpf_skb_adjust_room(skb, outer_hdr_size,
                                   BPF_ADJ_ROOM_MAC,
                                   BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 |
                                   BPF_F_ADJ_ROOM_ENCAP_L4_UDP);
    if (ret < 0)
        return TC_ACT_SHOT;
    
    // Now write outer headers...
    // (Implementation details omitted for brevity)
    
    return TC_ACT_OK;
}
 
// ============================================
// TC REDIRECT (Container Networking)
// ============================================
struct {
    __uint(type, BPF_MAP_TYPE_DEVMAP);
    __uint(max_entries, 256);
    __type(key, u32);      // Source ifindex
    __type(value, u32);    // Destination ifindex
} redirect_map SEC(".maps");
 
SEC("tc")
int tc_redirect_ingress(struct __sk_buff *skb) {
    u32 ifindex = skb->ifindex;
    u32 *target = bpf_map_lookup_elem(&redirect_map, &ifindex);
    
    if (target) {
        return bpf_redirect(*target, 0);
    }
    
    return TC_ACT_OK;
}
 
char LICENSE[] SEC("license") = "GPL";

TC BPF Attachment Commands
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Create clsact qdisc (required for TC BPF)
tc qdisc add dev eth0 clsact
 
# Attach TC BPF to ingress
tc filter add dev eth0 ingress bpf da obj tc_prog.o sec tc
 
# Attach TC BPF to egress
tc filter add dev eth0 egress bpf da obj tc_prog.o sec tc
 
# List attached filters
tc filter show dev eth0 ingress
tc filter show dev eth0 egress
 
# Remove filters
tc filter del dev eth0 ingress
tc filter del dev eth0 egress
 
# Remove qdisc
tc qdisc del dev eth0 clsact
 
# Debug: view TC BPF execution stats
cat /sys/kernel/debug/tracing/trace_pipe  # for bpf_printk output

TC BPF Use Cases

TC BPF excels at: NAT/masquerading, packet encapsulation (VXLAN, GENEVE), container networking (veth redirection), traffic policing, and packet marking. It's the primary technology behind Cilium's data path, replacing iptables with eBPF for Kubernetes networking.

Socket-Level BPF: Application Networking

eBPF also operates at the socket level, enabling powerful per-connection logic:

Socket BPF Program Types

Type	Purpose	Example Use
`BPF_PROG_TYPE_SOCKET_FILTER`	Filter packets on a socket	tcpdump-style capture
`BPF_PROG_TYPE_SK_SKB`	Redirect between sockets	Kernel-level proxy
`BPF_PROG_TYPE_SK_MSG`	Filter/redirect socket messages	L7 proxying
`BPF_PROG_TYPE_SOCK_OPS`	Socket operation hooks	TCP tuning per socket
`BPF_PROG_TYPE_CGROUP_SOCK`	cgroup socket operations	Container networking

Socket BPF Examples
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
 
// ============================================
// SOCKMAP: Kernel-Space TCP Proxy
// ============================================
// SOCKMAP enables redirecting traffic between sockets
// without going through user space - kernel-to-kernel copy
 
struct {
    __uint(type, BPF_MAP_TYPE_SOCKHASH);
    __uint(max_entries, 65535);
    __type(key, struct sock_key);
    __type(value, u32);  // Socket cookie
} sock_hash SEC(".maps");
 
struct sock_key {
    __be32 local_ip;
    __be32 remote_ip;
    __be16 local_port;
    __be16 remote_port;
};
 
// Called for each incoming packet on sockets in the sockmap
SEC("sk_skb/stream_verdict")
int stream_verdict(struct __sk_buff *skb) {
    struct sock_key key = {};
    
    // Extract connection tuple
    key.local_ip = skb->local_ip4;
    key.remote_ip = skb->remote_ip4;
    key.local_port = skb->local_port;
    key.remote_port = bpf_htons(skb->remote_port >> 16);
    
    // Redirect to peer socket
    return bpf_sk_redirect_hash(skb, &sock_hash, &key, 0);
}
 
// ============================================
// CGROUP/SOCK: Container Network Policy
// ============================================
SEC("cgroup/connect4")
int cgroup_connect4(struct bpf_sock_addr *ctx) {
    // Called when a socket in this cgroup calls connect()
    
    // Block connections to specific IP (e.g., metadata service)
    // 169.254.169.254 = AWS metadata service
    if (ctx->user_ip4 == bpf_htonl(0xa9fea9fe)) {
        return 0;  // Block
    }
    
    // Redirect connections (service mesh use case)
    // Intercept connections to 10.0.0.0/8 and redirect to local proxy
    __be32 orig_dst = ctx->user_ip4;
    if ((orig_dst & bpf_htonl(0xff000000)) == bpf_htonl(0x0a000000)) {
        ctx->user_ip4 = bpf_htonl(0x7f000001);  // 127.0.0.1
        ctx->user_port = bpf_htons(15001);       // Proxy port
    }
    
    return 1;  // Allow (modified)
}
 
SEC("cgroup/bind4")
int cgroup_bind4(struct bpf_sock_addr *ctx) {
    // Called when a socket in this cgroup calls bind()
    
    // Prevent binding to privileged ports
    __be16 port = ctx->user_port;
    if (bpf_ntohs(port) < 1024) {
        return 0;  // Block
    }
    
    return 1;  // Allow
}
 
// ============================================
// SOCK_OPS: TCP Tuning
// ============================================
SEC("sockops")
int sock_ops_handler(struct bpf_sock_ops *skops) {
    int op = skops->op;
    
    switch (op) {
        case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
        case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
            // Connection established - tune TCP parameters
            
            // Enable TCP congestion control algorithm
            bpf_setsockopt(skops, SOL_TCP, TCP_CONGESTION,
                          "bbr", sizeof("bbr"));
            
            // Set TCP keepalive
            int keepalive = 1;
            bpf_setsockopt(skops, SOL_SOCKET, SO_KEEPALIVE,
                          &keepalive, sizeof(keepalive));
            
            // Add socket to sockmap for potential proxying
            struct sock_key key = {};
            key.local_ip = skops->local_ip4;
            key.remote_ip = skops->remote_ip4;
            key.local_port = skops->local_port;
            key.remote_port = bpf_htons(skops->remote_port >> 16);
            
            bpf_sock_hash_update(skops, &sock_hash, &key, BPF_ANY);
            break;
            
        case BPF_SOCK_OPS_STATE_CB:
            // Socket state change - cleanup on close
            if (skops->args[1] == BPF_TCP_CLOSE) {
                // Remove from maps, cleanup...
            }
            break;
    }
    
    return 1;
}
 
char LICENSE[] SEC("license") = "GPL";

SOCKMAP Proxy Architecture

SOCKMAP enables building high-performance proxies that operate entirely in kernel space:

┌──────────────────────────────────────────────────────────────┐
│                      Traditional Proxy                        │
│                                                               │
│  Client ──TCP──▶ Proxy ──TCP──▶ Server                       │
│                   │                                           │
│            [User Space]                                       │
│         read() / write()                                     │
│           context switches                                    │
│           memory copies                                       │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                      SOCKMAP Proxy                            │
│                                                               │
│  Client ──TCP──▶ Kernel ──TCP──▶ Server                      │
│                   │                                           │
│           [Kernel Space]                                      │
│      bpf_sk_redirect_hash()                                   │
│         zero copies                                           │
│         no user-space switch                                 │
└──────────────────────────────────────────────────────────────┘

SOCKMAP proxies can achieve 2-3x higher throughput than traditional user-space proxies like Envoy for TCP passthrough workloads.

Cilium's Approach

Cilium combines multiple eBPF types: TC BPF for L3/L4 policies, socket-level BPF for service mesh acceleration, and cgroup BPF for Kubernetes network policies. This layered approach provides defense in depth while maintaining high performance.

Real-World eBPF Networking

Let's examine how major companies and projects use eBPF for production networking.

Cloudflare: XDP for DDoS Mitigation

Cloudflare handles massive DDoS attacks (over 2 Tbps) using XDP at the edge. Their approach:

Packet arrives at NIC
XDP program executes (in driver, before sk_buff)
Rate limiting: Track packets per source IP in BPF maps
Signature matching: Drop known attack patterns
Legitimate traffic: XDP_PASS to network stack

Result: Multi-million PPS drop rates with minimal CPU overhead.

Facebook Katran: XDP Load Balancer

Katran is Facebook's L4 load balancer, open-sourced and handling billions of connections:

Katran-Style L4 Load Balancer
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// Simplified Katran-style load balancer concept
// Actual Katran: github.com/facebookincubator/katran
 
struct vip_meta {
    u32 flags;
    u32 num_backends;
};
 
struct real_definition {
    __be32 dst;       // Backend IP
    __u8 mac[6];      // Backend MAC
    u8 flags;
};
 
// VIP to metadata
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1024);
    __type(key, __be32);              // VIP address
    __type(value, struct vip_meta);
} vip_map SEC(".maps");
 
// VIP + hash -> backend
struct {
    __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
    __uint(max_entries, 1024);
    __type(key, u32);                 // VIP index
    __array(values, struct {
        __uint(type, BPF_MAP_TYPE_ARRAY);
        __uint(max_entries, 256);
        __type(key, u32);
        __type(value, struct real_definition);
    });
} ch_rings SEC(".maps");  // Consistent hashing rings
 
SEC("xdp")
int xdp_balancer(struct xdp_md *ctx) {
    // 1. Parse packet headers
    // (omitted - standard parsing)
    
    // 2. Check if destination is a VIP
    struct vip_meta *vip = bpf_map_lookup_elem(&vip_map, &ip->daddr);
    if (!vip)
        return XDP_PASS;  // Not a VIP, continue normal processing
    
    // 3. Compute flow hash (5-tuple)
    u32 hash = jhash_3words(
        ip->saddr, 
        ip->daddr,
        ((__u32)tcp->source << 16) | tcp->dest,
        0
    );
    
    // 4. Consistent hashing: select backend
    u32 vip_idx = /* VIP index from somewhere */;
    void *ring = bpf_map_lookup_elem(&ch_rings, &vip_idx);
    if (!ring)
        return XDP_DROP;
    
    u32 slot = hash % 256;  // Simplified
    struct real_definition *backend = bpf_map_lookup_elem(ring, &slot);
    if (!backend)
        return XDP_DROP;
    
    // 5. Rewrite packet: DNAT + MAC rewrite
    ip->daddr = backend->dst;
    // Update checksums...
    __builtin_memcpy(eth->h_dest, backend->mac, 6);
    
    // 6. Send to backend (via XDP_TX or bpf_redirect)
    return XDP_TX;
}

Production eBPF Networking Deployments
Company/Project	Use Case	Technology	Scale
Cloudflare	DDoS mitigation	XDP	2+ Tbps attack mitigation
Facebook/Meta	L4 load balancing (Katran)	XDP	Billions of connections
Cilium	Kubernetes CNI	XDP + TC + Socket	Millions of pods
Netflix	Network observability	TC BPF + tracing	Thousands of instances
Google GKE	Dataplane V2 (Cilium)	XDP + TC	Production GKE clusters
AWS	Firewall (WAF)	XDP	Edge protection

Cilium: Kubernetes Networking

Cilium replaces kube-proxy and traditional CNI plugins with an eBPF-native data path:

┌──────────────────────────────────────────────────────────────┐
│                    Traditional kube-proxy                     │
│                                                               │
│  Pod A ──veth──▶ Host ──iptables──▶ Host ──veth──▶ Pod B    │
│                    │                                          │
│              Conntrack                                        │
│              NAT tables                                       │
│              O(N) rules                                       │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                        Cilium (eBPF)                          │
│                                                               │
│  Pod A ──veth──TC BPF──▶ bpf_redirect ──▶ veth──TC BPF──▶ Pod B │
│                │                                              │
│            BPF maps                                           │
│            O(1) lookups                                       │
│            No iptables                                        │
└──────────────────────────────────────────────────────────────┘

Benefits:

O(1) service lookups instead of O(N) iptables rules
Direct pod-to-pod forwarding without traversing netfilter
Identity-aware policies (not just IP-based)
L7 visibility via socket-level eBPF

eBPF is Production-Ready

These aren't experiments—they're production systems handling real traffic at massive scale. eBPF networking has proven its value for performance-critical infrastructure. If you're building networking infrastructure today, eBPF should be a primary consideration.

Performance Characteristics

Understanding eBPF networking performance helps you choose the right approach.

Packet Processing Rates

Mode	Typical Performance	Notes
XDP Native	10-15M pps per core	Driver-dependent
XDP Generic	2-4M pps per core	Fallback mode
XDP Offload	40-100M pps	NIC-dependent, limited programs
TC BPF	2-5M pps per core	After sk_buff creation
iptables (comparison)	0.5-2M pps	Rule-count dependent

Memory Overhead

Component	Memory Usage
eBPF program (typical)	4-64 KB
BPF maps	Depends on max_entries × value_size
Per-CPU maps	max_entries × value_size × num_CPUs
Ring buffer	As configured (256KB-64MB typical)

Performance Testing
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Generate test traffic with pktgen (kernel module)
modprobe pktgen
 
# Configure pktgen (example for 10M pps test)
cd /proc/net/pktgen
 
echo "rem_device_all" > kpktgend_0
echo "add_device eth1" > kpktgend_0
 
echo "count 50000000" > eth1
echo "pkt_size 64" > eth1
echo "rate 10000000" > eth1  # 10M pps
echo "dst 192.168.1.100" > eth1
echo "dst_mac 00:11:22:33:44:55" > eth1
 
echo "start" > pgctrl
 
# Check XDP statistics
ethtool -S eth0 | grep xdp
 
# BPF program statistics
bpftool prog show id 42 --json | jq '.run_cnt, .run_time_ns'
 
# Calculate per-call overhead
run_time_ns / run_cnt  # nanoseconds per invocation
 
# Monitor map operations
bpftool map show id 5 --json | jq
 
# Check drop stats  
cat /sys/class/net/eth0/statistics/rx_dropped
 
# XDP-specific redirect stats
cat /sys/kernel/debug/tracing/events/xdp/*/enable

Performance Optimization Tips

•Use per-CPU maps: PERCPU variants avoid cross-CPU cache line bouncing. Essential for high-frequency updates.
•Minimize map lookups: Cache lookup results in locals when possible. Each lookup has measurable overhead.
•Inline aggressively: Use __always_inline for helper functions. The verifier requires this anyway.
•Use native XDP mode: Generic mode is significantly slower. Verify your NIC supports native mode.
•Batch operations: Use bpf_redirect with DEVMAP_BATCH for bulk forwarding in XDP.
•Profile your programs: Use bpftool prog stats to identify hot paths and optimize.

Measuring Real-World Impact

Synthetic benchmarks (pps rates) are useful but don't tell the whole story. Measure latency distributions (p50, p99), tail latencies, and CPU utilization under realistic traffic patterns. XDP's advantage grows with packet volume—at low rates, the difference may be negligible.

Building eBPF Network Applications

Building production eBPF networking applications requires understanding the complete workflow and best practices.

Development Workflow

Define requirements: XDP for drops/redirects, TC for complex modifications, socket for L4/L7
Write BPF program: C with libbpf, or use bpftrace for prototyping
Test in VMs: Use network namespaces and veth pairs for isolation
Profile and optimize: Use bpftool stats, minimize per-packet overhead
Deploy with CO-RE: Ensure portability across kernel versions
Monitor in production: Track program stats, map sizes, error rates

Complete XDP Application
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
// xdp_firewall.bpf.c - Simple XDP firewall
 
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
 
// Block list: IPs to drop
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10000);
    __type(key, __be32);        // IP address
    __type(value, u64);         // Block count
} blocklist SEC(".maps");
 
// Statistics
struct stats {
    u64 passed;
    u64 dropped;
};
 
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
    __uint(max_entries, 1);
    __type(key, u32);
    __type(value, struct stats);
} statistics SEC(".maps");
 
SEC("xdp")
int xdp_firewall(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    u32 zero = 0;
    
    struct stats *stats = bpf_map_lookup_elem(&statistics, &zero);
    if (!stats)
        return XDP_ABORTED;
    
    // Parse Ethernet
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end) {
        stats->dropped++;
        return XDP_DROP;
    }
    
    // Only process IPv4
    if (eth->h_proto != bpf_htons(ETH_P_IP)) {
        stats->passed++;
        return XDP_PASS;
    }
    
    // Parse IP
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end) {
        stats->dropped++;
        return XDP_DROP;
    }
    
    // Check blocklist
    u64 *count = bpf_map_lookup_elem(&blocklist, &ip->saddr);
    if (count) {
        // IP is blocked - increment counter and drop
        __sync_fetch_and_add(count, 1);
        stats->dropped++;
        return XDP_DROP;
    }
    
    stats->passed++;
    return XDP_PASS;
}
 
char LICENSE[] SEC("license") = "GPL";

Testing Infrastructure

Use network namespaces for isolated testing: ip netns add test; ip link add veth0 type veth peer name veth1; ip link set veth1 netns test. This creates a virtual network for testing without affecting production traffic. Combine with traffic generators like iperf3 or pktgen for load testing.

Summary: Networking Use Cases

We've explored eBPF's transformative impact on Linux networking. Let's consolidate the key concepts:

Key Takeaways

•Multiple hook points — eBPF integrates at XDP (driver), TC (traffic control), and socket levels, each with different tradeoffs.
•XDP for maximum performance — ~10M pps per core, perfect for DDoS mitigation, load balancing, and early packet drops.
•TC BPF for flexibility — Ingress and egress support, full sk_buff access, ideal for NAT, encapsulation, and container networking.
•Socket-level BPF — Per-connection logic, SOCKMAP for kernel-space proxying, cgroup BPF for container policies.
•Production-proven — Cloudflare, Facebook, Google, and major Kubernetes deployments rely on eBPF networking.
•Replaces iptables — Cilium and others show eBPF offers O(1) lookups versus O(N) iptables rules.
•Performance optimization — Use per-CPU maps, native XDP mode, and profile with bpftool for production readiness.

What's Next:

Now that you understand eBPF's networking capabilities, the final page explores security applications—how eBPF enables runtime security monitoring, threat detection, system call filtering, and security policy enforcement in modern infrastructure.

Networking Mastery

You now understand how eBPF has revolutionized Linux networking—from packet-level processing with XDP to application-layer proxying with SOCKMAP. In the next page, we'll see how eBPF is equally transformative for security monitoring and enforcement.

4 / 5

Loading learning content...

Operating SystemseBPF

eBPF: The Linux Kernel's Programmability Revolution

LevelAdvanced

Duration90 mins

TopiceBPF

4 / 5

Networking Use Cases

Networking at the Speed of the Kernel

This is eBPF networking.

What You Will Learn

The Linux Networking Stack and eBPF

eBPF integrates at multiple points in the Linux networking stack, each offering different tradeoffs between flexibility and performance.

eBPF Attachment Points in the Network Path

┌─────────────────────────────────────────────────────────────────┐
│                        APPLICATION                              │
│                    (socket read/write)                          │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────┼────────────────────────────────────┐
│                     SOCKET LAYER                                 │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ SO_ATTACH_BPF, SOCKMAP, SK_SKB, SK_MSG                   │    │
│  └─────────────────────────────────────────────────────────┘    │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────┼────────────────────────────────────┐
│                  TRANSPORT LAYER (TCP/UDP)                       │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ Cgroup socket programs (connect, bind, etc.)             │    │
│  └─────────────────────────────────────────────────────────┘    │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────┼────────────────────────────────────┐
│                    NETWORK LAYER (IP)                            │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ LWT (Lightweight Tunnels)                                │    │
│  └─────────────────────────────────────────────────────────┘    │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────┼────────────────────────────────────┐
│               TRAFFIC CONTROL (TC)                               │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ TC ingress/egress: BPF_PROG_TYPE_SCHED_CLS               │    │
│  │ After netfilter, before qdisc                            │    │
│  └─────────────────────────────────────────────────────────┘    │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────┼────────────────────────────────────┐
│                 DRIVER / XDP                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │ XDP: BPF_PROG_TYPE_XDP                                   │    │
│  │ Before sk_buff creation, in driver receive path          │    │
│  └─────────────────────────────────────────────────────────┘    │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────┼────────────────────────────────────┐
│                         NIC                                      │
│  Hardware offload (XDP_FLAGS_HW_OFFLOAD)                        │
└─────────────────────────────────────────────────────────────────┘

eBPF Networking Program Types
Hook Point	Program Type	Direction	Performance	Use Case
XDP	BPF_PROG_TYPE_XDP	Ingress only	Fastest (~10M pps)	DDoS, load balancing, filtering
TC	BPF_PROG_TYPE_SCHED_CLS	Ingress + Egress	Fast (~3M pps)	NAT, encapsulation, policing
Socket Filter	BPF_PROG_TYPE_SOCKET_FILTER	Per-socket	Good	Packet capture, filtering
Sockmap	BPF_PROG_TYPE_SK_SKB	Per-socket	Good	Proxying, redirection
cgroup/socket	BPF_PROG_TYPE_CGROUP_*	Per-cgroup	Good	Container networking policy
LWT	BPF_PROG_TYPE_LWT_*	Routing	Good	Tunneling, encapsulation

Choosing the Right Hook Point

XDP: eXpress Data Path

XDP (eXpress Data Path) is the fastest eBPF networking hook, processing packets at the earliest possible point—directly in the network driver, before the kernel creates sk_buff structures.

Why XDP is So Fast

No sk_buff allocation: sk_buff (socket buffer) creation is expensive; XDP operates on raw packet data
No memory copies: Packets are processed in-place in DMA memory
Minimal kernel code path: Runs before most of the networking stack
Bulk processing: Drivers can batch XDP operations using NAPI

XDP Actions

Action	Value	Description
`XDP_ABORTED`	0	Error; drop packet, trace
`XDP_DROP`	1	Silently drop packet
`XDP_PASS`	2	Pass to normal network stack
`XDP_TX`	3	Transmit back out same interface
`XDP_REDIRECT`	4	Redirect to another interface/program

XDP Program Examples
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
 
// ============================================
// BASIC XDP PACKET FILTER
// ============================================
SEC("xdp")
int xdp_filter(struct xdp_md *ctx) {
    // Get packet data boundaries
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    
    // Parse Ethernet header
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_DROP;
    
    // Only process IPv4
    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return XDP_PASS;
    
    // Parse IP header
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_DROP;
    
    // Drop ICMP (ping)
    if (ip->protocol == IPPROTO_ICMP)
        return XDP_DROP;
    
    // Drop specific source IP (example: 10.0.0.100)
    if (ip->saddr == bpf_htonl(0x0a000064))
        return XDP_DROP;
    
    return XDP_PASS;
}
 
// ============================================
// XDP LOAD BALANCER (Simple Round-Robin)
// ============================================
struct backend {
    __be32 ip;
    unsigned char mac[6];
};
 
struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __uint(max_entries, 4);
    __type(key, u32);
    __type(value, struct backend);
} backends SEC(".maps");
 
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
    __uint(max_entries, 1);
    __type(key, u32);
    __type(value, u32);
} counter SEC(".maps");
 
SEC("xdp")
int xdp_load_balancer(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_DROP;
    
    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return XDP_PASS;
    
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_DROP;
    
    // Only load balance TCP to port 80
    if (ip->protocol != IPPROTO_TCP)
        return XDP_PASS;
    
    struct tcphdr *tcp = (void *)ip + (ip->ihl * 4);
    if ((void *)(tcp + 1) > data_end)
        return XDP_DROP;
    
    if (tcp->dest != bpf_htons(80))
        return XDP_PASS;
    
    // Round-robin backend selection
    u32 zero = 0;
    u32 *cnt = bpf_map_lookup_elem(&counter, &zero);
    if (!cnt)
        return XDP_PASS;
    
    u32 backend_idx = (*cnt) % 4;
    __sync_fetch_and_add(cnt, 1);
    
    struct backend *be = bpf_map_lookup_elem(&backends, &backend_idx);
    if (!be)
        return XDP_PASS;
    
    // Rewrite destination IP
    ip->daddr = be->ip;
    
    // Update checksum (simplified - real implementation more complex)
    ip->check = 0;
    ip->check = iph_csum(ip);
    
    // Rewrite destination MAC
    __builtin_memcpy(eth->h_dest, be->mac, 6);
    
    // TX back out the same interface
    return XDP_TX;
}
 
// ============================================
// XDP DDoS MITIGATION
// ============================================
struct {
    __uint(type, BPF_MAP_TYPE_LRU_HASH);
    __uint(max_entries, 100000);
    __type(key, __be32);        // Source IP
    __type(value, u64);         // Packet count
} rate_limit SEC(".maps");
 
const volatile u64 pps_limit = 1000;  // Packets per second
 
SEC("xdp")
int xdp_ddos_mitigate(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_DROP;
    
    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return XDP_PASS;
    
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_DROP;
    
    // Rate limit per source IP
    __be32 src = ip->saddr;
    u64 *count = bpf_map_lookup_elem(&rate_limit, &src);
    
    if (count) {
        if (*count > pps_limit) {
            // Exceeded rate limit - drop
            return XDP_DROP;
        }
        __sync_fetch_and_add(count, 1);
    } else {
        u64 one = 1;
        bpf_map_update_elem(&rate_limit, &src, &one, BPF_ANY);
    }
    
    return XDP_PASS;
}
 
char LICENSE[] SEC("license") = "GPL";

XDP Modes

Mode	Flag	Performance	Compatibility
Native (Driver)	`XDP_FLAGS_DRV_MODE`	Best	Driver must support
Generic (SKB)	`XDP_FLAGS_SKB_MODE`	Slower	Works everywhere
Hardware Offload	`XDP_FLAGS_HW_OFFLOAD`	Fastest	Limited NIC support

Supported Drivers (Native XDP): Most modern NICs support native XDP: Intel ixgbe, i40e, ice; Mellanox mlx4, mlx5; Broadcom bnxt; Amazon ENA; Virtio-net; and many more.

# Attach XDP program in native mode
ip link set dev eth0 xdp obj xdp_prog.o sec xdp

# Attach in generic mode (fallback)
ip link set dev eth0 xdpgeneric obj xdp_prog.o sec xdp

# Check XDP program attached
ip link show eth0
# Output: ... xdp/id:42 ...

# Detach XDP program
ip link set dev eth0 xdp off

XDP Limitations

TC BPF: Traffic Control with eBPF

TC (Traffic Control) is the Linux subsystem for queuing, scheduling, and classifying network traffic. TC BPF programs attach to the TC layer, providing:

Ingress + Egress: Unlike XDP, TC can process packets in both directions
sk_buff access: Full access to socket buffer metadata
Stack integration: Runs after netfilter, can modify packets before routing
Flexible actions: ACCEPT, DROP, REDIRECT, modify and re-inject

TC vs XDP Tradeoffs

Feature	XDP	TC BPF
Egress processing	❌	✅
sk_buff access	❌	✅
Performance	~10M pps	~3M pps
Packet modification	Basic	Full
Works with all NICs	Generic mode	✅

TC BPF Examples
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
 
// TC return values
#define TC_ACT_OK        0   // Accept and continue
#define TC_ACT_SHOT      2   // Drop packet
#define TC_ACT_REDIRECT  7   // Redirect via bpf_redirect
 
// ============================================
// BASIC TC FILTER
// ============================================
SEC("tc")
int tc_filter(struct __sk_buff *skb) {
    void *data_end = (void *)(long)skb->data_end;
    void *data = (void *)(long)skb->data;
    
    // Parse headers
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return TC_ACT_SHOT;
    
    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return TC_ACT_OK;
    
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return TC_ACT_SHOT;
    
    // TC-specific: access skb metadata
    // skb->ifindex, skb->priority, skb->mark, etc.
    
    // Log packet info
    bpf_printk("TC: len=%d, ifindex=%d, proto=%d
",
               skb->len, skb->ifindex, ip->protocol);
    
    return TC_ACT_OK;
}
 
// ============================================
// TC NAT (DNAT Example)
// ============================================
SEC("tc")
int tc_dnat(struct __sk_buff *skb) {
    void *data_end = (void *)(long)skb->data_end;
    void *data = (void *)(long)skb->data;
    
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return TC_ACT_SHOT;
    
    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return TC_ACT_OK;
    
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return TC_ACT_SHOT;
    
    // DNAT: rewrite destination IP
    // Original: 10.0.0.1 -> New: 192.168.1.100
    if (ip->daddr == bpf_htonl(0x0a000001)) {
        __be32 new_ip = bpf_htonl(0xc0a80164);  // 192.168.1.100
        
        // Use helper to properly update checksums
        int ret = bpf_l3_csum_replace(skb, 
            ETH_HLEN + offsetof(struct iphdr, check),
            ip->daddr, new_ip, 4);
        if (ret < 0)
            return TC_ACT_SHOT;
        
        // Update L4 checksum too (TCP/UDP)
        if (ip->protocol == IPPROTO_TCP) {
            ret = bpf_l4_csum_replace(skb,
                ETH_HLEN + sizeof(struct iphdr) + 
                    offsetof(struct tcphdr, check),
                ip->daddr, new_ip, 
                BPF_F_PSEUDO_HDR | 4);
        }
        
        // Write new destination IP
        ret = bpf_skb_store_bytes(skb,
            ETH_HLEN + offsetof(struct iphdr, daddr),
            &new_ip, 4, 0);
    }
    
    return TC_ACT_OK;
}
 
// ============================================
// TC ENCAPSULATION (VXLAN Example)
// ============================================
struct vxlanhdr {
    __be32 flags;
    __be32 vni;
};
 
SEC("tc")
int tc_vxlan_encap(struct __sk_buff *skb) {
    // Calculate new headers size
    int outer_hdr_size = sizeof(struct ethhdr) + 
                         sizeof(struct iphdr) +
                         sizeof(struct udphdr) +
                         sizeof(struct vxlanhdr);
    
    // Expand headroom
    int ret = bpf_skb_adjust_room(skb, outer_hdr_size,
                                   BPF_ADJ_ROOM_MAC,
                                   BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 |
                                   BPF_F_ADJ_ROOM_ENCAP_L4_UDP);
    if (ret < 0)
        return TC_ACT_SHOT;
    
    // Now write outer headers...
    // (Implementation details omitted for brevity)
    
    return TC_ACT_OK;
}
 
// ============================================
// TC REDIRECT (Container Networking)
// ============================================
struct {
    __uint(type, BPF_MAP_TYPE_DEVMAP);
    __uint(max_entries, 256);
    __type(key, u32);      // Source ifindex
    __type(value, u32);    // Destination ifindex
} redirect_map SEC(".maps");
 
SEC("tc")
int tc_redirect_ingress(struct __sk_buff *skb) {
    u32 ifindex = skb->ifindex;
    u32 *target = bpf_map_lookup_elem(&redirect_map, &ifindex);
    
    if (target) {
        return bpf_redirect(*target, 0);
    }
    
    return TC_ACT_OK;
}
 
char LICENSE[] SEC("license") = "GPL";

TC BPF Attachment Commands
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Create clsact qdisc (required for TC BPF)
tc qdisc add dev eth0 clsact
 
# Attach TC BPF to ingress
tc filter add dev eth0 ingress bpf da obj tc_prog.o sec tc
 
# Attach TC BPF to egress
tc filter add dev eth0 egress bpf da obj tc_prog.o sec tc
 
# List attached filters
tc filter show dev eth0 ingress
tc filter show dev eth0 egress
 
# Remove filters
tc filter del dev eth0 ingress
tc filter del dev eth0 egress
 
# Remove qdisc
tc qdisc del dev eth0 clsact
 
# Debug: view TC BPF execution stats
cat /sys/kernel/debug/tracing/trace_pipe  # for bpf_printk output

TC BPF Use Cases

Socket-Level BPF: Application Networking

eBPF also operates at the socket level, enabling powerful per-connection logic:

Socket BPF Program Types

Type	Purpose	Example Use
`BPF_PROG_TYPE_SOCKET_FILTER`	Filter packets on a socket	tcpdump-style capture
`BPF_PROG_TYPE_SK_SKB`	Redirect between sockets	Kernel-level proxy
`BPF_PROG_TYPE_SK_MSG`	Filter/redirect socket messages	L7 proxying
`BPF_PROG_TYPE_SOCK_OPS`	Socket operation hooks	TCP tuning per socket
`BPF_PROG_TYPE_CGROUP_SOCK`	cgroup socket operations	Container networking

Socket BPF Examples
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
 
// ============================================
// SOCKMAP: Kernel-Space TCP Proxy
// ============================================
// SOCKMAP enables redirecting traffic between sockets
// without going through user space - kernel-to-kernel copy
 
struct {
    __uint(type, BPF_MAP_TYPE_SOCKHASH);
    __uint(max_entries, 65535);
    __type(key, struct sock_key);
    __type(value, u32);  // Socket cookie
} sock_hash SEC(".maps");
 
struct sock_key {
    __be32 local_ip;
    __be32 remote_ip;
    __be16 local_port;
    __be16 remote_port;
};
 
// Called for each incoming packet on sockets in the sockmap
SEC("sk_skb/stream_verdict")
int stream_verdict(struct __sk_buff *skb) {
    struct sock_key key = {};
    
    // Extract connection tuple
    key.local_ip = skb->local_ip4;
    key.remote_ip = skb->remote_ip4;
    key.local_port = skb->local_port;
    key.remote_port = bpf_htons(skb->remote_port >> 16);
    
    // Redirect to peer socket
    return bpf_sk_redirect_hash(skb, &sock_hash, &key, 0);
}
 
// ============================================
// CGROUP/SOCK: Container Network Policy
// ============================================
SEC("cgroup/connect4")
int cgroup_connect4(struct bpf_sock_addr *ctx) {
    // Called when a socket in this cgroup calls connect()
    
    // Block connections to specific IP (e.g., metadata service)
    // 169.254.169.254 = AWS metadata service
    if (ctx->user_ip4 == bpf_htonl(0xa9fea9fe)) {
        return 0;  // Block
    }
    
    // Redirect connections (service mesh use case)
    // Intercept connections to 10.0.0.0/8 and redirect to local proxy
    __be32 orig_dst = ctx->user_ip4;
    if ((orig_dst & bpf_htonl(0xff000000)) == bpf_htonl(0x0a000000)) {
        ctx->user_ip4 = bpf_htonl(0x7f000001);  // 127.0.0.1
        ctx->user_port = bpf_htons(15001);       // Proxy port
    }
    
    return 1;  // Allow (modified)
}
 
SEC("cgroup/bind4")
int cgroup_bind4(struct bpf_sock_addr *ctx) {
    // Called when a socket in this cgroup calls bind()
    
    // Prevent binding to privileged ports
    __be16 port = ctx->user_port;
    if (bpf_ntohs(port) < 1024) {
        return 0;  // Block
    }
    
    return 1;  // Allow
}
 
// ============================================
// SOCK_OPS: TCP Tuning
// ============================================
SEC("sockops")
int sock_ops_handler(struct bpf_sock_ops *skops) {
    int op = skops->op;
    
    switch (op) {
        case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
        case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
            // Connection established - tune TCP parameters
            
            // Enable TCP congestion control algorithm
            bpf_setsockopt(skops, SOL_TCP, TCP_CONGESTION,
                          "bbr", sizeof("bbr"));
            
            // Set TCP keepalive
            int keepalive = 1;
            bpf_setsockopt(skops, SOL_SOCKET, SO_KEEPALIVE,
                          &keepalive, sizeof(keepalive));
            
            // Add socket to sockmap for potential proxying
            struct sock_key key = {};
            key.local_ip = skops->local_ip4;
            key.remote_ip = skops->remote_ip4;
            key.local_port = skops->local_port;
            key.remote_port = bpf_htons(skops->remote_port >> 16);
            
            bpf_sock_hash_update(skops, &sock_hash, &key, BPF_ANY);
            break;
            
        case BPF_SOCK_OPS_STATE_CB:
            // Socket state change - cleanup on close
            if (skops->args[1] == BPF_TCP_CLOSE) {
                // Remove from maps, cleanup...
            }
            break;
    }
    
    return 1;
}
 
char LICENSE[] SEC("license") = "GPL";

SOCKMAP Proxy Architecture

SOCKMAP enables building high-performance proxies that operate entirely in kernel space:

┌──────────────────────────────────────────────────────────────┐
│                      Traditional Proxy                        │
│                                                               │
│  Client ──TCP──▶ Proxy ──TCP──▶ Server                       │
│                   │                                           │
│            [User Space]                                       │
│         read() / write()                                     │
│           context switches                                    │
│           memory copies                                       │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                      SOCKMAP Proxy                            │
│                                                               │
│  Client ──TCP──▶ Kernel ──TCP──▶ Server                      │
│                   │                                           │
│           [Kernel Space]                                      │
│      bpf_sk_redirect_hash()                                   │
│         zero copies                                           │
│         no user-space switch                                 │
└──────────────────────────────────────────────────────────────┘

SOCKMAP proxies can achieve 2-3x higher throughput than traditional user-space proxies like Envoy for TCP passthrough workloads.

Cilium's Approach

Real-World eBPF Networking

Let's examine how major companies and projects use eBPF for production networking.

Cloudflare: XDP for DDoS Mitigation

Cloudflare handles massive DDoS attacks (over 2 Tbps) using XDP at the edge. Their approach:

Packet arrives at NIC
XDP program executes (in driver, before sk_buff)
Rate limiting: Track packets per source IP in BPF maps
Signature matching: Drop known attack patterns
Legitimate traffic: XDP_PASS to network stack

Result: Multi-million PPS drop rates with minimal CPU overhead.

Facebook Katran: XDP Load Balancer

Katran is Facebook's L4 load balancer, open-sourced and handling billions of connections:

Katran-Style L4 Load Balancer
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// Simplified Katran-style load balancer concept
// Actual Katran: github.com/facebookincubator/katran
 
struct vip_meta {
    u32 flags;
    u32 num_backends;
};
 
struct real_definition {
    __be32 dst;       // Backend IP
    __u8 mac[6];      // Backend MAC
    u8 flags;
};
 
// VIP to metadata
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 1024);
    __type(key, __be32);              // VIP address
    __type(value, struct vip_meta);
} vip_map SEC(".maps");
 
// VIP + hash -> backend
struct {
    __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
    __uint(max_entries, 1024);
    __type(key, u32);                 // VIP index
    __array(values, struct {
        __uint(type, BPF_MAP_TYPE_ARRAY);
        __uint(max_entries, 256);
        __type(key, u32);
        __type(value, struct real_definition);
    });
} ch_rings SEC(".maps");  // Consistent hashing rings
 
SEC("xdp")
int xdp_balancer(struct xdp_md *ctx) {
    // 1. Parse packet headers
    // (omitted - standard parsing)
    
    // 2. Check if destination is a VIP
    struct vip_meta *vip = bpf_map_lookup_elem(&vip_map, &ip->daddr);
    if (!vip)
        return XDP_PASS;  // Not a VIP, continue normal processing
    
    // 3. Compute flow hash (5-tuple)
    u32 hash = jhash_3words(
        ip->saddr, 
        ip->daddr,
        ((__u32)tcp->source << 16) | tcp->dest,
        0
    );
    
    // 4. Consistent hashing: select backend
    u32 vip_idx = /* VIP index from somewhere */;
    void *ring = bpf_map_lookup_elem(&ch_rings, &vip_idx);
    if (!ring)
        return XDP_DROP;
    
    u32 slot = hash % 256;  // Simplified
    struct real_definition *backend = bpf_map_lookup_elem(ring, &slot);
    if (!backend)
        return XDP_DROP;
    
    // 5. Rewrite packet: DNAT + MAC rewrite
    ip->daddr = backend->dst;
    // Update checksums...
    __builtin_memcpy(eth->h_dest, backend->mac, 6);
    
    // 6. Send to backend (via XDP_TX or bpf_redirect)
    return XDP_TX;
}

Production eBPF Networking Deployments
Company/Project	Use Case	Technology	Scale
Cloudflare	DDoS mitigation	XDP	2+ Tbps attack mitigation
Facebook/Meta	L4 load balancing (Katran)	XDP	Billions of connections
Cilium	Kubernetes CNI	XDP + TC + Socket	Millions of pods
Netflix	Network observability	TC BPF + tracing	Thousands of instances
Google GKE	Dataplane V2 (Cilium)	XDP + TC	Production GKE clusters
AWS	Firewall (WAF)	XDP	Edge protection

Cilium: Kubernetes Networking

Cilium replaces kube-proxy and traditional CNI plugins with an eBPF-native data path:

┌──────────────────────────────────────────────────────────────┐
│                    Traditional kube-proxy                     │
│                                                               │
│  Pod A ──veth──▶ Host ──iptables──▶ Host ──veth──▶ Pod B    │
│                    │                                          │
│              Conntrack                                        │
│              NAT tables                                       │
│              O(N) rules                                       │
└──────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────┐
│                        Cilium (eBPF)                          │
│                                                               │
│  Pod A ──veth──TC BPF──▶ bpf_redirect ──▶ veth──TC BPF──▶ Pod B │
│                │                                              │
│            BPF maps                                           │
│            O(1) lookups                                       │
│            No iptables                                        │
└──────────────────────────────────────────────────────────────┘

Benefits:

O(1) service lookups instead of O(N) iptables rules
Direct pod-to-pod forwarding without traversing netfilter
Identity-aware policies (not just IP-based)
L7 visibility via socket-level eBPF

eBPF is Production-Ready

Performance Characteristics

Understanding eBPF networking performance helps you choose the right approach.

Packet Processing Rates

Mode	Typical Performance	Notes
XDP Native	10-15M pps per core	Driver-dependent
XDP Generic	2-4M pps per core	Fallback mode
XDP Offload	40-100M pps	NIC-dependent, limited programs
TC BPF	2-5M pps per core	After sk_buff creation
iptables (comparison)	0.5-2M pps	Rule-count dependent

Memory Overhead

Component	Memory Usage
eBPF program (typical)	4-64 KB
BPF maps	Depends on max_entries × value_size
Per-CPU maps	max_entries × value_size × num_CPUs
Ring buffer	As configured (256KB-64MB typical)

Performance Testing
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Generate test traffic with pktgen (kernel module)
modprobe pktgen
 
# Configure pktgen (example for 10M pps test)
cd /proc/net/pktgen
 
echo "rem_device_all" > kpktgend_0
echo "add_device eth1" > kpktgend_0
 
echo "count 50000000" > eth1
echo "pkt_size 64" > eth1
echo "rate 10000000" > eth1  # 10M pps
echo "dst 192.168.1.100" > eth1
echo "dst_mac 00:11:22:33:44:55" > eth1
 
echo "start" > pgctrl
 
# Check XDP statistics
ethtool -S eth0 | grep xdp
 
# BPF program statistics
bpftool prog show id 42 --json | jq '.run_cnt, .run_time_ns'
 
# Calculate per-call overhead
run_time_ns / run_cnt  # nanoseconds per invocation
 
# Monitor map operations
bpftool map show id 5 --json | jq
 
# Check drop stats  
cat /sys/class/net/eth0/statistics/rx_dropped
 
# XDP-specific redirect stats
cat /sys/kernel/debug/tracing/events/xdp/*/enable

Performance Optimization Tips

•Use per-CPU maps: PERCPU variants avoid cross-CPU cache line bouncing. Essential for high-frequency updates.
•Minimize map lookups: Cache lookup results in locals when possible. Each lookup has measurable overhead.
•Inline aggressively: Use __always_inline for helper functions. The verifier requires this anyway.
•Use native XDP mode: Generic mode is significantly slower. Verify your NIC supports native mode.
•Batch operations: Use bpf_redirect with DEVMAP_BATCH for bulk forwarding in XDP.
•Profile your programs: Use bpftool prog stats to identify hot paths and optimize.

Measuring Real-World Impact

Building eBPF Network Applications

Building production eBPF networking applications requires understanding the complete workflow and best practices.

Development Workflow

Define requirements: XDP for drops/redirects, TC for complex modifications, socket for L4/L7
Write BPF program: C with libbpf, or use bpftrace for prototyping
Test in VMs: Use network namespaces and veth pairs for isolation
Profile and optimize: Use bpftool stats, minimize per-packet overhead
Deploy with CO-RE: Ensure portability across kernel versions
Monitor in production: Track program stats, map sizes, error rates

Complete XDP Application
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
// xdp_firewall.bpf.c - Simple XDP firewall
 
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
 
// Block list: IPs to drop
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10000);
    __type(key, __be32);        // IP address
    __type(value, u64);         // Block count
} blocklist SEC(".maps");
 
// Statistics
struct stats {
    u64 passed;
    u64 dropped;
};
 
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
    __uint(max_entries, 1);
    __type(key, u32);
    __type(value, struct stats);
} statistics SEC(".maps");
 
SEC("xdp")
int xdp_firewall(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;
    u32 zero = 0;
    
    struct stats *stats = bpf_map_lookup_elem(&statistics, &zero);
    if (!stats)
        return XDP_ABORTED;
    
    // Parse Ethernet
    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end) {
        stats->dropped++;
        return XDP_DROP;
    }
    
    // Only process IPv4
    if (eth->h_proto != bpf_htons(ETH_P_IP)) {
        stats->passed++;
        return XDP_PASS;
    }
    
    // Parse IP
    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end) {
        stats->dropped++;
        return XDP_DROP;
    }
    
    // Check blocklist
    u64 *count = bpf_map_lookup_elem(&blocklist, &ip->saddr);
    if (count) {
        // IP is blocked - increment counter and drop
        __sync_fetch_and_add(count, 1);
        stats->dropped++;
        return XDP_DROP;
    }
    
    stats->passed++;
    return XDP_PASS;
}
 
char LICENSE[] SEC("license") = "GPL";

Testing Infrastructure

Summary: Networking Use Cases

We've explored eBPF's transformative impact on Linux networking. Let's consolidate the key concepts:

Key Takeaways

•Multiple hook points — eBPF integrates at XDP (driver), TC (traffic control), and socket levels, each with different tradeoffs.
•XDP for maximum performance — ~10M pps per core, perfect for DDoS mitigation, load balancing, and early packet drops.
•TC BPF for flexibility — Ingress and egress support, full sk_buff access, ideal for NAT, encapsulation, and container networking.
•Socket-level BPF — Per-connection logic, SOCKMAP for kernel-space proxying, cgroup BPF for container policies.
•Production-proven — Cloudflare, Facebook, Google, and major Kubernetes deployments rely on eBPF networking.
•Replaces iptables — Cilium and others show eBPF offers O(1) lookups versus O(N) iptables rules.
•Performance optimization — Use per-CPU maps, native XDP mode, and profile with bpftool for production readiness.

What's Next:

Networking Mastery

4 / 5