Loading learning content...
Imagine processing 10 million packets per second per CPU core. Imagine dropping DDoS traffic before it even creates a socket. Imagine implementing custom load balancing logic without modifying your application or adding proxies. Imagine replacing iptables rules with programs that are 10x faster.
This is eBPF networking.
eBPF has revolutionized Linux networking by enabling programmable packet processing directly in the kernel's data path. From XDP (eXpress Data Path) at the driver level to socket-level filtering, eBPF provides unprecedented flexibility and performance.
Cloudflare handles millions of requests per second using eBPF-based DDoS mitigation. Facebook's Katran load balancer serves billions of connections using XDP. Cilium replaces kube-proxy with eBPF-based Kubernetes networking. These aren't experiments—they're production systems serving real traffic at massive scale.
By the end of this page, you will understand XDP and its high-performance packet processing capabilities, TC (Traffic Control) eBPF for classification and shaping, socket-level eBPF for connection steering, and real-world networking applications like load balancing, firewalling, and container networking.
eBPF integrates at multiple points in the Linux networking stack, each offering different tradeoffs between flexibility and performance.
eBPF Attachment Points in the Network Path
┌─────────────────────────────────────────────────────────────────┐
│ APPLICATION │
│ (socket read/write) │
└────────────────────────────┬────────────────────────────────────┘
│
┌────────────────────────────┼────────────────────────────────────┐
│ SOCKET LAYER │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ SO_ATTACH_BPF, SOCKMAP, SK_SKB, SK_MSG │ │
│ └─────────────────────────────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│
┌────────────────────────────┼────────────────────────────────────┐
│ TRANSPORT LAYER (TCP/UDP) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Cgroup socket programs (connect, bind, etc.) │ │
│ └─────────────────────────────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│
┌────────────────────────────┼────────────────────────────────────┐
│ NETWORK LAYER (IP) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ LWT (Lightweight Tunnels) │ │
│ └─────────────────────────────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│
┌────────────────────────────┼────────────────────────────────────┐
│ TRAFFIC CONTROL (TC) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ TC ingress/egress: BPF_PROG_TYPE_SCHED_CLS │ │
│ │ After netfilter, before qdisc │ │
│ └─────────────────────────────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│
┌────────────────────────────┼────────────────────────────────────┐
│ DRIVER / XDP │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ XDP: BPF_PROG_TYPE_XDP │ │
│ │ Before sk_buff creation, in driver receive path │ │
│ └─────────────────────────────────────────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│
┌────────────────────────────┼────────────────────────────────────┐
│ NIC │
│ Hardware offload (XDP_FLAGS_HW_OFFLOAD) │
└─────────────────────────────────────────────────────────────────┘
| Hook Point | Program Type | Direction | Performance | Use Case |
|---|---|---|---|---|
| XDP | BPF_PROG_TYPE_XDP | Ingress only | Fastest (~10M pps) | DDoS, load balancing, filtering |
| TC | BPF_PROG_TYPE_SCHED_CLS | Ingress + Egress | Fast (~3M pps) | NAT, encapsulation, policing |
| Socket Filter | BPF_PROG_TYPE_SOCKET_FILTER | Per-socket | Good | Packet capture, filtering |
| Sockmap | BPF_PROG_TYPE_SK_SKB | Per-socket | Good | Proxying, redirection |
| cgroup/socket | BPF_PROG_TYPE_CGROUP_* | Per-cgroup | Good | Container networking policy |
| LWT | BPF_PROG_TYPE_LWT_* | Routing | Good | Tunneling, encapsulation |
Use XDP when: you need maximum performance, early packet drops (DDoS), or don't need sk_buff features. Use TC when: you need egress processing, protocol stack access, or more packet manipulation. Use socket-level BPF when: you need per-connection logic, application-layer proxying, or socket steering.
XDP (eXpress Data Path) is the fastest eBPF networking hook, processing packets at the earliest possible point—directly in the network driver, before the kernel creates sk_buff structures.
Why XDP is So Fast
XDP Actions
| Action | Value | Description |
|---|---|---|
XDP_ABORTED | 0 | Error; drop packet, trace |
XDP_DROP | 1 | Silently drop packet |
XDP_PASS | 2 | Pass to normal network stack |
XDP_TX | 3 | Transmit back out same interface |
XDP_REDIRECT | 4 | Redirect to another interface/program |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
#include "vmlinux.h"#include <bpf/bpf_helpers.h>#include <bpf/bpf_endian.h> // ============================================// BASIC XDP PACKET FILTER// ============================================SEC("xdp")int xdp_filter(struct xdp_md *ctx) { // Get packet data boundaries void *data_end = (void *)(long)ctx->data_end; void *data = (void *)(long)ctx->data; // Parse Ethernet header struct ethhdr *eth = data; if ((void *)(eth + 1) > data_end) return XDP_DROP; // Only process IPv4 if (eth->h_proto != bpf_htons(ETH_P_IP)) return XDP_PASS; // Parse IP header struct iphdr *ip = (void *)(eth + 1); if ((void *)(ip + 1) > data_end) return XDP_DROP; // Drop ICMP (ping) if (ip->protocol == IPPROTO_ICMP) return XDP_DROP; // Drop specific source IP (example: 10.0.0.100) if (ip->saddr == bpf_htonl(0x0a000064)) return XDP_DROP; return XDP_PASS;} // ============================================// XDP LOAD BALANCER (Simple Round-Robin)// ============================================struct backend { __be32 ip; unsigned char mac[6];}; struct { __uint(type, BPF_MAP_TYPE_ARRAY); __uint(max_entries, 4); __type(key, u32); __type(value, struct backend);} backends SEC(".maps"); struct { __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); __uint(max_entries, 1); __type(key, u32); __type(value, u32);} counter SEC(".maps"); SEC("xdp")int xdp_load_balancer(struct xdp_md *ctx) { void *data_end = (void *)(long)ctx->data_end; void *data = (void *)(long)ctx->data; struct ethhdr *eth = data; if ((void *)(eth + 1) > data_end) return XDP_DROP; if (eth->h_proto != bpf_htons(ETH_P_IP)) return XDP_PASS; struct iphdr *ip = (void *)(eth + 1); if ((void *)(ip + 1) > data_end) return XDP_DROP; // Only load balance TCP to port 80 if (ip->protocol != IPPROTO_TCP) return XDP_PASS; struct tcphdr *tcp = (void *)ip + (ip->ihl * 4); if ((void *)(tcp + 1) > data_end) return XDP_DROP; if (tcp->dest != bpf_htons(80)) return XDP_PASS; // Round-robin backend selection u32 zero = 0; u32 *cnt = bpf_map_lookup_elem(&counter, &zero); if (!cnt) return XDP_PASS; u32 backend_idx = (*cnt) % 4; __sync_fetch_and_add(cnt, 1); struct backend *be = bpf_map_lookup_elem(&backends, &backend_idx); if (!be) return XDP_PASS; // Rewrite destination IP ip->daddr = be->ip; // Update checksum (simplified - real implementation more complex) ip->check = 0; ip->check = iph_csum(ip); // Rewrite destination MAC __builtin_memcpy(eth->h_dest, be->mac, 6); // TX back out the same interface return XDP_TX;} // ============================================// XDP DDoS MITIGATION// ============================================struct { __uint(type, BPF_MAP_TYPE_LRU_HASH); __uint(max_entries, 100000); __type(key, __be32); // Source IP __type(value, u64); // Packet count} rate_limit SEC(".maps"); const volatile u64 pps_limit = 1000; // Packets per second SEC("xdp")int xdp_ddos_mitigate(struct xdp_md *ctx) { void *data_end = (void *)(long)ctx->data_end; void *data = (void *)(long)ctx->data; struct ethhdr *eth = data; if ((void *)(eth + 1) > data_end) return XDP_DROP; if (eth->h_proto != bpf_htons(ETH_P_IP)) return XDP_PASS; struct iphdr *ip = (void *)(eth + 1); if ((void *)(ip + 1) > data_end) return XDP_DROP; // Rate limit per source IP __be32 src = ip->saddr; u64 *count = bpf_map_lookup_elem(&rate_limit, &src); if (count) { if (*count > pps_limit) { // Exceeded rate limit - drop return XDP_DROP; } __sync_fetch_and_add(count, 1); } else { u64 one = 1; bpf_map_update_elem(&rate_limit, &src, &one, BPF_ANY); } return XDP_PASS;} char LICENSE[] SEC("license") = "GPL";XDP Modes
| Mode | Flag | Performance | Compatibility |
|---|---|---|---|
| Native (Driver) | XDP_FLAGS_DRV_MODE | Best | Driver must support |
| Generic (SKB) | XDP_FLAGS_SKB_MODE | Slower | Works everywhere |
| Hardware Offload | XDP_FLAGS_HW_OFFLOAD | Fastest | Limited NIC support |
Supported Drivers (Native XDP): Most modern NICs support native XDP: Intel ixgbe, i40e, ice; Mellanox mlx4, mlx5; Broadcom bnxt; Amazon ENA; Virtio-net; and many more.
# Attach XDP program in native mode
ip link set dev eth0 xdp obj xdp_prog.o sec xdp
# Attach in generic mode (fallback)
ip link set dev eth0 xdpgeneric obj xdp_prog.o sec xdp
# Check XDP program attached
ip link show eth0
# Output: ... xdp/id:42 ...
# Detach XDP program
ip link set dev eth0 xdp off
XDP programs cannot access the full networking stack. You can't make TCP connections, access routing tables directly, or perform complex L7 inspection. For these use cases, combine XDP with TC or upper-layer eBPF. Also, XDP only handles ingress packets—for egress processing, use TC.
TC (Traffic Control) is the Linux subsystem for queuing, scheduling, and classifying network traffic. TC BPF programs attach to the TC layer, providing:
TC vs XDP Tradeoffs
| Feature | XDP | TC BPF |
|---|---|---|
| Egress processing | ❌ | ✅ |
| sk_buff access | ❌ | ✅ |
| Performance | ~10M pps | ~3M pps |
| Packet modification | Basic | Full |
| Works with all NICs | Generic mode | ✅ |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
#include "vmlinux.h"#include <bpf/bpf_helpers.h>#include <bpf/bpf_endian.h> // TC return values#define TC_ACT_OK 0 // Accept and continue#define TC_ACT_SHOT 2 // Drop packet#define TC_ACT_REDIRECT 7 // Redirect via bpf_redirect // ============================================// BASIC TC FILTER// ============================================SEC("tc")int tc_filter(struct __sk_buff *skb) { void *data_end = (void *)(long)skb->data_end; void *data = (void *)(long)skb->data; // Parse headers struct ethhdr *eth = data; if ((void *)(eth + 1) > data_end) return TC_ACT_SHOT; if (eth->h_proto != bpf_htons(ETH_P_IP)) return TC_ACT_OK; struct iphdr *ip = (void *)(eth + 1); if ((void *)(ip + 1) > data_end) return TC_ACT_SHOT; // TC-specific: access skb metadata // skb->ifindex, skb->priority, skb->mark, etc. // Log packet info bpf_printk("TC: len=%d, ifindex=%d, proto=%d", skb->len, skb->ifindex, ip->protocol); return TC_ACT_OK;} // ============================================// TC NAT (DNAT Example)// ============================================SEC("tc")int tc_dnat(struct __sk_buff *skb) { void *data_end = (void *)(long)skb->data_end; void *data = (void *)(long)skb->data; struct ethhdr *eth = data; if ((void *)(eth + 1) > data_end) return TC_ACT_SHOT; if (eth->h_proto != bpf_htons(ETH_P_IP)) return TC_ACT_OK; struct iphdr *ip = (void *)(eth + 1); if ((void *)(ip + 1) > data_end) return TC_ACT_SHOT; // DNAT: rewrite destination IP // Original: 10.0.0.1 -> New: 192.168.1.100 if (ip->daddr == bpf_htonl(0x0a000001)) { __be32 new_ip = bpf_htonl(0xc0a80164); // 192.168.1.100 // Use helper to properly update checksums int ret = bpf_l3_csum_replace(skb, ETH_HLEN + offsetof(struct iphdr, check), ip->daddr, new_ip, 4); if (ret < 0) return TC_ACT_SHOT; // Update L4 checksum too (TCP/UDP) if (ip->protocol == IPPROTO_TCP) { ret = bpf_l4_csum_replace(skb, ETH_HLEN + sizeof(struct iphdr) + offsetof(struct tcphdr, check), ip->daddr, new_ip, BPF_F_PSEUDO_HDR | 4); } // Write new destination IP ret = bpf_skb_store_bytes(skb, ETH_HLEN + offsetof(struct iphdr, daddr), &new_ip, 4, 0); } return TC_ACT_OK;} // ============================================// TC ENCAPSULATION (VXLAN Example)// ============================================struct vxlanhdr { __be32 flags; __be32 vni;}; SEC("tc")int tc_vxlan_encap(struct __sk_buff *skb) { // Calculate new headers size int outer_hdr_size = sizeof(struct ethhdr) + sizeof(struct iphdr) + sizeof(struct udphdr) + sizeof(struct vxlanhdr); // Expand headroom int ret = bpf_skb_adjust_room(skb, outer_hdr_size, BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 | BPF_F_ADJ_ROOM_ENCAP_L4_UDP); if (ret < 0) return TC_ACT_SHOT; // Now write outer headers... // (Implementation details omitted for brevity) return TC_ACT_OK;} // ============================================// TC REDIRECT (Container Networking)// ============================================struct { __uint(type, BPF_MAP_TYPE_DEVMAP); __uint(max_entries, 256); __type(key, u32); // Source ifindex __type(value, u32); // Destination ifindex} redirect_map SEC(".maps"); SEC("tc")int tc_redirect_ingress(struct __sk_buff *skb) { u32 ifindex = skb->ifindex; u32 *target = bpf_map_lookup_elem(&redirect_map, &ifindex); if (target) { return bpf_redirect(*target, 0); } return TC_ACT_OK;} char LICENSE[] SEC("license") = "GPL";12345678910111213141516171819202122
# Create clsact qdisc (required for TC BPF)tc qdisc add dev eth0 clsact # Attach TC BPF to ingresstc filter add dev eth0 ingress bpf da obj tc_prog.o sec tc # Attach TC BPF to egresstc filter add dev eth0 egress bpf da obj tc_prog.o sec tc # List attached filterstc filter show dev eth0 ingresstc filter show dev eth0 egress # Remove filterstc filter del dev eth0 ingresstc filter del dev eth0 egress # Remove qdisctc qdisc del dev eth0 clsact # Debug: view TC BPF execution statscat /sys/kernel/debug/tracing/trace_pipe # for bpf_printk outputTC BPF excels at: NAT/masquerading, packet encapsulation (VXLAN, GENEVE), container networking (veth redirection), traffic policing, and packet marking. It's the primary technology behind Cilium's data path, replacing iptables with eBPF for Kubernetes networking.
eBPF also operates at the socket level, enabling powerful per-connection logic:
Socket BPF Program Types
| Type | Purpose | Example Use |
|---|---|---|
BPF_PROG_TYPE_SOCKET_FILTER | Filter packets on a socket | tcpdump-style capture |
BPF_PROG_TYPE_SK_SKB | Redirect between sockets | Kernel-level proxy |
BPF_PROG_TYPE_SK_MSG | Filter/redirect socket messages | L7 proxying |
BPF_PROG_TYPE_SOCK_OPS | Socket operation hooks | TCP tuning per socket |
BPF_PROG_TYPE_CGROUP_SOCK | cgroup socket operations | Container networking |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119
#include "vmlinux.h"#include <bpf/bpf_helpers.h>#include <bpf/bpf_endian.h> // ============================================// SOCKMAP: Kernel-Space TCP Proxy// ============================================// SOCKMAP enables redirecting traffic between sockets// without going through user space - kernel-to-kernel copy struct { __uint(type, BPF_MAP_TYPE_SOCKHASH); __uint(max_entries, 65535); __type(key, struct sock_key); __type(value, u32); // Socket cookie} sock_hash SEC(".maps"); struct sock_key { __be32 local_ip; __be32 remote_ip; __be16 local_port; __be16 remote_port;}; // Called for each incoming packet on sockets in the sockmapSEC("sk_skb/stream_verdict")int stream_verdict(struct __sk_buff *skb) { struct sock_key key = {}; // Extract connection tuple key.local_ip = skb->local_ip4; key.remote_ip = skb->remote_ip4; key.local_port = skb->local_port; key.remote_port = bpf_htons(skb->remote_port >> 16); // Redirect to peer socket return bpf_sk_redirect_hash(skb, &sock_hash, &key, 0);} // ============================================// CGROUP/SOCK: Container Network Policy// ============================================SEC("cgroup/connect4")int cgroup_connect4(struct bpf_sock_addr *ctx) { // Called when a socket in this cgroup calls connect() // Block connections to specific IP (e.g., metadata service) // 169.254.169.254 = AWS metadata service if (ctx->user_ip4 == bpf_htonl(0xa9fea9fe)) { return 0; // Block } // Redirect connections (service mesh use case) // Intercept connections to 10.0.0.0/8 and redirect to local proxy __be32 orig_dst = ctx->user_ip4; if ((orig_dst & bpf_htonl(0xff000000)) == bpf_htonl(0x0a000000)) { ctx->user_ip4 = bpf_htonl(0x7f000001); // 127.0.0.1 ctx->user_port = bpf_htons(15001); // Proxy port } return 1; // Allow (modified)} SEC("cgroup/bind4")int cgroup_bind4(struct bpf_sock_addr *ctx) { // Called when a socket in this cgroup calls bind() // Prevent binding to privileged ports __be16 port = ctx->user_port; if (bpf_ntohs(port) < 1024) { return 0; // Block } return 1; // Allow} // ============================================// SOCK_OPS: TCP Tuning// ============================================SEC("sockops")int sock_ops_handler(struct bpf_sock_ops *skops) { int op = skops->op; switch (op) { case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB: case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: // Connection established - tune TCP parameters // Enable TCP congestion control algorithm bpf_setsockopt(skops, SOL_TCP, TCP_CONGESTION, "bbr", sizeof("bbr")); // Set TCP keepalive int keepalive = 1; bpf_setsockopt(skops, SOL_SOCKET, SO_KEEPALIVE, &keepalive, sizeof(keepalive)); // Add socket to sockmap for potential proxying struct sock_key key = {}; key.local_ip = skops->local_ip4; key.remote_ip = skops->remote_ip4; key.local_port = skops->local_port; key.remote_port = bpf_htons(skops->remote_port >> 16); bpf_sock_hash_update(skops, &sock_hash, &key, BPF_ANY); break; case BPF_SOCK_OPS_STATE_CB: // Socket state change - cleanup on close if (skops->args[1] == BPF_TCP_CLOSE) { // Remove from maps, cleanup... } break; } return 1;} char LICENSE[] SEC("license") = "GPL";SOCKMAP Proxy Architecture
SOCKMAP enables building high-performance proxies that operate entirely in kernel space:
┌──────────────────────────────────────────────────────────────┐
│ Traditional Proxy │
│ │
│ Client ──TCP──▶ Proxy ──TCP──▶ Server │
│ │ │
│ [User Space] │
│ read() / write() │
│ context switches │
│ memory copies │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ SOCKMAP Proxy │
│ │
│ Client ──TCP──▶ Kernel ──TCP──▶ Server │
│ │ │
│ [Kernel Space] │
│ bpf_sk_redirect_hash() │
│ zero copies │
│ no user-space switch │
└──────────────────────────────────────────────────────────────┘
SOCKMAP proxies can achieve 2-3x higher throughput than traditional user-space proxies like Envoy for TCP passthrough workloads.
Cilium combines multiple eBPF types: TC BPF for L3/L4 policies, socket-level BPF for service mesh acceleration, and cgroup BPF for Kubernetes network policies. This layered approach provides defense in depth while maintaining high performance.
Let's examine how major companies and projects use eBPF for production networking.
Cloudflare: XDP for DDoS Mitigation
Cloudflare handles massive DDoS attacks (over 2 Tbps) using XDP at the edge. Their approach:
Result: Multi-million PPS drop rates with minimal CPU overhead.
Facebook Katran: XDP Load Balancer
Katran is Facebook's L4 load balancer, open-sourced and handling billions of connections:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
// Simplified Katran-style load balancer concept// Actual Katran: github.com/facebookincubator/katran struct vip_meta { u32 flags; u32 num_backends;}; struct real_definition { __be32 dst; // Backend IP __u8 mac[6]; // Backend MAC u8 flags;}; // VIP to metadatastruct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 1024); __type(key, __be32); // VIP address __type(value, struct vip_meta);} vip_map SEC(".maps"); // VIP + hash -> backendstruct { __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS); __uint(max_entries, 1024); __type(key, u32); // VIP index __array(values, struct { __uint(type, BPF_MAP_TYPE_ARRAY); __uint(max_entries, 256); __type(key, u32); __type(value, struct real_definition); });} ch_rings SEC(".maps"); // Consistent hashing rings SEC("xdp")int xdp_balancer(struct xdp_md *ctx) { // 1. Parse packet headers // (omitted - standard parsing) // 2. Check if destination is a VIP struct vip_meta *vip = bpf_map_lookup_elem(&vip_map, &ip->daddr); if (!vip) return XDP_PASS; // Not a VIP, continue normal processing // 3. Compute flow hash (5-tuple) u32 hash = jhash_3words( ip->saddr, ip->daddr, ((__u32)tcp->source << 16) | tcp->dest, 0 ); // 4. Consistent hashing: select backend u32 vip_idx = /* VIP index from somewhere */; void *ring = bpf_map_lookup_elem(&ch_rings, &vip_idx); if (!ring) return XDP_DROP; u32 slot = hash % 256; // Simplified struct real_definition *backend = bpf_map_lookup_elem(ring, &slot); if (!backend) return XDP_DROP; // 5. Rewrite packet: DNAT + MAC rewrite ip->daddr = backend->dst; // Update checksums... __builtin_memcpy(eth->h_dest, backend->mac, 6); // 6. Send to backend (via XDP_TX or bpf_redirect) return XDP_TX;}| Company/Project | Use Case | Technology | Scale |
|---|---|---|---|
| Cloudflare | DDoS mitigation | XDP | 2+ Tbps attack mitigation |
| Facebook/Meta | L4 load balancing (Katran) | XDP | Billions of connections |
| Cilium | Kubernetes CNI | XDP + TC + Socket | Millions of pods |
| Netflix | Network observability | TC BPF + tracing | Thousands of instances |
| Google GKE | Dataplane V2 (Cilium) | XDP + TC | Production GKE clusters |
| AWS | Firewall (WAF) | XDP | Edge protection |
Cilium: Kubernetes Networking
Cilium replaces kube-proxy and traditional CNI plugins with an eBPF-native data path:
┌──────────────────────────────────────────────────────────────┐
│ Traditional kube-proxy │
│ │
│ Pod A ──veth──▶ Host ──iptables──▶ Host ──veth──▶ Pod B │
│ │ │
│ Conntrack │
│ NAT tables │
│ O(N) rules │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ Cilium (eBPF) │
│ │
│ Pod A ──veth──TC BPF──▶ bpf_redirect ──▶ veth──TC BPF──▶ Pod B │
│ │ │
│ BPF maps │
│ O(1) lookups │
│ No iptables │
└──────────────────────────────────────────────────────────────┘
Benefits:
These aren't experiments—they're production systems handling real traffic at massive scale. eBPF networking has proven its value for performance-critical infrastructure. If you're building networking infrastructure today, eBPF should be a primary consideration.
Understanding eBPF networking performance helps you choose the right approach.
Packet Processing Rates
| Mode | Typical Performance | Notes |
|---|---|---|
| XDP Native | 10-15M pps per core | Driver-dependent |
| XDP Generic | 2-4M pps per core | Fallback mode |
| XDP Offload | 40-100M pps | NIC-dependent, limited programs |
| TC BPF | 2-5M pps per core | After sk_buff creation |
| iptables (comparison) | 0.5-2M pps | Rule-count dependent |
Memory Overhead
| Component | Memory Usage |
|---|---|
| eBPF program (typical) | 4-64 KB |
| BPF maps | Depends on max_entries × value_size |
| Per-CPU maps | max_entries × value_size × num_CPUs |
| Ring buffer | As configured (256KB-64MB typical) |
12345678910111213141516171819202122232425262728293031323334
# Generate test traffic with pktgen (kernel module)modprobe pktgen # Configure pktgen (example for 10M pps test)cd /proc/net/pktgen echo "rem_device_all" > kpktgend_0echo "add_device eth1" > kpktgend_0 echo "count 50000000" > eth1echo "pkt_size 64" > eth1echo "rate 10000000" > eth1 # 10M ppsecho "dst 192.168.1.100" > eth1echo "dst_mac 00:11:22:33:44:55" > eth1 echo "start" > pgctrl # Check XDP statisticsethtool -S eth0 | grep xdp # BPF program statisticsbpftool prog show id 42 --json | jq '.run_cnt, .run_time_ns' # Calculate per-call overheadrun_time_ns / run_cnt # nanoseconds per invocation # Monitor map operationsbpftool map show id 5 --json | jq # Check drop stats cat /sys/class/net/eth0/statistics/rx_dropped # XDP-specific redirect statscat /sys/kernel/debug/tracing/events/xdp/*/enable__always_inline for helper functions. The verifier requires this anyway.bpftool prog stats to identify hot paths and optimize.Synthetic benchmarks (pps rates) are useful but don't tell the whole story. Measure latency distributions (p50, p99), tail latencies, and CPU utilization under realistic traffic patterns. XDP's advantage grows with packet volume—at low rates, the difference may be negligible.
Building production eBPF networking applications requires understanding the complete workflow and best practices.
Development Workflow
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
// xdp_firewall.bpf.c - Simple XDP firewall #include "vmlinux.h"#include <bpf/bpf_helpers.h>#include <bpf/bpf_endian.h> // Block list: IPs to dropstruct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 10000); __type(key, __be32); // IP address __type(value, u64); // Block count} blocklist SEC(".maps"); // Statisticsstruct stats { u64 passed; u64 dropped;}; struct { __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); __uint(max_entries, 1); __type(key, u32); __type(value, struct stats);} statistics SEC(".maps"); SEC("xdp")int xdp_firewall(struct xdp_md *ctx) { void *data_end = (void *)(long)ctx->data_end; void *data = (void *)(long)ctx->data; u32 zero = 0; struct stats *stats = bpf_map_lookup_elem(&statistics, &zero); if (!stats) return XDP_ABORTED; // Parse Ethernet struct ethhdr *eth = data; if ((void *)(eth + 1) > data_end) { stats->dropped++; return XDP_DROP; } // Only process IPv4 if (eth->h_proto != bpf_htons(ETH_P_IP)) { stats->passed++; return XDP_PASS; } // Parse IP struct iphdr *ip = (void *)(eth + 1); if ((void *)(ip + 1) > data_end) { stats->dropped++; return XDP_DROP; } // Check blocklist u64 *count = bpf_map_lookup_elem(&blocklist, &ip->saddr); if (count) { // IP is blocked - increment counter and drop __sync_fetch_and_add(count, 1); stats->dropped++; return XDP_DROP; } stats->passed++; return XDP_PASS;} char LICENSE[] SEC("license") = "GPL";Use network namespaces for isolated testing: ip netns add test; ip link add veth0 type veth peer name veth1; ip link set veth1 netns test. This creates a virtual network for testing without affecting production traffic. Combine with traffic generators like iperf3 or pktgen for load testing.
We've explored eBPF's transformative impact on Linux networking. Let's consolidate the key concepts:
What's Next:
Now that you understand eBPF's networking capabilities, the final page explores security applications—how eBPF enables runtime security monitoring, threat detection, system call filtering, and security policy enforcement in modern infrastructure.
You now understand how eBPF has revolutionized Linux networking—from packet-level processing with XDP to application-layer proxying with SOCKMAP. In the next page, we'll see how eBPF is equally transformative for security monitoring and enforcement.