Loading learning content...
Imagine if you could safely inject custom code directly into the Linux kernel without recompiling, without loading kernel modules, and without risking system stability. Imagine observing every system call, every network packet, every file operation—all with near-zero overhead. Imagine implementing custom networking logic, security policies, or performance monitoring tools that run at kernel speeds.
This is eBPF.
eBPF (extended Berkeley Packet Filter) represents the most significant advancement in Linux kernel programmability since loadable kernel modules. It has fundamentally transformed how we approach observability, networking, and security in modern infrastructure. Companies like Netflix, Facebook, Google, and Cloudflare have built critical infrastructure components using eBPF, and it has become the foundation for tools like Cilium, Falco, and bpftrace.
By the end of this page, you will understand eBPF's architecture, the virtual machine that executes eBPF programs, the verification process that ensures safety, and why eBPF represents a paradigm shift in kernel programmability. You'll gain the foundational knowledge needed to understand how modern observability and networking tools operate at the kernel level.
To understand eBPF, we must first understand its predecessor: the Berkeley Packet Filter (BPF). The evolution from classic BPF to eBPF represents a transformation from a specialized packet filtering mechanism to a general-purpose in-kernel virtual machine.
Classic BPF: The Origin (1992)
BPF was introduced in 1992 by Steven McCanne and Van Jacobson in their seminal paper "The BSD Packet Filter: A New Architecture for User-level Packet Capture." Its purpose was simple but revolutionary: provide an efficient way to filter network packets in-kernel, avoiding the expensive copying of unwanted packets to user space.
The classic BPF architecture consisted of:
tcpdump and libpcap| Feature | Classic BPF (cBPF) | Extended BPF (eBPF) |
|---|---|---|
| Register count | 2 registers (A, X), 32-bit | 11 registers (R0-R10), 64-bit |
| Instruction width | 32 bits | 64 bits |
| Stack | 16 memory slots | 512-byte stack |
| Maps (data structures) | None | Hash maps, arrays, ring buffers, etc. |
| Helper functions | None | 1000+ kernel helper functions |
| Tail calls | Not supported | Supported (program chaining) |
| Verification | Basic safety checks | Comprehensive static analysis |
| Use cases | Packet filtering only | Tracing, networking, security, etc. |
| Attachment points | Socket filters only | 100+ attachment points (kprobes, tracepoints, XDP, etc.) |
The eBPF Revolution (2014)
In 2014, Alexei Starovoitov and Daniel Borkmann introduced eBPF as a major extension of classic BPF. The key insight was that the concepts behind BPF—verified, safe, sandboxed execution within the kernel—could be generalized far beyond packet filtering.
The eBPF transformation included:
Extended register set: From 2 registers to 11 64-bit registers (R0-R10), enabling more complex programs and matching modern CPU architectures.
Richer instruction set: Added modern instructions, including 64-bit arithmetic, function calls, and memory operations.
BPF maps: Introduced kernel-side data structures (hash tables, arrays, etc.) that can be shared between eBPF programs and user space.
Helper functions: Provided access to kernel functionality through a growing set of helper functions.
Multiple attachment points: Extended beyond sockets to kprobes, tracepoints, XDP, cgroups, and more.
Advanced verification: Implemented a sophisticated static analyzer to ensure program safety.
The "e" in eBPF stands for "extended," but the technology has evolved so far beyond its origins that many now simply refer to it as "BPF." The Linux kernel internally uses the term "BPF" for the modern implementation, with classic BPF often referred to as "cBPF." When you see references to BPF in modern contexts, it almost always means eBPF.
At the heart of eBPF is a sophisticated virtual machine that executes eBPF bytecode within the kernel. This VM provides the foundation for safe, efficient in-kernel programmability.
Register Architecture
The eBPF VM is a register-based machine with 11 64-bit registers:
| Register | Purpose | Calling Convention |
|---|---|---|
| R0 | Return value from functions/helpers | Return value |
| R1-R5 | Function arguments | Arguments to helpers/functions |
| R6-R9 | Callee-saved registers | Preserved across calls |
| R10 | Read-only frame pointer | Points to 512-byte stack |
This architecture closely mirrors the x86-64 and ARM64 calling conventions, enabling efficient JIT compilation to native code.
12345678910111213141516171819202122232425262728
// Example: eBPF program structure showing register usage// R1 contains the context (e.g., struct pt_regs* for kprobes)// R10 is the frame pointer // Function prologue - allocate stack spacer1 = *(u64 *)(r1 + 0) // Load first argument from contextr6 = r1 // Save to callee-saved register // Call a helper function// Arguments go in R1-R5, result comes back in R0r1 = r6 // First argumentr2 = 16 // Second argument (size)call bpf_probe_read_kernel // Helper function call // R0 now contains the return valueif r0 != 0 goto error // Check for errors // Access stack using frame pointer (R10)*(u64 *)(r10 - 8) = r0 // Store result on stackr0 = *(u64 *)(r10 - 8) // Load from stack exit: r0 = 0 // Return success exit error: r0 = 1 // Return error exitInstruction Set
The eBPF instruction set uses a fixed 64-bit instruction format:
┌──────────┬─────────┬─────────┬────────────┬───────────────┐
│ opcode │ dst_reg │ src_reg │ offset │ immediate │
│ 8 bits │ 4 bits │ 4 bits │ 16 bits │ 32 bits │
└──────────┴─────────┴─────────┴────────────┴───────────────┘
The instruction classes include:
| Class | Encoding | Description |
|---|---|---|
| BPF_LD | 0x00 | Load operations (legacy) |
| BPF_LDX | 0x01 | Load from memory |
| BPF_ST | 0x02 | Store immediate |
| BPF_STX | 0x03 | Store from register |
| BPF_ALU | 0x04 | 32-bit arithmetic |
| BPF_JMP | 0x05 | Jump operations |
| BPF_JMP32 | 0x06 | 32-bit jump operations |
| BPF_ALU64 | 0x07 | 64-bit arithmetic |
The eBPF instruction set was deliberately designed to be simple enough for verification while expressive enough for complex programs. The choice of a register-based architecture (vs. stack-based) enables efficient JIT compilation and makes static analysis tractable. The instruction set avoids complex addressing modes found in x86, trading some flexibility for verifiability.
The eBPF verifier is the critical component that makes in-kernel programmability safe. Before any eBPF program can execute, it must pass the verifier's comprehensive static analysis. This is not a simple syntax check—it's a deep simulation of all possible execution paths.
Why Verification is Essential
Kernel code runs with full system privileges. A bug in kernel code can:
The verifier ensures that eBPF programs cannot cause these problems, even when written by untrusted users.
How the Verifier Works
The verifier performs abstract interpretation, simulating program execution with abstract register states rather than concrete values.
Step 1: Control Flow Graph Construction
The verifier first builds a DAG (Directed Acyclic Graph) representation of the program, identifying all basic blocks and control flow edges.
Step 2: State Tracking
For each register, the verifier tracks:
Step 3: Path Exploration
The verifier explores every possible execution path using depth-first search. At conditional branches, it forks the state and continues with both branches. States are merged at join points, taking the union of possible values.
12345678910111213141516171819202122232425262728293031323334353637
// Consider this eBPF programSEC("kprobe/sys_open")int trace_open(struct pt_regs *ctx) { void *ptr; u64 len; // Verifier tracks: R1 = PTR_TO_CTX len = bpf_get_current_pid_tgid() >> 32; // Verifier tracks: len is SCALAR, range [0, 2^32 - 1] if (len > 100) { // In this branch: len range is [101, 2^32 - 1] len = 100; // After assignment: len range is [100, 100] } // After join: len range is [0, 100] char buf[256]; // This is SAFE because len <= 100 < 256 bpf_probe_read_kernel(buf, len, ptr); return 0;} // But this would be REJECTED:SEC("kprobe/sys_open")int unsafe_trace(struct pt_regs *ctx) { u64 len = bpf_get_current_pid_tgid() >> 32; // Verifier tracks: len is SCALAR, range [0, 2^32 - 1] char buf[256]; // REJECTED! len could be > 256, causing buffer overflow bpf_probe_read_kernel(buf, len, ctx); // Error: "R2 unbounded memory access, use 'var &= const'" return 0;}Verifier Complexity Limits
To prevent denial-of-service attacks where malicious programs cause the verifier to consume excessive resources, strict limits are enforced:
| Limit | Value | Purpose |
|---|---|---|
| Max instructions | 1 million | Limits program size |
| Max verified instructions | 1 million | Limits verification time |
| Max stack depth | 512 bytes | Limits stack usage |
| Max tail calls | 33 | Prevents infinite recursion via tail calls |
| BPF-to-BPF call depth | 8 | Limits function nesting |
| Loops (bounded) | Kernel 5.3+ | Back-edges must have provable bounds |
The verifier is sophisticated but not omniscient. Complex programs can hit verification limits even when semantically safe. Developers often need to refactor code, add explicit bounds checks, or restructure control flow to satisfy the verifier. This is the price of running untrusted code in kernel space safely.
Once an eBPF program passes verification, it's compiled to native machine code using a Just-In-Time (JIT) compiler. This transforms the eBPF bytecode into x86-64, ARM64, or other architecture-specific instructions, eliminating interpretation overhead.
JIT Compilation Process
12345678910111213141516171819202122232425262728293031323334
// eBPF bytecode (simplified)r0 = 0 // BPF_MOV64_IMM R0, 0r1 = *(u64 *)(r10 - 8) // BPF_LDX_MEM DW R1, R10, -8r0 = r0 + r1 // BPF_ALU64 ADD R0, R1exit // BPF_EXIT // Corresponding x86-64 JIT output// (Generated by the Linux kernel's x86 BPF JIT) push %rbp // Function prologuemov %rsp, %rbpsub $0x200, %rsp // Allocate 512-byte BPF stack xor %eax, %eax // r0 = 0 (x86: eax is low 32 bits of rax)mov -0x8(%rbp), %rcx // r1 = *(u64 *)(r10 - 8) // rbp maps to R10 (frame pointer)add %rcx, %rax // r0 += r1 add $0x200, %rsp // Function epiloguepop %rbpretq // exit (return with value in rax) // Register mapping on x86-64:// R0 -> rax (return value)// R1 -> rdi (1st arg, also temp)// R2 -> rsi (2nd arg)// R3 -> rdx (3rd arg)// R4 -> rcx (4th arg)// R5 -> r8 (5th arg)// R6 -> rbx (callee-saved)// R7 -> r13 (callee-saved)// R8 -> r14 (callee-saved)// R9 -> r15 (callee-saved)// R10 -> rbp (frame pointer)JIT Hardening
BPF JIT includes several security hardening features:
Constant blinding: Immediate values are XORed with random values and unblinded at runtime, preventing JIT spray attacks.
Image randomization: JIT'd code is placed at randomized addresses within kernel memory.
Retpoline support: Indirect jumps use retpolines on CPUs vulnerable to Spectre variant 2.
Read-only executable memory: JIT images are marked as read-only and executable, preventing code modification.
You can check if BPF JIT is enabled with: sysctl net.core.bpf_jit_enable. Value 0 = disabled (interpreted), 1 = enabled, 2 = enabled with debug output. For production systems handling high-throughput eBPF programs (like XDP), JIT should always be enabled for acceptable performance.
eBPF's versatility comes from its diverse program types and attachment points. Each program type has a specific purpose, context, and set of available helper functions. The attachment point determines when and where the program executes.
Program Type Categories
eBPF programs can be broadly categorized by their domain:
| Category | Program Type | Attachment Point | Primary Use Case |
|---|---|---|---|
| Networking | BPF_PROG_TYPE_XDP | Network driver (ingress) | High-performance packet processing, DDoS mitigation |
| Networking | BPF_PROG_TYPE_SCHED_CLS | TC (Traffic Control) | Packet classification, container networking |
| Networking | BPF_PROG_TYPE_SOCKET_FILTER | Socket | Packet filtering (classic use case) |
| Networking | BPF_PROG_TYPE_SK_SKB | Sockmap | Socket-level proxy, load balancing |
| Tracing | BPF_PROG_TYPE_KPROBE | Kernel function entry/exit | Dynamic function tracing |
| Tracing | BPF_PROG_TYPE_TRACEPOINT | Static tracepoints | Stable kernel event tracing |
| Tracing | BPF_PROG_TYPE_RAW_TRACEPOINT | Raw tracepoints | Low-overhead tracepoint access |
| Tracing | BPF_PROG_TYPE_PERF_EVENT | Perf events | Performance monitoring, sampling |
| Security | BPF_PROG_TYPE_LSM | LSM hooks | Security policy enforcement |
| Security | BPF_PROG_TYPE_CGROUP_* | cgroup events | Per-container resource control |
| Observability | BPF_PROG_TYPE_STRUCT_OPS | Kernel struct_ops | Custom kernel subsystem implementations |
Context Structures
Each program type receives a specific context structure as its input (passed in R1). The context provides access to relevant data for that hook point:
| Program Type | Context Structure | Key Fields |
|---|---|---|
| XDP | struct xdp_md | data, data_end, data_meta, ingress_ifindex |
| Socket Filter | struct __sk_buff | data, data_end, protocol, len, ifindex |
| kprobe | struct pt_regs | CPU registers at probe point |
| Tracepoint | Tracepoint-specific | Varies per tracepoint |
| LSM | Hook-specific | Security-relevant data for the hook |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
// Example 1: XDP program (networking)SEC("xdp")int xdp_drop_all(struct xdp_md *ctx) { // Context gives us packet data pointers void *data_end = (void *)(long)ctx->data_end; void *data = (void *)(long)ctx->data; struct ethhdr *eth = data; // Bounds check required by verifier if ((void *)(eth + 1) > data_end) return XDP_DROP; // Drop all UDP packets (example) if (eth->h_proto == htons(ETH_P_IP)) { struct iphdr *ip = (void *)(eth + 1); if ((void *)(ip + 1) > data_end) return XDP_DROP; if (ip->protocol == IPPROTO_UDP) return XDP_DROP; // Drop UDP } return XDP_PASS; // Allow other packets} // Example 2: Kprobe program (tracing)SEC("kprobe/do_sys_openat2")int trace_openat(struct pt_regs *ctx) { // Context gives us CPU registers at function entry // On x86-64, function arguments are in rdi, rsi, rdx, rcx, r8, r9 int dirfd = PT_REGS_PARM1(ctx); const char *pathname = (const char *)PT_REGS_PARM2(ctx); u32 pid = bpf_get_current_pid_tgid() >> 32; // Log to trace output bpf_printk("PID %d opening file at dirfd %d\n", pid, dirfd); return 0;} // Example 3: Tracepoint program (stable tracing)SEC("tracepoint/syscalls/sys_enter_openat")int trace_sys_enter_openat(struct trace_event_raw_sys_enter *ctx) { // Tracepoint context has structured access to syscall args int dirfd = ctx->args[0]; const char *pathname = (const char *)ctx->args[1]; int flags = ctx->args[2]; bpf_printk("openat: dirfd=%d, flags=0x%x\n", dirfd, flags); return 0;}The choice of program type depends on your goal. For observability, prefer tracepoints (stable ABI) over kprobes (unstable). For networking, XDP provides the best performance for early packet drops, while TC offers more flexibility. For security, LSM programs integrate with the Linux Security Module framework. Understanding these tradeoffs is essential for effective eBPF development.
The eBPF ecosystem has matured significantly, offering multiple layers of tooling from low-level to high-level abstractions. Understanding this ecosystem is crucial for choosing the right tool for your use case.
The eBPF Toolchain Stack
┌─────────────────────────────────────────────────────────────────┐
│ High-Level Tools: bpftrace, BCC, kubectl-trace │
├─────────────────────────────────────────────────────────────────┤
│ eBPF Frameworks: libbpf, ebpf-go, aya (Rust), libbpf-rs │
├─────────────────────────────────────────────────────────────────┤
│ Compiler: Clang/LLVM (C → eBPF bytecode) │
├─────────────────────────────────────────────────────────────────┤
│ BTF (BPF Type Format): CO-RE (Compile Once, Run Everywhere) │
├─────────────────────────────────────────────────────────────────┤
│ Kernel: bpf() syscall, verifier, JIT, maps, helpers │
└─────────────────────────────────────────────────────────────────┘
bpf target.CO-RE: Solving the Portability Problem
Historically, eBPF programs that accessed kernel structures were tied to specific kernel versions—the offsets of struct fields could change between releases. CO-RE (Compile Once, Run Everywhere) solves this by:
This enables distributing pre-compiled eBPF programs that work across different kernel versions without recompilation.
123456789101112131415161718192021222324252627282930313233343536373839404142
// vmlinux.h is generated from kernel BTF// Contains all kernel type definitions#include "vmlinux.h"#include <bpf/bpf_helpers.h>#include <bpf/bpf_core_read.h> // CO-RE helpers SEC("kprobe/do_sys_openat2")int trace_open(struct pt_regs *ctx) { struct task_struct *task; struct file *file; char comm[16]; pid_t pid; // Get current task_struct task = (struct task_struct *)bpf_get_current_task(); // CO-RE: Read pid field - works across kernel versions // BPF_CORE_READ handles offset relocation automatically pid = BPF_CORE_READ(task, pid); // Read task->comm (process name) BPF_CORE_READ_STR_INTO(&comm, task, comm); // CO-RE field existence check // Some fields may not exist in all kernel versions if (bpf_core_field_exists(task->loginuid)) { // Access loginuid only if it exists kuid_t loginuid = BPF_CORE_READ(task, loginuid); } bpf_printk("PID %d (%s) opening file\n", pid, comm); return 0;} // libbpf skeleton usage in user space:// 1. Generate skeleton: bpftool gen skeleton program.bpf.o > program.skel.h// 2. Load and attach:// struct program_bpf *skel = program_bpf__open_and_load();// program_bpf__attach(skel);// 3. Cleanup:// program_bpf__destroy(skel);For new eBPF projects, the recommended stack is libbpf + CO-RE + skeleton generation. This provides the best combination of performance, portability, and maintainability. BCC is useful for prototyping but carries significant runtime dependencies. bpftrace excels at interactive exploration and one-off queries.
eBPF is not just a Linux feature—it represents a fundamental shift in how we think about kernel extensibility. Let's examine the paradigm shift and why major technology companies have bet heavily on eBPF.
The Traditional Problem
Before eBPF, extending kernel functionality required:
Kernel modification: Changing kernel source code, recompiling, and rebooting. Impractical for most organizations.
Loadable kernel modules (LKMs): More flexible but still dangerous—a buggy module can crash the system, and modules have full kernel privileges.
User-space solutions: Safe but slow—data must cross the kernel/user boundary, causing context switches and memory copies.
None of these options provided safe, efficient, dynamic kernel extensibility.
| Approach | Safety | Performance | Dynamic | Ease of Use |
|---|---|---|---|---|
| Kernel modification | Low (full privileges) | Best | No (requires reboot) | Hard (kernel dev skills) |
| Kernel modules | Low (full privileges) | Best | Yes (loadable) | Medium (kernel dev skills) |
| User-space | High (isolated) | Poor (context switches) | Yes | Easy |
| eBPF | High (verified) | Near-native | Yes (loadable) | Medium-Hard |
The eBPF Paradigm
eBPF provides a fourth option: sandboxed, verified, JIT-compiled code that runs in kernel space with near-native performance. This unlocks capabilities that were previously impractical:
1. Ubiquitous Observability
With eBPF, you can observe anything happening in the kernel without modifying applications, restarting services, or incurring significant overhead. This has revolutionized debugging and monitoring in production environments.
2. High-Performance Networking
XDP (eXpress Data Path) enables packet processing at millions of packets per second per core, directly in the network driver. This powers DDoS mitigation, load balancing, and packet filtering at line rate.
3. Dynamic Security Policies
LSM programs enable runtime security policy enforcement without rebuilding kernels. This powers runtime security tools like Falco and Tetragon.
4. Custom Kernel Behavior
struct_ops allows implementing custom TCP congestion control, schedulers, and other kernel subsystems as eBPF programs—without kernel modifications.
eBPF is evolving rapidly. Recent developments include: eBPF for Windows, user-space eBPF runtimes (for testing and portability), Rust-based eBPF development (Aya), and the Linux kernel's sched_ext for eBPF-based CPU schedulers. Understanding eBPF now positions you for the future of systems programming.
We've covered the foundations of eBPF. Let's consolidate the key concepts:
What's Next:
Now that you understand the eBPF foundation, the next page dives into eBPF programs themselves—how they're structured, how they interact with the kernel through helper functions, how they communicate via maps, and the development workflow from C source to running in the kernel.
You now have a solid understanding of eBPF's architecture, its safety guarantees, and its place in the Linux ecosystem. In the next page, we'll explore eBPF programs in detail—structure, helper functions, maps, and the complete development lifecycle.