Ebpf - Learning Module | OneNoughtOne

Loading content...

0/240

Tracing and Observability

Seeing Inside the Kernel in Real-Time

What if you could answer questions like:

"Why is this process slow?"
"What files is this container accessing?"
"Where is the kernel spending CPU time?"
"Why did that network connection fail?"

...without modifying applications, restarting services, or adding significant overhead?

eBPF makes this possible. By attaching programs to kernel functions, tracepoints, and performance counters, eBPF provides unprecedented visibility into system behavior. This capability has revolutionized debugging, performance analysis, and security monitoring in production environments.

Netflix, Facebook, and Google use eBPF-based observability to debug issues in real-time across millions of servers. Tools like bpftrace, perf, and commercial solutions like Datadog's agent leverage eBPF to provide insights that were previously impossible or impractical to obtain.

What You Will Learn

By the end of this page, you will understand the eBPF tracing capabilities (kprobes, tracepoints, USDT), learn to use bpftrace for ad-hoc analysis, understand profiling and flamegraph generation, and appreciate how production observability tools leverage these primitives.

Tracing Fundamentals

Tracing is the process of recording events as they occur in a system. Unlike logging (where applications explicitly emit messages), tracing instruments the system to capture events automatically. eBPF enables tracing at multiple levels:

Tracing Sources in Linux

Source	Description	Stability	Performance
kprobes	Dynamic instrumentation of any kernel function	Unstable (functions can change)	Good
kretprobes	Function return tracing (captures return value)	Unstable	Good
tracepoints	Static, pre-defined kernel instrumentation points	Stable API	Best
raw_tracepoints	Lower-overhead tracepoint access	Stable	Better than regular
fentry/fexit	BTF-enabled function tracing (kernel 5.5+)	Unstable (but typed)	Best
USDT	User-space statically defined tracing	Application-defined	Minimal
uprobes	Dynamic user-space function instrumentation	Unstable	Good
perf events	Hardware/software performance counters	Stable	Counter-dependent

Understanding the Tracing Landscape

┌───────────────────────────────────────────────────────────────────┐
│                        USER SPACE                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐    │
│  │   Application │    │   Library    │    │   Runtime (JVM)  │    │
│  └──────┬───────┘    └──────┬───────┘    └──────┬───────────┘    │
│         │ uprobes          │ uprobes           │ USDT probes    │
└─────────┼──────────────────┼───────────────────┼────────────────┘
          │                  │                   │
═══════════════════════════════════════════════════════════════════
          │                  │                   │
┌─────────┼──────────────────┼───────────────────┼────────────────┐
│         ▼                  ▼                   ▼    KERNEL      │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              System Calls (syscall tracepoints)          │    │
│  └─────────────────────────────────────────────────────────┘    │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │           VFS, Scheduler, Memory (kprobes/tracepoints)   │    │
│  └─────────────────────────────────────────────────────────┘    │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │            Drivers (kprobes, device tracepoints)         │    │
│  └─────────────────────────────────────────────────────────┘    │
│         │                                                        │
│         ▼                                                        │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │            Hardware (perf events, PMU counters)          │    │
│  └─────────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────────┘

Choosing the Right Tracing Mechanism

•Tracepoints — Use when available. They're stable across kernel versions, well-documented, and have the lowest overhead.
•fentry/fexit — Use for kernel function tracing on BTF-enabled kernels (5.5+). Faster than kprobes and provide type information.
•kprobes/kretprobes — Use when tracepoints don't exist for your target. Be prepared to adapt when kernel functions change.
•USDT — Use for application-level tracing when apps provide these probes (MySQL, PostgreSQL, Node.js with --enable-dtrace).
•uprobes — Use for tracing user-space functions when USDT isn't available. Higher overhead than USDT.
•perf events — Use for CPU cycle counting, cache behavior, and hardware performance analysis.

Stability Matters in Production

In production, prefer stable interfaces. A tracepoint-based tool will work across kernel upgrades, while a kprobe-based tool might break. For example, tracing sys_enter_openat via tracepoint is stable; tracing do_sys_openat2 via kprobe might fail if the kernel refactors that function.

Kprobes: Dynamic Kernel Instrumentation

Kprobes (kernel probes) enable dynamic instrumentation of almost any kernel function. With kprobes, you can place breakpoints at function entry points, specific instructions, or function returns (kretprobes) without modifying the kernel.

How Kprobes Work

You specify a function name (e.g., vfs_read)
The kernel locates the function's address
The kernel saves the instruction at that address
The kernel replaces it with a breakpoint (int3 on x86)
When executed, the CPU traps, your eBPF program runs, then the original instruction executes

This mechanism enables tracing any of the ~50,000+ kernel functions.

Kprobe Examples
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
// ============================================
// KPROBE: Trace Function Entry
// ============================================
// SEC name format: kprobe/<function_name>
SEC("kprobe/vfs_read")
int BPF_KPROBE(trace_vfs_read, struct file *file, 
               char __user *buf, size_t count, loff_t *pos) {
    // BPF_KPROBE macro handles architecture-specific argument extraction
    // Arguments match the kernel function signature
    
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    
    bpf_printk("vfs_read: pid=%d, count=%lu
", pid, count);
    
    return 0;
}
 
// ============================================
// KRETPROBE: Trace Function Return
// ============================================
// SEC name format: kretprobe/<function_name>
SEC("kretprobe/vfs_read")
int BPF_KRETPROBE(trace_vfs_read_ret, ssize_t ret) {
    // BPF_KRETPROBE provides the return value
    // 'ret' contains the value returned by vfs_read
    
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    
    if (ret < 0) {
        bpf_printk("vfs_read failed: pid=%d, err=%ld
", pid, ret);
    } else {
        bpf_printk("vfs_read success: pid=%d, bytes=%ld
", pid, ret);
    }
    
    return 0;
}
 
// ============================================
// LATENCY TRACING PATTERN
// ============================================
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10240);
    __type(key, u64);       // PID + TID
    __type(value, u64);     // Start timestamp
} start_times SEC(".maps");
 
SEC("kprobe/vfs_read")
int trace_read_entry(struct pt_regs *ctx) {
    u64 pid_tgid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    
    bpf_map_update_elem(&start_times, &pid_tgid, &ts, BPF_ANY);
    return 0;
}
 
SEC("kretprobe/vfs_read")
int trace_read_return(struct pt_regs *ctx) {
    u64 pid_tgid = bpf_get_current_pid_tgid();
    u64 *start_ts = bpf_map_lookup_elem(&start_times, &pid_tgid);
    
    if (start_ts) {
        u64 duration_ns = bpf_ktime_get_ns() - *start_ts;
        u64 duration_us = duration_ns / 1000;
        
        // Only log slow reads (> 1ms)
        if (duration_us > 1000) {
            bpf_printk("slow vfs_read: %llu us
", duration_us);
        }
        
        bpf_map_delete_elem(&start_times, &pid_tgid);
    }
    
    return 0;
}

Finding Kprobeable Functions

# List all available kprobe points
cat /sys/kernel/debug/tracing/available_filter_functions | head -20

# Search for specific functions
cat /sys/kernel/debug/tracing/available_filter_functions | grep vfs_

# Check if a function exists in current kernel
grep -w "do_sys_openat2" /proc/kallsyms

Kprobe Limitations:

Not all functions are kprobeable (inlined functions, some critical paths)
Function signatures can change between kernel versions
Arguments are accessed via architecture-specific register conventions
Small overhead (~100ns) per probe hit

Production Kprobe Caution

Avoid kprobing hot paths in production unless necessary. Kprobes introduce overhead per invocation. Tracing vfs_read on a busy file server generates millions of probe hits per second. Use filtering (by PID, cgroup, etc.) to reduce overhead, and prefer tracepoints when available.

Tracepoints: Stable Kernel Instrumentation

Tracepoints are static instrumentation points compiled into the kernel. Unlike kprobes, which are dynamic, tracepoints are:

Stable: The kernel maintains tracepoint APIs across versions
Documented: Tracepoint arguments are well-defined
Lower overhead: No breakpoint/trap mechanism needed when not active
Structured data: Arguments are typed and named

The kernel contains ~1000+ tracepoints covering syscalls, scheduler, memory, networking, block I/O, and more.

Tracepoint Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# List all available tracepoints
cat /sys/kernel/debug/tracing/available_events
 
# View tracepoint format (shows available fields)
cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_openat/format
 
# Example output:
# name: sys_enter_openat
# ID: 614
# format:
#     field:long __syscall_nr;    offset:8;   size:8; signed:0;
#     field:int dfd;              offset:16;  size:8; signed:0;
#     field:const char * filename;offset:24;  size:8; signed:0;
#     field:int flags;            offset:32;  size:8; signed:0;
#     field:umode_t mode;         offset:40;  size:8; signed:0;
 
# Key tracepoint categories:
# syscalls/sys_enter_*  - System call entry
# syscalls/sys_exit_*   - System call exit  
# sched/*               - Scheduler events
# block/*               - Block I/O
# net/*                 - Networking
# irq/*                 - Interrupts
# timer/*               - Timer events
# kmem/*                - Memory allocation

Key Tracepoint Categories for SRE/DevOps

•syscalls/sys_enter_, sys_exit_ — Every system call, with arguments and return values. Foundation for security monitoring and syscall analysis.
•sched/sched_switch, sched_wakeup — CPU scheduling events. Essential for latency analysis and understanding contention.
•block/block_rq_issue, block_rq_complete — Block I/O operations. Track disk latency, throughput, and I/O patterns.
•net/netif_receive_skb, net_dev_xmit — Network packet events. Analyze network throughput and drops.
•kmem/kmalloc, kfree — Memory allocation events. Debug memory leaks and allocation patterns.
•tcp/tcp_retransmit_skb — TCP retransmits. Critical for network reliability analysis.

fentry/fexit: The Modern Alternative

Starting with kernel 5.5+, fentry/fexit probes offer the best of both worlds: they attach to specific kernel functions like kprobes, but use BTF for type-safe argument access and have lower overhead (no int3 trap). Use fentry when available and the target function doesn't have a tracepoint.

bpftrace: Ad-Hoc Kernel Analysis

bpftrace is a high-level tracing language for Linux, inspired by DTrace and AWK. It compiles one-liners and scripts into eBPF programs, making eBPF accessible for ad-hoc analysis without writing C code.

bpftrace Architecture

┌───────────────────────────────────────────────────────────┐
│                     bpftrace script                        │
│     kprobe:vfs_read { @bytes = hist(arg2); }               │
└─────────────────────────┬─────────────────────────────────┘
                          │
                          ▼
┌───────────────────────────────────────────────────────────┐
│                   bpftrace compiler                        │
│               (parser → AST → LLVM IR)                     │
└─────────────────────────┬─────────────────────────────────┘
                          │
                          ▼
┌───────────────────────────────────────────────────────────┐
│                        eBPF bytecode                       │
│              (loaded via libbpf/bpf() syscall)             │
└─────────────────────────┬─────────────────────────────────┘
                          │
                          ▼
┌───────────────────────────────────────────────────────────┐
│                       Kernel                               │
│    verifier → JIT → attach to kprobe/tracepoint/etc.      │
└───────────────────────────────────────────────────────────┘

bpftrace One-Liners
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# ============================================
# PROCESS TRACING
# ============================================
 
# Trace new process execution
bpftrace -e 'tracepoint:syscalls:sys_enter_execve { 
    printf("%s called execve
", comm); 
}'
 
# Count syscalls by process
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { 
    @[comm] = count(); 
}'
 
# Trace process creation with args (only first 64 bytes)
bpftrace -e 'tracepoint:syscalls:sys_enter_execve {
    printf("%-6d %-16s ", pid, comm);
    join(args->argv);
}'
 
# ============================================
# FILE SYSTEM TRACING
# ============================================
 
# Trace file opens
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { 
    printf("%s opened %s
", comm, str(args->filename)); 
}'
 
# Count reads by file
bpftrace -e 'kprobe:vfs_read { 
    @[str(((struct file *)arg0)->f_path.dentry->d_name.name)] = count(); 
}'
 
# Histogram of read sizes
bpftrace -e 'kprobe:vfs_read { @bytes = hist(arg2); }'
 
# Read latency histogram
bpftrace -e 'kprobe:vfs_read { @start[tid] = nsecs; } 
             kretprobe:vfs_read /@start[tid]/ { 
                 @us = hist((nsecs - @start[tid]) / 1000); 
                 delete(@start[tid]); 
             }'
 
# ============================================
# SCHEDULER TRACING
# ============================================
 
# Trace context switches for a specific PID
bpftrace -e 'tracepoint:sched:sched_switch /args->prev_pid == 1234 || args->next_pid == 1234/ {
    printf("%-8d %-16s -> %-16s
", nsecs, args->prev_comm, args->next_comm);
}'
 
# Measure runqueue latency (time task waits to run)
bpftrace -e 'tracepoint:sched:sched_wakeup { @queuetime[args->pid] = nsecs; }
             tracepoint:sched:sched_switch /args->prev_pid == 0 && @queuetime[args->next_pid]/ {
                 @us = hist((nsecs - @queuetime[args->next_pid]) / 1000);
                 delete(@queuetime[args->next_pid]);
             }'
 
# Off-CPU time by stack
bpftrace -e 'tracepoint:sched:sched_switch {
    @blocked[args->prev_pid] = nsecs;
}
tracepoint:sched:sched_switch /args->prev_pid == 0 && @blocked[args->next_pid]/ {
    @us[kstack] = sum((nsecs - @blocked[args->next_pid]) / 1000);
    delete(@blocked[args->next_pid]);
}'
 
# ============================================
# NETWORK TRACING
# ============================================
 
# TCP retransmits
bpftrace -e 'tracepoint:tcp:tcp_retransmit_skb { 
    printf("retransmit: %s:%d -> %s:%d
", 
           ntop(args->saddr), args->sport,
           ntop(args->daddr), args->dport); 
}'
 
# Count TCP connections by destination
bpftrace -e 'kprobe:tcp_v4_connect { 
    @connects[ntop(((struct sockaddr_in *)arg1)->sin_addr.s_addr)] = count(); 
}'
 
# Socket accept latency
bpftrace -e 'kprobe:inet_csk_accept { @start[tid] = nsecs; }
             kretprobe:inet_csk_accept /@start[tid]/ {
                 @accept_us = hist((nsecs - @start[tid]) / 1000);
                 delete(@start[tid]);
             }'

bpftrace Syntax Quick Reference

Element	Description	Example
`probe`	Attachment point	`kprobe:vfs_read`, `tracepoint:syscalls:sys_enter_open`
`filter`	Conditional execution	`/pid == 1234/`, `/comm == "nginx"/`
`action`	Code to execute	`{ printf("%d

", pid); }| |@map| Aggregation map |@counts[comm] = count();| |$var| Scalar variable |$ts = nsecs;| |arg0-argN| kprobe arguments |arg0(first function argument) | |args->field| Tracepoint args |args->filename| |tid, pid| Thread/Process ID | Built-in variables | |comm| Process name | Built-in variable | |nsecs| Nanosecond timestamp | Built-in variable | |kstack, ustack` | Stack traces | For flamegraphs |

bpftrace for Production Debugging

bpftrace excels at ad-hoc debugging. When a production issue occurs, you can write a one-liner in seconds to answer questions like 'which process is calling this syscall?' or 'what's the read latency distribution?' It's the kernel equivalent of adding print statements—but without modifying or restarting anything.

Profiling and Flamegraphs

CPU Profiling identifies where a program spends its CPU time. eBPF enables efficient profiling by:

Sampling: Take stack traces at fixed intervals (e.g., 99 Hz)
Aggregating in-kernel: Count stacks in a BPF map, avoiding per-sample overhead
Reporting: Dump aggregated stacks at program exit

Flamegraphs are the standard visualization for profiling data. They show:

X-axis: Stack frame aggregation (not time!)
Y-axis: Stack depth
Width: Proportion of time (or samples) in that function
Color: Typically indicates function type (red=user, orange=kernel, etc.)

Profiling with eBPF
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# ============================================
# ON-CPU PROFILING (CPU sampling)
# ============================================
 
# Profile all processes at 99 Hz for 10 seconds
# Using bpftrace:
bpftrace -e 'profile:hz:99 { @[kstack, ustack, comm] = count(); }' > stacks.txt
 
# Using perf with eBPF-enabled stacks:
perf record -F 99 -a -g -- sleep 10
perf script > stacks.txt
 
# Generate flamegraph from stacks
# Requires: https://github.com/brendangregg/FlameGraph
./stackcollapse-bpftrace.pl stacks.txt | ./flamegraph.pl > profile.svg
 
# ============================================
# OFF-CPU ANALYSIS (Blocking time)
# ============================================
 
# Track time spent blocked/sleeping
# This shows WHERE processes are waiting (I/O, locks, etc.)
 
bpftrace -e '
tracepoint:sched:sched_switch {
    if (args->prev_state == 1 || args->prev_state == 2) {
        @blocked[args->prev_pid, kstack] = nsecs;
    }
}
 
tracepoint:sched:sched_switch /args->prev_pid == 0 && @blocked[args->next_pid, kstack]/ {
    @offcpu_us[@blocked[args->next_pid, kstack]] = 
        sum((nsecs - @blocked[args->next_pid, kstack]) / 1000);
    delete(@blocked[args->next_pid, kstack]);
}' > offcpu_stacks.txt
 
# ============================================
# FUNCTION DURATION PROFILING
# ============================================
 
# Profile time spent in specific functions
bpftrace -e '
kprobe:ext4_file_write_iter { @start[tid] = nsecs; }
kretprobe:ext4_file_write_iter /@start[tid]/ {
    @duration_us[kstack] = sum((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
}
END { print(@duration_us); }'
 
# ============================================
# DIFFERENTIAL PROFILING
# ============================================
 
# Compare before/after a change:
# 1. Profile baseline
bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' > before.txt
 
# 2. Make change, profile again
bpftrace -e 'profile:hz:99 { @[kstack] = count(); }' > after.txt
 
# 3. Generate differential flamegraph
./difffolded.pl before.txt after.txt | ./flamegraph.pl > diff.svg

Reading Flamegraphs

┌────────────────────────────────────────────────────────────────────┐
│                         do_sys_open                                 │ ← Wide = lots of time
├───────────────────────────────────┬────────────────────────────────┤
│         do_filp_open              │         security_file_open      │
├──────────────────┬────────────────┼────────────────────────────────┤
│   path_openat    │   alloc_fd     │           selinux_*            │
├────────┬─────────┼────────────────┤                                │
│ lookup │  create │                │                                │
└────────┴─────────┴────────────────┴────────────────────────────────┘
          △ Narrow = less time

Flamegraph Analysis Tips:

Look for wide plateaus — These are functions where significant time is spent without calling other functions (CPU-bound work or leaf functions)
Compare widths — If function_a is twice as wide as function_b, it consumes twice the CPU time
Follow the hot path — Start from the widest root and follow the widest child at each level
Ignore narrow towers — Deeply nested but narrow stacks contribute little to overall time

On-CPU vs Off-CPU Analysis

On-CPU flamegraphs show where CPU time is spent—useful for CPU-bound workloads. Off-CPU flamegraphs show where time is spent blocked (waiting for I/O, locks, etc.)—essential for I/O-bound or latency-sensitive workloads. For complete analysis, generate both and compare.

Production Observability with eBPF

eBPF has enabled a new generation of observability tools that provide deep system visibility with minimal overhead. Let's examine the patterns and tools used in production environments.

The eBPF Observability Stack

┌──────────────────────────────────────────────────────────────────┐
│                        Visualization                             │
│    Grafana, Jaeger, custom dashboards                            │
└────────────────────────────────┬─────────────────────────────────┘
                                 │
┌────────────────────────────────┼─────────────────────────────────┐
│                        Backends                                   │
│    Prometheus, Elasticsearch, ClickHouse, Parca                   │
└────────────────────────────────┬─────────────────────────────────┘
                                 │
┌────────────────────────────────┼─────────────────────────────────┐
│                     eBPF Agents                                   │
│    Pixie, Parca, Tetragon, Datadog Agent, Cilium Hubble          │
└────────────────────────────────┬─────────────────────────────────┘
                                 │
┌────────────────────────────────┼─────────────────────────────────┐
│                     eBPF Programs                                 │
│    Ring buffers → user space → export                             │
└────────────────────────────────┬─────────────────────────────────┘
                                 │
└────────────────────────────────┴─────────────────────────────────┘
                            Linux Kernel

eBPF Observability Tools
Tool	Focus Area	Key Capabilities
Cilium Hubble	Network observability	L3/L4/L7 flow visibility, service maps, DNS visibility
Pixie	Application performance	Auto-instrumented traces, flamegraphs, service topology
Parca	Continuous profiling	Always-on profiling, differential analysis
Tetragon	Security observability	Process execution, file access, network tracing
Falco	Runtime security	Syscall-based threat detection, rule engine
bcc tools	Ad-hoc analysis	50+ readymade tools (execsnoop, opensnoop, etc.)
Datadog Agent	Full-stack observability	eBPF-enhanced APM, network monitoring, security

Production Observability Patterns
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
// ============================================
// PATTERN 1: Efficient Event Streaming with Ring Buffer
// ============================================
// Ring buffers are the modern way to stream events to user space
// - Single buffer shared across CPUs
// - Lock-free for single producer (BPF program)
// - Notification coalescing reduces wakeups
 
struct event {
    u64 timestamp;
    u32 pid;
    u32 tid;
    char comm[16];
    char filename[256];
    s64 retval;
};
 
struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 1024 * 1024);  // 1 MB
} events SEC(".maps");
 
SEC("tracepoint/syscalls/sys_exit_openat")
int trace_openat_exit(struct trace_event_raw_sys_exit *ctx) {
    struct event *e;
    
    // Reserve space atomically
    e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e)
        return 0;  // Buffer full, drop (handle gracefully)
    
    // Fill event
    e->timestamp = bpf_ktime_get_ns();
    e->pid = bpf_get_current_pid_tgid() >> 32;
    e->tid = bpf_get_current_pid_tgid();
    bpf_get_current_comm(&e->comm, sizeof(e->comm));
    e->retval = ctx->ret;
    
    // Submit (makes visible to user space)
    bpf_ringbuf_submit(e, 0);
    
    return 0;
}
 
// ============================================
// PATTERN 2: In-Kernel Aggregation
// ============================================
// Aggregate in BPF maps to reduce user-space load
// Only send summaries, not individual events
 
struct latency_key {
    char comm[16];
    u8 bucket;  // Latency bucket (log2)
};
 
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10240);
    __type(key, struct latency_key);
    __type(value, u64);  // Count
} latency_histogram SEC(".maps");
 
static __always_inline u8 log2_bucket(u64 value) {
    // Returns 0-63, representing 2^N to 2^(N+1) range
    u8 bucket = 0;
    while (value > 1 && bucket < 63) {
        value >>= 1;
        bucket++;
    }
    return bucket;
}
 
SEC("kretprobe/vfs_read")
int trace_read_latency(struct pt_regs *ctx) {
    u64 *start_ts, latency_ns;
    struct latency_key key = {};
    u64 *count;
    
    // Get start timestamp from map (set in kprobe)
    // ... (lookup and calculate latency)
    
    latency_ns = /* calculated */;
    
    bpf_get_current_comm(&key.comm, sizeof(key.comm));
    key.bucket = log2_bucket(latency_ns / 1000);  // µs buckets
    
    // Increment counter
    count = bpf_map_lookup_elem(&latency_histogram, &key);
    if (count) {
        __sync_fetch_and_add(count, 1);
    } else {
        u64 one = 1;
        bpf_map_update_elem(&latency_histogram, &key, &one, BPF_ANY);
    }
    
    return 0;
}
 
// ============================================
// PATTERN 3: Cgroup Filtering for Containers
// ============================================
// In containerized environments, filter events by cgroup
 
const volatile u64 target_cgroupid = 0;  // Set by user space
 
SEC("tracepoint/syscalls/sys_enter_openat")
int trace_container_opens(struct trace_event_raw_sys_enter *ctx) {
    u64 cgid = bpf_get_current_cgroup_id();
    
    // Skip if not target container
    if (target_cgroupid && cgid != target_cgroupid)
        return 0;
    
    // Process event for this container
    // ...
    
    return 0;
}

Production Best Practices

•Aggregate in-kernel: Don't stream every event; aggregate histograms, counts, and summaries in BPF maps. Read periodically from user space.
•Use ring buffers: For event streaming, prefer ring buffers over perf buffers—they're more efficient and provide better APIs.
•Handle buffer overflow gracefully: Ring buffers can fill up under load. Track dropped events and adjust buffer size or sampling rate.
•Filter early: Apply filters (by PID, cgroup, etc.) as early as possible in the BPF program to reduce overhead.
•Sample high-frequency events: For events firing millions of times per second, sample (e.g., 1 in 1000) rather than capturing all.
•Monitor eBPF overhead: Use bpftool prog to check program run counts and run times. Aim for <1% CPU overhead.

eBPF in Production at Scale

Facebook runs eBPF programs on millions of servers with negligible overhead. The key is efficient program design: aggregate in-kernel, filter early, sample when necessary, and use modern APIs like ring buffers. Well-designed eBPF observability adds <0.5% CPU overhead even on busy systems.

Debugging eBPF Programs

Debugging eBPF programs presents unique challenges: you can't use GDB, printf debugging has limitations, and verifier errors can be cryptic. Here are the essential debugging techniques.

Debugging Toolkit

Tool/Technique	Purpose	When to Use
`bpf_printk()`	Kernel log output	Quick debugging
`bpftool prog show`	List loaded programs	Verify loading
`bpftool prog dump`	Disassemble programs	Understand JIT output
`bpftool map dump`	Inspect map contents	Verify data flow
Verifier output	Understand rejections	Fix verification errors
BTF (CO-RE)	Type-safe access	Portable programs

eBPF Debugging Techniques
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# ============================================
# INSPECT LOADED PROGRAMS
# ============================================
 
# List all loaded BPF programs
bpftool prog show
 
# Example output:
# 42: kprobe  name trace_openat  tag 7a8e3f04b9b3d57a  gpl
#     loaded_at 2024-01-15T10:30:00+0000  uid 0
#     bytes_xlated 392  jited 224  memlock 4096B
#     map_ids 5,6
 
# Show detailed program info
bpftool prog show id 42 --pretty
 
# Disassemble BPF bytecode
bpftool prog dump xlated id 42
 
# Show JIT'd native code
bpftool prog dump jited id 42
 
# ============================================
# INSPECT MAPS
# ============================================
 
# List all BPF maps
bpftool map show
 
# Dump map contents
bpftool map dump id 5
 
# Dump in JSON format for parsing
bpftool map dump id 5 --json | jq
 
# Look up specific key
bpftool map lookup id 5 key 0x00 0x00 0x00 0x01
 
# ============================================
# READ DEBUG OUTPUT
# ============================================
 
# bpf_printk() writes to trace_pipe
# Run in separate terminal:
cat /sys/kernel/debug/tracing/trace_pipe
 
# Or use bpftool:
# (requires kernel 5.9+)
bpftool prog tracelog
 
# ============================================
# VERIFIER DEBUGGING
# ============================================
 
# Get verbose verifier output
# In code, use:
#   LIBBPF_OPTS(bpf_object_open_opts, opts, .kernel_log_level = 1);
 
# Or set environment variable:
LIBBPF_LOG_LEVEL=debug ./my_loader
 
# The verifier will print something like:
# func#0 @0
# 0: (b7) r1 = 0
# 1: (63) *(u32 *)(r10 -4) = r1
# ...
# 12: (bf) r2 = r1
# R1 !read_ok                        <-- Error: R1 might be NULL
# processed 12 insns (limit 1000000)
 
# ============================================
# CHECK ATTACHMENT STATUS
# ============================================
 
# List kprobes
cat /sys/kernel/debug/kprobes/list
 
# List tracepoints in use
cat /sys/kernel/debug/tracing/enabled_events
 
# BPF links
bpftool link show

Common Verifier Errors and Solutions

Error Message	Cause	Solution
`R1 !read_ok`	Reading potentially NULL pointer	Add NULL check before access
`unbounded access`	Array index not bounded	Add bounds check `if (idx < MAX)`
`invalid mem access`	Wrong offset/type for context	Check context structure fields
`back-edge from insn X`	Unbounded loop detected	Add bounded loop or refactor
`bpf_xxx: unknown func`	Helper not available	Check kernel version, license
`variable stack access`	Stack access with non-const offset	Use constant array indices

Fixing Common Verifier Errors
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// ============================================
// ERROR: R1 !read_ok (potentially NULL pointer)
// ============================================
 
// BAD: Verifier doesn't know lookup can succeed
u64 *value = bpf_map_lookup_elem(&my_map, &key);
*value += 1;  // ERROR: value might be NULL
 
// GOOD: Always check map lookup results
u64 *value = bpf_map_lookup_elem(&my_map, &key);
if (value) {
    *value += 1;
}
 
// ============================================
// ERROR: unbounded memory access
// ============================================
 
// BAD: len could be any value
SEC("kprobe/vfs_read")
int bad_trace(struct pt_regs *ctx) {
    size_t len = PT_REGS_PARM3(ctx);
    char buf[256];
    bpf_probe_read_kernel(buf, len, some_ptr);  // ERROR
}
 
// GOOD: Bound the length
SEC("kprobe/vfs_read")
int good_trace(struct pt_regs *ctx) {
    size_t len = PT_REGS_PARM3(ctx);
    char buf[256];
    if (len > sizeof(buf))
        len = sizeof(buf);
    bpf_probe_read_kernel(buf, len, some_ptr);  // OK
}
 
// ALTERNATIVE: Use bitwise AND for power-of-2 sizes
size_t len = PT_REGS_PARM3(ctx) & (sizeof(buf) - 1);
 
// ============================================
// ERROR: back-edge from insn (loops)
// ============================================
 
// BAD: Unbounded loop
for (int i = 0; i < count; i++) {  // ERROR: count is variable
    // ...
}
 
// GOOD: Bounded loop (kernel 5.3+)
#pragma unroll
for (int i = 0; i < 16; i++) {  // OK: constant bound
    if (i >= count)
        break;
    // ...
}
 
// ALTERNATIVE: Use bpf_loop() helper (kernel 5.17+)
static int loop_callback(u32 index, void *ctx) {
    // ... process item
    return 0;  // Return 0 to continue, 1 to break
}
bpf_loop(count, loop_callback, &my_ctx, 0);

Debugging Strategy

Start simple: write a minimal program that attaches and prints one thing. Gradually add complexity. When the verifier rejects, read the full output carefully—it tells you exactly which instruction failed and why. Use bpf_printk() liberally during development, then remove for production.

Summary: Tracing and Observability

We've explored eBPF's transformative tracing and observability capabilities. Let's consolidate the key concepts:

Key Takeaways

•Multiple tracing mechanisms — kprobes (dynamic, any function), tracepoints (static, stable), fentry/fexit (BTF-typed, efficient), each with different stability and performance tradeoffs.
•Tracepoints for production — Prefer tracepoints when available; they're stable across kernel versions and have minimal overhead.
•bpftrace for ad-hoc analysis — One-liners and scripts enable rapid exploration without writing C code.
•Profiling with flamegraphs — On-CPU and off-CPU analysis visualized as flamegraphs reveal where time is spent.
•Production patterns — Ring buffers for streaming, in-kernel aggregation for efficiency, cgroup filtering for containers.
•Observability at scale — Tools like Cilium Hubble, Pixie, and Parca build on eBPF primitives for production-grade observability.
•Debugging techniques — bpftool for inspection, bpf_printk for output, verifier analysis for fixing errors.

What's Next:

Now that you understand eBPF's tracing and observability capabilities, the next page explores networking use cases—how eBPF is revolutionizing packet processing, load balancing, and network security through technologies like XDP, TC, and socket-level eBPF.

Observability Mastery

You now have the foundational knowledge to use eBPF for system observability—from ad-hoc debugging with bpftrace to production monitoring with efficient in-kernel aggregation. In the next page, we'll see how eBPF is equally transformative for networking.