Operating SystemsProtection Mechanisms

Sandboxing

LevelAdvanced

Duration90 mins

TopicProtection Mechanisms

4 / 5

seccomp

The Kernel's System Call Firewall

In the previous page, we explored the concepts and strategies behind system call filtering. Now we dive deep into seccomp (secure computing mode)—Linux's kernel facility for implementing these filters.

Seccomp represents one of the most important security innovations in the Linux kernel. Introduced progressively between 2005 and 2012, it has become the foundation of sandboxing in Chrome, Docker, Android, systemd, and countless other security-critical systems. Understanding seccomp is essential for anyone building or analyzing sandboxed systems on Linux.

Seccomp operates at the system call boundary, making decisions about whether to allow, deny, or otherwise handle each syscall before it executes. Because it runs in kernel context, seccomp provides guarantees that user-space monitoring cannot: there is no race condition between the check and the syscall execution.

What You Will Learn

By the end of this page, you will understand seccomp's architecture, the BPF programming model used for filters, how to write and install seccomp filters, the seccomp user notification mechanism for complex policies, and practical patterns used in production systems.

seccomp Architecture

Seccomp operates at a critical point in the kernel: the system call entry path. When a user-space process invokes a system call, the kernel's syscall handler checks if seccomp filters are installed before executing the actual syscall handler.

Execution Flow:

User Space                    Kernel Space
─────────────────────────────────────────────────────
                              
process → syscall →   ──────► syscall entry
                              │
                              ▼
                              seccomp filter check
                              │
                    ┌─────────┴──────────┐
                    │                    │
                    ▼                    ▼
             ALLOW ───► syscall      KILL ───► terminate
                        handler           │
                        │                 ▼
                        ▼              ERRNO ───► return error
                    ◄── return        TRAP ────► SIGSYS signal
                                      NOTIFY ──► supervisor

Key Architectural Properties:

Filter Runs in Kernel Context: The BPF filter executes in kernel space during syscall entry. This is critical—user space cannot race with or manipulate the filter execution.
Filters Are Inherited: Child processes inherit their parent's seccomp filters. Once a filter is installed, it applies to all current and future threads.
Filters Are Append-Only: You can add more restrictive filters but cannot remove filters. This prevents an attacker who gains code execution from disabling the sandbox.
Filters Stack: Multiple filters can be installed. The kernel evaluates all filters, and the most restrictive result wins (lowest precedence value).

seccomp Action Priority (Lowest Value Wins)
Action	Value	Behavior	Used For
SECCOMP_RET_KILL_PROCESS	0x80000000	Kill entire process	Highly dangerous syscalls
SECCOMP_RET_KILL_THREAD	0x00000000	Kill calling thread	Less disruptive than process kill
SECCOMP_RET_TRAP	0x00030000	Send SIGSYS signal	User-space emulation
SECCOMP_RET_ERRNO	0x00050000	Return errno value	Graceful denial
SECCOMP_RET_USER_NOTIF	0x7fc00000	Notify supervisor	Broker/policy server
SECCOMP_RET_TRACE	0x7ff00000	Ptrace notification	Debugging
SECCOMP_RET_LOG	0x7ffc0000	Log and allow	Policy development
SECCOMP_RET_ALLOW	0x7fff0000	Allow syscall	Permitted operations

Priority Semantics

When multiple filters are stacked, the kernel takes the action with the lowest numeric value. KILL (0x00000000) beats ERRNO (0x00050000) beats ALLOW (0x7fff0000). This ensures that once a filter denies a syscall, no subsequent filter can allow it.

seccomp Modes: Strict and Filter

Seccomp has two distinct modes of operation, reflecting its evolution over time:

Mode 1: Strict Mode (Original seccomp)

The original seccomp (2005) provides an extremely simple, fixed policy: only four syscalls are allowed: read, write, exit, and sigreturn. Any other syscall immediately terminates the process.

#include <linux/seccomp.h>
#include <sys/prctl.h>

void enable_strict_seccomp() {
    // Enable strict seccomp mode
    if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) != 0) {
        perror("prctl");
        exit(1);
    }
    
    // From here, only read/write/exit/sigreturn work
    // Any other syscall = instant death
}

Use Cases for Strict Mode:

Extremely simple compute tasks (no file opening, no memory allocation after setup)
CPUjail-style sandboxes for untrusted computation
Historical interest—mostly superseded by filter mode

Limitations:

Too restrictive for most applications (can't even allocate memory)
No customization—fixed policy
Can't communicate except through pre-opened file descriptors

Mode 2: Filter Mode (seccomp-bpf)

Filter mode (2012) extends seccomp with BPF (Berkeley Packet Filter) programs, allowing flexible, programmable policies. The filter can inspect the syscall number, arguments, and architecture, returning any of the action codes.

#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/prctl.h>

void enable_seccomp_filter(struct sock_fprog *prog) {
    // Required: prevent privilege escalation through exec of setuid binaries
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) != 0) {
        perror("prctl NO_NEW_PRIVS");
        exit(1);
    }
    
    // Install seccomp filter
    if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog) != 0) {
        perror("prctl SECCOMP_MODE_FILTER");
        exit(1);
    }
}

// Alternative: use seccomp() syscall directly
void enable_seccomp_syscall(struct sock_fprog *prog): {
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) != 0) {
        perror("prctl NO_NEW_PRIVS");
        exit(1);
    }
    
    if (seccomp(SECCOMP_SET_MODE_FILTER, 0, prog) != 0) {
        perror("seccomp");
        exit(1);
    }
}

NO_NEW_PRIVS Required

Installing seccomp filters requires either CAP_SYS_ADMIN or having PR_SET_NO_NEW_PRIVS set. Without this, an attacker could install a filter, then exec a setuid binary—the setuid binary would run with elevated privileges but be constrained by the attacker's filter, potentially enabling attacks.

BPF Filter Programming

Seccomp uses classic BPF (cBPF) for filter programs—a simple bytecode virtual machine originally designed for packet filtering. A BPF program consists of instructions that load values, perform arithmetic, and make conditional jumps.

The seccomp_data Structure:

The BPF filter can access syscall information through the seccomp_data structure:

struct seccomp_data {
    int   nr;         // Syscall number
    __u32 arch;       // AUDIT_ARCH_ value (architecture)
    __u64 instruction_pointer;  // Return address
    __u64 args[6];    // Syscall arguments
};

BPF Instructions:

BPF uses a small instruction set:

Instruction Type	Purpose
`BPF_LD`	Load value into accumulator
`BPF_LDX`	Load value into index register
`BPF_ST`	Store accumulator to memory
`BPF_ALU`	Arithmetic/logic operations
`BPF_JMP`	Conditional/unconditional jump
`BPF_RET`	Return value (action)
`BPF_MISC`	Miscellaneous (register copy)

Example: Minimal Allowlist Filter:

#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/audit.h>
#include <sys/syscall.h>

struct sock_filter filter[] = {
    // Load architecture
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, 
             offsetof(struct seccomp_data, arch)),
    // Verify architecture is x86_64
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
    // Kill if wrong architecture
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
    
    // Load syscall number
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
             offsetof(struct seccomp_data, nr)),
    
    // Allow read (syscall 0)
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    
    // Allow write (syscall 1)
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    
    // Allow exit_group (syscall 231)
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit_group, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    
    // Default: kill
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
};

struct sock_fprog prog = {
    .len = sizeof(filter) / sizeof(filter[0]),
    .filter = filter,
};

Understanding the BPF_STMT and BPF_JUMP Macros:

// BPF_STMT: unconditional instruction
// BPF_STMT(code, k)
// code = operation type | size | mode
// k = immediate value

BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr))
// BPF_LD: load operation
// BPF_W: word (32-bit) size
// BPF_ABS: absolute addressing from filter data (seccomp_data)
// Result: load seccomp_data.nr into accumulator

// BPF_JUMP: conditional jump
// BPF_JUMP(code, k, jt, jf)
// code = operation type | comparison
// k = comparison value
// jt = instructions to skip if true
// jf = instructions to skip if false

BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read, 0, 1)
// BPF_JEQ: jump if equal
// BPF_K: compare with immediate value k
// If accumulator == __NR_read: skip 0 instructions (execute next)
// If accumulator != __NR_read: skip 1 instruction

Use libseccomp for Complex Filters

Writing raw BPF is error-prone and architecture-specific. The libseccomp library provides a high-level API that generates correct BPF for any architecture. Production code should use libseccomp unless you have specific reasons for raw BPF.

libseccomp: The High-Level API

libseccomp provides a high-level, architecture-independent API for creating seccomp filters. It handles the complexities of BPF generation, architecture differences, and syscall number translation.

Basic Usage:

#include <seccomp.h>

void setup_seccomp() {
    // Create filter context with default action KILL_PROCESS
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL_PROCESS);
    if (ctx == NULL) {
        perror("seccomp_init");
        exit(1);
    }
    
    // Allow specific syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(fstat), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mprotect), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(munmap), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
    
    // Load filter into kernel
    if (seccomp_load(ctx) != 0) {
        perror("seccomp_load");
        exit(1);
    }
    
    // Release context (filter is now in kernel)
    seccomp_release(ctx);
}

Argument Filtering with libseccomp:

libseccomp makes argument filtering straightforward:

#include <seccomp.h>

void setup_with_arg_filtering() {
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
    
    // Allow socket() only for AF_UNIX (argument 0 == 1)
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(socket), 1,
                     SCMP_A0(SCMP_CMP_EQ, AF_UNIX));
    
    // Allow mprotect() but not with PROT_EXEC | PROT_WRITE
    // (Block arg2 having both bits set)
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(mprotect), 1,
                     SCMP_A2(SCMP_CMP_MASKED_EQ, 
                            PROT_EXEC | PROT_WRITE, 
                            PROT_EXEC | PROT_WRITE));
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mprotect), 0);
    
    // Allow ioctl() only for specific commands
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(ioctl), 1,
                     SCMP_A1(SCMP_CMP_EQ, TCGETS));
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(ioctl), 1,
                     SCMP_A1(SCMP_CMP_EQ, FIONREAD));
    
    // Block openat with O_CREAT (prevent file creation)
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(openat), 1,
                     SCMP_A2(SCMP_CMP_MASKED_EQ, O_CREAT, O_CREAT));
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(openat), 0);
    
    seccomp_load(ctx);
    seccomp_release(ctx);
}

Comparison Operators:

libseccomp Comparison Operators
Operator	C Macro	Description
Not equal	SCMP_CMP_NE	arg != value
Less than	SCMP_CMP_LT	arg < value
Less or equal	SCMP_CMP_LE	arg <= value
Equal	SCMP_CMP_EQ	arg == value
Greater or equal	SCMP_CMP_GE	arg >= value
Greater than	SCMP_CMP_GT	arg > value
Masked equal	SCMP_CMP_MASKED_EQ	(arg & mask) == value

Architecture Handling:

libseccomp automatically handles architecture differences:

// Enable support for additional architectures
seccomp_arch_add(ctx, SCMP_ARCH_X86);      // 32-bit x86
seccomp_arch_add(ctx, SCMP_ARCH_X32);      // x32 ABI

// Or remove architectures (stricter)
seccomp_arch_remove(ctx, SCMP_ARCH_NATIVE); // Remove native first
seccomp_arch_add(ctx, SCMP_ARCH_X86_64);    // Add only x86_64
// Now 32-bit and x32 syscalls will be blocked

Export Filters for Inspection

libseccomp can export generated BPF for debugging: seccomp_export_bpf(ctx, fd) writes raw BPF, and seccomp_export_pfc(ctx, fd) writes a human-readable pseudo-filter code format.

seccomp User Notification

SECCOMP_RET_USER_NOTIF (Linux 5.0+) enables a powerful broker pattern where a supervisor process handles blocked syscalls on behalf of the sandboxed process. The supervisor can inspect the syscall, validate it against policy, and either perform the operation or deny it.

Architecture:

┌─────────────────┐                    ┌─────────────────┐
│  Sandboxed      │                    │   Supervisor    │
│  Process        │                    │   Process       │
├─────────────────┤                    ├─────────────────┤
│                 │                    │                 │
│  syscall(open)  │──► blocked by ─────│  notif_recv()  │
│      ▼          │    seccomp         │      ▼          │
│   [blocked]     │                    │  validate path  │
│                 │                    │      ▼          │
│                 │                    │  open() if OK   │
│                 │                    │      ▼          │
│  [resumes]  ◄───│◄─ notif_send() ────│  send fd back   │
│   with fd       │                    │                 │
└─────────────────┘                    └─────────────────┘

Setting Up User Notification:

#include <linux/seccomp.h>
#include <sys/ioctl.h>

int setup_notify_supervisor() {
    struct sock_filter filter[] = {
        // ... architecture check ...
        
        // Load syscall number
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                 offsetof(struct seccomp_data, nr)),
        
        // Send openat to supervisor
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_openat, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),
        
        // Allow other syscalls
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    };
    
    struct sock_fprog prog = {
        .len = sizeof(filter) / sizeof(filter[0]),
        .filter = filter,
    };
    
    // Install filter and get notification fd
    int notify_fd = seccomp(SECCOMP_SET_MODE_FILTER,
                            SECCOMP_FILTER_FLAG_NEW_LISTENER,
                            &prog);
    if (notify_fd < 0) {
        perror("seccomp");
        exit(1);
    }
    
    return notify_fd;  // Parent uses this fd to receive notifications
}

Supervisor Event Loop:

void supervisor_loop(int notify_fd) {
    struct seccomp_notif *req = NULL;
    struct seccomp_notif_resp *resp = NULL;
    struct seccomp_notif_sizes sizes;
    
    // Get sizes for allocation
    seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes);
    req = malloc(sizes.seccomp_notif);
    resp = malloc(sizes.seccomp_notif_resp);
    
    while (1) {
        memset(req, 0, sizes.seccomp_notif);
        memset(resp, 0, sizes.seccomp_notif_resp);
        
        // Receive notification (blocks until syscall intercepted)
        if (ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_RECV, req) != 0) {
            if (errno == ENOENT)  // Target died
                continue;
            perror("NOTIF_RECV");
            break;
        }
        
        // req->id: unique notification ID
        // req->pid: pid of sandboxed process
        // req->data.nr: syscall number
        // req->data.args[]: syscall arguments
        
        resp->id = req->id;
        resp->flags = 0;
        
        if (req->data.nr == __NR_openat) {
            // Read pathname from sandboxed process memory
            char path[PATH_MAX];
            if (read_process_memory(req->pid, req->data.args[1], 
                                    path, sizeof(path)) < 0) {
                resp->error = -EACCES;
            } else if (is_path_allowed(path)) {
                // Perform open on behalf of sandbox
                int fd = openat(req->data.args[0], path,
                               req->data.args[2], req->data.args[3]);
                if (fd >= 0) {
                    // Send fd to sandboxed process
                    resp->val = fd;
                    resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
                    // Actually, for fd injection, use SECCOMP_IOCTL_NOTIF_ADDFD
                } else {
                    resp->error = -errno;
                }
            } else {
                resp->error = -EACCES;
            }
        }
        
        // Send response (unblocks sandboxed process)
        if (ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_SEND, resp) != 0) {
            perror("NOTIF_SEND");
        }
    }
}

TOCTOU in User Notification

User notification has an inherent TOCTOU race: between reading the sandboxed process's memory and performing the syscall, the sandbox might modify the memory. Always use SECCOMP_IOCTL_NOTIF_ID_VALID before acting on read data to verify the target hasn't been recycled, and consider the security implications carefully.

Filter Flags and Options

The seccomp() system call accepts various flags that modify filter behavior:

SECCOMP_FILTER_FLAG_TSYNC:

Synchronize the filter across all threads in the thread group. Without this, each thread needs to install the filter individually:

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC, &prog);
// Now all existing threads have the filter

SECCOMP_FILTER_FLAG_LOG:

Log all filtered syscalls that match a LOG action. Also ensures non-LOG actions are logged if audit is enabled:

seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_LOG, &prog);
// dmesg shows: seccomp filter LOG action

SECCOMP_FILTER_FLAG_SPEC_ALLOW:

Disable Spectre mitigations for this filter. Can improve performance but reduces security against Spectre attacks:

// Only use if performance is critical and you understand Spectre risks
seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_SPEC_ALLOW, &prog);

seccomp Filter Flags
Flag	Value	Purpose
SECCOMP_FILTER_FLAG_TSYNC	1	Sync filter to all threads
SECCOMP_FILTER_FLAG_LOG	2	Enable logging
SECCOMP_FILTER_FLAG_SPEC_ALLOW	4	Disable Spectre mitigations
SECCOMP_FILTER_FLAG_NEW_LISTENER	8	Return notification fd
SECCOMP_FILTER_FLAG_TSYNC_ESRCH	16	ESRCH if sync fails
SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV	32	Killable wait in NOTIF_RECV

seccomp Attribute Operations:

Recent kernels support querying and setting seccomp attributes:

struct seccomp_notif_sizes sizes;

// Get required allocation sizes for notification structures
seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes);
printf("notif: %hu, notif_resp: %hu, data: %hu\n",
       sizes.seccomp_notif, sizes.seccomp_notif_resp, 
       sizes.seccomp_data);

// Get current seccomp action (for diagnostic)
int action;
seccomp(SECCOMP_GET_ACTION_AVAIL, 0, &action);

TSYNC for Multi-Threaded Applications

For multi-threaded applications, always use SECCOMP_FILTER_FLAG_TSYNC when installing the filter. Otherwise, there's a race window where some threads might execute syscalls before the filter is installed. TSYNC atomically applies the filter to all threads.

Common Patterns and Best Practices

Production seccomp implementations follow established patterns that balance security with reliability:

Pattern: Privileged Setup, Then Sandbox:

Perform all privileged operations before installing the filter:

int main() {
    // Phase 1: Privileged setup
    int listen_fd = socket(AF_INET, SOCK_STREAM, 0);
    bind(listen_fd, ...);  // Bind to port 80
    listen(listen_fd, 100);
    
    // Open config files, allocate resources, etc.
    config_t *cfg = load_config("/etc/myservice.conf");
    
    // Phase 2: Drop privileges
    drop_root_privileges();
    
    // Phase 3: Install seccomp filter
    install_seccomp_filter();
    
    // Phase 4: Run main loop (now sandboxed)
    while (1) {
        int conn = accept(listen_fd, ...);
        handle_connection(conn);  // Cannot open files, bind ports, etc.
    }
}

Pattern: Fail-Open Development, Fail-Closed Production:

void install_filter(int strict_mode) {
    scmp_filter_ctx ctx = seccomp_init(
        strict_mode ? SCMP_ACT_KILL : SCMP_ACT_LOG
    );
    // ... rules ...
    seccomp_load(ctx);
}

// Development: blocked syscalls are logged, not killed
install_filter(false);
// Production: blocked syscalls terminate process
install_filter(true);

seccomp Best Practices

•Always validate architecture — First check in every filter; prevents x32/32-bit bypass.
•Use NO_NEW_PRIVS — Required for unprivileged install; prevents setuid-based escape.
•Use allowlists — Default deny with explicit allow is more secure than denylist.
•Use libseccomp — Manual BPF is error-prone and non-portable.
•Use TSYNC — Ensure all threads get the filter simultaneously.
•Test both paths — Verify allowed syscalls work AND blocked syscalls fail.
•Log during development — Use LOG action to discover needed syscalls.
•Handle x32 ABI — Either support it explicitly or block it entirely.
•Audit filter size — Large filters have performance overhead (though usually modest).
•Version control filters — Filter changes should be reviewed like code.

Pattern: Sandbox Entry Function:

int enter_sandbox(void (*sandboxed_main)(void *), void *arg) {
    // 1. Validate we're in a clean state
    if (getuid() == 0) {
        fprintf(stderr, "Must not be root\n");
        return -1;
    }
    
    // 2. Set up additional isolation
    if (unshare(CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC) != 0) {
        perror("unshare");
        // Continue anyway—seccomp is the primary protection
    }
    
    // 3. Set NO_NEW_PRIVS
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) != 0) {
        perror("PR_SET_NO_NEW_PRIVS");
        return -1;
    }
    
    // 4. Drop capabilities
    drop_all_capabilities();
    
    // 5. Install seccomp filter
    if (install_seccomp_filter() != 0) {
        return -1;
    }
    
    // 6. Close unnecessary file descriptors
    close_fds_above(2);
    
    // 7. Call sandboxed code
    sandboxed_main(arg);
    
    return 0;
}

Performance Considerations

Seccomp filter performance is generally excellent, but understanding the performance characteristics helps in designing efficient sandboxes.

Filter Execution Overhead:

Seccomp filters add overhead to every system call. The overhead depends on:

Filter size — More instructions means more execution time
Filter structure — Linear scans are slower than binary trees
Spectre mitigations — Indirect branch tracking adds overhead
Number of stacked filters — All filters are evaluated

Typical Overhead:

Scenario	Overhead per Syscall
No seccomp	Baseline
Small filter (20 instructions)	~50-100 ns
Medium filter (100 instructions)	~200-400 ns
Large filter (500+ instructions)	~500-1000 ns

For perspective, a typical syscall takes 100-500 ns, so overhead ranges from negligible to doubling syscall time.

Optimization Techniques:

1. Put Common Syscalls First:

libseccomp and manual filters should check frequently-called syscalls early:

// Bad: check rare syscalls first
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(reboot), 0);  // Never called!
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);    // Called constantly

// Better: common syscalls checked first (exits filter faster)
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(futex), 0);

libseccomp can optimize this automatically:

seccomp_attr_set(ctx, SCMP_FLTATR_CTL_OPTIMIZE, 2);  // Enable optimizations

2. Use Binary Search Tree Structure:

libseccomp can generate binary search trees instead of linear lists:

seccomp_attr_set(ctx, SCMP_FLTATR_CTL_OPTIMIZE, 2);
// Filter structure will be optimized for O(log n) lookup

3. Minimize Argument Checks:

Argument filtering is more expensive than syscall number checking. Only filter arguments when necessary for security.

Performance Is Usually Not a Problem

For most applications, seccomp overhead is negligible compared to actual syscall work and application logic. Profile before optimizing. Syscall-heavy micro-benchmarks show worst-case overhead; real applications see much less impact.

Summary: seccomp

We have explored Linux's seccomp facility in depth—from architecture to practical implementation patterns. Let's consolidate the key insights:

Key Takeaways

•seccomp executes in kernel context — No TOCTOU races; filter decisions are made atomically with syscall dispatch.
•BPF provides flexible filtering — Inspect syscall numbers, arguments, and architecture; return various actions.
•libseccomp simplifies development — Architecture-independent, generates correct BPF, handles edge cases.
•User notification enables broker patterns — Supervisor can perform operations on behalf of sandbox.
•Filters are inherited and append-only — Children inherit filters; can't remove once installed.
•Multiple actions for different scenarios — KILL for dangerous, ERRNO for graceful denial, TRAP for emulation.
•Architecture validation is critical — Always check arch to prevent x32/32-bit bypass.
•Performance is generally excellent — Low per-syscall overhead; optimize for frequent syscalls if needed.

What's Next:

We've covered process sandboxing and system call filtering in detail. The final page in this module explores container isolation—how container technologies like Docker and Kubernetes combine the mechanisms we've studied (namespaces, cgroups, seccomp, capabilities) to create practical, scalable isolation for modern workloads.

Page Complete

You now understand seccomp's architecture, the BPF programming model for filters, how to use libseccomp for practical filter development, user notification for complex policies, and the patterns used in production sandboxes. You can design, implement, and analyze seccomp-based system call filtering.

4 / 5

Loading learning content...

Operating SystemsProtection Mechanisms

Sandboxing

LevelAdvanced

Duration90 mins

TopicProtection Mechanisms

4 / 5

seccomp

The Kernel's System Call Firewall

What You Will Learn

seccomp Architecture

Execution Flow:

User Space                    Kernel Space
─────────────────────────────────────────────────────
                              
process → syscall →   ──────► syscall entry
                              │
                              ▼
                              seccomp filter check
                              │
                    ┌─────────┴──────────┐
                    │                    │
                    ▼                    ▼
             ALLOW ───► syscall      KILL ───► terminate
                        handler           │
                        │                 ▼
                        ▼              ERRNO ───► return error
                    ◄── return        TRAP ────► SIGSYS signal
                                      NOTIFY ──► supervisor

Key Architectural Properties:

Filter Runs in Kernel Context: The BPF filter executes in kernel space during syscall entry. This is critical—user space cannot race with or manipulate the filter execution.
Filters Are Inherited: Child processes inherit their parent's seccomp filters. Once a filter is installed, it applies to all current and future threads.
Filters Are Append-Only: You can add more restrictive filters but cannot remove filters. This prevents an attacker who gains code execution from disabling the sandbox.
Filters Stack: Multiple filters can be installed. The kernel evaluates all filters, and the most restrictive result wins (lowest precedence value).

seccomp Action Priority (Lowest Value Wins)
Action	Value	Behavior	Used For
SECCOMP_RET_KILL_PROCESS	0x80000000	Kill entire process	Highly dangerous syscalls
SECCOMP_RET_KILL_THREAD	0x00000000	Kill calling thread	Less disruptive than process kill
SECCOMP_RET_TRAP	0x00030000	Send SIGSYS signal	User-space emulation
SECCOMP_RET_ERRNO	0x00050000	Return errno value	Graceful denial
SECCOMP_RET_USER_NOTIF	0x7fc00000	Notify supervisor	Broker/policy server
SECCOMP_RET_TRACE	0x7ff00000	Ptrace notification	Debugging
SECCOMP_RET_LOG	0x7ffc0000	Log and allow	Policy development
SECCOMP_RET_ALLOW	0x7fff0000	Allow syscall	Permitted operations

Priority Semantics

seccomp Modes: Strict and Filter

Seccomp has two distinct modes of operation, reflecting its evolution over time:

Mode 1: Strict Mode (Original seccomp)

The original seccomp (2005) provides an extremely simple, fixed policy: only four syscalls are allowed: read, write, exit, and sigreturn. Any other syscall immediately terminates the process.

#include <linux/seccomp.h>
#include <sys/prctl.h>

void enable_strict_seccomp() {
    // Enable strict seccomp mode
    if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) != 0) {
        perror("prctl");
        exit(1);
    }
    
    // From here, only read/write/exit/sigreturn work
    // Any other syscall = instant death
}

Use Cases for Strict Mode:

Extremely simple compute tasks (no file opening, no memory allocation after setup)
CPUjail-style sandboxes for untrusted computation
Historical interest—mostly superseded by filter mode

Limitations:

Too restrictive for most applications (can't even allocate memory)
No customization—fixed policy
Can't communicate except through pre-opened file descriptors

Mode 2: Filter Mode (seccomp-bpf)

#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/prctl.h>

void enable_seccomp_filter(struct sock_fprog *prog) {
    // Required: prevent privilege escalation through exec of setuid binaries
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) != 0) {
        perror("prctl NO_NEW_PRIVS");
        exit(1);
    }
    
    // Install seccomp filter
    if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog) != 0) {
        perror("prctl SECCOMP_MODE_FILTER");
        exit(1);
    }
}

// Alternative: use seccomp() syscall directly
void enable_seccomp_syscall(struct sock_fprog *prog): {
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) != 0) {
        perror("prctl NO_NEW_PRIVS");
        exit(1);
    }
    
    if (seccomp(SECCOMP_SET_MODE_FILTER, 0, prog) != 0) {
        perror("seccomp");
        exit(1);
    }
}

NO_NEW_PRIVS Required

BPF Filter Programming

The seccomp_data Structure:

The BPF filter can access syscall information through the seccomp_data structure:

struct seccomp_data {
    int   nr;         // Syscall number
    __u32 arch;       // AUDIT_ARCH_ value (architecture)
    __u64 instruction_pointer;  // Return address
    __u64 args[6];    // Syscall arguments
};

BPF Instructions:

BPF uses a small instruction set:

Instruction Type	Purpose
`BPF_LD`	Load value into accumulator
`BPF_LDX`	Load value into index register
`BPF_ST`	Store accumulator to memory
`BPF_ALU`	Arithmetic/logic operations
`BPF_JMP`	Conditional/unconditional jump
`BPF_RET`	Return value (action)
`BPF_MISC`	Miscellaneous (register copy)

Example: Minimal Allowlist Filter:

#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/audit.h>
#include <sys/syscall.h>

struct sock_filter filter[] = {
    // Load architecture
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, 
             offsetof(struct seccomp_data, arch)),
    // Verify architecture is x86_64
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
    // Kill if wrong architecture
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
    
    // Load syscall number
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
             offsetof(struct seccomp_data, nr)),
    
    // Allow read (syscall 0)
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    
    // Allow write (syscall 1)
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    
    // Allow exit_group (syscall 231)
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_exit_group, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    
    // Default: kill
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
};

struct sock_fprog prog = {
    .len = sizeof(filter) / sizeof(filter[0]),
    .filter = filter,
};

Understanding the BPF_STMT and BPF_JUMP Macros:

// BPF_STMT: unconditional instruction
// BPF_STMT(code, k)
// code = operation type | size | mode
// k = immediate value

BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr))
// BPF_LD: load operation
// BPF_W: word (32-bit) size
// BPF_ABS: absolute addressing from filter data (seccomp_data)
// Result: load seccomp_data.nr into accumulator

// BPF_JUMP: conditional jump
// BPF_JUMP(code, k, jt, jf)
// code = operation type | comparison
// k = comparison value
// jt = instructions to skip if true
// jf = instructions to skip if false

BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read, 0, 1)
// BPF_JEQ: jump if equal
// BPF_K: compare with immediate value k
// If accumulator == __NR_read: skip 0 instructions (execute next)
// If accumulator != __NR_read: skip 1 instruction

Use libseccomp for Complex Filters

libseccomp: The High-Level API

Basic Usage:

#include <seccomp.h>

void setup_seccomp() {
    // Create filter context with default action KILL_PROCESS
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL_PROCESS);
    if (ctx == NULL) {
        perror("seccomp_init");
        exit(1);
    }
    
    // Allow specific syscalls
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(close), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(fstat), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mprotect), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(munmap), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
    
    // Load filter into kernel
    if (seccomp_load(ctx) != 0) {
        perror("seccomp_load");
        exit(1);
    }
    
    // Release context (filter is now in kernel)
    seccomp_release(ctx);
}

Argument Filtering with libseccomp:

libseccomp makes argument filtering straightforward:

#include <seccomp.h>

void setup_with_arg_filtering() {
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
    
    // Allow socket() only for AF_UNIX (argument 0 == 1)
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(socket), 1,
                     SCMP_A0(SCMP_CMP_EQ, AF_UNIX));
    
    // Allow mprotect() but not with PROT_EXEC | PROT_WRITE
    // (Block arg2 having both bits set)
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(mprotect), 1,
                     SCMP_A2(SCMP_CMP_MASKED_EQ, 
                            PROT_EXEC | PROT_WRITE, 
                            PROT_EXEC | PROT_WRITE));
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mprotect), 0);
    
    // Allow ioctl() only for specific commands
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(ioctl), 1,
                     SCMP_A1(SCMP_CMP_EQ, TCGETS));
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(ioctl), 1,
                     SCMP_A1(SCMP_CMP_EQ, FIONREAD));
    
    // Block openat with O_CREAT (prevent file creation)
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EPERM), SCMP_SYS(openat), 1,
                     SCMP_A2(SCMP_CMP_MASKED_EQ, O_CREAT, O_CREAT));
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(openat), 0);
    
    seccomp_load(ctx);
    seccomp_release(ctx);
}

Comparison Operators:

libseccomp Comparison Operators
Operator	C Macro	Description
Not equal	SCMP_CMP_NE	arg != value
Less than	SCMP_CMP_LT	arg < value
Less or equal	SCMP_CMP_LE	arg <= value
Equal	SCMP_CMP_EQ	arg == value
Greater or equal	SCMP_CMP_GE	arg >= value
Greater than	SCMP_CMP_GT	arg > value
Masked equal	SCMP_CMP_MASKED_EQ	(arg & mask) == value

Architecture Handling:

libseccomp automatically handles architecture differences:

// Enable support for additional architectures
seccomp_arch_add(ctx, SCMP_ARCH_X86);      // 32-bit x86
seccomp_arch_add(ctx, SCMP_ARCH_X32);      // x32 ABI

// Or remove architectures (stricter)
seccomp_arch_remove(ctx, SCMP_ARCH_NATIVE); // Remove native first
seccomp_arch_add(ctx, SCMP_ARCH_X86_64);    // Add only x86_64
// Now 32-bit and x32 syscalls will be blocked

Export Filters for Inspection

libseccomp can export generated BPF for debugging: seccomp_export_bpf(ctx, fd) writes raw BPF, and seccomp_export_pfc(ctx, fd) writes a human-readable pseudo-filter code format.

seccomp User Notification

Architecture:

┌─────────────────┐                    ┌─────────────────┐
│  Sandboxed      │                    │   Supervisor    │
│  Process        │                    │   Process       │
├─────────────────┤                    ├─────────────────┤
│                 │                    │                 │
│  syscall(open)  │──► blocked by ─────│  notif_recv()  │
│      ▼          │    seccomp         │      ▼          │
│   [blocked]     │                    │  validate path  │
│                 │                    │      ▼          │
│                 │                    │  open() if OK   │
│                 │                    │      ▼          │
│  [resumes]  ◄───│◄─ notif_send() ────│  send fd back   │
│   with fd       │                    │                 │
└─────────────────┘                    └─────────────────┘

Setting Up User Notification:

#include <linux/seccomp.h>
#include <sys/ioctl.h>

int setup_notify_supervisor() {
    struct sock_filter filter[] = {
        // ... architecture check ...
        
        // Load syscall number
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                 offsetof(struct seccomp_data, nr)),
        
        // Send openat to supervisor
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_openat, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),
        
        // Allow other syscalls
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    };
    
    struct sock_fprog prog = {
        .len = sizeof(filter) / sizeof(filter[0]),
        .filter = filter,
    };
    
    // Install filter and get notification fd
    int notify_fd = seccomp(SECCOMP_SET_MODE_FILTER,
                            SECCOMP_FILTER_FLAG_NEW_LISTENER,
                            &prog);
    if (notify_fd < 0) {
        perror("seccomp");
        exit(1);
    }
    
    return notify_fd;  // Parent uses this fd to receive notifications
}

Supervisor Event Loop:

void supervisor_loop(int notify_fd) {
    struct seccomp_notif *req = NULL;
    struct seccomp_notif_resp *resp = NULL;
    struct seccomp_notif_sizes sizes;
    
    // Get sizes for allocation
    seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes);
    req = malloc(sizes.seccomp_notif);
    resp = malloc(sizes.seccomp_notif_resp);
    
    while (1) {
        memset(req, 0, sizes.seccomp_notif);
        memset(resp, 0, sizes.seccomp_notif_resp);
        
        // Receive notification (blocks until syscall intercepted)
        if (ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_RECV, req) != 0) {
            if (errno == ENOENT)  // Target died
                continue;
            perror("NOTIF_RECV");
            break;
        }
        
        // req->id: unique notification ID
        // req->pid: pid of sandboxed process
        // req->data.nr: syscall number
        // req->data.args[]: syscall arguments
        
        resp->id = req->id;
        resp->flags = 0;
        
        if (req->data.nr == __NR_openat) {
            // Read pathname from sandboxed process memory
            char path[PATH_MAX];
            if (read_process_memory(req->pid, req->data.args[1], 
                                    path, sizeof(path)) < 0) {
                resp->error = -EACCES;
            } else if (is_path_allowed(path)) {
                // Perform open on behalf of sandbox
                int fd = openat(req->data.args[0], path,
                               req->data.args[2], req->data.args[3]);
                if (fd >= 0) {
                    // Send fd to sandboxed process
                    resp->val = fd;
                    resp->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
                    // Actually, for fd injection, use SECCOMP_IOCTL_NOTIF_ADDFD
                } else {
                    resp->error = -errno;
                }
            } else {
                resp->error = -EACCES;
            }
        }
        
        // Send response (unblocks sandboxed process)
        if (ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_SEND, resp) != 0) {
            perror("NOTIF_SEND");
        }
    }
}

TOCTOU in User Notification

Filter Flags and Options

The seccomp() system call accepts various flags that modify filter behavior:

SECCOMP_FILTER_FLAG_TSYNC:

Synchronize the filter across all threads in the thread group. Without this, each thread needs to install the filter individually:

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC, &prog);
// Now all existing threads have the filter

SECCOMP_FILTER_FLAG_LOG:

Log all filtered syscalls that match a LOG action. Also ensures non-LOG actions are logged if audit is enabled:

seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_LOG, &prog);
// dmesg shows: seccomp filter LOG action

SECCOMP_FILTER_FLAG_SPEC_ALLOW:

Disable Spectre mitigations for this filter. Can improve performance but reduces security against Spectre attacks:

// Only use if performance is critical and you understand Spectre risks
seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_SPEC_ALLOW, &prog);

seccomp Filter Flags
Flag	Value	Purpose
SECCOMP_FILTER_FLAG_TSYNC	1	Sync filter to all threads
SECCOMP_FILTER_FLAG_LOG	2	Enable logging
SECCOMP_FILTER_FLAG_SPEC_ALLOW	4	Disable Spectre mitigations
SECCOMP_FILTER_FLAG_NEW_LISTENER	8	Return notification fd
SECCOMP_FILTER_FLAG_TSYNC_ESRCH	16	ESRCH if sync fails
SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV	32	Killable wait in NOTIF_RECV

seccomp Attribute Operations:

Recent kernels support querying and setting seccomp attributes:

struct seccomp_notif_sizes sizes;

// Get required allocation sizes for notification structures
seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes);
printf("notif: %hu, notif_resp: %hu, data: %hu\n",
       sizes.seccomp_notif, sizes.seccomp_notif_resp, 
       sizes.seccomp_data);

// Get current seccomp action (for diagnostic)
int action;
seccomp(SECCOMP_GET_ACTION_AVAIL, 0, &action);

TSYNC for Multi-Threaded Applications

Common Patterns and Best Practices

Production seccomp implementations follow established patterns that balance security with reliability:

Pattern: Privileged Setup, Then Sandbox:

Perform all privileged operations before installing the filter:

int main() {
    // Phase 1: Privileged setup
    int listen_fd = socket(AF_INET, SOCK_STREAM, 0);
    bind(listen_fd, ...);  // Bind to port 80
    listen(listen_fd, 100);
    
    // Open config files, allocate resources, etc.
    config_t *cfg = load_config("/etc/myservice.conf");
    
    // Phase 2: Drop privileges
    drop_root_privileges();
    
    // Phase 3: Install seccomp filter
    install_seccomp_filter();
    
    // Phase 4: Run main loop (now sandboxed)
    while (1) {
        int conn = accept(listen_fd, ...);
        handle_connection(conn);  // Cannot open files, bind ports, etc.
    }
}

Pattern: Fail-Open Development, Fail-Closed Production:

void install_filter(int strict_mode) {
    scmp_filter_ctx ctx = seccomp_init(
        strict_mode ? SCMP_ACT_KILL : SCMP_ACT_LOG
    );
    // ... rules ...
    seccomp_load(ctx);
}

// Development: blocked syscalls are logged, not killed
install_filter(false);
// Production: blocked syscalls terminate process
install_filter(true);

seccomp Best Practices

•Always validate architecture — First check in every filter; prevents x32/32-bit bypass.
•Use NO_NEW_PRIVS — Required for unprivileged install; prevents setuid-based escape.
•Use allowlists — Default deny with explicit allow is more secure than denylist.
•Use libseccomp — Manual BPF is error-prone and non-portable.
•Use TSYNC — Ensure all threads get the filter simultaneously.
•Test both paths — Verify allowed syscalls work AND blocked syscalls fail.
•Log during development — Use LOG action to discover needed syscalls.
•Handle x32 ABI — Either support it explicitly or block it entirely.
•Audit filter size — Large filters have performance overhead (though usually modest).
•Version control filters — Filter changes should be reviewed like code.

Pattern: Sandbox Entry Function:

int enter_sandbox(void (*sandboxed_main)(void *), void *arg) {
    // 1. Validate we're in a clean state
    if (getuid() == 0) {
        fprintf(stderr, "Must not be root\n");
        return -1;
    }
    
    // 2. Set up additional isolation
    if (unshare(CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWIPC) != 0) {
        perror("unshare");
        // Continue anyway—seccomp is the primary protection
    }
    
    // 3. Set NO_NEW_PRIVS
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) != 0) {
        perror("PR_SET_NO_NEW_PRIVS");
        return -1;
    }
    
    // 4. Drop capabilities
    drop_all_capabilities();
    
    // 5. Install seccomp filter
    if (install_seccomp_filter() != 0) {
        return -1;
    }
    
    // 6. Close unnecessary file descriptors
    close_fds_above(2);
    
    // 7. Call sandboxed code
    sandboxed_main(arg);
    
    return 0;
}

Performance Considerations

Seccomp filter performance is generally excellent, but understanding the performance characteristics helps in designing efficient sandboxes.

Filter Execution Overhead:

Seccomp filters add overhead to every system call. The overhead depends on:

Filter size — More instructions means more execution time
Filter structure — Linear scans are slower than binary trees
Spectre mitigations — Indirect branch tracking adds overhead
Number of stacked filters — All filters are evaluated

Typical Overhead:

Scenario	Overhead per Syscall
No seccomp	Baseline
Small filter (20 instructions)	~50-100 ns
Medium filter (100 instructions)	~200-400 ns
Large filter (500+ instructions)	~500-1000 ns

For perspective, a typical syscall takes 100-500 ns, so overhead ranges from negligible to doubling syscall time.

Optimization Techniques:

1. Put Common Syscalls First:

libseccomp and manual filters should check frequently-called syscalls early:

// Bad: check rare syscalls first
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(reboot), 0);  // Never called!
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);    // Called constantly

// Better: common syscalls checked first (exits filter faster)
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(futex), 0);

libseccomp can optimize this automatically:

seccomp_attr_set(ctx, SCMP_FLTATR_CTL_OPTIMIZE, 2);  // Enable optimizations

2. Use Binary Search Tree Structure:

libseccomp can generate binary search trees instead of linear lists:

seccomp_attr_set(ctx, SCMP_FLTATR_CTL_OPTIMIZE, 2);
// Filter structure will be optimized for O(log n) lookup

3. Minimize Argument Checks:

Argument filtering is more expensive than syscall number checking. Only filter arguments when necessary for security.

Performance Is Usually Not a Problem

Summary: seccomp

We have explored Linux's seccomp facility in depth—from architecture to practical implementation patterns. Let's consolidate the key insights:

Key Takeaways

•seccomp executes in kernel context — No TOCTOU races; filter decisions are made atomically with syscall dispatch.
•BPF provides flexible filtering — Inspect syscall numbers, arguments, and architecture; return various actions.
•libseccomp simplifies development — Architecture-independent, generates correct BPF, handles edge cases.
•User notification enables broker patterns — Supervisor can perform operations on behalf of sandbox.
•Filters are inherited and append-only — Children inherit filters; can't remove once installed.
•Multiple actions for different scenarios — KILL for dangerous, ERRNO for graceful denial, TRAP for emulation.
•Architecture validation is critical — Always check arch to prevent x32/32-bit bypass.
•Performance is generally excellent — Low per-syscall overhead; optimize for frequent syscalls if needed.

What's Next:

Page Complete

4 / 5