Operating SystemsProtection Mechanisms

Sandboxing

LevelAdvanced

Duration90 mins

TopicProtection Mechanisms

3 / 5

System Call Filtering

The Last Line of Defense

We've explored how to isolate processes through namespaces, restrict their file system view, drop credentials, and limit resources. Yet one critical attack surface remains: the system call interface.

Every interaction between a sandboxed process and the kernel occurs through system calls. Each system call is a gateway that, if misused or exploited, can potentially compromise the sandbox. System calls handle file operations, network communication, process management, memory mapping, and hundreds of other kernel functions. The kernel's attack surface is enormous, and every syscall is a potential entry point for exploitation.

System call filtering addresses this by restricting which system calls a process can invoke and how those calls can be parameterized. Even if an attacker gains arbitrary code execution within the sandboxed process, they cannot invoke dangerous system calls—dramatically limiting what they can do.

What You Will Learn

By the end of this page, you will understand why system call filtering is essential for robust sandboxing, how to design effective syscall filtering policies, the different approaches to syscall filtering, and the security implications of various policy choices. You will be able to analyze and design syscall filter policies.

The System Call Attack Surface

The system call interface represents the boundary between user space and kernel space. Every time a process needs kernel services—opening files, creating sockets, mapping memory, spawning processes—it must cross this boundary through a system call. This makes the syscall interface the most critical security boundary in the system.

Syscall Attack Categories:

System calls can enable attacks in several ways:

System Call Attack Categories
Category	Example Syscalls	Attack	Risk Level
Kernel vulnerabilities	ioctl, futex, bpf, perf_event_open	Trigger kernel bugs to escape sandbox	Critical
Privilege escalation	setuid, setgid, setgroups, capset	Regain dropped privileges	Critical
Namespace escape	setns, unshare, mount	Escape namespace isolation	Critical
File system access	open, openat, read, write, unlink	Access files outside sandbox	High
Network access	socket, connect, bind, sendto	Exfiltrate data, attack network	High
Process control	fork, execve, clone, kill	Spawn processes, affect other processes	Medium
Information leak	getdents, readlink, stat	Discover system layout	Medium
Resource exhaustion	fork, mmap, socket	Denial of service	Medium

The Kernel Bug Problem:

The most serious concern is kernel vulnerabilities. The kernel is hundreds of thousands of lines of complex C code. Despite extensive review, bugs exist. Each system call handler is a potential vulnerability:

ioctl — Historically the source of many vulnerabilities due to its generic interface
futex — Complex synchronization primitive with a history of bugs
bpf — Powerful programmable interface; bugs enable arbitrary kernel code execution
ptrace — Process debugging interface with complex security implications
perf_event_open — Performance monitoring; requires careful permission checking

By restricting which syscalls a process can invoke, we reduce the kernel attack surface. If a vulnerability exists in the bpf syscall handler, and our sandbox blocks bpf, the vulnerability is not reachable from our sandbox.

Defense in Depth Against Kernel Bugs

System call filtering is one of the few practical defenses against kernel vulnerabilities in sandbox escape. Other mechanisms (namespaces, capabilities) rely on the kernel functioning correctly. If the kernel has a bug, those mechanisms can be bypassed. Syscall filtering prevents the vulnerable code from being reached in the first place.

Quantifying the Attack Surface:

Linux has approximately 450 system calls. A typical sandboxed application (like a browser renderer) needs only a small fraction:

Application Type	Syscalls Needed	Attack Surface Reduction
Browser renderer	~50-80	80-85%
Image decoder	~20-30	93-95%
PDF parser	~30-50	88-93%
Network service	~70-100	75-85%

By allowing only the required syscalls, we dramatically reduce exposure to kernel bugs and limit the capabilities available to an attacker who achieves code execution.

System Call Filtering Approaches

There are several technical approaches to filtering system calls, each with different characteristics:

1. Ptrace-Based Filtering:

The oldest approach uses ptrace() to intercept system calls in the traced process. A monitor process runs alongside the sandboxed process, receiving notifications for each syscall:

// Monitor process
while (1) {
    // Wait for syscall entry
    ptrace(PTRACE_SYSCALL, child, 0, 0);
    waitpid(child, &status, 0);
    
    // Get syscall number
    long syscall = ptrace(PTRACE_PEEKUSER, child, 
                          ORIG_RAX * sizeof(long), 0);
    
    // Check policy
    if (!is_allowed(syscall)) {
        // Block syscall by changing it to -1 (invalid)
        ptrace(PTRACE_POKEUSER, child, 
               ORIG_RAX * sizeof(long), -1);
    }
}

Ptrace Advantages

•Works on all Unix systems
•Can inspect/modify arguments
•Can emulate syscalls entirely
•Portable across kernel versions
•No kernel support required

Ptrace Disadvantages

•Extremely high overhead (2-10x slowdown)
•TOCTOU races between check and execution
•Complex, error-prone implementation
•Monitor is another process to attack
•Not composable with debugging

2. Kernel-Based Filtering (seccomp):

Modern Linux provides seccomp (secure computing mode), which moves syscall filtering into the kernel. The filter runs in kernel context, eliminating race conditions and reducing overhead dramatically.

Original seccomp (mode 1):

The original seccomp (2005) was extremely restrictive: it allowed only read, write, exit, and sigreturn. Any other syscall terminated the process. This was too restrictive for most applications.

seccomp-bpf (mode 2):

Seccomp-bpf (2012) extended seccomp with BPF (Berkeley Packet Filter) programs, allowing flexible, programmable filters. The filter inspects the syscall number and arguments and returns an action:

// Actions that seccomp filter can return:
#define SECCOMP_RET_KILL_PROCESS  0x80000000  // Kill entire process
#define SECCOMP_RET_KILL_THREAD   0x00000000  // Kill calling thread
#define SECCOMP_RET_TRAP          0x00030000  // Send SIGSYS signal
#define SECCOMP_RET_ERRNO         0x00050000  // Return errno value
#define SECCOMP_RET_USER_NOTIF    0x7fc00000  // Notify userspace
#define SECCOMP_RET_TRACE         0x7ff00000  // Notify ptrace tracer
#define SECCOMP_RET_LOG           0x7ffc0000  // Log and allow
#define SECCOMP_RET_ALLOW         0x7fff0000  // Allow syscall

3. Mandatory Access Control (SELinux, AppArmor):

MAC systems can also filter syscalls as part of their broader policy. SELinux policies can restrict which syscalls a domain can use, though this is typically coarser-grained than seccomp.

4. Syscall Interposition via Library:

Library-based approaches (LD_PRELOAD) intercept library calls before they become syscalls. This is easy to implement but easy to bypass (direct syscall instructions, statically linked binaries).

Comparison Summary:

Syscall Filtering Approach Comparison
Approach	Overhead	Security	Flexibility	Best Use
Ptrace	Very High (2-10x)	TOCTOU vulnerable	Full emulation possible	Legacy/compatibility
Seccomp-bpf	Very Low (1-2%)	Kernel-enforced	Per-syscall, arg inspection	Production sandboxing
MAC (SELinux)	Low (5-10%)	Kernel-enforced	Coarse-grained	System-wide policy
Library interposition	Low	Bypassable	Library-level control	Debugging only

Policy Design Strategies

Designing an effective syscall filter policy requires balancing security (blocking as much as possible) with functionality (allowing the application to work). There are two fundamental approaches:

Allowlist (Whitelist) Approach:

Start with everything blocked and explicitly allow only required syscalls:

# Pseudo-policy: default deny, explicit allow
default: DENY
allow: read, write, close, fstat, mmap, munmap, ...

Advantages:

Minimizes attack surface by default
New syscalls are blocked automatically
Forces explicit justification for each allowed syscall

Denylist (Blacklist) Approach:

Start with everything allowed and explicitly block dangerous syscalls:

# Pseudo-policy: default allow, explicit deny  
default: ALLOW
deny: mount, umount, reboot, kexec, ptrace, ...

Advantages:

Easier to implement initially
Less likely to break functionality

Always Prefer Allowlists

Security best practice strongly favors allowlists. Denylists are prone to omissions—if you forget to block a dangerous syscall, you have a vulnerability. Allowlists fail safe: if you forget to allow a syscall, the application breaks (visible) rather than being insecure (invisible). New kernel syscalls are automatically blocked.

Developing an Allowlist Policy:

Creating an allowlist requires understanding what syscalls the application actually needs. Several approaches help:

1. Strace Analysis:

Run the application under strace to observe syscall usage:

# Record all syscalls during normal operation
strace -c -f program 2>&1 | tee syscalls.log

# Output shows syscall counts:
# % time     calls  syscall
# 25.00       100  read
# 20.00        80  write
# 15.00        60  mmap
# ...

2. Seccomp Log Mode:

Use SECCOMP_RET_LOG for syscalls you're unsure about—the kernel logs but doesn't block:

// Log unknown syscalls during development
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_unknown_syscall, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_LOG),

3. Iterative Refinement:

Start with a minimal set, run the application, observe what fails, add needed syscalls, repeat:

Create minimal allowlist (exit, read, write)
Run application
Observe failure (SIGSYS or EPERM)
Identify blocked syscall
Evaluate if syscall is necessary or indicates unwanted behavior
Add to allowlist if legitimate, or redesign if not
Repeat until application works correctly

Common Syscall Groups:

Syscalls can be grouped by function to help policy design:

Syscall Groups and Sandboxing Considerations
Group	Examples	Typical Policy
Basic I/O	read, write, close	Usually allowed
Memory	mmap, mprotect, brk, munmap	Usually allowed, may restrict RWX mappings
File metadata	fstat, stat, lstat	Usually allowed, may restrict paths
File open	open, openat	Restrict or broker through monitor
Directory	getdents, readdir	May restrict to hide system layout
Process	getpid, gettid, getuid	Usually allowed (harmless)
Signals	rt_sigaction, sigaltstack	Usually allowed for error handling
Time	clock_gettime, gettimeofday	Usually allowed
Network	socket, connect, sendto	Block or broker for most sandboxes
Process control	fork, clone, execve	Usually blocked
Privilege	setuid, setgid, setgroups	Always blocked
Namespace	setns, unshare, mount	Always blocked in sandbox
Debug	ptrace	Always blocked
Modules	init_module, finit_module	Always blocked

Argument-Based Filtering

Simple syscall number filtering isn't always sufficient. Many syscalls have vastly different security implications depending on their arguments. Argument-based filtering allows finer-grained policies that inspect syscall arguments.

Example: ioctl Filtering:

The ioctl syscall is generic—thousands of different operations share one syscall number, distinguished by the request code argument:

// Dangerous ioctl requests (examples)
ioctl(fd, TIOCSTI, '\n');     // Inject input into terminal
ioctl(fd, SIOCGIFCONF, buf);  // Enumerate network interfaces

// Benign ioctl requests
ioctl(fd, TCGETS, &termios);  // Get terminal attributes
ioctl(fd, FIONREAD, &count);  // Get bytes available to read

A naive policy blocking all ioctl would break many applications. A sophisticated policy allows only specific, audited ioctl requests:

// BPF filter: allow ioctl only for specific requests
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_ioctl, 0, SKIP_IOCTL_CHECK),
// Load second argument (ioctl request)
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, args[1])),
// Allow TCGETS
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, TCGETS, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
// Allow FIONREAD  
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, FIONREAD, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
// Block other ioctl
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),

Example: mmap/mprotect Filtering:

The mmap and mprotect syscalls create memory mappings with specified protections. Allowing PROT_EXEC with PROT_WRITE enables JIT compilation but is dangerous (attackers can write shellcode then execute it):

// Restrict mprotect: disallow creating WX memory
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mprotect, 0, SKIP),
// Load third argument (prot flags)
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, args[2])),
// Check if both WRITE and EXEC are set
BPF_STMT(BPF_ALU | BPF_AND | BPF_K, PROT_WRITE | PROT_EXEC),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, PROT_WRITE | PROT_EXEC, 0, 1),
// Block WX mappings
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),

Example: socket Type Filtering:

The socket syscall can create various socket types. Restricting to specific types limits network capability:

// Only allow AF_UNIX sockets (local IPC only)
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_socket, 0, SKIP),
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, args[0])),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AF_UNIX, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),

64-bit Argument Handling

On 64-bit systems, syscall arguments are 64-bit, but classic BPF can only load 32-bit values. To check 64-bit arguments, you must load and check both halves. On x86-64, seccomp_data stores args as u64, so check both the low and high 32 bits to avoid attacks using values in the upper bits.

TOCTOU Considerations:

Seccomp-bpf filters inspect the argument values at the time of the syscall. For arguments that are pointers to user memory (like filenames), there's a potential TOCTOU (time-of-check-to-time-of-use) vulnerability:

Filter checks pointer and sees legitimate filename
Another thread modifies the pointed-to memory
Kernel uses modified (malicious) filename

For this reason, seccomp-bpf can only safely filter on numeric arguments, not on the content of pointed-to memory. To filter on file paths, use a broker pattern: the sandboxed process sends the path to a broker via IPC, the broker validates and performs the operation.

Handling Blocked System Calls

When a syscall is blocked, the filter must decide what happens. Seccomp-bpf provides several options, each with different implications:

SECCOMP_RET_KILL_PROCESS:

Terminate the entire process immediately with SIGSYS. This is the most secure option—any attempt to use a blocked syscall terminates the sandbox:

BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),

Pros: No opportunity for attacker to try alternatives, no information leak about what's blocked
Cons: Harsh—any bug or missed syscall kills the application

SECCOMP_RET_KILL_THREAD:

Terminate only the calling thread. Less disruptive than process kill, but can leave the process in an inconsistent state.

SECCOMP_RET_ERRNO:

Return an error value to the caller as if the syscall failed:

// Return EPERM (Operation not permitted)
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),

// Or return ENOSYS (Function not implemented)
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (ENOSYS & SECCOMP_RET_DATA)),

EPERM: Convention for "you don't have permission to do this"
ENOSYS: Convention for "this syscall doesn't exist"—can make fingerprinting harder

Pros: Application can handle the error gracefully, better user experience Cons: Attacker can probe to map what's blocked, might find workarounds

SECCOMP_RET_TRAP:

Send SIGSYS signal to the process, which can install a handler to deal with the blocked syscall:

// In seccomp filter:
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_TRAP),

// In application:
void sigsys_handler(int sig, siginfo_t *info, void *ctx) {
    // info->si_syscall contains blocked syscall number
    // info->si_arch contains architecture
    // Can emulate the syscall or take other action
}

struct sigaction sa = {
    .sa_sigaction = sigsys_handler,
    .sa_flags = SA_SIGINFO,
};
sigaction(SIGSYS, &sa, NULL);

Use case: User-space emulation of blocked syscalls. The handler can implement a sandboxed version of the functionality.

SECCOMP_RET_USER_NOTIF:

Notify a supervisor process via a notification file descriptor. The supervisor can inspect the syscall, make a policy decision, and optionally respond on behalf of the sandboxed process:

// Create notification fd
int notify_fd = seccomp(SECCOMP_SET_MODE_FILTER, 
                        SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);

// Supervisor event loop
struct seccomp_notif *req = ...;
struct seccomp_notif_resp *resp = ...;

while (1) {
    ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_RECV, req);
    
    // Inspect req->data.nr (syscall number), req->data.args
    // Make policy decision
    
    resp->id = req->id;
    resp->val = result;  // Return value to sandbox
    resp->error = 0;     // Or errno if failing
    resp->flags = 0;
    
    ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_SEND, resp);
}

Use case: Broker pattern—supervisor opens files, creates sockets on behalf of sandbox.

Combining Approaches

Production sandboxes often combine approaches: ALLOW for clearly safe syscalls, ERRNO for syscalls that might be called but aren't needed, USER_NOTIF for syscalls that need broker assistance, and KILL for syscalls that indicate an attack in progress (like ptrace or kexec).

Architecture and Compatibility Considerations

System call filtering must account for CPU architecture-specific issues that attackers can exploit:

Multi-Architecture Systems:

On x86-64 Linux, 32-bit programs can run under compatibility mode, and they use different syscall numbers. A filter designed for 64-bit syscalls might not cover 32-bit syscalls, creating an escape path.

// Always check architecture in seccomp filter!
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, arch)),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),  // Kill if not x86_64

The x32 ABI Attack:

Linux x86-64 supports an "x32" ABI—32-bit pointers with 64-bit registers. The x32 syscall numbers have their high bit set (| 0x40000000). An attacker might invoke x32 syscalls to bypass filters designed for regular x86-64:

// Block x32 ABI (high bit set in syscall number)
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),  // Kill x32 syscalls

Critical: Always Validate Architecture

Every seccomp filter MUST start by checking the architecture. Without this check, attackers on x86-64 can use 32-bit or x32 syscalls to invoke different syscall numbers that your filter doesn't cover. Many real sandbox escapes have exploited this oversight.

Syscall Number Portability:

Syscall numbers differ between architectures. A filter written with x86-64 numbers won't work on ARM64. Use architecture-specific headers or build filters at runtime:

// Use syscall number constants from headers
#include <sys/syscall.h>

// SYS_open is defined differently per architecture
// On x86-64: SYS_open = 2
// On ARM64: SYS_openat = 56 (no SYS_open)

Modern vs. Legacy Syscalls:

Some syscalls have been superseded by newer versions:

open → openat (relative paths)
stat → fstatat
access → faccessat

The old versions might still work, so filters must block both:

// Block both open and openat
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_open, 1, 0),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_openat, 0, 2),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | EPERM),

Library Wrapper Assumptions:

C library functions might use different syscalls than expected:

printf → write (expected)
malloc → brk or mmap (implementation detail)
sleep → nanosleep or clock_nanosleep

Test on target systems to ensure the actual syscalls are allowed.

Real-World Syscall Policies

Examining real-world syscall policies helps understand how production systems balance security and functionality.

Chrome Renderer Sandbox:

Chrome's renderer sandbox is one of the most restrictive production sandboxes:

# Chrome renderer allowed syscalls (simplified)
exit_group, exit, read, write, close, fstat, mmap, mprotect, munmap,
brk, rt_sigaction, rt_sigprocmask, rt_sigreturn, pread64, pwrite64,
lseek, futex, poll, recvmsg, sendmsg, socketpair, shutdown,
prctl, clock_gettime, gettimeofday, clone, sigaltstack,
get_robust_list, set_robust_list, restart_syscall, getrandom,
madvise, fcntl, nanosleep, getpid, gettid, tgkill, ...

# Notably NOT allowed:
open, openat, socket, connect, bind, fork, exec, ptrace, mount,
chmod, chown, link, unlink, rename, mkdir, rmdir, ...

The renderer cannot open files, create network connections, or spawn processes. All such operations go through IPC to the browser process, which validates and performs them.

Docker Default Seccomp Profile:

Docker's default seccomp profile blocks approximately 50 syscalls from over 300 available. It's less restrictive than browser sandboxes because containers need broader functionality:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86"],
  "syscalls": [
    {
      "names": ["accept", "accept4", "access", ...],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "names": ["personality"],
      "action": "SCMP_ACT_ALLOW",
      "args": [{"index": 0, "value": 0, "op": "SCMP_CMP_EQ"}]
    }
  ]
}

Blocked by default in Docker: acct, add_key, bpf, clock_adjtime, clock_settime, create_module, delete_module, finit_module, get_kernel_syms, get_mempolicy, init_module, ioperm, iopl, kcmp, kexec_file_load, kexec_load, keyctl, lookup_dcookie, mbind, mount, move_pages, name_to_handle_at, nfsservctl, open_by_handle_at, perf_event_open, personality, pivot_root, process_vm_readv, process_vm_writev, ptrace, query_module, quotactl, reboot, request_key, set_mempolicy, setns, settimeofday, stime, swapoff, swapon, sysfs, _sysctl, umount, umount2, unshare, uselib, userfaultfd, ustat, vm86, vm86old

Policy Restrictiveness Comparison
System	Allowed Syscalls	Approach	Compatibility
Chrome Renderer	~60-80	Strict allowlist	Breaks if needs change
Chrome GPU Process	~100-120	Allowlist + broker	Needs GPU driver syscalls
Docker default	~250	Denylist of dangerous	Wide compatibility
systemd services	Variable	Configurable per-service	Admin-controlled
Firejail default	~200	Denylist	Desktop app compat

Security vs Compatibility Trade-off

More restrictive policies provide better security but risk breaking applications. Browser sandboxes can be very strict because they control both the sandbox and the sandboxed code. General-purpose container sandboxes must be more permissive to support arbitrary applications.

Testing and Debugging Syscall Filters

Testing syscall filters is critical—both to ensure security and to verify applications still work.

Testing for Security:

Negative testing — Verify that blocked syscalls actually fail:

void test_blocked_syscalls() {
    // These should fail with EPERM (or process gets killed)
    assert(syscall(__NR_ptrace, 0, 0, 0, 0) == -1);
    assert(errno == EPERM);
    
    assert(syscall(__NR_mount, 0, 0, 0, 0, 0) == -1);
    assert(errno == EPERM);
    
    // Test that the process survived (if using ERRNO action)
    printf("Blocked syscalls correctly rejected\n");
}

Architecture coverage — Test that x32 and 32-bit syscalls are handled:

// Try x32 syscall (should fail/kill)
syscall(__NR_write | 0x40000000, 1, "test", 4);

Argument bypass attempts — Test that argument filtering works:

// If ioctl is allowed only for TCGETS:
assert(ioctl(0, TCGETS, &termios) >= 0);  // Should work
assert(ioctl(0, TIOCSTI, &c) == -1);       // Should fail

Debugging Filter Issues:

Use LOG action during development:

// During development, log instead of kill
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_LOG),

Kernel logs blocked syscalls:

audit: seccomp status=log syscall=257 pid=1234 code=0x7fc00000

strace for syscall tracing:

# Trace with seccomp information
strace -e trace=open,openat,socket -f ./sandboxed_app

BPF program inspection:

// Dump filter for inspection
struct sock_fprog prog;
socklen_t len = sizeof(prog);
getsockopt(fd, SOL_SOCKET, SO_GET_FILTER, &prog, &len);
// Then disassemble BPF instructions

seccomp-tools for analysis:

# Install seccomp-tools (Ruby gem)
gem install seccomp-tools

# Dump seccomp filter from running process
seccomp-tools dump ./sandboxed_app

# Disassemble BPF bytecode
seccomp-tools disasm filter.bpf

Incremental Development

Start with SECCOMP_RET_LOG to discover what syscalls your application uses, then switch to ERRNO to verify it handles rejection gracefully, and finally switch to KILL_PROCESS for production. This progression makes debugging much easier.

Summary: System Call Filtering

We have explored system call filtering as a critical layer of sandbox security. Let's consolidate the key insights:

Key Takeaways

•Syscalls are the attack surface — Every syscall is a potential kernel vulnerability entry point; filtering reduces this surface.
•Allowlists over denylists — Always use allowlists; they fail safe and automatically block new syscalls.
•Kernel-based filtering is essential — Seccomp-bpf provides low-overhead, race-free filtering in kernel context.
•Argument filtering enables precision — Filter not just syscall numbers but also arguments for fine-grained control.
•Architecture handling is critical — Always validate architecture to prevent x32/32-bit ABI bypass attacks.
•Action choice affects security/usability — KILL is safest; ERRNO allows graceful handling; USER_NOTIF enables broker pattern.
•Test both functionality and security — Verify allowed syscalls work AND blocked syscalls are actually blocked.
•Real-world policies balance trade-offs — Strictness depends on how much you control the sandboxed code.

What's Next:

With the conceptual and practical foundations of syscall filtering covered, the next page will focus on seccomp in detail—Linux's syscall filtering framework. You'll learn the BPF programming model, writing and installing filters, and advanced techniques used in production sandboxes.

Page Complete

You now understand how system call filtering protects sandboxed processes by restricting their access to kernel functionality. You can reason about policy design, understand the trade-offs between different filtering approaches, and appreciate the critical role of syscall filtering in modern sandboxing.

3 / 5

Loading learning content...

Operating SystemsProtection Mechanisms

Sandboxing

LevelAdvanced

Duration90 mins

TopicProtection Mechanisms

3 / 5

System Call Filtering

The Last Line of Defense

What You Will Learn

The System Call Attack Surface

Syscall Attack Categories:

System calls can enable attacks in several ways:

System Call Attack Categories
Category	Example Syscalls	Attack	Risk Level
Kernel vulnerabilities	ioctl, futex, bpf, perf_event_open	Trigger kernel bugs to escape sandbox	Critical
Privilege escalation	setuid, setgid, setgroups, capset	Regain dropped privileges	Critical
Namespace escape	setns, unshare, mount	Escape namespace isolation	Critical
File system access	open, openat, read, write, unlink	Access files outside sandbox	High
Network access	socket, connect, bind, sendto	Exfiltrate data, attack network	High
Process control	fork, execve, clone, kill	Spawn processes, affect other processes	Medium
Information leak	getdents, readlink, stat	Discover system layout	Medium
Resource exhaustion	fork, mmap, socket	Denial of service	Medium

The Kernel Bug Problem:

ioctl — Historically the source of many vulnerabilities due to its generic interface
futex — Complex synchronization primitive with a history of bugs
bpf — Powerful programmable interface; bugs enable arbitrary kernel code execution
ptrace — Process debugging interface with complex security implications
perf_event_open — Performance monitoring; requires careful permission checking

Defense in Depth Against Kernel Bugs

Quantifying the Attack Surface:

Linux has approximately 450 system calls. A typical sandboxed application (like a browser renderer) needs only a small fraction:

Application Type	Syscalls Needed	Attack Surface Reduction
Browser renderer	~50-80	80-85%
Image decoder	~20-30	93-95%
PDF parser	~30-50	88-93%
Network service	~70-100	75-85%

By allowing only the required syscalls, we dramatically reduce exposure to kernel bugs and limit the capabilities available to an attacker who achieves code execution.

System Call Filtering Approaches

There are several technical approaches to filtering system calls, each with different characteristics:

1. Ptrace-Based Filtering:

The oldest approach uses ptrace() to intercept system calls in the traced process. A monitor process runs alongside the sandboxed process, receiving notifications for each syscall:

// Monitor process
while (1) {
    // Wait for syscall entry
    ptrace(PTRACE_SYSCALL, child, 0, 0);
    waitpid(child, &status, 0);
    
    // Get syscall number
    long syscall = ptrace(PTRACE_PEEKUSER, child, 
                          ORIG_RAX * sizeof(long), 0);
    
    // Check policy
    if (!is_allowed(syscall)) {
        // Block syscall by changing it to -1 (invalid)
        ptrace(PTRACE_POKEUSER, child, 
               ORIG_RAX * sizeof(long), -1);
    }
}

Ptrace Advantages

•Works on all Unix systems
•Can inspect/modify arguments
•Can emulate syscalls entirely
•Portable across kernel versions
•No kernel support required

Ptrace Disadvantages

•Extremely high overhead (2-10x slowdown)
•TOCTOU races between check and execution
•Complex, error-prone implementation
•Monitor is another process to attack
•Not composable with debugging

2. Kernel-Based Filtering (seccomp):

Original seccomp (mode 1):

seccomp-bpf (mode 2):

Seccomp-bpf (2012) extended seccomp with BPF (Berkeley Packet Filter) programs, allowing flexible, programmable filters. The filter inspects the syscall number and arguments and returns an action:

// Actions that seccomp filter can return:
#define SECCOMP_RET_KILL_PROCESS  0x80000000  // Kill entire process
#define SECCOMP_RET_KILL_THREAD   0x00000000  // Kill calling thread
#define SECCOMP_RET_TRAP          0x00030000  // Send SIGSYS signal
#define SECCOMP_RET_ERRNO         0x00050000  // Return errno value
#define SECCOMP_RET_USER_NOTIF    0x7fc00000  // Notify userspace
#define SECCOMP_RET_TRACE         0x7ff00000  // Notify ptrace tracer
#define SECCOMP_RET_LOG           0x7ffc0000  // Log and allow
#define SECCOMP_RET_ALLOW         0x7fff0000  // Allow syscall

3. Mandatory Access Control (SELinux, AppArmor):

MAC systems can also filter syscalls as part of their broader policy. SELinux policies can restrict which syscalls a domain can use, though this is typically coarser-grained than seccomp.

4. Syscall Interposition via Library:

Library-based approaches (LD_PRELOAD) intercept library calls before they become syscalls. This is easy to implement but easy to bypass (direct syscall instructions, statically linked binaries).

Comparison Summary:

Syscall Filtering Approach Comparison
Approach	Overhead	Security	Flexibility	Best Use
Ptrace	Very High (2-10x)	TOCTOU vulnerable	Full emulation possible	Legacy/compatibility
Seccomp-bpf	Very Low (1-2%)	Kernel-enforced	Per-syscall, arg inspection	Production sandboxing
MAC (SELinux)	Low (5-10%)	Kernel-enforced	Coarse-grained	System-wide policy
Library interposition	Low	Bypassable	Library-level control	Debugging only

Policy Design Strategies

Designing an effective syscall filter policy requires balancing security (blocking as much as possible) with functionality (allowing the application to work). There are two fundamental approaches:

Allowlist (Whitelist) Approach:

Start with everything blocked and explicitly allow only required syscalls:

# Pseudo-policy: default deny, explicit allow
default: DENY
allow: read, write, close, fstat, mmap, munmap, ...

Advantages:

Minimizes attack surface by default
New syscalls are blocked automatically
Forces explicit justification for each allowed syscall

Denylist (Blacklist) Approach:

Start with everything allowed and explicitly block dangerous syscalls:

# Pseudo-policy: default allow, explicit deny  
default: ALLOW
deny: mount, umount, reboot, kexec, ptrace, ...

Advantages:

Easier to implement initially
Less likely to break functionality

Always Prefer Allowlists

Developing an Allowlist Policy:

Creating an allowlist requires understanding what syscalls the application actually needs. Several approaches help:

1. Strace Analysis:

Run the application under strace to observe syscall usage:

# Record all syscalls during normal operation
strace -c -f program 2>&1 | tee syscalls.log

# Output shows syscall counts:
# % time     calls  syscall
# 25.00       100  read
# 20.00        80  write
# 15.00        60  mmap
# ...

2. Seccomp Log Mode:

Use SECCOMP_RET_LOG for syscalls you're unsure about—the kernel logs but doesn't block:

// Log unknown syscalls during development
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_unknown_syscall, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_LOG),

3. Iterative Refinement:

Start with a minimal set, run the application, observe what fails, add needed syscalls, repeat:

Create minimal allowlist (exit, read, write)
Run application
Observe failure (SIGSYS or EPERM)
Identify blocked syscall
Evaluate if syscall is necessary or indicates unwanted behavior
Add to allowlist if legitimate, or redesign if not
Repeat until application works correctly

Common Syscall Groups:

Syscalls can be grouped by function to help policy design:

Syscall Groups and Sandboxing Considerations
Group	Examples	Typical Policy
Basic I/O	read, write, close	Usually allowed
Memory	mmap, mprotect, brk, munmap	Usually allowed, may restrict RWX mappings
File metadata	fstat, stat, lstat	Usually allowed, may restrict paths
File open	open, openat	Restrict or broker through monitor
Directory	getdents, readdir	May restrict to hide system layout
Process	getpid, gettid, getuid	Usually allowed (harmless)
Signals	rt_sigaction, sigaltstack	Usually allowed for error handling
Time	clock_gettime, gettimeofday	Usually allowed
Network	socket, connect, sendto	Block or broker for most sandboxes
Process control	fork, clone, execve	Usually blocked
Privilege	setuid, setgid, setgroups	Always blocked
Namespace	setns, unshare, mount	Always blocked in sandbox
Debug	ptrace	Always blocked
Modules	init_module, finit_module	Always blocked

Argument-Based Filtering

Example: ioctl Filtering:

The ioctl syscall is generic—thousands of different operations share one syscall number, distinguished by the request code argument:

// Dangerous ioctl requests (examples)
ioctl(fd, TIOCSTI, '\n');     // Inject input into terminal
ioctl(fd, SIOCGIFCONF, buf);  // Enumerate network interfaces

// Benign ioctl requests
ioctl(fd, TCGETS, &termios);  // Get terminal attributes
ioctl(fd, FIONREAD, &count);  // Get bytes available to read

A naive policy blocking all ioctl would break many applications. A sophisticated policy allows only specific, audited ioctl requests:

// BPF filter: allow ioctl only for specific requests
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_ioctl, 0, SKIP_IOCTL_CHECK),
// Load second argument (ioctl request)
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, args[1])),
// Allow TCGETS
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, TCGETS, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
// Allow FIONREAD  
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, FIONREAD, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
// Block other ioctl
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),

Example: mmap/mprotect Filtering:

// Restrict mprotect: disallow creating WX memory
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mprotect, 0, SKIP),
// Load third argument (prot flags)
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, args[2])),
// Check if both WRITE and EXEC are set
BPF_STMT(BPF_ALU | BPF_AND | BPF_K, PROT_WRITE | PROT_EXEC),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, PROT_WRITE | PROT_EXEC, 0, 1),
// Block WX mappings
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),

Example: socket Type Filtering:

The socket syscall can create various socket types. Restricting to specific types limits network capability:

// Only allow AF_UNIX sockets (local IPC only)
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_socket, 0, SKIP),
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, args[0])),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AF_UNIX, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),

64-bit Argument Handling

TOCTOU Considerations:

Filter checks pointer and sees legitimate filename
Another thread modifies the pointed-to memory
Kernel uses modified (malicious) filename

Handling Blocked System Calls

When a syscall is blocked, the filter must decide what happens. Seccomp-bpf provides several options, each with different implications:

SECCOMP_RET_KILL_PROCESS:

Terminate the entire process immediately with SIGSYS. This is the most secure option—any attempt to use a blocked syscall terminates the sandbox:

BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),

Pros: No opportunity for attacker to try alternatives, no information leak about what's blocked
Cons: Harsh—any bug or missed syscall kills the application

SECCOMP_RET_KILL_THREAD:

Terminate only the calling thread. Less disruptive than process kill, but can leave the process in an inconsistent state.

SECCOMP_RET_ERRNO:

Return an error value to the caller as if the syscall failed:

// Return EPERM (Operation not permitted)
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),

// Or return ENOSYS (Function not implemented)
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (ENOSYS & SECCOMP_RET_DATA)),

EPERM: Convention for "you don't have permission to do this"
ENOSYS: Convention for "this syscall doesn't exist"—can make fingerprinting harder

Pros: Application can handle the error gracefully, better user experience Cons: Attacker can probe to map what's blocked, might find workarounds

SECCOMP_RET_TRAP:

Send SIGSYS signal to the process, which can install a handler to deal with the blocked syscall:

// In seccomp filter:
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_TRAP),

// In application:
void sigsys_handler(int sig, siginfo_t *info, void *ctx) {
    // info->si_syscall contains blocked syscall number
    // info->si_arch contains architecture
    // Can emulate the syscall or take other action
}

struct sigaction sa = {
    .sa_sigaction = sigsys_handler,
    .sa_flags = SA_SIGINFO,
};
sigaction(SIGSYS, &sa, NULL);

Use case: User-space emulation of blocked syscalls. The handler can implement a sandboxed version of the functionality.

SECCOMP_RET_USER_NOTIF:

Notify a supervisor process via a notification file descriptor. The supervisor can inspect the syscall, make a policy decision, and optionally respond on behalf of the sandboxed process:

// Create notification fd
int notify_fd = seccomp(SECCOMP_SET_MODE_FILTER, 
                        SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);

// Supervisor event loop
struct seccomp_notif *req = ...;
struct seccomp_notif_resp *resp = ...;

while (1) {
    ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_RECV, req);
    
    // Inspect req->data.nr (syscall number), req->data.args
    // Make policy decision
    
    resp->id = req->id;
    resp->val = result;  // Return value to sandbox
    resp->error = 0;     // Or errno if failing
    resp->flags = 0;
    
    ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_SEND, resp);
}

Use case: Broker pattern—supervisor opens files, creates sockets on behalf of sandbox.

Combining Approaches

Architecture and Compatibility Considerations

System call filtering must account for CPU architecture-specific issues that attackers can exploit:

Multi-Architecture Systems:

// Always check architecture in seccomp filter!
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, arch)),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),  // Kill if not x86_64

The x32 ABI Attack:

// Block x32 ABI (high bit set in syscall number)
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),  // Kill x32 syscalls

Critical: Always Validate Architecture

Syscall Number Portability:

Syscall numbers differ between architectures. A filter written with x86-64 numbers won't work on ARM64. Use architecture-specific headers or build filters at runtime:

// Use syscall number constants from headers
#include <sys/syscall.h>

// SYS_open is defined differently per architecture
// On x86-64: SYS_open = 2
// On ARM64: SYS_openat = 56 (no SYS_open)

Modern vs. Legacy Syscalls:

Some syscalls have been superseded by newer versions:

open → openat (relative paths)
stat → fstatat
access → faccessat

The old versions might still work, so filters must block both:

// Block both open and openat
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_open, 1, 0),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_openat, 0, 2),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | EPERM),

Library Wrapper Assumptions:

C library functions might use different syscalls than expected:

printf → write (expected)
malloc → brk or mmap (implementation detail)
sleep → nanosleep or clock_nanosleep

Test on target systems to ensure the actual syscalls are allowed.

Real-World Syscall Policies

Examining real-world syscall policies helps understand how production systems balance security and functionality.

Chrome Renderer Sandbox:

Chrome's renderer sandbox is one of the most restrictive production sandboxes:

# Chrome renderer allowed syscalls (simplified)
exit_group, exit, read, write, close, fstat, mmap, mprotect, munmap,
brk, rt_sigaction, rt_sigprocmask, rt_sigreturn, pread64, pwrite64,
lseek, futex, poll, recvmsg, sendmsg, socketpair, shutdown,
prctl, clock_gettime, gettimeofday, clone, sigaltstack,
get_robust_list, set_robust_list, restart_syscall, getrandom,
madvise, fcntl, nanosleep, getpid, gettid, tgkill, ...

# Notably NOT allowed:
open, openat, socket, connect, bind, fork, exec, ptrace, mount,
chmod, chown, link, unlink, rename, mkdir, rmdir, ...

The renderer cannot open files, create network connections, or spawn processes. All such operations go through IPC to the browser process, which validates and performs them.

Docker Default Seccomp Profile:

Docker's default seccomp profile blocks approximately 50 syscalls from over 300 available. It's less restrictive than browser sandboxes because containers need broader functionality:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86"],
  "syscalls": [
    {
      "names": ["accept", "accept4", "access", ...],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "names": ["personality"],
      "action": "SCMP_ACT_ALLOW",
      "args": [{"index": 0, "value": 0, "op": "SCMP_CMP_EQ"}]
    }
  ]
}

Policy Restrictiveness Comparison
System	Allowed Syscalls	Approach	Compatibility
Chrome Renderer	~60-80	Strict allowlist	Breaks if needs change
Chrome GPU Process	~100-120	Allowlist + broker	Needs GPU driver syscalls
Docker default	~250	Denylist of dangerous	Wide compatibility
systemd services	Variable	Configurable per-service	Admin-controlled
Firejail default	~200	Denylist	Desktop app compat

Security vs Compatibility Trade-off

Testing and Debugging Syscall Filters

Testing syscall filters is critical—both to ensure security and to verify applications still work.

Testing for Security:

Negative testing — Verify that blocked syscalls actually fail:

void test_blocked_syscalls() {
    // These should fail with EPERM (or process gets killed)
    assert(syscall(__NR_ptrace, 0, 0, 0, 0) == -1);
    assert(errno == EPERM);
    
    assert(syscall(__NR_mount, 0, 0, 0, 0, 0) == -1);
    assert(errno == EPERM);
    
    // Test that the process survived (if using ERRNO action)
    printf("Blocked syscalls correctly rejected\n");
}

Architecture coverage — Test that x32 and 32-bit syscalls are handled:

// Try x32 syscall (should fail/kill)
syscall(__NR_write | 0x40000000, 1, "test", 4);

Argument bypass attempts — Test that argument filtering works:

// If ioctl is allowed only for TCGETS:
assert(ioctl(0, TCGETS, &termios) >= 0);  // Should work
assert(ioctl(0, TIOCSTI, &c) == -1);       // Should fail

Debugging Filter Issues:

Use LOG action during development:

// During development, log instead of kill
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_LOG),

Kernel logs blocked syscalls:

audit: seccomp status=log syscall=257 pid=1234 code=0x7fc00000

strace for syscall tracing:

# Trace with seccomp information
strace -e trace=open,openat,socket -f ./sandboxed_app

BPF program inspection:

// Dump filter for inspection
struct sock_fprog prog;
socklen_t len = sizeof(prog);
getsockopt(fd, SOL_SOCKET, SO_GET_FILTER, &prog, &len);
// Then disassemble BPF instructions

seccomp-tools for analysis:

# Install seccomp-tools (Ruby gem)
gem install seccomp-tools

# Dump seccomp filter from running process
seccomp-tools dump ./sandboxed_app

# Disassemble BPF bytecode
seccomp-tools disasm filter.bpf

Incremental Development

Summary: System Call Filtering

We have explored system call filtering as a critical layer of sandbox security. Let's consolidate the key insights:

Key Takeaways

•Syscalls are the attack surface — Every syscall is a potential kernel vulnerability entry point; filtering reduces this surface.
•Allowlists over denylists — Always use allowlists; they fail safe and automatically block new syscalls.
•Kernel-based filtering is essential — Seccomp-bpf provides low-overhead, race-free filtering in kernel context.
•Argument filtering enables precision — Filter not just syscall numbers but also arguments for fine-grained control.
•Architecture handling is critical — Always validate architecture to prevent x32/32-bit ABI bypass attacks.
•Action choice affects security/usability — KILL is safest; ERRNO allows graceful handling; USER_NOTIF enables broker pattern.
•Test both functionality and security — Verify allowed syscalls work AND blocked syscalls are actually blocked.
•Real-world policies balance trade-offs — Strictness depends on how much you control the sandboxed code.

What's Next:

Page Complete

3 / 5