Loading learning content...
We've explored how to isolate processes through namespaces, restrict their file system view, drop credentials, and limit resources. Yet one critical attack surface remains: the system call interface.
Every interaction between a sandboxed process and the kernel occurs through system calls. Each system call is a gateway that, if misused or exploited, can potentially compromise the sandbox. System calls handle file operations, network communication, process management, memory mapping, and hundreds of other kernel functions. The kernel's attack surface is enormous, and every syscall is a potential entry point for exploitation.
System call filtering addresses this by restricting which system calls a process can invoke and how those calls can be parameterized. Even if an attacker gains arbitrary code execution within the sandboxed process, they cannot invoke dangerous system calls—dramatically limiting what they can do.
By the end of this page, you will understand why system call filtering is essential for robust sandboxing, how to design effective syscall filtering policies, the different approaches to syscall filtering, and the security implications of various policy choices. You will be able to analyze and design syscall filter policies.
The system call interface represents the boundary between user space and kernel space. Every time a process needs kernel services—opening files, creating sockets, mapping memory, spawning processes—it must cross this boundary through a system call. This makes the syscall interface the most critical security boundary in the system.
Syscall Attack Categories:
System calls can enable attacks in several ways:
| Category | Example Syscalls | Attack | Risk Level |
|---|---|---|---|
| Kernel vulnerabilities | ioctl, futex, bpf, perf_event_open | Trigger kernel bugs to escape sandbox | Critical |
| Privilege escalation | setuid, setgid, setgroups, capset | Regain dropped privileges | Critical |
| Namespace escape | setns, unshare, mount | Escape namespace isolation | Critical |
| File system access | open, openat, read, write, unlink | Access files outside sandbox | High |
| Network access | socket, connect, bind, sendto | Exfiltrate data, attack network | High |
| Process control | fork, execve, clone, kill | Spawn processes, affect other processes | Medium |
| Information leak | getdents, readlink, stat | Discover system layout | Medium |
| Resource exhaustion | fork, mmap, socket | Denial of service | Medium |
The Kernel Bug Problem:
The most serious concern is kernel vulnerabilities. The kernel is hundreds of thousands of lines of complex C code. Despite extensive review, bugs exist. Each system call handler is a potential vulnerability:
By restricting which syscalls a process can invoke, we reduce the kernel attack surface. If a vulnerability exists in the bpf syscall handler, and our sandbox blocks bpf, the vulnerability is not reachable from our sandbox.
System call filtering is one of the few practical defenses against kernel vulnerabilities in sandbox escape. Other mechanisms (namespaces, capabilities) rely on the kernel functioning correctly. If the kernel has a bug, those mechanisms can be bypassed. Syscall filtering prevents the vulnerable code from being reached in the first place.
Quantifying the Attack Surface:
Linux has approximately 450 system calls. A typical sandboxed application (like a browser renderer) needs only a small fraction:
| Application Type | Syscalls Needed | Attack Surface Reduction |
|---|---|---|
| Browser renderer | ~50-80 | 80-85% |
| Image decoder | ~20-30 | 93-95% |
| PDF parser | ~30-50 | 88-93% |
| Network service | ~70-100 | 75-85% |
By allowing only the required syscalls, we dramatically reduce exposure to kernel bugs and limit the capabilities available to an attacker who achieves code execution.
There are several technical approaches to filtering system calls, each with different characteristics:
1. Ptrace-Based Filtering:
The oldest approach uses ptrace() to intercept system calls in the traced process. A monitor process runs alongside the sandboxed process, receiving notifications for each syscall:
// Monitor process
while (1) {
// Wait for syscall entry
ptrace(PTRACE_SYSCALL, child, 0, 0);
waitpid(child, &status, 0);
// Get syscall number
long syscall = ptrace(PTRACE_PEEKUSER, child,
ORIG_RAX * sizeof(long), 0);
// Check policy
if (!is_allowed(syscall)) {
// Block syscall by changing it to -1 (invalid)
ptrace(PTRACE_POKEUSER, child,
ORIG_RAX * sizeof(long), -1);
}
}
2. Kernel-Based Filtering (seccomp):
Modern Linux provides seccomp (secure computing mode), which moves syscall filtering into the kernel. The filter runs in kernel context, eliminating race conditions and reducing overhead dramatically.
Original seccomp (mode 1):
The original seccomp (2005) was extremely restrictive: it allowed only read, write, exit, and sigreturn. Any other syscall terminated the process. This was too restrictive for most applications.
seccomp-bpf (mode 2):
Seccomp-bpf (2012) extended seccomp with BPF (Berkeley Packet Filter) programs, allowing flexible, programmable filters. The filter inspects the syscall number and arguments and returns an action:
// Actions that seccomp filter can return:
#define SECCOMP_RET_KILL_PROCESS 0x80000000 // Kill entire process
#define SECCOMP_RET_KILL_THREAD 0x00000000 // Kill calling thread
#define SECCOMP_RET_TRAP 0x00030000 // Send SIGSYS signal
#define SECCOMP_RET_ERRNO 0x00050000 // Return errno value
#define SECCOMP_RET_USER_NOTIF 0x7fc00000 // Notify userspace
#define SECCOMP_RET_TRACE 0x7ff00000 // Notify ptrace tracer
#define SECCOMP_RET_LOG 0x7ffc0000 // Log and allow
#define SECCOMP_RET_ALLOW 0x7fff0000 // Allow syscall
3. Mandatory Access Control (SELinux, AppArmor):
MAC systems can also filter syscalls as part of their broader policy. SELinux policies can restrict which syscalls a domain can use, though this is typically coarser-grained than seccomp.
4. Syscall Interposition via Library:
Library-based approaches (LD_PRELOAD) intercept library calls before they become syscalls. This is easy to implement but easy to bypass (direct syscall instructions, statically linked binaries).
Comparison Summary:
| Approach | Overhead | Security | Flexibility | Best Use |
|---|---|---|---|---|
| Ptrace | Very High (2-10x) | TOCTOU vulnerable | Full emulation possible | Legacy/compatibility |
| Seccomp-bpf | Very Low (1-2%) | Kernel-enforced | Per-syscall, arg inspection | Production sandboxing |
| MAC (SELinux) | Low (5-10%) | Kernel-enforced | Coarse-grained | System-wide policy |
| Library interposition | Low | Bypassable | Library-level control | Debugging only |
Designing an effective syscall filter policy requires balancing security (blocking as much as possible) with functionality (allowing the application to work). There are two fundamental approaches:
Allowlist (Whitelist) Approach:
Start with everything blocked and explicitly allow only required syscalls:
# Pseudo-policy: default deny, explicit allow
default: DENY
allow: read, write, close, fstat, mmap, munmap, ...
Advantages:
Denylist (Blacklist) Approach:
Start with everything allowed and explicitly block dangerous syscalls:
# Pseudo-policy: default allow, explicit deny
default: ALLOW
deny: mount, umount, reboot, kexec, ptrace, ...
Advantages:
Security best practice strongly favors allowlists. Denylists are prone to omissions—if you forget to block a dangerous syscall, you have a vulnerability. Allowlists fail safe: if you forget to allow a syscall, the application breaks (visible) rather than being insecure (invisible). New kernel syscalls are automatically blocked.
Developing an Allowlist Policy:
Creating an allowlist requires understanding what syscalls the application actually needs. Several approaches help:
1. Strace Analysis:
Run the application under strace to observe syscall usage:
# Record all syscalls during normal operation
strace -c -f program 2>&1 | tee syscalls.log
# Output shows syscall counts:
# % time calls syscall
# 25.00 100 read
# 20.00 80 write
# 15.00 60 mmap
# ...
2. Seccomp Log Mode:
Use SECCOMP_RET_LOG for syscalls you're unsure about—the kernel logs but doesn't block:
// Log unknown syscalls during development
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_unknown_syscall, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_LOG),
3. Iterative Refinement:
Start with a minimal set, run the application, observe what fails, add needed syscalls, repeat:
Common Syscall Groups:
Syscalls can be grouped by function to help policy design:
| Group | Examples | Typical Policy |
|---|---|---|
| Basic I/O | read, write, close | Usually allowed |
| Memory | mmap, mprotect, brk, munmap | Usually allowed, may restrict RWX mappings |
| File metadata | fstat, stat, lstat | Usually allowed, may restrict paths |
| File open | open, openat | Restrict or broker through monitor |
| Directory | getdents, readdir | May restrict to hide system layout |
| Process | getpid, gettid, getuid | Usually allowed (harmless) |
| Signals | rt_sigaction, sigaltstack | Usually allowed for error handling |
| Time | clock_gettime, gettimeofday | Usually allowed |
| Network | socket, connect, sendto | Block or broker for most sandboxes |
| Process control | fork, clone, execve | Usually blocked |
| Privilege | setuid, setgid, setgroups | Always blocked |
| Namespace | setns, unshare, mount | Always blocked in sandbox |
| Debug | ptrace | Always blocked |
| Modules | init_module, finit_module | Always blocked |
Simple syscall number filtering isn't always sufficient. Many syscalls have vastly different security implications depending on their arguments. Argument-based filtering allows finer-grained policies that inspect syscall arguments.
Example: ioctl Filtering:
The ioctl syscall is generic—thousands of different operations share one syscall number, distinguished by the request code argument:
// Dangerous ioctl requests (examples)
ioctl(fd, TIOCSTI, '\n'); // Inject input into terminal
ioctl(fd, SIOCGIFCONF, buf); // Enumerate network interfaces
// Benign ioctl requests
ioctl(fd, TCGETS, &termios); // Get terminal attributes
ioctl(fd, FIONREAD, &count); // Get bytes available to read
A naive policy blocking all ioctl would break many applications. A sophisticated policy allows only specific, audited ioctl requests:
// BPF filter: allow ioctl only for specific requests
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_ioctl, 0, SKIP_IOCTL_CHECK),
// Load second argument (ioctl request)
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, args[1])),
// Allow TCGETS
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, TCGETS, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
// Allow FIONREAD
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, FIONREAD, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
// Block other ioctl
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),
Example: mmap/mprotect Filtering:
The mmap and mprotect syscalls create memory mappings with specified protections. Allowing PROT_EXEC with PROT_WRITE enables JIT compilation but is dangerous (attackers can write shellcode then execute it):
// Restrict mprotect: disallow creating WX memory
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_mprotect, 0, SKIP),
// Load third argument (prot flags)
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, args[2])),
// Check if both WRITE and EXEC are set
BPF_STMT(BPF_ALU | BPF_AND | BPF_K, PROT_WRITE | PROT_EXEC),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, PROT_WRITE | PROT_EXEC, 0, 1),
// Block WX mappings
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),
Example: socket Type Filtering:
The socket syscall can create various socket types. Restricting to specific types limits network capability:
// Only allow AF_UNIX sockets (local IPC only)
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_socket, 0, SKIP),
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, args[0])),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AF_UNIX, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),
On 64-bit systems, syscall arguments are 64-bit, but classic BPF can only load 32-bit values. To check 64-bit arguments, you must load and check both halves. On x86-64, seccomp_data stores args as u64, so check both the low and high 32 bits to avoid attacks using values in the upper bits.
TOCTOU Considerations:
Seccomp-bpf filters inspect the argument values at the time of the syscall. For arguments that are pointers to user memory (like filenames), there's a potential TOCTOU (time-of-check-to-time-of-use) vulnerability:
For this reason, seccomp-bpf can only safely filter on numeric arguments, not on the content of pointed-to memory. To filter on file paths, use a broker pattern: the sandboxed process sends the path to a broker via IPC, the broker validates and performs the operation.
When a syscall is blocked, the filter must decide what happens. Seccomp-bpf provides several options, each with different implications:
SECCOMP_RET_KILL_PROCESS:
Terminate the entire process immediately with SIGSYS. This is the most secure option—any attempt to use a blocked syscall terminates the sandbox:
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
SECCOMP_RET_KILL_THREAD:
Terminate only the calling thread. Less disruptive than process kill, but can leave the process in an inconsistent state.
SECCOMP_RET_ERRNO:
Return an error value to the caller as if the syscall failed:
// Return EPERM (Operation not permitted)
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),
// Or return ENOSYS (Function not implemented)
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (ENOSYS & SECCOMP_RET_DATA)),
Pros: Application can handle the error gracefully, better user experience Cons: Attacker can probe to map what's blocked, might find workarounds
SECCOMP_RET_TRAP:
Send SIGSYS signal to the process, which can install a handler to deal with the blocked syscall:
// In seccomp filter:
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_TRAP),
// In application:
void sigsys_handler(int sig, siginfo_t *info, void *ctx) {
// info->si_syscall contains blocked syscall number
// info->si_arch contains architecture
// Can emulate the syscall or take other action
}
struct sigaction sa = {
.sa_sigaction = sigsys_handler,
.sa_flags = SA_SIGINFO,
};
sigaction(SIGSYS, &sa, NULL);
Use case: User-space emulation of blocked syscalls. The handler can implement a sandboxed version of the functionality.
SECCOMP_RET_USER_NOTIF:
Notify a supervisor process via a notification file descriptor. The supervisor can inspect the syscall, make a policy decision, and optionally respond on behalf of the sandboxed process:
// Create notification fd
int notify_fd = seccomp(SECCOMP_SET_MODE_FILTER,
SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
// Supervisor event loop
struct seccomp_notif *req = ...;
struct seccomp_notif_resp *resp = ...;
while (1) {
ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_RECV, req);
// Inspect req->data.nr (syscall number), req->data.args
// Make policy decision
resp->id = req->id;
resp->val = result; // Return value to sandbox
resp->error = 0; // Or errno if failing
resp->flags = 0;
ioctl(notify_fd, SECCOMP_IOCTL_NOTIF_SEND, resp);
}
Use case: Broker pattern—supervisor opens files, creates sockets on behalf of sandbox.
Production sandboxes often combine approaches: ALLOW for clearly safe syscalls, ERRNO for syscalls that might be called but aren't needed, USER_NOTIF for syscalls that need broker assistance, and KILL for syscalls that indicate an attack in progress (like ptrace or kexec).
System call filtering must account for CPU architecture-specific issues that attackers can exploit:
Multi-Architecture Systems:
On x86-64 Linux, 32-bit programs can run under compatibility mode, and they use different syscall numbers. A filter designed for 64-bit syscalls might not cover 32-bit syscalls, creating an escape path.
// Always check architecture in seccomp filter!
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, arch)),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL), // Kill if not x86_64
The x32 ABI Attack:
Linux x86-64 supports an "x32" ABI—32-bit pointers with 64-bit registers. The x32 syscall numbers have their high bit set (| 0x40000000). An attacker might invoke x32 syscalls to bypass filters designed for regular x86-64:
// Block x32 ABI (high bit set in syscall number)
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, X32_SYSCALL_BIT, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL), // Kill x32 syscalls
Every seccomp filter MUST start by checking the architecture. Without this check, attackers on x86-64 can use 32-bit or x32 syscalls to invoke different syscall numbers that your filter doesn't cover. Many real sandbox escapes have exploited this oversight.
Syscall Number Portability:
Syscall numbers differ between architectures. A filter written with x86-64 numbers won't work on ARM64. Use architecture-specific headers or build filters at runtime:
// Use syscall number constants from headers
#include <sys/syscall.h>
// SYS_open is defined differently per architecture
// On x86-64: SYS_open = 2
// On ARM64: SYS_openat = 56 (no SYS_open)
Modern vs. Legacy Syscalls:
Some syscalls have been superseded by newer versions:
open → openat (relative paths)stat → fstatataccess → faccessatThe old versions might still work, so filters must block both:
// Block both open and openat
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_open, 1, 0),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_openat, 0, 2),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | EPERM),
Library Wrapper Assumptions:
C library functions might use different syscalls than expected:
printf → write (expected)malloc → brk or mmap (implementation detail)sleep → nanosleep or clock_nanosleepTest on target systems to ensure the actual syscalls are allowed.
Examining real-world syscall policies helps understand how production systems balance security and functionality.
Chrome Renderer Sandbox:
Chrome's renderer sandbox is one of the most restrictive production sandboxes:
# Chrome renderer allowed syscalls (simplified)
exit_group, exit, read, write, close, fstat, mmap, mprotect, munmap,
brk, rt_sigaction, rt_sigprocmask, rt_sigreturn, pread64, pwrite64,
lseek, futex, poll, recvmsg, sendmsg, socketpair, shutdown,
prctl, clock_gettime, gettimeofday, clone, sigaltstack,
get_robust_list, set_robust_list, restart_syscall, getrandom,
madvise, fcntl, nanosleep, getpid, gettid, tgkill, ...
# Notably NOT allowed:
open, openat, socket, connect, bind, fork, exec, ptrace, mount,
chmod, chown, link, unlink, rename, mkdir, rmdir, ...
The renderer cannot open files, create network connections, or spawn processes. All such operations go through IPC to the browser process, which validates and performs them.
Docker Default Seccomp Profile:
Docker's default seccomp profile blocks approximately 50 syscalls from over 300 available. It's less restrictive than browser sandboxes because containers need broader functionality:
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86"],
"syscalls": [
{
"names": ["accept", "accept4", "access", ...],
"action": "SCMP_ACT_ALLOW"
},
{
"names": ["personality"],
"action": "SCMP_ACT_ALLOW",
"args": [{"index": 0, "value": 0, "op": "SCMP_CMP_EQ"}]
}
]
}
Blocked by default in Docker: acct, add_key, bpf, clock_adjtime, clock_settime, create_module, delete_module, finit_module, get_kernel_syms, get_mempolicy, init_module, ioperm, iopl, kcmp, kexec_file_load, kexec_load, keyctl, lookup_dcookie, mbind, mount, move_pages, name_to_handle_at, nfsservctl, open_by_handle_at, perf_event_open, personality, pivot_root, process_vm_readv, process_vm_writev, ptrace, query_module, quotactl, reboot, request_key, set_mempolicy, setns, settimeofday, stime, swapoff, swapon, sysfs, _sysctl, umount, umount2, unshare, uselib, userfaultfd, ustat, vm86, vm86old
| System | Allowed Syscalls | Approach | Compatibility |
|---|---|---|---|
| Chrome Renderer | ~60-80 | Strict allowlist | Breaks if needs change |
| Chrome GPU Process | ~100-120 | Allowlist + broker | Needs GPU driver syscalls |
| Docker default | ~250 | Denylist of dangerous | Wide compatibility |
| systemd services | Variable | Configurable per-service | Admin-controlled |
| Firejail default | ~200 | Denylist | Desktop app compat |
More restrictive policies provide better security but risk breaking applications. Browser sandboxes can be very strict because they control both the sandbox and the sandboxed code. General-purpose container sandboxes must be more permissive to support arbitrary applications.
Testing syscall filters is critical—both to ensure security and to verify applications still work.
Testing for Security:
void test_blocked_syscalls() {
// These should fail with EPERM (or process gets killed)
assert(syscall(__NR_ptrace, 0, 0, 0, 0) == -1);
assert(errno == EPERM);
assert(syscall(__NR_mount, 0, 0, 0, 0, 0) == -1);
assert(errno == EPERM);
// Test that the process survived (if using ERRNO action)
printf("Blocked syscalls correctly rejected\n");
}
// Try x32 syscall (should fail/kill)
syscall(__NR_write | 0x40000000, 1, "test", 4);
// If ioctl is allowed only for TCGETS:
assert(ioctl(0, TCGETS, &termios) >= 0); // Should work
assert(ioctl(0, TIOCSTI, &c) == -1); // Should fail
Debugging Filter Issues:
// During development, log instead of kill
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_LOG),
Kernel logs blocked syscalls:
audit: seccomp status=log syscall=257 pid=1234 code=0x7fc00000
# Trace with seccomp information
strace -e trace=open,openat,socket -f ./sandboxed_app
// Dump filter for inspection
struct sock_fprog prog;
socklen_t len = sizeof(prog);
getsockopt(fd, SOL_SOCKET, SO_GET_FILTER, &prog, &len);
// Then disassemble BPF instructions
# Install seccomp-tools (Ruby gem)
gem install seccomp-tools
# Dump seccomp filter from running process
seccomp-tools dump ./sandboxed_app
# Disassemble BPF bytecode
seccomp-tools disasm filter.bpf
Start with SECCOMP_RET_LOG to discover what syscalls your application uses, then switch to ERRNO to verify it handles rejection gracefully, and finally switch to KILL_PROCESS for production. This progression makes debugging much easier.
We have explored system call filtering as a critical layer of sandbox security. Let's consolidate the key insights:
What's Next:
With the conceptual and practical foundations of syscall filtering covered, the next page will focus on seccomp in detail—Linux's syscall filtering framework. You'll learn the BPF programming model, writing and installing filters, and advanced techniques used in production sandboxes.
You now understand how system call filtering protects sandboxed processes by restricting their access to kernel functionality. You can reason about policy design, understand the trade-offs between different filtering approaches, and appreciate the critical role of syscall filtering in modern sandboxing.