Loading content...
In the previous page, we established the conceptual foundations of sandboxing. Now we descend from theory to practice: how do operating systems actually confine processes?
A process, from the operating system's perspective, is an abstraction that bundles code execution with system resource access. Processes have memory mappings, file descriptors, network connections, and permissions. A sandboxed process is one where the operating system systematically restricts this access—creating a confined environment where the process can execute but cannot interact freely with the rest of the system.
Process sandboxing represents the most common and practical form of sandboxing in modern systems. It operates at the kernel level, providing strong guarantees with relatively low overhead. Understanding these mechanisms is essential for anyone building secure systems or trying to understand how browsers, containers, and security-critical applications protect themselves.
By the end of this page, you will understand the key operating system primitives for process sandboxing: namespaces, chroot, pivot_root, resource isolation, privilege dropping, and the overall architecture of a sandboxed process. You will be able to design and reason about process-level sandboxing strategies.
Before we can sandbox a process, we must understand what resources a process can access and what attacks become possible through each resource. A process's attack surface consists of every channel through which it can interact with the system or other processes.
The Anatomy of a Process:
A Unix/Linux process consists of several components, each representing potential attack vectors:
| Component | Description | Potential Attack | Sandboxing Goal |
|---|---|---|---|
| Address Space | Virtual memory containing code and data | Memory corruption exploits, ROP | Prevent access to other processes' memory |
| File Descriptors | Handles to files, sockets, pipes | Read/write sensitive files, exfiltrate data | Restrict to minimal required descriptors |
| Credentials | UID, GID, supplementary groups | Access files/resources as privileged user | Drop to unprivileged credentials |
| Capabilities | Fine-grained privilege tokens | Escalate privileges, perform admin operations | Remove all unnecessary capabilities |
| Namespace Memberships | Views of system resources (PIDs, network) | Interact with other processes, access network | Create isolated namespaces |
| System Call Interface | Gateway to kernel functionality | Exploit kernel vulnerabilities | Filter to minimal syscall set |
| Environment Variables | Process configuration data | Inject malicious configuration | Sanitize or restrict environment |
| Signal Handlers | Asynchronous notification mechanism | Interrupt execution, trigger handlers | Limit signal delivery |
The Principle of Least Privilege:
Process sandboxing is the practical application of the principle of least privilege: every process should have only the minimum privileges necessary to perform its function. This minimizes the damage possible if the process is compromised.
Consider a web browser's renderer process. Its function is to:
For this function, the renderer process does not need:
A properly sandboxed renderer strips all these unnecessary capabilities, so even if an attacker exploits a vulnerability in the JavaScript engine, they remain confined.
The correct approach is to start with full privileges and progressively drop them until only the required capabilities remain. Do NOT try to sandbox by denying specific things—you'll miss something. Instead, strip everything and add back only what's needed.
One of the oldest and most fundamental sandboxing techniques is file system isolation—restricting what parts of the file system a process can see and access. On Unix systems, this is achieved through chroot and its more powerful successor, pivot_root.
The chroot System Call:
The chroot(path) system call changes the root directory of the calling process to path. After a chroot, all file path resolution starts from the new root. The process cannot access files outside the new root using normal path traversal.
// Change root directory to /sandbox
if (chroot("/sandbox") != 0) {
perror("chroot failed");
exit(1);
}
// Change to the new root
if (chdir("/") != 0) {
perror("chdir failed");
exit(1);
}
Limitations of chroot:
While chroot provides file system isolation, it has significant limitations:
chroot("new_jail"); chdir("../../..");fchdir() to that descriptor and escape.pivot_root: A Stronger Alternative:
The pivot_root(new_root, put_old) system call is more robust than chroot. Instead of just changing the root reference, it actually moves the old root to a subdirectory and makes the new root the actual system root. This allows the old root to be unmounted entirely, making escape much harder.
// Setup mount namespace first (required for pivot_root)
unshare(CLONE_NEWNS);
mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL);
// Setup new root
mount("/sandbox", "/sandbox", NULL, MS_BIND | MS_REC, NULL);
// Pivot to new root
mkdir("/sandbox/old_root", 0755);
pivot_root("/sandbox", "/sandbox/old_root");
chdir("/");
// Unmount old root
umount2("/old_root", MNT_DETACH);
rmdir("/old_root");
pivot_root only works within a mount namespace. Without a private mount namespace, pivot_root would affect the global file system view. This is why containers always create a mount namespace before setting up their root file system.
Best Practices for File System Isolation:
mount -o ro,remount /.Linux namespaces are kernel features that partition system resources so that different sets of processes see different views of those resources. Namespaces are the foundation of container technologies and provide powerful, fine-grained isolation.
Available Namespace Types:
Linux provides several namespace types, each isolating a different aspect of the system:
| Namespace | Clone Flag | Introduced | Isolates |
|---|---|---|---|
| Mount | CLONE_NEWNS | Linux 2.4.19 (2002) | Mount points, filesystem view |
| UTS | CLONE_NEWUTS | Linux 2.6.19 (2006) | Hostname and domain name |
| IPC | CLONE_NEWIPC | Linux 2.6.19 (2006) | System V IPC, POSIX message queues |
| PID | CLONE_NEWPID | Linux 2.6.24 (2008) | Process IDs |
| Network | CLONE_NEWNET | Linux 2.6.29 (2009) | Network devices, stacks, ports |
| User | CLONE_NEWUSER | Linux 3.8 (2013) | User and group IDs, capabilities |
| Cgroup | CLONE_NEWCGROUP | Linux 4.6 (2016) | Cgroup root directory |
| Time | CLONE_NEWTIME | Linux 5.6 (2020) | System time (clock_gettime) |
Creating and Entering Namespaces:
Namespaces can be created and entered using several mechanisms:
// Method 1: clone() - create new process in new namespaces
int flags = CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET | SIGCHLD;
pid_t pid = clone(child_func, stack, flags, arg);
// Method 2: unshare() - move current process to new namespaces
unshare(CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET);
// Method 3: setns() - enter existing namespace via fd
int ns_fd = open("/proc/1234/ns/net", O_RDONLY);
setns(ns_fd, CLONE_NEWNET);
PID Namespace Deep Dive:
The PID namespace deserves special attention because it profoundly affects how processes perceive the system. In a new PID namespace:
pid_t pid = fork();
if (pid == 0) {
// Child: create new PID namespace
unshare(CLONE_NEWPID);
pid_t inner_pid = fork();
if (inner_pid == 0) {
// This process is PID 1 in the new namespace
printf("My PID: %d
", getpid()); // Prints: My PID: 1
// Cannot see other system processes
// Cannot send signals to processes outside namespace
// Act as init: reap zombies
while (1) {
int status;
wait(&status);
}
}
exit(0);
}
PID 1 in a namespace has special responsibilities: it must reap zombie processes and handle signals appropriately. If PID 1 exits or crashes, all processes in the namespace are killed with SIGKILL. Proper PID 1 handling is critical for container stability.
Network Namespace Deep Dive:
The network namespace creates a completely isolated network stack:
Connecting a network namespace to the outside world requires explicit configuration using virtual ethernet pairs (veth), bridges, or NAT rules.
# Create new network namespace
ip netns add sandbox
# Create veth pair connecting namespace to host
ip link add veth0 type veth peer name veth1
ip link set veth1 netns sandbox
# Configure host side
ip addr add 10.0.0.1/24 dev veth0
ip link set veth0 up
# Configure namespace side (run in namespace)
ip netns exec sandbox ip addr add 10.0.0.2/24 dev veth1
ip netns exec sandbox ip link set veth1 up
ip netns exec sandbox ip link set lo up
# Now processes in 'sandbox' can reach 10.0.0.1
User namespaces are perhaps the most powerful namespace type because they enable unprivileged sandboxing. Before user namespaces, creating namespaces and sandboxes required root privileges—which meant you needed privileges to drop privileges. User namespaces solve this paradox.
How User Namespaces Work:
A user namespace provides a separate mapping of user and group IDs. A process can be root (UID 0) inside the namespace while being an ordinary unprivileged user outside:
// Unprivileged user creates user namespace
if (unshare(CLONE_NEWUSER) != 0) {
perror("unshare");
exit(1);
}
// Now UID 0 inside namespace, but still unprivileged outside
printf("UID inside namespace: %d
", getuid()); // 65534 (nobody)
printf("Effective capabilities: ...full set...");
UID/GID Mapping:
User namespaces require explicit UID/GID mappings to be configured. These mappings are written to /proc/[pid]/uid_map and /proc/[pid]/gid_map:
# Format: <id-inside-ns> <id-outside-ns> <range>
# Map UID 0 inside to UID 1000 outside (count=1)
echo "0 1000 1" > /proc/self/uid_map
# Map GID 0 inside to GID 1000 outside
# Note: must write 'deny' to /proc/self/setgroups first
echo "deny" > /proc/self/setgroups
echo "0 1000 1" > /proc/self/gid_map
Implications:
With user namespaces, an unprivileged user can:
However, this "root" is fake root—the kernel still enforces that operations against host resources use the mapped (unprivileged) UID.
User namespaces enable 'rootless containers'—Docker, Podman, and other container runtimes can run without any privileges. The container engine runs as an unprivileged user, creates a user namespace, and inside that namespace creates all other namespaces. This dramatically improves security by eliminating the privileged container runtime.
Traditional Unix has a binary privilege model: either you're root (UID 0) with full privileges, or you're not root with limited privileges. This is problematic because many programs need only a single privileged operation (e.g., binding to port 80) but receive all root privileges.
Linux capabilities break root privileges into distinct units that can be granted independently. Instead of giving a process full root access, you grant only the specific capabilities it needs.
Capability Sets:
Each process has several capability sets:
| Capability | Allows | Sandboxing Notes |
|---|---|---|
| CAP_NET_BIND_SERVICE | Bind to ports < 1024 | Often the only capability needed by web servers |
| CAP_NET_RAW | Use raw sockets | Needed for ping, packet capture; dangerous |
| CAP_SYS_ADMIN | Many privileged operations | The 'new root'; avoid granting |
| CAP_SYS_PTRACE | Use ptrace() | Allows debugging any process; escape risk |
| CAP_DAC_OVERRIDE | Bypass file permission checks | Read/write any file; avoid |
| CAP_CHOWN | Change file ownership | Can chown any file to any user |
| CAP_SETUID | Set UID | Can become any user; avoid |
| CAP_SYS_CHROOT | Use chroot() | Needed for chroot; escape risk |
| CAP_NET_ADMIN | Network configuration | IP config, routing, firewall rules |
| CAP_KILL | Send signals to any process | Can kill processes of other users |
Dropping Capabilities for Sandboxing:
The key sandboxing operation is dropping capabilities. A process should start with necessary capabilities and drop them before handling untrusted input:
#include <sys/capability.h>
#include <sys/prctl.h>
void drop_capabilities() {
// Get current capabilities
cap_t caps = cap_get_proc();
// Clear all capabilities
cap_clear(caps);
// Optionally keep specific capabilities
// cap_value_t keep[] = { CAP_NET_BIND_SERVICE };
// cap_set_flag(caps, CAP_PERMITTED, 1, keep, CAP_SET);
// cap_set_flag(caps, CAP_EFFECTIVE, 1, keep, CAP_SET);
// Apply
cap_set_proc(caps);
cap_free(caps);
// Lock the bounding set to prevent capability elevation
for (int cap = 0; cap <= CAP_LAST_CAP; cap++) {
prctl(PR_CAPBSET_DROP, cap, 0, 0, 0);
}
// Prevent regaining capabilities
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
}
CAP_SYS_ADMIN has become a catch-all for privileged operations that don't fit elsewhere. It allows mounting filesystems, creating namespaces, many ioctl operations, and more. A process with CAP_SYS_ADMIN has almost as much power as root. Well-designed sandboxes never grant it.
NO_NEW_PRIVS: Sealing the Sandbox:
The PR_SET_NO_NEW_PRIVS prctl prevents a process from gaining new privileges through execve(). Without this, a sandboxed process could execute a setuid binary and escape the sandbox:
// Enable no_new_privs - cannot be disabled once set
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
// Now execve() of setuid binary runs without privilege elevation
// This prevents sandbox escape via setuid binaries
This flag is inherited by children and cannot be cleared, making it a critical part of the sandbox.
Ambient Capabilities:
Ambient capabilities (since Linux 4.3) address a usability issue: how to run a non-setuid program with specific privileges. They allow capabilities to be preserved across execve() for programs that don't have file capabilities set, enabling capability-based privilege without setuid.
Beyond capabilities, process credentials (UID, GID) provide another layer of isolation. By switching to dedicated sandboxed identities, processes are isolated by traditional Unix permission mechanisms.
Dedicated Sandbox Users:
A common pattern is to create dedicated user accounts for sandboxed services:
# Create sandbox user with restricted shell
useradd --system --shell /usr/sbin/nologin
--home /var/empty --no-create-home
sandbox_user
# Service starts as root, then drops to sandbox_user
This approach provides:
Dropping Privileges:
The privilege drop sequence must be performed carefully to avoid race conditions and ensure complete privilege separation:
void drop_to_sandbox_user(uid_t uid, gid_t gid) {
// 1. Clear supplementary groups first
if (setgroups(0, NULL) != 0) {
perror("setgroups");
exit(1);
}
// 2. Set GID before UID (can't change GID after dropping root UID)
if (setresgid(gid, gid, gid) != 0) {
perror("setresgid");
exit(1);
}
// 3. Set UID (this drops root)
if (setresuid(uid, uid, uid) != 0) {
perror("setresuid");
exit(1);
}
// 4. Verify privilege drop succeeded
if (getuid() != uid || geteuid() != uid) {
fprintf(stderr, "UID drop failed
");
exit(1);
}
if (getgid() != gid || getegid() != gid) {
fprintf(stderr, "GID drop failed
");
exit(1);
}
}
Always set GID before UID. Once you drop root UID, you cannot change GID. The kernel checks permissions based on effective UID, and without root, setresgid() fails. Getting the order wrong leaves the sandbox with the original (often privileged) group.
Nobody User Anti-Pattern:
Historically, services would drop to the "nobody" user (typically UID 65534). This is now considered an anti-pattern:
Modern best practice is one dedicated user per service, providing true isolation.
Dynamic User Allocation:
Systemd provides dynamic user allocation via DynamicUser=yes in unit files. At service start, systemd allocates a unique UID/GID from a pool, runs the service as that identity, and reclaims the ID when the service stops. This provides per-service identity isolation without explicit user management.
Sandboxing isn't just about preventing unauthorized access—it's also about preventing resource abuse. A sandboxed process should not be able to consume unlimited CPU, memory, disk I/O, or network bandwidth, causing denial-of-service for the host system.
Traditional Resource Limits (rlimits):
The setrlimit() system call provides per-process resource limits:
#include <sys/resource.h>
void set_resource_limits() {
struct rlimit rl;
// Limit address space to 256MB
rl.rlim_cur = rl.rlim_max = 256 * 1024 * 1024;
setrlimit(RLIMIT_AS, &rl);
// Limit maximum file size to 10MB
rl.rlim_cur = rl.rlim_max = 10 * 1024 * 1024;
setrlimit(RLIMIT_FSIZE, &rl);
// Limit number of open files to 64
rl.rlim_cur = rl.rlim_max = 64;
setrlimit(RLIMIT_NOFILE, &rl);
// Limit number of processes to 1 (prevent fork bombs)
rl.rlim_cur = rl.rlim_max = 1;
setrlimit(RLIMIT_NPROC, &rl);
// No core dumps
rl.rlim_cur = rl.rlim_max = 0;
setrlimit(RLIMIT_CORE, &rl);
}
Control Groups (cgroups):
cgroups provide more sophisticated resource control at the process group level. Unlike rlimits, cgroups can:
cgroup v2 Controllers:
| Controller | Resource | Key Settings |
|---|---|---|
| cpu | CPU time | cpu.max (quota period), cpu.weight (priority) |
| memory | Memory usage | memory.max, memory.high, memory.swap.max |
| io | Block I/O | io.max (BPS/IOPS limits), io.weight |
| pids | Process count | pids.max (fork bomb protection) |
| cpuset | CPU/memory node affinity | cpuset.cpus, cpuset.mems |
| rdma | RDMA resources | rdma.max (HCA handles, objects) |
Setting cgroup Limits:
# Create cgroup for sandbox
mkdir /sys/fs/cgroup/sandbox
# Set memory limit to 256MB
echo "268435456" > /sys/fs/cgroup/sandbox/memory.max
# Set CPU limit to 50% of one core
echo "50000 100000" > /sys/fs/cgroup/sandbox/cpu.max
# Set max processes to 10
echo "10" > /sys/fs/cgroup/sandbox/pids.max
# Add current process to cgroup
echo $$ > /sys/fs/cgroup/sandbox/cgroup.procs
Memory Limit Behavior:
When a process exceeds its memory limit:
memory.max — Hard limit; process allocation fails or OOM killer triggeredmemory.high — Throttle point; process is throttled but not killedmemory.low — Best-effort protection; memory not reclaimed if possiblememory.min — Guaranteed minimum; absolute protection from reclaimThe cgroup namespace (CLONE_NEWCGROUP) allows a sandboxed process to see its cgroup as the root cgroup. This prevents the sandbox from observing host cgroup structure or manipulating host cgroups. Combine with appropriate cgroup placement for defense in depth.
Let's examine how these components combine to create a robust process sandbox. This example demonstrates a layered approach that's conceptually similar to what browsers and containers use.
Sandbox Setup Sequence:
The order of operations matters critically. Here's the typical sequence:
Conceptual Code Outline:
void setup_sandbox_and_exec(const char *program) {
// Phase 1: Namespace creation
unshare(CLONE_NEWUSER | CLONE_NEWNS | CLONE_NEWPID |
CLONE_NEWNET | CLONE_NEWIPC | CLONE_NEWUTS);
// Need to fork after CLONE_NEWPID
if (fork() != 0) exit(0); // Parent exits, child continues
// Phase 2: Namespace configuration
setup_uid_gid_mappings();
setup_mount_namespace("/sandbox/rootfs");
setup_network_namespace();
// Phase 3: Privilege restriction
chdir("/");
drop_supplementary_groups();
drop_gid(SANDBOX_GID);
drop_uid(SANDBOX_UID);
drop_capabilities();
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
// Phase 4: System call filtering
install_seccomp_filter();
// Phase 5: File descriptor cleanup
close_fds_except(STDIN_FILENO, STDOUT_FILENO, STDERR_FILENO);
// Phase 6: Execute target
execv(program, program_args);
_exit(127); // execv failed
}
Each layer provides independent protection. If an attacker bypasses seccomp, they still face namespace isolation. If they escape the namespace, capabilities are still restricted. If they somehow regain capabilities, rlimits still constrain resource use. This layering makes complete sandbox escape exponentially harder.
We have explored the operating system mechanisms for confining processes within controlled environments. Let's consolidate the key insights:
What's Next:
Process sandboxing restricts the process's environment, but the process still has access to the system call interface—the gateway to kernel functionality. The next page explores system call filtering: how to restrict not just what resources a process can see, but what operations it can perform.
You now understand the mechanisms for sandboxing processes at the operating system level: namespaces for resource virtualization, chroot/pivot_root for filesystem isolation, capabilities for privilege control, credentials for identity separation, and cgroups for resource limiting. The next page will cover system call filtering with seccomp.