Loading learning content...
Throughout this module, we've explored the fundamental building blocks of sandboxing: the sandbox concept, process isolation, system call filtering, and seccomp. Now we bring these mechanisms together to examine container isolation—perhaps the most impactful application of sandboxing technology in modern computing.
Containers have revolutionized software deployment. They package applications with their dependencies, run isolated from other containers and the host, and enable efficient resource utilization. But containers are not virtual machines—they share a kernel with the host. This kernel sharing is both their strength (efficiency, performance) and their challenge (a shared kernel is a shared attack surface).
Understanding container isolation is essential for anyone working with modern infrastructure. Containers are everywhere: microservices, CI/CD pipelines, cloud functions, edge computing. Their security properties directly impact the security of countless production systems.
By the end of this page, you will understand how containers achieve isolation by combining Linux kernel primitives, the security architecture of container runtimes, defense in depth strategies for container environments, the differences between container security profiles, and emerging technologies for stronger container isolation.
A container is an isolated execution environment that provides a consistent runtime for applications. Unlike virtual machines, which emulate hardware and run a complete guest kernel, containers share the host kernel while maintaining isolation at the resource and namespace level.
Container vs. VM Isolation:
| Aspect | Virtual Machine | Container |
|---|---|---|
| Kernel | Separate guest kernel | Shared host kernel |
| Isolation boundary | Hypervisor (hardware-based) | Kernel namespaces/cgroups |
| Overhead | Full OS memory, boot time | Minimal (MB, milliseconds) |
| Performance | Near-native (with VT-x) | Native |
| Security level | Strong (hardware-enforced) | Medium (kernel-enforced) |
| Density | Tens per host | Hundreds per host |
| Attack surface | Hypervisor escape | Kernel vulnerability |
The Container Abstraction Stack:
┌─────────────────────────────────────────────────────┐
│ User Application │
├─────────────────────────────────────────────────────┤
│ Container Runtime │
│ (containerd, CRI-O, Docker Engine) │
├─────────────────────────────────────────────────────┤
│ OCI Runtime │
│ (runc, crun, gVisor runsc, Kata Containers) │
├─────────────────────────────────────────────────────┤
│ Linux Kernel │
│ Namespaces | cgroups | seccomp | Capabilities | LSM │
├─────────────────────────────────────────────────────┤
│ Hardware │
└─────────────────────────────────────────────────────┘
Container runtimes manage container lifecycle (create, start, stop, delete). OCI runtimes actually execute containers according to the Open Container Initiative specification. The OCI runtime interfaces directly with kernel isolation primitives.
There is no 'container' syscall or kernel object. Containers are a userspace abstraction built by combining multiple kernel features: namespaces for resource isolation, cgroups for resource limits, seccomp for syscall filtering, capabilities for privilege management, and LSMs for mandatory access control.
Containers typically use all available namespace types to create comprehensive isolation. Each namespace isolates a different aspect of the system view:
Standard Container Namespaces:
| Namespace | What's Isolated | Container Benefit |
|---|---|---|
| Mount (mnt) | File system mounts | Own root filesystem, no host visibility |
| PID | Process IDs | Own PID 1, can't see host processes |
| Network (net) | Network stack | Own interfaces, ports, routes |
| IPC | IPC mechanisms | Own shared memory, semaphores |
| UTS | Hostname and domain | Own hostname |
| User | UID/GID mappings | Unprivileged containers, root remapping |
| Cgroup | Cgroup root view | Own cgroup hierarchy view |
| Time | System clocks | Own time view (Linux 5.6+) |
Creating a Container's Namespace Set:
When a container starts, the runtime creates a new set of namespaces:
// Simplified container namespace creation
int container_flags = CLONE_NEWNS // Mount namespace
| CLONE_NEWPID // PID namespace
| CLONE_NEWNET // Network namespace
| CLONE_NEWUTS // UTS namespace
| CLONE_NEWIPC // IPC namespace
| CLONE_NEWUSER // User namespace (if rootless)
| CLONE_NEWCGROUP; // Cgroup namespace
int pid = clone(container_init, stack_top, container_flags, arg);
Network Namespace Configuration:
Containers need network connectivity despite having isolated network stacks. Common approaches:
# Create virtual ethernet pair
ip link add veth-host type veth peer name veth-container
# Move one end to container's network namespace
ip link set veth-container netns container_pid
# Configure host side (connect to bridge)
ip link set veth-host up
ip link set veth-host master docker0
# Configure container side
nsenter -t $container_pid -n ip addr add 172.17.0.2/16 dev veth-container
nsenter -t $container_pid -n ip link set veth-container up
nsenter -t $container_pid -n ip route add default via 172.17.0.1
User Namespace and Rootless Containers:
User namespaces enable rootless containers—containers that run without any host privileges. The container's root is mapped to an unprivileged host user:
# Container root (UID 0) is mapped to host UID 100000
$ cat /proc/$container_pid/uid_map
0 100000 65536
# Inside container: I'm root!
root@container# id
uid=0(root) gid=0(root) groups=0(root)
# But on host: just UID 100000
$ ls -l /proc/$container_pid/exe
-rwxr-xr-x 1 100000 100000 ... /usr/bin/myapp
Rootless containers provide an additional security layer: even if an attacker escapes the container, they end up as an unprivileged host user.
Always use user namespaces (rootless mode) when possible. Container escapes are less damaging when the escaped process has no host privileges. The tradeoff is slightly reduced compatibility (some operations need real root), but security benefits usually outweigh this.
While namespaces isolate visibility of resources, cgroups (control groups) isolate consumption of resources. Every container runs in its own cgroup with configured limits.
Container cgroup Configuration:
# Container's cgroup (cgroup v2)
/sys/fs/cgroup/system.slice/docker-abc123def.scope/
# Memory limit: 512MB
echo "536870912" > memory.max
# CPU limit: 50% of one core
echo "50000 100000" > cpu.max
# Max 100 processes
echo "100" > pids.max
# I/O rate limits
echo "8:0 wbps=10485760" > io.max # 10 MB/s write to device 8:0
Resource Protection Properties:
| Controller | What's Limited | Protection Provided |
|---|---|---|
| memory | RAM, swap usage | Prevents container from exhausting host memory |
| cpu | CPU time | Ensures fair CPU sharing between containers |
| pids | Process count | Prevents fork bombs affecting host |
| io | Block I/O bandwidth | Prevents I/O starvation |
| cpuset | CPU affinity | Pin containers to specific CPUs |
| memory.oom.group | OOM grouping | Kill entire container on OOM, not just one process |
Memory Limit Behavior:
When a container exceeds its memory limit, the kernel's OOM (Out of Memory) killer is invoked. cgroup v2 provides more predictable behavior:
# Set memory limit
echo "536870912" > /sys/fs/cgroup/container_cgroup/memory.max
# When limit exceeded:
# - If memory.oom.group = 1: all processes in cgroup killed
# - If memory.oom.group = 0: kernel picks individual processes
# For containers, prefer killing all (consistent state)
echo "1" > memory.oom.group
CPU Throttling:
# Format: "quota period" in microseconds
# This gives 50% of one CPU (50ms every 100ms)
echo "50000 100000" > cpu.max
# For 2 full CPUs worth:
echo "200000 100000" > cpu.max
# For soft limits (doesn't hard-cap, just for scheduling priority)
echo "150" > cpu.weight # Default is 100
Containers without resource limits can consume unlimited host resources. A single misbehaving container can take all available memory, triggering the host OOM killer. Always configure memory.max, pids.max, and cpu.max for production containers.
Container runtimes apply seccomp profiles to restrict syscalls available to containerized processes. Docker, Kubernetes, and other platforms support configurable seccomp policies.
Docker Default seccomp Profile:
Docker's default profile blocks approximately 44 syscalls (as of Docker 20.10):
{
"defaultAction": "SCMP_ACT_ERRNO",
"defaultErrnoRet": 1,
"archMap": [
{
"architecture": "SCMP_ARCH_X86_64",
"subArchitectures": ["SCMP_ARCH_X86", "SCMP_ARCH_X32"]
}
],
"syscalls": [
{
"names": [
"accept", "accept4", "access", "adjtimex", "alarm",
"bind", "brk", "capget", "capset", "chdir", ...
],
"action": "SCMP_ACT_ALLOW"
}
]
}
Blocked by Default:
acct, add_key, bpf, clock_adjtime, clock_settime, clone (with CLONE_NEWUSER),
create_module, delete_module, finit_module, get_kernel_syms, get_mempolicy,
init_module, ioperm, iopl, kcmp, kexec_file_load, kexec_load, keyctl,
lookup_dcookie, mbind, mount, move_pages, name_to_handle_at, nfsservctl,
open_by_handle_at, perf_event_open, personality, pivot_root, process_vm_readv,
process_vm_writev, ptrace, query_module, quotactl, reboot, request_key,
set_mempolicy, setns, settimeofday, stime, swapon, swapoff, sysfs, _sysctl,
umount, umount2, unshare, uselib, userfaultfd, ustat, vm86, vm86old
Kubernetes seccomp Profiles:
Kubernetes supports seccomp through security contexts:
apiVersion: v1
kind: Pod
metadata:
name: secure-pod
spec:
securityContext:
seccompProfile:
type: RuntimeDefault # Use container runtime's default
containers:
- name: app
image: myapp:latest
Profile Types:
| Type | Description |
|---|---|
RuntimeDefault | Use the container runtime's default profile |
Unconfined | No seccomp filtering (dangerous!) |
Localhost | Use a custom profile from node's filesystem |
Custom Profile Example:
apiVersion: v1
kind: Pod
metadata:
name: custom-seccomp-pod
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/strict.json
containers:
- name: app
image: myapp:latest
// /var/lib/kubelet/seccomp/profiles/strict.json
{
"defaultAction": "SCMP_ACT_ERRNO",
"syscalls": [
{
"names": ["read", "write", "exit_group", "futex", "mmap"],
"action": "SCMP_ACT_ALLOW"
}
]
}
Tools like 'strace' and Sysdig's 'oci-seccomp-bpf-hook' can trace an application and generate a minimal seccomp profile based on actual syscall usage. Run your application under tracing, then use the generated profile in production.
Linux capabilities provide fine-grained privilege control. Container runtimes drop most capabilities by default, granting only what's needed for typical containerized applications.
Docker Default Capabilities:
Granted by default:
CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_MKNOD,
CAP_NET_RAW, CAP_SETGID, CAP_SETUID, CAP_SETFCAP, CAP_SETPCAP,
CAP_NET_BIND_SERVICE, CAP_SYS_CHROOT, CAP_KILL, CAP_AUDIT_WRITE
Notably NOT granted:
CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_PTRACE, CAP_SYS_MODULE,
CAP_SYS_TIME, CAP_SYS_RESOURCE, CAP_SYS_BOOT, CAP_MAC_ADMIN, ...
Modifying Capabilities:
# Drop all capabilities
docker run --cap-drop=ALL myimage
# Drop all, add specific ones back
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myimage
# Add a capability (dangerous!)
docker run --cap-add=SYS_PTRACE myimage
Kubernetes Capability Configuration:
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
image: myapp:latest
securityContext:
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
| Capability | Risk | Escape Method |
|---|---|---|
| CAP_SYS_ADMIN | Very High | Mount filesystems, access /proc, many kernel features |
| CAP_SYS_PTRACE | High | Ptrace host processes if PID namespace shared |
| CAP_NET_ADMIN | High | Reconfigure network, potential escape via network |
| CAP_SYS_RAWIO | High | Direct hardware/memory access |
| CAP_SYS_MODULE | Critical | Load kernel modules (instant escape) |
| CAP_DAC_READ_SEARCH | Medium | Read any file (bypass permissions) |
Never use 'docker run --privileged' in production. This grants ALL capabilities, disables seccomp, and provides full device access. A privileged container can trivially escape to the host. If you think you need --privileged, you almost always need to redesign your approach.
Mandatory Access Control (MAC) systems provide an additional security layer by enforcing policies that even root cannot bypass. Docker and Kubernetes support both AppArmor (Ubuntu/Debian default) and SELinux (RHEL/CentOS default).
AppArmor for Containers:
Docker applies a default AppArmor profile (docker-default) that restricts:
# View Docker's default profile
cat /etc/apparmor.d/docker-default
# Use custom profile
docker run --security-opt apparmor=my-custom-profile myimage
# Disable AppArmor (not recommended)
docker run --security-opt apparmor=unconfined myimage
Custom AppArmor Profile:
#include <tunables/global>
profile docker-nginx flags=(attach_disconnected,mediate_deleted) {
#include <abstractions/base>
#include <abstractions/nameservice>
network inet tcp,
network inet udp,
/usr/sbin/nginx mr,
/var/log/nginx/* w,
/etc/nginx/** r,
/var/www/** r,
# Deny dangerous operations
deny mount,
deny ptrace,
deny @{PROC}/* w,
}
SELinux for Containers:
SELinux provides type enforcement, labeling all processes and files with security contexts:
# View container's SELinux context
ps -eZ | grep container
# system_u:system_r:container_t:s0:c123,c456 ... my-container
# Custom SELinux label
docker run --security-opt label=type:custom_container_t myimage
# Disable SELinux confinement (dangerous)
docker run --security-opt label=disable myimage
SELinux Container Policy:
Kubernetes SELinux Options:
apiVersion: v1
kind: Pod
spec:
securityContext:
seLinuxOptions:
level: "s0:c123,c456" # MCS categories for isolation
containers:
- name: app
securityContext:
seLinuxOptions:
type: container_t
AppArmor/SELinux provides another layer that must be bypassed for full escape. Even if an attacker escapes namespace isolation, MAC policies still restrict what they can do on the host. This defense-in-depth approach makes exploitation significantly harder.
Despite all isolation mechanisms, container escapes remain possible. Understanding escape vectors helps in designing secure container deployments.
Common Escape Vectors:
| Vector | Description | Mitigation |
|---|---|---|
| Kernel vulnerabilities | Exploit bugs in syscall handlers | Seccomp filtering, kernel updates |
| Privileged containers | --privileged or CAP_SYS_ADMIN | Never use privileged mode |
| Host mount exposure | Sensitive host paths mounted | Minimize mounts, read-only when possible |
| Docker socket mount | /var/run/docker.sock exposed | Never mount Docker socket |
| Shared namespaces | Host PID/network namespace shared | Use dedicated namespaces |
| Writable /proc or /sys | Kernel interfaces exposed | Mount read-only mask procfs |
| Exposed device files | /dev access to host devices | Device whitelist, no raw access |
Case Study: CVE-2019-5736 (runc vulnerability)
A container escape vulnerability in runc allowed a malicious container to overwrite the host runc binary:
Mitigations:
Case Study: Privileged Container Escape
# If container is privileged, escape is trivial:
docker run --privileged -it ubuntu bash
# Inside privileged container:
mount -t cgroup cgroup /mnt
echo 1 > /mnt/cgroup.procs # Become member of host root cgroup
chroot /host
# Now have root shell on host!
# Or via kernel module:
insmod /host/path/to/malicious.ko
For workloads requiring stronger isolation than standard containers provide, several technologies offer enhanced security at the cost of some overhead or compatibility.
gVisor (Google):
gVisor implements a user-space kernel that intercepts container syscalls. The container talks to gVisor, which emulates kernel behavior without actually invoking the host kernel for most operations.
┌────────────────────────┐
│ Application │
├────────────────────────┤
│ gVisor Sentry │ ← User-space "kernel"
│ (syscall emulation) │
├────────────────────────┤
│ gVisor Gofer │ ← File system broker
├────────────────────────┤
│ Host Kernel │ ← Sees only ~20 syscalls
└────────────────────────┘
Security Benefits:
Trade-offs:
Kata Containers:
Kata Containers runs each container in a lightweight virtual machine, providing hardware-level isolation:
┌────────────────┐ ┌────────────────┐
│ Container A │ │ Container B │
├────────────────┤ ├────────────────┤
│ Guest Kernel │ │ Guest Kernel │
├────────────────┤ ├────────────────┤
│ QEMU/FC │ │ QEMU/FC │ ← Micro-VM
├────────────────┴──┴────────────────┤
│ Host Kernel │
└─────────────────────────────────────┘
Security Benefits:
Trade-offs:
Firecracker (AWS):
Firecracker is a micro-VM engine used by AWS Lambda and Fargate:
| Technology | Isolation | Overhead | Compatibility | Use Case |
|---|---|---|---|---|
| Standard containers | Medium (shared kernel) | Lowest | Highest | Most workloads |
| gVisor | High (user-space kernel) | Medium | Good | Untrusted code, multi-tenant |
| Kata Containers | Very High (VM) | Higher | High | Strong isolation required |
| Firecracker | Very High (micro-VM) | Low-Medium | Good | Serverless, FaaS |
Standard containers are sufficient for many workloads where containers run trusted code. For multi-tenant environments running untrusted code, gVisor or Kata provide stronger isolation. Firecracker excels for serverless platforms needing both security and density.
We have explored how containers combine kernel isolation primitives to create practical sandboxing at scale. Let's consolidate the key insights:
Module Complete:
You have now completed the Sandboxing module. You understand:
This knowledge is essential for building, deploying, and securing modern applications. Sandboxing is not optional in today's threat environment—it's a fundamental requirement for any system handling untrusted input or running untrusted code.
You have completed the Sandboxing module. You now possess comprehensive knowledge of sandbox concepts, process isolation, system call filtering, seccomp, and container isolation. You can design, implement, and evaluate sandboxing solutions for various security requirements.