Loading content...
Containers share a kernel. This fundamental architectural choice enables their efficiency—fast startup, low overhead, high density—but raises critical security questions: How isolated are containers from each other? How isolated are they from the host? What happens if a container is compromised?
The answer is nuanced. Container isolation is built from multiple overlapping mechanisms—namespaces, cgroups, capabilities, seccomp, and mandatory access controls—each addressing different attack vectors. Understanding how these layers work together, and where gaps remain, is essential for deploying containers securely.
By the end of this page, you will understand the complete container isolation stack, how Linux capabilities partition root privileges, how seccomp filters system calls, how AppArmor and SELinux provide mandatory access control, common container escape vectors, and enhanced isolation mechanisms like gVisor and Kata Containers for high-security environments.
Container isolation is not a single mechanism but a layered defense combining multiple kernel features. Each layer addresses different isolation concerns:
Layer 1: Namespaces — What Can Be Seen
Namespaces partition global resources so containers see only their own:
Layer 2: cgroups — What Can Be Consumed
Cgroups limit resource consumption:
Layer 3: Capabilities — What Can Be Done
Capabilities divide traditional root power into granular permissions.
Layer 4: Seccomp — What System Calls Can Be Made
Seccomp filters restrict which kernel interfaces a container can access.
Layer 5: LSMs (AppArmor/SELinux) — What Files and Operations Are Allowed
Mandatory access controls provide fine-grained object-level restrictions.
Layer 6: Read-Only / Immutable Configurations
Read-only root filesystems, no-new-privileges flags, and other hardening.
Each layer acts as a filter. An operation must pass through all layers to succeed:
This defense-in-depth means a vulnerability in one layer may be blocked by another. Breaking out of a container typically requires bypassing multiple security mechanisms.
Container escapes usually exploit multiple weaknesses: a capability that shouldn't be granted, combined with a missing seccomp rule, accessing a namespace gap. Robust security requires configuring all layers correctly, not relying on any single mechanism.
Traditional UNIX has a binary privilege model: you're either root (uid 0, can do everything) or not root (restricted). This is problematic—many programs need only one specific privilege (e.g., binding to port 80) but running as root grants all privileges.
Capabilities split root's omnipotent power into ~40 distinct capabilities. A process can have specific capabilities without having full root access.
Key Capabilities
| Capability | Allows | Container Default | Security Impact |
|---|---|---|---|
| CAP_NET_BIND_SERVICE | Bind to ports < 1024 | Usually granted | Low (useful for web servers) |
| CAP_NET_ADMIN | Network configuration | Usually dropped | Medium (can reconfigure network) |
| CAP_SYS_ADMIN | Broad admin powers (mount, etc.) | Usually dropped | Critical (near-root, escape risk) |
| CAP_SYS_PTRACE | Trace/debug processes | Usually dropped | High (process injection risk) |
| CAP_CHOWN | Change file ownership | Usually granted | Medium (can change container files) |
| CAP_SETUID/SETGID | Change UID/GID | Usually granted | Medium (privilege escalation) |
| CAP_DAC_OVERRIDE | Bypass file permissions | Usually dropped | High (can read any file) |
| CAP_MKNOD | Create device files | Usually dropped | High (device access) |
Default Docker Capabilities
Docker grants a default set of capabilities—enough for most applications to work without granting dangerous powers:
# Default capabilities granted (Docker/containerd)
CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER,
CAP_MKNOD, CAP_NET_RAW, CAP_SETGID, CAP_SETUID,
CAP_SETFCAP, CAP_SETPCAP, CAP_NET_BIND_SERVICE,
CAP_SYS_CHROOT, CAP_KILL, CAP_AUDIT_WRITE
# Notably NOT granted by default:
CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_PTRACE,
CAP_SYS_MODULE, CAP_SYS_RAWIO, CAP_SYS_TIME
Manipulating Capabilities
# Drop all capabilities (most restrictive)
docker run --cap-drop=ALL myimage
# Drop all, then add back only what's needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myimage
# Add a specific dangerous capability (avoid if possible)
docker run --cap-add=SYS_ADMIN myimage # Dangerous!
# Check capabilities of a process
cat /proc/self/status | grep Cap
# CapInh: 0000000000000000
# CapPrm: 00000000a80425fb
# CapEff: 00000000a80425fb
# CapBnd: 00000000a80425fb
# CapAmb: 0000000000000000
CAP_SYS_ADMIN is the 'garbage can' capability—it grants ~30+ different privileges including mounting filesystems, loading BPF programs, accessing /proc/kcore, and more. Granting CAP_SYS_ADMIN to a container essentially defeats namespace isolation. Never grant it unless absolutely required and other isolation layers (seccomp, LSM) are very strict.
User Namespaces and Capabilities
Capabilities are scoped to user namespaces. A process can have CAP_SYS_ADMIN in its own user namespace without having it in the initial (host) user namespace.
Container with user namespace mapping:
Container root (uid 0) → Host uid 100000
Capabilities:
- CAP_SYS_ADMIN in container's user namespace: YES
- CAP_SYS_ADMIN in host's user namespace: NO
Result:
- Can mount filesystems within container's mount namespace
- Cannot access host's /proc/kcore or load kernel modules
This is why rootless containers (with user namespaces) are more secure—the container's capabilities don't translate to host privileges.
Seccomp (Secure Computing Mode) restricts which system calls a process can make. Since all kernel interactions go through syscalls, seccomp is a powerful attack surface reduction mechanism.
How Seccomp Works
Seccomp installs a BPF (Berkeley Packet Filter) program in the kernel that intercepts every syscall. The filter decides:
Default Container Seccomp Profile
Docker and containerd ship with a default profile blocking ~50 dangerous syscalls while allowing the ~300+ needed for typical applications:
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86"],
"syscalls": [
{
"names": ["accept", "bind", "clone", "read", "write", ...],
"action": "SCMP_ACT_ALLOW"
}
]
}
Custom Seccomp Profiles
For high-security environments, create restrictive custom profiles:
1234567891011121314151617181920212223242526272829303132
{ "defaultAction": "SCMP_ACT_ERRNO", "defaultErrnoRet": 1, "syscalls": [ { "names": [ "read", "write", "close", "fstat", "lseek", "mmap", "mprotect", "munmap", "brk", "rt_sigaction", "rt_sigprocmask", "rt_sigreturn", "ioctl", "pread64", "pwrite64", "access", "pipe", "select", "sched_yield", "dup", "dup2", "nanosleep", "getpid", "getuid", "socket", "connect", "accept", "sendto", "recvfrom", "exit", "exit_group", "futex", "epoll_wait", "clock_gettime", "openat", "newfstatat" ], "action": "SCMP_ACT_ALLOW" }, { "names": ["clone"], "action": "SCMP_ACT_ALLOW", "args": [ { "index": 0, "value": 2114060288, "op": "SCMP_CMP_MASKED_EQ" } ], "comment": "Allow clone but only for threads, not new namespaces" } ]}Generating Seccomp Profiles
SPECT, OCI Seccomp Profile Generator, and similar tools can automatically generate profiles by tracing application behavior:
# Record syscalls made during application runtime
strace -c -f ./my-application
# Use recorded syscalls to build allowlist
# Or use seccomp-notify for dynamic analysis
Running with Seccomp
# Use tight custom profile
docker run --security-opt seccomp=custom-seccomp.json myimage
# Disable seccomp (dangerous, for debugging only)
docker run --security-opt seccomp=unconfined myimage
# Check if seccomp is enabled
grep Seccomp /proc/self/status
# Seccomp: 2 (2 = filter mode enabled)
Seccomp filtering has minimal overhead—the BPF program runs in the kernel at near-native speed. The cost is typically <1% for syscall-heavy workloads. The security benefit far outweighs this minimal cost.
Linux Security Modules (LSMs) provide mandatory access control—security policies enforced by the kernel regardless of user identity or capabilities. The two most common LSMs for containers are AppArmor and SELinux.
AppArmor: Profile-Based Access Control
AppArmor confines programs using per-program profiles that specify what files, capabilities, and network access are permitted:
# Example AppArmor profile for nginx container
#include <tunables/global>
profile docker-nginx flags=(attach_disconnected,mediate_deleted) {
#include <abstractions/base>
#include <abstractions/nameservice>
# File access
/usr/sbin/nginx mr,
/etc/nginx/** r,
/var/log/nginx/** w,
/var/www/html/** r,
/run/nginx.pid w,
# Network access
network inet tcp,
network inet udp,
# Capabilities allowed
capability net_bind_service,
capability setgid,
capability setuid,
# Denials (explicit)
deny /etc/shadow r,
deny /proc/*/mem rw,
}
Docker automatically applies a default AppArmor profile (docker-default) that:
SELinux: Label-Based Access Control
SELinux assigns labels (contexts) to subjects (processes) and objects (files, ports, etc.), then enforces policies based on label relationships:
# View SELinux context of a container process
ps -eZ | grep container
# system_u:system_r:container_t:s0:c123,c456 12345 ? nginx
# View SELinux context of files
ls -Z /var/www/html
# system_u:object_r:container_file_t:s0 index.html
# Container process (container_t) can access container files (container_file_t)
# but cannot access system files (admin_home_t, etc.)
SELinux Contexts for Containers:
container_t: Standard container process typecontainer_file_t: Container-accessible file typesvirt_sandbox_file_t: Sandbox-mode file typeSELinux is more granular than AppArmor but more complex to configure. RHEL/CentOS systems use SELinux; Ubuntu/Debian typically use AppArmor.
Running with LSMs
# Use custom AppArmor profile
docker run --security-opt apparmor=my-custom-profile myimage
# Disable AppArmor (dangerous)
docker run --security-opt apparmor=unconfined myimage
# Use custom SELinux label
docker run --security-opt label=type:my_custom_t myimage
# Disable SELinux for container
docker run --security-opt label=disable myimage
LSMs provide the last line of defense—even if a container exploits a capability vulnerability or namespace escape, LSM policies can prevent accessing sensitive host resources.
Understanding how containers can be escaped helps defend against attacks. Container escapes typically exploit misconfigurations or kernel vulnerabilities.
Privileged Containers
--privileged disables most security mechanisms:
# Escape from privileged container
docker run --privileged -it alpine
# Inside container:
mount /dev/sda1 /mnt # Mount host disk
chroot /mnt # Access host filesystem as root
Never use --privileged in production. If you need specific privileges, grant only those capabilities.
Dangerous Volume Mounts
Mounting sensitive host paths enables escape:
# Docker socket mount (container can control Docker → escape)
docker run -v /var/run/docker.sock:/var/run/docker.sock myimage
# Inside: docker run --privileged ... → full host access
# Host filesystem mount
docker run -v /:/host myimage
# Inside: access /host/etc/shadow, /host/root/.ssh, etc.
# /proc mount
docker run -v /proc:/host-proc myimage
# Inside: access host /proc (kernel memory, process info)
Kernel Vulnerabilities
Since containers share the host kernel, kernel vulnerabilities affect all containers:
Mitigation:
Detecting Container Escapes
Monitor for suspicious activity:
# Watch for processes with host PID namespace
auditctl -a exit,always -F arch=b64 -S clone -F a0&0x20000000
# Monitor container capability usage
auditctl -a exit,always -F arch=b64 -S capset
# Detect namespace changes
auditctl -a exit,always -F arch=b64 -S unshare
auditctl -a exit,always -F arch=b64 -S setns
Tools like Falco, Sysdig, or Tracee can detect anomalous container behavior in real-time.
Defense-in-depth means assuming containers WILL be compromised and limiting blast radius. Use network policies, run as non-root, minimize capabilities, apply strict seccomp/LSM profiles, and segment sensitive workloads to separate nodes or stronger isolation mechanisms.
For high-security environments where kernel-sharing is unacceptable, enhanced isolation mechanisms provide stronger boundaries.
gVisor: User-Space Kernel
gVisor is a user-space kernel (called Sentry) that intercepts container syscalls and implements them without passing them to the host kernel:
Traditional Container:
Container Process → syscall → Host Kernel
gVisor Container:
Container Process → syscall → gVisor Sentry → limited host syscalls
How gVisor Works:
Trade-offs:
Kata Containers: MicroVMs
Kata Containers run each container inside a lightweight virtual machine, providing hardware-level isolation:
Kata Container:
Container Process → Guest Kernel → VM/Hypervisor → Host Kernel
How Kata Works:
Trade-offs:
Firecracker: AWS's MicroVM
Firecracker is a purpose-built VMM (Virtual Machine Monitor) for containers:
| Mechanism | Isolation Level | Startup | Overhead | Compatibility | Use Case |
|---|---|---|---|---|---|
| Standard Container | Process (namespace/cgroup) | ~100ms | ~1-5% | Excellent | General workloads |
| gVisor (runsc) | User-space kernel | ~150ms | 5-50% | Good (~80%) | Untrusted code, CI/CD |
| Kata Containers | VM (hypervisor) | ~200-500ms | 30-50MB/VM | Excellent | Multi-tenant, security-critical |
| Firecracker | MicroVM (minimal) | ~125ms | <5MB/VM | Excellent | Serverless, Lambda-style |
Choosing an Isolation Mechanism
Kubernetes supports multiple runtimes simultaneously via RuntimeClass:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc
---
apiVersion: v1
kind: Pod
spec:
runtimeClassName: gvisor # Use gVisor for this pod
containers:
- name: untrusted-workload
image: user-code:latest
Applying the defense-in-depth principle, here are actionable security practices for container deployments:
myimage@sha256:abc... not myimage:latestrunAsNonRoot: true, runAsUser: 1000--cap-drop=ALL --cap-add=NET_BIND_SERVICEreadOnlyRootFilesystem: trueallowPrivilegeEscalation: falseunconfined123456789101112131415161718192021222324252627282930313233343536
apiVersion: v1kind: Podmetadata: name: secure-appspec: securityContext: runAsNonRoot: true runAsUser: 10000 runAsGroup: 10000 fsGroup: 10000 seccompProfile: type: RuntimeDefault containers: - name: app image: myapp@sha256:abc123... securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: - ALL add: - NET_BIND_SERVICE volumeMounts: - name: tmp mountPath: /tmp resources: limits: memory: "256Mi" cpu: "500m" requests: memory: "128Mi" cpu: "250m" volumes: - name: tmp emptyDir: {} # Writable /tmp without host accessWe've explored container isolation from multiple angles—the layered security stack, individual mechanisms, attack vectors, and enhanced isolation options. Let's consolidate the key takeaways:
Module Conclusion: Namespaces and cgroups
This module has covered the fundamental Linux kernel primitives that enable containerization:
Together, namespaces and cgroups form the foundation upon which all container technologies are built. Whether you're debugging container networking issues, tuning resource limits for production workloads, or securing multi-tenant infrastructure, this knowledge is essential.
Containers are not magic—they're a clever application of kernel primitives. Understanding those primitives empowers you to use containers effectively, debug them confidently, and secure them rigorously.
Congratulations! You've completed the Namespaces and cgroups module. You now understand the kernel primitives that enable containerization, how to configure resource limits, and how to secure containers through defense in depth. This knowledge forms the foundation for working with any container technology.