Operating SystemsProtection Mechanisms

Sandboxing

LevelAdvanced

Duration90 mins

TopicProtection Mechanisms

5 / 5

Container Isolation

Sandboxing at Scale

Throughout this module, we've explored the fundamental building blocks of sandboxing: the sandbox concept, process isolation, system call filtering, and seccomp. Now we bring these mechanisms together to examine container isolation—perhaps the most impactful application of sandboxing technology in modern computing.

Containers have revolutionized software deployment. They package applications with their dependencies, run isolated from other containers and the host, and enable efficient resource utilization. But containers are not virtual machines—they share a kernel with the host. This kernel sharing is both their strength (efficiency, performance) and their challenge (a shared kernel is a shared attack surface).

Understanding container isolation is essential for anyone working with modern infrastructure. Containers are everywhere: microservices, CI/CD pipelines, cloud functions, edge computing. Their security properties directly impact the security of countless production systems.

What You Will Learn

By the end of this page, you will understand how containers achieve isolation by combining Linux kernel primitives, the security architecture of container runtimes, defense in depth strategies for container environments, the differences between container security profiles, and emerging technologies for stronger container isolation.

Container Fundamentals

A container is an isolated execution environment that provides a consistent runtime for applications. Unlike virtual machines, which emulate hardware and run a complete guest kernel, containers share the host kernel while maintaining isolation at the resource and namespace level.

Container vs. VM Isolation:

Container vs. Virtual Machine Isolation
Aspect	Virtual Machine	Container
Kernel	Separate guest kernel	Shared host kernel
Isolation boundary	Hypervisor (hardware-based)	Kernel namespaces/cgroups
Overhead	Full OS memory, boot time	Minimal (MB, milliseconds)
Performance	Near-native (with VT-x)	Native
Security level	Strong (hardware-enforced)	Medium (kernel-enforced)
Density	Tens per host	Hundreds per host
Attack surface	Hypervisor escape	Kernel vulnerability

The Container Abstraction Stack:

┌─────────────────────────────────────────────────────┐
│                 User Application                      │
├─────────────────────────────────────────────────────┤
│                Container Runtime                      │
│  (containerd, CRI-O, Docker Engine)                   │
├─────────────────────────────────────────────────────┤
│                   OCI Runtime                         │
│  (runc, crun, gVisor runsc, Kata Containers)         │
├─────────────────────────────────────────────────────┤
│                 Linux Kernel                          │
│  Namespaces | cgroups | seccomp | Capabilities | LSM │
├─────────────────────────────────────────────────────┤
│                    Hardware                           │
└─────────────────────────────────────────────────────┘

Container runtimes manage container lifecycle (create, start, stop, delete). OCI runtimes actually execute containers according to the Open Container Initiative specification. The OCI runtime interfaces directly with kernel isolation primitives.

Containers Are Not a Kernel Feature

There is no 'container' syscall or kernel object. Containers are a userspace abstraction built by combining multiple kernel features: namespaces for resource isolation, cgroups for resource limits, seccomp for syscall filtering, capabilities for privilege management, and LSMs for mandatory access control.

Namespace Combination in Containers

Containers typically use all available namespace types to create comprehensive isolation. Each namespace isolates a different aspect of the system view:

Standard Container Namespaces:

Namespace Usage in Containers
Namespace	What's Isolated	Container Benefit
Mount (mnt)	File system mounts	Own root filesystem, no host visibility
PID	Process IDs	Own PID 1, can't see host processes
Network (net)	Network stack	Own interfaces, ports, routes
IPC	IPC mechanisms	Own shared memory, semaphores
UTS	Hostname and domain	Own hostname
User	UID/GID mappings	Unprivileged containers, root remapping
Cgroup	Cgroup root view	Own cgroup hierarchy view
Time	System clocks	Own time view (Linux 5.6+)

Creating a Container's Namespace Set:

When a container starts, the runtime creates a new set of namespaces:

// Simplified container namespace creation
int container_flags = CLONE_NEWNS    // Mount namespace
                    | CLONE_NEWPID   // PID namespace
                    | CLONE_NEWNET   // Network namespace
                    | CLONE_NEWUTS   // UTS namespace
                    | CLONE_NEWIPC   // IPC namespace
                    | CLONE_NEWUSER  // User namespace (if rootless)
                    | CLONE_NEWCGROUP; // Cgroup namespace

int pid = clone(container_init, stack_top, container_flags, arg);

Network Namespace Configuration:

Containers need network connectivity despite having isolated network stacks. Common approaches:

# Create virtual ethernet pair
ip link add veth-host type veth peer name veth-container

# Move one end to container's network namespace
ip link set veth-container netns container_pid

# Configure host side (connect to bridge)
ip link set veth-host up
ip link set veth-host master docker0

# Configure container side
nsenter -t $container_pid -n ip addr add 172.17.0.2/16 dev veth-container
nsenter -t $container_pid -n ip link set veth-container up
nsenter -t $container_pid -n ip route add default via 172.17.0.1

User Namespace and Rootless Containers:

User namespaces enable rootless containers—containers that run without any host privileges. The container's root is mapped to an unprivileged host user:

# Container root (UID 0) is mapped to host UID 100000
$ cat /proc/$container_pid/uid_map
         0     100000      65536
         
# Inside container: I'm root!
root@container# id
uid=0(root) gid=0(root) groups=0(root)

# But on host: just UID 100000
$ ls -l /proc/$container_pid/exe
-rwxr-xr-x 1 100000 100000 ... /usr/bin/myapp

Rootless containers provide an additional security layer: even if an attacker escapes the container, they end up as an unprivileged host user.

Defense in Depth with User Namespaces

Always use user namespaces (rootless mode) when possible. Container escapes are less damaging when the escaped process has no host privileges. The tradeoff is slightly reduced compatibility (some operations need real root), but security benefits usually outweigh this.

Resource Control with cgroups

While namespaces isolate visibility of resources, cgroups (control groups) isolate consumption of resources. Every container runs in its own cgroup with configured limits.

Container cgroup Configuration:

# Container's cgroup (cgroup v2)
/sys/fs/cgroup/system.slice/docker-abc123def.scope/

# Memory limit: 512MB
echo "536870912" > memory.max

# CPU limit: 50% of one core
echo "50000 100000" > cpu.max

# Max 100 processes
echo "100" > pids.max

# I/O rate limits
echo "8:0 wbps=10485760" > io.max  # 10 MB/s write to device 8:0

Resource Protection Properties:

cgroup Resource Controls
Controller	What's Limited	Protection Provided
memory	RAM, swap usage	Prevents container from exhausting host memory
cpu	CPU time	Ensures fair CPU sharing between containers
pids	Process count	Prevents fork bombs affecting host
io	Block I/O bandwidth	Prevents I/O starvation
cpuset	CPU affinity	Pin containers to specific CPUs
memory.oom.group	OOM grouping	Kill entire container on OOM, not just one process

Memory Limit Behavior:

When a container exceeds its memory limit, the kernel's OOM (Out of Memory) killer is invoked. cgroup v2 provides more predictable behavior:

# Set memory limit
echo "536870912" > /sys/fs/cgroup/container_cgroup/memory.max

# When limit exceeded:
# - If memory.oom.group = 1: all processes in cgroup killed
# - If memory.oom.group = 0: kernel picks individual processes

# For containers, prefer killing all (consistent state)
echo "1" > memory.oom.group

CPU Throttling:

# Format: "quota period" in microseconds
# This gives 50% of one CPU (50ms every 100ms)
echo "50000 100000" > cpu.max

# For 2 full CPUs worth:
echo "200000 100000" > cpu.max

# For soft limits (doesn't hard-cap, just for scheduling priority)
echo "150" > cpu.weight  # Default is 100

Always Set Resource Limits

Containers without resource limits can consume unlimited host resources. A single misbehaving container can take all available memory, triggering the host OOM killer. Always configure memory.max, pids.max, and cpu.max for production containers.

Container seccomp Profiles

Container runtimes apply seccomp profiles to restrict syscalls available to containerized processes. Docker, Kubernetes, and other platforms support configurable seccomp policies.

Docker Default seccomp Profile:

Docker's default profile blocks approximately 44 syscalls (as of Docker 20.10):

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "defaultErrnoRet": 1,
    "archMap": [
        {
            "architecture": "SCMP_ARCH_X86_64",
            "subArchitectures": ["SCMP_ARCH_X86", "SCMP_ARCH_X32"]
        }
    ],
    "syscalls": [
        {
            "names": [
                "accept", "accept4", "access", "adjtimex", "alarm",
                "bind", "brk", "capget", "capset", "chdir", ...
            ],
            "action": "SCMP_ACT_ALLOW"
        }
    ]
}

Blocked by Default:

acct, add_key, bpf, clock_adjtime, clock_settime, clone (with CLONE_NEWUSER),
create_module, delete_module, finit_module, get_kernel_syms, get_mempolicy,
init_module, ioperm, iopl, kcmp, kexec_file_load, kexec_load, keyctl,
lookup_dcookie, mbind, mount, move_pages, name_to_handle_at, nfsservctl,
open_by_handle_at, perf_event_open, personality, pivot_root, process_vm_readv,
process_vm_writev, ptrace, query_module, quotactl, reboot, request_key,
set_mempolicy, setns, settimeofday, stime, swapon, swapoff, sysfs, _sysctl,
umount, umount2, unshare, uselib, userfaultfd, ustat, vm86, vm86old

Kubernetes seccomp Profiles:

Kubernetes supports seccomp through security contexts:

apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault  # Use container runtime's default
  containers:
  - name: app
    image: myapp:latest

Profile Types:

Type	Description
`RuntimeDefault`	Use the container runtime's default profile
`Unconfined`	No seccomp filtering (dangerous!)
`Localhost`	Use a custom profile from node's filesystem

Custom Profile Example:

apiVersion: v1
kind: Pod
metadata:
  name: custom-seccomp-pod
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/strict.json
  containers:
  - name: app
    image: myapp:latest

// /var/lib/kubelet/seccomp/profiles/strict.json
{
    "defaultAction": "SCMP_ACT_ERRNO",
    "syscalls": [
        {
            "names": ["read", "write", "exit_group", "futex", "mmap"],
            "action": "SCMP_ACT_ALLOW"
        }
    ]
}

Generating Custom Profiles

Tools like 'strace' and Sysdig's 'oci-seccomp-bpf-hook' can trace an application and generate a minimal seccomp profile based on actual syscall usage. Run your application under tracing, then use the generated profile in production.

Capability Management in Containers

Linux capabilities provide fine-grained privilege control. Container runtimes drop most capabilities by default, granting only what's needed for typical containerized applications.

Docker Default Capabilities:

Granted by default:
CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_MKNOD,
CAP_NET_RAW, CAP_SETGID, CAP_SETUID, CAP_SETFCAP, CAP_SETPCAP,
CAP_NET_BIND_SERVICE, CAP_SYS_CHROOT, CAP_KILL, CAP_AUDIT_WRITE

Notably NOT granted:
CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_PTRACE, CAP_SYS_MODULE,
CAP_SYS_TIME, CAP_SYS_RESOURCE, CAP_SYS_BOOT, CAP_MAC_ADMIN, ...

Modifying Capabilities:

# Drop all capabilities
docker run --cap-drop=ALL myimage

# Drop all, add specific ones back
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myimage

# Add a capability (dangerous!)
docker run --cap-add=SYS_PTRACE myimage

Kubernetes Capability Configuration:

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    image: myapp:latest
    securityContext:
      capabilities:
        drop:
          - ALL
        add:
          - NET_BIND_SERVICE

Dangerous Capabilities to Avoid
Capability	Risk	Escape Method
CAP_SYS_ADMIN	Very High	Mount filesystems, access /proc, many kernel features
CAP_SYS_PTRACE	High	Ptrace host processes if PID namespace shared
CAP_NET_ADMIN	High	Reconfigure network, potential escape via network
CAP_SYS_RAWIO	High	Direct hardware/memory access
CAP_SYS_MODULE	Critical	Load kernel modules (instant escape)
CAP_DAC_READ_SEARCH	Medium	Read any file (bypass permissions)

Avoid --privileged

Never use 'docker run --privileged' in production. This grants ALL capabilities, disables seccomp, and provides full device access. A privileged container can trivially escape to the host. If you think you need --privileged, you almost always need to redesign your approach.

Mandatory Access Control: AppArmor and SELinux

Mandatory Access Control (MAC) systems provide an additional security layer by enforcing policies that even root cannot bypass. Docker and Kubernetes support both AppArmor (Ubuntu/Debian default) and SELinux (RHEL/CentOS default).

AppArmor for Containers:

Docker applies a default AppArmor profile (docker-default) that restricts:

Mount operations
Ptrace
Signal handling
Certain network operations

# View Docker's default profile
cat /etc/apparmor.d/docker-default

# Use custom profile
docker run --security-opt apparmor=my-custom-profile myimage

# Disable AppArmor (not recommended)
docker run --security-opt apparmor=unconfined myimage

Custom AppArmor Profile:

#include <tunables/global>

profile docker-nginx flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>
  #include <abstractions/nameservice>
  
  network inet tcp,
  network inet udp,
  
  /usr/sbin/nginx mr,
  /var/log/nginx/* w,
  /etc/nginx/** r,
  /var/www/** r,
  
  # Deny dangerous operations
  deny mount,
  deny ptrace,
  deny @{PROC}/* w,
}

SELinux for Containers:

SELinux provides type enforcement, labeling all processes and files with security contexts:

# View container's SELinux context
ps -eZ | grep container
# system_u:system_r:container_t:s0:c123,c456  ... my-container

# Custom SELinux label
docker run --security-opt label=type:custom_container_t myimage

# Disable SELinux confinement (dangerous)
docker run --security-opt label=disable myimage

SELinux Container Policy:

container_t — Standard container type, restricted access
container_init_t — Container init processes
container_runtime_t — Container runtime (Docker/containerd)

Kubernetes SELinux Options:

apiVersion: v1
kind: Pod
spec:
  securityContext:
    seLinuxOptions:
      level: "s0:c123,c456"  # MCS categories for isolation
  containers:
  - name: app
    securityContext:
      seLinuxOptions:
        type: container_t

MAC Adds Defense in Depth

AppArmor/SELinux provides another layer that must be bypassed for full escape. Even if an attacker escapes namespace isolation, MAC policies still restrict what they can do on the host. This defense-in-depth approach makes exploitation significantly harder.

Container Escape Vectors and Mitigations

Despite all isolation mechanisms, container escapes remain possible. Understanding escape vectors helps in designing secure container deployments.

Common Escape Vectors:

Container Escape Vectors
Vector	Description	Mitigation
Kernel vulnerabilities	Exploit bugs in syscall handlers	Seccomp filtering, kernel updates
Privileged containers	--privileged or CAP_SYS_ADMIN	Never use privileged mode
Host mount exposure	Sensitive host paths mounted	Minimize mounts, read-only when possible
Docker socket mount	/var/run/docker.sock exposed	Never mount Docker socket
Shared namespaces	Host PID/network namespace shared	Use dedicated namespaces
Writable /proc or /sys	Kernel interfaces exposed	Mount read-only mask procfs
Exposed device files	/dev access to host devices	Device whitelist, no raw access

Case Study: CVE-2019-5736 (runc vulnerability)

A container escape vulnerability in runc allowed a malicious container to overwrite the host runc binary:

Container starts with a malicious entrypoint
Entrypoint waits for runc to open a file descriptor to /proc/self/exe
Container process uses /proc/[runc_pid]/exe to access host's runc binary
Overwrites runc with malicious binary
Next container start executes malicious runc on host

Mitigations:

Update runc to patched version
Read-only container layer
AppArmor profile blocking /proc/[other_pid]/exe access

Case Study: Privileged Container Escape

# If container is privileged, escape is trivial:
docker run --privileged -it ubuntu bash

# Inside privileged container:
mount -t cgroup cgroup /mnt
echo 1 > /mnt/cgroup.procs  # Become member of host root cgroup
chroot /host
# Now have root shell on host!

# Or via kernel module:
insmod /host/path/to/malicious.ko

Container Hardening Checklist

•Never use --privileged — Find alternative solutions for whatever requires it.
•Drop all capabilities, add back minimally — --cap-drop=ALL --cap-add=NET_BIND_SERVICE
•Use user namespaces — Run containers as non-root, map to unprivileged host users.
•Use read-only root filesystem — --read-only with explicit tmpfs for writable paths.
•Apply seccomp profile — Use RuntimeDefault at minimum, custom for high-security.
•Enable AppArmor/SELinux — Don't disable MAC; use appropriate profiles.
•Minimize host mounts — Only mount what's absolutely necessary, prefer read-only.
•Never mount Docker socket — Find other ways to build images or manage containers.
•Use non-root user in container — USER directive in Dockerfile.
•Set resource limits — memory, CPU, pids limits always.
•Use dedicated network namespace — Avoid --network=host.
•Keep images updated — Patch vulnerabilities in base images and dependencies.

Stronger Isolation: Microkernel Sandboxes and Micro-VMs

For workloads requiring stronger isolation than standard containers provide, several technologies offer enhanced security at the cost of some overhead or compatibility.

gVisor (Google):

gVisor implements a user-space kernel that intercepts container syscalls. The container talks to gVisor, which emulates kernel behavior without actually invoking the host kernel for most operations.

┌────────────────────────┐
│     Application        │
├────────────────────────┤
│   gVisor Sentry        │  ← User-space "kernel"
│   (syscall emulation)  │
├────────────────────────┤
│   gVisor Gofer         │  ← File system broker
├────────────────────────┤
│    Host Kernel         │  ← Sees only ~20 syscalls
└────────────────────────┘

Security Benefits:

Application syscalls never reach host kernel
Sentry uses minimal host syscall set (~20 syscalls)
Written in memory-safe Go language
Host kernel attack surface dramatically reduced

Trade-offs:

Performance overhead (10-50% for I/O-heavy workloads)
Not all syscalls and features supported
Some applications may not work correctly

Kata Containers:

Kata Containers runs each container in a lightweight virtual machine, providing hardware-level isolation:

┌────────────────┐  ┌────────────────┐
│  Container A   │  │  Container B   │
├────────────────┤  ├────────────────┤
│  Guest Kernel  │  │  Guest Kernel  │
├────────────────┤  ├────────────────┤
│    QEMU/FC     │  │    QEMU/FC     │  ← Micro-VM
├────────────────┴──┴────────────────┤
│           Host Kernel               │
└─────────────────────────────────────┘

Security Benefits:

Full VM isolation (hardware-enforced)
Separate kernel per container
Container escape requires hypervisor escape (very hard)

Trade-offs:

Higher resource overhead (RAM for each VM)
Slower startup (VM boot)
More complex networking

Firecracker (AWS):

Firecracker is a micro-VM engine used by AWS Lambda and Fargate:

Minimal device model (~50 syscalls to host)
Sub-second boot times (100-150ms)
5MB per VM overhead
KVM-based with minimal attack surface

Container Isolation Technology Comparison
Technology	Isolation	Overhead	Compatibility	Use Case
Standard containers	Medium (shared kernel)	Lowest	Highest	Most workloads
gVisor	High (user-space kernel)	Medium	Good	Untrusted code, multi-tenant
Kata Containers	Very High (VM)	Higher	High	Strong isolation required
Firecracker	Very High (micro-VM)	Low-Medium	Good	Serverless, FaaS

Choose Based on Threat Model

Standard containers are sufficient for many workloads where containers run trusted code. For multi-tenant environments running untrusted code, gVisor or Kata provide stronger isolation. Firecracker excels for serverless platforms needing both security and density.

Summary: Container Isolation

We have explored how containers combine kernel isolation primitives to create practical sandboxing at scale. Let's consolidate the key insights:

Key Takeaways

•Containers combine multiple kernel primitives — Namespaces, cgroups, seccomp, capabilities, and MAC work together.
•Containers share the host kernel — This is their strength (efficiency) and weakness (shared attack surface).
•Each isolation layer addresses different threats — Defense in depth is essential.
•User namespaces enable rootless containers — Always prefer rootless for better security.
•Resource limits prevent DoS — Always configure cgroup limits (memory, CPU, pids).
•seccomp reduces kernel attack surface — Use RuntimeDefault or stricter profiles.
•Capabilities should be minimal — Drop all, add back only what's needed.
•MAC provides additional protection — Enable AppArmor/SELinux, don't disable.
•Privileged containers are dangerous — Never use --privileged in production.
•Stronger isolation is available — gVisor, Kata, Firecracker for untrusted workloads.

Module Complete:

You have now completed the Sandboxing module. You understand:

The fundamental concept of sandboxing and why it's necessary
How process-level sandboxing works using OS primitives
System call filtering concepts and strategies
seccomp implementation details and patterns
Container isolation architecture and best practices

This knowledge is essential for building, deploying, and securing modern applications. Sandboxing is not optional in today's threat environment—it's a fundamental requirement for any system handling untrusted input or running untrusted code.

Module Complete

You have completed the Sandboxing module. You now possess comprehensive knowledge of sandbox concepts, process isolation, system call filtering, seccomp, and container isolation. You can design, implement, and evaluate sandboxing solutions for various security requirements.

5 / 5

Loading learning content...

Operating SystemsProtection Mechanisms

Sandboxing

LevelAdvanced

Duration90 mins

TopicProtection Mechanisms

5 / 5

Container Isolation

Sandboxing at Scale

What You Will Learn

Container Fundamentals

Container vs. VM Isolation:

Container vs. Virtual Machine Isolation
Aspect	Virtual Machine	Container
Kernel	Separate guest kernel	Shared host kernel
Isolation boundary	Hypervisor (hardware-based)	Kernel namespaces/cgroups
Overhead	Full OS memory, boot time	Minimal (MB, milliseconds)
Performance	Near-native (with VT-x)	Native
Security level	Strong (hardware-enforced)	Medium (kernel-enforced)
Density	Tens per host	Hundreds per host
Attack surface	Hypervisor escape	Kernel vulnerability

The Container Abstraction Stack:

┌─────────────────────────────────────────────────────┐
│                 User Application                      │
├─────────────────────────────────────────────────────┤
│                Container Runtime                      │
│  (containerd, CRI-O, Docker Engine)                   │
├─────────────────────────────────────────────────────┤
│                   OCI Runtime                         │
│  (runc, crun, gVisor runsc, Kata Containers)         │
├─────────────────────────────────────────────────────┤
│                 Linux Kernel                          │
│  Namespaces | cgroups | seccomp | Capabilities | LSM │
├─────────────────────────────────────────────────────┤
│                    Hardware                           │
└─────────────────────────────────────────────────────┘

Containers Are Not a Kernel Feature

Namespace Combination in Containers

Containers typically use all available namespace types to create comprehensive isolation. Each namespace isolates a different aspect of the system view:

Standard Container Namespaces:

Namespace Usage in Containers
Namespace	What's Isolated	Container Benefit
Mount (mnt)	File system mounts	Own root filesystem, no host visibility
PID	Process IDs	Own PID 1, can't see host processes
Network (net)	Network stack	Own interfaces, ports, routes
IPC	IPC mechanisms	Own shared memory, semaphores
UTS	Hostname and domain	Own hostname
User	UID/GID mappings	Unprivileged containers, root remapping
Cgroup	Cgroup root view	Own cgroup hierarchy view
Time	System clocks	Own time view (Linux 5.6+)

Creating a Container's Namespace Set:

When a container starts, the runtime creates a new set of namespaces:

// Simplified container namespace creation
int container_flags = CLONE_NEWNS    // Mount namespace
                    | CLONE_NEWPID   // PID namespace
                    | CLONE_NEWNET   // Network namespace
                    | CLONE_NEWUTS   // UTS namespace
                    | CLONE_NEWIPC   // IPC namespace
                    | CLONE_NEWUSER  // User namespace (if rootless)
                    | CLONE_NEWCGROUP; // Cgroup namespace

int pid = clone(container_init, stack_top, container_flags, arg);

Network Namespace Configuration:

Containers need network connectivity despite having isolated network stacks. Common approaches:

# Create virtual ethernet pair
ip link add veth-host type veth peer name veth-container

# Move one end to container's network namespace
ip link set veth-container netns container_pid

# Configure host side (connect to bridge)
ip link set veth-host up
ip link set veth-host master docker0

# Configure container side
nsenter -t $container_pid -n ip addr add 172.17.0.2/16 dev veth-container
nsenter -t $container_pid -n ip link set veth-container up
nsenter -t $container_pid -n ip route add default via 172.17.0.1

User Namespace and Rootless Containers:

User namespaces enable rootless containers—containers that run without any host privileges. The container's root is mapped to an unprivileged host user:

# Container root (UID 0) is mapped to host UID 100000
$ cat /proc/$container_pid/uid_map
         0     100000      65536
         
# Inside container: I'm root!
root@container# id
uid=0(root) gid=0(root) groups=0(root)

# But on host: just UID 100000
$ ls -l /proc/$container_pid/exe
-rwxr-xr-x 1 100000 100000 ... /usr/bin/myapp

Rootless containers provide an additional security layer: even if an attacker escapes the container, they end up as an unprivileged host user.

Defense in Depth with User Namespaces

Resource Control with cgroups

While namespaces isolate visibility of resources, cgroups (control groups) isolate consumption of resources. Every container runs in its own cgroup with configured limits.

Container cgroup Configuration:

# Container's cgroup (cgroup v2)
/sys/fs/cgroup/system.slice/docker-abc123def.scope/

# Memory limit: 512MB
echo "536870912" > memory.max

# CPU limit: 50% of one core
echo "50000 100000" > cpu.max

# Max 100 processes
echo "100" > pids.max

# I/O rate limits
echo "8:0 wbps=10485760" > io.max  # 10 MB/s write to device 8:0

Resource Protection Properties:

cgroup Resource Controls
Controller	What's Limited	Protection Provided
memory	RAM, swap usage	Prevents container from exhausting host memory
cpu	CPU time	Ensures fair CPU sharing between containers
pids	Process count	Prevents fork bombs affecting host
io	Block I/O bandwidth	Prevents I/O starvation
cpuset	CPU affinity	Pin containers to specific CPUs
memory.oom.group	OOM grouping	Kill entire container on OOM, not just one process

Memory Limit Behavior:

When a container exceeds its memory limit, the kernel's OOM (Out of Memory) killer is invoked. cgroup v2 provides more predictable behavior:

# Set memory limit
echo "536870912" > /sys/fs/cgroup/container_cgroup/memory.max

# When limit exceeded:
# - If memory.oom.group = 1: all processes in cgroup killed
# - If memory.oom.group = 0: kernel picks individual processes

# For containers, prefer killing all (consistent state)
echo "1" > memory.oom.group

CPU Throttling:

# Format: "quota period" in microseconds
# This gives 50% of one CPU (50ms every 100ms)
echo "50000 100000" > cpu.max

# For 2 full CPUs worth:
echo "200000 100000" > cpu.max

# For soft limits (doesn't hard-cap, just for scheduling priority)
echo "150" > cpu.weight  # Default is 100

Always Set Resource Limits

Container seccomp Profiles

Container runtimes apply seccomp profiles to restrict syscalls available to containerized processes. Docker, Kubernetes, and other platforms support configurable seccomp policies.

Docker Default seccomp Profile:

Docker's default profile blocks approximately 44 syscalls (as of Docker 20.10):

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "defaultErrnoRet": 1,
    "archMap": [
        {
            "architecture": "SCMP_ARCH_X86_64",
            "subArchitectures": ["SCMP_ARCH_X86", "SCMP_ARCH_X32"]
        }
    ],
    "syscalls": [
        {
            "names": [
                "accept", "accept4", "access", "adjtimex", "alarm",
                "bind", "brk", "capget", "capset", "chdir", ...
            ],
            "action": "SCMP_ACT_ALLOW"
        }
    ]
}

Blocked by Default:

acct, add_key, bpf, clock_adjtime, clock_settime, clone (with CLONE_NEWUSER),
create_module, delete_module, finit_module, get_kernel_syms, get_mempolicy,
init_module, ioperm, iopl, kcmp, kexec_file_load, kexec_load, keyctl,
lookup_dcookie, mbind, mount, move_pages, name_to_handle_at, nfsservctl,
open_by_handle_at, perf_event_open, personality, pivot_root, process_vm_readv,
process_vm_writev, ptrace, query_module, quotactl, reboot, request_key,
set_mempolicy, setns, settimeofday, stime, swapon, swapoff, sysfs, _sysctl,
umount, umount2, unshare, uselib, userfaultfd, ustat, vm86, vm86old

Kubernetes seccomp Profiles:

Kubernetes supports seccomp through security contexts:

apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault  # Use container runtime's default
  containers:
  - name: app
    image: myapp:latest

Profile Types:

Type	Description
`RuntimeDefault`	Use the container runtime's default profile
`Unconfined`	No seccomp filtering (dangerous!)
`Localhost`	Use a custom profile from node's filesystem

Custom Profile Example:

apiVersion: v1
kind: Pod
metadata:
  name: custom-seccomp-pod
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/strict.json
  containers:
  - name: app
    image: myapp:latest

// /var/lib/kubelet/seccomp/profiles/strict.json
{
    "defaultAction": "SCMP_ACT_ERRNO",
    "syscalls": [
        {
            "names": ["read", "write", "exit_group", "futex", "mmap"],
            "action": "SCMP_ACT_ALLOW"
        }
    ]
}

Generating Custom Profiles

Capability Management in Containers

Linux capabilities provide fine-grained privilege control. Container runtimes drop most capabilities by default, granting only what's needed for typical containerized applications.

Docker Default Capabilities:

Granted by default:
CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_MKNOD,
CAP_NET_RAW, CAP_SETGID, CAP_SETUID, CAP_SETFCAP, CAP_SETPCAP,
CAP_NET_BIND_SERVICE, CAP_SYS_CHROOT, CAP_KILL, CAP_AUDIT_WRITE

Notably NOT granted:
CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_PTRACE, CAP_SYS_MODULE,
CAP_SYS_TIME, CAP_SYS_RESOURCE, CAP_SYS_BOOT, CAP_MAC_ADMIN, ...

Modifying Capabilities:

# Drop all capabilities
docker run --cap-drop=ALL myimage

# Drop all, add specific ones back
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myimage

# Add a capability (dangerous!)
docker run --cap-add=SYS_PTRACE myimage

Kubernetes Capability Configuration:

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    image: myapp:latest
    securityContext:
      capabilities:
        drop:
          - ALL
        add:
          - NET_BIND_SERVICE

Dangerous Capabilities to Avoid
Capability	Risk	Escape Method
CAP_SYS_ADMIN	Very High	Mount filesystems, access /proc, many kernel features
CAP_SYS_PTRACE	High	Ptrace host processes if PID namespace shared
CAP_NET_ADMIN	High	Reconfigure network, potential escape via network
CAP_SYS_RAWIO	High	Direct hardware/memory access
CAP_SYS_MODULE	Critical	Load kernel modules (instant escape)
CAP_DAC_READ_SEARCH	Medium	Read any file (bypass permissions)

Avoid --privileged

Mandatory Access Control: AppArmor and SELinux

AppArmor for Containers:

Docker applies a default AppArmor profile (docker-default) that restricts:

Mount operations
Ptrace
Signal handling
Certain network operations

# View Docker's default profile
cat /etc/apparmor.d/docker-default

# Use custom profile
docker run --security-opt apparmor=my-custom-profile myimage

# Disable AppArmor (not recommended)
docker run --security-opt apparmor=unconfined myimage

Custom AppArmor Profile:

#include <tunables/global>

profile docker-nginx flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>
  #include <abstractions/nameservice>
  
  network inet tcp,
  network inet udp,
  
  /usr/sbin/nginx mr,
  /var/log/nginx/* w,
  /etc/nginx/** r,
  /var/www/** r,
  
  # Deny dangerous operations
  deny mount,
  deny ptrace,
  deny @{PROC}/* w,
}

SELinux for Containers:

SELinux provides type enforcement, labeling all processes and files with security contexts:

# View container's SELinux context
ps -eZ | grep container
# system_u:system_r:container_t:s0:c123,c456  ... my-container

# Custom SELinux label
docker run --security-opt label=type:custom_container_t myimage

# Disable SELinux confinement (dangerous)
docker run --security-opt label=disable myimage

SELinux Container Policy:

container_t — Standard container type, restricted access
container_init_t — Container init processes
container_runtime_t — Container runtime (Docker/containerd)

Kubernetes SELinux Options:

apiVersion: v1
kind: Pod
spec:
  securityContext:
    seLinuxOptions:
      level: "s0:c123,c456"  # MCS categories for isolation
  containers:
  - name: app
    securityContext:
      seLinuxOptions:
        type: container_t

MAC Adds Defense in Depth

Container Escape Vectors and Mitigations

Despite all isolation mechanisms, container escapes remain possible. Understanding escape vectors helps in designing secure container deployments.

Common Escape Vectors:

Container Escape Vectors
Vector	Description	Mitigation
Kernel vulnerabilities	Exploit bugs in syscall handlers	Seccomp filtering, kernel updates
Privileged containers	--privileged or CAP_SYS_ADMIN	Never use privileged mode
Host mount exposure	Sensitive host paths mounted	Minimize mounts, read-only when possible
Docker socket mount	/var/run/docker.sock exposed	Never mount Docker socket
Shared namespaces	Host PID/network namespace shared	Use dedicated namespaces
Writable /proc or /sys	Kernel interfaces exposed	Mount read-only mask procfs
Exposed device files	/dev access to host devices	Device whitelist, no raw access

Case Study: CVE-2019-5736 (runc vulnerability)

A container escape vulnerability in runc allowed a malicious container to overwrite the host runc binary:

Container starts with a malicious entrypoint
Entrypoint waits for runc to open a file descriptor to /proc/self/exe
Container process uses /proc/[runc_pid]/exe to access host's runc binary
Overwrites runc with malicious binary
Next container start executes malicious runc on host

Mitigations:

Update runc to patched version
Read-only container layer
AppArmor profile blocking /proc/[other_pid]/exe access

Case Study: Privileged Container Escape

# If container is privileged, escape is trivial:
docker run --privileged -it ubuntu bash

# Inside privileged container:
mount -t cgroup cgroup /mnt
echo 1 > /mnt/cgroup.procs  # Become member of host root cgroup
chroot /host
# Now have root shell on host!

# Or via kernel module:
insmod /host/path/to/malicious.ko

Container Hardening Checklist

•Never use --privileged — Find alternative solutions for whatever requires it.
•Drop all capabilities, add back minimally — --cap-drop=ALL --cap-add=NET_BIND_SERVICE
•Use user namespaces — Run containers as non-root, map to unprivileged host users.
•Use read-only root filesystem — --read-only with explicit tmpfs for writable paths.
•Apply seccomp profile — Use RuntimeDefault at minimum, custom for high-security.
•Enable AppArmor/SELinux — Don't disable MAC; use appropriate profiles.
•Minimize host mounts — Only mount what's absolutely necessary, prefer read-only.
•Never mount Docker socket — Find other ways to build images or manage containers.
•Use non-root user in container — USER directive in Dockerfile.
•Set resource limits — memory, CPU, pids limits always.
•Use dedicated network namespace — Avoid --network=host.
•Keep images updated — Patch vulnerabilities in base images and dependencies.

Stronger Isolation: Microkernel Sandboxes and Micro-VMs

For workloads requiring stronger isolation than standard containers provide, several technologies offer enhanced security at the cost of some overhead or compatibility.

gVisor (Google):

gVisor implements a user-space kernel that intercepts container syscalls. The container talks to gVisor, which emulates kernel behavior without actually invoking the host kernel for most operations.

┌────────────────────────┐
│     Application        │
├────────────────────────┤
│   gVisor Sentry        │  ← User-space "kernel"
│   (syscall emulation)  │
├────────────────────────┤
│   gVisor Gofer         │  ← File system broker
├────────────────────────┤
│    Host Kernel         │  ← Sees only ~20 syscalls
└────────────────────────┘

Security Benefits:

Application syscalls never reach host kernel
Sentry uses minimal host syscall set (~20 syscalls)
Written in memory-safe Go language
Host kernel attack surface dramatically reduced

Trade-offs:

Performance overhead (10-50% for I/O-heavy workloads)
Not all syscalls and features supported
Some applications may not work correctly

Kata Containers:

Kata Containers runs each container in a lightweight virtual machine, providing hardware-level isolation:

┌────────────────┐  ┌────────────────┐
│  Container A   │  │  Container B   │
├────────────────┤  ├────────────────┤
│  Guest Kernel  │  │  Guest Kernel  │
├────────────────┤  ├────────────────┤
│    QEMU/FC     │  │    QEMU/FC     │  ← Micro-VM
├────────────────┴──┴────────────────┤
│           Host Kernel               │
└─────────────────────────────────────┘

Security Benefits:

Full VM isolation (hardware-enforced)
Separate kernel per container
Container escape requires hypervisor escape (very hard)

Trade-offs:

Higher resource overhead (RAM for each VM)
Slower startup (VM boot)
More complex networking

Firecracker (AWS):

Firecracker is a micro-VM engine used by AWS Lambda and Fargate:

Minimal device model (~50 syscalls to host)
Sub-second boot times (100-150ms)
5MB per VM overhead
KVM-based with minimal attack surface

Container Isolation Technology Comparison
Technology	Isolation	Overhead	Compatibility	Use Case
Standard containers	Medium (shared kernel)	Lowest	Highest	Most workloads
gVisor	High (user-space kernel)	Medium	Good	Untrusted code, multi-tenant
Kata Containers	Very High (VM)	Higher	High	Strong isolation required
Firecracker	Very High (micro-VM)	Low-Medium	Good	Serverless, FaaS

Choose Based on Threat Model

Summary: Container Isolation

We have explored how containers combine kernel isolation primitives to create practical sandboxing at scale. Let's consolidate the key insights:

Key Takeaways

•Containers combine multiple kernel primitives — Namespaces, cgroups, seccomp, capabilities, and MAC work together.
•Containers share the host kernel — This is their strength (efficiency) and weakness (shared attack surface).
•Each isolation layer addresses different threats — Defense in depth is essential.
•User namespaces enable rootless containers — Always prefer rootless for better security.
•Resource limits prevent DoS — Always configure cgroup limits (memory, CPU, pids).
•seccomp reduces kernel attack surface — Use RuntimeDefault or stricter profiles.
•Capabilities should be minimal — Drop all, add back only what's needed.
•MAC provides additional protection — Enable AppArmor/SELinux, don't disable.
•Privileged containers are dangerous — Never use --privileged in production.
•Stronger isolation is available — gVisor, Kata, Firecracker for untrusted workloads.

Module Complete:

You have now completed the Sandboxing module. You understand:

The fundamental concept of sandboxing and why it's necessary
How process-level sandboxing works using OS primitives
System call filtering concepts and strategies
seccomp implementation details and patterns
Container isolation architecture and best practices

Module Complete

5 / 5