Namespaces And Cgroups - Learning Module

Loading content...

0/227

Container Isolation

The Security Question

Containers share a kernel. This fundamental architectural choice enables their efficiency—fast startup, low overhead, high density—but raises critical security questions: How isolated are containers from each other? How isolated are they from the host? What happens if a container is compromised?

The answer is nuanced. Container isolation is built from multiple overlapping mechanisms—namespaces, cgroups, capabilities, seccomp, and mandatory access controls—each addressing different attack vectors. Understanding how these layers work together, and where gaps remain, is essential for deploying containers securely.

What You Will Learn

By the end of this page, you will understand the complete container isolation stack, how Linux capabilities partition root privileges, how seccomp filters system calls, how AppArmor and SELinux provide mandatory access control, common container escape vectors, and enhanced isolation mechanisms like gVisor and Kata Containers for high-security environments.

The Container Isolation Stack

Container isolation is not a single mechanism but a layered defense combining multiple kernel features. Each layer addresses different isolation concerns:

Layer 1: Namespaces — What Can Be Seen

Namespaces partition global resources so containers see only their own:

PID namespace: Own process tree
Network namespace: Own network stack
Mount namespace: Own filesystem view
UTS namespace: Own hostname
IPC namespace: Own inter-process communication
User namespace: Own UID/GID mapping

Layer 2: cgroups — What Can Be Consumed

Cgroups limit resource consumption:

Prevent memory exhaustion
Prevent CPU starvation of other containers
Prevent disk I/O monopolization
Prevent fork bombs (PIDs limit)

Layer 3: Capabilities — What Can Be Done

Capabilities divide traditional root power into granular permissions.

Layer 4: Seccomp — What System Calls Can Be Made

Seccomp filters restrict which kernel interfaces a container can access.

Layer 5: LSMs (AppArmor/SELinux) — What Files and Operations Are Allowed

Mandatory access controls provide fine-grained object-level restrictions.

Layer 6: Read-Only / Immutable Configurations

Read-only root filesystems, no-new-privileges flags, and other hardening.

Converting Mermaid diagram...

Each layer acts as a filter. An operation must pass through all layers to succeed:

Is this system call allowed by seccomp? (If not, signal/kill)
Does the process have the required capability? (If not, EPERM)
Does the LSM policy allow this operation? (If not, EACCESS)
Is the target resource visible in the namespace? (If not, ENOENT)
Is the cgroup limit reached? (If so, throttle/OOM)

This defense-in-depth means a vulnerability in one layer may be blocked by another. Breaking out of a container typically requires bypassing multiple security mechanisms.

Defense in Depth

Container escapes usually exploit multiple weaknesses: a capability that shouldn't be granted, combined with a missing seccomp rule, accessing a namespace gap. Robust security requires configuring all layers correctly, not relying on any single mechanism.

Linux Capabilities — Partitioning Root

Traditional UNIX has a binary privilege model: you're either root (uid 0, can do everything) or not root (restricted). This is problematic—many programs need only one specific privilege (e.g., binding to port 80) but running as root grants all privileges.

Capabilities split root's omnipotent power into ~40 distinct capabilities. A process can have specific capabilities without having full root access.

Key Capabilities

Critical Linux Capabilities for Containers
Capability	Allows	Container Default	Security Impact
CAP_NET_BIND_SERVICE	Bind to ports < 1024	Usually granted	Low (useful for web servers)
CAP_NET_ADMIN	Network configuration	Usually dropped	Medium (can reconfigure network)
CAP_SYS_ADMIN	Broad admin powers (mount, etc.)	Usually dropped	Critical (near-root, escape risk)
CAP_SYS_PTRACE	Trace/debug processes	Usually dropped	High (process injection risk)
CAP_CHOWN	Change file ownership	Usually granted	Medium (can change container files)
CAP_SETUID/SETGID	Change UID/GID	Usually granted	Medium (privilege escalation)
CAP_DAC_OVERRIDE	Bypass file permissions	Usually dropped	High (can read any file)
CAP_MKNOD	Create device files	Usually dropped	High (device access)

Default Docker Capabilities

Docker grants a default set of capabilities—enough for most applications to work without granting dangerous powers:

# Default capabilities granted (Docker/containerd)
CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER,
CAP_MKNOD, CAP_NET_RAW, CAP_SETGID, CAP_SETUID,
CAP_SETFCAP, CAP_SETPCAP, CAP_NET_BIND_SERVICE,
CAP_SYS_CHROOT, CAP_KILL, CAP_AUDIT_WRITE

# Notably NOT granted by default:
CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_PTRACE,
CAP_SYS_MODULE, CAP_SYS_RAWIO, CAP_SYS_TIME

Manipulating Capabilities

# Drop all capabilities (most restrictive)
docker run --cap-drop=ALL myimage

# Drop all, then add back only what's needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myimage

# Add a specific dangerous capability (avoid if possible)
docker run --cap-add=SYS_ADMIN myimage  # Dangerous!

# Check capabilities of a process
cat /proc/self/status | grep Cap
# CapInh: 0000000000000000
# CapPrm: 00000000a80425fb
# CapEff: 00000000a80425fb
# CapBnd: 00000000a80425fb
# CapAmb: 0000000000000000

CAP_SYS_ADMIN Warning

CAP_SYS_ADMIN is the 'garbage can' capability—it grants ~30+ different privileges including mounting filesystems, loading BPF programs, accessing /proc/kcore, and more. Granting CAP_SYS_ADMIN to a container essentially defeats namespace isolation. Never grant it unless absolutely required and other isolation layers (seccomp, LSM) are very strict.

User Namespaces and Capabilities

Capabilities are scoped to user namespaces. A process can have CAP_SYS_ADMIN in its own user namespace without having it in the initial (host) user namespace.

Container with user namespace mapping:
  Container root (uid 0) → Host uid 100000
  
Capabilities:
  - CAP_SYS_ADMIN in container's user namespace: YES
  - CAP_SYS_ADMIN in host's user namespace: NO
  
Result:
  - Can mount filesystems within container's mount namespace
  - Cannot access host's /proc/kcore or load kernel modules

This is why rootless containers (with user namespaces) are more secure—the container's capabilities don't translate to host privileges.

Seccomp — System Call Filtering

Seccomp (Secure Computing Mode) restricts which system calls a process can make. Since all kernel interactions go through syscalls, seccomp is a powerful attack surface reduction mechanism.

How Seccomp Works

Seccomp installs a BPF (Berkeley Packet Filter) program in the kernel that intercepts every syscall. The filter decides:

ALLOW: Permit the syscall
ERRNO: Return an error (e.g., EPERM)
TRAP: Send a signal (for debugging)
KILL: Terminate the process immediately
LOG: Log and allow (for auditing)

Default Container Seccomp Profile

Docker and containerd ship with a default profile blocking ~50 dangerous syscalls while allowing the ~300+ needed for typical applications:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86"],
  "syscalls": [
    {
      "names": ["accept", "bind", "clone", "read", "write", ...],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Syscalls Blocked by Default

•clone with CLONE_NEWUSER — Prevents creating user namespaces within container
•mount — Prevents arbitrary filesystem mounting (except via capabilities)
•kexec_load — Prevents loading new kernel
•init_module, finit_module — Prevents loading kernel modules
•acct — Prevents process accounting manipulation
•reboot — Prevents system reboot
•settimeofday — Prevents changing system time
•swapon, swapoff — Prevents swap manipulation
•ptrace — Limited to prevent debugging host processes

Custom Seccomp Profiles

For high-security environments, create restrictive custom profiles:

custom-seccomp.json
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 1,
  "syscalls": [
    {
      "names": [
        "read", "write", "close", "fstat", "lseek",
        "mmap", "mprotect", "munmap", "brk",
        "rt_sigaction", "rt_sigprocmask", "rt_sigreturn",
        "ioctl", "pread64", "pwrite64",
        "access", "pipe", "select", "sched_yield",
        "dup", "dup2", "nanosleep", "getpid", "getuid",
        "socket", "connect", "accept", "sendto", "recvfrom",
        "exit", "exit_group", "futex", "epoll_wait",
        "clock_gettime", "openat", "newfstatat"
      ],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "names": ["clone"],
      "action": "SCMP_ACT_ALLOW",
      "args": [
        {
          "index": 0,
          "value": 2114060288,
          "op": "SCMP_CMP_MASKED_EQ"
        }
      ],
      "comment": "Allow clone but only for threads, not new namespaces"
    }
  ]
}

Generating Seccomp Profiles

SPECT, OCI Seccomp Profile Generator, and similar tools can automatically generate profiles by tracing application behavior:

# Record syscalls made during application runtime
strace -c -f ./my-application

# Use recorded syscalls to build allowlist
# Or use seccomp-notify for dynamic analysis

Running with Seccomp

# Use tight custom profile
docker run --security-opt seccomp=custom-seccomp.json myimage

# Disable seccomp (dangerous, for debugging only)
docker run --security-opt seccomp=unconfined myimage

# Check if seccomp is enabled
grep Seccomp /proc/self/status
# Seccomp: 2  (2 = filter mode enabled)

Seccomp Performance

Seccomp filtering has minimal overhead—the BPF program runs in the kernel at near-native speed. The cost is typically <1% for syscall-heavy workloads. The security benefit far outweighs this minimal cost.

Mandatory Access Controls (LSMs)

Linux Security Modules (LSMs) provide mandatory access control—security policies enforced by the kernel regardless of user identity or capabilities. The two most common LSMs for containers are AppArmor and SELinux.

AppArmor: Profile-Based Access Control

AppArmor confines programs using per-program profiles that specify what files, capabilities, and network access are permitted:

# Example AppArmor profile for nginx container
#include <tunables/global>

profile docker-nginx flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>
  #include <abstractions/nameservice>
  
  # File access
  /usr/sbin/nginx mr,
  /etc/nginx/** r,
  /var/log/nginx/** w,
  /var/www/html/** r,
  /run/nginx.pid w,
  
  # Network access
  network inet tcp,
  network inet udp,
  
  # Capabilities allowed
  capability net_bind_service,
  capability setgid,
  capability setuid,
  
  # Denials (explicit)
  deny /etc/shadow r,
  deny /proc/*/mem rw,
}

Docker automatically applies a default AppArmor profile (docker-default) that:

Denies writing to most /proc and /sys paths
Denies mounting filesystems
Denies loading kernel modules
Allows network access
Allows reading most files

SELinux: Label-Based Access Control

SELinux assigns labels (contexts) to subjects (processes) and objects (files, ports, etc.), then enforces policies based on label relationships:

# View SELinux context of a container process
ps -eZ | grep container
# system_u:system_r:container_t:s0:c123,c456  12345 ?  nginx

# View SELinux context of files
ls -Z /var/www/html
# system_u:object_r:container_file_t:s0  index.html

# Container process (container_t) can access container files (container_file_t)
# but cannot access system files (admin_home_t, etc.)

SELinux Contexts for Containers:

container_t: Standard container process type
container_file_t: Container-accessible file type
svirt_sandbox_file_t: Sandbox-mode file type
MCS (Multi-Category Security) labels provide per-container isolation

SELinux is more granular than AppArmor but more complex to configure. RHEL/CentOS systems use SELinux; Ubuntu/Debian typically use AppArmor.

AppArmor

•Profile-based (per-application)
•Path-based rules
•Easier to write and understand
•Good Ubuntu/Debian support
•Less granular than SELinux

SELinux

•Label-based (process ↔ object)
•Inode-based (survives renames)
•More comprehensive policies
•Native RHEL/CentOS support
•Steeper learning curve

Running with LSMs

# Use custom AppArmor profile
docker run --security-opt apparmor=my-custom-profile myimage

# Disable AppArmor (dangerous)
docker run --security-opt apparmor=unconfined myimage

# Use custom SELinux label
docker run --security-opt label=type:my_custom_t myimage

# Disable SELinux for container
docker run --security-opt label=disable myimage

LSMs provide the last line of defense—even if a container exploits a capability vulnerability or namespace escape, LSM policies can prevent accessing sensitive host resources.

Container Escape Vectors

Understanding how containers can be escaped helps defend against attacks. Container escapes typically exploit misconfigurations or kernel vulnerabilities.

Privileged Containers

--privileged disables most security mechanisms:

All capabilities granted
Seccomp disabled
AppArmor disabled
Device access allowed
Shares host's /sys, /proc views

# Escape from privileged container
docker run --privileged -it alpine
# Inside container:
mount /dev/sda1 /mnt  # Mount host disk
chroot /mnt           # Access host filesystem as root

Never use --privileged in production. If you need specific privileges, grant only those capabilities.

Dangerous Volume Mounts

Mounting sensitive host paths enables escape:

# Docker socket mount (container can control Docker → escape)
docker run -v /var/run/docker.sock:/var/run/docker.sock myimage
# Inside: docker run --privileged ... → full host access

# Host filesystem mount
docker run -v /:/host myimage
# Inside: access /host/etc/shadow, /host/root/.ssh, etc.

# /proc mount
docker run -v /proc:/host-proc myimage
# Inside: access host /proc (kernel memory, process info)

Common Escape Vectors

•Privileged mode — Full host access, trivial escape
•CAP_SYS_ADMIN — Mount filesystems, escape via mount
•Docker socket mount — Create new privileged containers
•Host PID/network namespace — Access host processes or network stack
•Writable hostPath /sysfs or /procfs — Modify kernel parameters
•Device access (/dev) — Raw disk access, keyboard sniffing
•Kernel vulnerabilities — Exploits in shared kernel (e.g., Dirty COW)
•CVE-2019-5736 (runc) — Overwrite runc binary via /proc/self/exe

Kernel Vulnerabilities

Since containers share the host kernel, kernel vulnerabilities affect all containers:

Dirty COW (CVE-2016-5195): Race condition in copy-on-write allowing privilege escalation
CVE-2022-0847 (Dirty Pipe): Pipe buffer vulnerability allowing arbitrary file overwrites
CVE-2020-14386: Memory corruption in packet socket

Mitigation:

Keep kernel patched and updated
Use user namespaces (reduces impact)
Use hardened kernel profiles (grsecurity, lockdown mode)
Consider stronger isolation (gVisor, Kata)

Detecting Container Escapes

Monitor for suspicious activity:

# Watch for processes with host PID namespace
auditctl -a exit,always -F arch=b64 -S clone -F a0&0x20000000

# Monitor container capability usage
auditctl -a exit,always -F arch=b64 -S capset

# Detect namespace changes
auditctl -a exit,always -F arch=b64 -S unshare
auditctl -a exit,always -F arch=b64 -S setns

Tools like Falco, Sysdig, or Tracee can detect anomalous container behavior in real-time.

Assume Containers Can Be Compromised

Defense-in-depth means assuming containers WILL be compromised and limiting blast radius. Use network policies, run as non-root, minimize capabilities, apply strict seccomp/LSM profiles, and segment sensitive workloads to separate nodes or stronger isolation mechanisms.

Enhanced Isolation Mechanisms

For high-security environments where kernel-sharing is unacceptable, enhanced isolation mechanisms provide stronger boundaries.

gVisor: User-Space Kernel

gVisor is a user-space kernel (called Sentry) that intercepts container syscalls and implements them without passing them to the host kernel:

Traditional Container:
  Container Process → syscall → Host Kernel

gVisor Container:
  Container Process → syscall → gVisor Sentry → limited host syscalls

How gVisor Works:

Container syscalls are intercepted by gVisor's Sentry (via ptrace or platform-specific mechanisms)
Sentry implements the Linux syscall interface in user-space Go code
Sentry makes only ~50-70 host syscalls (vs ~300+ without gVisor)
Kernel vulnerabilities in the emulated syscalls don't affect the host

Trade-offs:

Stronger isolation (reduced kernel attack surface)
Performance overhead (5-50% depending on workload, worse for syscall-heavy)
Compatibility issues (not all syscalls fully implemented)
Good for multi-tenant hosting, CI/CD runners, untrusted code

Converting Mermaid diagram...

Kata Containers: MicroVMs

Kata Containers run each container inside a lightweight virtual machine, providing hardware-level isolation:

Kata Container:
  Container Process → Guest Kernel → VM/Hypervisor → Host Kernel

How Kata Works:

Each container (or pod) gets a dedicated lightweight VM
VM runs a minimal Linux kernel (optimized for startup)
Container runs normally inside the VM
Hypervisor (QEMU, Firecracker, Cloud Hypervisor) provides hardware isolation
OCI-compliant: works with Docker, Kubernetes transparently

Trade-offs:

True isolation (separate kernel per container)
Startup overhead (VM boot, typically 100-300ms with Firecracker)
Memory overhead (~30-50MB per VM for guest kernel + minimal init)
Excellent for untrusted workloads, multi-tenant environments

Firecracker: AWS's MicroVM

Firecracker is a purpose-built VMM (Virtual Machine Monitor) for containers:

Written in Rust, minimal attack surface (~50K lines of code)
<125ms cold start
<5MB memory footprint per VM
Used by AWS Lambda, AWS Fargate
Provides VM-level isolation with near-container efficiency

Isolation Mechanism Comparison
Mechanism	Isolation Level	Startup	Overhead	Compatibility	Use Case
Standard Container	Process (namespace/cgroup)	~100ms	~1-5%	Excellent	General workloads
gVisor (runsc)	User-space kernel	~150ms	5-50%	Good (~80%)	Untrusted code, CI/CD
Kata Containers	VM (hypervisor)	~200-500ms	30-50MB/VM	Excellent	Multi-tenant, security-critical
Firecracker	MicroVM (minimal)	~125ms	<5MB/VM	Excellent	Serverless, Lambda-style

Choosing an Isolation Mechanism

Standard containers: When you control the workload, performance matters, and kernel patching is rigorous
gVisor: When running untrusted code like CI/CD jobs, third-party plugins, or student submissions
Kata/Firecracker: When strong isolation is mandatory (compliance, multi-tenant SaaS, sensitive data processing)

Kubernetes supports multiple runtimes simultaneously via RuntimeClass:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
---
apiVersion: v1
kind: Pod
spec:
  runtimeClassName: gvisor  # Use gVisor for this pod
  containers:
  - name: untrusted-workload
    image: user-code:latest

Container Security Best Practices

Applying the defense-in-depth principle, here are actionable security practices for container deployments:

Image Security

•Use minimal base images — Alpine, distroless, or scratch reduce attack surface
•Scan images for vulnerabilities — Trivy, Clair, Snyk in CI/CD pipeline
•Sign and verify images — Cosign, Notary v2 for supply chain security
•Pin image digests — myimage@sha256:abc... not myimage:latest
•Don't include secrets in images — Use runtime secrets injection

Runtime Security

•Run as non-root — Set runAsNonRoot: true, runAsUser: 1000
•Drop all capabilities, add back minimally — --cap-drop=ALL --cap-add=NET_BIND_SERVICE
•Use read-only root filesystem — readOnlyRootFilesystem: true
•Prevent privilege escalation — allowPrivilegeEscalation: false
•Apply seccomp profiles — Default at minimum, custom for sensitive workloads
•Enable AppArmor/SELinux — Don't disable with unconfined

secure-pod.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
apiVersion: v1
kind: Pod
metadata:
  name: secure-app
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 10000
    runAsGroup: 10000
    fsGroup: 10000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: myapp@sha256:abc123...
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
          - ALL
        add:
          - NET_BIND_SERVICE
    volumeMounts:
    - name: tmp
      mountPath: /tmp
    resources:
      limits:
        memory: "256Mi"
        cpu: "500m"
      requests:
        memory: "128Mi"
        cpu: "250m"
  volumes:
  - name: tmp
    emptyDir: {}  # Writable /tmp without host access

Infrastructure Security

•Keep kernel patched — Most container escapes require kernel vulns
•Use user namespaces — Enable rootless mode for defense in depth
•Apply network policies — Default deny, explicit allow per service
•Audit container activity — Falco, Sysdig, auditd rules
•Limit host access — No Docker socket mounts, no hostPath for /
•Segment sensitive workloads — Use dedicated nodes or stronger isolation

Summary: Container Isolation

We've explored container isolation from multiple angles—the layered security stack, individual mechanisms, attack vectors, and enhanced isolation options. Let's consolidate the key takeaways:

Key Takeaways

•Container isolation is layered — Namespaces, cgroups, capabilities, seccomp, and LSMs work together
•Capabilities partition root power — Drop all, add back only what's needed; never grant CAP_SYS_ADMIN lightly
•Seccomp reduces kernel attack surface — ~50 dangerous syscalls blocked by default; custom profiles for sensitive workloads
•LSMs provide mandatory access control — AppArmor (profile-based) or SELinux (label-based) add object-level restrictions
•Privileged mode defeats isolation — Never use --privileged; grant specific capabilities instead
•Kernel vulnerabilities affect all containers — Keep kernel patched; consider gVisor/Kata for untrusted workloads
•User namespaces add defense depth — Container root maps to unprivileged host user, limiting escape impact
•Choose isolation level by threat model — Standard for trusted workloads, gVisor for untrusted code, Kata for strong boundary requirements

Module Conclusion: Namespaces and cgroups

This module has covered the fundamental Linux kernel primitives that enable containerization:

Linux Namespaces — Process isolation across eight dimensions (PID, network, mount, UTS, IPC, user, cgroup, time)
PID, Network, Mount Namespaces — Deep dive into the three most critical namespace types
cgroups — Resource accounting and limiting with controllers for CPU, memory, I/O, and more
Resource Limiting — Practical configuration of requests, limits, QoS classes, and overcommitment
Container Isolation — The complete security stack from capabilities to gVisor

Together, namespaces and cgroups form the foundation upon which all container technologies are built. Whether you're debugging container networking issues, tuning resource limits for production workloads, or securing multi-tenant infrastructure, this knowledge is essential.

Containers are not magic—they're a clever application of kernel primitives. Understanding those primitives empowers you to use containers effectively, debug them confidently, and secure them rigorously.

Module Complete

Congratulations! You've completed the Namespaces and cgroups module. You now understand the kernel primitives that enable containerization, how to configure resource limits, and how to secure containers through defense in depth. This knowledge forms the foundation for working with any container technology.

Container Isolation

The Security Question

What You Will Learn

The Container Isolation Stack

Container isolation is not a single mechanism but a layered defense combining multiple kernel features. Each layer addresses different isolation concerns:

Layer 1: Namespaces — What Can Be Seen

Namespaces partition global resources so containers see only their own:

PID namespace: Own process tree
Network namespace: Own network stack
Mount namespace: Own filesystem view
UTS namespace: Own hostname
IPC namespace: Own inter-process communication
User namespace: Own UID/GID mapping

Layer 2: cgroups — What Can Be Consumed

Cgroups limit resource consumption:

Prevent memory exhaustion
Prevent CPU starvation of other containers
Prevent disk I/O monopolization
Prevent fork bombs (PIDs limit)

Layer 3: Capabilities — What Can Be Done

Capabilities divide traditional root power into granular permissions.

Layer 4: Seccomp — What System Calls Can Be Made

Seccomp filters restrict which kernel interfaces a container can access.

Layer 5: LSMs (AppArmor/SELinux) — What Files and Operations Are Allowed

Mandatory access controls provide fine-grained object-level restrictions.

Layer 6: Read-Only / Immutable Configurations

Read-only root filesystems, no-new-privileges flags, and other hardening.

Converting Mermaid diagram...

Each layer acts as a filter. An operation must pass through all layers to succeed:

Is this system call allowed by seccomp? (If not, signal/kill)
Does the process have the required capability? (If not, EPERM)
Does the LSM policy allow this operation? (If not, EACCESS)
Is the target resource visible in the namespace? (If not, ENOENT)
Is the cgroup limit reached? (If so, throttle/OOM)

This defense-in-depth means a vulnerability in one layer may be blocked by another. Breaking out of a container typically requires bypassing multiple security mechanisms.

Defense in Depth

Linux Capabilities — Partitioning Root

Capabilities split root's omnipotent power into ~40 distinct capabilities. A process can have specific capabilities without having full root access.

Key Capabilities

Critical Linux Capabilities for Containers
Capability	Allows	Container Default	Security Impact
CAP_NET_BIND_SERVICE	Bind to ports < 1024	Usually granted	Low (useful for web servers)
CAP_NET_ADMIN	Network configuration	Usually dropped	Medium (can reconfigure network)
CAP_SYS_ADMIN	Broad admin powers (mount, etc.)	Usually dropped	Critical (near-root, escape risk)
CAP_SYS_PTRACE	Trace/debug processes	Usually dropped	High (process injection risk)
CAP_CHOWN	Change file ownership	Usually granted	Medium (can change container files)
CAP_SETUID/SETGID	Change UID/GID	Usually granted	Medium (privilege escalation)
CAP_DAC_OVERRIDE	Bypass file permissions	Usually dropped	High (can read any file)
CAP_MKNOD	Create device files	Usually dropped	High (device access)

Default Docker Capabilities

Docker grants a default set of capabilities—enough for most applications to work without granting dangerous powers:

# Default capabilities granted (Docker/containerd)
CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER,
CAP_MKNOD, CAP_NET_RAW, CAP_SETGID, CAP_SETUID,
CAP_SETFCAP, CAP_SETPCAP, CAP_NET_BIND_SERVICE,
CAP_SYS_CHROOT, CAP_KILL, CAP_AUDIT_WRITE

# Notably NOT granted by default:
CAP_SYS_ADMIN, CAP_NET_ADMIN, CAP_SYS_PTRACE,
CAP_SYS_MODULE, CAP_SYS_RAWIO, CAP_SYS_TIME

Manipulating Capabilities

# Drop all capabilities (most restrictive)
docker run --cap-drop=ALL myimage

# Drop all, then add back only what's needed
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myimage

# Add a specific dangerous capability (avoid if possible)
docker run --cap-add=SYS_ADMIN myimage  # Dangerous!

# Check capabilities of a process
cat /proc/self/status | grep Cap
# CapInh: 0000000000000000
# CapPrm: 00000000a80425fb
# CapEff: 00000000a80425fb
# CapBnd: 00000000a80425fb
# CapAmb: 0000000000000000

CAP_SYS_ADMIN Warning

User Namespaces and Capabilities

Capabilities are scoped to user namespaces. A process can have CAP_SYS_ADMIN in its own user namespace without having it in the initial (host) user namespace.

Container with user namespace mapping:
  Container root (uid 0) → Host uid 100000
  
Capabilities:
  - CAP_SYS_ADMIN in container's user namespace: YES
  - CAP_SYS_ADMIN in host's user namespace: NO
  
Result:
  - Can mount filesystems within container's mount namespace
  - Cannot access host's /proc/kcore or load kernel modules

This is why rootless containers (with user namespaces) are more secure—the container's capabilities don't translate to host privileges.

Seccomp — System Call Filtering

Seccomp (Secure Computing Mode) restricts which system calls a process can make. Since all kernel interactions go through syscalls, seccomp is a powerful attack surface reduction mechanism.

How Seccomp Works

Seccomp installs a BPF (Berkeley Packet Filter) program in the kernel that intercepts every syscall. The filter decides:

ALLOW: Permit the syscall
ERRNO: Return an error (e.g., EPERM)
TRAP: Send a signal (for debugging)
KILL: Terminate the process immediately
LOG: Log and allow (for auditing)

Default Container Seccomp Profile

Docker and containerd ship with a default profile blocking ~50 dangerous syscalls while allowing the ~300+ needed for typical applications:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86"],
  "syscalls": [
    {
      "names": ["accept", "bind", "clone", "read", "write", ...],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Syscalls Blocked by Default

•clone with CLONE_NEWUSER — Prevents creating user namespaces within container
•mount — Prevents arbitrary filesystem mounting (except via capabilities)
•kexec_load — Prevents loading new kernel
•init_module, finit_module — Prevents loading kernel modules
•acct — Prevents process accounting manipulation
•reboot — Prevents system reboot
•settimeofday — Prevents changing system time
•swapon, swapoff — Prevents swap manipulation
•ptrace — Limited to prevent debugging host processes

Custom Seccomp Profiles

For high-security environments, create restrictive custom profiles:

custom-seccomp.json
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
  "defaultAction": "SCMP_ACT_ERRNO",
  "defaultErrnoRet": 1,
  "syscalls": [
    {
      "names": [
        "read", "write", "close", "fstat", "lseek",
        "mmap", "mprotect", "munmap", "brk",
        "rt_sigaction", "rt_sigprocmask", "rt_sigreturn",
        "ioctl", "pread64", "pwrite64",
        "access", "pipe", "select", "sched_yield",
        "dup", "dup2", "nanosleep", "getpid", "getuid",
        "socket", "connect", "accept", "sendto", "recvfrom",
        "exit", "exit_group", "futex", "epoll_wait",
        "clock_gettime", "openat", "newfstatat"
      ],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "names": ["clone"],
      "action": "SCMP_ACT_ALLOW",
      "args": [
        {
          "index": 0,
          "value": 2114060288,
          "op": "SCMP_CMP_MASKED_EQ"
        }
      ],
      "comment": "Allow clone but only for threads, not new namespaces"
    }
  ]
}

Generating Seccomp Profiles

SPECT, OCI Seccomp Profile Generator, and similar tools can automatically generate profiles by tracing application behavior:

# Record syscalls made during application runtime
strace -c -f ./my-application

# Use recorded syscalls to build allowlist
# Or use seccomp-notify for dynamic analysis

Running with Seccomp

# Use tight custom profile
docker run --security-opt seccomp=custom-seccomp.json myimage

# Disable seccomp (dangerous, for debugging only)
docker run --security-opt seccomp=unconfined myimage

# Check if seccomp is enabled
grep Seccomp /proc/self/status
# Seccomp: 2  (2 = filter mode enabled)

Seccomp Performance

Mandatory Access Controls (LSMs)

AppArmor: Profile-Based Access Control

AppArmor confines programs using per-program profiles that specify what files, capabilities, and network access are permitted:

# Example AppArmor profile for nginx container
#include <tunables/global>

profile docker-nginx flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>
  #include <abstractions/nameservice>
  
  # File access
  /usr/sbin/nginx mr,
  /etc/nginx/** r,
  /var/log/nginx/** w,
  /var/www/html/** r,
  /run/nginx.pid w,
  
  # Network access
  network inet tcp,
  network inet udp,
  
  # Capabilities allowed
  capability net_bind_service,
  capability setgid,
  capability setuid,
  
  # Denials (explicit)
  deny /etc/shadow r,
  deny /proc/*/mem rw,
}

Docker automatically applies a default AppArmor profile (docker-default) that:

Denies writing to most /proc and /sys paths
Denies mounting filesystems
Denies loading kernel modules
Allows network access
Allows reading most files

SELinux: Label-Based Access Control

SELinux assigns labels (contexts) to subjects (processes) and objects (files, ports, etc.), then enforces policies based on label relationships:

# View SELinux context of a container process
ps -eZ | grep container
# system_u:system_r:container_t:s0:c123,c456  12345 ?  nginx

# View SELinux context of files
ls -Z /var/www/html
# system_u:object_r:container_file_t:s0  index.html

# Container process (container_t) can access container files (container_file_t)
# but cannot access system files (admin_home_t, etc.)

SELinux Contexts for Containers:

container_t: Standard container process type
container_file_t: Container-accessible file type
svirt_sandbox_file_t: Sandbox-mode file type
MCS (Multi-Category Security) labels provide per-container isolation

SELinux is more granular than AppArmor but more complex to configure. RHEL/CentOS systems use SELinux; Ubuntu/Debian typically use AppArmor.

AppArmor

•Profile-based (per-application)
•Path-based rules
•Easier to write and understand
•Good Ubuntu/Debian support
•Less granular than SELinux

SELinux

•Label-based (process ↔ object)
•Inode-based (survives renames)
•More comprehensive policies
•Native RHEL/CentOS support
•Steeper learning curve

Running with LSMs

# Use custom AppArmor profile
docker run --security-opt apparmor=my-custom-profile myimage

# Disable AppArmor (dangerous)
docker run --security-opt apparmor=unconfined myimage

# Use custom SELinux label
docker run --security-opt label=type:my_custom_t myimage

# Disable SELinux for container
docker run --security-opt label=disable myimage

LSMs provide the last line of defense—even if a container exploits a capability vulnerability or namespace escape, LSM policies can prevent accessing sensitive host resources.

Container Escape Vectors

Understanding how containers can be escaped helps defend against attacks. Container escapes typically exploit misconfigurations or kernel vulnerabilities.

Privileged Containers

--privileged disables most security mechanisms:

All capabilities granted
Seccomp disabled
AppArmor disabled
Device access allowed
Shares host's /sys, /proc views

# Escape from privileged container
docker run --privileged -it alpine
# Inside container:
mount /dev/sda1 /mnt  # Mount host disk
chroot /mnt           # Access host filesystem as root

Never use --privileged in production. If you need specific privileges, grant only those capabilities.

Dangerous Volume Mounts

Mounting sensitive host paths enables escape:

# Docker socket mount (container can control Docker → escape)
docker run -v /var/run/docker.sock:/var/run/docker.sock myimage
# Inside: docker run --privileged ... → full host access

# Host filesystem mount
docker run -v /:/host myimage
# Inside: access /host/etc/shadow, /host/root/.ssh, etc.

# /proc mount
docker run -v /proc:/host-proc myimage
# Inside: access host /proc (kernel memory, process info)

Common Escape Vectors

•Privileged mode — Full host access, trivial escape
•CAP_SYS_ADMIN — Mount filesystems, escape via mount
•Docker socket mount — Create new privileged containers
•Host PID/network namespace — Access host processes or network stack
•Writable hostPath /sysfs or /procfs — Modify kernel parameters
•Device access (/dev) — Raw disk access, keyboard sniffing
•Kernel vulnerabilities — Exploits in shared kernel (e.g., Dirty COW)
•CVE-2019-5736 (runc) — Overwrite runc binary via /proc/self/exe

Kernel Vulnerabilities

Since containers share the host kernel, kernel vulnerabilities affect all containers:

Dirty COW (CVE-2016-5195): Race condition in copy-on-write allowing privilege escalation
CVE-2022-0847 (Dirty Pipe): Pipe buffer vulnerability allowing arbitrary file overwrites
CVE-2020-14386: Memory corruption in packet socket

Mitigation:

Keep kernel patched and updated
Use user namespaces (reduces impact)
Use hardened kernel profiles (grsecurity, lockdown mode)
Consider stronger isolation (gVisor, Kata)

Detecting Container Escapes

Monitor for suspicious activity:

# Watch for processes with host PID namespace
auditctl -a exit,always -F arch=b64 -S clone -F a0&0x20000000

# Monitor container capability usage
auditctl -a exit,always -F arch=b64 -S capset

# Detect namespace changes
auditctl -a exit,always -F arch=b64 -S unshare
auditctl -a exit,always -F arch=b64 -S setns

Tools like Falco, Sysdig, or Tracee can detect anomalous container behavior in real-time.

Assume Containers Can Be Compromised

Enhanced Isolation Mechanisms

For high-security environments where kernel-sharing is unacceptable, enhanced isolation mechanisms provide stronger boundaries.

gVisor: User-Space Kernel

gVisor is a user-space kernel (called Sentry) that intercepts container syscalls and implements them without passing them to the host kernel:

Traditional Container:
  Container Process → syscall → Host Kernel

gVisor Container:
  Container Process → syscall → gVisor Sentry → limited host syscalls

How gVisor Works:

Container syscalls are intercepted by gVisor's Sentry (via ptrace or platform-specific mechanisms)
Sentry implements the Linux syscall interface in user-space Go code
Sentry makes only ~50-70 host syscalls (vs ~300+ without gVisor)
Kernel vulnerabilities in the emulated syscalls don't affect the host

Trade-offs:

Stronger isolation (reduced kernel attack surface)
Performance overhead (5-50% depending on workload, worse for syscall-heavy)
Compatibility issues (not all syscalls fully implemented)
Good for multi-tenant hosting, CI/CD runners, untrusted code

Converting Mermaid diagram...

Kata Containers: MicroVMs

Kata Containers run each container inside a lightweight virtual machine, providing hardware-level isolation:

Kata Container:
  Container Process → Guest Kernel → VM/Hypervisor → Host Kernel

How Kata Works:

Each container (or pod) gets a dedicated lightweight VM
VM runs a minimal Linux kernel (optimized for startup)
Container runs normally inside the VM
Hypervisor (QEMU, Firecracker, Cloud Hypervisor) provides hardware isolation
OCI-compliant: works with Docker, Kubernetes transparently

Trade-offs:

True isolation (separate kernel per container)
Startup overhead (VM boot, typically 100-300ms with Firecracker)
Memory overhead (~30-50MB per VM for guest kernel + minimal init)
Excellent for untrusted workloads, multi-tenant environments

Firecracker: AWS's MicroVM

Firecracker is a purpose-built VMM (Virtual Machine Monitor) for containers:

Written in Rust, minimal attack surface (~50K lines of code)
<125ms cold start
<5MB memory footprint per VM
Used by AWS Lambda, AWS Fargate
Provides VM-level isolation with near-container efficiency

Isolation Mechanism Comparison
Mechanism	Isolation Level	Startup	Overhead	Compatibility	Use Case
Standard Container	Process (namespace/cgroup)	~100ms	~1-5%	Excellent	General workloads
gVisor (runsc)	User-space kernel	~150ms	5-50%	Good (~80%)	Untrusted code, CI/CD
Kata Containers	VM (hypervisor)	~200-500ms	30-50MB/VM	Excellent	Multi-tenant, security-critical
Firecracker	MicroVM (minimal)	~125ms	<5MB/VM	Excellent	Serverless, Lambda-style

Choosing an Isolation Mechanism

Standard containers: When you control the workload, performance matters, and kernel patching is rigorous
gVisor: When running untrusted code like CI/CD jobs, third-party plugins, or student submissions
Kata/Firecracker: When strong isolation is mandatory (compliance, multi-tenant SaaS, sensitive data processing)

Kubernetes supports multiple runtimes simultaneously via RuntimeClass:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
---
apiVersion: v1
kind: Pod
spec:
  runtimeClassName: gvisor  # Use gVisor for this pod
  containers:
  - name: untrusted-workload
    image: user-code:latest

Container Security Best Practices

Applying the defense-in-depth principle, here are actionable security practices for container deployments:

Image Security

•Use minimal base images — Alpine, distroless, or scratch reduce attack surface
•Scan images for vulnerabilities — Trivy, Clair, Snyk in CI/CD pipeline
•Sign and verify images — Cosign, Notary v2 for supply chain security
•Pin image digests — myimage@sha256:abc... not myimage:latest
•Don't include secrets in images — Use runtime secrets injection

Runtime Security

•Run as non-root — Set runAsNonRoot: true, runAsUser: 1000
•Drop all capabilities, add back minimally — --cap-drop=ALL --cap-add=NET_BIND_SERVICE
•Use read-only root filesystem — readOnlyRootFilesystem: true
•Prevent privilege escalation — allowPrivilegeEscalation: false
•Apply seccomp profiles — Default at minimum, custom for sensitive workloads
•Enable AppArmor/SELinux — Don't disable with unconfined

secure-pod.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
apiVersion: v1
kind: Pod
metadata:
  name: secure-app
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 10000
    runAsGroup: 10000
    fsGroup: 10000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: myapp@sha256:abc123...
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
          - ALL
        add:
          - NET_BIND_SERVICE
    volumeMounts:
    - name: tmp
      mountPath: /tmp
    resources:
      limits:
        memory: "256Mi"
        cpu: "500m"
      requests:
        memory: "128Mi"
        cpu: "250m"
  volumes:
  - name: tmp
    emptyDir: {}  # Writable /tmp without host access

Infrastructure Security

•Keep kernel patched — Most container escapes require kernel vulns
•Use user namespaces — Enable rootless mode for defense in depth
•Apply network policies — Default deny, explicit allow per service
•Audit container activity — Falco, Sysdig, auditd rules
•Limit host access — No Docker socket mounts, no hostPath for /
•Segment sensitive workloads — Use dedicated nodes or stronger isolation

Summary: Container Isolation

We've explored container isolation from multiple angles—the layered security stack, individual mechanisms, attack vectors, and enhanced isolation options. Let's consolidate the key takeaways:

Key Takeaways

•Container isolation is layered — Namespaces, cgroups, capabilities, seccomp, and LSMs work together
•Capabilities partition root power — Drop all, add back only what's needed; never grant CAP_SYS_ADMIN lightly
•Seccomp reduces kernel attack surface — ~50 dangerous syscalls blocked by default; custom profiles for sensitive workloads
•LSMs provide mandatory access control — AppArmor (profile-based) or SELinux (label-based) add object-level restrictions
•Privileged mode defeats isolation — Never use --privileged; grant specific capabilities instead
•Kernel vulnerabilities affect all containers — Keep kernel patched; consider gVisor/Kata for untrusted workloads
•User namespaces add defense depth — Container root maps to unprivileged host user, limiting escape impact
•Choose isolation level by threat model — Standard for trusted workloads, gVisor for untrusted code, Kata for strong boundary requirements

Module Conclusion: Namespaces and cgroups

This module has covered the fundamental Linux kernel primitives that enable containerization:

Linux Namespaces — Process isolation across eight dimensions (PID, network, mount, UTS, IPC, user, cgroup, time)
PID, Network, Mount Namespaces — Deep dive into the three most critical namespace types
cgroups — Resource accounting and limiting with controllers for CPU, memory, I/O, and more
Resource Limiting — Practical configuration of requests, limits, QoS classes, and overcommitment
Container Isolation — The complete security stack from capabilities to gVisor

Module Complete