Loading content...
Virtual machines virtualize hardware—they create the illusion of complete computers with their own CPUs, memory, and devices. But what if we don't need separate kernels? What if we just want isolated environments for applications, sharing a single operating system kernel?
OS-level virtualization, most commonly known as containerization, provides exactly this. Instead of virtualizing hardware, containers virtualize the operating system, creating isolated user-space environments that share the host kernel. This approach is lighter, faster, and more efficient than full virtualization—at the cost of some flexibility and isolation guarantees.
Understanding containers is essential for modern software engineering. From Docker and Kubernetes to cloud-native architectures and microservices, containerization has reshaped how we build, deploy, and scale applications.
By the end of this page, you will understand the fundamental mechanisms of OS-level virtualization (namespaces and cgroups), how containers differ from virtual machines, the architecture of container runtimes like Docker and containerd, security implications and hardening techniques, and when to choose containers vs VMs.
The fundamental difference between containers and virtual machines lies in what they virtualize:
Virtual Machines:
Containers:
| Characteristic | Virtual Machines | Containers |
|---|---|---|
| Isolation unit | Complete OS + kernel | Process(es) with isolated namespaces |
| Boot time | Seconds to minutes | Milliseconds to seconds |
| Resource overhead | ~1-2GB per VM (OS) | ~10-100MB per container (app only) |
| Density | 10s of VMs per host | 100s-1000s of containers per host |
| Kernel | Each VM has its own kernel | Shared host kernel |
| Guest OS | Any OS the hypervisor supports | Must be compatible with host kernel |
| Isolation strength | Hardware-enforced (strong) | Kernel-enforced (weaker) |
| Live migration | Supported (with overhead) | Complex, usually stateless restart |
| Snapshot/checkpoint | Mature, well-supported | Possible but less common |
When to Use Each:
Choose Virtual Machines when:
Choose Containers when:
Modern architectures often combine both: VMs provide the isolation boundary (multi-tenant cloud), while containers inside VMs provide lightweight deployment and scaling. Kubernetes clusters typically run in VMs, with containers orchestrated inside.
Namespaces are a Linux kernel feature that partition system resources so that a set of processes sees only an isolated subset. Namespaces are the primary mechanism for container isolation—they make a container believe it's running on its own system.
Available Namespace Types:
| Namespace | Isolates | Effect in Container |
|---|---|---|
| PID | Process IDs | Container sees its own PID 1 (init), can't see host processes |
| NET | Network stack | Own network interfaces, routing tables, firewall rules |
| MNT | Mount points | Own filesystem view, mount different images |
| UTS | Hostname, domain | Container can have its own hostname |
| IPC | IPC resources | Isolated shared memory, semaphores, message queues |
| USER | User/Group IDs | UID 0 in container can map to unprivileged user on host |
| CGROUP | Cgroup root | Container sees its cgroup as root, can't escape limits |
| TIME | System clocks | (Linux 5.6+) Isolated CLOCK_MONOTONIC, CLOCK_BOOTTIME |
How Namespaces Work:
Namespaces are created via the clone() system call with specific flags, or existing processes can join namespaces via setns(). The unshare() call allows a process to disassociate from its current namespaces.
// Create a child process with new namespaces
clone(child_func, stack,
CLONE_NEWPID | // New PID namespace
CLONE_NEWNET | // New network namespace
CLONE_NEWNS | // New mount namespace
CLONE_NEWUTS | // New UTS namespace
CLONE_NEWIPC, // New IPC namespace
args);
PID Namespace in Detail:
In a new PID namespace:
Host PID namespace: Container PID namespace:
┌─────────────────────┐ ┌─────────────────────┐
│ PID 1 (systemd) │ │ PID 1 (container │
│ PID 1234 (docker) │────▶│ init) │
│ PID 1235 (containerd)│ │ PID 2 (app) │
│ PID 1240 (container │ │ PID 3 (worker) │
│ entrypoint) │────▶│ │
└─────────────────────┘ └─────────────────────┘
│ │
└── PID 1240 on host = PID 1 in container
Network Namespace in Detail:
Network namespaces provide complete network stack isolation:
Containers typically get a veth pair—a virtual network cable with one end in the container and one end on the host. The host end connects to a bridge (like docker0), enabling container networking.
1234567891011121314151617181920212223242526
#!/bin/bash# Demonstrate namespace isolation # Create a new PID and mount namespace, run a shellsudo unshare --pid --mount --fork /bin/bash # Inside the new namespace:# mount -t proc proc /proc # Mount new proc filesystem# ps aux # Only see processes in this namespace# echo $$ # PID 1 in this namespace # Network namespace examplesudo ip netns add mycontainersudo ip netns exec mycontainer ip link list# lo: <LOOPBACK> ...# Only loopback exists, completely isolated network # Create a veth pair connecting namespace to hostsudo ip link add veth0 type veth peer name veth1sudo ip link set veth1 netns mycontainersudo ip netns exec mycontainer ip addr add 10.0.0.2/24 dev veth1sudo ip netns exec mycontainer ip link set veth1 upsudo ip addr add 10.0.0.1/24 dev veth0sudo ip link set veth0 up # Now mycontainer can communicate via 10.0.0.0/24 networkWhile namespaces provide isolation (what a process can see), cgroups (control groups) provide resource control (how much a process can use). Cgroups allow you to:
Cgroup Hierarchy:
Cgroups are organized in a hierarchical filesystem (typically mounted at /sys/fs/cgroup). Each cgroup directory contains files that control and report on the group's resources.
Cgroups v1 vs v2:
Linux has two cgroup versions:
Cgroups v1 (legacy):
Cgroups v2 (unified):
Setting Resource Limits:
123456789101112131415161718192021222324
#!/bin/bash# Cgroups v2 example: Create a cgroup with resource limits # Create a cgroup for a containermkdir /sys/fs/cgroup/mycontainer # Limit memory to 512MB (hard limit)echo "536870912" > /sys/fs/cgroup/mycontainer/memory.max # Limit to 50% of one CPU (50000 out of 100000 period)echo "50000 100000" > /sys/fs/cgroup/mycontainer/cpu.max # Limit to 100 processesecho "100" > /sys/fs/cgroup/mycontainer/pids.max # Add current shell to the cgroupecho $$ > /sys/fs/cgroup/mycontainer/cgroup.procs # Now this shell and its children are limited!# stress --vm 1 --vm-bytes 600M # Will be OOM killed # View current usagecat /sys/fs/cgroup/mycontainer/memory.currentcat /sys/fs/cgroup/mycontainer/cpu.statWhen a container exceeds its memory limit, the kernel's OOM (Out of Memory) Killer terminates processes. By default, it kills processes in the offending cgroup. Configure memory.oom.group=1 to kill the entire cgroup at once, preventing partial container states.
Containers need a filesystem—the base operating system libraries, tools, and application code. Container images provide this filesystem in a portable, versioned format. Layered filesystems enable efficient storage and sharing.
Container Images:
An image is a read-only template containing:
Images are built in layers. Each layer represents a set of filesystem changes (adding files, modifying files, deleting files). Layers are stacked to create the final filesystem view.
The Layering Model:
Union Filesystems:
Layering is implemented using union filesystems (or overlay filesystems). These present multiple directories as a single unified view:
OverlayFS (Linux's built-in union filesystem):
┌─────────────────────────────────────┐
│ Merged View (what container sees) │
├─────────────────────────────────────┤
│ Upper (R/W) │ Lower 4 │ Lower 3 │
│ ┌──────┐ │ (app) │ (deps) │
│ │ logs │ ├─────────┼────────────┤
│ └──────┘ │ Lower 2 │ Lower 1 │
│ │ (python)│ (ubuntu) │
└─────────────────────────────────────┘
Copy-on-Write (CoW):
When a container modifies a file from a lower layer:
This is efficient: unchanged files are shared among all containers using the same image. Only modified files consume additional space.
Benefits of Layering:
Container runtimes are the software that actually creates and runs containers. The ecosystem has evolved from monolithic tools to layered, standardized components.
The Container Runtime Stack:
Low-Level Runtimes (OCI Runtimes):
These implement the OCI (Open Container Initiative) runtime specification:
runc — The reference implementation. Created by Docker, now maintained by the containerd project. Actually calls clone() and sets up namespaces/cgroups.
crun — A lightweight alternative written in C. Smaller, faster startup, memory-efficient.
Alternative Runtimes:
gVisor — Google's container runtime with an intercepting kernel (application kernel). Catches system calls and implements them in a user-space kernel, providing stronger isolation without VM overhead.
Kata Containers — Runs each container in a lightweight VM. Combines container UX with VM isolation. Uses a minimal guest kernel and hypervisor.
High-Level Runtimes:
containerd — A graduated CNCF project. Manages the complete container lifecycle: image pull, storage, container execution, networking. Used by Docker and Kubernetes.
CRI-O — Built specifically for Kubernetes. Implements the Container Runtime Interface (CRI) without extra features. Minimal and focused.
Docker Engine — The original container platform. Now uses containerd underneath. Provides the familiar docker CLI, Compose, and Swarm orchestration.
| Runtime | Type | Primary Use Case | Key Feature |
|---|---|---|---|
| runc | OCI (low-level) | Standard container execution | Reference implementation |
| crun | OCI (low-level) | Resource-constrained environments | Fast, small footprint |
| gVisor | OCI (low-level) | Multi-tenant security | User-space kernel isolation |
| Kata Containers | OCI (low-level) | Strong isolation needs | VM-based containers |
| containerd | High-level daemon | Container management | Kubernetes default, mature |
| CRI-O | High-level daemon | Kubernetes only | Minimal, Kubernetes-focused |
| Docker Engine | Full platform | Developer experience | Familiar tooling, Compose |
Containers share the host kernel, making security critical. A kernel vulnerability affects all containers. Multiple defense layers are employed to harden container isolation:
Security Mechanisms:
Seccomp Profiles:
Seccomp filters system calls before they execute:
{
"defaultAction": "SCMP_ACT_ERRNO",
"syscalls": [
{
"names": ["read", "write", "open", "close", ...],
"action": "SCMP_ACT_ALLOW"
},
{
"names": ["mount", "umount", "reboot", "kexec_load"],
"action": "SCMP_ACT_ERRNO"
}
]
}
Docker's default seccomp profile blocks ~44 system calls, including those that could facilitate container escape or system damage.
Capability Dropping:
12345678910111213
# Run container with minimal capabilitiesdocker run --rm -it \ --cap-drop=ALL \ # Drop all capabilities --cap-add=NET_BIND_SERVICE \ # Only add what's needed --read-only \ # Read-only root filesystem --security-opt=no-new-privileges:true \ # Prevent privilege escalation --security-opt=seccomp=default.json \ # Apply seccomp profile --user 1000:1000 \ # Run as non-root user myimage:latest # View capabilities of a running containerdocker inspect --format='{{.HostConfig.CapAdd}}' container_namedocker inspect --format='{{.HostConfig.CapDrop}}' container_nameRunning containers with --privileged disables almost all isolation. The container has full access to the host's devices, can mount filesystems, and can escape easily. Never use --privileged in production unless absolutely necessary (and even then, consider alternatives like specific capability grants or device cgroup rules).
Running one container is simple. Running hundreds across multiple hosts, with networking, storage, scaling, and failure recovery, requires orchestration.
Kubernetes: The Industry Standard
Kubernetes (K8s) has become the de facto standard for container orchestration. It provides:
Kubernetes Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Control Plane │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ API Server │ │ Scheduler │ │ Controller │ │
│ │ (kube-apiserver) │ │ │ Manager │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ etcd (distributed key-value store) │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
(API calls)
│
┌─────────────────────────────────────────────────────────────┐
│ Worker Nodes │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │
│ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │
│ │ │ kubelet │ │ │ │ kubelet │ │ │ │ kubelet │ │ │
│ │ ├─────────────┤ │ │ ├─────────────┤ │ │ ├─────────────┤ │ │
│ │ │ container │ │ │ │ container │ │ │ │ container │ │ │
│ │ │ runtime │ │ │ │ runtime │ │ │ │ runtime │ │ │
│ │ ├─────────────┤ │ │ ├─────────────┤ │ │ ├─────────────┤ │ │
│ │ │ [pod][pod] │ │ │ │ [pod][pod] │ │ │ │ [pod][pod] │ │ │
│ │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Core Kubernetes Concepts:
| Resource | Description | Example Use |
|---|---|---|
| Pod | Smallest deployable unit (1+ containers) | Run an application container |
| Deployment | Manages replicated Pods with updates | Stateless applications |
| StatefulSet | Pods with stable identity and storage | Databases, stateful apps |
| Service | Stable network endpoint for Pods | Load balancing, discovery |
| ConfigMap | Configuration data as key-value pairs | Application configuration |
| Secret | Encrypted sensitive data | Passwords, API keys |
| Namespace | Virtual cluster for multi-tenancy | Team/environment isolation |
While Kubernetes dominates, alternatives exist: Docker Swarm (simpler, integrated with Docker), HashiCorp Nomad (lightweight, multi-workload), Amazon ECS (AWS-native). For learning containers, start with Docker alone; add orchestration when you need multi-container, multi-host deployments.
Containers need network connectivity to serve requests and communicate with each other. Container networking leverages Linux network namespaces and virtual networking.
Common Networking Modes:
| Mode | Description | Use Case |
|---|---|---|
| Bridge | Containers connect to a virtual bridge (docker0), NAT to host | Default mode, most common |
| Host | Container uses host network namespace directly | Maximum performance, no isolation |
| None | Container has only loopback, no external network | Security-isolated workloads |
| Overlay | Multi-host networking via encapsulation | Kubernetes, Swarm clusters |
| Macvlan | Container gets its own MAC address on physical network | Legacy apps needing real network presence |
Bridge Networking in Detail:
┌─────────────────────────────────────────────────────────┐
│ Host Network Namespace │
│ │
│ ┌─────────────────────┐ │
│ │ Physical Interface │ │
│ │ eth0 (192.168.1.10) │◄─────── Internet/LAN │
│ └─────────────────────┘ │
│ │ │
│ │ NAT (iptables MASQUERADE) │
│ │ │
│ ┌─────────────────────┐ │
│ │ Docker Bridge │ │
│ │ docker0 (172.17.0.1)│ │
│ └─────────────────────┘ │
│ │ │ │ │
│ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │veth0 │ │veth1 │ │veth2 │ (veth host ends) │
└───┴───────┴─┴───────┴─┴───────┴──────────────────────────┘
│ │ │
┌───────────┐ ┌───────────┐ ┌───────────┐
│Container 1│ │Container 2│ │Container 3│
│eth0: │ │eth0: │ │eth0: │
│172.17.0.2 │ │172.17.0.3 │ │172.17.0.4 │
└───────────┘ └───────────┘ └───────────┘
Container Network Interface (CNI):
CNI is the standard plugin interface for container networking, used by Kubernetes and other orchestrators. CNI plugins handle:
Popular CNI plugins: Calico, Flannel, Weave, Cilium (eBPF-based).
Service Mesh (Advanced):
For complex microservices, a service mesh (like Istio, Linkerd) adds a sidecar proxy to each container. The mesh provides:
We've comprehensively explored OS-level virtualization and containerization. Let's consolidate the essential concepts:
Module Complete:
You've now completed your exploration of virtual machines—from Type 1 hypervisors running directly on hardware, through Type 2 hypervisors on desktop systems, to the hardware extensions that make virtualization efficient, and finally to OS-level virtualization with containers. This comprehensive understanding of virtualization technologies prepares you to make informed decisions about isolation, performance, and architecture in your systems work.
You now understand OS-level virtualization, from the kernel mechanisms (namespaces, cgroups) to the container ecosystem (runtimes, images, orchestration). This knowledge is essential for modern software deployment and cloud-native architecture.