Virtual Machines - Learning Module

Loading content...

0/227

OS-Level Virtualization: Containers and Beyond

A Different Approach to Virtualization

Virtual machines virtualize hardware—they create the illusion of complete computers with their own CPUs, memory, and devices. But what if we don't need separate kernels? What if we just want isolated environments for applications, sharing a single operating system kernel?

OS-level virtualization, most commonly known as containerization, provides exactly this. Instead of virtualizing hardware, containers virtualize the operating system, creating isolated user-space environments that share the host kernel. This approach is lighter, faster, and more efficient than full virtualization—at the cost of some flexibility and isolation guarantees.

Understanding containers is essential for modern software engineering. From Docker and Kubernetes to cloud-native architectures and microservices, containerization has reshaped how we build, deploy, and scale applications.

What You Will Learn

By the end of this page, you will understand the fundamental mechanisms of OS-level virtualization (namespaces and cgroups), how containers differ from virtual machines, the architecture of container runtimes like Docker and containerd, security implications and hardening techniques, and when to choose containers vs VMs.

Containers vs Virtual Machines

The fundamental difference between containers and virtual machines lies in what they virtualize:

Virtual Machines:

Virtualize hardware (CPU, memory, devices)
Each VM runs a complete operating system with its own kernel
Hypervisor mediates access to physical resources
Strong isolation through hardware boundaries

Containers:

Virtualize the operating system (processes, filesystem, network)
All containers share the host kernel
Kernel mechanisms (namespaces, cgroups) provide isolation
Lighter weight but weaker isolation than VMs

Converting Mermaid diagram...

Containers vs Virtual Machines
Characteristic	Virtual Machines	Containers
Isolation unit	Complete OS + kernel	Process(es) with isolated namespaces
Boot time	Seconds to minutes	Milliseconds to seconds
Resource overhead	~1-2GB per VM (OS)	~10-100MB per container (app only)
Density	10s of VMs per host	100s-1000s of containers per host
Kernel	Each VM has its own kernel	Shared host kernel
Guest OS	Any OS the hypervisor supports	Must be compatible with host kernel
Isolation strength	Hardware-enforced (strong)	Kernel-enforced (weaker)
Live migration	Supported (with overhead)	Complex, usually stateless restart
Snapshot/checkpoint	Mature, well-supported	Possible but less common

When to Use Each:

Choose Virtual Machines when:

You need to run different operating systems (Windows on Linux host)
Strong isolation is critical (multi-tenant security requirements)
You need kernel-level access or custom kernels
Legacy applications require specific OS versions
Hardware device access or GPU pass-through is needed

Choose Containers when:

All workloads can share the host kernel (Linux containers on Linux)
Fast startup and high density are priorities
Microservices architecture with many small services
CI/CD pipelines needing ephemeral environments
Resource efficiency is paramount

Best of Both Worlds

Modern architectures often combine both: VMs provide the isolation boundary (multi-tenant cloud), while containers inside VMs provide lightweight deployment and scaling. Kubernetes clusters typically run in VMs, with containers orchestrated inside.

Linux Namespaces: The Foundation of Isolation

Namespaces are a Linux kernel feature that partition system resources so that a set of processes sees only an isolated subset. Namespaces are the primary mechanism for container isolation—they make a container believe it's running on its own system.

Available Namespace Types:

Linux Namespace Types
Namespace	Isolates	Effect in Container
PID	Process IDs	Container sees its own PID 1 (init), can't see host processes
NET	Network stack	Own network interfaces, routing tables, firewall rules
MNT	Mount points	Own filesystem view, mount different images
UTS	Hostname, domain	Container can have its own hostname
IPC	IPC resources	Isolated shared memory, semaphores, message queues
USER	User/Group IDs	UID 0 in container can map to unprivileged user on host
CGROUP	Cgroup root	Container sees its cgroup as root, can't escape limits
TIME	System clocks	(Linux 5.6+) Isolated CLOCK_MONOTONIC, CLOCK_BOOTTIME

How Namespaces Work:

Namespaces are created via the clone() system call with specific flags, or existing processes can join namespaces via setns(). The unshare() call allows a process to disassociate from its current namespaces.

// Create a child process with new namespaces
clone(child_func, stack, 
    CLONE_NEWPID |   // New PID namespace
    CLONE_NEWNET |   // New network namespace
    CLONE_NEWNS |    // New mount namespace
    CLONE_NEWUTS |   // New UTS namespace
    CLONE_NEWIPC,    // New IPC namespace
    args);

PID Namespace in Detail:

In a new PID namespace:

The first process becomes PID 1 (the container's init)
PIDs inside the namespace are independent of outside PIDs
A process has different PIDs in different namespaces

Host PID namespace:         Container PID namespace:
┌─────────────────────┐     ┌─────────────────────┐
│ PID 1 (systemd)     │     │ PID 1 (container    │
│ PID 1234 (docker)   │────▶│      init)          │
│ PID 1235 (containerd)│    │ PID 2 (app)         │
│ PID 1240 (container │     │ PID 3 (worker)      │
│        entrypoint)  │────▶│                     │
└─────────────────────┘     └─────────────────────┘
        │                           │
        └── PID 1240 on host = PID 1 in container

Network Namespace in Detail:

Network namespaces provide complete network stack isolation:

Own network interfaces (eth0, lo, etc.)
Own routing tables
Own firewall rules (iptables/nftables)
Own /proc/net

Containers typically get a veth pair—a virtual network cable with one end in the container and one end on the host. The host end connects to a bridge (like docker0), enabling container networking.

namespace-demo.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/bin/bash
# Demonstrate namespace isolation
 
# Create a new PID and mount namespace, run a shell
sudo unshare --pid --mount --fork /bin/bash
 
# Inside the new namespace:
# mount -t proc proc /proc    # Mount new proc filesystem
# ps aux                       # Only see processes in this namespace
# echo $$                      # PID 1 in this namespace
 
# Network namespace example
sudo ip netns add mycontainer
sudo ip netns exec mycontainer ip link list
# lo: <LOOPBACK> ...
# Only loopback exists, completely isolated network
 
# Create a veth pair connecting namespace to host
sudo ip link add veth0 type veth peer name veth1
sudo ip link set veth1 netns mycontainer
sudo ip netns exec mycontainer ip addr add 10.0.0.2/24 dev veth1
sudo ip netns exec mycontainer ip link set veth1 up
sudo ip addr add 10.0.0.1/24 dev veth0
sudo ip link set veth0 up
 
# Now mycontainer can communicate via 10.0.0.0/24 network

Control Groups (cgroups): Resource Management

While namespaces provide isolation (what a process can see), cgroups (control groups) provide resource control (how much a process can use). Cgroups allow you to:

Limit resource usage (CPU, memory, I/O bandwidth)
Prioritize access to resources
Account for resource usage
Control processes (freeze, checkpoint, restart)

Cgroup Hierarchy:

Cgroups are organized in a hierarchical filesystem (typically mounted at /sys/fs/cgroup). Each cgroup directory contains files that control and report on the group's resources.

Cgroup Controllers

•cpu — CPU time allocation (shares, quotas, periods). Limit a container to 50% of one CPU or 200% (2 CPUs).
•cpuset — Pin processes to specific CPUs and memory nodes. Ensures NUMA-aware placement.
•memory — Memory limits (hard and soft), swap limits, OOM control. Kill container if it exceeds memory limit.
•blkio/io — Block I/O bandwidth and IOPS limits. Ensure one container can't monopolize disk.
•pids — Limit number of processes. Prevents fork bombs from taking down the system.
•devices — Control access to device files. Containers can only access authorized devices.
•freezer — Suspend/resume all processes in the cgroup. Used for container pause/checkpoint.
•net_cls/net_prio — Network traffic classification and priority.

Cgroups v1 vs v2:

Linux has two cgroup versions:

Cgroups v1 (legacy):

Each controller has a separate hierarchy
Process can be in different cgroups for different controllers
More complex, some controllers conflict
Still common in older systems and some container runtimes

Cgroups v2 (unified):

Single unified hierarchy
Process is in exactly one cgroup
Cleaner interface, better resource distribution
Now the default in modern distributions

Setting Resource Limits:

cgroup-limits.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/bin/bash
# Cgroups v2 example: Create a cgroup with resource limits
 
# Create a cgroup for a container
mkdir /sys/fs/cgroup/mycontainer
 
# Limit memory to 512MB (hard limit)
echo "536870912" > /sys/fs/cgroup/mycontainer/memory.max
 
# Limit to 50% of one CPU (50000 out of 100000 period)
echo "50000 100000" > /sys/fs/cgroup/mycontainer/cpu.max
 
# Limit to 100 processes
echo "100" > /sys/fs/cgroup/mycontainer/pids.max
 
# Add current shell to the cgroup
echo $$ > /sys/fs/cgroup/mycontainer/cgroup.procs
 
# Now this shell and its children are limited!
# stress --vm 1 --vm-bytes 600M  # Will be OOM killed
 
# View current usage
cat /sys/fs/cgroup/mycontainer/memory.current
cat /sys/fs/cgroup/mycontainer/cpu.stat

OOM Killer and Containers

When a container exceeds its memory limit, the kernel's OOM (Out of Memory) Killer terminates processes. By default, it kills processes in the offending cgroup. Configure memory.oom.group=1 to kill the entire cgroup at once, preventing partial container states.

Container Images and Layered Filesystems

Containers need a filesystem—the base operating system libraries, tools, and application code. Container images provide this filesystem in a portable, versioned format. Layered filesystems enable efficient storage and sharing.

Container Images:

An image is a read-only template containing:

Root filesystem (/, /bin, /lib, /etc, ...)
Application code and dependencies
Metadata (entry point, environment variables, exposed ports)

Images are built in layers. Each layer represents a set of filesystem changes (adding files, modifying files, deleting files). Layers are stacked to create the final filesystem view.

The Layering Model:

Converting Mermaid diagram...

Union Filesystems:

Layering is implemented using union filesystems (or overlay filesystems). These present multiple directories as a single unified view:

OverlayFS (Linux's built-in union filesystem):

Lower layers: Read-only image layers (stacked)
Upper layer: Read-write container layer
Merged view: Combined view presented to the container

┌─────────────────────────────────────┐
│ Merged View (what container sees)   │
├─────────────────────────────────────┤
│ Upper (R/W) │ Lower 4 │ Lower 3    │
│  ┌──────┐   │ (app)   │ (deps)     │
│  │ logs │   ├─────────┼────────────┤
│  └──────┘   │ Lower 2 │ Lower 1    │
│             │ (python)│ (ubuntu)   │
└─────────────────────────────────────┘

Copy-on-Write (CoW):

When a container modifies a file from a lower layer:

The file is copied to the upper (writable) layer
The modification is made on the copy
The copy in the upper layer "shadows" the original

This is efficient: unchanged files are shared among all containers using the same image. Only modified files consume additional space.

Benefits of Layering:

Layered Filesystem Advantages

•Efficient storage: Base layers shared across thousands of containers. A 100MB base image used by 1000 containers still only stores once.
•Fast image distribution: Only download layers you don't have. Building on a shared base means small incremental updates.
•Caching for builds: Layers cache previous build steps. Change line 50 of Dockerfile: only rebuild from that layer onward.
•Immutable infrastructure: Image layers are immutable. Same image always produces same container filesystem.
•Easy rollback: Run previous image version to rollback. Old layers still exist.

Container Runtimes

Container runtimes are the software that actually creates and runs containers. The ecosystem has evolved from monolithic tools to layered, standardized components.

The Container Runtime Stack:

Converting Mermaid diagram...

Low-Level Runtimes (OCI Runtimes):

These implement the OCI (Open Container Initiative) runtime specification:

runc — The reference implementation. Created by Docker, now maintained by the containerd project. Actually calls clone() and sets up namespaces/cgroups.

crun — A lightweight alternative written in C. Smaller, faster startup, memory-efficient.

Alternative Runtimes:

gVisor — Google's container runtime with an intercepting kernel (application kernel). Catches system calls and implements them in a user-space kernel, providing stronger isolation without VM overhead.

Kata Containers — Runs each container in a lightweight VM. Combines container UX with VM isolation. Uses a minimal guest kernel and hypervisor.

High-Level Runtimes:

containerd — A graduated CNCF project. Manages the complete container lifecycle: image pull, storage, container execution, networking. Used by Docker and Kubernetes.

CRI-O — Built specifically for Kubernetes. Implements the Container Runtime Interface (CRI) without extra features. Minimal and focused.

Docker Engine — The original container platform. Now uses containerd underneath. Provides the familiar docker CLI, Compose, and Swarm orchestration.

Container Runtime Comparison
Runtime	Type	Primary Use Case	Key Feature
runc	OCI (low-level)	Standard container execution	Reference implementation
crun	OCI (low-level)	Resource-constrained environments	Fast, small footprint
gVisor	OCI (low-level)	Multi-tenant security	User-space kernel isolation
Kata Containers	OCI (low-level)	Strong isolation needs	VM-based containers
containerd	High-level daemon	Container management	Kubernetes default, mature
CRI-O	High-level daemon	Kubernetes only	Minimal, Kubernetes-focused
Docker Engine	Full platform	Developer experience	Familiar tooling, Compose

Container Security

Containers share the host kernel, making security critical. A kernel vulnerability affects all containers. Multiple defense layers are employed to harden container isolation:

Security Mechanisms:

Container Security Layers

•User Namespaces — Map container root (UID 0) to unprivileged host UID. Even if a container process escapes, it's unprivileged on the host.
•Capabilities — Linux divides root privileges into capabilities (CAP_NET_ADMIN, CAP_SYS_ADMIN, etc.). Containers run with a minimal set, dropping dangerous capabilities.
•Seccomp — Secure Computing Mode. Filter system calls: allow only a whitelist, block dangerous calls like reboot, mount, or ptrace.
•AppArmor/SELinux — Mandatory Access Control. Confine containers to specific file paths, network access, and capabilities beyond namespace isolation.
•Read-Only Rootfs — Mount the container filesystem read-only. Attackers can't persist malware by modifying container files.
•No New Privileges — Prevent setuid binaries and capability escalation inside the container.

Seccomp Profiles:

Seccomp filters system calls before they execute:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    {
      "names": ["read", "write", "open", "close", ...],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "names": ["mount", "umount", "reboot", "kexec_load"],
      "action": "SCMP_ACT_ERRNO"
    }
  ]
}

Docker's default seccomp profile blocks ~44 system calls, including those that could facilitate container escape or system damage.

Capability Dropping:

secure-container.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
# Run container with minimal capabilities
docker run --rm -it \
  --cap-drop=ALL \                # Drop all capabilities
  --cap-add=NET_BIND_SERVICE \    # Only add what's needed
  --read-only \                   # Read-only root filesystem
  --security-opt=no-new-privileges:true \  # Prevent privilege escalation
  --security-opt=seccomp=default.json \    # Apply seccomp profile
  --user 1000:1000 \              # Run as non-root user
  myimage:latest
 
# View capabilities of a running container
docker inspect --format='{{.HostConfig.CapAdd}}' container_name
docker inspect --format='{{.HostConfig.CapDrop}}' container_name

The Privileged Mode Danger

Running containers with --privileged disables almost all isolation. The container has full access to the host's devices, can mount filesystems, and can escape easily. Never use --privileged in production unless absolutely necessary (and even then, consider alternatives like specific capability grants or device cgroup rules).

Container Orchestration

Running one container is simple. Running hundreds across multiple hosts, with networking, storage, scaling, and failure recovery, requires orchestration.

Kubernetes: The Industry Standard

Kubernetes (K8s) has become the de facto standard for container orchestration. It provides:

Declarative configuration: Define desired state; Kubernetes makes it happen
Automatic scheduling: Place containers on appropriate nodes
Self-healing: Restart failed containers, replace unresponsive nodes
Scaling: Horizontal scaling based on metrics
Service discovery: DNS-based service discovery and load balancing
Rolling updates: Zero-downtime deployments
Secret and config management: Inject configuration without baking into images

Kubernetes Architecture:

┌─────────────────────────────────────────────────────────────┐
│ Control Plane                                               │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐       │
│ │ API Server    │ │ Scheduler     │ │ Controller    │       │
│ │ (kube-apiserver) │               │ │ Manager       │       │
│ └───────────────┘ └───────────────┘ └───────────────┘       │
│ ┌─────────────────────────────────────────────────────┐     │
│ │ etcd (distributed key-value store)                  │     │
│ └─────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────┘
                            │
                      (API calls)
                            │
┌─────────────────────────────────────────────────────────────┐
│ Worker Nodes                                                │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Node 1          │ │ Node 2          │ │ Node 3          │ │
│ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │
│ │ │ kubelet     │ │ │ │ kubelet     │ │ │ │ kubelet     │ │ │
│ │ ├─────────────┤ │ │ ├─────────────┤ │ │ ├─────────────┤ │ │
│ │ │ container   │ │ │ │ container   │ │ │ │ container   │ │ │
│ │ │ runtime     │ │ │ │ runtime     │ │ │ │ runtime     │ │ │
│ │ ├─────────────┤ │ │ ├─────────────┤ │ │ ├─────────────┤ │ │
│ │ │ [pod][pod]  │ │ │ │ [pod][pod]  │ │ │ │ [pod][pod]  │ │ │
│ │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Core Kubernetes Concepts:

Kubernetes Resource Types
Resource	Description	Example Use
Pod	Smallest deployable unit (1+ containers)	Run an application container
Deployment	Manages replicated Pods with updates	Stateless applications
StatefulSet	Pods with stable identity and storage	Databases, stateful apps
Service	Stable network endpoint for Pods	Load balancing, discovery
ConfigMap	Configuration data as key-value pairs	Application configuration
Secret	Encrypted sensitive data	Passwords, API keys
Namespace	Virtual cluster for multi-tenancy	Team/environment isolation

Beyond Kubernetes

While Kubernetes dominates, alternatives exist: Docker Swarm (simpler, integrated with Docker), HashiCorp Nomad (lightweight, multi-workload), Amazon ECS (AWS-native). For learning containers, start with Docker alone; add orchestration when you need multi-container, multi-host deployments.

Container Networking

Containers need network connectivity to serve requests and communicate with each other. Container networking leverages Linux network namespaces and virtual networking.

Common Networking Modes:

Container Networking Modes
Mode	Description	Use Case
Bridge	Containers connect to a virtual bridge (docker0), NAT to host	Default mode, most common
Host	Container uses host network namespace directly	Maximum performance, no isolation
None	Container has only loopback, no external network	Security-isolated workloads
Overlay	Multi-host networking via encapsulation	Kubernetes, Swarm clusters
Macvlan	Container gets its own MAC address on physical network	Legacy apps needing real network presence

Bridge Networking in Detail:

┌─────────────────────────────────────────────────────────┐
│ Host Network Namespace                                   │
│                                                          │
│  ┌─────────────────────┐                                │
│  │ Physical Interface  │                                │
│  │ eth0 (192.168.1.10) │◄─────── Internet/LAN          │
│  └─────────────────────┘                                │
│           │                                              │
│           │ NAT (iptables MASQUERADE)                   │
│           │                                              │
│  ┌─────────────────────┐                                │
│  │ Docker Bridge       │                                │
│  │ docker0 (172.17.0.1)│                                │
│  └─────────────────────┘                                │
│       │         │         │                              │
│   ┌───────┐ ┌───────┐ ┌───────┐                        │
│   │veth0  │ │veth1  │ │veth2  │   (veth host ends)     │
└───┴───────┴─┴───────┴─┴───────┴──────────────────────────┘
        │         │         │
┌───────────┐ ┌───────────┐ ┌───────────┐
│Container 1│ │Container 2│ │Container 3│
│eth0:      │ │eth0:      │ │eth0:      │
│172.17.0.2 │ │172.17.0.3 │ │172.17.0.4 │
└───────────┘ └───────────┘ └───────────┘

Container Network Interface (CNI):

CNI is the standard plugin interface for container networking, used by Kubernetes and other orchestrators. CNI plugins handle:

Creating network namespaces
Connecting containers to networks (bridge, overlay, etc.)
Assigning IP addresses (IPAM)
Setting up routes and firewall rules

Popular CNI plugins: Calico, Flannel, Weave, Cilium (eBPF-based).

Service Mesh (Advanced):

For complex microservices, a service mesh (like Istio, Linkerd) adds a sidecar proxy to each container. The mesh provides:

Encrypted service-to-service communication (mTLS)
Observability (distributed tracing, metrics)
Traffic management (canary deployments, retries)
Policy enforcement (access control)

Summary: OS-Level Virtualization

We've comprehensively explored OS-level virtualization and containerization. Let's consolidate the essential concepts:

Key Takeaways

•Containers virtualize the OS, not hardware, sharing the host kernel for efficiency but weaker isolation than VMs.
•Namespaces provide isolation (PID, NET, MNT, UTS, IPC, USER, CGROUP, TIME), making containers believe they're on their own system.
•Cgroups provide resource control (CPU, memory, I/O, PIDs), preventing containers from monopolizing host resources.
•Layered filesystems and images enable efficient storage, fast distribution, and immutable infrastructure.
•Container runtimes (runc, containerd, CRI-O) are layered: OCI runtimes create containers, high-level runtimes manage lifecycle.
•Security requires defense in depth: user namespaces, capabilities, seccomp, AppArmor/SELinux, read-only filesystems.
•Orchestration (Kubernetes) manages container deployment, scaling, networking, and recovery at scale.
•Networking uses namespaces and CNI plugins to connect containers with various isolation and connectivity options.

Module Complete:

You've now completed your exploration of virtual machines—from Type 1 hypervisors running directly on hardware, through Type 2 hypervisors on desktop systems, to the hardware extensions that make virtualization efficient, and finally to OS-level virtualization with containers. This comprehensive understanding of virtualization technologies prepares you to make informed decisions about isolation, performance, and architecture in your systems work.

Page Complete

You now understand OS-level virtualization, from the kernel mechanisms (namespaces, cgroups) to the container ecosystem (runtimes, images, orchestration). This knowledge is essential for modern software deployment and cloud-native architecture.

OS-Level Virtualization: Containers and Beyond

A Different Approach to Virtualization

What You Will Learn

Containers vs Virtual Machines

The fundamental difference between containers and virtual machines lies in what they virtualize:

Virtual Machines:

Virtualize hardware (CPU, memory, devices)
Each VM runs a complete operating system with its own kernel
Hypervisor mediates access to physical resources
Strong isolation through hardware boundaries

Containers:

Virtualize the operating system (processes, filesystem, network)
All containers share the host kernel
Kernel mechanisms (namespaces, cgroups) provide isolation
Lighter weight but weaker isolation than VMs

Converting Mermaid diagram...

Containers vs Virtual Machines
Characteristic	Virtual Machines	Containers
Isolation unit	Complete OS + kernel	Process(es) with isolated namespaces
Boot time	Seconds to minutes	Milliseconds to seconds
Resource overhead	~1-2GB per VM (OS)	~10-100MB per container (app only)
Density	10s of VMs per host	100s-1000s of containers per host
Kernel	Each VM has its own kernel	Shared host kernel
Guest OS	Any OS the hypervisor supports	Must be compatible with host kernel
Isolation strength	Hardware-enforced (strong)	Kernel-enforced (weaker)
Live migration	Supported (with overhead)	Complex, usually stateless restart
Snapshot/checkpoint	Mature, well-supported	Possible but less common

When to Use Each:

Choose Virtual Machines when:

You need to run different operating systems (Windows on Linux host)
Strong isolation is critical (multi-tenant security requirements)
You need kernel-level access or custom kernels
Legacy applications require specific OS versions
Hardware device access or GPU pass-through is needed

Choose Containers when:

All workloads can share the host kernel (Linux containers on Linux)
Fast startup and high density are priorities
Microservices architecture with many small services
CI/CD pipelines needing ephemeral environments
Resource efficiency is paramount

Best of Both Worlds

Linux Namespaces: The Foundation of Isolation

Available Namespace Types:

Linux Namespace Types
Namespace	Isolates	Effect in Container
PID	Process IDs	Container sees its own PID 1 (init), can't see host processes
NET	Network stack	Own network interfaces, routing tables, firewall rules
MNT	Mount points	Own filesystem view, mount different images
UTS	Hostname, domain	Container can have its own hostname
IPC	IPC resources	Isolated shared memory, semaphores, message queues
USER	User/Group IDs	UID 0 in container can map to unprivileged user on host
CGROUP	Cgroup root	Container sees its cgroup as root, can't escape limits
TIME	System clocks	(Linux 5.6+) Isolated CLOCK_MONOTONIC, CLOCK_BOOTTIME

How Namespaces Work:

// Create a child process with new namespaces
clone(child_func, stack, 
    CLONE_NEWPID |   // New PID namespace
    CLONE_NEWNET |   // New network namespace
    CLONE_NEWNS |    // New mount namespace
    CLONE_NEWUTS |   // New UTS namespace
    CLONE_NEWIPC,    // New IPC namespace
    args);

PID Namespace in Detail:

In a new PID namespace:

The first process becomes PID 1 (the container's init)
PIDs inside the namespace are independent of outside PIDs
A process has different PIDs in different namespaces

Host PID namespace:         Container PID namespace:
┌─────────────────────┐     ┌─────────────────────┐
│ PID 1 (systemd)     │     │ PID 1 (container    │
│ PID 1234 (docker)   │────▶│      init)          │
│ PID 1235 (containerd)│    │ PID 2 (app)         │
│ PID 1240 (container │     │ PID 3 (worker)      │
│        entrypoint)  │────▶│                     │
└─────────────────────┘     └─────────────────────┘
        │                           │
        └── PID 1240 on host = PID 1 in container

Network Namespace in Detail:

Network namespaces provide complete network stack isolation:

Own network interfaces (eth0, lo, etc.)
Own routing tables
Own firewall rules (iptables/nftables)
Own /proc/net

namespace-demo.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/bin/bash
# Demonstrate namespace isolation
 
# Create a new PID and mount namespace, run a shell
sudo unshare --pid --mount --fork /bin/bash
 
# Inside the new namespace:
# mount -t proc proc /proc    # Mount new proc filesystem
# ps aux                       # Only see processes in this namespace
# echo $$                      # PID 1 in this namespace
 
# Network namespace example
sudo ip netns add mycontainer
sudo ip netns exec mycontainer ip link list
# lo: <LOOPBACK> ...
# Only loopback exists, completely isolated network
 
# Create a veth pair connecting namespace to host
sudo ip link add veth0 type veth peer name veth1
sudo ip link set veth1 netns mycontainer
sudo ip netns exec mycontainer ip addr add 10.0.0.2/24 dev veth1
sudo ip netns exec mycontainer ip link set veth1 up
sudo ip addr add 10.0.0.1/24 dev veth0
sudo ip link set veth0 up
 
# Now mycontainer can communicate via 10.0.0.0/24 network

Control Groups (cgroups): Resource Management

While namespaces provide isolation (what a process can see), cgroups (control groups) provide resource control (how much a process can use). Cgroups allow you to:

Limit resource usage (CPU, memory, I/O bandwidth)
Prioritize access to resources
Account for resource usage
Control processes (freeze, checkpoint, restart)

Cgroup Hierarchy:

Cgroups are organized in a hierarchical filesystem (typically mounted at /sys/fs/cgroup). Each cgroup directory contains files that control and report on the group's resources.

Cgroup Controllers

•cpu — CPU time allocation (shares, quotas, periods). Limit a container to 50% of one CPU or 200% (2 CPUs).
•cpuset — Pin processes to specific CPUs and memory nodes. Ensures NUMA-aware placement.
•memory — Memory limits (hard and soft), swap limits, OOM control. Kill container if it exceeds memory limit.
•blkio/io — Block I/O bandwidth and IOPS limits. Ensure one container can't monopolize disk.
•pids — Limit number of processes. Prevents fork bombs from taking down the system.
•devices — Control access to device files. Containers can only access authorized devices.
•freezer — Suspend/resume all processes in the cgroup. Used for container pause/checkpoint.
•net_cls/net_prio — Network traffic classification and priority.

Cgroups v1 vs v2:

Linux has two cgroup versions:

Cgroups v1 (legacy):

Each controller has a separate hierarchy
Process can be in different cgroups for different controllers
More complex, some controllers conflict
Still common in older systems and some container runtimes

Cgroups v2 (unified):

Single unified hierarchy
Process is in exactly one cgroup
Cleaner interface, better resource distribution
Now the default in modern distributions

Setting Resource Limits:

cgroup-limits.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/bin/bash
# Cgroups v2 example: Create a cgroup with resource limits
 
# Create a cgroup for a container
mkdir /sys/fs/cgroup/mycontainer
 
# Limit memory to 512MB (hard limit)
echo "536870912" > /sys/fs/cgroup/mycontainer/memory.max
 
# Limit to 50% of one CPU (50000 out of 100000 period)
echo "50000 100000" > /sys/fs/cgroup/mycontainer/cpu.max
 
# Limit to 100 processes
echo "100" > /sys/fs/cgroup/mycontainer/pids.max
 
# Add current shell to the cgroup
echo $$ > /sys/fs/cgroup/mycontainer/cgroup.procs
 
# Now this shell and its children are limited!
# stress --vm 1 --vm-bytes 600M  # Will be OOM killed
 
# View current usage
cat /sys/fs/cgroup/mycontainer/memory.current
cat /sys/fs/cgroup/mycontainer/cpu.stat

OOM Killer and Containers

Container Images and Layered Filesystems

Container Images:

An image is a read-only template containing:

Root filesystem (/, /bin, /lib, /etc, ...)
Application code and dependencies
Metadata (entry point, environment variables, exposed ports)

Images are built in layers. Each layer represents a set of filesystem changes (adding files, modifying files, deleting files). Layers are stacked to create the final filesystem view.

The Layering Model:

Converting Mermaid diagram...

Union Filesystems:

Layering is implemented using union filesystems (or overlay filesystems). These present multiple directories as a single unified view:

OverlayFS (Linux's built-in union filesystem):

Lower layers: Read-only image layers (stacked)
Upper layer: Read-write container layer
Merged view: Combined view presented to the container

┌─────────────────────────────────────┐
│ Merged View (what container sees)   │
├─────────────────────────────────────┤
│ Upper (R/W) │ Lower 4 │ Lower 3    │
│  ┌──────┐   │ (app)   │ (deps)     │
│  │ logs │   ├─────────┼────────────┤
│  └──────┘   │ Lower 2 │ Lower 1    │
│             │ (python)│ (ubuntu)   │
└─────────────────────────────────────┘

Copy-on-Write (CoW):

When a container modifies a file from a lower layer:

The file is copied to the upper (writable) layer
The modification is made on the copy
The copy in the upper layer "shadows" the original

This is efficient: unchanged files are shared among all containers using the same image. Only modified files consume additional space.

Benefits of Layering:

Layered Filesystem Advantages

•Efficient storage: Base layers shared across thousands of containers. A 100MB base image used by 1000 containers still only stores once.
•Fast image distribution: Only download layers you don't have. Building on a shared base means small incremental updates.
•Caching for builds: Layers cache previous build steps. Change line 50 of Dockerfile: only rebuild from that layer onward.
•Immutable infrastructure: Image layers are immutable. Same image always produces same container filesystem.
•Easy rollback: Run previous image version to rollback. Old layers still exist.

Container Runtimes

Container runtimes are the software that actually creates and runs containers. The ecosystem has evolved from monolithic tools to layered, standardized components.

The Container Runtime Stack:

Converting Mermaid diagram...

Low-Level Runtimes (OCI Runtimes):

These implement the OCI (Open Container Initiative) runtime specification:

runc — The reference implementation. Created by Docker, now maintained by the containerd project. Actually calls clone() and sets up namespaces/cgroups.

crun — A lightweight alternative written in C. Smaller, faster startup, memory-efficient.

Alternative Runtimes:

Kata Containers — Runs each container in a lightweight VM. Combines container UX with VM isolation. Uses a minimal guest kernel and hypervisor.

High-Level Runtimes:

containerd — A graduated CNCF project. Manages the complete container lifecycle: image pull, storage, container execution, networking. Used by Docker and Kubernetes.

CRI-O — Built specifically for Kubernetes. Implements the Container Runtime Interface (CRI) without extra features. Minimal and focused.

Docker Engine — The original container platform. Now uses containerd underneath. Provides the familiar docker CLI, Compose, and Swarm orchestration.

Container Runtime Comparison
Runtime	Type	Primary Use Case	Key Feature
runc	OCI (low-level)	Standard container execution	Reference implementation
crun	OCI (low-level)	Resource-constrained environments	Fast, small footprint
gVisor	OCI (low-level)	Multi-tenant security	User-space kernel isolation
Kata Containers	OCI (low-level)	Strong isolation needs	VM-based containers
containerd	High-level daemon	Container management	Kubernetes default, mature
CRI-O	High-level daemon	Kubernetes only	Minimal, Kubernetes-focused
Docker Engine	Full platform	Developer experience	Familiar tooling, Compose

Container Security

Containers share the host kernel, making security critical. A kernel vulnerability affects all containers. Multiple defense layers are employed to harden container isolation:

Security Mechanisms:

Container Security Layers

•User Namespaces — Map container root (UID 0) to unprivileged host UID. Even if a container process escapes, it's unprivileged on the host.
•Capabilities — Linux divides root privileges into capabilities (CAP_NET_ADMIN, CAP_SYS_ADMIN, etc.). Containers run with a minimal set, dropping dangerous capabilities.
•Seccomp — Secure Computing Mode. Filter system calls: allow only a whitelist, block dangerous calls like reboot, mount, or ptrace.
•AppArmor/SELinux — Mandatory Access Control. Confine containers to specific file paths, network access, and capabilities beyond namespace isolation.
•Read-Only Rootfs — Mount the container filesystem read-only. Attackers can't persist malware by modifying container files.
•No New Privileges — Prevent setuid binaries and capability escalation inside the container.

Seccomp Profiles:

Seccomp filters system calls before they execute:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "syscalls": [
    {
      "names": ["read", "write", "open", "close", ...],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "names": ["mount", "umount", "reboot", "kexec_load"],
      "action": "SCMP_ACT_ERRNO"
    }
  ]
}

Docker's default seccomp profile blocks ~44 system calls, including those that could facilitate container escape or system damage.

Capability Dropping:

secure-container.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
# Run container with minimal capabilities
docker run --rm -it \
  --cap-drop=ALL \                # Drop all capabilities
  --cap-add=NET_BIND_SERVICE \    # Only add what's needed
  --read-only \                   # Read-only root filesystem
  --security-opt=no-new-privileges:true \  # Prevent privilege escalation
  --security-opt=seccomp=default.json \    # Apply seccomp profile
  --user 1000:1000 \              # Run as non-root user
  myimage:latest
 
# View capabilities of a running container
docker inspect --format='{{.HostConfig.CapAdd}}' container_name
docker inspect --format='{{.HostConfig.CapDrop}}' container_name

The Privileged Mode Danger

Container Orchestration

Running one container is simple. Running hundreds across multiple hosts, with networking, storage, scaling, and failure recovery, requires orchestration.

Kubernetes: The Industry Standard

Kubernetes (K8s) has become the de facto standard for container orchestration. It provides:

Declarative configuration: Define desired state; Kubernetes makes it happen
Automatic scheduling: Place containers on appropriate nodes
Self-healing: Restart failed containers, replace unresponsive nodes
Scaling: Horizontal scaling based on metrics
Service discovery: DNS-based service discovery and load balancing
Rolling updates: Zero-downtime deployments
Secret and config management: Inject configuration without baking into images

Kubernetes Architecture:

┌─────────────────────────────────────────────────────────────┐
│ Control Plane                                               │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐       │
│ │ API Server    │ │ Scheduler     │ │ Controller    │       │
│ │ (kube-apiserver) │               │ │ Manager       │       │
│ └───────────────┘ └───────────────┘ └───────────────┘       │
│ ┌─────────────────────────────────────────────────────┐     │
│ │ etcd (distributed key-value store)                  │     │
│ └─────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────┘
                            │
                      (API calls)
                            │
┌─────────────────────────────────────────────────────────────┐
│ Worker Nodes                                                │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Node 1          │ │ Node 2          │ │ Node 3          │ │
│ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │
│ │ │ kubelet     │ │ │ │ kubelet     │ │ │ │ kubelet     │ │ │
│ │ ├─────────────┤ │ │ ├─────────────┤ │ │ ├─────────────┤ │ │
│ │ │ container   │ │ │ │ container   │ │ │ │ container   │ │ │
│ │ │ runtime     │ │ │ │ runtime     │ │ │ │ runtime     │ │ │
│ │ ├─────────────┤ │ │ ├─────────────┤ │ │ ├─────────────┤ │ │
│ │ │ [pod][pod]  │ │ │ │ [pod][pod]  │ │ │ │ [pod][pod]  │ │ │
│ │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Core Kubernetes Concepts:

Kubernetes Resource Types
Resource	Description	Example Use
Pod	Smallest deployable unit (1+ containers)	Run an application container
Deployment	Manages replicated Pods with updates	Stateless applications
StatefulSet	Pods with stable identity and storage	Databases, stateful apps
Service	Stable network endpoint for Pods	Load balancing, discovery
ConfigMap	Configuration data as key-value pairs	Application configuration
Secret	Encrypted sensitive data	Passwords, API keys
Namespace	Virtual cluster for multi-tenancy	Team/environment isolation

Beyond Kubernetes

Container Networking

Containers need network connectivity to serve requests and communicate with each other. Container networking leverages Linux network namespaces and virtual networking.

Common Networking Modes:

Container Networking Modes
Mode	Description	Use Case
Bridge	Containers connect to a virtual bridge (docker0), NAT to host	Default mode, most common
Host	Container uses host network namespace directly	Maximum performance, no isolation
None	Container has only loopback, no external network	Security-isolated workloads
Overlay	Multi-host networking via encapsulation	Kubernetes, Swarm clusters
Macvlan	Container gets its own MAC address on physical network	Legacy apps needing real network presence

Bridge Networking in Detail:

┌─────────────────────────────────────────────────────────┐
│ Host Network Namespace                                   │
│                                                          │
│  ┌─────────────────────┐                                │
│  │ Physical Interface  │                                │
│  │ eth0 (192.168.1.10) │◄─────── Internet/LAN          │
│  └─────────────────────┘                                │
│           │                                              │
│           │ NAT (iptables MASQUERADE)                   │
│           │                                              │
│  ┌─────────────────────┐                                │
│  │ Docker Bridge       │                                │
│  │ docker0 (172.17.0.1)│                                │
│  └─────────────────────┘                                │
│       │         │         │                              │
│   ┌───────┐ ┌───────┐ ┌───────┐                        │
│   │veth0  │ │veth1  │ │veth2  │   (veth host ends)     │
└───┴───────┴─┴───────┴─┴───────┴──────────────────────────┘
        │         │         │
┌───────────┐ ┌───────────┐ ┌───────────┐
│Container 1│ │Container 2│ │Container 3│
│eth0:      │ │eth0:      │ │eth0:      │
│172.17.0.2 │ │172.17.0.3 │ │172.17.0.4 │
└───────────┘ └───────────┘ └───────────┘

Container Network Interface (CNI):

CNI is the standard plugin interface for container networking, used by Kubernetes and other orchestrators. CNI plugins handle:

Creating network namespaces
Connecting containers to networks (bridge, overlay, etc.)
Assigning IP addresses (IPAM)
Setting up routes and firewall rules

Popular CNI plugins: Calico, Flannel, Weave, Cilium (eBPF-based).

Service Mesh (Advanced):

For complex microservices, a service mesh (like Istio, Linkerd) adds a sidecar proxy to each container. The mesh provides:

Encrypted service-to-service communication (mTLS)
Observability (distributed tracing, metrics)
Traffic management (canary deployments, retries)
Policy enforcement (access control)

Summary: OS-Level Virtualization

We've comprehensively explored OS-level virtualization and containerization. Let's consolidate the essential concepts:

Key Takeaways

•Containers virtualize the OS, not hardware, sharing the host kernel for efficiency but weaker isolation than VMs.
•Namespaces provide isolation (PID, NET, MNT, UTS, IPC, USER, CGROUP, TIME), making containers believe they're on their own system.
•Cgroups provide resource control (CPU, memory, I/O, PIDs), preventing containers from monopolizing host resources.
•Layered filesystems and images enable efficient storage, fast distribution, and immutable infrastructure.
•Container runtimes (runc, containerd, CRI-O) are layered: OCI runtimes create containers, high-level runtimes manage lifecycle.
•Security requires defense in depth: user namespaces, capabilities, seccomp, AppArmor/SELinux, read-only filesystems.
•Orchestration (Kubernetes) manages container deployment, scaling, networking, and recovery at scale.
•Networking uses namespaces and CNI plugins to connect containers with various isolation and connectivity options.

Module Complete:

Page Complete