Operating SystemsNamespaces and cgroups

Linux Namespaces and Control Groups

LevelAdvanced

Duration90 mins

TopicNamespaces and cgroups

2 / 5

PID, Network, Mount Namespaces

The Container Trinity

Of the eight Linux namespace types, three form the essential foundation for practical containerization: PID namespaces, network namespaces, and mount namespaces. These three isolate the most critical system resources—processes, networking, and filesystems—that determine what a container can see, access, and modify.

Understanding these namespaces in depth reveals how container runtimes like Docker, containerd, and CRI-O construct the isolated environments that power modern cloud infrastructure. Each namespace type has unique characteristics, hierarchy rules, and interaction patterns that directly impact container behavior.

What You Will Learn

By the end of this page, you will understand the internal mechanics of PID, network, and mount namespaces in precise detail. You'll learn about PID namespace hierarchies, network namespace connectivity via virtual interfaces, and mount propagation semantics. This knowledge enables you to debug container issues, implement custom runtime features, and understand why containers behave the way they do.

PID Namespaces — Process Tree Isolation

PID namespaces provide the foundational illusion that each container has its own complete process tree, starting from PID 1. This isolation prevents containers from seeing, signaling, or interfering with processes outside their namespace.

The PID 1 Problem

In UNIX tradition, PID 1 is special—it's the init process, responsible for:

Adopting orphaned child processes
Reaping zombie processes
Handling system signals (SIGTERM, etc.)
Serving as the ancestor of all other processes

When Docker runs a container, the container's main process becomes PID 1 inside the PID namespace. This has profound implications that catch many developers off guard.

The Signal Handling Trap

PID 1 has special signal handling semantics in the kernel. Unlike other processes, PID 1 only receives signals for which it has explicitly installed handlers. SIGTERM and SIGKILL without a handler are ignored! This is why docker stop sometimes times out—the container process ignores SIGTERM because it never expected to be PID 1.

PID Namespace Hierarchy

PID namespaces form a strict parent-child hierarchy. This is unique among namespace types—most other namespaces are flat collections.

When process A in namespace N creates a PID namespace N', then:

Process A can see all processes in N and N'
Processes in N' cannot see processes in N or sibling namespaces
Process A is in both namespaces (ancestor), with different PIDs in each

This hierarchy is enforced: you cannot join a PID namespace that is an ancestor of your current one—that would break the isolation model.

Converting Mermaid diagram...

Dual PID Visibility

A key insight is that processes in nested PID namespaces have multiple PIDs simultaneously—one in each ancestor namespace including their own. The kernel maintains this mapping internally:

# From host namespace
$ ps aux | grep nginx
root  2001  nginx: master process     # Host PID
root  2002  nginx: worker process     # Host PID

# From inside container
$ ps aux
PID   CMD
1     nginx: master process          # Container PID
2     nginx: worker process          # Container PID

The process has PID 2001 in the host namespace and PID 1 in the container namespace. Both are valid; which one you see depends on which namespace you're observing from.

The /proc Interface

Each PID namespace maintains its own view of /proc. When a container mounts its own /proc, it only shows processes within its PID namespace. This is why ps inside a container only shows container processes—it's reading a namespace-scoped /proc.

Zombie Reaping

When a process's parent dies, it becomes orphaned and is reparented to PID 1 of its PID namespace (not the host's PID 1). If the container's PID 1 doesn't properly implement wait() for children, zombies accumulate within the container. This is why init systems like tini or dumb-init exist—they handle zombie reaping for containers whose main process wasn't designed to be PID 1.

zombie-reaping.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// Simple init process that properly reaps zombies
// Used as PID 1 in containers to wrap the real application
 
#include <signal.h>
#include <sys/wait.h>
#include <unistd.h>
 
static volatile sig_atomic_t got_sigchld = 0;
 
void sigchld_handler(int sig) {
    got_sigchld = 1;
}
 
void reap_zombies() {
    int status;
    pid_t pid;
    
    // Reap all available zombie children
    while ((pid = waitpid(-1, &status, WNOHANG)) > 0) {
        // Log child termination if needed
    }
}
 
int main(int argc, char *argv[]) {
    // Install SIGCHLD handler
    struct sigaction sa;
    sa.sa_handler = sigchld_handler;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = SA_RESTART | SA_NOCLDSTOP;
    sigaction(SIGCHLD, &sa, NULL);
    
    // Fork and exec the real application
    pid_t child = fork();
    if (child == 0) {
        // Child: exec the real command
        execvp(argv[1], &argv[1]);
        _exit(127);
    }
    
    // Parent: wait for main child while reaping zombies
    while (1) {
        pause();  // Sleep until signal
        
        if (got_sigchld) {
            got_sigchld = 0;
            reap_zombies();
        }
        
        // Check if main child exited
        int status;
        if (waitpid(child, &status, WNOHANG) > 0) {
            return WEXITSTATUS(status);
        }
    }
}

Network Namespaces — Stack Isolation

Network namespaces provide complete network stack isolation. Each network namespace has its own:

Network interfaces (eth0, lo, etc.)
IP addresses (IPv4 and IPv6)
Routing tables and policy routing rules
Firewall rules (iptables/nftables)
Port bindings (two namespaces can both use port 80)
ARP tables and neighbor discovery
/proc/net contents
Socket lists (/proc/net/tcp, /proc/net/unix, etc.)

A new network namespace starts nearly empty—it contains only an unconfigured loopback interface (lo). To be useful, network devices must be created or moved into it, and routing must be configured.

Complete Stack Isolation

Network namespaces isolate at layer 3 and above. A container has its own IP addresses, routes, and firewall—it's as if it had a completely separate network stack. This is more thorough than just IP aliasing or port translation; it's true stack virtualization.

Virtual Ethernet Pairs (veth)

The primary mechanism for connecting network namespaces is the veth pair—two virtual Ethernet interfaces that act as a bidirectional pipe. Packets transmitted on one interface appear as received on the other.

Veth pairs are created as a linked pair, then one end is moved to a different namespace:

# Create veth pair: veth0 and veth1
ip link add veth0 type veth peer name veth1

# Move veth1 to container's network namespace (by PID)
ip link set veth1 netns $CONTAINER_PID

# Configure host end
ip addr add 10.0.0.1/24 dev veth0
ip link set veth0 up

# Configure container end (run inside namespace)
ip addr add 10.0.0.2/24 dev veth1
ip link set veth1 up
ip route add default via 10.0.0.1

Now traffic from the container (10.0.0.2) reaches the host (10.0.0.1) through the veth pair. The host can NAT this traffic to the external network, providing internet access.

Converting Mermaid diagram...

Bridge Networking

Docker's default networking (the bridge mode) uses a software bridge in the host namespace. All container veth endpoints connect to this bridge, enabling:

Container-to-container communication via the bridge
Container-to-host communication (bridge has an IP)
Container-to-external communication via NAT (masquerading)

The bridge acts as a virtual switch at layer 2. Containers on the same bridge can communicate using their bridge-assigned IPs without NAT.

Host Networking Mode

When a container uses --network=host, it shares the host's network namespace entirely. No isolation exists—the container sees all host interfaces and can bind to any port. This is faster (no veth overhead) but sacrifices isolation.

None Networking Mode

With --network=none, the container gets only an unconfigured loopback interface. It has no network connectivity. This is useful for:

Purely computational workloads
Custom networking setups (CNI plugins add interfaces later)
Maximum security isolation

Container Networking Modes Comparison
Mode	Network Namespace	Interfaces	Performance	Isolation
bridge (default)	Separate per container	veth pair + bridge	Good (small overhead)	Full network isolation
host	Shared with host	All host interfaces	Native (no overhead)	None (shared stack)
none	Separate, empty	Only lo (unconfigured)	N/A (no networking)	Complete (no network)
container:<id>	Shared with another container	Shared with target	Good (shared namespace)	Shared with target container
macvlan	Separate per container	Virtual MAC on host NIC	Near native	Layer 2 separation

CNI (Container Network Interface)

Kubernetes and other orchestrators use CNI plugins to configure container networking. The CNI specification defines a simple interface:

Container runtime creates network namespace
CNI plugin executable is called with namespace path
Plugin configures interfaces, addresses, routes inside namespace
Plugin returns configuration to runtime

Popular CNI plugins (Calico, Cilium, Flannel) implement various networking models—overlay networks, BGP peering, eBPF-based routing—all using the same fundamental namespace primitives.

Mount Namespaces — Filesystem Isolation

Mount namespaces isolate the filesystem mount table—the kernel data structure that maps directory paths to mounted filesystems. Each mount namespace has its own independent mount table, enabling containers to have entirely different filesystem views than the host.

The Mount Table

Every mount namespace maintains a complete, independent mount table. When a process mounts a filesystem, it only affects processes in the same mount namespace. This enables:

Containers with different root filesystems
Per-container /proc, /sys, /dev with appropriate visibility
Volume mounts visible only inside specific containers
Temporary filesystems (tmpfs) for container scratch space

mount-namespace-demo.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/bin/bash
# Demonstrate mount namespace isolation
 
# Create a new mount namespace
unshare --mount /bin/bash << 'INNER_SHELL'
    echo "Inside new mount namespace"
    
    # This mount is only visible in this namespace
    mkdir -p /tmp/isolated-demo
    mount -t tmpfs tmpfs /tmp/isolated-demo
    echo "secret-data" > /tmp/isolated-demo/secret.txt
    
    echo "File exists here:"
    cat /tmp/isolated-demo/secret.txt
    
    # Check mount from inside
    mount | grep isolated-demo
    # Output: tmpfs on /tmp/isolated-demo type tmpfs (rw,relatime)
INNER_SHELL
 
# Back in original namespace
echo "Outside the mount namespace:"
cat /tmp/isolated-demo/secret.txt 2>/dev/null || echo "File not accessible!"
mount | grep isolated-demo || echo "Mount not visible here!"

Mount Propagation

Mount namespaces have a critical feature called mount propagation that controls how mount/unmount events propagate between namespace instances. This is essential for scenarios where the host's mounts should (or should not) be visible inside containers.

There are four propagation types:

MS_SHARED: Mount/unmount events propagate bidirectionally between peer groups
MS_SLAVE: Events propagate only from master to slave (one-way)
MS_PRIVATE: No propagation—mount events are isolated
MS_UNBINDABLE: Private + cannot be bind mounted

These propagation semantics enable sophisticated mount configurations:

Mount Propagation Scenarios
Propagation	Host Mounts Visible in Container?	Container Mounts Visible on Host?	Use Case
shared	Yes (after container start)	Yes	Shared filesystem pools
slave	Yes (after container start)	No	Dynamic host mounts (USB, NFS)
private	No (only initial mounts)	No	Full isolation (typical container)
unbindable	No	No	Security-sensitive mounts

Container Root Filesystem

Container images provide a root filesystem that becomes the container's /. The container runtime uses mount namespaces and overlay filesystems to achieve this:

Create new mount namespace for container
Mount container image layers as overlay filesystem
Use pivot_root() or chroot() to change root to the overlay mount
Mount container-specific /proc, /sys, /dev
Bind-mount volumes from host into container paths

Converting Mermaid diagram...

OverlayFS: The Container Filesystem

OverlayFS is a union filesystem that overlays multiple directory trees, presenting a unified view. For containers:

lowerdir: Read-only image layers (multiple directories)
upperdir: Writable layer for container changes
workdir: Internal directory for atomic operations
merged: The unified view the container sees

When a container reads a file, OverlayFS searches from top to bottom:

Check upperdir (container's written files)
Check lowerdirs in order (image layers, top to bottom)
Return first match or ENOENT

When a container writes:

New files go to upperdir
Modified existing files are copied up to upperdir first (copy-on-write)
Deleted files get a "whiteout" marker in upperdir that hides the lower file

This architecture enables:

Multiple containers sharing read-only base layers
Efficient storage through deduplication
Fast container startup (no copying needed)
Clean layer separation for image building

Copy-on-Write Performance

The first write to a large file triggers a full copy from lowerdir to upperdir. For write-heavy workloads on large files (databases, log files), use volumes that bypass OverlayFS entirely to avoid copy-on-write overhead.

Creating a Minimal Container

To solidify our understanding, let's walk through creating a minimal container using the namespace primitives directly. This is essentially what container runtimes do, stripped to essentials.

minimal-container.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <unistd.h>
 
#define STACK_SIZE (1024 * 1024)
static char child_stack[STACK_SIZE];
 
// Container rootfs path (e.g., extracted alpine rootfs)
const char *rootfs = "./rootfs";
 
void setup_mounts() {
    // Make all mounts private to prevent propagation
    mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL);
    
    // Bind mount the rootfs to itself (prepare for pivot_root)
    mount(rootfs, rootfs, NULL, MS_BIND | MS_REC, NULL);
    
    // Create directories for pivot_root
    char old_root[256];
    snprintf(old_root, sizeof(old_root), "%s/.old_root", rootfs);
    mkdir(old_root, 0755);
    
    // Change root to container rootfs
    if (syscall(SYS_pivot_root, rootfs, old_root) == -1) {
        perror("pivot_root");
        exit(1);
    }
    
    // Change working directory to new root
    chdir("/");
    
    // Unmount old root 
    umount2("/.old_root", MNT_DETACH);
    rmdir("/.old_root");
    
    // Mount essential filesystems
    mount("proc", "/proc", "proc", 0, NULL);
    mount("sysfs", "/sys", "sysfs", 0, NULL);
    mount("tmpfs", "/tmp", "tmpfs", 0, NULL);
    
    // Mount minimal /dev
    mount("tmpfs", "/dev", "tmpfs", MS_NOSUID | MS_STRICTATIME, "mode=755");
    mknod("/dev/null", S_IFCHR | 0666, makedev(1, 3));
    mknod("/dev/zero", S_IFCHR | 0666, makedev(1, 5));
    mknod("/dev/random", S_IFCHR | 0666, makedev(1, 8));
    mknod("/dev/urandom", S_IFCHR | 0666, makedev(1, 9));
}
 
int container_main(void *arg) {
    char **argv = (char **)arg;
    
    // Set container hostname
    sethostname("container", 9);
    
    // Setup container filesystem
    setup_mounts();
    
    printf("Container started (PID %d inside namespace)
", getpid());
    printf("Hostname: ");
    system("hostname");
    
    // Execute the specified command
    execvp(argv[0], argv);
    perror("execvp");
    return 1;
}
 
int main(int argc, char *argv[]) {
    if (argc < 2) {
        fprintf(stderr, "Usage: %s <command> [args...]
", argv[0]);
        return 1;
    }
    
    printf("Starting container with namespaces...
");
    
    // Create child process in new namespaces
    int flags = CLONE_NEWPID |   // New PID namespace
                CLONE_NEWNS  |   // New mount namespace  
                CLONE_NEWUTS |   // New UTS namespace (hostname)
                CLONE_NEWNET |   // New network namespace
                CLONE_NEWIPC |   // New IPC namespace
                SIGCHLD;
    
    pid_t child_pid = clone(container_main,
                           child_stack + STACK_SIZE,
                           flags,
                           &argv[1]);
    
    if (child_pid == -1) {
        perror("clone");
        return 1;
    }
    
    printf("Container PID from host perspective: %d
", child_pid);
    
    // Wait for container to exit
    int status;
    waitpid(child_pid, &status, 0);
    
    printf("Container exited with status %d
", WEXITSTATUS(status));
    return WEXITSTATUS(status);
}

What This Code Does

Creates namespaces: Uses clone() with namespace flags to create PID, mount, UTS, network, and IPC namespaces
Sets up mount namespace:
- Makes existing mounts private (isolation)
- Uses pivot_root() to change the root to container rootfs
- Mounts /proc, /sys, /tmp with appropriate filesystem types
- Creates minimal /dev with essential device nodes
Configures hostname: Sets a container-specific hostname in the UTS namespace
Executes command: Runs the specified command inside the isolated environment

This is a simplified version of what Docker or runc does. A production container runtime adds:

User namespace configuration
cgroup setup for resource limits
Seccomp filtering for system call restriction
Capability dropping for privilege reduction
SELinux/AppArmor profile application

Namespace Interactions and Gotchas

The three namespaces we've studied interact in subtle ways. Understanding these interactions is crucial for debugging container issues.

PID and Mount Namespace Interaction

The /proc filesystem is PID-namespace aware. When you mount proc inside a mount namespace that's also in a different PID namespace, the mounted /proc shows only processes in that PID namespace.

# This won't work as expected:
unshare --mount /bin/bash
mount -t proc proc /proc
ps aux  # Still shows host processes!

# You need both namespaces:
unshare --mount --pid --fork /bin/bash
mount -t proc proc /proc
ps aux  # Now shows only container processes

The --fork is required with --pid because the calling process is already assigned a PID. Only new processes can become PID 1 in the new namespace.

Common Debugging Mistake

If ps inside a container shows host processes, check: (1) Is /proc mounted inside the container? (2) Was /proc mounted from within the container's PID namespace? A /proc mounted before PID namespace creation shows the old PID namespace's processes.

Network and Mount Namespace Interaction

/proc/net and /sys/class/net reflect the network namespace of the viewing process, not the mount namespace. This means:

# Inside container that shares host mount namespace but has separate netns
ls /sys/class/net  # Shows container interfaces, not host
cat /proc/net/tcp  # Shows container's TCP connections

The kernel dynamically generates these entries based on the network namespace, regardless of mount configuration.

Signals Across PID Namespaces

Processes can only signal processes in the same PID namespace or descendant namespaces. A container cannot kill host processes—not just because of permissions, but because from the container's perspective, those PIDs don't exist.

However, the host can signal container processes using their host-visible PIDs:

# From host, kill process inside container
kill -SIGTERM 2001  # Using host PID, not container PID 1

Network Namespace and Loopback

Each network namespace has its own loopback interface (lo). A common issue: the loopback isn't automatically brought up in new namespaces.

unshare --net /bin/bash
ip link show  # lo exists but is DOWN
ping 127.0.0.1  # Network unreachable!

# Must manually bring up loopback:
ip link set lo up
ping 127.0.0.1  # Now works

Container runtimes handle this automatically, but custom namespace usage requires explicit loopback configuration.

Common Namespace Debugging Issues

•'ps aux shows host processes' — Check /proc mount origin and PID namespace
•'Network unreachable to localhost' — Verify lo interface is UP in the network namespace
•'Cannot mount /proc' — May need to remount / as private first, or CAP_SYS_ADMIN
•'Permission denied' on bind mount — Check user namespace mapping and mount propagation
•Zombie processes accumulating — Container PID 1 isn't reaping children; use proper init
•'No such file' for expected mount — Mount propagation set to private; mounts after namespace creation not visible

Performance Considerations

Namespaces are designed to be lightweight, but they're not free. Understanding their performance characteristics helps in designing efficient container architectures.

PID Namespace Overhead

PID namespaces add minimal runtime overhead:

PID translation on signal delivery: O(hierarchy depth)
Additional reference counting for namespace objects
Slightly more complex process lookup

In practice, PID namespace overhead is negligible—a few nanoseconds per operation. The hierarchy is typically shallow (1-3 levels), and the kernel optimizes lookups.

Network Namespace Overhead

Network namespaces introduce measurable overhead through veth pairs:

Latency: ~10-20 microseconds additional latency per packet (varies by kernel version)
Throughput: ~5-10% reduction compared to host networking
CPU: Additional CPU cycles for packet copying between namespaces

For network-intensive workloads, options to reduce overhead:

Use host networking (--network=host)
Use macvlan for near-native performance
Use SR-IOV virtual functions (hardware support required)
Use eBPF-based networking (Cilium) to bypass some overhead

Network Namespace Performance Impact
Metric	Host Native	Bridge (veth)	macvlan	Host Mode
Latency (μs)	Baseline	+10-20	+5-10	Baseline
Throughput (%)	100%	90-95%	95-99%	100%
CPU Overhead	Baseline	5-15%	2-5%	Baseline
Isolation	None	Full	L2 only	None

Mount Namespace Overhead

Mount namespaces themselves add negligible overhead—mount lookups traverse the same VFS structures. The overhead comes from what you mount:

OverlayFS: Copy-on-write has ~5-15% overhead for writes; reads are near-native
Many layers: Each layer adds to lookup time; keep layers minimal
bind mounts: Near-zero overhead; ideal for volume mounts

Memory Overhead

Namespaces consume kernel memory:

Each namespace object: ~1-4 KB
Mount namespace: More memory with more mounts
Network namespace: More memory with more interfaces and routing entries

For thousands of containers, this adds up. Shared namespaces (e.g., multiple containers sharing a network namespace) reduce memory usage.

PID 1 Considerations

The containerized PID 1 receives special treatment:

Cannot be killed with SIGKILL (unless from parent namespace)
Default signal handlers don't apply

Opting for a proper init process (tini, dumb-init) adds minimal overhead but ensures correct signal handling and zombie reaping.

Performance Best Practices

For most workloads, namespace overhead is negligible. Focus optimization on: (1) Minimize OverlayFS layer count for I/O-intensive workloads, (2) Consider host networking for network-intensive workloads requiring lowest latency, (3) Share namespaces between related containers (pods) to reduce overhead and enable efficient communication.

Summary: Core Namespaces

We've explored the three most critical namespace types for containerization in depth. Let's consolidate the key takeaways:

Key Takeaways

•PID namespaces virtualize process IDs — Each container has PID 1; processes are visible in all ancestor namespaces with different PIDs
•PID 1 has special signal semantics — Container init processes must handle signals explicitly and reap zombie children
•Network namespaces provide complete stack isolation — Each has its own interfaces, addresses, routes, ports, and firewall rules
•veth pairs connect network namespaces — The bidirectional pipe enables container networking models like bridges
•Mount namespaces isolate the mount table — Containers have independent filesystem views through mount namespaces + OverlayFS
•Mount propagation controls visibility — Shared/slave/private propagation determines if mount events cross namespace boundaries
•Namespaces interact in subtle ways — /proc visibility depends on both PID and mount namespaces; /proc/net is network-namespace aware
•Overhead is generally minimal — Network namespace (veth) introduces the most noticeable overhead at ~5-15%

What's next:

Namespaces provide isolation—they limit what containers can see. But seeing resources and consuming them are different concerns. The next page introduces cgroups (control groups), the Linux kernel feature that limits how much of shared resources (CPU, memory, I/O) each container can consume. Together, namespaces and cgroups form the complete container resource model.

Page Complete

You now understand PID, network, and mount namespaces in depth—how they work internally, how they interact, and how container runtimes use them to create isolated environments. Next, we'll explore cgroups for resource control and limiting.

2 / 5

Loading learning content...

Operating SystemsNamespaces and cgroups

Linux Namespaces and Control Groups

LevelAdvanced

Duration90 mins

TopicNamespaces and cgroups

2 / 5

PID, Network, Mount Namespaces

The Container Trinity

What You Will Learn

PID Namespaces — Process Tree Isolation

The PID 1 Problem

In UNIX tradition, PID 1 is special—it's the init process, responsible for:

Adopting orphaned child processes
Reaping zombie processes
Handling system signals (SIGTERM, etc.)
Serving as the ancestor of all other processes

When Docker runs a container, the container's main process becomes PID 1 inside the PID namespace. This has profound implications that catch many developers off guard.

The Signal Handling Trap

PID Namespace Hierarchy

PID namespaces form a strict parent-child hierarchy. This is unique among namespace types—most other namespaces are flat collections.

When process A in namespace N creates a PID namespace N', then:

Process A can see all processes in N and N'
Processes in N' cannot see processes in N or sibling namespaces
Process A is in both namespaces (ancestor), with different PIDs in each

This hierarchy is enforced: you cannot join a PID namespace that is an ancestor of your current one—that would break the isolation model.

Converting Mermaid diagram...

Dual PID Visibility

A key insight is that processes in nested PID namespaces have multiple PIDs simultaneously—one in each ancestor namespace including their own. The kernel maintains this mapping internally:

# From host namespace
$ ps aux | grep nginx
root  2001  nginx: master process     # Host PID
root  2002  nginx: worker process     # Host PID

# From inside container
$ ps aux
PID   CMD
1     nginx: master process          # Container PID
2     nginx: worker process          # Container PID

The process has PID 2001 in the host namespace and PID 1 in the container namespace. Both are valid; which one you see depends on which namespace you're observing from.

The /proc Interface

Zombie Reaping

zombie-reaping.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// Simple init process that properly reaps zombies
// Used as PID 1 in containers to wrap the real application
 
#include <signal.h>
#include <sys/wait.h>
#include <unistd.h>
 
static volatile sig_atomic_t got_sigchld = 0;
 
void sigchld_handler(int sig) {
    got_sigchld = 1;
}
 
void reap_zombies() {
    int status;
    pid_t pid;
    
    // Reap all available zombie children
    while ((pid = waitpid(-1, &status, WNOHANG)) > 0) {
        // Log child termination if needed
    }
}
 
int main(int argc, char *argv[]) {
    // Install SIGCHLD handler
    struct sigaction sa;
    sa.sa_handler = sigchld_handler;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = SA_RESTART | SA_NOCLDSTOP;
    sigaction(SIGCHLD, &sa, NULL);
    
    // Fork and exec the real application
    pid_t child = fork();
    if (child == 0) {
        // Child: exec the real command
        execvp(argv[1], &argv[1]);
        _exit(127);
    }
    
    // Parent: wait for main child while reaping zombies
    while (1) {
        pause();  // Sleep until signal
        
        if (got_sigchld) {
            got_sigchld = 0;
            reap_zombies();
        }
        
        // Check if main child exited
        int status;
        if (waitpid(child, &status, WNOHANG) > 0) {
            return WEXITSTATUS(status);
        }
    }
}

Network Namespaces — Stack Isolation

Network namespaces provide complete network stack isolation. Each network namespace has its own:

Network interfaces (eth0, lo, etc.)
IP addresses (IPv4 and IPv6)
Routing tables and policy routing rules
Firewall rules (iptables/nftables)
Port bindings (two namespaces can both use port 80)
ARP tables and neighbor discovery
/proc/net contents
Socket lists (/proc/net/tcp, /proc/net/unix, etc.)

A new network namespace starts nearly empty—it contains only an unconfigured loopback interface (lo). To be useful, network devices must be created or moved into it, and routing must be configured.

Complete Stack Isolation

Virtual Ethernet Pairs (veth)

Veth pairs are created as a linked pair, then one end is moved to a different namespace:

# Create veth pair: veth0 and veth1
ip link add veth0 type veth peer name veth1

# Move veth1 to container's network namespace (by PID)
ip link set veth1 netns $CONTAINER_PID

# Configure host end
ip addr add 10.0.0.1/24 dev veth0
ip link set veth0 up

# Configure container end (run inside namespace)
ip addr add 10.0.0.2/24 dev veth1
ip link set veth1 up
ip route add default via 10.0.0.1

Now traffic from the container (10.0.0.2) reaches the host (10.0.0.1) through the veth pair. The host can NAT this traffic to the external network, providing internet access.

Converting Mermaid diagram...

Bridge Networking

Docker's default networking (the bridge mode) uses a software bridge in the host namespace. All container veth endpoints connect to this bridge, enabling:

Container-to-container communication via the bridge
Container-to-host communication (bridge has an IP)
Container-to-external communication via NAT (masquerading)

The bridge acts as a virtual switch at layer 2. Containers on the same bridge can communicate using their bridge-assigned IPs without NAT.

Host Networking Mode

None Networking Mode

With --network=none, the container gets only an unconfigured loopback interface. It has no network connectivity. This is useful for:

Purely computational workloads
Custom networking setups (CNI plugins add interfaces later)
Maximum security isolation

Container Networking Modes Comparison
Mode	Network Namespace	Interfaces	Performance	Isolation
bridge (default)	Separate per container	veth pair + bridge	Good (small overhead)	Full network isolation
host	Shared with host	All host interfaces	Native (no overhead)	None (shared stack)
none	Separate, empty	Only lo (unconfigured)	N/A (no networking)	Complete (no network)
container:<id>	Shared with another container	Shared with target	Good (shared namespace)	Shared with target container
macvlan	Separate per container	Virtual MAC on host NIC	Near native	Layer 2 separation

CNI (Container Network Interface)

Kubernetes and other orchestrators use CNI plugins to configure container networking. The CNI specification defines a simple interface:

Container runtime creates network namespace
CNI plugin executable is called with namespace path
Plugin configures interfaces, addresses, routes inside namespace
Plugin returns configuration to runtime

Popular CNI plugins (Calico, Cilium, Flannel) implement various networking models—overlay networks, BGP peering, eBPF-based routing—all using the same fundamental namespace primitives.

Mount Namespaces — Filesystem Isolation

The Mount Table

Every mount namespace maintains a complete, independent mount table. When a process mounts a filesystem, it only affects processes in the same mount namespace. This enables:

Containers with different root filesystems
Per-container /proc, /sys, /dev with appropriate visibility
Volume mounts visible only inside specific containers
Temporary filesystems (tmpfs) for container scratch space

mount-namespace-demo.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/bin/bash
# Demonstrate mount namespace isolation
 
# Create a new mount namespace
unshare --mount /bin/bash << 'INNER_SHELL'
    echo "Inside new mount namespace"
    
    # This mount is only visible in this namespace
    mkdir -p /tmp/isolated-demo
    mount -t tmpfs tmpfs /tmp/isolated-demo
    echo "secret-data" > /tmp/isolated-demo/secret.txt
    
    echo "File exists here:"
    cat /tmp/isolated-demo/secret.txt
    
    # Check mount from inside
    mount | grep isolated-demo
    # Output: tmpfs on /tmp/isolated-demo type tmpfs (rw,relatime)
INNER_SHELL
 
# Back in original namespace
echo "Outside the mount namespace:"
cat /tmp/isolated-demo/secret.txt 2>/dev/null || echo "File not accessible!"
mount | grep isolated-demo || echo "Mount not visible here!"

Mount Propagation

There are four propagation types:

MS_SHARED: Mount/unmount events propagate bidirectionally between peer groups
MS_SLAVE: Events propagate only from master to slave (one-way)
MS_PRIVATE: No propagation—mount events are isolated
MS_UNBINDABLE: Private + cannot be bind mounted

These propagation semantics enable sophisticated mount configurations:

Mount Propagation Scenarios
Propagation	Host Mounts Visible in Container?	Container Mounts Visible on Host?	Use Case
shared	Yes (after container start)	Yes	Shared filesystem pools
slave	Yes (after container start)	No	Dynamic host mounts (USB, NFS)
private	No (only initial mounts)	No	Full isolation (typical container)
unbindable	No	No	Security-sensitive mounts

Container Root Filesystem

Container images provide a root filesystem that becomes the container's /. The container runtime uses mount namespaces and overlay filesystems to achieve this:

Create new mount namespace for container
Mount container image layers as overlay filesystem
Use pivot_root() or chroot() to change root to the overlay mount
Mount container-specific /proc, /sys, /dev
Bind-mount volumes from host into container paths

Converting Mermaid diagram...

OverlayFS: The Container Filesystem

OverlayFS is a union filesystem that overlays multiple directory trees, presenting a unified view. For containers:

lowerdir: Read-only image layers (multiple directories)
upperdir: Writable layer for container changes
workdir: Internal directory for atomic operations
merged: The unified view the container sees

When a container reads a file, OverlayFS searches from top to bottom:

Check upperdir (container's written files)
Check lowerdirs in order (image layers, top to bottom)
Return first match or ENOENT

When a container writes:

New files go to upperdir
Modified existing files are copied up to upperdir first (copy-on-write)
Deleted files get a "whiteout" marker in upperdir that hides the lower file

This architecture enables:

Multiple containers sharing read-only base layers
Efficient storage through deduplication
Fast container startup (no copying needed)
Clean layer separation for image building

Copy-on-Write Performance

Creating a Minimal Container

To solidify our understanding, let's walk through creating a minimal container using the namespace primitives directly. This is essentially what container runtimes do, stripped to essentials.

minimal-container.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <unistd.h>
 
#define STACK_SIZE (1024 * 1024)
static char child_stack[STACK_SIZE];
 
// Container rootfs path (e.g., extracted alpine rootfs)
const char *rootfs = "./rootfs";
 
void setup_mounts() {
    // Make all mounts private to prevent propagation
    mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL);
    
    // Bind mount the rootfs to itself (prepare for pivot_root)
    mount(rootfs, rootfs, NULL, MS_BIND | MS_REC, NULL);
    
    // Create directories for pivot_root
    char old_root[256];
    snprintf(old_root, sizeof(old_root), "%s/.old_root", rootfs);
    mkdir(old_root, 0755);
    
    // Change root to container rootfs
    if (syscall(SYS_pivot_root, rootfs, old_root) == -1) {
        perror("pivot_root");
        exit(1);
    }
    
    // Change working directory to new root
    chdir("/");
    
    // Unmount old root 
    umount2("/.old_root", MNT_DETACH);
    rmdir("/.old_root");
    
    // Mount essential filesystems
    mount("proc", "/proc", "proc", 0, NULL);
    mount("sysfs", "/sys", "sysfs", 0, NULL);
    mount("tmpfs", "/tmp", "tmpfs", 0, NULL);
    
    // Mount minimal /dev
    mount("tmpfs", "/dev", "tmpfs", MS_NOSUID | MS_STRICTATIME, "mode=755");
    mknod("/dev/null", S_IFCHR | 0666, makedev(1, 3));
    mknod("/dev/zero", S_IFCHR | 0666, makedev(1, 5));
    mknod("/dev/random", S_IFCHR | 0666, makedev(1, 8));
    mknod("/dev/urandom", S_IFCHR | 0666, makedev(1, 9));
}
 
int container_main(void *arg) {
    char **argv = (char **)arg;
    
    // Set container hostname
    sethostname("container", 9);
    
    // Setup container filesystem
    setup_mounts();
    
    printf("Container started (PID %d inside namespace)
", getpid());
    printf("Hostname: ");
    system("hostname");
    
    // Execute the specified command
    execvp(argv[0], argv);
    perror("execvp");
    return 1;
}
 
int main(int argc, char *argv[]) {
    if (argc < 2) {
        fprintf(stderr, "Usage: %s <command> [args...]
", argv[0]);
        return 1;
    }
    
    printf("Starting container with namespaces...
");
    
    // Create child process in new namespaces
    int flags = CLONE_NEWPID |   // New PID namespace
                CLONE_NEWNS  |   // New mount namespace  
                CLONE_NEWUTS |   // New UTS namespace (hostname)
                CLONE_NEWNET |   // New network namespace
                CLONE_NEWIPC |   // New IPC namespace
                SIGCHLD;
    
    pid_t child_pid = clone(container_main,
                           child_stack + STACK_SIZE,
                           flags,
                           &argv[1]);
    
    if (child_pid == -1) {
        perror("clone");
        return 1;
    }
    
    printf("Container PID from host perspective: %d
", child_pid);
    
    // Wait for container to exit
    int status;
    waitpid(child_pid, &status, 0);
    
    printf("Container exited with status %d
", WEXITSTATUS(status));
    return WEXITSTATUS(status);
}

What This Code Does

Creates namespaces: Uses clone() with namespace flags to create PID, mount, UTS, network, and IPC namespaces
Sets up mount namespace:
- Makes existing mounts private (isolation)
- Uses pivot_root() to change the root to container rootfs
- Mounts /proc, /sys, /tmp with appropriate filesystem types
- Creates minimal /dev with essential device nodes
Configures hostname: Sets a container-specific hostname in the UTS namespace
Executes command: Runs the specified command inside the isolated environment

This is a simplified version of what Docker or runc does. A production container runtime adds:

User namespace configuration
cgroup setup for resource limits
Seccomp filtering for system call restriction
Capability dropping for privilege reduction
SELinux/AppArmor profile application

Namespace Interactions and Gotchas

The three namespaces we've studied interact in subtle ways. Understanding these interactions is crucial for debugging container issues.

PID and Mount Namespace Interaction

The /proc filesystem is PID-namespace aware. When you mount proc inside a mount namespace that's also in a different PID namespace, the mounted /proc shows only processes in that PID namespace.

# This won't work as expected:
unshare --mount /bin/bash
mount -t proc proc /proc
ps aux  # Still shows host processes!

# You need both namespaces:
unshare --mount --pid --fork /bin/bash
mount -t proc proc /proc
ps aux  # Now shows only container processes

The --fork is required with --pid because the calling process is already assigned a PID. Only new processes can become PID 1 in the new namespace.

Common Debugging Mistake

Network and Mount Namespace Interaction

/proc/net and /sys/class/net reflect the network namespace of the viewing process, not the mount namespace. This means:

# Inside container that shares host mount namespace but has separate netns
ls /sys/class/net  # Shows container interfaces, not host
cat /proc/net/tcp  # Shows container's TCP connections

The kernel dynamically generates these entries based on the network namespace, regardless of mount configuration.

Signals Across PID Namespaces

However, the host can signal container processes using their host-visible PIDs:

# From host, kill process inside container
kill -SIGTERM 2001  # Using host PID, not container PID 1

Network Namespace and Loopback

Each network namespace has its own loopback interface (lo). A common issue: the loopback isn't automatically brought up in new namespaces.

unshare --net /bin/bash
ip link show  # lo exists but is DOWN
ping 127.0.0.1  # Network unreachable!

# Must manually bring up loopback:
ip link set lo up
ping 127.0.0.1  # Now works

Container runtimes handle this automatically, but custom namespace usage requires explicit loopback configuration.

Common Namespace Debugging Issues

•'ps aux shows host processes' — Check /proc mount origin and PID namespace
•'Network unreachable to localhost' — Verify lo interface is UP in the network namespace
•'Cannot mount /proc' — May need to remount / as private first, or CAP_SYS_ADMIN
•'Permission denied' on bind mount — Check user namespace mapping and mount propagation
•Zombie processes accumulating — Container PID 1 isn't reaping children; use proper init
•'No such file' for expected mount — Mount propagation set to private; mounts after namespace creation not visible

Performance Considerations

Namespaces are designed to be lightweight, but they're not free. Understanding their performance characteristics helps in designing efficient container architectures.

PID Namespace Overhead

PID namespaces add minimal runtime overhead:

PID translation on signal delivery: O(hierarchy depth)
Additional reference counting for namespace objects
Slightly more complex process lookup

In practice, PID namespace overhead is negligible—a few nanoseconds per operation. The hierarchy is typically shallow (1-3 levels), and the kernel optimizes lookups.

Network Namespace Overhead

Network namespaces introduce measurable overhead through veth pairs:

Latency: ~10-20 microseconds additional latency per packet (varies by kernel version)
Throughput: ~5-10% reduction compared to host networking
CPU: Additional CPU cycles for packet copying between namespaces

For network-intensive workloads, options to reduce overhead:

Use host networking (--network=host)
Use macvlan for near-native performance
Use SR-IOV virtual functions (hardware support required)
Use eBPF-based networking (Cilium) to bypass some overhead

Network Namespace Performance Impact
Metric	Host Native	Bridge (veth)	macvlan	Host Mode
Latency (μs)	Baseline	+10-20	+5-10	Baseline
Throughput (%)	100%	90-95%	95-99%	100%
CPU Overhead	Baseline	5-15%	2-5%	Baseline
Isolation	None	Full	L2 only	None

Mount Namespace Overhead

Mount namespaces themselves add negligible overhead—mount lookups traverse the same VFS structures. The overhead comes from what you mount:

OverlayFS: Copy-on-write has ~5-15% overhead for writes; reads are near-native
Many layers: Each layer adds to lookup time; keep layers minimal
bind mounts: Near-zero overhead; ideal for volume mounts

Memory Overhead

Namespaces consume kernel memory:

Each namespace object: ~1-4 KB
Mount namespace: More memory with more mounts
Network namespace: More memory with more interfaces and routing entries

For thousands of containers, this adds up. Shared namespaces (e.g., multiple containers sharing a network namespace) reduce memory usage.

PID 1 Considerations

The containerized PID 1 receives special treatment:

Cannot be killed with SIGKILL (unless from parent namespace)
Default signal handlers don't apply

Opting for a proper init process (tini, dumb-init) adds minimal overhead but ensures correct signal handling and zombie reaping.

Performance Best Practices

Summary: Core Namespaces

We've explored the three most critical namespace types for containerization in depth. Let's consolidate the key takeaways:

Key Takeaways

•PID namespaces virtualize process IDs — Each container has PID 1; processes are visible in all ancestor namespaces with different PIDs
•PID 1 has special signal semantics — Container init processes must handle signals explicitly and reap zombie children
•Network namespaces provide complete stack isolation — Each has its own interfaces, addresses, routes, ports, and firewall rules
•veth pairs connect network namespaces — The bidirectional pipe enables container networking models like bridges
•Mount namespaces isolate the mount table — Containers have independent filesystem views through mount namespaces + OverlayFS
•Mount propagation controls visibility — Shared/slave/private propagation determines if mount events cross namespace boundaries
•Namespaces interact in subtle ways — /proc visibility depends on both PID and mount namespaces; /proc/net is network-namespace aware
•Overhead is generally minimal — Network namespace (veth) introduces the most noticeable overhead at ~5-15%

What's next:

Page Complete

2 / 5