Operating SystemsNamespaces and cgroups

Linux Namespaces and Control Groups

LevelAdvanced

Duration90 mins

TopicNamespaces and cgroups

1 / 5

Linux Namespaces

The Foundation of Container Isolation

When Docker runs a container, or when Kubernetes orchestrates thousands of workloads across a cluster, the fundamental isolation that keeps each container's processes, filesystems, and network interfaces separate from the host and from each other comes from a surprisingly elegant kernel feature: Linux namespaces.

Namespaces are not containers themselves—they are the building blocks from which containers are constructed. Understanding namespaces means understanding the atomic units of isolation that the Linux kernel provides, and how containerization technologies compose these primitives to create the sandboxed environments we rely on in modern infrastructure.

What You Will Learn

By the end of this page, you will understand what namespaces are, why they exist, the historical context that led to their development, the different types of namespaces Linux provides, and how the kernel implements namespace isolation at the process level. You will gain the foundational knowledge required to understand containerization from first principles.

The Namespace Concept

At its core, a namespace is a kernel mechanism that partitions system resources so that one set of processes sees one set of resources while another set of processes sees a different set. Each namespace type isolates a specific global system resource, creating the illusion that processes within the namespace have their own isolated instance of that resource.

Consider how a traditional UNIX system works: all processes share a single view of the system. Every process sees the same process tree, the same network interfaces, the same mounted filesystems, the same hostname. This shared view is both powerful (enabling easy inter-process communication and resource sharing) and limiting (making true isolation impossible without heavyweight virtualization).

Namespaces change this fundamental assumption. Instead of a single global namespace for each resource type, the kernel can maintain multiple instances of that namespace. Processes are assigned to specific namespace instances, and their view of the system is confined to what that namespace exposes.

The Namespace Abstraction

Think of namespaces as creating parallel universes within the kernel. Each universe has its own version of certain system resources. Processes living in one universe cannot see or affect processes in another universe—from their perspective, their universe is the entire system.

Key properties of namespaces:

Isolation: Processes in different instances of a namespace cannot see each other's resources for that namespace type
Hierarchical: Namespaces can form hierarchies, with parent namespaces having visibility into child namespaces but not vice versa
Composable: A process belongs to exactly one instance of each namespace type, and different namespace types operate independently
Lightweight: Unlike full virtualization, namespaces are implemented within the same kernel and share most kernel data structures, making them extremely efficient
Progressive adoption: New processes inherit their parent's namespace memberships by default, but can be moved to new namespaces at creation time or, for some types, during execution

Historical Context and Evolution

Linux namespaces did not emerge fully formed. They evolved over nearly two decades, with each namespace type addressing specific isolation requirements that became apparent as the industry's understanding of containerization matured.

The Pre-Namespace Era (Before 2002)

Before namespaces, process isolation on Linux was limited to traditional UNIX mechanisms:

Users and groups: Basic access control but no visibility isolation
chroot: Filesystem isolation, but easily escapable and incomplete
Separate machines: True isolation required dedicated hardware or heavyweight VMs

The need for lightweight isolation became pressing as hosting providers sought better multi-tenancy, and as security researchers demonstrated the limitations of chroot for containment.

Timeline of Linux Namespace Evolution
Year	Kernel Version	Namespace Type	Purpose
2002	2.4.19	Mount (mnt)	Filesystem isolation, extending chroot concept
2006	2.6.19	UTS	Hostname and domain name isolation
2006	2.6.19	IPC	System V IPC and POSIX message queue isolation
2006	2.6.24	PID	Process ID isolation, virtualized PID trees
2009	2.6.29	Network (net)	Network stack isolation, virtual interfaces
2013	3.8	User	User and group ID mapping isolation
2016	4.6	Cgroup	Control group hierarchy isolation
2020	5.6	Time	System time virtualization (per-namespace clocks)

The mount namespace (2002) was the first, introduced as an extension to the Plan 9 operating system's concept of per-process filesystem views. It allowed different processes to see different filesystem layouts—a powerful extension to chroot that couldn't be circumvented by simply navigating directory structures.

The containerization wave (2006-2009) brought rapid development. The UTS, IPC, and PID namespaces emerged as researchers at IBM and other organizations worked on container technologies like OpenVZ and later LXC. Network namespaces completed the picture for practical containerization.

User namespaces (2013) represented a paradigm shift—they enabled unprivileged users to create and administer containers by mapping UID/GID ranges, fundamentally changing the security model.

Recent additions (2016-2020) reflect ongoing refinement. Cgroup namespaces improve container nesting, while time namespaces enable scenarios requiring time manipulation without host privileges.

Ongoing Development

Namespace development continues. Proposals for additional namespace types (such as syslog namespaces) are regularly discussed in the Linux kernel community. The architecture is intentionally extensible, allowing new isolation boundaries to be added as requirements emerge.

The Eight Namespace Types

Modern Linux (kernel 5.6+) supports eight distinct namespace types, each isolating a specific system resource. Understanding each type is essential for comprehending how containers achieve comprehensive isolation.

Mount Namespace (CLONE_NEWNS)

The mount namespace isolates the set of filesystem mount points seen by processes. When a process is in its own mount namespace, mounting or unmounting filesystems affects only processes in that namespace—the host and other containers remain unaffected.

This is foundational for containers: it enables each container to have its own root filesystem, with its own /proc, /sys, and any other mount points, completely independent of the host's mount table.

Mount Namespace Capabilities

•Independent mount trees — Each namespace has its own /proc/mounts
•Mount propagation control — Shared, slave, private, and unbindable propagation modes
•Filesystem overlay — OverlayFS and bind mounts for image layering
•Security boundary — Prevents container mount operations from affecting host

UTS Namespace (CLONE_NEWUTS)

Named after the 'UNIX Time-sharing System', the UTS namespace isolates the system's hostname and NIS (Network Information Service) domain name. Each UTS namespace has its own hostname and domainname, allowing containers to have distinct identities without affecting each other or the host.

While seemingly simple, hostname isolation is critical for applications that use hostname for identity, logging, and service discovery.

IPC Namespace (CLONE_NEWIPC)

The IPC namespace isolates System V IPC objects (semaphores, message queues, shared memory segments) and POSIX message queues. Processes in different IPC namespaces cannot access each other's IPC resources, even if they know the identifiers.

This prevents information leakage between containers through IPC channels and ensures that IPC identifier collisions between containers cannot occur.

PID Namespace (CLONE_NEWPID)

The PID namespace virtualizes process IDs. Each PID namespace has its own independent set of PIDs starting from 1. The first process in a new PID namespace becomes PID 1 within that namespace—the init process for that container.

Critically, PID namespaces are hierarchical. A process is visible in its own PID namespace and all ancestor namespaces (with different PIDs in each). This allows the host to manage container processes while containers cannot see or signal processes outside their namespace.

Network Namespace (CLONE_NEWNET)

The network namespace provides complete network stack isolation. Each network namespace has its own:

Network devices (eth0, lo, etc.)
IP addresses and routing tables
Port numbers (two containers can both bind to port 80)
Firewall rules (iptables/nftables)
/proc/net contents

Virtual ethernet pairs (veth) connect namespaces, enabling controlled network communication between containers and with the host.

User Namespace (CLONE_NEWUSER)

The user namespace is perhaps the most powerful and complex. It isolates user and group IDs, enabling a process to have different UIDs inside and outside the namespace. A process can be root (UID 0) inside its user namespace while being an unprivileged user on the host.

This capability enables:

Rootless containers (containers run by non-root users)
Improved security (container root has no host privileges)
Safe delegation of namespace creation to unprivileged users

Cgroup Namespace (CLONE_NEWCGROUP)

The cgroup namespace virtualizes the view of the cgroup hierarchy. A process in a cgroup namespace sees its current cgroup as the root of the hierarchy, rather than seeing the full host cgroup tree. This prevents containers from discovering information about other containers or the host through /proc/self/cgroup.

Time Namespace (CLONE_NEWTIME)

The newest addition, time namespaces allow processes to have different views of CLOCK_MONOTONIC and CLOCK_BOOTTIME. This enables scenarios like container migration (where boot time differs) and testing time-dependent applications without root access to set system time.

Namespace Types Summary
Namespace	Clone Flag	Isolates	Primary Use Case
Mount	CLONE_NEWNS	Filesystem mount points	Container root filesystem
UTS	CLONE_NEWUTS	Hostname, domain name	Container identity
IPC	CLONE_NEWIPC	IPC objects, message queues	IPC isolation
PID	CLONE_NEWPID	Process IDs	Process tree isolation
Network	CLONE_NEWNET	Network stack	Network isolation
User	CLONE_NEWUSER	User/Group IDs	Privilege separation
Cgroup	CLONE_NEWCGROUP	Cgroup hierarchy view	Container nesting
Time	CLONE_NEWTIME	System clocks	Time virtualization

Kernel Implementation Architecture

Understanding how the kernel implements namespaces illuminates both their power and their limitations. The implementation is remarkably elegant, built around a few key data structures and system calls.

The nsproxy Structure

At the heart of namespace implementation is the nsproxy structure. Every process (represented by a task_struct) has a pointer to an nsproxy, which in turn contains pointers to the actual namespace objects the process belongs to.

kernel/nsproxy.h (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
struct nsproxy {
    atomic_t count;                    // Reference count
    struct uts_namespace *uts_ns;      // UTS namespace
    struct ipc_namespace *ipc_ns;      // IPC namespace  
    struct mnt_namespace *mnt_ns;      // Mount namespace
    struct pid_namespace *pid_ns_for_children;  // PID namespace for children
    struct net *net_ns;                // Network namespace
    struct time_namespace *time_ns;    // Time namespace
    struct time_namespace *time_ns_for_children;
    struct cgroup_namespace *cgroup_ns; // Cgroup namespace
};

Reference counting ensures namespaces persist as long as any process uses them. When the last process exits a namespace (and no external references remain, such as bind-mounted namespace files), the namespace is destroyed and its resources released.

Namespace inheritance works through the nsproxy. When a process forks, the child typically shares the parent's nsproxy—both point to the same namespace instances. Only when a process explicitly creates a new namespace (via clone() or unshare()) does the kernel allocate a new namespace object and potentially a new nsproxy.

The User Namespace Exception

Notice that user_namespace is not in nsproxy. Instead, it's stored directly in the process's credentials (struct cred). This reflects user namespaces' special role in the permission model—they affect privilege checks throughout the kernel, not just resource visibility.

User Namespace Ownership

Every namespace (except user namespaces themselves) is owned by a user namespace. This ownership determines which user namespace's privilege rules apply when operating on that namespace. A process with CAP_SYS_ADMIN in the user namespace that owns a mount namespace can mount filesystems in that mount namespace, even if it lacks that capability in other user namespaces.

Namespace Objects

Each namespace type has its own kernel structure containing the isolated resources. For example:

struct pid_namespace contains the PID allocation bitmap, init process pointer, and link to parent PID namespace
struct net contains routing tables, netfilter rules, and network device lists
struct mnt_namespace contains the mount tree root and mount ID allocator

These structures are created during namespace creation and populated with either empty/default resources (for truly new namespaces) or copies of the parent's resources (for most namespace types).

The Namespace Filesystem (/proc/[pid]/ns/)

The kernel exposes namespace membership through a special filesystem. For each process, /proc/[pid]/ns/ contains symbolic links representing the namespaces that process belongs to. These links have several remarkable properties:

Namespace File Properties

•Unique inode per namespace — Each namespace instance has a unique inode number, enabling comparison (same inode = same namespace)
•Bind mountable — Can be bind-mounted elsewhere, keeping the namespace alive even without processes
•Openable — Opening the file gives a file descriptor that can be passed to setns() to join that namespace
•Magic symlink — The link target format (e.g., net:[4026531840]) encodes the namespace type and inode

System Calls for Namespace Operations

Three primary system calls govern namespace creation and manipulation: clone(), unshare(), and setns(). Understanding these is essential for anyone working with containers or implementing namespace-aware applications.

clone() — Create Process in New Namespaces

The clone() system call creates a new process (like fork()) but with fine-grained control over what is shared between parent and child. Namespace flags determine which new namespaces the child should enter.

namespace-clone-example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
 
static int child_func(void *arg) {
    // This child is now in:
    // - A new UTS namespace (can set its own hostname)
    // - A new PID namespace (sees itself as PID 1)
    // - A new mount namespace (mount ops are isolated)
    
    sethostname("container", 9);
    printf("Child PID (inside namespace): %d\n", getpid());
    // Will print: Child PID (inside namespace): 1
    
    // Keep running to demonstrate isolation
    sleep(60);
    return 0;
}
 
int main() {
    char *stack = malloc(65536);
    
    // Create child in new namespaces
    int flags = CLONE_NEWUTS |    // New UTS namespace
                CLONE_NEWPID |    // New PID namespace
                CLONE_NEWNS  |    // New mount namespace
                SIGCHLD;          // Send SIGCHLD on exit
    
    pid_t child_pid = clone(child_func, 
                            stack + 65536,  // Stack grows downward
                            flags, 
                            NULL);
    
    if (child_pid == -1) {
        perror("clone");
        exit(1);
    }
    
    printf("Child PID (from parent's view): %d\n", child_pid);
    // Will print: Child PID (from parent's view): <actual PID>
    
    waitpid(child_pid, NULL, 0);
    return 0;
}

unshare() — Disassociate from Namespaces

The unshare() system call allows an existing process to create new namespaces and move itself into them. This is useful when you want to isolate the current process rather than creating a new one.

namespace-unshare-example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
 
int main() {
    printf("Before unshare - hostname: ");
    system("hostname");
    
    // Create new UTS and mount namespaces for this process
    if (unshare(CLONE_NEWUTS | CLONE_NEWNS) == -1) {
        perror("unshare");
        return 1;
    }
    
    // Now isolated - changing hostname only affects this process
    sethostname("isolated", 8);
    printf("After unshare - hostname: ");
    system("hostname");  // Prints: isolated
    
    // Parent shell still has original hostname!
    return 0;
}

setns() — Join Existing Namespaces

The setns() system call moves a process into an existing namespace, specified by a file descriptor. This is how tools like docker exec work—they open the namespace files of a running container and use setns() to join them.

namespace-setns-example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#define _GNU_SOURCE
#include <fcntl.h>
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
 
int main(int argc, char *argv[]) {
    if (argc < 2) {
        printf("Usage: %s <container-pid>\n", argv[0]);
        return 1;
    }
    
    char ns_path[256];
    int target_pid = atoi(argv[1]);
    
    // Open the target container's network namespace
    snprintf(ns_path, sizeof(ns_path), 
             "/proc/%d/ns/net", target_pid);
    int ns_fd = open(ns_path, O_RDONLY);
    
    if (ns_fd == -1) {
        perror("open namespace");
        return 1;
    }
    
    // Join the network namespace
    if (setns(ns_fd, CLONE_NEWNET) == -1) {
        perror("setns");
        return 1;
    }
    
    close(ns_fd);
    
    // Now running in the container's network namespace
    // We see the container's network interfaces, routes, etc.
    execl("/bin/ip", "ip", "addr", NULL);
    
    return 0;
}

The nsenter Command

The nsenter utility wraps these system calls for command-line use. For example, nsenter -t <pid> -n ip addr joins the network namespace of process <pid> and runs ip addr. This is invaluable for debugging containers.

Namespace Lifecycle and Persistence

Namespaces have a defined lifecycle governed by reference counting. Understanding this lifecycle is crucial for implementing robust containerization and avoiding resource leaks.

Default Lifecycle (Process-Bound)

By default, a namespace's lifetime is tied to the processes within it:

Creation: A new namespace is created when a process calls clone() with namespace flags or unshare()
Usage: The namespace exists while at least one process belongs to it
Destruction: When the last process exits the namespace, the kernel destroys it and releases resources

This means that a container's namespaces naturally clean up when all container processes terminate—no explicit garbage collection needed.

Converting Mermaid diagram...

Persistent Namespaces (Bind-Mount Trick)

Sometimes you need a namespace to survive even when no processes are using it—for example, to prepare a network namespace before launching a container. This is achieved by bind mounting the namespace file:

# Create a new network namespace without any process
ip netns add my_namespace

# This actually does:
# 1. unshare(CLONE_NEWNET) in a helper process
# 2. Bind mounts /proc/self/ns/net to /var/run/netns/my_namespace
# 3. Helper process exits, but namespace persists due to bind mount

The bind mount holds a reference to the namespace object, preventing destruction even with zero processes. Removing the bind mount releases this reference.

File Descriptor References

Opening a namespace file (e.g., open("/proc/1234/ns/net", O_RDONLY)) also holds a reference. This is used by orchestrators to:

Hold namespaces while reconfiguring them
Pass namespace references between processes via socket SCM_RIGHTS
Inspect namespace properties without joining them

Namespace Resource Leaks

Be careful when bind-mounting or holding file descriptors to namespaces. A forgotten bind mount will prevent namespace cleanup, leading to resource leaks. This is a common source of 'phantom' network interfaces or mount points that survive container deletion.

Orphaned Namespace Handling

PID namespaces have special destruction semantics. When the init process (PID 1) of a PID namespace dies, the kernel sends SIGKILL to all remaining processes in that namespace. This ensures containers are fully terminated when their init dies:

Container init process (PID 1 inside, PID 12345 outside) crashes
Kernel iterates through all processes in that PID namespace
Each process receives SIGKILL
Once all processes exit, the namespace is destroyed

This reaping behavior prevents orphaned zombie containers.

Security Considerations

Namespaces are a powerful security mechanism, but they have limitations and gotchas that practitioners must understand. Security depends on properly configuring multiple namespace types together, combined with other kernel features.

What Namespaces Protect Against

When properly configured, namespaces prevent containerized processes from:

Seeing host processes (PID namespace)
Accessing host network interfaces (network namespace)
Modifying host filesystem mounts (mount namespace)
Signaling host processes (PID namespace)
Using host IPC mechanisms (IPC namespace)
Binding to host ports (network namespace)

Common Security Pitfalls

•Shared namespaces: Running containers with --network=host or --pid=host negates isolation for that resource type
•Privileged containers: Using --privileged grants all capabilities and access to all host devices, defeating namespace isolation
•Excessive capabilities: Capabilities like CAP_SYS_ADMIN can allow namespace escape depending on configuration
•Kernel vulnerabilities: Namespaces are implemented in the kernel; kernel bugs can enable escapes
•/proc and /sys exposure: Improperly mounted procfs/sysfs can leak host information or allow writes

User Namespaces as a Security Boundary

User namespaces fundamentally change the security model. Without user namespaces, UID 0 inside a container is UID 0 on the host—if they escape the container, they have root access. With user namespaces:

Container root (UID 0) maps to an unprivileged host user (e.g., UID 100000)
Kernel permission checks use the mapped UID
Escaping the container yields unprivileged host access

Capability Scoping

Capabilities (the split-up components of traditional root power) are scoped to namespaces. A process can have CAP_NET_ADMIN in its network namespace (allowing it to configure networking inside the container) without having that capability in the initial network namespace (preventing host network configuration).

This scoping enables fine-grained privilege delegation:

Grant CAP_SYS_ADMIN only within a user namespace
Allow mount operations only within a mount namespace
Permit network configuration only within a network namespace

Namespaces Are Not Virtual Machines

Containers share the host kernel. Kernel vulnerabilities can break namespace isolation. For strong security boundaries (e.g., multi-tenant cloud hosting), combine namespaces with additional layers: seccomp filters, AppArmor/SELinux profiles, or run containers inside lightweight VMs (like Kata Containers or Firecracker).

Summary: Linux Namespaces

We've covered substantial ground in understanding Linux namespaces. Let's consolidate the key takeaways:

Key Takeaways

•Namespaces are isolation primitives — They partition global system resources so processes see isolated views of the system
•Eight namespace types exist — Mount, UTS, IPC, PID, Network, User, Cgroup, and Time, each isolating a specific resource
•Namespaces are composable — A process belongs to one instance of each namespace type; containers combine multiple namespaces
•Three system calls manage namespaces — clone() creates processes in new namespaces, unshare() moves the current process, setns() joins existing namespaces
•Namespaces are reference-counted — They persist while processes or bind mounts reference them, then are automatically destroyed
•User namespaces enable rootless containers — UID mapping allows unprivileged users to create and administer isolated environments
•Security requires defense in depth — Namespaces must be combined with capabilities, seccomp, and LSMs for robust isolation

What's next:

Now that we understand the namespace concept and the eight namespace types at a high level, the next page dives deep into the three most critical namespaces for practical containerization: PID namespaces, network namespaces, and mount namespaces. We'll explore their internal mechanics, hierarchical structure, and how they interact to create the isolation that containers depend upon.

Page Complete

You now understand the foundational concept of Linux namespaces—the kernel primitive that enables containerization. You know the eight namespace types, how they're implemented in the kernel, and the system calls used to create and manage them. Next, we'll explore PID, network, and mount namespaces in depth.

1 / 5

Loading learning content...

Operating SystemsNamespaces and cgroups

Linux Namespaces and Control Groups

LevelAdvanced

Duration90 mins

TopicNamespaces and cgroups

1 / 5

Linux Namespaces

The Foundation of Container Isolation

What You Will Learn

The Namespace Concept

The Namespace Abstraction

Key properties of namespaces:

Isolation: Processes in different instances of a namespace cannot see each other's resources for that namespace type
Hierarchical: Namespaces can form hierarchies, with parent namespaces having visibility into child namespaces but not vice versa
Composable: A process belongs to exactly one instance of each namespace type, and different namespace types operate independently
Lightweight: Unlike full virtualization, namespaces are implemented within the same kernel and share most kernel data structures, making them extremely efficient
Progressive adoption: New processes inherit their parent's namespace memberships by default, but can be moved to new namespaces at creation time or, for some types, during execution

Historical Context and Evolution

The Pre-Namespace Era (Before 2002)

Before namespaces, process isolation on Linux was limited to traditional UNIX mechanisms:

Users and groups: Basic access control but no visibility isolation
chroot: Filesystem isolation, but easily escapable and incomplete
Separate machines: True isolation required dedicated hardware or heavyweight VMs

The need for lightweight isolation became pressing as hosting providers sought better multi-tenancy, and as security researchers demonstrated the limitations of chroot for containment.

Timeline of Linux Namespace Evolution
Year	Kernel Version	Namespace Type	Purpose
2002	2.4.19	Mount (mnt)	Filesystem isolation, extending chroot concept
2006	2.6.19	UTS	Hostname and domain name isolation
2006	2.6.19	IPC	System V IPC and POSIX message queue isolation
2006	2.6.24	PID	Process ID isolation, virtualized PID trees
2009	2.6.29	Network (net)	Network stack isolation, virtual interfaces
2013	3.8	User	User and group ID mapping isolation
2016	4.6	Cgroup	Control group hierarchy isolation
2020	5.6	Time	System time virtualization (per-namespace clocks)

User namespaces (2013) represented a paradigm shift—they enabled unprivileged users to create and administer containers by mapping UID/GID ranges, fundamentally changing the security model.

Recent additions (2016-2020) reflect ongoing refinement. Cgroup namespaces improve container nesting, while time namespaces enable scenarios requiring time manipulation without host privileges.

Ongoing Development

The Eight Namespace Types

Mount Namespace (CLONE_NEWNS)

Mount Namespace Capabilities

•Independent mount trees — Each namespace has its own /proc/mounts
•Mount propagation control — Shared, slave, private, and unbindable propagation modes
•Filesystem overlay — OverlayFS and bind mounts for image layering
•Security boundary — Prevents container mount operations from affecting host

UTS Namespace (CLONE_NEWUTS)

While seemingly simple, hostname isolation is critical for applications that use hostname for identity, logging, and service discovery.

IPC Namespace (CLONE_NEWIPC)

This prevents information leakage between containers through IPC channels and ensures that IPC identifier collisions between containers cannot occur.

PID Namespace (CLONE_NEWPID)

Network Namespace (CLONE_NEWNET)

The network namespace provides complete network stack isolation. Each network namespace has its own:

Network devices (eth0, lo, etc.)
IP addresses and routing tables
Port numbers (two containers can both bind to port 80)
Firewall rules (iptables/nftables)
/proc/net contents

Virtual ethernet pairs (veth) connect namespaces, enabling controlled network communication between containers and with the host.

User Namespace (CLONE_NEWUSER)

This capability enables:

Rootless containers (containers run by non-root users)
Improved security (container root has no host privileges)
Safe delegation of namespace creation to unprivileged users

Cgroup Namespace (CLONE_NEWCGROUP)

Time Namespace (CLONE_NEWTIME)

Namespace Types Summary
Namespace	Clone Flag	Isolates	Primary Use Case
Mount	CLONE_NEWNS	Filesystem mount points	Container root filesystem
UTS	CLONE_NEWUTS	Hostname, domain name	Container identity
IPC	CLONE_NEWIPC	IPC objects, message queues	IPC isolation
PID	CLONE_NEWPID	Process IDs	Process tree isolation
Network	CLONE_NEWNET	Network stack	Network isolation
User	CLONE_NEWUSER	User/Group IDs	Privilege separation
Cgroup	CLONE_NEWCGROUP	Cgroup hierarchy view	Container nesting
Time	CLONE_NEWTIME	System clocks	Time virtualization

Kernel Implementation Architecture

The nsproxy Structure

kernel/nsproxy.h (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
struct nsproxy {
    atomic_t count;                    // Reference count
    struct uts_namespace *uts_ns;      // UTS namespace
    struct ipc_namespace *ipc_ns;      // IPC namespace  
    struct mnt_namespace *mnt_ns;      // Mount namespace
    struct pid_namespace *pid_ns_for_children;  // PID namespace for children
    struct net *net_ns;                // Network namespace
    struct time_namespace *time_ns;    // Time namespace
    struct time_namespace *time_ns_for_children;
    struct cgroup_namespace *cgroup_ns; // Cgroup namespace
};

The User Namespace Exception

User Namespace Ownership

Namespace Objects

Each namespace type has its own kernel structure containing the isolated resources. For example:

struct pid_namespace contains the PID allocation bitmap, init process pointer, and link to parent PID namespace
struct net contains routing tables, netfilter rules, and network device lists
struct mnt_namespace contains the mount tree root and mount ID allocator

These structures are created during namespace creation and populated with either empty/default resources (for truly new namespaces) or copies of the parent's resources (for most namespace types).

The Namespace Filesystem (/proc/[pid]/ns/)

Namespace File Properties

•Unique inode per namespace — Each namespace instance has a unique inode number, enabling comparison (same inode = same namespace)
•Bind mountable — Can be bind-mounted elsewhere, keeping the namespace alive even without processes
•Openable — Opening the file gives a file descriptor that can be passed to setns() to join that namespace
•Magic symlink — The link target format (e.g., net:[4026531840]) encodes the namespace type and inode

System Calls for Namespace Operations

clone() — Create Process in New Namespaces

namespace-clone-example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
 
static int child_func(void *arg) {
    // This child is now in:
    // - A new UTS namespace (can set its own hostname)
    // - A new PID namespace (sees itself as PID 1)
    // - A new mount namespace (mount ops are isolated)
    
    sethostname("container", 9);
    printf("Child PID (inside namespace): %d\n", getpid());
    // Will print: Child PID (inside namespace): 1
    
    // Keep running to demonstrate isolation
    sleep(60);
    return 0;
}
 
int main() {
    char *stack = malloc(65536);
    
    // Create child in new namespaces
    int flags = CLONE_NEWUTS |    // New UTS namespace
                CLONE_NEWPID |    // New PID namespace
                CLONE_NEWNS  |    // New mount namespace
                SIGCHLD;          // Send SIGCHLD on exit
    
    pid_t child_pid = clone(child_func, 
                            stack + 65536,  // Stack grows downward
                            flags, 
                            NULL);
    
    if (child_pid == -1) {
        perror("clone");
        exit(1);
    }
    
    printf("Child PID (from parent's view): %d\n", child_pid);
    // Will print: Child PID (from parent's view): <actual PID>
    
    waitpid(child_pid, NULL, 0);
    return 0;
}

unshare() — Disassociate from Namespaces

The unshare() system call allows an existing process to create new namespaces and move itself into them. This is useful when you want to isolate the current process rather than creating a new one.

namespace-unshare-example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
 
int main() {
    printf("Before unshare - hostname: ");
    system("hostname");
    
    // Create new UTS and mount namespaces for this process
    if (unshare(CLONE_NEWUTS | CLONE_NEWNS) == -1) {
        perror("unshare");
        return 1;
    }
    
    // Now isolated - changing hostname only affects this process
    sethostname("isolated", 8);
    printf("After unshare - hostname: ");
    system("hostname");  // Prints: isolated
    
    // Parent shell still has original hostname!
    return 0;
}

setns() — Join Existing Namespaces

namespace-setns-example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#define _GNU_SOURCE
#include <fcntl.h>
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
 
int main(int argc, char *argv[]) {
    if (argc < 2) {
        printf("Usage: %s <container-pid>\n", argv[0]);
        return 1;
    }
    
    char ns_path[256];
    int target_pid = atoi(argv[1]);
    
    // Open the target container's network namespace
    snprintf(ns_path, sizeof(ns_path), 
             "/proc/%d/ns/net", target_pid);
    int ns_fd = open(ns_path, O_RDONLY);
    
    if (ns_fd == -1) {
        perror("open namespace");
        return 1;
    }
    
    // Join the network namespace
    if (setns(ns_fd, CLONE_NEWNET) == -1) {
        perror("setns");
        return 1;
    }
    
    close(ns_fd);
    
    // Now running in the container's network namespace
    // We see the container's network interfaces, routes, etc.
    execl("/bin/ip", "ip", "addr", NULL);
    
    return 0;
}

The nsenter Command

Namespace Lifecycle and Persistence

Namespaces have a defined lifecycle governed by reference counting. Understanding this lifecycle is crucial for implementing robust containerization and avoiding resource leaks.

Default Lifecycle (Process-Bound)

By default, a namespace's lifetime is tied to the processes within it:

Creation: A new namespace is created when a process calls clone() with namespace flags or unshare()
Usage: The namespace exists while at least one process belongs to it
Destruction: When the last process exits the namespace, the kernel destroys it and releases resources

This means that a container's namespaces naturally clean up when all container processes terminate—no explicit garbage collection needed.

Converting Mermaid diagram...

Persistent Namespaces (Bind-Mount Trick)

# Create a new network namespace without any process
ip netns add my_namespace

# This actually does:
# 1. unshare(CLONE_NEWNET) in a helper process
# 2. Bind mounts /proc/self/ns/net to /var/run/netns/my_namespace
# 3. Helper process exits, but namespace persists due to bind mount

The bind mount holds a reference to the namespace object, preventing destruction even with zero processes. Removing the bind mount releases this reference.

File Descriptor References

Opening a namespace file (e.g., open("/proc/1234/ns/net", O_RDONLY)) also holds a reference. This is used by orchestrators to:

Hold namespaces while reconfiguring them
Pass namespace references between processes via socket SCM_RIGHTS
Inspect namespace properties without joining them

Namespace Resource Leaks

Orphaned Namespace Handling

Container init process (PID 1 inside, PID 12345 outside) crashes
Kernel iterates through all processes in that PID namespace
Each process receives SIGKILL
Once all processes exit, the namespace is destroyed

This reaping behavior prevents orphaned zombie containers.

Security Considerations

What Namespaces Protect Against

When properly configured, namespaces prevent containerized processes from:

Seeing host processes (PID namespace)
Accessing host network interfaces (network namespace)
Modifying host filesystem mounts (mount namespace)
Signaling host processes (PID namespace)
Using host IPC mechanisms (IPC namespace)
Binding to host ports (network namespace)

Common Security Pitfalls

•Shared namespaces: Running containers with --network=host or --pid=host negates isolation for that resource type
•Privileged containers: Using --privileged grants all capabilities and access to all host devices, defeating namespace isolation
•Excessive capabilities: Capabilities like CAP_SYS_ADMIN can allow namespace escape depending on configuration
•Kernel vulnerabilities: Namespaces are implemented in the kernel; kernel bugs can enable escapes
•/proc and /sys exposure: Improperly mounted procfs/sysfs can leak host information or allow writes

User Namespaces as a Security Boundary

Container root (UID 0) maps to an unprivileged host user (e.g., UID 100000)
Kernel permission checks use the mapped UID
Escaping the container yields unprivileged host access

Capability Scoping

This scoping enables fine-grained privilege delegation:

Grant CAP_SYS_ADMIN only within a user namespace
Allow mount operations only within a mount namespace
Permit network configuration only within a network namespace

Namespaces Are Not Virtual Machines

Summary: Linux Namespaces

We've covered substantial ground in understanding Linux namespaces. Let's consolidate the key takeaways:

Key Takeaways

•Namespaces are isolation primitives — They partition global system resources so processes see isolated views of the system
•Eight namespace types exist — Mount, UTS, IPC, PID, Network, User, Cgroup, and Time, each isolating a specific resource
•Namespaces are composable — A process belongs to one instance of each namespace type; containers combine multiple namespaces
•Three system calls manage namespaces — clone() creates processes in new namespaces, unshare() moves the current process, setns() joins existing namespaces
•Namespaces are reference-counted — They persist while processes or bind mounts reference them, then are automatically destroyed
•User namespaces enable rootless containers — UID mapping allows unprivileged users to create and administer isolated environments
•Security requires defense in depth — Namespaces must be combined with capabilities, seccomp, and LSMs for robust isolation

What's next:

Page Complete

1 / 5