Loading learning content...
While namespaces answer the question "What can a process see?", cgroups (control groups) answer an equally critical question: "How much can a process consume?"
Without cgroups, a containerized process could monopolize CPU time, exhaust system memory, or saturate disk I/O—bringing down not just itself, but the entire host and all other containers. Cgroups provide the resource accounting and limiting mechanisms that make multi-tenant container hosting safe and predictable.
Cgroups are not a container-specific feature—they're a fundamental kernel mechanism for resource management. But containers are their killer application, and understanding cgroups is essential for anyone operating containerized infrastructure.
By the end of this page, you will understand what cgroups are, how they evolved from v1 to v2, the hierarchical structure they form, and the controllers that govern different resource types. You'll learn how to create and configure cgroups, set resource limits, and understand how container runtimes use cgroups to enforce resource constraints.
A control group (cgroup) is a kernel mechanism that organizes processes into hierarchical groups and applies resource management policies to those groups. Unlike namespaces (which provide isolation), cgroups provide:
Every process in the system belongs to exactly one cgroup for each controller (resource type). The cgroups form a hierarchy—a tree structure where child cgroups inherit properties from their parents and can have additional constraints applied.
Think of cgroups as organizational units in a company's budget system. Departments (cgroups) are organized hierarchically. Each department has a budget allocation (resource limits). Employees (processes) belong to departments. The sum of department budgets cannot exceed the parent organization's budget (hierarchical limits).
Controllers: The Resource Managers
Cgroups themselves are just organizational structures—the actual resource management is performed by controllers (also called subsystems). Each controller manages a specific resource type:
A cgroup can have multiple controllers attached, providing comprehensive resource management from a single hierarchy.
The Hierarchy Structure
Cgroups form a tree where:
cgroup.procs fileLinux has two cgroup implementations: the original cgroup v1 (introduced in kernel 2.6.24, 2008) and the redesigned cgroup v2 (introduced in kernel 4.5, 2016, became mature around 5.x). Understanding both is necessary because production systems still use v1, while v2 is the future.
cgroup v1: Multiple Hierarchies
In v1, each controller has its own independent hierarchy. A process belongs to one cgroup in the CPU hierarchy, a potentially different cgroup in the memory hierarchy, and another in the I/O hierarchy.
1234567891011121314151617181920
/sys/fs/cgroup/├── cpu/ # CPU controller hierarchy│ ├── docker/│ │ ├── container-a/│ │ │ ├── cpu.cfs_period_us│ │ │ ├── cpu.cfs_quota_us│ │ │ └── tasks│ │ └── container-b/│ └── system.slice/├── memory/ # Separate memory hierarchy│ ├── docker/│ │ ├── container-a/│ │ │ ├── memory.limit_in_bytes│ │ │ └── tasks│ │ └── container-b/│ └── system.slice/├── blkio/ # Separate block I/O hierarchy │ └── docker/└── pids/ # Separate PID limit hierarchy └── docker/This multi-hierarchy approach has significant problems:
/cpu/docker/container-a but /memory/system.slice/sshd—confusing and error-pronecgroup v2: Unified Hierarchy
Cgroup v2 mandates a single, unified hierarchy. All controllers share the same tree structure, and a process's position in the hierarchy determines its constraints across all controllers.
1234567891011121314151617
/sys/fs/cgroup/ # Unified hierarchy root├── cgroup.controllers # Available controllers├── cgroup.subtree_control # Controllers enabled for subtree├── docker/│ ├── container-a/│ │ ├── cgroup.controllers│ │ ├── cpu.max # CPU limit (replaces quota/period)│ │ ├── memory.max # Memory limit│ │ ├── io.max # I/O limit│ │ ├── pids.max # PID limit│ │ └── cgroup.procs # Member processes│ └── container-b/│ ├── cpu.max│ ├── memory.max│ └── ...└── system.slice/ └── sshd.service/Key cgroup v2 Improvements
cpu.pressure, memory.pressure, io.pressure provide standardized pressure metricscgroup.type can be threaded for per-thread granularity| Aspect | cgroup v1 | cgroup v2 |
|---|---|---|
| Hierarchy | Multiple (per-controller) | Single (unified) |
| Mount point | /sys/fs/cgroup/<controller> | /sys/fs/cgroup |
| Process placement | Can differ per controller | Same for all controllers |
| CPU limit file | cpu.cfs_quota_us + cpu.cfs_period_us | cpu.max (quota period) |
| Memory limit file | memory.limit_in_bytes | memory.max |
| I/O limit file | blkio.throttle.* | io.max |
| Pressure metrics | Not standardized | PSI (cpu/memory/io.pressure) |
| Delegation | Complex, error-prone | Well-defined rules |
Some systems run in 'hybrid' mode with v2 for some controllers and v1 for others. This is a transitional configuration that should be avoided for new deployments. A controller can only be attached to one hierarchy version at a time.
Cgroups are managed through a pseudo-filesystem mounted at /sys/fs/cgroup. Creating a cgroup is as simple as creating a directory; configuring it involves writing to files in that directory.
Creating a cgroup (v2)
12345678910111213141516171819202122232425262728
#!/bin/bash# Create a cgroup v2 with CPU and memory limits # 1. Create the cgroup by making a directorymkdir -p /sys/fs/cgroup/my_container # 2. Enable controllers for this cgroup's children# (Controllers must be enabled at each level of the hierarchy)echo "+cpu +memory +io +pids" > /sys/fs/cgroup/cgroup.subtree_control # 3. Configure CPU limit: 50% of one CPU (50000 microseconds per 100000)echo "50000 100000" > /sys/fs/cgroup/my_container/cpu.max # 4. Configure memory limit: 512MB hard limitecho $((512 * 1024 * 1024)) > /sys/fs/cgroup/my_container/memory.max # 5. Configure memory soft limit (for reclaim under pressure)echo $((256 * 1024 * 1024)) > /sys/fs/cgroup/my_container/memory.high # 6. Configure PID limit: Max 100 processesecho 100 > /sys/fs/cgroup/my_container/pids.max # 7. Move a process into the cgroupecho $$ > /sys/fs/cgroup/my_container/cgroup.procs # 8. Verify membershipcat /proc/$$/cgroup# Output: 0::/my_containerThe subtree_control Mechanism
In cgroup v2, controllers must be explicitly enabled for a cgroup's subtree before child cgroups can use them. This is done via the cgroup.subtree_control file:
# At the root, enable controllers
echo "+cpu +memory" > /sys/fs/cgroup/cgroup.subtree_control
# Now children can use cpu and memory controllers
mkdir /sys/fs/cgroup/child
echo "max 100000" > /sys/fs/cgroup/child/cpu.max # Works!
This explicit enablement:
Moving Processes Between cgroups
Processes are moved by writing their PID to the target cgroup's cgroup.procs file:
# Move process 1234 to my_container cgroup
echo 1234 > /sys/fs/cgroup/my_container/cgroup.procs
# Move all threads of a process (cgroup v2)
echo 1234 > /sys/fs/cgroup/my_container/cgroup.procs
# (All threads move together by default in v2)
# Check which cgroup a process is in
cat /proc/1234/cgroup
The No Internal Processes Rule (v2)
Cgroup v2 enforces the "no internal processes" rule: a cgroup with configured controllers cannot have both processes AND child cgroups. Processes must be in leaf cgroups.
/sys/fs/cgroup/
└── parent/ # Has subtree_control configured
├── cgroup.procs # Must be empty if controllers enabled for subtree
├── child-a/
│ └── cgroup.procs # Processes go here (leaf)
└── child-b/
└── cgroup.procs # Or here (leaf)
This rule simplifies resource distribution calculations and prevents ambiguous accounting scenarios.
Removing cgroups
Cgroups are removed by removing their directory, but only if empty:
# This fails if cgroup has processes or children
rmdir /sys/fs/cgroup/my_container
# First, move all processes out
for pid in $(cat /sys/fs/cgroup/my_container/cgroup.procs); do
echo $pid > /sys/fs/cgroup/cgroup.procs
done
# Then remove
rmdir /sys/fs/cgroup/my_container
The CPU controller manages how much CPU time processes in a cgroup receive. It supports both limiting (hard caps) and weighting (relative shares).
CPU Limiting (CFS Bandwidth)
The Completely Fair Scheduler (CFS) implements bandwidth limiting via quota and period:
In cgroup v2, these are combined in cpu.max:
1234567891011121314151617
# cpu.max format: "$quota $period" (microseconds) # Limit to 50% of one CPU (50ms every 100ms)echo "50000 100000" > /sys/fs/cgroup/container/cpu.max # Limit to 200% (2 full CPUs worth)echo "200000 100000" > /sys/fs/cgroup/container/cpu.max # Limit to 25% of one CPUecho "25000 100000" > /sys/fs/cgroup/container/cpu.max # No limit (default)echo "max 100000" > /sys/fs/cgroup/container/cpu.max # Check current CPU limitcat /sys/fs/cgroup/container/cpu.max# Output: 50000 100000CPU Weighting (Shares)
When multiple cgroups compete for CPU, the cpu.weight (v2) or cpu.shares (v1) determines relative allocation:
# Default weight is 100
# Double the default weight = 2x relative CPU access
echo 200 > /sys/fs/cgroup/container-a/cpu.weight
echo 100 > /sys/fs/cgroup/container-b/cpu.weight
# container-a gets 2/3 of contested CPU time
# container-b gets 1/3 of contested CPU time
Important: weights only matter when CPU is contested. If container-b is idle, container-a gets all available CPU regardless of weights.
Limits (cpu.max) are hard caps—the kernel throttles processes that exceed their quota. Weights (cpu.weight) only affect distribution when CPU is contested—an idle-weighted container costs nothing. Use limits for billing and isolation guarantees; use weights for fair sharing.
CPU Throttling Mechanics
When a cgroup exhausts its quota before the period ends, the kernel throttles all runnable processes in that cgroup. Throttled processes are paused until the next period begins and quota is replenished.
Throttling is visible in cgroup statistics:
cat /sys/fs/cgroup/container/cpu.stat
# usage_usec 1234567890 # Total CPU time used
# user_usec 1000000000 # User-space CPU time
# system_usec 234567890 # Kernel CPU time
# nr_periods 12345 # Number of periods elapsed
# nr_throttled 234 # Number of times throttled
# throttled_usec 5678900 # Total time spent throttled
High nr_throttled or throttled_usec values indicate the cgroup's CPU limit is too low for its workload—it's regularly being paused mid-computation.
The cpuset Controller
The cpuset controller constrains which CPUs (cores) and which memory nodes (NUMA) a cgroup can use:
# Pin cgroup to CPUs 0-1 (cores 0 and 1 only)
echo "0-1" > /sys/fs/cgroup/container/cpuset.cpus
# Pin to NUMA node 0 for memory allocation
echo "0" > /sys/fs/cgroup/container/cpuset.mems
Cpuset is critical for:
Kubernetes CPU Requests and Limits
Kubernetes' CPU configuration maps directly to cgroups:
resources:
requests:
cpu: "500m" # Converts to cpu.weight proportional share
limits:
cpu: "2" # Converts to cpu.max: 200000 100000 (2 cores)
requests.cpu affects scheduling decisions and sets cpu.shares/weightlimits.cpu sets the cpu.max quota for hard throttlingThe memory controller is arguably the most critical for container stability. It tracks memory usage, enforces limits, and handles out-of-memory conditions for cgroups.
Memory Accounting
The memory controller accounts for:
In cgroup v2, comprehensive accounting is available via memory.stat:
123456789101112131415
$ cat /sys/fs/cgroup/container/memory.statanon 52428800 # Anonymous memory (50 MB)file 26214400 # Page cache (25 MB)kernel 4194304 # Kernel memory (4 MB)sock 8192 # Socket buffersshmem 0 # Shared memoryfile_mapped 10485760 # Mapped file pagesfile_dirty 4096 # Dirty file pagesfile_writeback 0 # Pages being written backslab 2097152 # Slab allocatorpgfault 1234567 # Page faultspgmajfault 42 # Major page faults (disk reads)workingset_refault 12345 # Refaults from working setworkingset_activate 6789 # Working set activationsoom_kill 0 # OOM kills in this cgroupMemory Limits
Cgroup v2 provides three limit levels:
123456789101112131415
# Hard limit: 1GB - OOM if exceededecho $((1024 * 1024 * 1024)) > /sys/fs/cgroup/container/memory.max # Soft limit: 768MB - Aggressive reclaim above thisecho $((768 * 1024 * 1024)) > /sys/fs/cgroup/container/memory.high # Protected minimum: 256MB - Won't be reclaimed below thisecho $((256 * 1024 * 1024)) > /sys/fs/cgroup/container/memory.min # Soft minimum: 512MB - Protected unless system under severe pressureecho $((512 * 1024 * 1024)) > /sys/fs/cgroup/container/memory.low # Check current memory usagecat /sys/fs/cgroup/container/memory.current# Output: 52428800 (50 MB in bytes)The OOM Killer
When a cgroup reaches its memory.max and cannot reclaim enough memory, the kernel invokes the OOM killer within that cgroup. The OOM killer selects and terminates processes to free memory, considering:
In containerized environments, OOM kills are confined to the offending cgroup—they don't kill processes in other containers or the host. This is essential for isolation.
OOM events are logged and visible:
# Watch for OOM events
cat /sys/fs/cgroup/container/memory.events
# low 0 # Times memory dropped below memory.low
# high 42 # Times memory exceeded memory.high
# max 3 # Times memory hit memory.max
# oom 1 # OOM events
# oom_kill 1 # Processes killed by OOM
# Check cgroup OOM kill count
cat /sys/fs/cgroup/container/memory.stat | grep oom_kill
If the container's main process (PID 1 inside the container) is OOM-killed, the entire container dies. Memory limits should allow headroom, and critical applications should monitor memory.events for early warning of memory pressure.
Memory Reclaim Under Pressure
Before invoking OOM, the kernel attempts reclaim:
When approaching memory.high, the kernel aggressively reclaims, which throttles the cgroup. Applications see increased latency as the kernel works to free memory. Monitoring memory.high events helps identify workloads that need more memory or optimization.
Swap Behavior
Cgroup v2's memory.swap.max controls swap usage per-cgroup:
# Allow up to 512MB swap
echo $((512 * 1024 * 1024)) > /sys/fs/cgroup/container/memory.swap.max
# Disable swap for this cgroup (common for containers)
echo 0 > /sys/fs/cgroup/container/memory.swap.max
# Check current swap usage
cat /sys/fs/cgroup/container/memory.swap.current
Containers typically disable swap to:
Beyond CPU and memory, cgroups provide controllers for I/O, process count, devices, and more.
I/O Controller
The I/O controller (cgroup v2's io or v1's blkio) manages block device I/O bandwidth and IOPS:
# Limit I/O to 10 MB/s read, 5 MB/s write for device 8:0
echo "8:0 rbps=10485760 wbps=5242880" > /sys/fs/cgroup/container/io.max
# Limit IOPS: 100 read IOPS, 50 write IOPS
echo "8:0 riops=100 wiops=50" > /sys/fs/cgroup/container/io.max
# Combine bandwidth and IOPS limits
echo "8:0 rbps=10485760 wbps=5242880 riops=1000 wiops=500" > /sys/fs/cgroup/container/io.max
# Check I/O statistics
cat /sys/fs/cgroup/container/io.stat
# 8:0 rbytes=1234567 wbytes=7654321 rios=100 wios=50 dbytes=0 dios=0
Device identifiers (8:0) are major:minor numbers. Find them with:
ls -la /dev/sda # Shows major:minor
cat /sys/block/sda/dev # Shows major:minor
I/O Weight (Proportional Sharing)
Like CPU weights, I/O weights determine relative bandwidth when devices are contested:
# Default weight is 100, range is 1-10000
echo "8:0 200" > /sys/fs/cgroup/container-a/io.weight
echo "8:0 100" > /sys/fs/cgroup/container-b/io.weight
# container-a gets 2x the I/O bandwidth of container-b when contested
PIDs Controller
The PIDs controller limits the number of processes (and threads) in a cgroup, preventing fork bombs:
# Limit to 100 processes
echo 100 > /sys/fs/cgroup/container/pids.max
# Check current process count
cat /sys/fs/cgroup/container/pids.current
# Output: 42
# Check if limit was ever hit
cat /sys/fs/cgroup/container/pids.events
# max 0 (number of times fork failed due to limit)
The PIDs controller is essential for multi-tenant systems. Without it, a malicious or buggy container could fork-bomb the entire host. Kubernetes and Docker set pids.max by default.
Devices Controller
The devices controller (v1, being replaced by eBPF in v2) whitelists which devices a cgroup can access:
# In cgroup v1:
# Deny all devices by default
echo "a" > /sys/fs/cgroup/devices/container/devices.deny
# Allow /dev/null (char 1:3)
echo "c 1:3 rwm" > /sys/fs/cgroup/devices/container/devices.allow
# Allow /dev/urandom (char 1:9)
echo "c 1:9 r" > /sys/fs/cgroup/devices/container/devices.allow
Format: [type] [major:minor] [permissions]
c (char), b (block), a (all)* for wildcard)r (read), w (write), m (mknod)Freezer Controller
The freezer controller suspends and resumes all processes in a cgroup:
# Freeze (suspend) all processes
echo 1 > /sys/fs/cgroup/container/cgroup.freeze
# Resume all processes
echo 0 > /sys/fs/cgroup/container/cgroup.freeze
# Check frozen state
cat /sys/fs/cgroup/container/cgroup.freeze
Used for:
| Controller | Controls | Key Files (v2) | Primary Use |
|---|---|---|---|
| cpu | CPU time | cpu.max, cpu.weight | Limit/share CPU |
| cpuset | CPU/NUMA affinity | cpuset.cpus, cpuset.mems | Pin to cores |
| memory | Memory usage | memory.max, memory.high | Limit RAM |
| io | Block I/O | io.max, io.weight | Limit disk I/O |
| pids | Process count | pids.max | Prevent fork bomb |
| devices | Device access | (v1: devices.allow) | Device whitelist |
| freezer | Process execution | cgroup.freeze | Suspend/resume |
| hugetlb | Huge pages | hugetlb.*.max | Limit huge pages |
Container runtimes (Docker, containerd, CRI-O, Podman) abstract cgroup management, but understanding the mapping helps with troubleshooting and optimization.
Docker cgroup Management
When you run a Docker container with resource limits:
docker run -d
--name myapp
--cpus="1.5"
--memory="512m"
--memory-swap="512m"
--pids-limit=100
nginx
Docker creates a cgroup (the path depends on cgroup driver) and sets:
| Docker Flag | cgroup v2 File | Value |
|---|---|---|
--cpus=1.5 | cpu.max | 150000 100000 |
--memory=512m | memory.max | 536870912 |
--memory-swap=512m | memory.swap.max | 0 (same as mem = no swap) |
--pids-limit=100 | pids.max | 100 |
Finding a Container's cgroup
# Get container's cgroup path
docker inspect myapp --format '{{.HostConfig.CgroupParent}}'
# Or check from inside the container's PID
CID=$(docker inspect myapp --format '{{.State.Pid}}')
cat /proc/$CID/cgroup
# For cgroup v2, typically:
# 0::/system.slice/docker-<container-id>.scope
# Navigate to cgroup
cd /sys/fs/cgroup/system.slice/docker-$(docker inspect myapp -f '{{.Id}}').scope
ls
# cpu.max cpu.stat memory.current memory.max pids.current ...
Kubernetes cgroup Structure
Kubernetes organizes cgroups hierarchically:
/sys/fs/cgroup/
└── kubepods.slice/ # All pod cgroups
├── kubepods-burstable.slice/ # Burstable QoS pods
│ └── kubepods-burstable-pod<uid>.slice/
│ └── cri-containerd-<cid>.scope/ # Container cgroup
├── kubepods-besteffort.slice/ # BestEffort QoS pods
└── kubepods-guaranteed.slice/ # Guaranteed QoS pods
This structure enables:
systemd cgroup Integration
Modern Linux systems use systemd as init, and systemd manages cgroups via slices, scopes, and services:
Docker can use the systemd-cgroup driver (recommended for systemd-based hosts):
// /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=systemd"]
}
This delegates cgroup management to systemd, ensuring consistency between systemd's view and Docker's view of resource usage.
Debugging Resource Issues
When containers misbehave, check cgroup stats:
# Check if CPU throttled
cat /sys/fs/cgroup/.../cpu.stat | grep throttled
# Check memory pressure
cat /sys/fs/cgroup/.../memory.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# Check for OOM events
cat /sys/fs/cgroup/.../memory.events | grep oom
# Real-time monitoring with bpftrace or systemd-cgtop
systemd-cgtop
Pressure Stall Information (PSI) in cgroup v2 provides standardized metrics for resource contention. Monitor cpu.pressure, memory.pressure, and io.pressure for early warning of resource exhaustion before OOM or throttling becomes severe.
We've explored Linux control groups comprehensively. Let's consolidate the key takeaways:
What's next:
We now understand the two foundational container primitives: namespaces for isolation and cgroups for resource control. The next page focuses on resource limiting in practice—how to calculate appropriate limits, common patterns (requests vs limits), overcommitment strategies, and real-world tuning based on workload characteristics. We'll connect the theoretical cgroup knowledge to practical container sizing decisions.
You now understand Linux control groups—the resource management primitive that complements namespaces for containerization. You know how to create cgroups, configure controllers, set limits, and interpret statistics. Next, we'll apply this knowledge to practical resource limiting scenarios.