Operating SystemsNamespaces and cgroups

Linux Namespaces and Control Groups

LevelAdvanced

Duration90 mins

TopicNamespaces and cgroups

4 / 5

Resource Limiting

From Theory to Practice

Understanding how cgroups work is only half the battle. The harder question is: What limits should I actually set? Too low, and your application is constantly throttled or OOM-killed. Too high, and you waste resources or risk one container starving others.

Effective resource limiting requires understanding your workload's characteristics, the difference between requests and limits, overcommitment tradeoffs, and monitoring feedback loops. This page bridges cgroup mechanics with practical container sizing decisions.

What You Will Learn

By the end of this page, you will understand how to analyze workload resource requirements, the requests vs limits model used by Kubernetes, CPU and memory limit calculation strategies, overcommitment ratios, QoS classes, and how to use monitoring data to iteratively tune resource configurations.

Understanding Workload Characteristics

Before setting limits, you must understand your workload's resource consumption patterns. Different workload types have dramatically different profiles.

CPU-Bound vs I/O-Bound Workloads

CPU-bound: Computation-intensive (image processing, cryptography, ML inference). Consume CPU continuously; benefit from higher CPU limits
I/O-bound: Wait on network/disk (web servers, databases). Low average CPU but may need bursts; benefit from weight-based sharing
Memory-bound: Large datasets in RAM (caches, analytics). Need high memory limits; CPU limits less critical

Steady-State vs Bursty Workloads

Steady-state: Consistent resource usage (background batch jobs). Easy to size; set limits near average usage
Bursty: Periodic high usage spikes (web servers during traffic peaks). Need higher limits than average to handle bursts without throttling

Workload Resource Profiles
Workload Type	CPU Pattern	Memory Pattern	Sizing Strategy
Web API Server	Bursty, request-driven	Stable (connection pools)	Size for P99 latency targets
Background Worker	Steady, queue-driven	Varies with batch size	Size for throughput; limit to prevent starvation
Database	Bursty, query-driven	Stable (buffer pool)	High memory; CPU for concurrent queries
ML Inference	High, steady per request	High (model in memory)	Fixed high CPU/memory; horizontal scale
Cache (Redis)	Low (network-bound)	Very high (dataset)	Minimal CPU; memory is primary constraint
Build/Compile	Very high, sustained	Moderate (compiler state)	Maximum CPU; memory for parallel compiles

Measuring Workload Requirements

Before setting production limits, measure actual usage under realistic load:

# Monitor container resource usage with docker stats
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"

# For Kubernetes, use kubectl top
kubectl top pods --containers

# For detailed cgroup metrics
cat /sys/fs/cgroup/<path>/cpu.stat
cat /sys/fs/cgroup/<path>/memory.current
cat /sys/fs/cgroup/<path>/memory.stat

# Use Prometheus with cAdvisor for time-series data
# Key metrics:
# - container_cpu_usage_seconds_total
# - container_memory_working_set_bytes
# - container_cpu_cfs_throttled_seconds_total

Collect data across:

Normal operation
Peak load periods
Startup/initialization phase
Recovery from restarts

Startup is often the most resource-intensive phase—applications loading data, warming caches, or compiling JIT code may need temporarily higher limits.

Profile Before Limiting

Running without limits initially (in a safe environment) while monitoring provides the baseline data needed for informed limit setting. Many organizations run new workloads with monitoring but no hard limits for a burn-in period, then set limits based on observed P99 usage plus headroom.

The Requests vs Limits Model

Kubernetes popularized the requests/limits model, which maps directly to cgroup configurations but provides a higher-level abstraction for resource management.

Requests: Guaranteed Resources

A pod's requests specify the minimum resources it needs to run properly:

Used by the scheduler to place pods on nodes with sufficient capacity
Affects cgroup weights (cpu.weight) for proportional sharing
A node's allocatable resources minus pod requests = remaining capacity

Limits: Maximum Resources

A pod's limits specify the maximum resources it can consume:

Enforced by cgroups (cpu.max, memory.max)
Exceeding CPU limits causes throttling
Exceeding memory limits triggers OOM killer

pod-resources.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: v1
kind: Pod
metadata:
  name: web-app
spec:
  containers:
  - name: app
    image: myapp:latest
    resources:
      requests:
        cpu: "500m"       # 0.5 CPU cores guaranteed
        memory: "256Mi"   # 256 MB guaranteed
      limits:
        cpu: "2"          # Can burst to 2 cores when available
        memory: "512Mi"   # Hard limit; OOM if exceeded

How Requests/Limits Map to cgroups

Kubernetes	cgroup v2 File	Effect
`requests.cpu`	`cpu.weight`	Proportional share = (request/1 CPU) * 100
`limits.cpu`	`cpu.max`	`$((limit * 100000)) 100000`
`requests.memory`	(scheduling only)	Not enforced by cgroup
`limits.memory`	`memory.max`	Hard limit in bytes

The Gap Between Requests and Limits

The ratio of limits to requests determines overcommitment potential:

limits = requests: No overcommitment. Pod is guaranteed its resources.
limits > requests: Pod can burst above requests when capacity is available.
limits unset: Pod has no upper bound (uses node's allocatable as implicit limit).

Example:

requests:
  cpu: "100m"
limits:
  cpu: "1000m"  # 10x overcommit ratio

This pod needs 100m to function but can use up to 1000m if available. On a 4-CPU node, you could schedule 40 such pods by requests, but they could collectively demand 40 CPUs—10x the node's capacity if all burst simultaneously.

Converting Mermaid diagram...

Memory Overcommit Risk

CPU overcommit is relatively safe—excess demand causes throttling but not crashes. Memory overcommit is dangerous—if all pods simultaneously need their limits, there's not enough physical memory, triggering OOM kills. Conservative memory limits are essential for stability.

CPU Limit Strategies

CPU limits require balancing predictability against utilization. Several strategies exist, each with tradeoffs.

Strategy 1: No CPU Limits (Requests Only)

Some organizations (notably Google) recommend setting CPU requests but not limits:

resources:
  requests:
    cpu: "500m"
  # No limits.cpu set

Advantages:

Pods can burst to use any available CPU
Higher utilization of cluster resources
No throttling-induced latency spikes

Disadvantages:

Noisy neighbor problems: one pod can starve others
Less predictable performance
Harder capacity planning

Strategy 2: Limits Equal to Requests (Guaranteed)

resources:
  requests:
    cpu: "1"
  limits:
    cpu: "1"  # Same as request

Advantages:

Completely predictable performance
Pod gets exactly what it asked for
No resource contention between pods

Disadvantages:

Lower cluster utilization (reserved but unused resources)
Must size for peak, wasting resources during low-load periods

Best for: Latency-sensitive applications where predictability matters more than efficiency.

Strategy 3: Limits Higher Than Requests (Burstable)

resources:
  requests:
    cpu: "500m"
  limits:
    cpu: "2"  # 4x burst capacity

Advantages:

Handles traffic spikes without throttling
Efficient use of cluster resources
Lower costs (can schedule more pods)

Disadvantages:

Unpredictable performance during contention
Risk of cascading throttling during cluster-wide load

Best for: Batch workloads, bursty applications, cost-optimized environments.

Calculating CPU Limits

Use monitoring data to calculate appropriate limits:

# Collect CPU usage over time (e.g., from Prometheus)
cpu_usage_samples = get_cpu_usage_over_7_days()

# Calculate percentiles
p50 = percentile(cpu_usage_samples, 50)
p99 = percentile(cpu_usage_samples, 99)
max_spike = max(cpu_usage_samples)

# Set request at P50-P75 (typical usage)
request = p50 * 1.2  # 20% headroom

# Set limit at P99 + headroom
limit = p99 * 1.5  # 50% headroom for P99

# Sanity check: limit should handle max observed + margin
if limit < max_spike * 1.1:
    limit = max_spike * 1.1

Debugging CPU Throttling

Throttling indicates the limit is too low:

# Check throttling in cgroup stats
cat /sys/fs/cgroup/.../cpu.stat
# nr_throttled 12345   # Number of throttle events
# throttled_usec 78901 # Total time throttled

# In Kubernetes, check container_cpu_cfs_throttled_seconds_total
# If throttle ratio (throttled/usage) > 5-10%, increase limit

Throttling causes latency spikes. A request taking 10ms of CPU spread across a throttled period might actually take 50ms wall-clock time.

The 100ms Period Issue

CFS bandwidth uses a 100ms period. A pod with limit=100m (10% of one CPU) gets 10ms every 100ms. A burst that uses 15ms CPU in one period exhausts quota and must wait 90ms for more. For latency-sensitive workloads, consider requests=limits or no limits.

Memory Limit Strategies

Memory limits are less forgiving than CPU limits. Exceeding a CPU limit causes throttling (recoverable). Exceeding a memory limit causes OOM (often fatal). Memory sizing requires careful analysis.

Understanding Memory Types

What counts toward the memory limit?

RSS (Resident Set Size): Actual physical memory used by process
Cache: File-backed pages (can be reclaimed)
Buffers: Kernel buffers for I/O
Swap: If swap enabled, swap usage also counts

In cgroup v2's memory.current:

memory.current = RSS + page cache + kernel memory used by cgroup

Kubernetes uses memory.current minus inactive cache for metrics, as cache can be reclaimed under pressure.

Calculating Memory Limits

Memory limits should accommodate:

Application baseline memory (data structures, static allocations)
Per-request memory (buffers, temporary objects)
Peak concurrent request memory
JVM heap + metaspace (for Java apps)
Headroom for memory fragmentation and kernel overhead

# Example calculation for a Java web service

jvm_heap_max = 384 * MB     # -Xmx setting
jvm_metaspace = 64 * MB     # Typical metaspace
jvm_overhead = 50 * MB      # JVM internal structures
app_static = 20 * MB        # Static data
per_request = 0.5 * MB      # Memory per concurrent request
max_concurrent = 50         # Max concurrent requests
fragmentation = 1.15        # 15% fragmentation overhead

working_memory = (
    jvm_heap_max + 
    jvm_metaspace + 
    jvm_overhead + 
    app_static +
    (per_request * max_concurrent)
)

memory_limit = int(working_memory * fragmentation)
# ≈ 600 MB

For non-JVM applications, profile memory under load:

# Track memory over time
while true; do
  cat /sys/fs/cgroup/.../memory.current >> memory_log.txt
  sleep 5
done

# Analyze for P99 and max
awk '{print}' memory_log.txt | sort -n | tail -1   # Max

JVM Memory Pitfall

JVMs manage their own heap. A 512MB container with a JVM using -Xmx512m will OOM—the JVM heap is only part of JVM memory. Use -Xmx at 70-80% of container limit, or use -XX:MaxRAMPercentage to auto-size. Always set -XX:+UseContainerSupport (default in recent JDKs) so JVM respects cgroup limits.

Memory Requests: The Scheduling Guarantee

Unlike CPU requests which affect scheduling, memory requests affect scheduling AND behavior:

Pods are scheduled on nodes with (allocatable - sum(requests)) >= pod's request
If memory usage exceeds request but below limit, pod becomes eviction candidate under node memory pressure
Pods using less than requests are safe from eviction

# Conservative: request = limit (guaranteed, no eviction risk)
resources:
  requests:
    memory: "512Mi"
  limits:
    memory: "512Mi"

# Overcommitted: request < limit (risk of eviction under pressure)
resources:
  requests:
    memory: "256Mi"   # Scheduling guarantee
  limits:
    memory: "512Mi"   # Hard limit

Memory Limit Tuning Signals

Monitor these to tune limits:

Signal	Indication	Action
OOM kills (`memory.events oom_kill`)	Limit too low	Increase limit or optimize app
High memory.high events	Approaching limit frequently	Increase limit or add memory.high
Memory stable at limit	May be throttling reclaim	Check io.pressure, add headroom
Memory well below limit	Over-provisioned	Reduce limit to free cluster capacity

Quality of Service (QoS) Classes

Kubernetes classifies pods into QoS classes based on their resource configuration. QoS affects eviction priority during node resource pressure.

Guaranteed QoS

A pod is Guaranteed when:

Every container has both CPU and memory requests AND limits set
requests = limits for both CPU and memory

containers:
- name: app
  resources:
    requests:
      memory: "512Mi"
      cpu: "500m"
    limits:
      memory: "512Mi"   # Same as request
      cpu: "500m"       # Same as request

Characteristics:

Highest priority in eviction decisions
Most expensive (reserved resources can't be overcommitted)
Best for critical production workloads

Burstable QoS

A pod is Burstable when:

At least one container has a request or limit set
Doesn't meet Guaranteed criteria (requests ≠ limits or not all set)

containers:
- name: app
  resources:
    requests:
      memory: "256Mi"
      cpu: "250m"
    limits:
      memory: "512Mi"   # Different from request
      cpu: "1"          # Different from request

Characteristics:

Medium eviction priority
Can burst above requests when capacity available
Good balance of cost and flexibility

BestEffort QoS

A pod is BestEffort when:

No containers have any CPU or memory requests or limits

containers:
- name: app
  # No resources specified
  image: myapp:latest

Characteristics:

First to be evicted under pressure
Can use any available resources
Suitable only for fully interruptible workloads

QoS Class Comparison
QoS Class	Eviction Priority	Resource Efficiency	Predictability	Use Case
Guaranteed	Last (most protected)	Lowest (full reservation)	Highest	Critical production
Burstable	Middle	Good (overcommit allowed)	Medium	General workloads
BestEffort	First (least protected)	Highest (no reservation)	Lowest	Batch, interruptible

Eviction Order Under Pressure

When a node experiences memory pressure, the kubelet evicts pods in this order:

BestEffort pods using more than requests (they have none)
Burstable pods using more than their memory requests
Guaranteed pods (only if necessary, and only if above requests—which equals limits)

Within each class, pods using more memory relative to their requests are evicted first.

OOM Score Calculation

The Linux OOM killer uses oom_score_adj to prioritize kills. Kubernetes sets this based on QoS:

QoS Class	oom_score_adj	Meaning
Guaranteed	-997	Almost never killed
BestEffort	1000	Maximum kill priority
Burstable	2 to 999	Varies by memory usage vs request ratio

Designing for QoS

Choose QoS class intentionally. Use Guaranteed for databases, caches, and critical APIs. Use Burstable for web servers and workers that can tolerate occasional eviction. Use BestEffort only for jobs that can be restarted without consequence.

Overcommitment and Bin Packing

Overcommitment allows scheduling more workloads than physical resources by betting that not all workloads peak simultaneously. It's a powerful cost optimization tool but requires careful management.

CPU Overcommitment

CPU is "compressible"—excess demand causes throttling, not crashes. CPU overcommitment is common and relatively safe:

Node: 4 CPUs
Pod A: request=1, limit=2
Pod B: request=1, limit=2
Pod C: request=1, limit=2
Pod D: request=1, limit=2

Total requests: 4 CPUs (100% scheduled)
Total limits: 8 CPUs (200% overcommit ratio)

If all pods burst simultaneously:
- Each gets proportional share based on requests
- Pod with request=1 gets 1 CPU even if limit=2
- Throttling occurs but no crashes

Memory Overcommitment

Memory is "incompressible"—excess demand causes OOM. Memory overcommitment is risky:

Node: 8 GB
Pod A: request=2GB, limit=4GB
Pod B: request=2GB, limit=4GB
Pod C: request=2GB, limit=4GB
Pod D: request=2GB, limit=4GB

Total requests: 8 GB (100% scheduled)
Total limits: 16 GB (200% overcommit ratio)

If all pods use their limits:
- Node has only 8 GB physical memory
- OOM killer invoked
- Pods crash, potential cascade failure

Memory Overcommit Warning

Memory overcommitment is the primary cause of production outages in containerized environments. A best practice is requests = limits for memory, eliminating memory overcommit risk. CPU can be overcommitted more aggressively.

Calculating Overcommit Ratios

Determine safe overcommit ratios based on workload diversity:

# If pods have independent load patterns (low correlation)
# Higher overcommit is safer

correlation = calculate_usage_correlation(pod_metrics)

if correlation < 0.3:  # Low correlation
    safe_cpu_overcommit = 3.0  # 300%
    safe_memory_overcommit = 1.2  # 120%
elif correlation < 0.6:  # Medium correlation
    safe_cpu_overcommit = 2.0  # 200%
    safe_memory_overcommit = 1.0  # None
else:  # High correlation (e.g., all peak together)
    safe_cpu_overcommit = 1.2  # 120%
    safe_memory_overcommit = 1.0  # None (too risky)

Bin Packing Strategies

Bin packing is the scheduler's job—fitting pods onto nodes efficiently:

MostAllocated: Pack pods tightly on fewer nodes (cost optimization)
LeastAllocated: Spread pods across nodes (availability optimization)
RequestedToCapacityRatio: Balance based on resource ratios

Kubernetes supports custom scheduling policies:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  plugins:
    score:
      enabled:
      - name: NodeResourcesFit
        weight: 1
  pluginConfig:
  - name: NodeResourcesFit
    args:
      scoringStrategy:
        type: MostAllocated  # Tight packing
        resources:
        - name: cpu
          weight: 1
        - name: memory
          weight: 1

Vertical Pod Autoscaler (VPA) for Right-Sizing

VPA automatically adjusts resource requests/limits based on observed usage:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Auto"  # Automatically adjust resources
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 4
        memory: 4Gi

VPA collects usage metrics, computes recommendations, and (in Auto mode) evicts pods to apply new resource values. Use "Off" mode initially to get recommendations without changes.

Monitoring and Iterative Tuning

Resource limits should not be set and forgotten. Continuous monitoring and iterative tuning ensures optimal performance and cost efficiency.

Key Metrics to Monitor

# CPU throttling (indicates limit too low)
rate(container_cpu_cfs_throttled_seconds_total[5m])
/ rate(container_cpu_usage_seconds_total[5m]) > 0.1
# Alert if throttling > 10% of usage

# Memory approaching limit (OOM risk)
(container_memory_working_set_bytes 
/ container_spec_memory_limit_bytes) > 0.9
# Alert if using > 90% of limit

# OOM kills
increase(kube_pod_container_status_restarts_total[1h]) > 3
# Combined with OOM termination reason

# Resource waste (over-provisioning)
1 - (avg_over_time(container_memory_working_set_bytes[7d]) 
/ container_spec_memory_limit_bytes) > 0.5
# Alert if using < 50% of limit consistently

Resource Tuning Workflow

•Deploy with initial estimates — Based on profiling or conservative defaults
•Collect metrics for 1-2 weeks — Capture normal and peak usage patterns
•Analyze usage distribution — Calculate P50, P95, P99, and max
•Check for throttling or OOM — Examine cpu.stat and memory.events
•Calculate optimized values — requests at P75, limits at P99 + headroom
•Apply changes incrementally — Update in staging first, then production
•Monitor after changes — Verify no new throttling/OOM issues
•Repeat periodically — Workloads change; reassess quarterly

Interpreting Pressure Stall Information (PSI)

Cgroup v2's PSI provides direct pressure metrics without needing to correlate multiple signals:

cat /sys/fs/cgroup/.../cpu.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

cat /sys/fs/cgroup/.../memory.pressure
# some avg10=5.23 avg60=2.11 avg300=0.89 total=123456
# full avg10=0.50 avg60=0.20 avg300=0.08 total=12345

some: Percentage of time at least one task is stalled waiting for resource
full: Percentage of time all tasks are stalled (complete halt)
avg10/60/300: Moving averages over 10 seconds, 1 minute, 5 minutes

PSI thresholds for alerts:

some > 10%: Resource becoming constrained; investigate
some > 25%: Resource is a bottleneck; increase limits
full > 1%: Severe constraint; immediate attention needed

resource-analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#!/usr/bin/env python3
"""
Analyze container resource usage and generate limit recommendations.
Fetch data from Prometheus and calculate optimal requests/limits.
"""
 
import numpy as np
from prometheus_api_client import PrometheusConnect
 
def analyze_container_resources(prom: PrometheusConnect, 
                                container: str, 
                                days: int = 7):
    # Query CPU usage over the period
    cpu_query = f'''
        rate(container_cpu_usage_seconds_total{{container="{container}"}}[5m])
    '''
    cpu_data = prom.custom_query_range(
        cpu_query,
        start_time=f"{days}d",
        end_time="now",
        step="5m"
    )
    cpu_values = [float(v[1]) for v in cpu_data[0]['values']]
    
    # Query memory usage
    memory_query = f'''
        container_memory_working_set_bytes{{container="{container}"}}
    '''
    memory_data = prom.custom_query_range(
        memory_query,
        start_time=f"{days}d",
        end_time="now",
        step="5m"
    )
    memory_values = [float(v[1]) for v in memory_data[0]['values']]
    
    # Calculate percentiles
    recommendations = {
        'cpu': {
            'request': f"{int(np.percentile(cpu_values, 75) * 1000)}m",
            'limit': f"{int(np.percentile(cpu_values, 99) * 1500)}m",
            'p50': np.percentile(cpu_values, 50),
            'p99': np.percentile(cpu_values, 99),
            'max': max(cpu_values),
        },
        'memory': {
            'request': f"{int(np.percentile(memory_values, 90) / 1024**2)}Mi",
            'limit': f"{int(np.percentile(memory_values, 99) * 1.25 / 1024**2)}Mi",
            'p50': np.percentile(memory_values, 50) / 1024**2,
            'p99': np.percentile(memory_values, 99) / 1024**2,
            'max': max(memory_values) / 1024**2,
        }
    }
    
    return recommendations
 
# Example usage
prom = PrometheusConnect(url="http://prometheus:9090")
recs = analyze_container_resources(prom, "my-app", days=14)
print(f"CPU: request={recs['cpu']['request']}, limit={recs['cpu']['limit']}")
print(f"Memory: request={recs['memory']['request']}, limit={recs['memory']['limit']}")

Summary: Resource Limiting

We've covered the practical aspects of container resource limiting. Let's consolidate the key takeaways:

Key Takeaways

•Profile before limiting — Collect real usage data under realistic load before setting limits
•Requests affect scheduling, limits affect enforcement — Requests guarantee resources; limits cap maximum usage
•CPU is compressible, memory is not — CPU throttling is recoverable; memory overcommit causes OOM kills
•QoS classes affect eviction — Guaranteed > Burstable > BestEffort for eviction protection
•CPU limits cause throttling — High throttling indicates limit too low; consider limits > requests or no limits
•Memory limits should have headroom — Set limits 20-50% above typical usage to absorb spikes
•Overcommit carefully — CPU overcommit is relatively safe; memory overcommit is dangerous
•Monitor and iterate — Use PSI, throttle metrics, and OOM events to continuously tune resources

What's next:

We've explored namespaces for isolation, cgroups for resource control, and practical resource limiting strategies. The final page of this module covers container isolation from a holistic security perspective—how namespaces, cgroups, capabilities, seccomp, and LSMs work together to create secure container boundaries, and the limitations that require additional isolation layers like gVisor or Kata Containers.

Page Complete

You now understand practical resource limiting—how to analyze workloads, calculate appropriate requests and limits, choose QoS classes, and iteratively tune resource configurations. Next, we'll explore container isolation security in depth.

4 / 5

Loading learning content...

Operating SystemsNamespaces and cgroups

Linux Namespaces and Control Groups

LevelAdvanced

Duration90 mins

TopicNamespaces and cgroups

4 / 5

Resource Limiting

From Theory to Practice

What You Will Learn

Understanding Workload Characteristics

Before setting limits, you must understand your workload's resource consumption patterns. Different workload types have dramatically different profiles.

CPU-Bound vs I/O-Bound Workloads

CPU-bound: Computation-intensive (image processing, cryptography, ML inference). Consume CPU continuously; benefit from higher CPU limits
I/O-bound: Wait on network/disk (web servers, databases). Low average CPU but may need bursts; benefit from weight-based sharing
Memory-bound: Large datasets in RAM (caches, analytics). Need high memory limits; CPU limits less critical

Steady-State vs Bursty Workloads

Steady-state: Consistent resource usage (background batch jobs). Easy to size; set limits near average usage
Bursty: Periodic high usage spikes (web servers during traffic peaks). Need higher limits than average to handle bursts without throttling

Workload Resource Profiles
Workload Type	CPU Pattern	Memory Pattern	Sizing Strategy
Web API Server	Bursty, request-driven	Stable (connection pools)	Size for P99 latency targets
Background Worker	Steady, queue-driven	Varies with batch size	Size for throughput; limit to prevent starvation
Database	Bursty, query-driven	Stable (buffer pool)	High memory; CPU for concurrent queries
ML Inference	High, steady per request	High (model in memory)	Fixed high CPU/memory; horizontal scale
Cache (Redis)	Low (network-bound)	Very high (dataset)	Minimal CPU; memory is primary constraint
Build/Compile	Very high, sustained	Moderate (compiler state)	Maximum CPU; memory for parallel compiles

Measuring Workload Requirements

Before setting production limits, measure actual usage under realistic load:

# Monitor container resource usage with docker stats
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"

# For Kubernetes, use kubectl top
kubectl top pods --containers

# For detailed cgroup metrics
cat /sys/fs/cgroup/<path>/cpu.stat
cat /sys/fs/cgroup/<path>/memory.current
cat /sys/fs/cgroup/<path>/memory.stat

# Use Prometheus with cAdvisor for time-series data
# Key metrics:
# - container_cpu_usage_seconds_total
# - container_memory_working_set_bytes
# - container_cpu_cfs_throttled_seconds_total

Collect data across:

Normal operation
Peak load periods
Startup/initialization phase
Recovery from restarts

Startup is often the most resource-intensive phase—applications loading data, warming caches, or compiling JIT code may need temporarily higher limits.

Profile Before Limiting

The Requests vs Limits Model

Kubernetes popularized the requests/limits model, which maps directly to cgroup configurations but provides a higher-level abstraction for resource management.

Requests: Guaranteed Resources

A pod's requests specify the minimum resources it needs to run properly:

Used by the scheduler to place pods on nodes with sufficient capacity
Affects cgroup weights (cpu.weight) for proportional sharing
A node's allocatable resources minus pod requests = remaining capacity

Limits: Maximum Resources

A pod's limits specify the maximum resources it can consume:

Enforced by cgroups (cpu.max, memory.max)
Exceeding CPU limits causes throttling
Exceeding memory limits triggers OOM killer

pod-resources.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: v1
kind: Pod
metadata:
  name: web-app
spec:
  containers:
  - name: app
    image: myapp:latest
    resources:
      requests:
        cpu: "500m"       # 0.5 CPU cores guaranteed
        memory: "256Mi"   # 256 MB guaranteed
      limits:
        cpu: "2"          # Can burst to 2 cores when available
        memory: "512Mi"   # Hard limit; OOM if exceeded

How Requests/Limits Map to cgroups

Kubernetes	cgroup v2 File	Effect
`requests.cpu`	`cpu.weight`	Proportional share = (request/1 CPU) * 100
`limits.cpu`	`cpu.max`	`$((limit * 100000)) 100000`
`requests.memory`	(scheduling only)	Not enforced by cgroup
`limits.memory`	`memory.max`	Hard limit in bytes

The Gap Between Requests and Limits

The ratio of limits to requests determines overcommitment potential:

limits = requests: No overcommitment. Pod is guaranteed its resources.
limits > requests: Pod can burst above requests when capacity is available.
limits unset: Pod has no upper bound (uses node's allocatable as implicit limit).

Example:

requests:
  cpu: "100m"
limits:
  cpu: "1000m"  # 10x overcommit ratio

Converting Mermaid diagram...

Memory Overcommit Risk

CPU Limit Strategies

CPU limits require balancing predictability against utilization. Several strategies exist, each with tradeoffs.

Strategy 1: No CPU Limits (Requests Only)

Some organizations (notably Google) recommend setting CPU requests but not limits:

resources:
  requests:
    cpu: "500m"
  # No limits.cpu set

Advantages:

Pods can burst to use any available CPU
Higher utilization of cluster resources
No throttling-induced latency spikes

Disadvantages:

Noisy neighbor problems: one pod can starve others
Less predictable performance
Harder capacity planning

Strategy 2: Limits Equal to Requests (Guaranteed)

resources:
  requests:
    cpu: "1"
  limits:
    cpu: "1"  # Same as request

Advantages:

Completely predictable performance
Pod gets exactly what it asked for
No resource contention between pods

Disadvantages:

Lower cluster utilization (reserved but unused resources)
Must size for peak, wasting resources during low-load periods

Best for: Latency-sensitive applications where predictability matters more than efficiency.

Strategy 3: Limits Higher Than Requests (Burstable)

resources:
  requests:
    cpu: "500m"
  limits:
    cpu: "2"  # 4x burst capacity

Advantages:

Handles traffic spikes without throttling
Efficient use of cluster resources
Lower costs (can schedule more pods)

Disadvantages:

Unpredictable performance during contention
Risk of cascading throttling during cluster-wide load

Best for: Batch workloads, bursty applications, cost-optimized environments.

Calculating CPU Limits

Use monitoring data to calculate appropriate limits:

# Collect CPU usage over time (e.g., from Prometheus)
cpu_usage_samples = get_cpu_usage_over_7_days()

# Calculate percentiles
p50 = percentile(cpu_usage_samples, 50)
p99 = percentile(cpu_usage_samples, 99)
max_spike = max(cpu_usage_samples)

# Set request at P50-P75 (typical usage)
request = p50 * 1.2  # 20% headroom

# Set limit at P99 + headroom
limit = p99 * 1.5  # 50% headroom for P99

# Sanity check: limit should handle max observed + margin
if limit < max_spike * 1.1:
    limit = max_spike * 1.1

Debugging CPU Throttling

Throttling indicates the limit is too low:

# Check throttling in cgroup stats
cat /sys/fs/cgroup/.../cpu.stat
# nr_throttled 12345   # Number of throttle events
# throttled_usec 78901 # Total time throttled

# In Kubernetes, check container_cpu_cfs_throttled_seconds_total
# If throttle ratio (throttled/usage) > 5-10%, increase limit

Throttling causes latency spikes. A request taking 10ms of CPU spread across a throttled period might actually take 50ms wall-clock time.

The 100ms Period Issue

Memory Limit Strategies

Memory limits are less forgiving than CPU limits. Exceeding a CPU limit causes throttling (recoverable). Exceeding a memory limit causes OOM (often fatal). Memory sizing requires careful analysis.

Understanding Memory Types

What counts toward the memory limit?

RSS (Resident Set Size): Actual physical memory used by process
Cache: File-backed pages (can be reclaimed)
Buffers: Kernel buffers for I/O
Swap: If swap enabled, swap usage also counts

In cgroup v2's memory.current:

memory.current = RSS + page cache + kernel memory used by cgroup

Kubernetes uses memory.current minus inactive cache for metrics, as cache can be reclaimed under pressure.

Calculating Memory Limits

Memory limits should accommodate:

Application baseline memory (data structures, static allocations)
Per-request memory (buffers, temporary objects)
Peak concurrent request memory
JVM heap + metaspace (for Java apps)
Headroom for memory fragmentation and kernel overhead

# Example calculation for a Java web service

jvm_heap_max = 384 * MB     # -Xmx setting
jvm_metaspace = 64 * MB     # Typical metaspace
jvm_overhead = 50 * MB      # JVM internal structures
app_static = 20 * MB        # Static data
per_request = 0.5 * MB      # Memory per concurrent request
max_concurrent = 50         # Max concurrent requests
fragmentation = 1.15        # 15% fragmentation overhead

working_memory = (
    jvm_heap_max + 
    jvm_metaspace + 
    jvm_overhead + 
    app_static +
    (per_request * max_concurrent)
)

memory_limit = int(working_memory * fragmentation)
# ≈ 600 MB

For non-JVM applications, profile memory under load:

# Track memory over time
while true; do
  cat /sys/fs/cgroup/.../memory.current >> memory_log.txt
  sleep 5
done

# Analyze for P99 and max
awk '{print}' memory_log.txt | sort -n | tail -1   # Max

JVM Memory Pitfall

Memory Requests: The Scheduling Guarantee

Unlike CPU requests which affect scheduling, memory requests affect scheduling AND behavior:

Pods are scheduled on nodes with (allocatable - sum(requests)) >= pod's request
If memory usage exceeds request but below limit, pod becomes eviction candidate under node memory pressure
Pods using less than requests are safe from eviction

# Conservative: request = limit (guaranteed, no eviction risk)
resources:
  requests:
    memory: "512Mi"
  limits:
    memory: "512Mi"

# Overcommitted: request < limit (risk of eviction under pressure)
resources:
  requests:
    memory: "256Mi"   # Scheduling guarantee
  limits:
    memory: "512Mi"   # Hard limit

Memory Limit Tuning Signals

Monitor these to tune limits:

Signal	Indication	Action
OOM kills (`memory.events oom_kill`)	Limit too low	Increase limit or optimize app
High memory.high events	Approaching limit frequently	Increase limit or add memory.high
Memory stable at limit	May be throttling reclaim	Check io.pressure, add headroom
Memory well below limit	Over-provisioned	Reduce limit to free cluster capacity

Quality of Service (QoS) Classes

Kubernetes classifies pods into QoS classes based on their resource configuration. QoS affects eviction priority during node resource pressure.

Guaranteed QoS

A pod is Guaranteed when:

Every container has both CPU and memory requests AND limits set
requests = limits for both CPU and memory

containers:
- name: app
  resources:
    requests:
      memory: "512Mi"
      cpu: "500m"
    limits:
      memory: "512Mi"   # Same as request
      cpu: "500m"       # Same as request

Characteristics:

Highest priority in eviction decisions
Most expensive (reserved resources can't be overcommitted)
Best for critical production workloads

Burstable QoS

A pod is Burstable when:

At least one container has a request or limit set
Doesn't meet Guaranteed criteria (requests ≠ limits or not all set)

containers:
- name: app
  resources:
    requests:
      memory: "256Mi"
      cpu: "250m"
    limits:
      memory: "512Mi"   # Different from request
      cpu: "1"          # Different from request

Characteristics:

Medium eviction priority
Can burst above requests when capacity available
Good balance of cost and flexibility

BestEffort QoS

A pod is BestEffort when:

No containers have any CPU or memory requests or limits

containers:
- name: app
  # No resources specified
  image: myapp:latest

Characteristics:

First to be evicted under pressure
Can use any available resources
Suitable only for fully interruptible workloads

QoS Class Comparison
QoS Class	Eviction Priority	Resource Efficiency	Predictability	Use Case
Guaranteed	Last (most protected)	Lowest (full reservation)	Highest	Critical production
Burstable	Middle	Good (overcommit allowed)	Medium	General workloads
BestEffort	First (least protected)	Highest (no reservation)	Lowest	Batch, interruptible

Eviction Order Under Pressure

When a node experiences memory pressure, the kubelet evicts pods in this order:

BestEffort pods using more than requests (they have none)
Burstable pods using more than their memory requests
Guaranteed pods (only if necessary, and only if above requests—which equals limits)

Within each class, pods using more memory relative to their requests are evicted first.

OOM Score Calculation

The Linux OOM killer uses oom_score_adj to prioritize kills. Kubernetes sets this based on QoS:

QoS Class	oom_score_adj	Meaning
Guaranteed	-997	Almost never killed
BestEffort	1000	Maximum kill priority
Burstable	2 to 999	Varies by memory usage vs request ratio

Designing for QoS

Overcommitment and Bin Packing

CPU Overcommitment

CPU is "compressible"—excess demand causes throttling, not crashes. CPU overcommitment is common and relatively safe:

Node: 4 CPUs
Pod A: request=1, limit=2
Pod B: request=1, limit=2
Pod C: request=1, limit=2
Pod D: request=1, limit=2

Total requests: 4 CPUs (100% scheduled)
Total limits: 8 CPUs (200% overcommit ratio)

If all pods burst simultaneously:
- Each gets proportional share based on requests
- Pod with request=1 gets 1 CPU even if limit=2
- Throttling occurs but no crashes

Memory Overcommitment

Memory is "incompressible"—excess demand causes OOM. Memory overcommitment is risky:

Node: 8 GB
Pod A: request=2GB, limit=4GB
Pod B: request=2GB, limit=4GB
Pod C: request=2GB, limit=4GB
Pod D: request=2GB, limit=4GB

Total requests: 8 GB (100% scheduled)
Total limits: 16 GB (200% overcommit ratio)

If all pods use their limits:
- Node has only 8 GB physical memory
- OOM killer invoked
- Pods crash, potential cascade failure

Memory Overcommit Warning

Calculating Overcommit Ratios

Determine safe overcommit ratios based on workload diversity:

# If pods have independent load patterns (low correlation)
# Higher overcommit is safer

correlation = calculate_usage_correlation(pod_metrics)

if correlation < 0.3:  # Low correlation
    safe_cpu_overcommit = 3.0  # 300%
    safe_memory_overcommit = 1.2  # 120%
elif correlation < 0.6:  # Medium correlation
    safe_cpu_overcommit = 2.0  # 200%
    safe_memory_overcommit = 1.0  # None
else:  # High correlation (e.g., all peak together)
    safe_cpu_overcommit = 1.2  # 120%
    safe_memory_overcommit = 1.0  # None (too risky)

Bin Packing Strategies

Bin packing is the scheduler's job—fitting pods onto nodes efficiently:

MostAllocated: Pack pods tightly on fewer nodes (cost optimization)
LeastAllocated: Spread pods across nodes (availability optimization)
RequestedToCapacityRatio: Balance based on resource ratios

Kubernetes supports custom scheduling policies:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  plugins:
    score:
      enabled:
      - name: NodeResourcesFit
        weight: 1
  pluginConfig:
  - name: NodeResourcesFit
    args:
      scoringStrategy:
        type: MostAllocated  # Tight packing
        resources:
        - name: cpu
          weight: 1
        - name: memory
          weight: 1

Vertical Pod Autoscaler (VPA) for Right-Sizing

VPA automatically adjusts resource requests/limits based on observed usage:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Auto"  # Automatically adjust resources
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 4
        memory: 4Gi

VPA collects usage metrics, computes recommendations, and (in Auto mode) evicts pods to apply new resource values. Use "Off" mode initially to get recommendations without changes.

Monitoring and Iterative Tuning

Resource limits should not be set and forgotten. Continuous monitoring and iterative tuning ensures optimal performance and cost efficiency.

Key Metrics to Monitor

# CPU throttling (indicates limit too low)
rate(container_cpu_cfs_throttled_seconds_total[5m])
/ rate(container_cpu_usage_seconds_total[5m]) > 0.1
# Alert if throttling > 10% of usage

# Memory approaching limit (OOM risk)
(container_memory_working_set_bytes 
/ container_spec_memory_limit_bytes) > 0.9
# Alert if using > 90% of limit

# OOM kills
increase(kube_pod_container_status_restarts_total[1h]) > 3
# Combined with OOM termination reason

# Resource waste (over-provisioning)
1 - (avg_over_time(container_memory_working_set_bytes[7d]) 
/ container_spec_memory_limit_bytes) > 0.5
# Alert if using < 50% of limit consistently

Resource Tuning Workflow

•Deploy with initial estimates — Based on profiling or conservative defaults
•Collect metrics for 1-2 weeks — Capture normal and peak usage patterns
•Analyze usage distribution — Calculate P50, P95, P99, and max
•Check for throttling or OOM — Examine cpu.stat and memory.events
•Calculate optimized values — requests at P75, limits at P99 + headroom
•Apply changes incrementally — Update in staging first, then production
•Monitor after changes — Verify no new throttling/OOM issues
•Repeat periodically — Workloads change; reassess quarterly

Interpreting Pressure Stall Information (PSI)

Cgroup v2's PSI provides direct pressure metrics without needing to correlate multiple signals:

cat /sys/fs/cgroup/.../cpu.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

cat /sys/fs/cgroup/.../memory.pressure
# some avg10=5.23 avg60=2.11 avg300=0.89 total=123456
# full avg10=0.50 avg60=0.20 avg300=0.08 total=12345

some: Percentage of time at least one task is stalled waiting for resource
full: Percentage of time all tasks are stalled (complete halt)
avg10/60/300: Moving averages over 10 seconds, 1 minute, 5 minutes

PSI thresholds for alerts:

some > 10%: Resource becoming constrained; investigate
some > 25%: Resource is a bottleneck; increase limits
full > 1%: Severe constraint; immediate attention needed

resource-analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#!/usr/bin/env python3
"""
Analyze container resource usage and generate limit recommendations.
Fetch data from Prometheus and calculate optimal requests/limits.
"""
 
import numpy as np
from prometheus_api_client import PrometheusConnect
 
def analyze_container_resources(prom: PrometheusConnect, 
                                container: str, 
                                days: int = 7):
    # Query CPU usage over the period
    cpu_query = f'''
        rate(container_cpu_usage_seconds_total{{container="{container}"}}[5m])
    '''
    cpu_data = prom.custom_query_range(
        cpu_query,
        start_time=f"{days}d",
        end_time="now",
        step="5m"
    )
    cpu_values = [float(v[1]) for v in cpu_data[0]['values']]
    
    # Query memory usage
    memory_query = f'''
        container_memory_working_set_bytes{{container="{container}"}}
    '''
    memory_data = prom.custom_query_range(
        memory_query,
        start_time=f"{days}d",
        end_time="now",
        step="5m"
    )
    memory_values = [float(v[1]) for v in memory_data[0]['values']]
    
    # Calculate percentiles
    recommendations = {
        'cpu': {
            'request': f"{int(np.percentile(cpu_values, 75) * 1000)}m",
            'limit': f"{int(np.percentile(cpu_values, 99) * 1500)}m",
            'p50': np.percentile(cpu_values, 50),
            'p99': np.percentile(cpu_values, 99),
            'max': max(cpu_values),
        },
        'memory': {
            'request': f"{int(np.percentile(memory_values, 90) / 1024**2)}Mi",
            'limit': f"{int(np.percentile(memory_values, 99) * 1.25 / 1024**2)}Mi",
            'p50': np.percentile(memory_values, 50) / 1024**2,
            'p99': np.percentile(memory_values, 99) / 1024**2,
            'max': max(memory_values) / 1024**2,
        }
    }
    
    return recommendations
 
# Example usage
prom = PrometheusConnect(url="http://prometheus:9090")
recs = analyze_container_resources(prom, "my-app", days=14)
print(f"CPU: request={recs['cpu']['request']}, limit={recs['cpu']['limit']}")
print(f"Memory: request={recs['memory']['request']}, limit={recs['memory']['limit']}")

Summary: Resource Limiting

We've covered the practical aspects of container resource limiting. Let's consolidate the key takeaways:

Key Takeaways

•Profile before limiting — Collect real usage data under realistic load before setting limits
•Requests affect scheduling, limits affect enforcement — Requests guarantee resources; limits cap maximum usage
•CPU is compressible, memory is not — CPU throttling is recoverable; memory overcommit causes OOM kills
•QoS classes affect eviction — Guaranteed > Burstable > BestEffort for eviction protection
•CPU limits cause throttling — High throttling indicates limit too low; consider limits > requests or no limits
•Memory limits should have headroom — Set limits 20-50% above typical usage to absorb spikes
•Overcommit carefully — CPU overcommit is relatively safe; memory overcommit is dangerous
•Monitor and iterate — Use PSI, throttle metrics, and OOM events to continuously tune resources

What's next:

Page Complete

4 / 5