Kubernetes Operations - Learning Module

Loading content...

0/273

Resource Management

The Foundation of Kubernetes Stability

In the world of Kubernetes, resource management is the invisible foundation upon which all stability, performance, and cost efficiency rest. A cluster without proper resource management is like a city without traffic laws—everything might work fine when traffic is light, but chaos erupts under load.

Every container running in Kubernetes competes for the same finite pool of compute resources: CPU cycles, memory, storage bandwidth, and network capacity. Without explicit resource management, this competition becomes a free-for-all where a single misbehaving application can starve its neighbors, trigger cascading failures, or cause the scheduler to make poor placement decisions.

Resource management in Kubernetes encompasses:

Resource requests: The minimum resources a container needs to function
Resource limits: The maximum resources a container can consume
Quality of Service (QoS) classes: How Kubernetes prioritizes containers during resource pressure
Resource quotas: Cluster-wide limits on resource consumption per namespace
Limit ranges: Default and enforced boundaries for pod specifications

What You Will Learn

By the end of this page, you'll understand how to configure CPU and memory resources, predict scheduling behavior, implement QoS strategies, enforce organizational resource policies, and avoid the common pitfalls that lead to cluster instability and cost overruns.

Understanding CPU and Memory Resources

Before diving into configuration, we must understand how Kubernetes models compute resources and how these models differ from traditional virtualization.

CPU Resources: Compressible and Shareable

CPU is a compressible resource in Kubernetes terminology. When CPU demand exceeds supply, the kernel's CFS (Completely Fair Scheduler) throttles containers proportionally. No container crashes—they simply run slower. This compressibility makes CPU more forgiving but also means CPU limits can introduce unexpected latency through throttling.

CPU Units:

1 CPU unit = 1 vCPU in cloud environments, 1 hyperthread on bare metal
Millicores (m): 1000m = 1 CPU. A request of 250m means 0.25 CPU cores
Fractional allocation: You can request 0.1 CPU (100m) for lightweight sidecars

Memory Resources: Incompressible and Fatal

Memory is incompressible. When a container tries to exceed its memory limit, the kernel's OOM (Out of Memory) killer terminates it—there's no graceful degradation. This makes memory limits more dangerous than CPU limits; incorrect memory limits cause container restarts and potential data loss.

Memory Units:

Binary units (Ki, Mi, Gi, Ti): 1Mi = 2^20 bytes = 1,048,576 bytes
Decimal units (K, M, G, T): 1M = 10^6 bytes = 1,000,000 bytes
Always use binary units (Mi, Gi) for consistency and clarity

CPU vs Memory Resource Characteristics
Characteristic	CPU	Memory
Compressibility	Compressible (throttled)	Incompressible (OOM killed)
Failure mode	Increased latency	Container restart
Overcommit risk	Performance degradation	Cascading failures
Monitoring signal	Throttle metrics	OOM events, restarts
Limit enforcement	CFS bandwidth control	Kernel OOM killer
Burstable	Yes (can exceed request)	Yes, but dangerous

Memory Limit Dangers

Setting memory limits too close to actual usage is a recipe for instability. Applications often have transient memory spikes (GC pauses, request bursts, background tasks) that exceed steady-state consumption. Leave headroom—typically 20-30% above observed peak usage—to absorb these spikes without OOM kills.

Requests vs Limits: The Critical Distinction

The distinction between requests and limits is the most critical concept in Kubernetes resource management. Misunderstanding this distinction leads to scheduling failures, cluster instability, and resource waste.

Requests: What the Scheduler Sees

Resource requests tell the Kubernetes scheduler how much CPU and memory a container needs to run. The scheduler uses requests—not limits—to make placement decisions. A pod is only scheduled to a node if the node has sufficient allocatable resources to satisfy the pod's total requests.

Key implications:

If you request 2Gi memory, the scheduler finds a node with 2Gi available
The node's allocatable memory decreases by 2Gi, even if the container uses only 500Mi
Under-requesting leads to scheduling pods on insufficient nodes
Over-requesting leads to wasted capacity (pods scheduled but resources unused)

Limits: What the Kernel Enforces

Resource limits define the maximum resources a container can consume. Unlike requests, limits are enforced by Linux kernel mechanisms (cgroups), not the Kubernetes scheduler.

Limit enforcement:

CPU limits: Enforced via CFS bandwidth control. Container is throttled when quota exhausted.
Memory limits: Enforced via cgroup memory limits. Container is OOM-killed when exceeded.

pod-resource-spec.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
  namespace: production
spec:
  containers:
  - name: application
    image: myapp:v1.2.3
    resources:
      # Requests: Scheduler guarantees, must always be available
      requests:
        memory: "512Mi"    # Container needs at least 512Mi to start
        cpu: "250m"        # Container needs 0.25 CPU cores minimum
      
      # Limits: Hard ceiling, kernel-enforced
      limits:
        memory: "1Gi"      # Container killed if exceeds 1Gi
        cpu: "1000m"       # Container throttled if exceeds 1 CPU
    
  - name: sidecar-logger
    image: fluentbit:latest
    resources:
      requests:
        memory: "64Mi"
        cpu: "50m"
      limits:
        memory: "128Mi"
        cpu: "100m"
        
# Total pod requests: 576Mi memory, 300m CPU
# Total pod limits: 1152Mi memory, 1100m CPU
# Scheduler reserves: 576Mi memory, 300m CPU

Requests Best Practices

•Base on real metrics: Use actual P50 usage from production monitoring
•Include startup overhead: Applications often need more resources during initialization
•Account for all containers: Sidecar requests are often overlooked but add up
•Review periodically: Usage patterns change; requests should too

Limits Best Practices

•CPU limits are controversial: Many teams omit them to avoid throttling
•Memory limits are essential: Always set to prevent runaway containers
•Leave headroom: Set limits 20-50% above typical peak usage
•Monitor throttling: CFS throttle metrics reveal when limits are too tight

Quality of Service (QoS) Classes

Kubernetes assigns every pod a Quality of Service (QoS) class based on its resource specification. QoS classes determine eviction priority when nodes experience memory pressure—pods with lower QoS are evicted before pods with higher QoS.

Understanding QoS is essential because:

It dictates which pods survive resource pressure events
It affects OOM score adjustment (kernel-level eviction priority)
It influences node allocatable capacity calculations
It determines scheduling behavior under resource contention

The Three QoS Classes:

1. Guaranteed (Highest Priority)

All containers have requests AND limits set
Requests equal limits for both CPU and memory
These pods are never throttled below their request
Last to be evicted during memory pressure
OOM score: -997 (extremely unlikely to be killed)

2. Burstable (Medium Priority)

At least one container has a request or limit set
Requests do not equal limits (or only one is set)
Can burst above requests when node has spare capacity
Evicted after BestEffort pods during memory pressure
OOM score: Calculated based on memory request/node ratio

3. BestEffort (Lowest Priority)

No containers have any requests or limits set
First to be evicted during memory pressure
OOM score: 1000 (first to be killed)
Suitable only for non-critical batch workloads

QoS Class Determination and Behavior
QoS Class	Criteria	Eviction Priority	Use Case
Guaranteed	All containers: requests = limits for CPU & memory	Last (most protected)	Databases, critical services, latency-sensitive apps
Burstable	At least one request or limit set; requests ≠ limits	Middle	Web servers, API services, most production workloads
BestEffort	No requests or limits on any container	First (least protected)	Batch jobs, CI/CD runners, dev environments

qos-class-examples.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Guaranteed QoS - Highest protection
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
  - name: database
    image: postgres:15
    resources:
      requests:
        memory: "4Gi"
        cpu: "2000m"
      limits:
        memory: "4Gi"      # Same as request
        cpu: "2000m"       # Same as request
 
---
# Burstable QoS - Can burst above requests
apiVersion: v1
kind: Pod
metadata:
  name: burstable-pod
spec:
  containers:
  - name: webserver
    image: nginx:1.24
    resources:
      requests:
        memory: "256Mi"
        cpu: "100m"
      limits:
        memory: "512Mi"    # Higher than request
        cpu: "500m"        # Higher than request
 
---
# BestEffort QoS - No resource guarantees
apiVersion: v1
kind: Pod
metadata:
  name: besteffort-pod
spec:
  containers:
  - name: batch-job
    image: batch-processor:latest
    # No resources section = BestEffort

QoS Strategy for Production

For production clusters, aim for most pods to be Burstable with appropriate requests, and reserve Guaranteed for truly critical services (databases, message queues, coordination services). BestEffort should be restricted to development namespaces or preemptible batch workloads.

Resource Quotas: Namespace-Level Governance

Resource Quotas enable cluster administrators to constrain aggregate resource consumption per namespace. In multi-tenant clusters, quotas prevent any single team from monopolizing cluster resources, ensure fair resource distribution, and provide cost accountability.

What Resource Quotas Control:

Compute Resources
- requests.cpu, limits.cpu: Total CPU across all pods
- requests.memory, limits.memory: Total memory across all pods
- requests.nvidia.com/gpu: Extended resources (GPUs, etc.)
Storage Resources
- requests.storage: Total PersistentVolumeClaim size
- persistentvolumeclaims: Number of PVCs
- <storage-class>.storageclass.storage.k8s.io/requests.storage: Per-storage-class limits
Object Counts
- pods, services, secrets, configmaps: Limit number of objects
- count/<resource>.<group>: Generic object count (e.g., count/deployments.apps)

Enforcement Behavior:

Quotas are enforced at admission time (pod creation/update)
If a pod creation would exceed quota, it's rejected with a clear error
Existing pods are never terminated due to quota—only new creations are blocked
Quotas apply to the sum of all pods in the namespace

resource-quota.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: team-payments
spec:
  hard:
    # Compute Quotas
    requests.cpu: "100"          # 100 CPU cores total
    requests.memory: "200Gi"     # 200 GiB memory requests
    limits.cpu: "200"            # 200 CPU cores limit total
    limits.memory: "400Gi"       # 400 GiB memory limits
    
    # Pod count limits
    pods: "500"                  # Max 500 pods in namespace
    
    # Storage limits
    requests.storage: "1Ti"      # 1 TiB total PVC storage
    persistentvolumeclaims: "50" # Max 50 PVCs
    
    # Object count limits
    services: "100"
    secrets: "200"
    configmaps: "200"
 
---
# Scoped quota - only applies to specific QoS class
apiVersion: v1
kind: ResourceQuota
metadata:
  name: guaranteed-quota
  namespace: team-payments
spec:
  hard:
    pods: "20"           # Only 20 Guaranteed pods allowed
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values: ["high-priority"]
 
---
# Quota for specific priority class
apiVersion: v1
kind: ResourceQuota
metadata:
  name: low-priority-quota
  namespace: team-analytics
spec:
  hard:
    cpu: "50"
    memory: "100Gi"
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values: ["batch", "preemptible"]

Quota and LimitRange Interaction

If a ResourceQuota exists that requires tracking compute resources (cpu/memory requests/limits), all pods in that namespace MUST specify those resources. Otherwise, pod creation fails. Use LimitRange to set defaults so users don't need to specify resources on every container.

Limit Ranges: Default and Boundary Enforcement

LimitRange resources provide three critical functions:

Default values: Automatically inject requests/limits when containers don't specify them
Minimum enforcement: Reject pods with resources below minimum thresholds
Maximum enforcement: Reject pods with resources above maximum thresholds

LimitRanges are particularly valuable in multi-tenant environments where you want to:

Ensure all pods have resource specifications (required when quotas are in place)
Prevent runaway containers that request excessive resources
Establish sensible defaults that match cluster capacity

LimitRange Types:

Container: Applies to individual container specs
Pod: Applies to aggregate pod resources
PersistentVolumeClaim: Applies to storage request sizes

limit-range.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-checkout
spec:
  limits:
  # Container-level limits (most common)
  - type: Container
    # Default values applied when container doesn't specify
    default:
      cpu: "500m"
      memory: "512Mi"
    # Default requests when not specified
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    # Minimum allowed (rejects pods below this)
    min:
      cpu: "50m"
      memory: "64Mi"
    # Maximum allowed (rejects pods above this)
    max:
      cpu: "4"
      memory: "8Gi"
    # Maximum limit/request ratio (prevents extreme overcommit)
    maxLimitRequestRatio:
      cpu: "10"       # Limit can be at most 10x request
      memory: "4"     # Limit can be at most 4x request
 
  # Pod-level limits (aggregate across all containers)
  - type: Pod
    max:
      cpu: "8"        # Total pod CPU cannot exceed 8 cores
      memory: "16Gi"  # Total pod memory cannot exceed 16Gi
 
  # PVC storage limits
  - type: PersistentVolumeClaim
    min:
      storage: "1Gi"   # No tiny PVCs
    max:
      storage: "100Gi" # Cap individual PVC size

How Defaults Are Applied:

When a pod is created, the admission controller checks each container:

If container has no resources.limits, apply default values
If container has no resources.requests, apply defaultRequest values
If defaultRequest is not set but default is, request = limit (Guaranteed QoS)
Validate that final values satisfy min, max, and maxLimitRequestRatio

Common LimitRange Patterns:

LimitRange Use Cases

•Development namespaces: Low defaults (100m CPU, 128Mi memory) to maximize density and prevent runaway dev workloads
•Production namespaces: Higher defaults (500m CPU, 512Mi memory) with wider ranges to accommodate diverse workloads
•Database namespaces: High minimums and maximums for memory-intensive workloads
•Batch processing namespaces: Allow high CPU burst ratios but constrain memory to prevent OOM storms

Node Allocatable: Understanding True Capacity

A common source of confusion in Kubernetes resource management is the difference between node capacity and node allocatable. Understanding this distinction is critical for capacity planning.

Node Capacity: Total hardware resources on the node

Node Allocatable: Resources available for scheduling pods

The difference accounts for:

System reserved: Resources reserved for OS processes (systemd, journald, etc.)
Kube reserved: Resources reserved for Kubernetes components (kubelet, container runtime)
Eviction threshold: Buffer before hard eviction kicks in

Formula:

Allocatable = Capacity - SystemReserved - KubeReserved - EvictionThreshold

kubelet-config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Kubelet configuration for resource reservation
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
 
# Reserve resources for system processes
systemReserved:
  cpu: "500m"
  memory: "1Gi"
  ephemeral-storage: "1Gi"
 
# Reserve resources for Kubernetes components
kubeReserved:
  cpu: "500m"
  memory: "1Gi"
  ephemeral-storage: "1Gi"
 
# Eviction thresholds (when to start evicting pods)
evictionHard:
  memory.available: "500Mi"
  nodefs.available: "10%"
  nodefs.inodesFree: "5%"
  imagefs.available: "15%"
 
evictionSoft:
  memory.available: "1Gi"
  nodefs.available: "15%"
 
evictionSoftGracePeriod:
  memory.available: "1m30s"
  nodefs.available: "1m30s"
 
# Example calculation for 16Gi memory node:
# Capacity:         16Gi
# - SystemReserved:  1Gi
# - KubeReserved:    1Gi  
# - EvictionHard:   500Mi
# = Allocatable:  ~13.5Gi

Inspect Node Allocatable

View a node's allocatable resources with: kubectl describe node <node-name>. Look for the 'Allocatable' section. The difference between 'Capacity' and 'Allocatable' shows exactly what's reserved for system and Kubernetes components.

Typical Resource Reservations by Node Size
Node Size	Capacity	System + Kube Reserved	Eviction Buffer	Allocatable	% Available
Small (4 CPU, 16Gi)	16Gi	2Gi	500Mi	~13.5Gi	84%
Medium (8 CPU, 32Gi)	32Gi	2.5Gi	1Gi	~28.5Gi	89%
Large (16 CPU, 64Gi)	64Gi	3Gi	1.5Gi	~59.5Gi	93%
XLarge (32 CPU, 128Gi)	128Gi	4Gi	2Gi	~122Gi	95%

CPU Throttling: The Hidden Performance Killer

CPU limits in Kubernetes are enforced via the Linux CFS (Completely Fair Scheduler) bandwidth control mechanism. While this seems straightforward, the implementation details cause significant performance issues that many teams overlook.

How CFS Bandwidth Control Works:

The CFS scheduler enforces CPU limits using a quota/period system:

Period: Default 100ms (configurable via --cpu-cfs-quota-period)
Quota: Derived from CPU limit. A 500m limit = 50ms quota per 100ms period

The throttling mechanism:

At the start of each period, container receives its quota allocation
CPU time consumed by the container decrements the quota
When quota exhausts, container is throttled (cannot run) until next period
Quota resets at the start of each period

The Burst Problem:

Even if your container's average CPU usage is well below limits, it can be throttled if usage is bursty. Consider a web server that processes requests in bursts—each request handler might spike CPU for 10-20ms. If multiple requests arrive in the same 100ms period, the quota exhausts and requests queue, adding latency.

throttling-metrics.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/bin/bash
# Monitor CFS throttling for a specific container
 
# Find the cgroup path for your container
CONTAINER_ID=$(kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].containerID}' | sed 's/containerd:\/\///')
 
# For containerd runtime, cgroup path is typically:
CGROUP_PATH="/sys/fs/cgroup/cpu,cpuacct/kubepods/pod<pod-uid>/<container-id>"
 
# Key metrics to monitor:
# nr_periods: Total number of CFS periods elapsed
cat $CGROUP_PATH/cpu.stat | grep nr_periods
 
# nr_throttled: Number of periods where container was throttled  
cat $CGROUP_PATH/cpu.stat | grep nr_throttled
 
# throttled_time: Total time container was throttled (nanoseconds)
cat $CGROUP_PATH/cpu.stat | grep throttled_time
 
# Calculate throttle percentage:
# throttle_pct = (nr_throttled / nr_periods) * 100
 
# Using cgroup v2 (modern systems):
cat /sys/fs/cgroup/kubepods.slice/kubepods-pod*.slice/cpu.stat
 
# Prometheus metrics (if using node_exporter or cAdvisor):
# container_cpu_cfs_throttled_periods_total
# container_cpu_cfs_periods_total
# Throttle ratio = throttled_periods / total_periods

The Case Against CPU Limits

Many Google, Netflix, and other high-performance teams do NOT set CPU limits, only requests. Their reasoning: CPU throttling causes unpredictable latency spikes, and CPU overcommit on well-monitored clusters is manageable. Request-only allocation ensures scheduling fairness without throttling penalties. This is controversial—not setting limits requires mature monitoring and smaller node pools to limit blast radius.

Mitigating CPU Throttling

•Remove CPU limits entirely (controversial): Set only requests, rely on cluster autoscaling to manage contention
•Increase limits significantly: Set limits 2-3x higher than average usage to absorb bursts
•Use CPU manager static policy: Pin latency-sensitive containers to dedicated cores (Guaranteed QoS only)
•Monitor throttle metrics: Alert when throttle ratio exceeds 10-20%
•Right-size requests: Under-requested pods can't burst effectively when limits are set

Memory Overcommit and OOM Management

Unlike CPU, memory is incompressible—you either have it or you don't. Memory overcommit (where total memory limits exceed node capacity) is inherently risky but often necessary for cost efficiency. Understanding how Kubernetes handles memory pressure is critical.

Memory Pressure Handling:

When a node experiences memory pressure, several mechanisms activate:

1. Eviction (Kubelet-driven)

Kubelet monitors memory availability against eviction thresholds
When soft thresholds hit, kubelet begins graceful eviction of low-priority pods
When hard thresholds hit, kubelet immediately evicts pods
Eviction order: BestEffort → Burstable (sorted by priority and memory usage) → Guaranteed

2. OOM Kill (Kernel-driven)

If eviction can't free memory fast enough, the kernel's OOM killer activates
Each process has an OOM score (0-1000); highest score is killed first
Kubernetes adjusts OOM scores based on QoS class:
- Guaranteed: -997 (very unlikely to be killed)
- BestEffort: 1000 (first to be killed)
- Burstable: Scaled based on memory request ratio

3. System OOM

If OOM killer can't resolve pressure, critical system processes may die
This can cause node failure, requiring replacement

OOM Score Adjustment by QoS Class
QoS Class	OOM Score Adj	Kill Priority	Rationale
Guaranteed	-997	Lowest	Resources were reserved; breaking promise is wrong
Burstable (high request)	-500 to -100	Medium-Low	Closer to request = more protected
Burstable (low request)	0 to 500	Medium-High	Far below request = less protection
BestEffort	1000	Highest	No guarantees made; first to sacrifice

memory-safety-patterns.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Pattern 1: Prevent memory overcommit per namespace
apiVersion: v1
kind: LimitRange
metadata:
  name: prevent-overcommit
  namespace: production
spec:
  limits:
  - type: Container
    # Force requests = limits for memory (Guaranteed QoS)
    maxLimitRequestRatio:
      memory: "1"    # Limit can only be 1x request
      
---
# Pattern 2: Reserve memory buffer on nodes
# kubelet configuration
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
  memory.available: "1Gi"   # Start hard eviction at 1Gi free
evictionSoft:
  memory.available: "2Gi"   # Start soft eviction at 2Gi free
evictionSoftGracePeriod:
  memory.available: "2m"
  
---
# Pattern 3: Pod Disruption Budget for critical workloads
# Ensure eviction doesn't take down too many replicas
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: critical-service

Memory Leak Detection

Memory leaks are the most common cause of OOM kills. Before setting tight memory limits, ensure your application has no memory leaks. Use heap profilers, monitor memory growth over time, and test with realistic workloads. A limit set against a leaking application is just scheduled failure.

Extended Resources: GPUs, TPUs, and Custom Hardware

Beyond CPU and memory, Kubernetes supports extended resources for specialized hardware like GPUs, TPUs, FPGAs, or any custom resource you define.

How Extended Resources Work:

Advertisement: Node-level agents (device plugins) advertise available resources to the kubelet
Scheduling: Scheduler treats extended resources like any other—pods requesting them are placed on nodes that have them
Allocation: Device plugin handles actual device assignment to containers

Common Extended Resources:

nvidia.com/gpu: NVIDIA GPUs (requires nvidia device plugin)
amd.com/gpu: AMD GPUs
intel.com/fpga: Intel FPGAs
google.com/tpu: Google TPUs (GKE only)

Extended Resource Characteristics:

Must be requested in whole numbers (no fractional GPUs natively)
Cannot be overcommitted (1 GPU = 1 container)
No requests/limits distinction—if you request it, you get exclusive access

gpu-pod.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
apiVersion: v1
kind: Pod
metadata:
  name: ml-training-job
spec:
  restartPolicy: Never
  containers:
  - name: trainer
    image: tensorflow/tensorflow:latest-gpu
    resources:
      limits:
        # Request 2 NVIDIA GPUs
        nvidia.com/gpu: "2"
        # Also need regular resources
        memory: "32Gi"
        cpu: "8"
      requests:
        memory: "32Gi"
        cpu: "8"
    volumeMounts:
    - mountPath: /data
      name: training-data
  
  # Tolerate GPU node taints (common pattern)
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
    
  # Prefer GPU node pool
  nodeSelector:
    cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
    
  volumes:
  - name: training-data
    persistentVolumeClaim:
      claimName: training-dataset

GPU Time-Sharing and MIG

Modern GPUs (NVIDIA A100, H100) support Multi-Instance GPU (MIG) to partition a single GPU into multiple isolated instances. This enables fractional GPU sharing with hardware isolation. Time-sharing solutions like NVIDIA GPU Operator's time-slicing also allow oversubscription without hardware partitioning, useful for development workloads.

Summary: Resource Management Best Practices

Effective resource management is the cornerstone of stable, cost-efficient Kubernetes operations. Let's consolidate the key principles and practices covered in this comprehensive guide:

Key Takeaways

•Always set memory limits — Memory is incompressible; runaway containers cause node-level failures
•CPU limits are optional — Consider the throttling trade-off; many teams use requests-only for CPU
•Base requests on real metrics — Use P50-P75 of actual production usage, not guesses
•Understand QoS implications — Your resource spec determines eviction priority during pressure
•Use LimitRanges for defaults — Ensure all pods have resources specs without manual configuration
•Implement ResourceQuotas for multi-tenancy — Prevent any team from monopolizing cluster resources
•Monitor throttling and OOM — These metrics reveal resource configuration problems before outages
•Account for system overhead — Allocatable ≠ Capacity; plan for 10-20% overhead on each node
•Right-size nodes for workloads — Larger nodes = less overhead percentage, but bigger blast radius
•Review and adjust regularly — Workload patterns change; resource configuration should too

What's Next:

With resource management foundations in place, the next page explores Auto-Scaling with Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA). You'll learn how to automatically adjust resource allocation and replica counts based on real-time demand, eliminating manual capacity management while maintaining performance SLAs.

Page Complete

You now have a comprehensive understanding of Kubernetes resource management—from basic CPU/memory configuration through QoS classes, quotas, limit ranges, and advanced topics like CPU throttling and extended resources. Apply these principles to build stable, efficient, and cost-effective Kubernetes deployments.