Loading content...
Resource exhaustion failures are fundamentally different from network or service failures. When a network partitions or a service crashes, the failure is typically abrupt and detectable—systems either work or they don't. But when resources exhaust, the failure is gradual and insidious.
A system running low on memory doesn't suddenly stop. It slows down as garbage collection consumes more CPU. It starts swapping to disk. Response times drift from milliseconds to seconds. Eventually, the OOM killer terminates processes, but by then, the damage is done—users have experienced degraded service for minutes or hours.
Resource exhaustion creates a particularly dangerous failure pattern: the system is sick but not dead. Health checks may pass. The service responds to pings. Yet actual user requests time out or fail. This zombie state can persist far longer than a clean failure because automated recovery mechanisms don't recognize the problem.
By the end of this page, you will understand how to inject and analyze resource exhaustion scenarios: CPU saturation, memory pressure, disk exhaustion, disk I/O saturation, and file descriptor exhaustion. You'll learn to observe system behavior as resources deplete, identify the tipping points where degradation becomes failure, and implement protective mechanisms.
Before injecting resource exhaustion, we must understand what resources our systems consume and what limits constrain them. In containerized environments, resources operate at multiple levels:
Host-Level Resources:
Container-Level Resources:
Application-Level Resources:
| Resource | Early Warning Signs | Degradation Behavior | Failure Mode |
|---|---|---|---|
| CPU | Increased latency, queue buildup | Slow processing, timeouts | Request dropping, cascading failures |
| Memory | Increased GC, swapping | Very slow response, thrashing | OOM kill, sudden crash |
| Disk Space | Disk alerts, write failures | Read-only filesystem | Database corruption, crash |
| Disk I/O | Increased latency for I/O ops | Database slowness, log delays | Complete I/O stall |
| File Descriptors | Connection errors | Cannot accept new connections | Service becomes unreachable |
| Thread Pool | Request queuing | Timeout errors | Complete inability to handle requests |
Before conducting resource exhaustion experiments, document the configured limits for your target system. Container memory limits, connection pool sizes, and thread counts all affect how the system behaves under resource pressure. Observing behavior near these limits is exactly what we're testing.
CPU saturation occurs when computational demand exceeds available processing capacity. Unlike memory exhaustion (which can cause abrupt failure), CPU saturation causes gradual degradation—work takes longer, queues build up, and latency increases proportionally to the degree of saturation.
How CPU Saturation Manifests:
| Saturation Level | Observable Behavior | System State |
|---|---|---|
| 50-70% | Normal operation, headroom for spikes | Healthy |
| 70-85% | Slight latency increase, less spike tolerance | Yellow alert |
| 85-95% | Noticeable latency, queue buildup | Degraded |
| 95-100% | Severe latency, timeouts begin | Critical |
| 100% sustained | Timeouts, cascading failures, potential crashes | Failure |
Importantly, CPU saturation interacts with other resources. As the CPU works harder, garbage collection runs more frequently (consuming more CPU), heat increases (potentially causing throttling), and context switching overhead grows. The relationship between CPU usage and performance is non-linear.
123456789101112131415161718192021222324252627282930313233343536373839
# Using stress-ng for CPU saturation (Linux) # Saturate all CPUs for 5 minutesstress-ng --cpu 0 --timeout 5m # Saturate with specific number of workers (4 CPUs)stress-ng --cpu 4 --cpu-load 100 --timeout 5m # Gradually increase CPU loadfor load in 25 50 75 90 100; do echo "Setting CPU load to $load%" stress-ng --cpu 0 --cpu-load $load --timeout 30sdone # Target specific CPU methods (tests different CPU characteristics)stress-ng --cpu 4 --cpu-method matrixprod --timeout 5m # Matrix multiplicationstress-ng --cpu 4 --cpu-method fft --timeout 5m # FFT (floating point) # Kubernetes: Using Chaos Mesh CPU stresscat <<EOF | kubectl apply -f -apiVersion: chaos-mesh.org/v1alpha1kind: StressChaosmetadata: name: cpu-stressspec: mode: all selector: labelSelectors: app: my-service stressors: cpu: workers: 4 load: 90 # 90% CPU load duration: "10m"EOF # Docker: CPU stress using containerdocker run --rm -it --cpus="2" \ alexeiled/stress-ng --cpu 2 --cpu-load 100 --timeout 5mWhat to Observe During CPU Saturation:
In Kubernetes with CPU limits set, a container that exceeds its CPU limit is throttled—its processes are paused for portions of each scheduling period. This causes unpredictable latency spikes even when the host has CPU available. Observe container throttling metrics (container_cpu_throttled_seconds) alongside system CPU metrics.
Memory pressure occurs when available memory decreases to the point where the system must work to reclaim it. In garbage-collected languages (Java, Go, Python), this triggers more frequent and longer GC cycles. In all systems, extreme memory pressure can trigger swapping to disk or, ultimately, the OOM killer.
The Memory Pressure Spectrum:
[Normal] → [Minor GC] → [Major GC] → [Swapping] → [Thrashing] → [OOM Kill]
↑ ↑ ↑ ↑ ↑ ↑
Plenty GC runs Stop-the- Severe System Process
of room more often world GC slowness unusable terminated
The transition from 'increased GC' to 'OOM kill' can happen rapidly. A system that's been running with 80% memory usage for weeks might suddenly crash because a traffic spike needed that last 20%.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
# Using stress-ng for memory pressure (Linux) # Allocate 2GB of memory for 5 minutesstress-ng --vm 1 --vm-bytes 2G --timeout 5m # Allocate percentage of available memorystress-ng --vm 1 --vm-bytes 80% --timeout 5m # Allocate with continuous memory operations (more realistic)stress-ng --vm 2 --vm-bytes 1G --vm-keep --timeout 5m # Gradually increase memory pressurefor percent in 50 60 70 80 90; do echo "Allocating ${percent}% of memory" timeout 30s stress-ng --vm 1 --vm-bytes ${percent}% sleep 5done # Kubernetes: Using Chaos Mesh memory stresscat <<EOF | kubectl apply -f -apiVersion: chaos-mesh.org/v1alpha1kind: StressChaosmetadata: name: memory-stressspec: mode: one selector: labelSelectors: app: my-service stressors: memory: workers: 1 size: "256MB" # Allocate 256MB per worker oomScoreAdj: -1000 # Protect stress process from OOM duration: "5m"EOF # Docker: Memory stress in containerdocker run --rm -it --memory="512m" \ alexeiled/stress-ng --vm 1 --vm-bytes 450M --timeout 5m # Simulate memory leak (Python example)python3 -c "import timedata = []while True: data.append('X' * 1024 * 1024) # Add 1MB per iteration time.sleep(1) print(f'Allocated {len(data)}MB')"| Metric | Normal Range | Warning Level | Critical Level |
|---|---|---|---|
| Memory utilization | < 70% | 70-85% | 85% |
| GC pause time (Java) | < 100ms | 100-500ms | 500ms |
| GC frequency | Normal for app | 2x baseline | 5x+ baseline |
| Swap usage | 0% | Any swap | 10% swap |
| OOM events | 0 | N/A | Any OOM |
| Container memory % | < 80% | 80-95% | 95% |
When a container exceeds its memory limit, Kubernetes kills and restarts it. Unlike graceful shutdown, this is abrupt—in-flight requests are lost, connections are dropped, and there's no cleanup. Memory pressure testing reveals whether your application gracefully sheds load before hitting the limit or crashes suddenly.
Disk space exhaustion occurs when file storage is filled to capacity. This can happen gradually (logs accumulating, data growing) or suddenly (large file creation, backup copies, log explosion during incident). Disk exhaustion affects different components differently:
Components Affected by Disk Exhaustion:
| Component | Effect When Disk is Full | Typical Consequence |
|---|---|---|
| Application logs | Can't write logs | Blind debugging during incidents |
| Database | Can't write data | Transaction failures, corruption |
| Temp files | Can't create temp files | Processing failures |
| Container overlay | Container can't write | Pod eviction |
| System journals | System logs lost | Unable to diagnose issues |
| Docker/containerd | Can't pull images, create containers | Deployment failures |
Disk exhaustion is particularly dangerous because many systems assume writes will succeed. A database that can't write its transaction log may corrupt data. An application that can't write to temp storage may crash in unexpected ways.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
# Fill disk space using dd# WARNING: Use with caution, creates large files # Create a 10GB file to fill diskdd if=/dev/zero of=/tmp/fill_disk bs=1G count=10 # Create file of specific size to reach target usageCURRENT_USAGE=$(df / | awk 'NR==2 {print $5}' | tr -d '%')TARGET=90FILL_SIZE=$((TARGET - CURRENT_USAGE))Gdd if=/dev/zero of=/tmp/fill_disk bs=1M count=$((FILL_SIZE * 1024)) # Using stress-ng for disk stressstress-ng --hdd 2 --hdd-bytes 5G --timeout 5m # Fill ephemeral storage in Kubernetes (run in container)dd if=/dev/zero of=/app/fill_ephemeral bs=1M count=500 # Kubernetes: Disk fill chaos using Chaos Meshcat <<EOF | kubectl apply -f -apiVersion: chaos-mesh.org/v1alpha1kind: IOChaosmetadata: name: disk-fillspec: action: fill mode: one selector: labelSelectors: app: my-service volumePath: /var/log containerName: my-container size: "500MB" duration: "5m"EOF # Clean up disk fillrm /tmp/fill_disk # Monitor disk usage during experimentwatch -n 1 'df -h; echo "---"; ls -lah /tmp/fill_*' # Container-based disk fillingdocker run --rm -v /tmp:/fill alpine \ sh -c 'dd if=/dev/zero of=/fill/large_file bs=1M count=5000'Observations During Disk Exhaustion:
Many organizations reserve disk space (e.g., 10-20% or a fixed amount) for emergency operations. Without reserved space, you can't SSH into a system, write logs, or even delete files to free space. Test whether your disk exhaustion procedures work when the disk is actually full.
Disk I/O saturation occurs when disk read/write bandwidth is fully consumed, creating contention for storage operations. Unlike disk space exhaustion (which is about capacity), I/O saturation is about throughput. Even with plenty of free space, a disk can only handle a limited number of operations per second (IOPS) or bytes per second (bandwidth).
I/O Saturation Characteristics:
I/O saturation manifests as increased latency for all disk operations. When I/O queues fill up:
SSD vs. HDD Behavior:
| Characteristic | HDD Under Saturation | SSD Under Saturation |
|---|---|---|
| Random IOPS | Collapses quickly | Maintains better |
| Sequential throughput | Better sustained | May throttle (heat) |
| Latency increase | Dramatic (seeks) | More gradual |
| Write amplification | Not applicable | Can worsen under load |
| Recovery | Quick once load drops | May need thermal cooldown |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
# Using stress-ng for disk I/O stress # Sequential write stressstress-ng --hdd 4 --hdd-bytes 1G --timeout 5m # Random I/O stress (more punishing)stress-ng --hdd 4 --hdd-bytes 1G --hdd-opts random --timeout 5m # Using fio for precise I/O workload generation# Random 4K reads (database-like workload)fio --name=random-read \ --ioengine=libaio \ --rw=randread \ --bs=4k \ --size=1G \ --numjobs=4 \ --runtime=300 \ --iodepth=32 \ --filename=/tmp/fio_test # Random 4K writes (log/WAL-like workload) fio --name=random-write \ --ioengine=libaio \ --rw=randwrite \ --bs=4k \ --size=1G \ --numjobs=4 \ --runtime=300 \ --fsync=1 \ --filename=/tmp/fio_test # Mixed read/write workloadfio --name=mixed-io \ --ioengine=libaio \ --rw=randrw \ --rwmixread=70 \ --bs=4k \ --size=2G \ --numjobs=8 \ --runtime=300 \ --filename=/tmp/fio_test # Kubernetes: IO chaos using Chaos Meshcat <<EOF | kubectl apply -f -apiVersion: chaos-mesh.org/v1alpha1kind: IOChaosmetadata: name: io-latencyspec: action: latency mode: one selector: labelSelectors: app: database volumePath: /var/lib/postgresql/data delay: "100ms" # Add 100ms latency to all I/O percent: 100 duration: "10m"EOF # Monitor I/O during experimentiostat -xz 1| Metric | Tool | Warning Threshold | Critical Threshold |
|---|---|---|---|
| I/O utilization % | iostat | 70% | 90% |
| Avg queue length (avgqu-sz) | iostat | 2 | 10 |
| Await (ms) | iostat | 10ms | 50ms |
| Read/Write IOPS | iostat | Approaching limit | At disk limit |
| I/O wait CPU % | top/vmstat | 20% | 50% |
| Disk bandwidth | iotop | Approaching limit | At disk limit |
Cloud storage volumes (EBS, Persistent Disks, Azure Disks) have explicit IOPS and throughput limits based on volume type and size. These limits are often lower than physical disk capabilities. Test I/O saturation at the cloud volume's configured limits, not the theoretical maximum.
File descriptors (FDs) are handles to open files, sockets, pipes, and other I/O resources. Every network connection, open file, and IPC channel consumes a file descriptor. Systems have limits on total FDs (system-wide) and per-process FDs. When these limits are reached, new connections and file opens fail.
FD Exhaustion Symptoms:
FD exhaustion is particularly common in services that:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
# Check current FD limitsulimit -n # Per-process limitcat /proc/sys/fs/file-max # System-wide limit # Check current FD usagels /proc/$(pgrep -f my-service)/fd | wc -l # FDs used by processcat /proc/sys/fs/file-nr # System-wide: allocated, free, max # Consume FDs by opening sockets (using Python)python3 -c "import socketimport time sockets = []host = 'localhost'port = 8080 while True: try: s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect((host, port)) sockets.append(s) print(f'Opened {len(sockets)} connections') except Exception as e: print(f'Failed at {len(sockets)}: {e}') time.sleep(60) break" # Consume FDs by opening filespython3 -c "import tempfile files = []while True: try: f = tempfile.TemporaryFile() files.append(f) if len(files) % 100 == 0: print(f'Opened {len(files)} files') except Exception as e: print(f'Failed at {len(files)}: {e}') break" # Temporarily reduce FD limit for testingulimit -n 256 # Very low limit for testing./my-service # Run with reduced limit # Using stress-ngstress-ng --handle 1000 --timeout 5m # Create/destroy handles # Monitor FD usagewatch -n 1 'ls /proc/$(pgrep -f my-service)/fd | wc -l' # Detailed FD inventoryls -l /proc/$(pgrep -f my-service)/fdDetecting and Preventing FD Exhaustion:
| Prevention Strategy | Implementation | Benefit |
|---|---|---|
| Increase FD limits | ulimit -n, systemd limits | More headroom |
| Connection pooling | Database pools, HTTP client pools | Reuse connections |
| Connection timeouts | Idle connection pruning | Release unused FDs |
| File handle cleanup | Explicit close, try-with-resources | Prevent leaks |
| FD monitoring | Prometheus metrics on FD usage | Early warning |
| Graceful rejection | Reject requests when FDs near limit | Prevent total failure |
The 'lsof' command shows all open files and network connections for a process. During FD exhaustion testing, use 'lsof -p <pid>' to see exactly what's consuming FDs. Look for FD leaks—resources that should have been closed but weren't.
In production, resource exhaustion rarely occurs in isolation. A memory leak leads to increased GC, which consumes CPU. CPU saturation leads to slower request processing, which leads to more concurrent connections, which consumes more FDs. Disk I/O saturation leads to blocked threads, which leads to thread pool exhaustion.
Common Resource Exhaustion Cascades:
| Initial Exhaustion | Secondary Effect | Tertiary Effect | Ultimate Failure |
|---|---|---|---|
| Memory pressure | Increased GC | CPU saturation | Request timeouts |
| CPU saturation | Slow processing | Queue buildup | Memory exhaustion |
| Disk I/O saturation | Blocked threads | Thread exhaustion | New requests rejected |
| FD exhaustion | Cannot connect | Requests fail | Circuit breakers trip |
| Thread exhaustion | Work queues grow | Memory exhaustion | OOM kill |
| Network bandwidth | Retry storms | CPU exhaustion | Complete unavailability |
1234567891011121314151617181920212223242526272829303132333435363738394041424344
# Compound resource exhaustion experiment# Apply multiple stressors simultaneously # CPU + Memory compound stressstress-ng --cpu 4 --cpu-load 80 \ --vm 2 --vm-bytes 1G \ --timeout 10m & # Memory + Disk I/O compound stressstress-ng --vm 2 --vm-bytes 500M \ --hdd 2 --hdd-bytes 2G \ --timeout 10m & # All major resources simultaneouslystress-ng --cpu 2 --cpu-load 70 \ --vm 1 --vm-bytes 500M \ --hdd 1 --hdd-bytes 1G \ --timeout 10m & # Kubernetes: Combined stress chaoscat <<EOF | kubectl apply -f -apiVersion: chaos-mesh.org/v1alpha1kind: StressChaosmetadata: name: compound-stressspec: mode: one selector: labelSelectors: app: my-service stressors: cpu: workers: 2 load: 80 memory: workers: 1 size: "512MB" duration: "10m"EOF # Monitor all resources during experimentvmstat 1 | tee vmstat.log &iostat -xz 1 | tee iostat.log &watch -n 1 "ps aux | grep my-service"Before testing compound resource exhaustion, ensure you understand single-resource behavior. Compound experiments produce complex interactions that are hard to analyze if you don't have baseline data for each resource type individually. Build up complexity gradually.
Resource exhaustion testing reveals how your system degrades under finite resource constraints. Unlike binary failures (up or down), resource exhaustion creates a spectrum of degradation that's often harder to detect and handle correctly.
What's Next:
With network, service, and resource failures covered, we'll now examine the most subtle failure type: Clock Skew. Time-related failures exploit hidden assumptions about clock consistency that pervade distributed systems, causing failures that are difficult to diagnose and often impossible to reproduce in a lab environment.
You now understand how to inject and analyze resource exhaustion scenarios—CPU saturation, memory pressure, disk exhaustion, I/O saturation, and file descriptor limits. These techniques reveal gradual degradation patterns that are often invisible until they cause outages. Next, we'll explore clock skew injection.