Failure Injection - Learning Module

Loading content...

0/273

Resource Exhaustion

The Silent Killers: Resource Exhaustion Failures

Resource exhaustion failures are fundamentally different from network or service failures. When a network partitions or a service crashes, the failure is typically abrupt and detectable—systems either work or they don't. But when resources exhaust, the failure is gradual and insidious.

A system running low on memory doesn't suddenly stop. It slows down as garbage collection consumes more CPU. It starts swapping to disk. Response times drift from milliseconds to seconds. Eventually, the OOM killer terminates processes, but by then, the damage is done—users have experienced degraded service for minutes or hours.

Resource exhaustion creates a particularly dangerous failure pattern: the system is sick but not dead. Health checks may pass. The service responds to pings. Yet actual user requests time out or fail. This zombie state can persist far longer than a clean failure because automated recovery mechanisms don't recognize the problem.

What You Will Learn

By the end of this page, you will understand how to inject and analyze resource exhaustion scenarios: CPU saturation, memory pressure, disk exhaustion, disk I/O saturation, and file descriptor exhaustion. You'll learn to observe system behavior as resources deplete, identify the tipping points where degradation becomes failure, and implement protective mechanisms.

Understanding Resource Limits

Before injecting resource exhaustion, we must understand what resources our systems consume and what limits constrain them. In containerized environments, resources operate at multiple levels:

Host-Level Resources:

Physical CPU cores and clock speed
Physical RAM
Disk storage and I/O bandwidth
Network bandwidth
Kernel resources (file descriptors, processes, ports)

Container-Level Resources:

CPU limits (millicores or shares)
Memory limits (often hard limits that trigger OOM kill)
Ephemeral storage limits
PID limits

Application-Level Resources:

Thread pools
Connection pools
Internal buffers and caches
Application-specific limits (queue depths, concurrent requests)

Resource Exhaustion Impact Matrix
Resource	Early Warning Signs	Degradation Behavior	Failure Mode
CPU	Increased latency, queue buildup	Slow processing, timeouts	Request dropping, cascading failures
Memory	Increased GC, swapping	Very slow response, thrashing	OOM kill, sudden crash
Disk Space	Disk alerts, write failures	Read-only filesystem	Database corruption, crash
Disk I/O	Increased latency for I/O ops	Database slowness, log delays	Complete I/O stall
File Descriptors	Connection errors	Cannot accept new connections	Service becomes unreachable
Thread Pool	Request queuing	Timeout errors	Complete inability to handle requests

Know Your Limits Before Testing

Before conducting resource exhaustion experiments, document the configured limits for your target system. Container memory limits, connection pool sizes, and thread counts all affect how the system behaves under resource pressure. Observing behavior near these limits is exactly what we're testing.

CPU Saturation

CPU saturation occurs when computational demand exceeds available processing capacity. Unlike memory exhaustion (which can cause abrupt failure), CPU saturation causes gradual degradation—work takes longer, queues build up, and latency increases proportionally to the degree of saturation.

How CPU Saturation Manifests:

Saturation Level	Observable Behavior	System State
50-70%	Normal operation, headroom for spikes	Healthy
70-85%	Slight latency increase, less spike tolerance	Yellow alert
85-95%	Noticeable latency, queue buildup	Degraded
95-100%	Severe latency, timeouts begin	Critical
100% sustained	Timeouts, cascading failures, potential crashes	Failure

Importantly, CPU saturation interacts with other resources. As the CPU works harder, garbage collection runs more frequently (consuming more CPU), heat increases (potentially causing throttling), and context switching overhead grows. The relationship between CPU usage and performance is non-linear.

CPU Saturation Scenarios to Test

•Gradual CPU Increase — Slowly increase CPU load from baseline to 100%. Observe at what point latency becomes unacceptable.
•CPU Spike — Inject sudden 100% CPU load for 30 seconds. Test whether the system recovers or enters degraded state.
•Sustained High CPU — Maintain 90% CPU usage for an extended period (1 hour). Test long-term stability under pressure.
•Single Core Saturation — Saturate cores, leaving one free for GC/system tasks. Test whether worker affinity affects behavior.
•CPU + Memory Combined — Inject CPU load alongside memory pressure. Test compound failure handling.
•CPU During Peak Traffic — Add CPU stress during known high-traffic periods. Test realistic worst-case scenarios.

cpu-saturation.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Using stress-ng for CPU saturation (Linux)
 
# Saturate all CPUs for 5 minutes
stress-ng --cpu 0 --timeout 5m
 
# Saturate with specific number of workers (4 CPUs)
stress-ng --cpu 4 --cpu-load 100 --timeout 5m
 
# Gradually increase CPU load
for load in 25 50 75 90 100; do
  echo "Setting CPU load to $load%"
  stress-ng --cpu 0 --cpu-load $load --timeout 30s
done
 
# Target specific CPU methods (tests different CPU characteristics)
stress-ng --cpu 4 --cpu-method matrixprod --timeout 5m  # Matrix multiplication
stress-ng --cpu 4 --cpu-method fft --timeout 5m         # FFT (floating point)
 
# Kubernetes: Using Chaos Mesh CPU stress
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress
spec:
  mode: all
  selector:
    labelSelectors:
      app: my-service
  stressors:
    cpu:
      workers: 4
      load: 90        # 90% CPU load
  duration: "10m"
EOF
 
# Docker: CPU stress using container
docker run --rm -it --cpus="2" \
  alexeiled/stress-ng --cpu 2 --cpu-load 100 --timeout 5m

What to Observe During CPU Saturation:

Request latency distribution — Watch p50, p95, p99. CPU saturation affects tail latency more than median.
Queue depths — Internal request queues should grow as processing slows.
Concurrent requests — More in-flight requests as each takes longer to complete.
Timeout rate — At what CPU level do timeouts begin?
Auto-scaling triggers — Does HPA (Horizontal Pod Autoscaler) detect the problem and scale?
GC behavior — In garbage-collected languages, GC time typically increases.
Downstream impact — Do dependent services experience degradation?

CPU Throttling in Containers

In Kubernetes with CPU limits set, a container that exceeds its CPU limit is throttled—its processes are paused for portions of each scheduling period. This causes unpredictable latency spikes even when the host has CPU available. Observe container throttling metrics (container_cpu_throttled_seconds) alongside system CPU metrics.

Memory Pressure

Memory pressure occurs when available memory decreases to the point where the system must work to reclaim it. In garbage-collected languages (Java, Go, Python), this triggers more frequent and longer GC cycles. In all systems, extreme memory pressure can trigger swapping to disk or, ultimately, the OOM killer.

The Memory Pressure Spectrum:

[Normal] → [Minor GC] → [Major GC] → [Swapping] → [Thrashing] → [OOM Kill]
   ↑           ↑            ↑            ↑            ↑            ↑
Plenty     GC runs       Stop-the-     Severe      System      Process
of room    more often    world GC     slowness    unusable    terminated

The transition from 'increased GC' to 'OOM kill' can happen rapidly. A system that's been running with 80% memory usage for weeks might suddenly crash because a traffic spike needed that last 20%.

Memory Pressure Scenarios to Test

•Gradual Memory Consumption — Slowly consume memory until OOM. Record the GC behavior at each stage.
•Memory Spike — Rapidly allocate 90% of available memory, then release. Test recovery after spike.
•Memory Leak Simulation — Consume small amounts of memory continuously. Test leak detection and alerting.
•Memory Limit Approach — Consume memory to 95% of container limit. Test behavior just below OOM threshold.
•OOM Kill and Recovery — Consume enough memory to trigger OOM kill. Test process restart and state recovery.
•Memory + CPU Combined — High memory means more GC, which means more CPU. Test compound effects.

memory-pressure.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# Using stress-ng for memory pressure (Linux)
 
# Allocate 2GB of memory for 5 minutes
stress-ng --vm 1 --vm-bytes 2G --timeout 5m
 
# Allocate percentage of available memory
stress-ng --vm 1 --vm-bytes 80% --timeout 5m
 
# Allocate with continuous memory operations (more realistic)
stress-ng --vm 2 --vm-bytes 1G --vm-keep --timeout 5m
 
# Gradually increase memory pressure
for percent in 50 60 70 80 90; do
  echo "Allocating ${percent}% of memory"
  timeout 30s stress-ng --vm 1 --vm-bytes ${percent}%
  sleep 5
done
 
# Kubernetes: Using Chaos Mesh memory stress
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: memory-stress
spec:
  mode: one
  selector:
    labelSelectors:
      app: my-service
  stressors:
    memory:
      workers: 1
      size: "256MB"    # Allocate 256MB per worker
      oomScoreAdj: -1000  # Protect stress process from OOM
  duration: "5m"
EOF
 
# Docker: Memory stress in container
docker run --rm -it --memory="512m" \
  alexeiled/stress-ng --vm 1 --vm-bytes 450M --timeout 5m
 
# Simulate memory leak (Python example)
python3 -c "
import time
data = []
while True:
    data.append('X' * 1024 * 1024)  # Add 1MB per iteration
    time.sleep(1)
    print(f'Allocated {len(data)}MB')
"

Memory Pressure Indicators to Monitor
Metric	Normal Range	Warning Level	Critical Level
Memory utilization	< 70%	70-85%	85%
GC pause time (Java)	< 100ms	100-500ms	500ms
GC frequency	Normal for app	2x baseline	5x+ baseline
Swap usage	0%	Any swap	10% swap
OOM events	0	N/A	Any OOM
Container memory %	< 80%	80-95%	95%

Container OOM Behavior

When a container exceeds its memory limit, Kubernetes kills and restarts it. Unlike graceful shutdown, this is abrupt—in-flight requests are lost, connections are dropped, and there's no cleanup. Memory pressure testing reveals whether your application gracefully sheds load before hitting the limit or crashes suddenly.

Disk Space Exhaustion

Disk space exhaustion occurs when file storage is filled to capacity. This can happen gradually (logs accumulating, data growing) or suddenly (large file creation, backup copies, log explosion during incident). Disk exhaustion affects different components differently:

Components Affected by Disk Exhaustion:

Component	Effect When Disk is Full	Typical Consequence
Application logs	Can't write logs	Blind debugging during incidents
Database	Can't write data	Transaction failures, corruption
Temp files	Can't create temp files	Processing failures
Container overlay	Container can't write	Pod eviction
System journals	System logs lost	Unable to diagnose issues
Docker/containerd	Can't pull images, create containers	Deployment failures

Disk exhaustion is particularly dangerous because many systems assume writes will succeed. A database that can't write its transaction log may corrupt data. An application that can't write to temp storage may crash in unexpected ways.

Disk Space Scenarios to Test

•Fill to 90% — Consume disk to 90% capacity. Test whether alerts fire and rotation policies work.
•Fill to 100% — Consume all disk space. Test write failure handling across components.
•Fill Specific Partition — Fill only /var/log or /tmp. Test whether other partitions buffer the failure.
•Rapid Fill — Create large files quickly to simulate log explosion. Test immediate response.
•Ephemeral Storage Fill — In Kubernetes, fill ephemeral storage to trigger eviction.
•Fill and Recovery — Fill disk, then release space. Test whether services recover automatically.

disk-exhaustion.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Fill disk space using dd
# WARNING: Use with caution, creates large files
 
# Create a 10GB file to fill disk
dd if=/dev/zero of=/tmp/fill_disk bs=1G count=10
 
# Create file of specific size to reach target usage
CURRENT_USAGE=$(df / | awk 'NR==2 {print $5}' | tr -d '%')
TARGET=90
FILL_SIZE=$((TARGET - CURRENT_USAGE))G
dd if=/dev/zero of=/tmp/fill_disk bs=1M count=$((FILL_SIZE * 1024))
 
# Using stress-ng for disk stress
stress-ng --hdd 2 --hdd-bytes 5G --timeout 5m
 
# Fill ephemeral storage in Kubernetes (run in container)
dd if=/dev/zero of=/app/fill_ephemeral bs=1M count=500
 
# Kubernetes: Disk fill chaos using Chaos Mesh
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: disk-fill
spec:
  action: fill
  mode: one
  selector:
    labelSelectors:
      app: my-service
  volumePath: /var/log
  containerName: my-container
  size: "500MB"
  duration: "5m"
EOF
 
# Clean up disk fill
rm /tmp/fill_disk
 
# Monitor disk usage during experiment
watch -n 1 'df -h; echo "---"; ls -lah /tmp/fill_*'
 
# Container-based disk filling
docker run --rm -v /tmp:/fill alpine \
  sh -c 'dd if=/dev/zero of=/fill/large_file bs=1M count=5000'

Observations During Disk Exhaustion:

Write error messages — Check application logs (while you can still write them) for write failures.
Database behavior — Databases may go read-only, refuse connections, or corrupt data.
Log rotation behavior — Do log rotation policies help prevent this scenario?
Alert activation — Do disk usage alerts fire before reaching critical levels?
Ephemeral storage eviction — In Kubernetes, pods exceeding ephemeral storage limits are evicted.
System stability — Some systems become unstable or unresponsive with full disks.
Recovery behavior — After freeing space, do services recover automatically?

Reserve Disk Space for Operations

Many organizations reserve disk space (e.g., 10-20% or a fixed amount) for emergency operations. Without reserved space, you can't SSH into a system, write logs, or even delete files to free space. Test whether your disk exhaustion procedures work when the disk is actually full.

Disk I/O Saturation

Disk I/O saturation occurs when disk read/write bandwidth is fully consumed, creating contention for storage operations. Unlike disk space exhaustion (which is about capacity), I/O saturation is about throughput. Even with plenty of free space, a disk can only handle a limited number of operations per second (IOPS) or bytes per second (bandwidth).

I/O Saturation Characteristics:

I/O saturation manifests as increased latency for all disk operations. When I/O queues fill up:

Database queries take longer (waiting for disk reads)
Log writes block (waiting for disk writes)
Application startup slows (loading files from disk)
Backup operations crawl (competing for bandwidth)
Even simple operations like directory listings become slow

SSD vs. HDD Behavior:

Characteristic	HDD Under Saturation	SSD Under Saturation
Random IOPS	Collapses quickly	Maintains better
Sequential throughput	Better sustained	May throttle (heat)
Latency increase	Dramatic (seeks)	More gradual
Write amplification	Not applicable	Can worsen under load
Recovery	Quick once load drops	May need thermal cooldown

Disk I/O Saturation Scenarios

•Sequential Write Flood — Generate continuous sequential writes. Test log writing, database WAL behavior.
•Random Read Load — Generate random reads across disk. Test database query performance under contention.
•Mixed I/O Workload — Combine reads and writes in realistic ratio. Test overall system behavior.
•I/O Latency Injection — Add artificial latency to I/O operations. Test timeout handling and performance degradation.
•Competing Workloads — Run I/O-heavy jobs concurrently with normal operations. Test workload isolation.
•IOPS Limit Testing — Gradually increase IOPS until saturation. Find the breaking point.

io-saturation.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# Using stress-ng for disk I/O stress
 
# Sequential write stress
stress-ng --hdd 4 --hdd-bytes 1G --timeout 5m
 
# Random I/O stress (more punishing)
stress-ng --hdd 4 --hdd-bytes 1G --hdd-opts random --timeout 5m
 
# Using fio for precise I/O workload generation
# Random 4K reads (database-like workload)
fio --name=random-read \
    --ioengine=libaio \
    --rw=randread \
    --bs=4k \
    --size=1G \
    --numjobs=4 \
    --runtime=300 \
    --iodepth=32 \
    --filename=/tmp/fio_test
 
# Random 4K writes (log/WAL-like workload)  
fio --name=random-write \
    --ioengine=libaio \
    --rw=randwrite \
    --bs=4k \
    --size=1G \
    --numjobs=4 \
    --runtime=300 \
    --fsync=1 \
    --filename=/tmp/fio_test
 
# Mixed read/write workload
fio --name=mixed-io \
    --ioengine=libaio \
    --rw=randrw \
    --rwmixread=70 \
    --bs=4k \
    --size=2G \
    --numjobs=8 \
    --runtime=300 \
    --filename=/tmp/fio_test
 
# Kubernetes: IO chaos using Chaos Mesh
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-latency
spec:
  action: latency
  mode: one
  selector:
    labelSelectors:
      app: database
  volumePath: /var/lib/postgresql/data
  delay: "100ms"       # Add 100ms latency to all I/O
  percent: 100
  duration: "10m"
EOF
 
# Monitor I/O during experiment
iostat -xz 1

I/O Metrics to Monitor
Metric	Tool	Warning Threshold	Critical Threshold
I/O utilization %	iostat	70%	90%
Avg queue length (avgqu-sz)	iostat	2	10
Await (ms)	iostat	10ms	50ms
Read/Write IOPS	iostat	Approaching limit	At disk limit
I/O wait CPU %	top/vmstat	20%	50%
Disk bandwidth	iotop	Approaching limit	At disk limit

Cloud Disk I/O Limits

Cloud storage volumes (EBS, Persistent Disks, Azure Disks) have explicit IOPS and throughput limits based on volume type and size. These limits are often lower than physical disk capabilities. Test I/O saturation at the cloud volume's configured limits, not the theoretical maximum.

File Descriptor Exhaustion

File descriptors (FDs) are handles to open files, sockets, pipes, and other I/O resources. Every network connection, open file, and IPC channel consumes a file descriptor. Systems have limits on total FDs (system-wide) and per-process FDs. When these limits are reached, new connections and file opens fail.

FD Exhaustion Symptoms:

'Too many open files' errors
Cannot accept new network connections
Cannot open files for reading or writing
Cannot create new pipes or sockets
Service appears to hang or refuse connections

FD exhaustion is particularly common in services that:

Handle many concurrent connections (API gateways, load balancers)
Open files frequently without closing them (logging, file processing)
Create many short-lived connections (database connection churn)
Fork many processes (each inherits open FDs)

FD Exhaustion Scenarios to Test

•Connection Flood — Open many TCP connections to consume FDs. Test whether rate limiting prevents exhaustion.
•File Open Flood — Open many files without closing them. Test FD leak detection.
•Gradual FD Consumption — Slowly consume FDs to near the limit. Test monitoring and alerting.
•FD Limit Reduction — Temporarily reduce FD limit. Test behavior at lower thresholds.
•Long-Lived Connection Accumulation — Accumulate idle connections over time. Test connection pruning.
•FD + Network Combined — FD exhaustion prevents new connections. Test whether health checks fail appropriately.

fd-exhaustion.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Check current FD limits
ulimit -n        # Per-process limit
cat /proc/sys/fs/file-max  # System-wide limit
 
# Check current FD usage
ls /proc/$(pgrep -f my-service)/fd | wc -l  # FDs used by process
cat /proc/sys/fs/file-nr    # System-wide: allocated, free, max
 
# Consume FDs by opening sockets (using Python)
python3 -c "
import socket
import time
 
sockets = []
host = 'localhost'
port = 8080
 
while True:
    try:
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.connect((host, port))
        sockets.append(s)
        print(f'Opened {len(sockets)} connections')
    except Exception as e:
        print(f'Failed at {len(sockets)}: {e}')
        time.sleep(60)
        break
"
 
# Consume FDs by opening files
python3 -c "
import tempfile
 
files = []
while True:
    try:
        f = tempfile.TemporaryFile()
        files.append(f)
        if len(files) % 100 == 0:
            print(f'Opened {len(files)} files')
    except Exception as e:
        print(f'Failed at {len(files)}: {e}')
        break
"
 
# Temporarily reduce FD limit for testing
ulimit -n 256  # Very low limit for testing
./my-service   # Run with reduced limit
 
# Using stress-ng
stress-ng --handle 1000 --timeout 5m  # Create/destroy handles
 
# Monitor FD usage
watch -n 1 'ls /proc/$(pgrep -f my-service)/fd | wc -l'
 
# Detailed FD inventory
ls -l /proc/$(pgrep -f my-service)/fd

Detecting and Preventing FD Exhaustion:

Prevention Strategy	Implementation	Benefit
Increase FD limits	ulimit -n, systemd limits	More headroom
Connection pooling	Database pools, HTTP client pools	Reuse connections
Connection timeouts	Idle connection pruning	Release unused FDs
File handle cleanup	Explicit close, try-with-resources	Prevent leaks
FD monitoring	Prometheus metrics on FD usage	Early warning
Graceful rejection	Reject requests when FDs near limit	Prevent total failure

Lsof Is Your Friend

The 'lsof' command shows all open files and network connections for a process. During FD exhaustion testing, use 'lsof -p <pid>' to see exactly what's consuming FDs. Look for FD leaks—resources that should have been closed but weren't.

Compound Resource Exhaustion

In production, resource exhaustion rarely occurs in isolation. A memory leak leads to increased GC, which consumes CPU. CPU saturation leads to slower request processing, which leads to more concurrent connections, which consumes more FDs. Disk I/O saturation leads to blocked threads, which leads to thread pool exhaustion.

Common Resource Exhaustion Cascades:

Resource Exhaustion Cascade Patterns
Initial Exhaustion	Secondary Effect	Tertiary Effect	Ultimate Failure
Memory pressure	Increased GC	CPU saturation	Request timeouts
CPU saturation	Slow processing	Queue buildup	Memory exhaustion
Disk I/O saturation	Blocked threads	Thread exhaustion	New requests rejected
FD exhaustion	Cannot connect	Requests fail	Circuit breakers trip
Thread exhaustion	Work queues grow	Memory exhaustion	OOM kill
Network bandwidth	Retry storms	CPU exhaustion	Complete unavailability

compound-exhaustion.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Compound resource exhaustion experiment
# Apply multiple stressors simultaneously
 
# CPU + Memory compound stress
stress-ng --cpu 4 --cpu-load 80 \
          --vm 2 --vm-bytes 1G \
          --timeout 10m &
 
# Memory + Disk I/O compound stress
stress-ng --vm 2 --vm-bytes 500M \
          --hdd 2 --hdd-bytes 2G \
          --timeout 10m &
 
# All major resources simultaneously
stress-ng --cpu 2 --cpu-load 70 \
          --vm 1 --vm-bytes 500M \
          --hdd 1 --hdd-bytes 1G \
          --timeout 10m &
 
# Kubernetes: Combined stress chaos
cat <<EOF | kubectl apply -f -
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: compound-stress
spec:
  mode: one
  selector:
    labelSelectors:
      app: my-service
  stressors:
    cpu:
      workers: 2
      load: 80
    memory:
      workers: 1
      size: "512MB"
  duration: "10m"
EOF
 
# Monitor all resources during experiment
vmstat 1 | tee vmstat.log &
iostat -xz 1 | tee iostat.log &
watch -n 1 "ps aux | grep my-service"

Start with Single Resources

Before testing compound resource exhaustion, ensure you understand single-resource behavior. Compound experiments produce complex interactions that are hard to analyze if you don't have baseline data for each resource type individually. Build up complexity gradually.

Summary: Mastering Resource Exhaustion Testing

Resource exhaustion testing reveals how your system degrades under finite resource constraints. Unlike binary failures (up or down), resource exhaustion creates a spectrum of degradation that's often harder to detect and handle correctly.

Key Takeaways

•Resource exhaustion is gradual — Systems degrade rather than fail abruptly, making detection difficult. Monitor leading indicators.
•CPU saturation affects latency non-linearly — Performance degrades slowly at first, then rapidly as saturation approaches 100%.
•Memory pressure triggers GC storms — In garbage-collected languages, memory pressure compounds with CPU pressure.
•Disk space and I/O are separate concerns — You can have plenty of space but no I/O bandwidth, or vice versa.
•FD exhaustion is often overlooked — Services that handle many connections are vulnerable. Monitor and configure limits.
•Resources cascade — Exhaustion in one resource often triggers pressure in others. Test compound scenarios.
•Container limits add complexity — Kubernetes and Docker limits create additional failure modes (throttling, OOM kill).

What's Next:

With network, service, and resource failures covered, we'll now examine the most subtle failure type: Clock Skew. Time-related failures exploit hidden assumptions about clock consistency that pervade distributed systems, causing failures that are difficult to diagnose and often impossible to reproduce in a lab environment.

Page Complete

You now understand how to inject and analyze resource exhaustion scenarios—CPU saturation, memory pressure, disk exhaustion, I/O saturation, and file descriptor limits. These techniques reveal gradual degradation patterns that are often invisible until they cause outages. Next, we'll explore clock skew injection.