Loading content...
Here is a paradox that confounds system administrators and misleads monitoring systems: during thrashing, when the system is overwhelmed and barely functional, CPU utilization drops dramatically. The processors sit idle while the system fails to make progress.
This counter-intuitive behavior makes thrashing particularly dangerous because standard monitoring tools give the wrong signal. Alert systems that trigger on high CPU usage will never fire. Automated scaling systems see low CPU and don't add resources. Human operators see idle CPUs and assume there's no problem—or worse, assume the system is under-loaded and add more work.
This page explains why CPU utilization drops during thrashing and how to interpret CPU metrics correctly.
By the end of this page, you will understand why thrashing causes CPU utilization to drop, how to distinguish thrashing-induced idle time from genuine under-utilization, the relationship between I/O wait and CPU idle time, and how to build monitoring systems that detect thrashing despite misleading CPU metrics.
CPU utilization is traditionally measured as the fraction of time the CPU spends executing code (user or kernel) versus sitting idle. In a well-tuned system, high utilization indicates the CPU is doing useful work.
During thrashing, CPU utilization collapses because all processes are blocked waiting for page fault service. No process is ready to run, so the CPU has nothing to execute.
Traditional monitoring interprets low CPU utilization as "system has capacity." During thrashing, the signal is exactly reversed:
The same metric means opposite things depending on context. Without additional information, it's impossible to distinguish the two states from CPU utilization alone.
The Timeline of CPU Behavior During Thrashing:
Time →
┌───────────────────────────────────────────────────────────────┐
│ NORMAL OPERATION │
│ │
│ Process A: [RUNNING]────[RUNNING]────[RUNNING]────[RUNNING] │
│ Process B: [RUNNING]────[RUNNING]────[RUNNING] │
│ Process C: [RUNNING]────[RUNNING]──── │
│ CPU: ████████████████████████████████████████████ (~90% used) │
└───────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────┐
│ THRASHING │
│ │
│ Process A: [RUN][WAIT──────────────][RUN][WAIT────────────] │
│ Process B: [R][WAIT──────────────][R][WAIT──────────] │
│ Process C: [R][WAIT──────────────][R][WAIT────────] │
│ CPU: ██░░░░░░░░░░░░██░░░░░░░░░░░░██░░░░░░░░░░ (~10% used) │
│ ↑ ↑ ↑ │
│ Brief run periods separated by long I/O waits │
└───────────────────────────────────────────────────────────────┘
Legend: [RUNNING] = Executing on CPU
[WAIT] = Blocked waiting for page fault I/O
████ = CPU busy
░░░░ = CPU idle
In the thrashing scenario, each process runs briefly (touching a few pages), immediately faults when it accesses a page not in memory, and then blocks waiting for the disk to load the page. With all processes blocked, the CPU sits idle—even though the system is overloaded.
To understand the CPU utilization paradox, we need to examine how CPU time is categorized. Operating systems track CPU time in multiple categories:
| Category | Description | During Normal Operation | During Thrashing |
|---|---|---|---|
| User (%us) | Time in user-space code | 40-70% | 5-15% |
| System (%sy) | Time in kernel code | 5-20% | 15-30% |
| Nice (%ni) | Low-priority user processes | 0-10% | Near 0% |
| Idle (%id) | No runnable processes | 10-40% | 10-30% |
| I/O Wait (%wa) | Waiting for I/O completion | 1-10% | 40-70% |
| IRQ (%hi) | Hardware interrupt handling | 0-2% | 1-5% |
| SoftIRQ (%si) | Software interrupt handling | 0-2% | 2-5% |
| Steal (%st) | Time stolen by hypervisor | 0-10% | Varies |
The key to identifying thrashing is I/O Wait (%wa). This measures time the CPU is idle because all runnable processes are waiting for I/O. During thrashing:
High I/O Wait combined with high page fault rate is the signature of thrashing.
Interpreting top/htop Output:
┌── NORMAL OPERATION ──────────────────────────────────────────┐
│ top - 14:32:05 up 5 days, 3:12, 4 users, load average: 2.5 │
│ Tasks: 245 total, 3 running, 242 sleeping, 0 stopped │
│ %Cpu(s): 65.2 us, 12.3 sy, 0.0 ni, 20.5 id, 1.2 wa, 0.8 hi │
│ │ │ │ │ │ │
│ └─ User └─ System | └─ Low I/O Wait ← GOOD │
│ └─ Some idle is normal │
└──────────────────────────────────────────────────────────────┘
┌── THRASHING ─────────────────────────────────────────────────┐
│ top - 14:35:22 up 5 days, 3:15, 4 users, load average: 28.7 │
│ Tasks: 245 total, 0 running, 245 sleeping, 0 stopped │
│ %Cpu(s): 8.1 us, 22.4 sy, 0.0 ni, 12.3 id, 56.8 wa, 0.4 hi │
│ │ │ │ │
│ └─ Low! └─ Higher kernel time └─ VERY HIGH! ← BAD│
│ (page fault handling) │
│ │
│ Note: 0 running processes despite high load average! │
└──────────────────────────────────────────────────────────────┘
The thrashing signature:
Understanding what the scheduler sees during thrashing explains both the CPU utilization drop and why standard scheduler responses make things worse.
Linux load average includes processes in uninterruptible sleep (D state)—typically waiting for disk I/O. During thrashing:
This divergence between load average and CPU utilization is a strong thrashing indicator. In normal operation, they correlate; in thrashing, they diverge dramatically.
The Scheduler's Dilemma:
┌────────────────────────────────────────────────────────────┐
│ SCHEDULER'S STATE │
├────────────────────────────────────────────────────────────┤
│ │
│ Run Queue: [ ] ← Empty! No runnable tasks │
│ │
│ Wait Queue: [A][B][C][D][E][F][G][H][I][J][K][L][M]... │
│ ↑ │
│ All waiting for page fault I/O │
│ │
│ Decision Point: │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ "Run queue empty, load average high... │ │
│ │ Must need more processes!" │ │
│ │ │ │
│ │ Action: Admit new process from job queue │ │
│ │ │ │
│ │ Result: New process takes frames from others, │ │
│ │ those processes now fault more, │ │
│ │ situation worsens. │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────┘
The scheduler's optimization target (CPU utilization) leads it to exactly the wrong action during thrashing.
Both genuine underload and thrashing produce low CPU utilization. Distinguishing them is critical for choosing the correct response (add work vs. reduce work).
| Metric | Genuine Underload | Thrashing |
|---|---|---|
| CPU Utilization | Low (10-30%) | Low (10-30%) |
| CPU I/O Wait | Low (< 5%) | High (30-70%) |
| Load Average | Low (< 1.0) | High (>> CPU cores) |
| Page Fault Rate | Low (< 100/s) | Very high (> 1000/s) |
| Disk I/O | Low or moderate | Saturated |
| Memory Pressure | Low (plenty free) | Severe (near 100%) |
| Response Time | Fast | Very slow |
| Runnable Processes | Few—light workload | Zero—all blocked |
| Process State | Mostly sleeping | Mostly D state (I/O wait) |
A fast diagnostic for Linux systems:
# Check for thrashing indicators
vmstat 1 3
Look at:
If si/so are high, wa is high, r is 0, and b is high → Thrashing If si/so are 0, wa is low, r is 0, b is 0 → Genuine underload
CPU utilization drop directly causes throughput collapse. When the CPU can't execute processes, no work gets done—regardless of how many processes are waiting.
Throughput vs. Multiprogramming Curve:
Throughput
▲
│ ┌── Optimal Point
│ │
Max ┤ ╱──────────╲
│ ╱ ╲
│ ╱ Plateау ╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲ ← Thrashing Zone
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│╱ ╲───────
└────────┬─────────┬───────────┬──────► Multiprogramming
│ │ │ Level
CPU Memory Thrashing
Bound Bound Collapse
────────────────────────────────────────────────────
CPU Utilization
▲
│
100% ┤ ╱────────────────╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│╱ ╲
│ ╲
│ ╲
│ ╲─────────
└────────────────────────────────────► Multiprogramming
Level
Both curves show the same phenomenon: above the optimal point, adding more work reduces both CPU utilization and throughput. The counterintuitive result is that the system does less total work with more processes.
The transition from "optimal" to "thrashing" is often abrupt:
A small increase in load can cause a 100x drop in throughput. This non-linear collapse is why thrashing is so dangerous.
Real-World Example:
| Load Level | Processes | CPU% | Throughput (req/s) | P99 Latency |
|---|---|---|---|---|
| Light | 4 | 40% | 400 | 50ms |
| Normal | 8 | 75% | 700 | 100ms |
| Heavy | 12 | 90% | 800 | 200ms |
| Overload | 16 | 85% | 500 | 1s |
| Thrashing | 20 | 40% | 50 | 30s |
| Severe | 25 | 15% | 5 | 5min |
Note how CPU and throughput move together initially, then CPU drops while work continues to increase. At 25 processes, CPU is 15% ("underloaded") but throughput is 5 req/s (catastrophic).
Given that CPU utilization is misleading during thrashing, how should we monitor systems to detect this condition? The key is combining multiple metrics.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188
#!/usr/bin/env python3"""Thrashing Detection MonitorCombines multiple metrics to detect thrashing conditions"""import osimport timefrom dataclasses import dataclassfrom typing import Tuple @dataclassclass SystemMetrics: cpu_user: float cpu_system: float cpu_iowait: float cpu_idle: float load_1min: float major_faults_per_sec: float available_memory_pct: float num_runnable: int num_blocked: int class ThrashingDetector: """ Detects thrashing by combining multiple system metrics. Single metrics can be misleading; combined analysis is robust. """ def __init__(self, iowait_threshold: float = 30.0, fault_rate_threshold: float = 500.0, memory_threshold: float = 10.0): # Available memory < 10% self.iowait_threshold = iowait_threshold self.fault_rate_threshold = fault_rate_threshold self.memory_threshold = memory_threshold self.last_major_faults = self._get_major_faults() self.last_check = time.time() def _get_major_faults(self) -> int: with open('/proc/vmstat', 'r') as f: for line in f: if line.startswith('pgmajfault'): return int(line.split()[1]) return 0 def _get_cpu_stats(self) -> Tuple[float, float, float, float]: """Returns user, system, iowait, idle percentages""" with open('/proc/stat', 'r') as f: cpu_line = f.readline() values = [int(x) for x in cpu_line.split()[1:]] total = sum(values) user = (values[0] + values[1]) * 100 / total system = values[2] * 100 / total idle = values[3] * 100 / total iowait = values[4] * 100 / total return user, system, iowait, idle def _get_load_average(self) -> float: with open('/proc/loadavg', 'r') as f: return float(f.read().split()[0]) def _get_memory_available_pct(self) -> float: with open('/proc/meminfo', 'r') as f: total = available = 0 for line in f: if line.startswith('MemTotal:'): total = int(line.split()[1]) elif line.startswith('MemAvailable:'): available = int(line.split()[1]) break return (available / total * 100) if total > 0 else 0 def _get_process_states(self) -> Tuple[int, int]: """Returns (runnable, blocked) process counts""" runnable = blocked = 0 for pid in os.listdir('/proc'): if not pid.isdigit(): continue try: with open(f'/proc/{pid}/stat', 'r') as f: state = f.read().split()[2] if state == 'R': runnable += 1 elif state == 'D': # Uninterruptible sleep (I/O wait) blocked += 1 except (FileNotFoundError, PermissionError): continue return runnable, blocked def collect_metrics(self) -> SystemMetrics: """Collect all relevant metrics""" user, system, iowait, idle = self._get_cpu_stats() current_faults = self._get_major_faults() current_time = time.time() fault_rate = (current_faults - self.last_major_faults) / \ (current_time - self.last_check) self.last_major_faults = current_faults self.last_check = current_time runnable, blocked = self._get_process_states() return SystemMetrics( cpu_user=user, cpu_system=system, cpu_iowait=iowait, cpu_idle=idle, load_1min=self._get_load_average(), major_faults_per_sec=fault_rate, available_memory_pct=self._get_memory_available_pct(), num_runnable=runnable, num_blocked=blocked ) def analyze(self, metrics: SystemMetrics) -> dict: """ Analyze metrics to detect thrashing. Returns analysis with confidence level. """ indicators = [] # Indicator 1: High I/O wait if metrics.cpu_iowait > self.iowait_threshold: indicators.append(f"High I/O wait: {metrics.cpu_iowait:.1f}%") # Indicator 2: High page fault rate if metrics.major_faults_per_sec > self.fault_rate_threshold: indicators.append(f"High fault rate: {metrics.major_faults_per_sec:.0f}/s") # Indicator 3: Low available memory if metrics.available_memory_pct < self.memory_threshold: indicators.append(f"Low memory: {metrics.available_memory_pct:.1f}% avail") # Indicator 4: Zero runnable but many blocked if metrics.num_runnable == 0 and metrics.num_blocked > 3: indicators.append(f"Process stall: {metrics.num_blocked} blocked, 0 runnable") # Indicator 5: High load with low CPU user if metrics.load_1min > 4.0 and metrics.cpu_user < 20.0: indicators.append(f"Load/CPU mismatch: load={metrics.load_1min:.1f}, cpu={metrics.cpu_user:.1f}%") # Calculate thrashing probability confidence = len(indicators) / 5.0 # 5 possible indicators if confidence >= 0.6: status = "THRASHING" elif confidence >= 0.4: status = "POSSIBLE_THRASHING" elif confidence >= 0.2: status = "WARNING" else: status = "NORMAL" return { 'status': status, 'confidence': confidence, 'indicators': indicators, 'recommendation': self._get_recommendation(status) } def _get_recommendation(self, status: str) -> str: recommendations = { 'THRASHING': "URGENT: Reduce multiprogramming. Kill or suspend processes.", 'POSSIBLE_THRASHING': "Monitor closely. Prepare to reduce load.", 'WARNING': "Elevated memory pressure. Investigate memory-heavy processes.", 'NORMAL': "System healthy. Continue monitoring." } return recommendations[status] if __name__ == "__main__": detector = ThrashingDetector() print("Thrashing Detection Monitor") print("=" * 60) while True: time.sleep(1) metrics = detector.collect_metrics() analysis = detector.analyze(metrics) print(f"[{time.strftime('%H:%M:%S')}] Status: {analysis['status']} " f"(confidence: {analysis['confidence']:.0%})") if analysis['indicators']: print(f" Indicators: {', '.join(analysis['indicators'])}") if analysis['status'] != 'NORMAL': print(f" Recommendation: {analysis['recommendation']}")Configure alerting systems to watch for:
Any two of these in combination strongly indicates thrashing.
When thrashing is detected via CPU utilization drop and elevated I/O wait, specific corrective actions can restore system functionality.
ps aux --sort=-%mem or similar to find memory hogs1234567891011121314151617181920212223242526272829303132333435
#!/bin/bash# Emergency thrashing response script# WARNING: Use with caution in production echo "=== Thrashing Emergency Response ===" # 1. Identify processes sorted by memory usageecho "Top memory consumers:"ps aux --sort=-%mem | head -10 # 2. Identify processes in D state (blocked on I/O)echo -e "Processes blocked on I/O (D state):"ps aux | awk '$8 ~ /D/ {print $0}' | head -10 # 3. Current memory pressureecho -e "Memory status:"free -h # 4. If confirmed thrashing, options:# Option A: Suspend heavy non-critical processes# find_non_critical_processes | while read pid; do# kill -STOP $pid# echo "Suspended PID $pid"# done # Option B: Clear page cache (Linux)# sync; echo 1 > /proc/sys/vm/drop_caches # Option C: Reduce swappiness to prefer keeping application pages# echo 10 > /proc/sys/vm/swappiness # Option D: OOM killer adjustment - lower score = less likely killed# echo -500 > /proc/$(pgrep -f "critical_process")/oom_score_adjEmergency response stops the bleeding but doesn't prevent recurrence. Long-term solutions include:
The CPU utilization drop during thrashing is counterintuitive but fully explainable. Armed with this knowledge, you can build monitoring systems that detect thrashing early and respond appropriately.
What's Next:
Having explored the symptoms of thrashing—high page fault rates and paradoxical CPU drops—the next page completes our analysis with detection methods. We'll examine systematic approaches for identifying thrashing, including automated detection algorithms and prevention strategies.
You now understand why CPU utilization drops during thrashing, how to distinguish it from genuine underload, and how to build monitoring that detects thrashing despite misleading CPU metrics. The final page covers systematic detection methods.