Thrashing - Learning Module

Loading content...

0/240

CPU Utilization Drop

The Paradox of Idle CPUs

Here is a paradox that confounds system administrators and misleads monitoring systems: during thrashing, when the system is overwhelmed and barely functional, CPU utilization drops dramatically. The processors sit idle while the system fails to make progress.

This counter-intuitive behavior makes thrashing particularly dangerous because standard monitoring tools give the wrong signal. Alert systems that trigger on high CPU usage will never fire. Automated scaling systems see low CPU and don't add resources. Human operators see idle CPUs and assume there's no problem—or worse, assume the system is under-loaded and add more work.

This page explains why CPU utilization drops during thrashing and how to interpret CPU metrics correctly.

What You Will Learn

By the end of this page, you will understand why thrashing causes CPU utilization to drop, how to distinguish thrashing-induced idle time from genuine under-utilization, the relationship between I/O wait and CPU idle time, and how to build monitoring systems that detect thrashing despite misleading CPU metrics.

The CPU Utilization Paradox

CPU utilization is traditionally measured as the fraction of time the CPU spends executing code (user or kernel) versus sitting idle. In a well-tuned system, high utilization indicates the CPU is doing useful work.

During thrashing, CPU utilization collapses because all processes are blocked waiting for page fault service. No process is ready to run, so the CPU has nothing to execute.

The Misleading Signal

Traditional monitoring interprets low CPU utilization as "system has capacity." During thrashing, the signal is exactly reversed:

Normal operation: Low CPU = light load, can handle more work
Thrashing: Low CPU = all processes blocked on I/O, cannot make progress

The same metric means opposite things depending on context. Without additional information, it's impossible to distinguish the two states from CPU utilization alone.

The Timeline of CPU Behavior During Thrashing:

Time →
┌───────────────────────────────────────────────────────────────┐
│ NORMAL OPERATION                                               │
│                                                                 │
│ Process A: [RUNNING]────[RUNNING]────[RUNNING]────[RUNNING]   │
│ Process B:         [RUNNING]────[RUNNING]────[RUNNING]        │
│ Process C:              [RUNNING]────[RUNNING]────            │
│ CPU: ████████████████████████████████████████████ (~90% used) │
└───────────────────────────────────────────────────────────────┘

┌───────────────────────────────────────────────────────────────┐
│ THRASHING                                                      │
│                                                                 │
│ Process A: [RUN][WAIT──────────────][RUN][WAIT────────────]   │
│ Process B:      [R][WAIT──────────────][R][WAIT──────────]    │
│ Process C:         [R][WAIT──────────────][R][WAIT────────]   │
│ CPU: ██░░░░░░░░░░░░██░░░░░░░░░░░░██░░░░░░░░░░ (~10% used)     │
│      ↑              ↑              ↑                           │
│      Brief run periods separated by long I/O waits             │
└───────────────────────────────────────────────────────────────┘

Legend: [RUNNING] = Executing on CPU
        [WAIT] = Blocked waiting for page fault I/O
        ████ = CPU busy
        ░░░░ = CPU idle

In the thrashing scenario, each process runs briefly (touching a few pages), immediately faults when it accesses a page not in memory, and then blocks waiting for the disk to load the page. With all processes blocked, the CPU sits idle—even though the system is overloaded.

CPU Time Breakdown Analysis

To understand the CPU utilization paradox, we need to examine how CPU time is categorized. Operating systems track CPU time in multiple categories:

CPU Time Categories
Category	Description	During Normal Operation	During Thrashing
User (%us)	Time in user-space code	40-70%	5-15%
System (%sy)	Time in kernel code	5-20%	15-30%
Nice (%ni)	Low-priority user processes	0-10%	Near 0%
Idle (%id)	No runnable processes	10-40%	10-30%
I/O Wait (%wa)	Waiting for I/O completion	1-10%	40-70%
IRQ (%hi)	Hardware interrupt handling	0-2%	1-5%
SoftIRQ (%si)	Software interrupt handling	0-2%	2-5%
Steal (%st)	Time stolen by hypervisor	0-10%	Varies

The Critical Metric: I/O Wait

The key to identifying thrashing is I/O Wait (%wa). This measures time the CPU is idle because all runnable processes are waiting for I/O. During thrashing:

User time drops — Processes can't execute user code
System time may increase — Kernel handles page faults
I/O Wait spikes — All processes blocked on disk
Idle time may not increase — If processes are queued for I/O

High I/O Wait combined with high page fault rate is the signature of thrashing.

Interpreting top/htop Output:

┌── NORMAL OPERATION ──────────────────────────────────────────┐
│ top - 14:32:05 up 5 days, 3:12, 4 users, load average: 2.5   │
│ Tasks: 245 total, 3 running, 242 sleeping, 0 stopped         │
│ %Cpu(s): 65.2 us, 12.3 sy, 0.0 ni, 20.5 id, 1.2 wa, 0.8 hi  │
│          │        │           │      │      │                │
│          └─ User  └─ System   |      └─ Low I/O Wait ← GOOD │
│                               └─ Some idle is normal         │
└──────────────────────────────────────────────────────────────┘

┌── THRASHING ─────────────────────────────────────────────────┐
│ top - 14:35:22 up 5 days, 3:15, 4 users, load average: 28.7  │
│ Tasks: 245 total, 0 running, 245 sleeping, 0 stopped         │
│ %Cpu(s):  8.1 us, 22.4 sy, 0.0 ni, 12.3 id, 56.8 wa, 0.4 hi │
│          │        │                      │                   │
│          └─ Low!  └─ Higher kernel time  └─ VERY HIGH! ← BAD│
│                     (page fault handling)                     │
│                                                               │
│ Note: 0 running processes despite high load average!         │
└──────────────────────────────────────────────────────────────┘

The thrashing signature:

User time collapsed (65% → 8%)
System time increased (12% → 22%) due to page fault handling
I/O Wait exploded (1% → 57%)
Zero running processes despite high load average

The Scheduler's Perspective

Understanding what the scheduler sees during thrashing explains both the CPU utilization drop and why standard scheduler responses make things worse.

Scheduler View: Normal Operation

•Run Queue: Multiple processes ready to execute
•CPU Selection: Always a process available when one blocks
•Context Switches: Smooth transitions between processes
•Idle Time: Minimal—always work to do
•Load Average: Reflects actual work being done

Scheduler View: Thrashing

•Run Queue: Empty or near-empty—all processes blocked
•Wait Queue: Massively backed up with processes waiting for I/O
•CPU Selection: No runnable process → CPU idle
•Context Switches: Infrequent—nothing to switch to
•Load Average: High—many processes waiting, but none running

Load Average During Thrashing

Linux load average includes processes in uninterruptible sleep (D state)—typically waiting for disk I/O. During thrashing:

Load average shoots up (many processes waiting)
CPU utilization drops (no runnable processes)

This divergence between load average and CPU utilization is a strong thrashing indicator. In normal operation, they correlate; in thrashing, they diverge dramatically.

The Scheduler's Dilemma:

┌────────────────────────────────────────────────────────────┐
│                    SCHEDULER'S STATE                        │
├────────────────────────────────────────────────────────────┤
│                                                             │
│   Run Queue: [ ]               ← Empty! No runnable tasks  │
│                                                             │
│   Wait Queue: [A][B][C][D][E][F][G][H][I][J][K][L][M]...   │
│               ↑                                             │
│               All waiting for page fault I/O                │
│                                                             │
│   Decision Point:                                           │
│   ┌──────────────────────────────────────────────────────┐ │
│   │ "Run queue empty, load average high...               │ │
│   │  Must need more processes!"                          │ │
│   │                                                       │ │
│   │  Action: Admit new process from job queue            │ │
│   │                                                       │ │
│   │  Result: New process takes frames from others,       │ │
│   │          those processes now fault more,             │ │
│   │          situation worsens.                          │ │
│   └──────────────────────────────────────────────────────┘ │
│                                                             │
└────────────────────────────────────────────────────────────┘

The scheduler's optimization target (CPU utilization) leads it to exactly the wrong action during thrashing.

Distinguishing Underload from Thrashing

Both genuine underload and thrashing produce low CPU utilization. Distinguishing them is critical for choosing the correct response (add work vs. reduce work).

Distinguishing System States
Metric	Genuine Underload	Thrashing
CPU Utilization	Low (10-30%)	Low (10-30%)
CPU I/O Wait	Low (< 5%)	High (30-70%)
Load Average	Low (< 1.0)	High (>> CPU cores)
Page Fault Rate	Low (< 100/s)	Very high (> 1000/s)
Disk I/O	Low or moderate	Saturated
Memory Pressure	Low (plenty free)	Severe (near 100%)
Response Time	Fast	Very slow
Runnable Processes	Few—light workload	Zero—all blocked
Process State	Mostly sleeping	Mostly D state (I/O wait)

The Quick Diagnostic

A fast diagnostic for Linux systems:

# Check for thrashing indicators
vmstat 1 3

Look at:

si/so (swap in/out): High values indicate heavy paging
wa (I/O wait): > 20% suggests I/O bottleneck
r (runnable): 0 with high swap activity = thrashing
b (blocked): Many processes blocked = I/O bound

If si/so are high, wa is high, r is 0, and b is high → Thrashing If si/so are 0, wa is low, r is 0, b is 0 → Genuine underload

Signs of Genuine Underload

•Plenty of free memory
•Low swap activity (si=0, so=0)
•Low disk activity
•Fast application response
•Low load average
•Mostly sleeping processes

Signs of Thrashing

•Memory nearly exhausted
•High swap activity (si >> 0, so >> 0)
•Disk I/O saturated
•Applications frozen/unresponsive
•Very high load average
•Most processes in D state

The Throughput Collapse

CPU utilization drop directly causes throughput collapse. When the CPU can't execute processes, no work gets done—regardless of how many processes are waiting.

Throughput vs. Multiprogramming Curve:

   Throughput
       ▲
       │                    ┌── Optimal Point
       │                    │
   Max ┤         ╱──────────╲
       │        ╱            ╲
       │       ╱   Plateау    ╲
       │      ╱                ╲
       │     ╱                  ╲
       │    ╱                    ╲ ← Thrashing Zone
       │   ╱                      ╲
       │  ╱                        ╲
       │ ╱                          ╲
       │╱                            ╲───────
       └────────┬─────────┬───────────┬──────► Multiprogramming
                │         │           │         Level
              CPU       Memory      Thrashing
              Bound     Bound       Collapse

   ────────────────────────────────────────────────────
   CPU Utilization
       ▲
       │
  100% ┤    ╱────────────────╲
       │   ╱                  ╲
       │  ╱                    ╲
       │ ╱                      ╲
       │╱                        ╲
       │                          ╲
       │                           ╲
       │                            ╲─────────
       └────────────────────────────────────► Multiprogramming
                                               Level

Both curves show the same phenomenon: above the optimal point, adding more work reduces both CPU utilization and throughput. The counterintuitive result is that the system does less total work with more processes.

The Catastrophic Drop

The transition from "optimal" to "thrashing" is often abrupt:

At optimal: 90% CPU utilization, 1000 requests/sec
Slightly overloaded: 80% CPU, 800 requests/sec
At thrashing onset: 50% CPU, 200 requests/sec
Deep thrashing: 10% CPU, 10 requests/sec

A small increase in load can cause a 100x drop in throughput. This non-linear collapse is why thrashing is so dangerous.

Real-World Example:

Load Level	Processes	CPU%	Throughput (req/s)	P99 Latency
Light	4	40%	400	50ms
Normal	8	75%	700	100ms
Heavy	12	90%	800	200ms
Overload	16	85%	500	1s
Thrashing	20	40%	50	30s
Severe	25	15%	5	5min

Note how CPU and throughput move together initially, then CPU drops while work continues to increase. At 25 processes, CPU is 15% ("underloaded") but throughput is 5 req/s (catastrophic).

Monitoring for Thrashing

Given that CPU utilization is misleading during thrashing, how should we monitor systems to detect this condition? The key is combining multiple metrics.

Thrashing Detection Monitor
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
#!/usr/bin/env python3
"""
Thrashing Detection Monitor
Combines multiple metrics to detect thrashing conditions
"""
import os
import time
from dataclasses import dataclass
from typing import Tuple
 
@dataclass
class SystemMetrics:
    cpu_user: float
    cpu_system: float
    cpu_iowait: float
    cpu_idle: float
    load_1min: float
    major_faults_per_sec: float
    available_memory_pct: float
    num_runnable: int
    num_blocked: int
 
class ThrashingDetector:
    """
    Detects thrashing by combining multiple system metrics.
    Single metrics can be misleading; combined analysis is robust.
    """
    
    def __init__(self, 
                 iowait_threshold: float = 30.0,
                 fault_rate_threshold: float = 500.0,
                 memory_threshold: float = 10.0):  # Available memory < 10%
        self.iowait_threshold = iowait_threshold
        self.fault_rate_threshold = fault_rate_threshold
        self.memory_threshold = memory_threshold
        self.last_major_faults = self._get_major_faults()
        self.last_check = time.time()
    
    def _get_major_faults(self) -> int:
        with open('/proc/vmstat', 'r') as f:
            for line in f:
                if line.startswith('pgmajfault'):
                    return int(line.split()[1])
        return 0
    
    def _get_cpu_stats(self) -> Tuple[float, float, float, float]:
        """Returns user, system, iowait, idle percentages"""
        with open('/proc/stat', 'r') as f:
            cpu_line = f.readline()
        values = [int(x) for x in cpu_line.split()[1:]]
        total = sum(values)
        user = (values[0] + values[1]) * 100 / total
        system = values[2] * 100 / total
        idle = values[3] * 100 / total
        iowait = values[4] * 100 / total
        return user, system, iowait, idle
    
    def _get_load_average(self) -> float:
        with open('/proc/loadavg', 'r') as f:
            return float(f.read().split()[0])
    
    def _get_memory_available_pct(self) -> float:
        with open('/proc/meminfo', 'r') as f:
            total = available = 0
            for line in f:
                if line.startswith('MemTotal:'):
                    total = int(line.split()[1])
                elif line.startswith('MemAvailable:'):
                    available = int(line.split()[1])
                    break
        return (available / total * 100) if total > 0 else 0
    
    def _get_process_states(self) -> Tuple[int, int]:
        """Returns (runnable, blocked) process counts"""
        runnable = blocked = 0
        for pid in os.listdir('/proc'):
            if not pid.isdigit():
                continue
            try:
                with open(f'/proc/{pid}/stat', 'r') as f:
                    state = f.read().split()[2]
                    if state == 'R':
                        runnable += 1
                    elif state == 'D':  # Uninterruptible sleep (I/O wait)
                        blocked += 1
            except (FileNotFoundError, PermissionError):
                continue
        return runnable, blocked
    
    def collect_metrics(self) -> SystemMetrics:
        """Collect all relevant metrics"""
        user, system, iowait, idle = self._get_cpu_stats()
        
        current_faults = self._get_major_faults()
        current_time = time.time()
        fault_rate = (current_faults - self.last_major_faults) / \
                     (current_time - self.last_check)
        self.last_major_faults = current_faults
        self.last_check = current_time
        
        runnable, blocked = self._get_process_states()
        
        return SystemMetrics(
            cpu_user=user,
            cpu_system=system,
            cpu_iowait=iowait,
            cpu_idle=idle,
            load_1min=self._get_load_average(),
            major_faults_per_sec=fault_rate,
            available_memory_pct=self._get_memory_available_pct(),
            num_runnable=runnable,
            num_blocked=blocked
        )
    
    def analyze(self, metrics: SystemMetrics) -> dict:
        """
        Analyze metrics to detect thrashing.
        Returns analysis with confidence level.
        """
        indicators = []
        
        # Indicator 1: High I/O wait
        if metrics.cpu_iowait > self.iowait_threshold:
            indicators.append(f"High I/O wait: {metrics.cpu_iowait:.1f}%")
        
        # Indicator 2: High page fault rate
        if metrics.major_faults_per_sec > self.fault_rate_threshold:
            indicators.append(f"High fault rate: {metrics.major_faults_per_sec:.0f}/s")
        
        # Indicator 3: Low available memory
        if metrics.available_memory_pct < self.memory_threshold:
            indicators.append(f"Low memory: {metrics.available_memory_pct:.1f}% avail")
        
        # Indicator 4: Zero runnable but many blocked
        if metrics.num_runnable == 0 and metrics.num_blocked > 3:
            indicators.append(f"Process stall: {metrics.num_blocked} blocked, 0 runnable")
        
        # Indicator 5: High load with low CPU user
        if metrics.load_1min > 4.0 and metrics.cpu_user < 20.0:
            indicators.append(f"Load/CPU mismatch: load={metrics.load_1min:.1f}, cpu={metrics.cpu_user:.1f}%")
        
        # Calculate thrashing probability
        confidence = len(indicators) / 5.0  # 5 possible indicators
        
        if confidence >= 0.6:
            status = "THRASHING"
        elif confidence >= 0.4:
            status = "POSSIBLE_THRASHING"
        elif confidence >= 0.2:
            status = "WARNING"
        else:
            status = "NORMAL"
        
        return {
            'status': status,
            'confidence': confidence,
            'indicators': indicators,
            'recommendation': self._get_recommendation(status)
        }
    
    def _get_recommendation(self, status: str) -> str:
        recommendations = {
            'THRASHING': "URGENT: Reduce multiprogramming. Kill or suspend processes.",
            'POSSIBLE_THRASHING': "Monitor closely. Prepare to reduce load.",
            'WARNING': "Elevated memory pressure. Investigate memory-heavy processes.",
            'NORMAL': "System healthy. Continue monitoring."
        }
        return recommendations[status]
 
if __name__ == "__main__":
    detector = ThrashingDetector()
    print("Thrashing Detection Monitor")
    print("=" * 60)
    
    while True:
        time.sleep(1)
        metrics = detector.collect_metrics()
        analysis = detector.analyze(metrics)
        
        print(f"
[{time.strftime('%H:%M:%S')}] Status: {analysis['status']} "
              f"(confidence: {analysis['confidence']:.0%})")
        
        if analysis['indicators']:
            print(f"  Indicators: {', '.join(analysis['indicators'])}")
        
        if analysis['status'] != 'NORMAL':
            print(f"  Recommendation: {analysis['recommendation']}")

Key Metrics for Alerting

Configure alerting systems to watch for:

I/O Wait > 30% for sustained period (> 1 minute)
Major faults > 1000/s combined with high I/O wait
Available memory < 5% with swap activity
Load average >> CPU count with low user CPU
Zero runnable processes with blocked processes > 5

Any two of these in combination strongly indicates thrashing.

Corrective Actions When CPU Drops

When thrashing is detected via CPU utilization drop and elevated I/O wait, specific corrective actions can restore system functionality.

Immediate Actions

•Stop admitting new processes — Pause job queue, reject new connections, enable load shedding
•Identify the heaviest processes — Use ps aux --sort=-%mem or similar to find memory hogs
•Suspend non-critical processes — SIGSTOP to low-priority work; can resume later
•Kill runaway processes — If a process has grown unexpectedly, terminate it
•Clear caches if possible — Application-level caches consuming memory can be dropped

Emergency Thrashing Response
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/bin/bash
# Emergency thrashing response script
# WARNING: Use with caution in production
 
echo "=== Thrashing Emergency Response ==="
 
# 1. Identify processes sorted by memory usage
echo "Top memory consumers:"
ps aux --sort=-%mem | head -10
 
# 2. Identify processes in D state (blocked on I/O)
echo -e "
Processes blocked on I/O (D state):"
ps aux | awk '$8 ~ /D/ {print $0}' | head -10
 
# 3. Current memory pressure
echo -e "
Memory status:"
free -h
 
# 4. If confirmed thrashing, options:
# Option A: Suspend heavy non-critical processes
# find_non_critical_processes | while read pid; do
#     kill -STOP $pid
#     echo "Suspended PID $pid"
# done
 
# Option B: Clear page cache (Linux)
# sync; echo 1 > /proc/sys/vm/drop_caches
 
# Option C: Reduce swappiness to prefer keeping application pages
# echo 10 > /proc/sys/vm/swappiness
 
# Option D: OOM killer adjustment - lower score = less likely killed
# echo -500 > /proc/$(pgrep -f "critical_process")/oom_score_adj

Prevention is Better

Emergency response stops the bleeding but doesn't prevent recurrence. Long-term solutions include:

Memory limits — Use cgroups to cap process memory usage
Admission control — Limit concurrent processes based on available memory
Working set monitoring — Track actual memory needs and allocate accordingly
Capacity planning — Ensure sufficient memory for expected workload

Summary and Key Takeaways

The CPU utilization drop during thrashing is counterintuitive but fully explainable. Armed with this knowledge, you can build monitoring systems that detect thrashing early and respond appropriately.

Key Takeaways

•Thrashing causes CPU to drop — All processes blocked on page fault I/O
•Standard monitoring misleads — Low CPU looks like underload, not overload
•I/O Wait is the key metric — High %wa with low %us signals thrashing
•Load average diverges from CPU — High load with low CPU is characteristic
•Combine metrics for detection — No single metric is sufficient
•Throughput collapses non-linearly — Small overload can cause catastrophic failure
•Response must reduce load — Adding capacity makes things worse during thrashing

What's Next:

Having explored the symptoms of thrashing—high page fault rates and paradoxical CPU drops—the next page completes our analysis with detection methods. We'll examine systematic approaches for identifying thrashing, including automated detection algorithms and prevention strategies.

Page Complete

You now understand why CPU utilization drops during thrashing, how to distinguish it from genuine underload, and how to build monitoring that detects thrashing despite misleading CPU metrics. The final page covers systematic detection methods.