Loading content...
Throughout this module, we've examined the definition, causes, and symptoms of thrashing. Now we synthesize this knowledge into systematic detection methods—algorithms and approaches that can identify thrashing reliably, automatically, and early enough to take corrective action.
Effective thrashing detection is the bridge between understanding and action. A system that can detect thrashing can prevent it; a system that cannot detect it will inevitably suffer catastrophic performance collapse.
By the end of this page, you will understand multiple approaches to thrashing detection, including working set analysis, page fault frequency monitoring, memory pressure scoring, and machine learning-based prediction. You'll know how to implement these methods and when to use each approach.
Thrashing detection faces a fundamental challenge: we need to detect the problem before it becomes catastrophic, but the clearest signals appear after significant degradation has occurred.
The Detection Spectrum:
| Detection Point | Signal Clarity | Intervention Window | False Positive Risk |
|---|---|---|---|
| Predictive (before onset) | Weak, ambiguous | Large (proactive) | High |
| Early (onset beginning) | Moderate | Medium (responsive) | Moderate |
| Symptomatic (clear thrashing) | Strong, clear | Small (reactive) | Low |
| Crisis (system frozen) | Obvious | None (too late) | None |
Earlier detection provides more time for corrective action but risks false positives (reducing load unnecessarily). Later detection is more accurate but may leave insufficient time to prevent collapse. The art of thrashing detection lies in finding the right balance for your system's requirements.
Detection Requirements:
An effective thrashing detection system should:
The most theoretically sound approach to thrashing detection is working set analysis. If we can measure each process's working set and compare the sum to available memory, we can predict thrashing before it occurs.
The working set WSS(Δ) of a process at time t is the set of pages referenced in the time interval [t-Δ, t], where Δ is the working set window.
Thrashing occurs when: ∑ WSS(i, Δ) > m
Where m is available physical memory. Detection means estimating WSS for all processes.
Working Set Estimation Methods:
| Method | How It Works | Accuracy | Overhead |
|---|---|---|---|
| Reference bit sampling | Periodically check/clear reference bits | Moderate | Low |
| Page table scanning | Walk page tables, count accessed pages | High | Moderate |
| Hardware counters | Use performance counters | High | Low |
| Fault tracking | Infer from page fault patterns | Moderate | Very low |
| RSS monitoring | Use Resident Set Size as proxy | Approximate | Very low |
Reference Bit Sampling Algorithm:
Algorithm: Working Set Estimation via Reference Bits
Inputs:
- Δ = working set window (e.g., 10 seconds)
- τ = sampling interval (e.g., 1 second)
- history[process][page] = number of intervals page was referenced
Every τ seconds:
For each process P:
For each page pg in P's page table:
If reference_bit(pg) == 1:
history[P][pg] = Δ/τ # Reset to max age
clear reference_bit(pg)
Else:
history[P][pg] -= 1 # Age the entry
If history[P][pg] <= 0:
Remove pg from working set estimate
WSS[P] = count of pages with history > 0
Thrashing Detection:
If ∑ WSS[P] > available_memory * threshold:
ALERT: Thrashing likely
This algorithm approximates working sets by observing which pages are accessed during each sampling interval. Pages not accessed for Δ seconds are dropped from the working set estimate.
Pure working set tracking is often too expensive for production systems. A practical compromise is to track Resident Set Size (RSS) as a proxy. While RSS isn't exactly the working set, it correlates well enough for detection purposes:
If ∑ RSS(i) approaches physical memory, thrashing risk is high.
Page Fault Frequency (PFF) provides a simpler, more practical detection approach than full working set tracking. Instead of measuring working sets directly, we observe the symptom of inadequate allocation: elevated fault rates.
PFF Detection Algorithm:
Algorithm: PFF-Based Thrashing Detection
Parameters:
- upper_threshold = maximum acceptable fault rate per process
- lower_threshold = minimum useful fault rate
- global_threshold = system-wide fault rate threshold
- sample_interval = measurement period
Every sample_interval:
total_faults = 0
processes_over_limit = 0
For each process P:
fault_rate[P] = faults(P) / sample_interval
total_faults += fault_rate[P]
If fault_rate[P] > upper_threshold:
processes_over_limit += 1
mark P as "needs more frames"
If fault_rate[P] < lower_threshold:
mark P as "can release frames"
# Global thrashing detection
If total_faults > global_threshold:
If all processes marked "needs more frames":
ALERT: THRASHING - reduce multiprogramming
Else:
ACTION: Rebalance frames from "can release" to "needs more"
PFF detection reveals an important condition: irrecoverable thrashing. This occurs when:
In this state, rebalancing is impossible—the only solution is to reduce the number of processes.
Modern operating systems expose memory pressure indicators that provide early warning of potential thrashing. These metrics aggregate multiple signals into actionable information.
| OS | Mechanism | Location/API | What It Reports |
|---|---|---|---|
| Linux | PSI (Pressure Stall Info) | /proc/pressure/memory | % time processes stalled on memory |
| Linux | Memory cgroups | /sys/fs/cgroup/memory.* | Per-group pressure, OOM events |
| Linux | Watermarks | /proc/zoneinfo | Pages in each zone vs. thresholds |
| Windows | Memory Manager | Performance Counters | Available bytes, cache faults/sec |
| macOS | VM pressure | Mach APIs | Warning, critical, warning-recovery |
| FreeBSD | vm.stats | sysctl | Various VM statistics |
Linux 4.20+ provides PSI, a powerful mechanism for thrashing detection:
$ cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
High 'full' values indicate thrashing—all processes are blocked waiting for memory.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137
#!/usr/bin/env python3"""PSI-Based Thrashing Detection for Linux 4.20+Uses Pressure Stall Information for accurate thrashing detection"""import timefrom dataclasses import dataclassfrom typing import Optional @dataclassclass PSIMetrics: """Pressure Stall Information metrics""" some_avg10: float # Some tasks stalled, 10s avg some_avg60: float # Some tasks stalled, 60s avg some_avg300: float # Some tasks stalled, 5min avg full_avg10: float # All tasks stalled, 10s avg full_avg60: float # All tasks stalled, 60s avg full_avg300: float # All tasks stalled, 5min avg some_total: int # Total some stall time (microseconds) full_total: int # Total full stall time (microseconds) def read_psi_memory() -> Optional[PSIMetrics]: """Read PSI memory metrics from /proc/pressure/memory""" try: with open('/proc/pressure/memory', 'r') as f: lines = f.readlines() some_line = dict(x.split('=') for x in lines[0].split()[1:]) full_line = dict(x.split('=') for x in lines[1].split()[1:]) return PSIMetrics( some_avg10=float(some_line['avg10']), some_avg60=float(some_line['avg60']), some_avg300=float(some_line['avg300']), full_avg10=float(full_line['avg10']), full_avg60=float(full_line['avg60']), full_avg300=float(full_line['avg300']), some_total=int(some_line['total']), full_total=int(full_line['total']), ) except FileNotFoundError: print("PSI not available (requires Linux 4.20+)") return None class PSIThrashingDetector: """ Detects thrashing using Linux PSI metrics. PSI 'full' metric indicates time when ALL tasks are stalled - this is the most direct measure of thrashing available. """ def __init__(self, warning_threshold: float = 10.0, # 10% full stall critical_threshold: float = 25.0, # 25% full stall thrashing_threshold: float = 50.0): # 50% full stall self.warning_threshold = warning_threshold self.critical_threshold = critical_threshold self.thrashing_threshold = thrashing_threshold def analyze(self, psi: PSIMetrics) -> dict: """Analyze PSI metrics and return status""" # Use 10s average for quick response full_pressure = psi.full_avg10 some_pressure = psi.some_avg10 # Determine severity if full_pressure >= self.thrashing_threshold: status = "THRASHING" urgency = "IMMEDIATE" elif full_pressure >= self.critical_threshold: status = "CRITICAL" urgency = "HIGH" elif full_pressure >= self.warning_threshold: status = "WARNING" urgency = "MEDIUM" else: status = "NORMAL" urgency = "NONE" # Trending analysis if psi.full_avg10 > psi.full_avg60 > psi.full_avg300: trend = "WORSENING" elif psi.full_avg10 < psi.full_avg60 < psi.full_avg300: trend = "IMPROVING" else: trend = "STABLE" return { 'status': status, 'urgency': urgency, 'full_pressure': full_pressure, 'some_pressure': some_pressure, 'trend': trend, 'recommendation': self._get_recommendation(status, trend) } def _get_recommendation(self, status: str, trend: str) -> str: if status == "THRASHING": return "IMMEDIATE ACTION: Suspend or kill processes to reduce memory pressure" elif status == "CRITICAL": if trend == "WORSENING": return "Prepare load shedding - thrashing imminent" return "Investigate memory-heavy processes" elif status == "WARNING": return "Monitor closely, consider preemptive action" return "System healthy" def main(): detector = PSIThrashingDetector() print("PSI-Based Thrashing Monitor") print("=" * 60) while True: psi = read_psi_memory() if psi is None: print("Cannot read PSI metrics") break analysis = detector.analyze(psi) print(f"[{time.strftime('%H:%M:%S')}]") print(f" Status: {analysis['status']} | Urgency: {analysis['urgency']}") print(f" Full stall: {analysis['full_pressure']:.1f}% | " f"Some stall: {analysis['some_pressure']:.1f}%") print(f" Trend: {analysis['trend']}") if analysis['status'] != "NORMAL": print(f" >> {analysis['recommendation']}") time.sleep(1) if __name__ == "__main__": main()The most robust thrashing detection combines multiple indicators into a composite score. This approach reduces false positives by requiring multiple metrics to agree before alerting.
The Composite Detection Framework:
Algorithm: Composite Thrashing Detection
Metrics (each normalized to 0-100 scale):
M1 = Memory pressure (PSI full, or MemAvailable/MemTotal inverted)
M2 = Page fault rate (relative to baseline)
M3 = I/O wait percentage
M4 = Load average divergence from CPU utilization
M5 = Swap activity rate
Weights (tunable based on system):
W1 = 0.25 (memory pressure is primary indicator)
W2 = 0.25 (fault rate is strong signal)
W3 = 0.20 (I/O wait is reliable)
W4 = 0.15 (load divergence is confirming)
W5 = 0.15 (swap activity is confirming)
Composite Score:
S = W1*M1 + W2*M2 + W3*M3 + W4*M4 + W5*M5
Thresholds:
S < 30: NORMAL
30 ≤ S < 50: WARNING
50 ≤ S < 70: CRITICAL
S ≥ 70: THRASHING
The weights and thresholds above are starting points. For your specific system:
Different workloads may require different tuning. A batch processing system has different patterns than a web server.
Metric Normalization:
To combine different metrics, we normalize each to a 0-100 scale:
| Metric | Raw Range | Normalization Formula |
|---|---|---|
| Memory (PSI full) | 0-100% | Use directly |
| Fault rate | 0-∞ f/s | min(100, rate/baseline × 50) |
| I/O wait | 0-100% | Use directly |
| Load divergence | 0-∞ | min(100, (load/cores - cpu%) × 2) |
| Swap rate | 0-∞ MB/s | min(100, rate × 10) |
Normalization ensures each metric contributes proportionally to the composite score.
For complex systems with variable workloads, machine learning approaches can learn thrashing patterns from historical data and detect anomalies that rule-based systems might miss.
ML Approaches for Thrashing Detection:
| Approach | Method | Best For |
|---|---|---|
| Anomaly Detection | Isolation Forest, One-Class SVM | Detecting unusual patterns |
| Classification | Random Forest, Gradient Boosting | Binary thrashing/not-thrashing |
| Time Series | LSTM, Prophet | Predicting future thrashing |
| Clustering | K-Means, DBSCAN | Identifying distinct system states |
| Regression | Linear, XGBoost | Predicting performance degradation |
Key features for ML-based thrashing detection:
Raw metrics: CPU user/sys/wait, memory available, fault rate, swap rate, load average
Derived features: Rate of change of each metric, ratios (e.g., load/CPU), rolling statistics (min, max, variance over windows)
Temporal features: Time of day, day of week, seasonality indicators
Context features: Number of processes, container count, recent deployments
Practical ML Implementation:
# Simplified thrashing prediction using Isolation Forest
from sklearn.ensemble import IsolationForest
import numpy as np
class ThrashingPredictor:
def __init__(self, contamination=0.05):
# Contamination = expected fraction of anomalies
self.model = IsolationForest(
contamination=contamination,
random_state=42,
n_estimators=100
)
self.trained = False
def train(self, normal_operation_data: np.ndarray):
"""
Train on data from normal operation.
Data shape: (n_samples, n_features)
Features: [cpu_user, cpu_wait, mem_pct, fault_rate, swap_rate]
"""
self.model.fit(normal_operation_data)
self.trained = True
def detect(self, current_metrics: np.ndarray) -> dict:
"""
Detect if current metrics indicate thrashing.
Returns prediction and anomaly score.
"""
if not self.trained:
return {'error': 'Model not trained'}
# -1 = anomaly (potential thrashing)
# +1 = normal
prediction = self.model.predict(current_metrics.reshape(1, -1))[0]
score = self.model.score_samples(current_metrics.reshape(1, -1))[0]
return {
'is_anomaly': prediction == -1,
'anomaly_score': -score, # Higher = more anomalous
'status': 'POTENTIAL_THRASHING' if prediction == -1 else 'NORMAL'
}
Detection approaches fall into two categories: real-time (detecting current thrashing) and predictive (anticipating future thrashing). Each has distinct use cases.
| Aspect | Real-Time Detection | Predictive Detection |
|---|---|---|
| Timing | Detects current state | Predicts future state |
| Data | Current metrics | Historical patterns + current trend |
| Algorithm | Threshold/rule-based | Time series analysis, ML |
| Lead time | None (reactive) | Minutes to hours |
| Accuracy | High (clear signals) | Moderate (prediction uncertainty) |
| Use case | Immediate response | Proactive capacity planning |
| Action | Emergency load shedding | Scale up, shed load gracefully |
The best systems use both approaches:
This layered approach maximizes system resilience.
Predictive Detection Example:
Algorithm: Trend-Based Thrashing Prediction
Inputs:
- memory_usage[] = last 60 samples of memory utilization
- fault_rate[] = last 60 samples of page fault rate
- horizon = prediction time (e.g., 5 minutes)
Process:
1. Fit linear regression to recent memory_usage
trend_mem = slope of regression
2. Fit linear regression to recent fault_rate
trend_fault = slope of regression
3. Project forward:
projected_mem = current_mem + trend_mem × horizon
projected_fault = current_fault + trend_fault × horizon
4. Risk assessment:
If projected_mem > 90% AND projected_fault > baseline × 5:
ALERT: Thrashing predicted in ~{horizon} minutes
Recommended action: Preemptive load reduction
This simple linear extrapolation provides early warning for gradual memory pressure buildup.
Detection is only valuable when coupled with automated response. This section outlines how to integrate thrashing detection into operational systems.
Automated Response Pipeline:
┌─────────────────────────────────────────────────────────────┐
│ DETECTION → RESPONSE PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Metrics │───►│ Detection│───►│ Decision │ │
│ │ Collector│ │ Engine │ │ Engine │ │
│ └──────────┘ └──────────┘ └────┬─────┘ │
│ │ │
│ ┌──────────────┼──────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Alert │ │ Scale │ │ Shed │ │
│ │ (Warn) │ │ (Add) │ │ (Drop) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ Decision Logic: │
│ - WARNING: Alert on-call, monitor closely │
│ - CRITICAL + WORSENING: Scale up if possible │
│ - THRASHING: Immediate load shedding │
│ │
└─────────────────────────────────────────────────────────────┘
Response actions have different time scales:
Match response to detection urgency. Don't wait for slow scaling when immediate shedding is needed.
You have completed Module 4: Thrashing. You now understand:
• Definition: What thrashing is and why it matters • Cause: Insufficient frames relative to working set demand • Symptoms: High page fault rates and paradoxical CPU drops • Detection: Working set analysis, PFF, PSI, composite algorithms, and ML approaches
With this knowledge, you can design, monitor, and operate systems that resist thrashing and recover gracefully when memory pressure rises.