Thrashing - Learning Module

Loading content...

0/240

Detection Methods

From Symptoms to Systematic Detection

Throughout this module, we've examined the definition, causes, and symptoms of thrashing. Now we synthesize this knowledge into systematic detection methods—algorithms and approaches that can identify thrashing reliably, automatically, and early enough to take corrective action.

Effective thrashing detection is the bridge between understanding and action. A system that can detect thrashing can prevent it; a system that cannot detect it will inevitably suffer catastrophic performance collapse.

What You Will Learn

By the end of this page, you will understand multiple approaches to thrashing detection, including working set analysis, page fault frequency monitoring, memory pressure scoring, and machine learning-based prediction. You'll know how to implement these methods and when to use each approach.

Detection Philosophy

Thrashing detection faces a fundamental challenge: we need to detect the problem before it becomes catastrophic, but the clearest signals appear after significant degradation has occurred.

The Detection Spectrum:

Detection Timing and Trade-offs
Detection Point	Signal Clarity	Intervention Window	False Positive Risk
Predictive (before onset)	Weak, ambiguous	Large (proactive)	High
Early (onset beginning)	Moderate	Medium (responsive)	Moderate
Symptomatic (clear thrashing)	Strong, clear	Small (reactive)	Low
Crisis (system frozen)	Obvious	None (too late)	None

The Detection Tradeoff

Earlier detection provides more time for corrective action but risks false positives (reducing load unnecessarily). Later detection is more accurate but may leave insufficient time to prevent collapse. The art of thrashing detection lies in finding the right balance for your system's requirements.

Detection Requirements:

An effective thrashing detection system should:

Be non-intrusive — Detection overhead shouldn't cause the problem it detects
Minimize false positives — Unnecessary load shedding wastes capacity
Minimize false negatives — Missing actual thrashing causes outages
Provide early warning — Enough lead time for intervention
Distinguish workload phases — Don't alarm on startup spikes
Support automation — Enable programmatic response

Working Set-Based Detection

The most theoretically sound approach to thrashing detection is working set analysis. If we can measure each process's working set and compare the sum to available memory, we can predict thrashing before it occurs.

Working Set Model Review

The working set WSS(Δ) of a process at time t is the set of pages referenced in the time interval [t-Δ, t], where Δ is the working set window.

Thrashing occurs when: ∑ WSS(i, Δ) > m

Where m is available physical memory. Detection means estimating WSS for all processes.

Working Set Estimation Methods:

Method	How It Works	Accuracy	Overhead
Reference bit sampling	Periodically check/clear reference bits	Moderate	Low
Page table scanning	Walk page tables, count accessed pages	High	Moderate
Hardware counters	Use performance counters	High	Low
Fault tracking	Infer from page fault patterns	Moderate	Very low
RSS monitoring	Use Resident Set Size as proxy	Approximate	Very low

Reference Bit Sampling Algorithm:

Algorithm: Working Set Estimation via Reference Bits

Inputs:
  - Δ = working set window (e.g., 10 seconds)
  - τ = sampling interval (e.g., 1 second)
  - history[process][page] = number of intervals page was referenced

Every τ seconds:
  For each process P:
    For each page pg in P's page table:
      If reference_bit(pg) == 1:
        history[P][pg] = Δ/τ  # Reset to max age
        clear reference_bit(pg)
      Else:
        history[P][pg] -= 1  # Age the entry
        If history[P][pg] <= 0:
          Remove pg from working set estimate
    
    WSS[P] = count of pages with history > 0

Thrashing Detection:
  If ∑ WSS[P] > available_memory * threshold:
    ALERT: Thrashing likely

This algorithm approximates working sets by observing which pages are accessed during each sampling interval. Pages not accessed for Δ seconds are dropped from the working set estimate.

Practical Consideration

Pure working set tracking is often too expensive for production systems. A practical compromise is to track Resident Set Size (RSS) as a proxy. While RSS isn't exactly the working set, it correlates well enough for detection purposes:

If ∑ RSS(i) approaches physical memory, thrashing risk is high.

Page Fault Frequency (PFF) Detection

Page Fault Frequency (PFF) provides a simpler, more practical detection approach than full working set tracking. Instead of measuring working sets directly, we observe the symptom of inadequate allocation: elevated fault rates.

PFF Detection Algorithm:

Algorithm: PFF-Based Thrashing Detection

Parameters:
  - upper_threshold = maximum acceptable fault rate per process
  - lower_threshold = minimum useful fault rate
  - global_threshold = system-wide fault rate threshold
  - sample_interval = measurement period

Every sample_interval:
  total_faults = 0
  processes_over_limit = 0
  
  For each process P:
    fault_rate[P] = faults(P) / sample_interval
    total_faults += fault_rate[P]
    
    If fault_rate[P] > upper_threshold:
      processes_over_limit += 1
      mark P as "needs more frames"
    
    If fault_rate[P] < lower_threshold:
      mark P as "can release frames"
  
  # Global thrashing detection
  If total_faults > global_threshold:
    If all processes marked "needs more frames":
      ALERT: THRASHING - reduce multiprogramming
    Else:
      ACTION: Rebalance frames from "can release" to "needs more"

Detecting Irrecoverable Thrashing

PFF detection reveals an important condition: irrecoverable thrashing. This occurs when:

Every process exceeds the upper fault threshold
No process is below the lower threshold
Therefore, no frames can be redistributed

In this state, rebalancing is impossible—the only solution is to reduce the number of processes.

PFF Advantages

•Low overhead (just count faults)
•Detects problem directly via symptom
•Provides per-process granularity
•Identifies both under and over allocation
•Simple to implement

PFF Limitations

•Reactive (fault must occur first)
•Thresholds require tuning
•Phase transitions cause false positives
•Doesn't predict impending thrashing
•Sensitive to workload variation

Memory Pressure Indicators

Modern operating systems expose memory pressure indicators that provide early warning of potential thrashing. These metrics aggregate multiple signals into actionable information.

Memory Pressure Indicators by OS
OS	Mechanism	Location/API	What It Reports
Linux	PSI (Pressure Stall Info)	/proc/pressure/memory	% time processes stalled on memory
Linux	Memory cgroups	/sys/fs/cgroup/memory.*	Per-group pressure, OOM events
Linux	Watermarks	/proc/zoneinfo	Pages in each zone vs. thresholds
Windows	Memory Manager	Performance Counters	Available bytes, cache faults/sec
macOS	VM pressure	Mach APIs	Warning, critical, warning-recovery
FreeBSD	vm.stats	sysctl	Various VM statistics

Linux PSI (Pressure Stall Information)

Linux 4.20+ provides PSI, a powerful mechanism for thrashing detection:

$ cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

some: Percentage of time at least one task stalled on memory
full: Percentage of time all tasks stalled on memory
avg10/60/300: Moving averages over 10s, 60s, 5min

High 'full' values indicate thrashing—all processes are blocked waiting for memory.

PSI-Based Thrashing Detection
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
#!/usr/bin/env python3
"""
PSI-Based Thrashing Detection for Linux 4.20+
Uses Pressure Stall Information for accurate thrashing detection
"""
import time
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class PSIMetrics:
    """Pressure Stall Information metrics"""
    some_avg10: float   # Some tasks stalled, 10s avg
    some_avg60: float   # Some tasks stalled, 60s avg
    some_avg300: float  # Some tasks stalled, 5min avg
    full_avg10: float   # All tasks stalled, 10s avg
    full_avg60: float   # All tasks stalled, 60s avg
    full_avg300: float  # All tasks stalled, 5min avg
    some_total: int     # Total some stall time (microseconds)
    full_total: int     # Total full stall time (microseconds)
 
def read_psi_memory() -> Optional[PSIMetrics]:
    """Read PSI memory metrics from /proc/pressure/memory"""
    try:
        with open('/proc/pressure/memory', 'r') as f:
            lines = f.readlines()
        
        some_line = dict(x.split('=') for x in lines[0].split()[1:])
        full_line = dict(x.split('=') for x in lines[1].split()[1:])
        
        return PSIMetrics(
            some_avg10=float(some_line['avg10']),
            some_avg60=float(some_line['avg60']),
            some_avg300=float(some_line['avg300']),
            full_avg10=float(full_line['avg10']),
            full_avg60=float(full_line['avg60']),
            full_avg300=float(full_line['avg300']),
            some_total=int(some_line['total']),
            full_total=int(full_line['total']),
        )
    except FileNotFoundError:
        print("PSI not available (requires Linux 4.20+)")
        return None
 
class PSIThrashingDetector:
    """
    Detects thrashing using Linux PSI metrics.
    
    PSI 'full' metric indicates time when ALL tasks are stalled - 
    this is the most direct measure of thrashing available.
    """
    
    def __init__(self,
                 warning_threshold: float = 10.0,    # 10% full stall
                 critical_threshold: float = 25.0,   # 25% full stall
                 thrashing_threshold: float = 50.0): # 50% full stall
        self.warning_threshold = warning_threshold
        self.critical_threshold = critical_threshold
        self.thrashing_threshold = thrashing_threshold
    
    def analyze(self, psi: PSIMetrics) -> dict:
        """Analyze PSI metrics and return status"""
        
        # Use 10s average for quick response
        full_pressure = psi.full_avg10
        some_pressure = psi.some_avg10
        
        # Determine severity
        if full_pressure >= self.thrashing_threshold:
            status = "THRASHING"
            urgency = "IMMEDIATE"
        elif full_pressure >= self.critical_threshold:
            status = "CRITICAL"
            urgency = "HIGH"
        elif full_pressure >= self.warning_threshold:
            status = "WARNING"
            urgency = "MEDIUM"
        else:
            status = "NORMAL"
            urgency = "NONE"
        
        # Trending analysis
        if psi.full_avg10 > psi.full_avg60 > psi.full_avg300:
            trend = "WORSENING"
        elif psi.full_avg10 < psi.full_avg60 < psi.full_avg300:
            trend = "IMPROVING"
        else:
            trend = "STABLE"
        
        return {
            'status': status,
            'urgency': urgency,
            'full_pressure': full_pressure,
            'some_pressure': some_pressure,
            'trend': trend,
            'recommendation': self._get_recommendation(status, trend)
        }
    
    def _get_recommendation(self, status: str, trend: str) -> str:
        if status == "THRASHING":
            return "IMMEDIATE ACTION: Suspend or kill processes to reduce memory pressure"
        elif status == "CRITICAL":
            if trend == "WORSENING":
                return "Prepare load shedding - thrashing imminent"
            return "Investigate memory-heavy processes"
        elif status == "WARNING":
            return "Monitor closely, consider preemptive action"
        return "System healthy"
 
def main():
    detector = PSIThrashingDetector()
    
    print("PSI-Based Thrashing Monitor")
    print("=" * 60)
    
    while True:
        psi = read_psi_memory()
        if psi is None:
            print("Cannot read PSI metrics")
            break
        
        analysis = detector.analyze(psi)
        
        print(f"
[{time.strftime('%H:%M:%S')}]")
        print(f"  Status: {analysis['status']} | Urgency: {analysis['urgency']}")
        print(f"  Full stall: {analysis['full_pressure']:.1f}% | "
              f"Some stall: {analysis['some_pressure']:.1f}%")
        print(f"  Trend: {analysis['trend']}")
        
        if analysis['status'] != "NORMAL":
            print(f"  >> {analysis['recommendation']}")
        
        time.sleep(1)
 
if __name__ == "__main__":
    main()

Composite Detection Algorithm

The most robust thrashing detection combines multiple indicators into a composite score. This approach reduces false positives by requiring multiple metrics to agree before alerting.

The Composite Detection Framework:

Algorithm: Composite Thrashing Detection

Metrics (each normalized to 0-100 scale):
  M1 = Memory pressure (PSI full, or MemAvailable/MemTotal inverted)
  M2 = Page fault rate (relative to baseline)
  M3 = I/O wait percentage
  M4 = Load average divergence from CPU utilization
  M5 = Swap activity rate

Weights (tunable based on system):
  W1 = 0.25  (memory pressure is primary indicator)
  W2 = 0.25  (fault rate is strong signal)
  W3 = 0.20  (I/O wait is reliable)
  W4 = 0.15  (load divergence is confirming)
  W5 = 0.15  (swap activity is confirming)

Composite Score:
  S = W1*M1 + W2*M2 + W3*M3 + W4*M4 + W5*M5

Thresholds:
  S < 30:  NORMAL
  30 ≤ S < 50:  WARNING
  50 ≤ S < 70:  CRITICAL
  S ≥ 70:  THRASHING

Threshold Calibration

The weights and thresholds above are starting points. For your specific system:

Baseline during normal operation: Record all metrics under normal load
Identify thrashing events: Collect metrics during known thrashing incidents
Tune thresholds: Adjust to minimize false positives while catching all actual events
Validate: Test with synthetic load to verify detection

Different workloads may require different tuning. A batch processing system has different patterns than a web server.

Metric Normalization:

To combine different metrics, we normalize each to a 0-100 scale:

Metric	Raw Range	Normalization Formula
Memory (PSI full)	0-100%	Use directly
Fault rate	0-∞ f/s	min(100, rate/baseline × 50)
I/O wait	0-100%	Use directly
Load divergence	0-∞	min(100, (load/cores - cpu%) × 2)
Swap rate	0-∞ MB/s	min(100, rate × 10)

Normalization ensures each metric contributes proportionally to the composite score.

Machine Learning-Based Detection

For complex systems with variable workloads, machine learning approaches can learn thrashing patterns from historical data and detect anomalies that rule-based systems might miss.

ML Detection Advantages

•Adapts to workload patterns automatically
•Can detect subtle precursor patterns
•Handles complex metric interactions
•Improves over time with more data
•Can predict thrashing before onset

ML Detection Challenges

•Requires labeled training data
•May not generalize to new patterns
•Black-box nature complicates debugging
•Higher computational overhead
•Cold start problem for new systems

ML Approaches for Thrashing Detection:

Approach	Method	Best For
Anomaly Detection	Isolation Forest, One-Class SVM	Detecting unusual patterns
Classification	Random Forest, Gradient Boosting	Binary thrashing/not-thrashing
Time Series	LSTM, Prophet	Predicting future thrashing
Clustering	K-Means, DBSCAN	Identifying distinct system states
Regression	Linear, XGBoost	Predicting performance degradation

Feature Engineering for Thrashing Detection

Key features for ML-based thrashing detection:

Raw metrics: CPU user/sys/wait, memory available, fault rate, swap rate, load average

Derived features: Rate of change of each metric, ratios (e.g., load/CPU), rolling statistics (min, max, variance over windows)

Temporal features: Time of day, day of week, seasonality indicators

Context features: Number of processes, container count, recent deployments

Practical ML Implementation:

# Simplified thrashing prediction using Isolation Forest
from sklearn.ensemble import IsolationForest
import numpy as np

class ThrashingPredictor:
    def __init__(self, contamination=0.05):
        # Contamination = expected fraction of anomalies
        self.model = IsolationForest(
            contamination=contamination,
            random_state=42,
            n_estimators=100
        )
        self.trained = False
    
    def train(self, normal_operation_data: np.ndarray):
        """
        Train on data from normal operation.
        Data shape: (n_samples, n_features)
        Features: [cpu_user, cpu_wait, mem_pct, fault_rate, swap_rate]
        """
        self.model.fit(normal_operation_data)
        self.trained = True
    
    def detect(self, current_metrics: np.ndarray) -> dict:
        """
        Detect if current metrics indicate thrashing.
        Returns prediction and anomaly score.
        """
        if not self.trained:
            return {'error': 'Model not trained'}
        
        # -1 = anomaly (potential thrashing)
        # +1 = normal
        prediction = self.model.predict(current_metrics.reshape(1, -1))[0]
        score = self.model.score_samples(current_metrics.reshape(1, -1))[0]
        
        return {
            'is_anomaly': prediction == -1,
            'anomaly_score': -score,  # Higher = more anomalous
            'status': 'POTENTIAL_THRASHING' if prediction == -1 else 'NORMAL'
        }

Real-Time vs. Predictive Detection

Detection approaches fall into two categories: real-time (detecting current thrashing) and predictive (anticipating future thrashing). Each has distinct use cases.

Real-Time vs. Predictive Detection
Aspect	Real-Time Detection	Predictive Detection
Timing	Detects current state	Predicts future state
Data	Current metrics	Historical patterns + current trend
Algorithm	Threshold/rule-based	Time series analysis, ML
Lead time	None (reactive)	Minutes to hours
Accuracy	High (clear signals)	Moderate (prediction uncertainty)
Use case	Immediate response	Proactive capacity planning
Action	Emergency load shedding	Scale up, shed load gracefully

Defense in Depth

The best systems use both approaches:

Predictive detection identifies trends and triggers proactive scaling or load shedding
Real-time detection provides a safety net when predictions fail or sudden spikes occur

This layered approach maximizes system resilience.

Predictive Detection Example:

Algorithm: Trend-Based Thrashing Prediction

Inputs:
  - memory_usage[] = last 60 samples of memory utilization
  - fault_rate[] = last 60 samples of page fault rate
  - horizon = prediction time (e.g., 5 minutes)

Process:
  1. Fit linear regression to recent memory_usage
     trend_mem = slope of regression
  
  2. Fit linear regression to recent fault_rate
     trend_fault = slope of regression
  
  3. Project forward:
     projected_mem = current_mem + trend_mem × horizon
     projected_fault = current_fault + trend_fault × horizon
  
  4. Risk assessment:
     If projected_mem > 90% AND projected_fault > baseline × 5:
       ALERT: Thrashing predicted in ~{horizon} minutes
       Recommended action: Preemptive load reduction

This simple linear extrapolation provides early warning for gradual memory pressure buildup.

Integration with Response Systems

Detection is only valuable when coupled with automated response. This section outlines how to integrate thrashing detection into operational systems.

Response Integration Points

•Alerting Systems — PagerDuty, OpsGenie, Slack for human notification
•Auto-Scaling — Kubernetes HPA, AWS ASG for capacity increase
•Load Balancers — Remove unhealthy nodes, implement circuit breakers
•Process Managers — Systemd, Docker for process suspension/termination
•Admission Controllers — Kubernetes LimitRange, OOM handling policies
•Runbook Automation — Rundeck, Ansible for scripted remediation

Automated Response Pipeline:

┌─────────────────────────────────────────────────────────────┐
│                    DETECTION → RESPONSE PIPELINE             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐              │
│   │ Metrics  │───►│ Detection│───►│ Decision │              │
│   │ Collector│    │ Engine   │    │ Engine   │              │
│   └──────────┘    └──────────┘    └────┬─────┘              │
│                                        │                     │
│                         ┌──────────────┼──────────────┐     │
│                         │              │              │     │
│                         ▼              ▼              ▼     │
│                   ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│                   │  Alert   │  │  Scale   │  │   Shed   │  │
│                   │  (Warn)  │  │  (Add)   │  │  (Drop)  │  │
│                   └──────────┘  └──────────┘  └──────────┘  │
│                                                              │
│  Decision Logic:                                             │
│  - WARNING: Alert on-call, monitor closely                   │
│  - CRITICAL + WORSENING: Scale up if possible                │
│  - THRASHING: Immediate load shedding                        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Response Timing

Response actions have different time scales:

Immediate (seconds): Stop accepting new connections, reject requests
Fast (minutes): Suspend processes, reallocate memory
Slow (minutes-hours): Scale out new instances, add memory

Match response to detection urgency. Don't wait for slow scaling when immediate shedding is needed.

Module Complete

You have completed Module 4: Thrashing. You now understand:

• Definition: What thrashing is and why it matters • Cause: Insufficient frames relative to working set demand • Symptoms: High page fault rates and paradoxical CPU drops • Detection: Working set analysis, PFF, PSI, composite algorithms, and ML approaches

With this knowledge, you can design, monitor, and operate systems that resist thrashing and recover gracefully when memory pressure rises.