Thrashing - Learning Module

Loading content...

0/240

High Page Fault Rate

The Symptom That Signals Collapse

If insufficient frames are the cause of thrashing, high page fault rate is its most visible symptom. Understanding page fault rates is essential for diagnosing thrashing—it transforms an invisible problem into a measurable quantity that operators and automated systems can monitor and respond to.

This page provides a deep analysis of page fault rates: what they mean, how to measure them, what constitutes "high," and how they relate to system performance. By the end, you'll be able to diagnose thrashing from page fault metrics alone.

What You Will Learn

By the end of this page, you will understand how page fault rates are measured and interpreted, what distinguishes normal paging from thrashing, the mathematical relationship between fault rates and performance, and how to use fault rate data for proactive system management.

Understanding Page Fault Rates

A page fault rate measures how frequently page faults occur over time or per reference. There are several ways to express this metric:

1. Absolute Fault Rate (Faults per Second):

Fault Rate = Total Page Faults / Time Interval
Units: faults/second (f/s)
Example: 5,000 faults/second

2. Reference Fault Rate (Faults per Reference):

Fault Rate = Page Faults / Memory References
Units: faults/reference (dimensionless)
Example: 0.001 (1 fault per 1000 references)

3. Per-Process Fault Rate:

Fault Rate(i) = Faults by Process i / Time
Useful for identifying which process is thrashing

Page Fault Rate Metrics and Their Uses
Metric	Formula	Typical Values	Primary Use
System Fault Rate	Total faults / second	100-10,000 f/s	System-wide health
Process Fault Rate	Process faults / second	1-1,000 f/s	Per-process diagnosis
Reference Rate	Faults / memory accesses	0.0001-0.01	Theoretical analysis
Major Faults	Disk I/O faults / second	10-1,000 f/s	I/O subsystem load
Minor Faults	Non-I/O faults / second	100-100,000 f/s	Page table manipulation

Major vs. Minor Page Faults

Major faults require disk I/O—the page must be loaded from storage. These are expensive (milliseconds).

Minor faults don't require disk I/O—the page is already in memory (shared, copy-on-write, or mapped differently). These are cheap (microseconds).

For thrashing analysis, major faults are the critical metric. A system with a high minor fault rate may be fine; a system with a high major fault rate is likely thrashing.

Measurement Sources:

Operating systems provide multiple ways to observe page fault rates:

Operating System	Command/Tool	What It Shows
Linux	`vmstat 1`	si/so (swap in/out), pgpgin/pgpgout
Linux	`sar -B 1`	pgfault/s, majflt/s, pgscank/s
Linux	`/proc/vmstat`	Raw kernel counters
Linux	`perf stat`	Hardware-level page fault events
Windows	Performance Monitor	Pages/sec, Page Faults/sec
Windows	`typeperf`	Counter logging
macOS	`vm_stat`	Pageins, pageouts, faults
All	Application profilers	Per-process breakdown

What Constitutes a "High" Page Fault Rate

The question "what page fault rate is too high?" doesn't have a simple numerical answer. "High" is relative to system capacity, workload characteristics, and storage speed. However, we can establish meaningful thresholds and guidelines.

The Key Insight

A page fault rate is "too high" when page fault service time dominates process execution time. The exact threshold depends on storage speed:

HDD: Thrashing often begins at 100-500 major faults/second
SSD: Thrashing may not manifest until 5,000-10,000 major faults/second
NVMe: Even higher thresholds, but still finite

Calculating the Thrashing Threshold:

We can calculate when page faulting dominates execution:

Let:
  T_fault = Average page fault service time
  T_exec  = Time to execute between faults (when all pages present)
  F       = Fault rate (faults per second)

Condition for thrashing:
  F × T_fault > T_exec × (1 - F × T_fault)

Simplifying, thrashing begins when:
  F > 1 / (2 × T_fault)

For HDD (T_fault = 10ms = 0.01s):
  F > 1 / (2 × 0.01) = 50 faults/second

For SSD (T_fault = 0.1ms = 0.0001s):
  F > 1 / (2 × 0.0001) = 5,000 faults/second

These are order-of-magnitude estimates; actual thresholds depend on workload characteristics.

Page Fault Rate Severity Levels
Level	HDD (f/s)	SSD (f/s)	NVMe (f/s)	System Impact
Normal	< 50	< 500	< 2,000	Negligible performance impact
Elevated	50-200	500-2,000	2,000-10,000	Noticeable slowdown
High	200-500	2,000-5,000	10,000-50,000	Significant degradation
Critical	500-1,000	5,000-10,000	50,000-100,000	Severe performance issues
Thrashing	1,000	10,000	100,000	System nearly unusable

Context-Dependent Interpretation:

Raw fault rates must be interpreted in context:

Workload Phase: Application startup typically has high fault rates as code and data load—this is expected
Process Count: 500 faults/second across 50 processes (10 f/s/process) differs from 500 faults/second from one process
Read vs. Write: Read faults can often be serviced from cache; write faults require actual disk I/O
Sequential vs. Random: Sequential page faults can be prefetched; random access patterns cannot
Baseline Comparison: Compare current rates to normal operation, not just absolute thresholds

Page Fault Rate Dynamics

Page fault rates are not static—they vary over time in characteristic patterns. Understanding these dynamics is essential for distinguishing normal behavior from thrashing.

Normal Fault Rate Patterns

•Startup Spike: High faults during application launch as code/data loads—decreases as working set stabilizes
•Phase Transitions: Spikes when program moves between phases (e.g., from input to processing)
•Periodic Baseline: Low, consistent fault rate during steady-state operation
•Recovery Pattern: After a spike, fault rate returns to baseline within seconds or minutes
•Diurnal Variation: Workload-dependent patterns (e.g., higher during business hours)

Thrashing Fault Rate Patterns

Thrashing exhibits distinctly different patterns:

Sustained High Rate: Unlike startup spikes, thrashing rates don't decrease
Runaway Increase: Fault rate climbs continuously rather than stabilizing
No Recovery: The system doesn't return to baseline without intervention
Mutual Escalation: Multiple processes show increasing fault rates simultaneously
High Variance: Extreme spikes as processes compete for limited frames

Fault Rate Time Series Analysis:

  Faults/s
      ▲
 5000 ┤                              ╱╲  ╱╲  ╱╲
      │                            ╱  ╲╱  ╲╱  ╲   Thrashing:
 4000 ┤                          ╱              ╲  Sustained high,
      │                        ╱                   no recovery
 3000 ┤                      ╱
      │                    ╱
 2000 ┤                  ╱
      │                ╱
 1000 ┤   ╱╲         ╱
      │ ╱  ╲       ╱         Normal: Spike
  500 ┤╱    ╲─────╱          then recovery
      │      ╲__╱
    0 ┼────┬────┬────┬────┬────┬────┬────┬────► Time
          t1   t2   t3   t4   t5   t6   t7   t8
          │    │    │                   │
       Startup  Recovery           Thrashing
        Spike                       Begins

The key diagnostic is whether the fault rate returns to baseline or continues escalating.

Statistical Indicators:

Beyond raw fault rate, statistical measures help identify thrashing:

Indicator	Normal Operation	Thrashing
Mean fault rate	Low, stable	High, increasing
Variance	Low	High (chaotic)
Trend	Flat or decreasing	Increasing
Autocorrelation	High (predictable)	Low (chaotic)
Distribution	Concentrated	Spread/bimodal

Page Fault Frequency (PFF) Analysis

Page Fault Frequency (PFF) is both a measurement technique and a control strategy. By monitoring fault frequency per process, we can detect and respond to thrashing before it becomes catastrophic.

PFF Control Strategy

The PFF strategy uses two thresholds:

Upper Threshold (U): If fault rate > U, process needs more frames Lower Threshold (L): If fault rate < L, process has excess frames

The OS adjusts frame allocation based on which threshold is crossed:

Fault rate > U → Allocate more frames to this process
Fault rate < L → Reclaim frames from this process

PFF in Action:

Consider a system with PFF thresholds U=200 f/s and L=50 f/s:

Process  Current Frames  Fault Rate  Action
───────  ──────────────  ──────────  ──────
   A          100           250      Needs more frames (>200)
   B          150            80      OK (50 < 80 < 200)
   C          200            30      Can give up frames (<50)
   D          120           180      OK (50 < 180 < 200)
   E          130           500      Urgent: needs many more frames

Rebalancing:
- Take frames from C (fault rate 30, below 50)
- Give to A and E (fault rates above 200)

This continuous rebalancing keeps all processes within acceptable fault rate bounds.

PFF Threshold Selection:

Choosing appropriate thresholds is critical:

Factor	Impact on Thresholds
Storage speed	Faster storage → higher thresholds acceptable
Process importance	Critical processes → tighter thresholds
Memory pressure	Low pressure → wider thresholds (less rebalancing)
Workload stability	Stable workloads → can use narrower thresholds
Overhead tolerance	Low tolerance → wider thresholds (less rebalancing)

Typical ranges:

Conservative: L=10 f/s, U=100 f/s
Moderate: L=50 f/s, U=200 f/s
Aggressive: L=100 f/s, U=500 f/s

When PFF Fails

PFF cannot solve thrashing when all processes exceed the upper threshold and no process is below the lower threshold. In this case, there are no frames to redistribute—the system is globally overcommitted. PFF detects this condition: if rebalancing is impossible, the system must either reduce multiprogramming (suspend processes) or acquire more memory.

Measuring and Monitoring Page Faults

Effective thrashing detection requires robust monitoring infrastructure. This section provides practical guidance on measuring page fault rates in production systems.

Linux Page Fault Monitoring
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Real-time page fault monitoring with vmstat
# Columns: si=swap in, so=swap out, bi=blocks in, bo=blocks out
vmstat 1
 
# Detailed page fault statistics with sar
sar -B 1 5
# Key metrics:
#   pgfault/s  - Total page faults per second
#   majflt/s   - Major faults (requiring I/O)
#   pgscank/s  - Pages scanned by kswapd (memory pressure)
 
# Per-process page faults
ps -eo pid,comm,majflt,minflt | sort -k3 -rn | head -20
 
# Continuous monitoring of a specific process
pidstat -r 1 -p $(pgrep -f "your_process")
 
# System-wide fault counters
cat /proc/vmstat | grep -E "pgfault|pgmajfault|pswpin|pswpout"
 
# Alerting on high fault rates (example)
#!/bin/bash
THRESHOLD=1000
while true; do
    FAULTS=$(cat /proc/vmstat | grep pgmajfault | awk '{print $2}')
    sleep 1
    NEW_FAULTS=$(cat /proc/vmstat | grep pgmajfault | awk '{print $2}')
    RATE=$((NEW_FAULTS - FAULTS))
    if [ $RATE -gt $THRESHOLD ]; then
        echo "ALERT: Major fault rate $RATE/s exceeds threshold"
    fi
done

Production Monitoring Best Practices

In production environments:

Use aggregation: Collect 1-second samples, aggregate to 1-minute for dashboards
Track both major and minor: Minor faults indicate memory pressure even without I/O
Per-process granularity: Identify which process is causing thrashing
Historical baselines: Compare to normal operation, not just thresholds
Automated alerting: Alert on sustained high rates, not transient spikes

The High Fault Rate Feedback Loop

High page fault rates create feedback loops that intensify thrashing. Understanding these loops explains why thrashing is self-reinforcing and why it escalates so rapidly.

Loop 1: The I/O Queue Feedback

┌────────────────────────────────────────────────────────────┐
│                    I/O QUEUE FEEDBACK LOOP                  │
│                                                              │
│    High Fault Rate ──────► Disk I/O Queue Grows            │
│          ▲                        │                          │
│          │                        ▼                          │
│    Fault Service         Queue Wait Time Increases          │
│    Time Increases                 │                          │
│          ▲                        ▼                          │
│          │                 Processes Block Longer            │
│          │                        │                          │
│          └────────────────────────┘                          │
└────────────────────────────────────────────────────────────┘

As more faults occur, the disk I/O queue grows. Each fault takes longer to service because it must wait in queue. Longer service time means processes are blocked longer, and when they resume, they generate more faults because other pages have been evicted while waiting.

Loop 2: The Working Set Destruction Loop

┌────────────────────────────────────────────────────────────┐
│               WORKING SET DESTRUCTION LOOP                   │
│                                                              │
│   Process A Faults ──────► A's Pages Loaded                 │
│          ▲                       │                           │
│          │                       ▼                           │
│    A's Working Set       B's Pages Evicted                  │
│    Destroyed                     │                           │
│          ▲                       ▼                           │
│          │                 B Faults When Runs               │
│          │                       │                           │
│    B's Pages Evicted ◄───────────┘                          │
│    to Load A's                   │                           │
│          ▲                       ▼                           │
│          └─────────────── B's Pages Loaded                  │
└────────────────────────────────────────────────────────────┘

With global replacement, processes destroy each other's working sets. Loading pages for process A evicts pages from process B. When B runs, it faults and evicts A's pages. Neither process can establish a stable working set—they continuously victimize each other.

Loop 3: The Scheduler Feedback

We covered this in the previous page, but it bears repeating:

High fault rate → Processes blocked on I/O
Blocked processes → Low CPU utilization
Low CPU utilization → Scheduler admits more processes
More processes → Higher total demand
Higher demand → Even higher fault rate

All three loops operate simultaneously, creating a catastrophic cascade.

Breaking the Loops:

To escape thrashing, we must break at least one feedback loop:

Loop	Break Point	Action
I/O Queue	Reduce fault rate	Reduce demand, add frames
Working Set	Prevent mutual eviction	Local replacement, suspend processes
Scheduler	Prevent process admission	Reduce multiprogramming

Performance Impact Analysis

We can quantify the exact performance impact of high page fault rates using the Effective Access Time (EAT) model.

EAT Formula

EAT = (1 - p) × ma + p × pft

Where:

p = page fault probability (0 to 1)
ma = memory access time (~100 ns)
pft = page fault time (~10 ms for HDD, ~0.1 ms for SSD)

For an HDD system: EAT = (1-p) × 100ns + p × 10,000,000ns EAT = 100ns + p × 9,999,900ns

Effective Access Time vs. Page Fault Probability (HDD)
Page Fault Probability	EAT	Slowdown vs. No Faults	Practical Interpretation
0 (no faults)	100 ns	1x (baseline)	Fully resident working set
0.00001 (1 in 100,000)	200 ns	2x	Normal operation
0.0001 (1 in 10,000)	1.1 μs	11x	Slightly elevated paging
0.001 (1 in 1,000)	10 μs	100x	Noticeable slowdown
0.01 (1 in 100)	100 μs	1,000x	Significant degradation
0.1 (1 in 10)	1 ms	10,000x	System nearly unusable
0.5 (1 in 2)	5 ms	50,000x	Complete thrashing

Throughput Degradation:

System throughput degrades inversely with EAT:

Throughput = 1 / EAT (relative to baseline)

  Throughput
  (relative)     Normal         Thrashing Zone
      ▲      ┌────────────┐  ┌─────────────────┐
      │                      │
  1.0 ┤──────•               │
      │       ╲              │
  0.8 ┤        ╲             │
      │         ╲            │
  0.6 ┤          ╲           │
      │           ╲          │
  0.4 ┤            ╲         │
      │             ╲        │
  0.2 ┤              ╲       │
      │               ╲      │
    0 ┼─────┬────┬────┬╲─────┬────► Fault Rate
        10    100  1000  10000
          (faults/second for HDD)

The curve shows rapid throughput collapse as fault rate increases beyond the sustainable threshold.

The Compound Effect

These calculations assume a single process. With multiple processes thrashing:

Each process's slowdown compounds
I/O contention adds additional delays
Context switch overhead increases
Cache pollution reduces memory efficiency

The actual slowdown in a thrashing system can exceed theoretical single-process predictions by an order of magnitude.

Summary and Key Takeaways

High page fault rate is the primary observable symptom of thrashing. Understanding how to measure, interpret, and respond to fault rates is essential for maintaining healthy systems.

Key Takeaways

•Page fault rate is measurable — Use vmstat, sar, /proc/vmstat to monitor faults per second
•"High" is relative to storage speed — HDDs thrash at lower rates than SSDs
•Major faults matter most — Minor faults are cheap; major faults require disk I/O
•Patterns matter more than values — Sustained high rates indicate thrashing; transient spikes are normal
•PFF provides control — Use upper/lower thresholds to rebalance frame allocation
•Feedback loops intensify thrashing — I/O queue, working set destruction, and scheduler all create positive feedback
•Impact is non-linear and severe — Even 0.1% fault rate causes 100x slowdown

What's Next:

High page fault rates cause high I/O wait, but there's another symptom that makes thrashing particularly insidious: CPU utilization drop. The next page examines this paradoxical metric and why it makes thrashing difficult to diagnose with traditional monitoring.

Page Complete

You now understand how to measure and interpret page fault rates, what constitutes thrashing-level fault rates, and the feedback loops that make high fault rates self-reinforcing. The next page examines the paradoxical CPU utilization drop during thrashing.