Loading content...
If insufficient frames are the cause of thrashing, high page fault rate is its most visible symptom. Understanding page fault rates is essential for diagnosing thrashing—it transforms an invisible problem into a measurable quantity that operators and automated systems can monitor and respond to.
This page provides a deep analysis of page fault rates: what they mean, how to measure them, what constitutes "high," and how they relate to system performance. By the end, you'll be able to diagnose thrashing from page fault metrics alone.
By the end of this page, you will understand how page fault rates are measured and interpreted, what distinguishes normal paging from thrashing, the mathematical relationship between fault rates and performance, and how to use fault rate data for proactive system management.
A page fault rate measures how frequently page faults occur over time or per reference. There are several ways to express this metric:
1. Absolute Fault Rate (Faults per Second):
Fault Rate = Total Page Faults / Time Interval
Units: faults/second (f/s)
Example: 5,000 faults/second
2. Reference Fault Rate (Faults per Reference):
Fault Rate = Page Faults / Memory References
Units: faults/reference (dimensionless)
Example: 0.001 (1 fault per 1000 references)
3. Per-Process Fault Rate:
Fault Rate(i) = Faults by Process i / Time
Useful for identifying which process is thrashing
| Metric | Formula | Typical Values | Primary Use |
|---|---|---|---|
| System Fault Rate | Total faults / second | 100-10,000 f/s | System-wide health |
| Process Fault Rate | Process faults / second | 1-1,000 f/s | Per-process diagnosis |
| Reference Rate | Faults / memory accesses | 0.0001-0.01 | Theoretical analysis |
| Major Faults | Disk I/O faults / second | 10-1,000 f/s | I/O subsystem load |
| Minor Faults | Non-I/O faults / second | 100-100,000 f/s | Page table manipulation |
Major faults require disk I/O—the page must be loaded from storage. These are expensive (milliseconds).
Minor faults don't require disk I/O—the page is already in memory (shared, copy-on-write, or mapped differently). These are cheap (microseconds).
For thrashing analysis, major faults are the critical metric. A system with a high minor fault rate may be fine; a system with a high major fault rate is likely thrashing.
Measurement Sources:
Operating systems provide multiple ways to observe page fault rates:
| Operating System | Command/Tool | What It Shows |
|---|---|---|
| Linux | vmstat 1 | si/so (swap in/out), pgpgin/pgpgout |
| Linux | sar -B 1 | pgfault/s, majflt/s, pgscank/s |
| Linux | /proc/vmstat | Raw kernel counters |
| Linux | perf stat | Hardware-level page fault events |
| Windows | Performance Monitor | Pages/sec, Page Faults/sec |
| Windows | typeperf | Counter logging |
| macOS | vm_stat | Pageins, pageouts, faults |
| All | Application profilers | Per-process breakdown |
The question "what page fault rate is too high?" doesn't have a simple numerical answer. "High" is relative to system capacity, workload characteristics, and storage speed. However, we can establish meaningful thresholds and guidelines.
A page fault rate is "too high" when page fault service time dominates process execution time. The exact threshold depends on storage speed:
Calculating the Thrashing Threshold:
We can calculate when page faulting dominates execution:
Let:
T_fault = Average page fault service time
T_exec = Time to execute between faults (when all pages present)
F = Fault rate (faults per second)
Condition for thrashing:
F × T_fault > T_exec × (1 - F × T_fault)
Simplifying, thrashing begins when:
F > 1 / (2 × T_fault)
For HDD (T_fault = 10ms = 0.01s):
F > 1 / (2 × 0.01) = 50 faults/second
For SSD (T_fault = 0.1ms = 0.0001s):
F > 1 / (2 × 0.0001) = 5,000 faults/second
These are order-of-magnitude estimates; actual thresholds depend on workload characteristics.
| Level | HDD (f/s) | SSD (f/s) | NVMe (f/s) | System Impact |
|---|---|---|---|---|
| Normal | < 50 | < 500 | < 2,000 | Negligible performance impact |
| Elevated | 50-200 | 500-2,000 | 2,000-10,000 | Noticeable slowdown |
| High | 200-500 | 2,000-5,000 | 10,000-50,000 | Significant degradation |
| Critical | 500-1,000 | 5,000-10,000 | 50,000-100,000 | Severe performance issues |
| Thrashing | 1,000 | 10,000 | 100,000 | System nearly unusable |
Context-Dependent Interpretation:
Raw fault rates must be interpreted in context:
Workload Phase: Application startup typically has high fault rates as code and data load—this is expected
Process Count: 500 faults/second across 50 processes (10 f/s/process) differs from 500 faults/second from one process
Read vs. Write: Read faults can often be serviced from cache; write faults require actual disk I/O
Sequential vs. Random: Sequential page faults can be prefetched; random access patterns cannot
Baseline Comparison: Compare current rates to normal operation, not just absolute thresholds
Page fault rates are not static—they vary over time in characteristic patterns. Understanding these dynamics is essential for distinguishing normal behavior from thrashing.
Thrashing exhibits distinctly different patterns:
Fault Rate Time Series Analysis:
Faults/s
▲
5000 ┤ ╱╲ ╱╲ ╱╲
│ ╱ ╲╱ ╲╱ ╲ Thrashing:
4000 ┤ ╱ ╲ Sustained high,
│ ╱ no recovery
3000 ┤ ╱
│ ╱
2000 ┤ ╱
│ ╱
1000 ┤ ╱╲ ╱
│ ╱ ╲ ╱ Normal: Spike
500 ┤╱ ╲─────╱ then recovery
│ ╲__╱
0 ┼────┬────┬────┬────┬────┬────┬────┬────► Time
t1 t2 t3 t4 t5 t6 t7 t8
│ │ │ │
Startup Recovery Thrashing
Spike Begins
The key diagnostic is whether the fault rate returns to baseline or continues escalating.
Statistical Indicators:
Beyond raw fault rate, statistical measures help identify thrashing:
| Indicator | Normal Operation | Thrashing |
|---|---|---|
| Mean fault rate | Low, stable | High, increasing |
| Variance | Low | High (chaotic) |
| Trend | Flat or decreasing | Increasing |
| Autocorrelation | High (predictable) | Low (chaotic) |
| Distribution | Concentrated | Spread/bimodal |
Page Fault Frequency (PFF) is both a measurement technique and a control strategy. By monitoring fault frequency per process, we can detect and respond to thrashing before it becomes catastrophic.
The PFF strategy uses two thresholds:
Upper Threshold (U): If fault rate > U, process needs more frames Lower Threshold (L): If fault rate < L, process has excess frames
The OS adjusts frame allocation based on which threshold is crossed:
PFF in Action:
Consider a system with PFF thresholds U=200 f/s and L=50 f/s:
Process Current Frames Fault Rate Action
─────── ────────────── ────────── ──────
A 100 250 Needs more frames (>200)
B 150 80 OK (50 < 80 < 200)
C 200 30 Can give up frames (<50)
D 120 180 OK (50 < 180 < 200)
E 130 500 Urgent: needs many more frames
Rebalancing:
- Take frames from C (fault rate 30, below 50)
- Give to A and E (fault rates above 200)
This continuous rebalancing keeps all processes within acceptable fault rate bounds.
PFF Threshold Selection:
Choosing appropriate thresholds is critical:
| Factor | Impact on Thresholds |
|---|---|
| Storage speed | Faster storage → higher thresholds acceptable |
| Process importance | Critical processes → tighter thresholds |
| Memory pressure | Low pressure → wider thresholds (less rebalancing) |
| Workload stability | Stable workloads → can use narrower thresholds |
| Overhead tolerance | Low tolerance → wider thresholds (less rebalancing) |
Typical ranges:
PFF cannot solve thrashing when all processes exceed the upper threshold and no process is below the lower threshold. In this case, there are no frames to redistribute—the system is globally overcommitted. PFF detects this condition: if rebalancing is impossible, the system must either reduce multiprogramming (suspend processes) or acquire more memory.
Effective thrashing detection requires robust monitoring infrastructure. This section provides practical guidance on measuring page fault rates in production systems.
1234567891011121314151617181920212223242526272829303132
# Real-time page fault monitoring with vmstat# Columns: si=swap in, so=swap out, bi=blocks in, bo=blocks outvmstat 1 # Detailed page fault statistics with sarsar -B 1 5# Key metrics:# pgfault/s - Total page faults per second# majflt/s - Major faults (requiring I/O)# pgscank/s - Pages scanned by kswapd (memory pressure) # Per-process page faultsps -eo pid,comm,majflt,minflt | sort -k3 -rn | head -20 # Continuous monitoring of a specific processpidstat -r 1 -p $(pgrep -f "your_process") # System-wide fault counterscat /proc/vmstat | grep -E "pgfault|pgmajfault|pswpin|pswpout" # Alerting on high fault rates (example)#!/bin/bashTHRESHOLD=1000while true; do FAULTS=$(cat /proc/vmstat | grep pgmajfault | awk '{print $2}') sleep 1 NEW_FAULTS=$(cat /proc/vmstat | grep pgmajfault | awk '{print $2}') RATE=$((NEW_FAULTS - FAULTS)) if [ $RATE -gt $THRESHOLD ]; then echo "ALERT: Major fault rate $RATE/s exceeds threshold" fidoneIn production environments:
High page fault rates create feedback loops that intensify thrashing. Understanding these loops explains why thrashing is self-reinforcing and why it escalates so rapidly.
Loop 1: The I/O Queue Feedback
┌────────────────────────────────────────────────────────────┐
│ I/O QUEUE FEEDBACK LOOP │
│ │
│ High Fault Rate ──────► Disk I/O Queue Grows │
│ ▲ │ │
│ │ ▼ │
│ Fault Service Queue Wait Time Increases │
│ Time Increases │ │
│ ▲ ▼ │
│ │ Processes Block Longer │
│ │ │ │
│ └────────────────────────┘ │
└────────────────────────────────────────────────────────────┘
As more faults occur, the disk I/O queue grows. Each fault takes longer to service because it must wait in queue. Longer service time means processes are blocked longer, and when they resume, they generate more faults because other pages have been evicted while waiting.
Loop 2: The Working Set Destruction Loop
┌────────────────────────────────────────────────────────────┐
│ WORKING SET DESTRUCTION LOOP │
│ │
│ Process A Faults ──────► A's Pages Loaded │
│ ▲ │ │
│ │ ▼ │
│ A's Working Set B's Pages Evicted │
│ Destroyed │ │
│ ▲ ▼ │
│ │ B Faults When Runs │
│ │ │ │
│ B's Pages Evicted ◄───────────┘ │
│ to Load A's │ │
│ ▲ ▼ │
│ └─────────────── B's Pages Loaded │
└────────────────────────────────────────────────────────────┘
With global replacement, processes destroy each other's working sets. Loading pages for process A evicts pages from process B. When B runs, it faults and evicts A's pages. Neither process can establish a stable working set—they continuously victimize each other.
We covered this in the previous page, but it bears repeating:
All three loops operate simultaneously, creating a catastrophic cascade.
Breaking the Loops:
To escape thrashing, we must break at least one feedback loop:
| Loop | Break Point | Action |
|---|---|---|
| I/O Queue | Reduce fault rate | Reduce demand, add frames |
| Working Set | Prevent mutual eviction | Local replacement, suspend processes |
| Scheduler | Prevent process admission | Reduce multiprogramming |
We can quantify the exact performance impact of high page fault rates using the Effective Access Time (EAT) model.
EAT = (1 - p) × ma + p × pft
Where:
For an HDD system: EAT = (1-p) × 100ns + p × 10,000,000ns EAT = 100ns + p × 9,999,900ns
| Page Fault Probability | EAT | Slowdown vs. No Faults | Practical Interpretation |
|---|---|---|---|
| 0 (no faults) | 100 ns | 1x (baseline) | Fully resident working set |
| 0.00001 (1 in 100,000) | 200 ns | 2x | Normal operation |
| 0.0001 (1 in 10,000) | 1.1 μs | 11x | Slightly elevated paging |
| 0.001 (1 in 1,000) | 10 μs | 100x | Noticeable slowdown |
| 0.01 (1 in 100) | 100 μs | 1,000x | Significant degradation |
| 0.1 (1 in 10) | 1 ms | 10,000x | System nearly unusable |
| 0.5 (1 in 2) | 5 ms | 50,000x | Complete thrashing |
Throughput Degradation:
System throughput degrades inversely with EAT:
Throughput = 1 / EAT (relative to baseline)
Throughput
(relative) Normal Thrashing Zone
▲ ┌────────────┐ ┌─────────────────┐
│ │
1.0 ┤──────• │
│ ╲ │
0.8 ┤ ╲ │
│ ╲ │
0.6 ┤ ╲ │
│ ╲ │
0.4 ┤ ╲ │
│ ╲ │
0.2 ┤ ╲ │
│ ╲ │
0 ┼─────┬────┬────┬╲─────┬────► Fault Rate
10 100 1000 10000
(faults/second for HDD)
The curve shows rapid throughput collapse as fault rate increases beyond the sustainable threshold.
These calculations assume a single process. With multiple processes thrashing:
The actual slowdown in a thrashing system can exceed theoretical single-process predictions by an order of magnitude.
High page fault rate is the primary observable symptom of thrashing. Understanding how to measure, interpret, and respond to fault rates is essential for maintaining healthy systems.
What's Next:
High page fault rates cause high I/O wait, but there's another symptom that makes thrashing particularly insidious: CPU utilization drop. The next page examines this paradoxical metric and why it makes thrashing difficult to diagnose with traditional monitoring.
You now understand how to measure and interpret page fault rates, what constitutes thrashing-level fault rates, and the feedback loops that make high fault rates self-reinforcing. The next page examines the paradoxical CPU utilization drop during thrashing.