Swapping - Learning Module

Loading content...

0/227

Performance Considerations

When Swapping Becomes a Performance Crisis

Swapping is a necessary evil: it enables systems to handle memory demands that exceed physical capacity, but it extracts a severe performance tax. The difference between RAM access (nanoseconds) and disk access (milliseconds) spans six orders of magnitude—a gulf that can transform a responsive system into an unresponsive one.

For systems architects and engineers, understanding swap performance isn't optional. It's the difference between:

A server that handles 10,000 requests/second vs. 100
A laptop that remains usable during compilation vs. one that freezes
A database that meets SLAs vs. one that times out

This page equips you with the knowledge to measure swap impact, recognize pathological patterns, and tune systems for optimal behavior under memory pressure.

What You Will Learn

By the end of this page, you will be able to measure and interpret swap metrics, understand the devastating effects of thrashing, apply tuning strategies for different workloads, and make informed decisions about swap configuration in production systems.

The Performance Cost of Swap

To understand swap performance, we must first grasp the staggering disparity between memory and disk access times. This disparity is the fundamental reason swapping is a last resort.

Access Times Across the Memory Hierarchy
Storage Type	Typical Access Time	Relative to RAM	Operations/Second
L1 Cache	~1 ns	0.01x (faster)	1,000,000,000
L3 Cache	~10 ns	0.1x (faster)	100,000,000
RAM (DDR4)	~100 ns	1x (baseline)	10,000,000
NVMe SSD (4K random)	~100 μs	1,000x slower	10,000
SATA SSD (4K random)	~250 μs	2,500x slower	4,000
HDD (4K random)	~10 ms	100,000x slower	100

What these numbers mean in practice:

Imagine a memory access as a 1-second task (walking to the refrigerator). With that scale:

L1 cache access: 10 milliseconds (blinking)
RAM access: 1 second (walking to the refrigerator)
NVMe SSD page fault: ~17 minutes (running errands)
SATA SSD page fault: ~42 minutes (a meeting)
HDD page fault: ~27 hours (a full day plus)

This is why swapping to HDD causes systems to "freeze"—from the CPU's perspective, each page fault is an eternity.

Impact on application throughput:

Consider a web server processing requests:

Without swap:
  Request handling: 10ms per request
  Capacity: 100 requests/second

With 10% of accesses hitting swap (NVMe SSD):
  RAM portion: 9ms (90% of 10ms at 1x speed)
  Swap portion: 10ms (10% of 10ms at 1000x slower = 10ms)
  Total: 19ms per request
  Capacity: ~52 requests/second (48% reduction)

With 10% of accesses hitting swap (HDD):
  Swap portion: 1000ms
  Total: ~1009ms per request
  Capacity: ~1 request/second (99% reduction)

Even modest swap usage devastates throughput. This is not linear degradation—it's catastrophic.

The Latency Tail Problem

Swap access doesn't just slow down average performance—it creates massive latency outliers. A server might handle 99% of requests in 10ms, but the 1% that hit swap take 100ms or 1000ms. These tail latencies cascade through distributed systems, causing timeouts and failures that affect overall system reliability.

Measuring Swap Activity

Effective swap management requires accurate measurement. Modern systems provide rich instrumentation for swap activity and memory pressure.

swap_monitoring.sh
Linux
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#!/bin/bash
 
# ============================================
# BASIC SWAP USAGE
# ============================================
 
# Quick overview of memory and swap usage
free -h
#               total     used     free   shared  buff/cache   available
# Mem:          31Gi      12Gi     8.0Gi    256Mi      11Gi        18Gi
# Swap:          4Gi      256Mi     3.7Gi
 
# Detailed swap information
swapon --show
# NAME      TYPE      SIZE   USED PRIO
# /dev/sda2 partition   4G   256M   -2
 
# Per-process swap usage
for pid in $(ls /proc | grep -E '^[0-9]+$'); do
    if [ -f /proc/$pid/status ]; then
        name=$(cat /proc/$pid/comm 2>/dev/null)
        swap=$(grep VmSwap /proc/$pid/status 2>/dev/null | awk '{print $2}')
        if [ ! -z "$swap" ] && [ "$swap" != "0" ]; then
            echo "$swap kB: $pid $name"
        fi
    fi
done | sort -rn | head -20
 
# ============================================
# SWAP ACTIVITY (RATES)
# ============================================
 
# Real-time swap I/O rates (pages/sec)
vmstat 1
# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
#  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
#  1  0 262144 8388608 1024  11534336  0    0     4    12   50  100  5  2 93  0  0
#                              si=swap-in rate  so=swap-out rate (Pages/sec)
 
# More detailed I/O including swap device
iostat -x 1 /dev/sda2  # Monitor swap partition specifically
 
# ============================================
# ADVANCED: MEMORY PRESSURE
# ============================================
 
# Pressure Stall Information (PSI) - Linux 4.20+
cat /proc/pressure/memory
# some avg10=0.05 avg60=0.02 avg300=0.01 total=12345678
# full avg10=0.00 avg60=0.00 avg300=0.00 total=1234567
 
# "some": At least one task stalled on memory
# "full": All tasks stalled simultaneously
# Higher percentages = worse memory pressure
 
# ============================================
# SYSTEM CALL TRACING
# ============================================
 
# Trace page faults for a specific process
perf record -e major-faults -p <pid> sleep 10
perf report
 
# BPF-based tracing of page faults
sudo bpftrace -e 'kprobe:handle_mm_fault { @[comm] = count(); }'

Key metrics to monitor:

Metric	What It Means	Healthy Value	Concerning Value
Swap used	Total swap consumed	Varies	Consistently > 50% of swap
si (swap in)	Pages read from swap/sec	0	Sustained > 0
so (swap out)	Pages written to swap/sec	0-100 occasional	Sustained > 100
wa (wait)	CPU time waiting for I/O	< 5%	> 20%
PSI some avg10	% time any task memory-stalled	< 5%	> 25%
PSI full avg10	% time all tasks memory-stalled	0%	> 5%

The difference between swap used and swap activity:

Swap usage (how much data is in swap) is different from swap activity (how often swap is accessed).

High usage, low activity: Inactive data was swapped long ago. It sits in swap, rarely touched. This is often acceptable—those pages aren't needed.
Low usage, high activity: A small amount of data is being thrashed between RAM and swap repeatedly. This is disastrous—constant I/O for limited benefit.
High usage, high activity: The system is in serious trouble. Lots of data in swap, and it's all being actively accessed. Thrashing scenario.

Alerting on Swap Activity

Don't alert on swap usage percentage alone. Alert on swap-in rates (si > 0 sustained) because that indicates active performance impact. A server with 80% swap used but si=0 is likely fine; a server with 10% swap used but si=1000 is struggling.

Thrashing: The Memory Death Spiral

Thrashing is the most severe swap-related performance problem. It occurs when the combined working sets of running processes exceed available physical memory, causing continuous page faulting that consumes nearly all system resources.

The thrashing dynamic:

Working sets exceed RAM — Processes need more pages in memory than physical frames available.
Page faults become constant — Every time a process runs, it faults on pages that were evicted to make room for other processes.
Eviction accelerates — To handle the current fault, another page is evicted. But that page is also part of some process's working set.
CPU becomes I/O bound — Most CPU time is spent waiting for disk I/O rather than executing useful work.
Positive feedback loop — The slower the system runs, the longer processes keep pages, the more pressure builds, the slower it gets.

Converting Mermaid diagram...

Detecting thrashing:

Thrashing manifests through multiple symptoms:

# Classic thrashing signatures:

# 1. CPU wait time dominates
vmstat 1
#  id  wa
#   5  90   # 90% waiting for I/O, 5% idle = thrashing

# 2. Swap I/O is constant and high
vmstat 1
#   si    so
# 5000  4000   # Thousands of pages per second = thrashing

# 3. Load average far exceeds CPU count
uptime
# load average: 48.50, 45.23, 40.10  # On a 4-CPU system = thrashing

# 4. PSI shows total stalls
cat /proc/pressure/memory
# full avg10=75.00   # 75% of time ALL tasks are stalled = thrashing

Why thrashing is catastrophic:

In a thrashing state:

Throughput approaches zero — The system does almost no useful work
Latency becomes unbounded — Operations that normally take milliseconds take minutes
The system becomes unresponsive — Even SSH sessions may hang; remediation is difficult
Recovery is hard — Simply waiting doesn't help; the system is in a stable but pathological state

Thrashing Recovery

To recover from thrashing: (1) Kill memory-heavy processes (if you can access the system), (2) Reduce the number of running processes to fit working sets in RAM, or (3) Add more RAM (long-term). Prevention is far better than cure—monitor memory pressure and shed load before thrashing begins.

The Working Set Model

Understanding thrashing requires understanding the working set model, introduced by Peter Denning in 1968. This model describes the set of pages a process actively uses during a time window.

Formal definition:

The working set W(t, Δ) of a process at time t is the set of pages referenced during the time interval (t - Δ, t]. The parameter Δ is the "working set window."

Small Δ: Only the most recent pages; may not capture full working set
Large Δ: Captures more pages, but may include stale ones no longer needed

In practice, Δ is chosen to capture the pages actively needed for current execution phase. Typical values correspond to millions of memory references.

The working set principle:

A process should be allowed to run only if its entire working set can be held in memory.

This principle prevents thrashing: if memory can hold all working sets, page faults are rare (only for pages outside the working set). If memory cannot hold all working sets, some processes should be suspended entirely (swapped out) rather than allowed to run and cause thrashing.

Implications of the Working Set Model

•Admission control — Don't admit new processes if their working sets won't fit. Better to queue than thrash.
•Suspension over thrashing — If memory is scarce, suspend some processes entirely (swap out completely) so others can run efficiently.
•Working set tracking — The OS should estimate each process's working set size to make informed decisions.
•Degree of multiprogramming — The number of processes that can run simultaneously is limited by total working set size vs. RAM.
•Phase changes — Working sets change as programs move through execution phases. The OS must adapt.

Working set size vs. address space size:

A critical insight is that working set size is usually much smaller than address space size:

Application	Address Space	Typical Working Set	Ratio
Web Browser	4+ GB	200-500 MB	10:1
IDE	2 GB	100-300 MB	10:1
Database Server	64 GB	2-8 GB	10:1
Video Editor	16 GB	1-4 GB	5:1
Scientific Simulation	256 GB	10-50 GB	10:1

This explains why virtual memory works: processes don't need all their pages simultaneously. The art is keeping the right pages (the working set) in memory while allowing other pages to be swapped or discarded.

Modern Working Set Tracking

Linux uses the Resident Set Size (RSS) as a working set proxy, tracks accessed bits on page table entries, and uses LRU list positions to approximate page hotness. Windows has explicit working set management with working set trimming when memory is scarce.

Swap Storage Performance

The choice of swap storage device significantly impacts swap performance. With the storage industry's evolution from HDDs to NVMe SSDs, swap behavior has transformed dramatically.

Swap Performance by Storage Type
Storage Type	4K Random IOPS	Page Fault Latency	Suitable For
HDD (7200 RPM)	~100	~10ms	Archive only; avoid if possible
SATA SSD (consumer)	~25,000	~250μs	Desktop/laptop OK; servers marginal
SATA SSD (enterprise)	~75,000	~100μs	Acceptable for light server swap
NVMe SSD (consumer)	~500,000	~50μs	Good for most workloads
NVMe SSD (enterprise)	~1,000,000	~20μs	Excellent; swap is nearly transparent
Intel Optane	~550,000 (low latency)	~10μs	Best; approaches RAM latency

Why NVMe changes the swap equation:

With NVMe SSDs, swap becomes much more viable:

Latency gap narrows — 20μs NVMe vs. 10ms HDD is a 500x improvement. While still slower than RAM, it's not catastrophically so.
Parallelism — NVMe drives support deep queue depths (64+ outstanding I/Os). Multiple page faults can be serviced concurrently.
Sustained throughput — NVMe can sustain 3-7 GB/s, enough to swap in hundreds of thousands of pages per second.
Consistent performance — Unlike HDDs, which slow dramatically under random access, NVMe performs similarly for sequential and random I/O.

SSD wear concerns:

SSD write endurance is finite. Each cell can be written a limited number of times (TBW = Terabytes Written lifetime):

Consumer SSD: 150-600 TBW
Enterprise SSD: 1,000-10,000 TBW

Heavy swap activity can accelerate wear:

Assume: 1GB swap written per hour (moderate pressure)
Daily: 24 GB
Yearly: ~9 TB

Consumer 500GB SSD (300 TBW): ~33 years (fine)
But at 100GB/day sustained: 36 TB/year = 8 years (marginal for consumer)

Enterprise SSDs with higher endurance are recommended for swap-heavy workloads.

Swap on NVMe vs. Avoiding Swap

With fast NVMe storage, some workloads that previously required 'disable swap for performance' can now tolerate modest swap usage. However, for latency-critical applications (databases, trading systems), the advice remains: add RAM and minimize swap usage, regardless of storage speed.

Tuning Swap Behavior

Operating systems provide numerous parameters to tune swap behavior. The right settings depend heavily on workload characteristics.

swap_tuning_linux.sh
Linux
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
#!/bin/bash
# ============================================
# SWAPPINESS: Balance between file cache and anonymous memory reclaim
# ============================================
 
# Range: 0-200 (default: 60)
# Lower values: prefer evicting file cache over anonymous pages
# Higher values: more willing to swap anonymous pages
 
# Desktop/general purpose:
sysctl -w vm.swappiness=60
 
# Database server (protect anonymous memory):
sysctl -w vm.swappiness=10
 
# Disable swap entirely (use with abundant RAM):
sysctl -w vm.swappiness=0
# Note: 0 doesn't disable swap; it just strongly avoids it
 
# ============================================
# VFS CACHE PRESSURE: Willingness to reclaim filesystem caches
# ============================================
 
# Range: 0-10000 (default: 100)
# Lower: hold onto dentry/inode caches (good for many small files)
# Higher: aggressively reclaim caches
 
# File server with many files:
sysctl -w vm.vfs_cache_pressure=50
 
# Memory-constrained system:
sysctl -w vm.vfs_cache_pressure=200
 
# ============================================
# WATERMARKS: When to start/stop reclaim
# ============================================
 
# Increase buffer between watermarks (more proactive reclaim):
# Default: 10 (0.1% of RAM per zone between watermarks)
sysctl -w vm.watermark_scale_factor=150  # 1.5% gap
 
# This makes kswapd wake earlier, avoiding direct reclaim
 
# ============================================
# OVERCOMMIT SETTINGS
# ============================================
 
# 0 = heuristic overcommit (default)
# 1 = always overcommit (dangerous)
# 2 = strict (never overcommit beyond RAM + swap * overcommit_ratio/100)
sysctl -w vm.overcommit_memory=2
sysctl -w vm.overcommit_ratio=80  # Allow up to RAM + 80% of swap
 
# ============================================
# PAGE CLUSTERING (Read-ahead)
# ============================================
 
# Default: 3 (2^3 = 8 pages read-ahead)
# Higher: better for sequential swap access
# Lower: better for random swap access
 
# For sequential workloads:
sysctl -w vm.page-cluster=4  # 16 pages
 
# For random access workloads:
sysctl -w vm.page-cluster=0  # 1 page (disable read-ahead)
 
# ============================================
# MAKE PERSISTENT
# ============================================
 
# Add to /etc/sysctl.conf for persistence:
cat >> /etc/sysctl.conf << EOF
vm.swappiness=10
vm.vfs_cache_pressure=50
vm.watermark_scale_factor=150
vm.overcommit_memory=2
vm.overcommit_ratio=80
EOF
 
sysctl -p  # Apply changes

Tuning Recommendations by Workload

•Database servers — swappiness=10-20, vfs_cache_pressure=50, consider overcommit_memory=2. Databases manage their own caches; protect their anonymous memory.
•Web servers — swappiness=30-40, moderate settings. Balance between caching content (file cache) and process heap (anonymous).
•Build servers/CI — swappiness=60 (default), higher overcommit tolerance. Builds are spiky; let the system adapt.
•In-memory caches (Redis/Memcached) — swappiness=1, overcommit_memory=2. These systems must never swap; size RAM appropriately.
•Desktop/laptops — swappiness=60, default settings usually fine. Consider increasing for limited RAM systems.
•Kubernetes nodes — Often recommended to disable swap entirely (swapoff -a) for predictable cgroup memory limits.

Tuning Is Not a Substitute for Capacity

Tuning can help at the margins, but if workloads consistently require more memory than available, no amount of tuning will compensate. The ultimate fix for persistent memory pressure is adding RAM or reducing workload.

Compressed Memory Techniques: zswap and zram

Modern systems use memory compression techniques to mitigate swap performance penalties. By compressing pages before writing to disk (or instead of writing to disk), these techniques reduce I/O and can significantly improve responsiveness.

zswap (Compressed Cache)

•Compressed write-back cache for swap pages
•Pages are compressed before writing to backing swap device
•Compressed pages stored in RAM pool
•If pool is full, oldest pages written to actual swap
•Transparent: requires no application changes
•Typical compression: 2-3x (if compressible)
•Reduces swap I/O by 50-70%

zram (Compressed RAM Disk)

•Virtual block device in RAM storing compressed data
•Used as swap device directly (no backing disk required)
•All swap stays in RAM (compressed)
•Excellent for diskless systems or SSDs with wear concerns
•Typical effective memory increase: 2-3x
•CPU overhead for compression/decompression
•Popular on Android and Chrome OS

setup_compressed_memory.sh
Linux
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#!/bin/bash
 
# ============================================
# ZSWAP SETUP (Compressed swap cache)
# ============================================
 
# Enable zswap at boot: add to kernel command line
# zswap.enabled=1 zswap.compressor=lz4 zswap.max_pool_percent=25
 
# Or enable at runtime:
echo 1 > /sys/module/zswap/parameters/enabled
echo lz4 > /sys/module/zswap/parameters/compressor
echo 25 > /sys/module/zswap/parameters/max_pool_percent
 
# Check status:
grep -r . /sys/kernel/debug/zswap/
 
# ============================================
# ZRAM SETUP (RAM-based compressed swap)
# ============================================
 
# Load the module
modprobe zram num_devices=1
 
# Set size (e.g., 4GB compressed capacity)
echo 4G > /sys/block/zram0/disksize
 
# Set compression algorithm (lz4 is fast)
echo lz4 > /sys/block/zram0/comp_algorithm
 
# Initialize as swap
mkswap /dev/zram0
 
# Enable with high priority (use before disk swap)
swapon -p 100 /dev/zram0
 
# Verify
swapon --show
# NAME       TYPE      SIZE   USED   PRIO
# /dev/zram0 partition  4G    256M   100
# /dev/sda2  partition  8G      0M    -2
 
# Monitor compression stats
cat /sys/block/zram0/mm_stat
# orig_data_size  compr_data_size  mem_used_total  ...
# 536870912       234567890        256789012       ...
# Compression ratio: orig/compr ≈ 2.3x

Performance impact of compression:

Scenario	Without Compression	With zswap	With zram
Swap read latency	100-250μs (NVMe)	10-50μs (decompress)	2-10μs (decompress)
Swap write latency	50-100μs (NVMe)	5-20μs (compress)	2-10μs (compress)
Effective swap capacity	1x	1.5-3x (pool acts as buffer)	2-3x
Disk I/O reduction	0%	50-70% (many pages never hit disk)	100% (no disk involved)
CPU overhead	None	Low-moderate	Low-moderate

When to use each:

zswap: When you have backing swap and want to reduce I/O. Good for servers with SSDs.
zram: When you want swap functionality without any disk I/O. Ideal for devices without writable storage or with SSD wear concerns.
Both: zram as primary (high priority), disk swap as emergency fallback (low priority).

Real-World Compression Gains

In typical workloads, anonymous pages (heap, stack) compress well—often 2-3x. A 16GB system with 8GB zram can effectively have ~24GB of usable memory. This transforms marginal memory situations into comfortable ones, often eliminating visible swap impact entirely.

Summary: Performance Considerations

Swap performance is a critical aspect of system behavior that can mean the difference between a responsive system and one that grinds to a halt. Understanding the metrics, recognizing pathological patterns, and applying appropriate tuning transforms swap from a mysterious performance killer into a manageable aspect of system administration.

Key Takeaways

•Swap is slow—orders of magnitude slower than RAM — Even NVMe SSDs are 100-1000x slower than memory access. Minimize swap to minimize impact.
•Measure activity, not just usage — High swap usage with low activity is often fine. Any swap-in activity (si > 0) indicates active performance impact.
•Thrashing is catastrophic — When working sets exceed RAM, positive feedback loops cause system-wide slowdown. Prevention (admission control, suspension) beats cure.
•The working set model guides design — Ensure RAM can hold active working sets. If it can't, add RAM or reduce concurrent processes.
•Storage type matters enormously — NVMe makes swap vastly more tolerable than HDD. But faster storage doesn't eliminate the fundamental slowness.
•Tuning should match workload — Different applications (databases, web servers, desktops) need different swappiness and pressure settings.
•Compressed memory transforms the equation — zswap and zram reduce or eliminate disk I/O, making memory pressure far more manageable.

Module complete:

You have now completed the comprehensive study of swapping in operating systems. From the fundamental concept of swap space through the mechanics of swap operations, the evolution from standard swapping to paged memory management, and finally the critical performance considerations—you are equipped to understand, diagnose, and optimize memory management in production systems.

Module Complete

Congratulations! You have mastered the swapping module: swap space organization, swap in/out mechanics, standard vs. paged swapping, and performance optimization. You now understand how operating systems extend physical memory to disk and the profound performance implications of this fundamental technique.