Loading content...
Swapping is a necessary evil: it enables systems to handle memory demands that exceed physical capacity, but it extracts a severe performance tax. The difference between RAM access (nanoseconds) and disk access (milliseconds) spans six orders of magnitude—a gulf that can transform a responsive system into an unresponsive one.
For systems architects and engineers, understanding swap performance isn't optional. It's the difference between:
This page equips you with the knowledge to measure swap impact, recognize pathological patterns, and tune systems for optimal behavior under memory pressure.
By the end of this page, you will be able to measure and interpret swap metrics, understand the devastating effects of thrashing, apply tuning strategies for different workloads, and make informed decisions about swap configuration in production systems.
To understand swap performance, we must first grasp the staggering disparity between memory and disk access times. This disparity is the fundamental reason swapping is a last resort.
| Storage Type | Typical Access Time | Relative to RAM | Operations/Second |
|---|---|---|---|
| L1 Cache | ~1 ns | 0.01x (faster) | 1,000,000,000 |
| L3 Cache | ~10 ns | 0.1x (faster) | 100,000,000 |
| RAM (DDR4) | ~100 ns | 1x (baseline) | 10,000,000 |
| NVMe SSD (4K random) | ~100 μs | 1,000x slower | 10,000 |
| SATA SSD (4K random) | ~250 μs | 2,500x slower | 4,000 |
| HDD (4K random) | ~10 ms | 100,000x slower | 100 |
What these numbers mean in practice:
Imagine a memory access as a 1-second task (walking to the refrigerator). With that scale:
This is why swapping to HDD causes systems to "freeze"—from the CPU's perspective, each page fault is an eternity.
Impact on application throughput:
Consider a web server processing requests:
Without swap:
Request handling: 10ms per request
Capacity: 100 requests/second
With 10% of accesses hitting swap (NVMe SSD):
RAM portion: 9ms (90% of 10ms at 1x speed)
Swap portion: 10ms (10% of 10ms at 1000x slower = 10ms)
Total: 19ms per request
Capacity: ~52 requests/second (48% reduction)
With 10% of accesses hitting swap (HDD):
Swap portion: 1000ms
Total: ~1009ms per request
Capacity: ~1 request/second (99% reduction)
Even modest swap usage devastates throughput. This is not linear degradation—it's catastrophic.
Swap access doesn't just slow down average performance—it creates massive latency outliers. A server might handle 99% of requests in 10ms, but the 1% that hit swap take 100ms or 1000ms. These tail latencies cascade through distributed systems, causing timeouts and failures that affect overall system reliability.
Effective swap management requires accurate measurement. Modern systems provide rich instrumentation for swap activity and memory pressure.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
#!/bin/bash # ============================================# BASIC SWAP USAGE# ============================================ # Quick overview of memory and swap usagefree -h# total used free shared buff/cache available# Mem: 31Gi 12Gi 8.0Gi 256Mi 11Gi 18Gi# Swap: 4Gi 256Mi 3.7Gi # Detailed swap informationswapon --show# NAME TYPE SIZE USED PRIO# /dev/sda2 partition 4G 256M -2 # Per-process swap usagefor pid in $(ls /proc | grep -E '^[0-9]+$'); do if [ -f /proc/$pid/status ]; then name=$(cat /proc/$pid/comm 2>/dev/null) swap=$(grep VmSwap /proc/$pid/status 2>/dev/null | awk '{print $2}') if [ ! -z "$swap" ] && [ "$swap" != "0" ]; then echo "$swap kB: $pid $name" fi fidone | sort -rn | head -20 # ============================================# SWAP ACTIVITY (RATES)# ============================================ # Real-time swap I/O rates (pages/sec)vmstat 1# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----# r b swpd free buff cache si so bi bo in cs us sy id wa st# 1 0 262144 8388608 1024 11534336 0 0 4 12 50 100 5 2 93 0 0# si=swap-in rate so=swap-out rate (Pages/sec) # More detailed I/O including swap deviceiostat -x 1 /dev/sda2 # Monitor swap partition specifically # ============================================# ADVANCED: MEMORY PRESSURE# ============================================ # Pressure Stall Information (PSI) - Linux 4.20+cat /proc/pressure/memory# some avg10=0.05 avg60=0.02 avg300=0.01 total=12345678# full avg10=0.00 avg60=0.00 avg300=0.00 total=1234567 # "some": At least one task stalled on memory# "full": All tasks stalled simultaneously# Higher percentages = worse memory pressure # ============================================# SYSTEM CALL TRACING# ============================================ # Trace page faults for a specific processperf record -e major-faults -p <pid> sleep 10perf report # BPF-based tracing of page faultssudo bpftrace -e 'kprobe:handle_mm_fault { @[comm] = count(); }'Key metrics to monitor:
| Metric | What It Means | Healthy Value | Concerning Value |
|---|---|---|---|
| Swap used | Total swap consumed | Varies | Consistently > 50% of swap |
| si (swap in) | Pages read from swap/sec | 0 | Sustained > 0 |
| so (swap out) | Pages written to swap/sec | 0-100 occasional | Sustained > 100 |
| wa (wait) | CPU time waiting for I/O | < 5% | > 20% |
| PSI some avg10 | % time any task memory-stalled | < 5% | > 25% |
| PSI full avg10 | % time all tasks memory-stalled | 0% | > 5% |
The difference between swap used and swap activity:
Swap usage (how much data is in swap) is different from swap activity (how often swap is accessed).
High usage, low activity: Inactive data was swapped long ago. It sits in swap, rarely touched. This is often acceptable—those pages aren't needed.
Low usage, high activity: A small amount of data is being thrashed between RAM and swap repeatedly. This is disastrous—constant I/O for limited benefit.
High usage, high activity: The system is in serious trouble. Lots of data in swap, and it's all being actively accessed. Thrashing scenario.
Don't alert on swap usage percentage alone. Alert on swap-in rates (si > 0 sustained) because that indicates active performance impact. A server with 80% swap used but si=0 is likely fine; a server with 10% swap used but si=1000 is struggling.
Thrashing is the most severe swap-related performance problem. It occurs when the combined working sets of running processes exceed available physical memory, causing continuous page faulting that consumes nearly all system resources.
The thrashing dynamic:
Working sets exceed RAM — Processes need more pages in memory than physical frames available.
Page faults become constant — Every time a process runs, it faults on pages that were evicted to make room for other processes.
Eviction accelerates — To handle the current fault, another page is evicted. But that page is also part of some process's working set.
CPU becomes I/O bound — Most CPU time is spent waiting for disk I/O rather than executing useful work.
Positive feedback loop — The slower the system runs, the longer processes keep pages, the more pressure builds, the slower it gets.
Detecting thrashing:
Thrashing manifests through multiple symptoms:
# Classic thrashing signatures:
# 1. CPU wait time dominates
vmstat 1
# id wa
# 5 90 # 90% waiting for I/O, 5% idle = thrashing
# 2. Swap I/O is constant and high
vmstat 1
# si so
# 5000 4000 # Thousands of pages per second = thrashing
# 3. Load average far exceeds CPU count
uptime
# load average: 48.50, 45.23, 40.10 # On a 4-CPU system = thrashing
# 4. PSI shows total stalls
cat /proc/pressure/memory
# full avg10=75.00 # 75% of time ALL tasks are stalled = thrashing
Why thrashing is catastrophic:
In a thrashing state:
To recover from thrashing: (1) Kill memory-heavy processes (if you can access the system), (2) Reduce the number of running processes to fit working sets in RAM, or (3) Add more RAM (long-term). Prevention is far better than cure—monitor memory pressure and shed load before thrashing begins.
Understanding thrashing requires understanding the working set model, introduced by Peter Denning in 1968. This model describes the set of pages a process actively uses during a time window.
Formal definition:
The working set W(t, Δ) of a process at time t is the set of pages referenced during the time interval (t - Δ, t]. The parameter Δ is the "working set window."
In practice, Δ is chosen to capture the pages actively needed for current execution phase. Typical values correspond to millions of memory references.
The working set principle:
A process should be allowed to run only if its entire working set can be held in memory.
This principle prevents thrashing: if memory can hold all working sets, page faults are rare (only for pages outside the working set). If memory cannot hold all working sets, some processes should be suspended entirely (swapped out) rather than allowed to run and cause thrashing.
Working set size vs. address space size:
A critical insight is that working set size is usually much smaller than address space size:
| Application | Address Space | Typical Working Set | Ratio |
|---|---|---|---|
| Web Browser | 4+ GB | 200-500 MB | 10:1 |
| IDE | 2 GB | 100-300 MB | 10:1 |
| Database Server | 64 GB | 2-8 GB | 10:1 |
| Video Editor | 16 GB | 1-4 GB | 5:1 |
| Scientific Simulation | 256 GB | 10-50 GB | 10:1 |
This explains why virtual memory works: processes don't need all their pages simultaneously. The art is keeping the right pages (the working set) in memory while allowing other pages to be swapped or discarded.
Linux uses the Resident Set Size (RSS) as a working set proxy, tracks accessed bits on page table entries, and uses LRU list positions to approximate page hotness. Windows has explicit working set management with working set trimming when memory is scarce.
The choice of swap storage device significantly impacts swap performance. With the storage industry's evolution from HDDs to NVMe SSDs, swap behavior has transformed dramatically.
| Storage Type | 4K Random IOPS | Page Fault Latency | Suitable For |
|---|---|---|---|
| HDD (7200 RPM) | ~100 | ~10ms | Archive only; avoid if possible |
| SATA SSD (consumer) | ~25,000 | ~250μs | Desktop/laptop OK; servers marginal |
| SATA SSD (enterprise) | ~75,000 | ~100μs | Acceptable for light server swap |
| NVMe SSD (consumer) | ~500,000 | ~50μs | Good for most workloads |
| NVMe SSD (enterprise) | ~1,000,000 | ~20μs | Excellent; swap is nearly transparent |
| Intel Optane | ~550,000 (low latency) | ~10μs | Best; approaches RAM latency |
Why NVMe changes the swap equation:
With NVMe SSDs, swap becomes much more viable:
Latency gap narrows — 20μs NVMe vs. 10ms HDD is a 500x improvement. While still slower than RAM, it's not catastrophically so.
Parallelism — NVMe drives support deep queue depths (64+ outstanding I/Os). Multiple page faults can be serviced concurrently.
Sustained throughput — NVMe can sustain 3-7 GB/s, enough to swap in hundreds of thousands of pages per second.
Consistent performance — Unlike HDDs, which slow dramatically under random access, NVMe performs similarly for sequential and random I/O.
SSD wear concerns:
SSD write endurance is finite. Each cell can be written a limited number of times (TBW = Terabytes Written lifetime):
Heavy swap activity can accelerate wear:
Assume: 1GB swap written per hour (moderate pressure)
Daily: 24 GB
Yearly: ~9 TB
Consumer 500GB SSD (300 TBW): ~33 years (fine)
But at 100GB/day sustained: 36 TB/year = 8 years (marginal for consumer)
Enterprise SSDs with higher endurance are recommended for swap-heavy workloads.
With fast NVMe storage, some workloads that previously required 'disable swap for performance' can now tolerate modest swap usage. However, for latency-critical applications (databases, trading systems), the advice remains: add RAM and minimize swap usage, regardless of storage speed.
Operating systems provide numerous parameters to tune swap behavior. The right settings depend heavily on workload characteristics.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
#!/bin/bash# ============================================# SWAPPINESS: Balance between file cache and anonymous memory reclaim# ============================================ # Range: 0-200 (default: 60)# Lower values: prefer evicting file cache over anonymous pages# Higher values: more willing to swap anonymous pages # Desktop/general purpose:sysctl -w vm.swappiness=60 # Database server (protect anonymous memory):sysctl -w vm.swappiness=10 # Disable swap entirely (use with abundant RAM):sysctl -w vm.swappiness=0# Note: 0 doesn't disable swap; it just strongly avoids it # ============================================# VFS CACHE PRESSURE: Willingness to reclaim filesystem caches# ============================================ # Range: 0-10000 (default: 100)# Lower: hold onto dentry/inode caches (good for many small files)# Higher: aggressively reclaim caches # File server with many files:sysctl -w vm.vfs_cache_pressure=50 # Memory-constrained system:sysctl -w vm.vfs_cache_pressure=200 # ============================================# WATERMARKS: When to start/stop reclaim# ============================================ # Increase buffer between watermarks (more proactive reclaim):# Default: 10 (0.1% of RAM per zone between watermarks)sysctl -w vm.watermark_scale_factor=150 # 1.5% gap # This makes kswapd wake earlier, avoiding direct reclaim # ============================================# OVERCOMMIT SETTINGS# ============================================ # 0 = heuristic overcommit (default)# 1 = always overcommit (dangerous)# 2 = strict (never overcommit beyond RAM + swap * overcommit_ratio/100)sysctl -w vm.overcommit_memory=2sysctl -w vm.overcommit_ratio=80 # Allow up to RAM + 80% of swap # ============================================# PAGE CLUSTERING (Read-ahead)# ============================================ # Default: 3 (2^3 = 8 pages read-ahead)# Higher: better for sequential swap access# Lower: better for random swap access # For sequential workloads:sysctl -w vm.page-cluster=4 # 16 pages # For random access workloads:sysctl -w vm.page-cluster=0 # 1 page (disable read-ahead) # ============================================# MAKE PERSISTENT# ============================================ # Add to /etc/sysctl.conf for persistence:cat >> /etc/sysctl.conf << EOFvm.swappiness=10vm.vfs_cache_pressure=50vm.watermark_scale_factor=150vm.overcommit_memory=2vm.overcommit_ratio=80EOF sysctl -p # Apply changesTuning can help at the margins, but if workloads consistently require more memory than available, no amount of tuning will compensate. The ultimate fix for persistent memory pressure is adding RAM or reducing workload.
Modern systems use memory compression techniques to mitigate swap performance penalties. By compressing pages before writing to disk (or instead of writing to disk), these techniques reduce I/O and can significantly improve responsiveness.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
#!/bin/bash # ============================================# ZSWAP SETUP (Compressed swap cache)# ============================================ # Enable zswap at boot: add to kernel command line# zswap.enabled=1 zswap.compressor=lz4 zswap.max_pool_percent=25 # Or enable at runtime:echo 1 > /sys/module/zswap/parameters/enabledecho lz4 > /sys/module/zswap/parameters/compressorecho 25 > /sys/module/zswap/parameters/max_pool_percent # Check status:grep -r . /sys/kernel/debug/zswap/ # ============================================# ZRAM SETUP (RAM-based compressed swap)# ============================================ # Load the modulemodprobe zram num_devices=1 # Set size (e.g., 4GB compressed capacity)echo 4G > /sys/block/zram0/disksize # Set compression algorithm (lz4 is fast)echo lz4 > /sys/block/zram0/comp_algorithm # Initialize as swapmkswap /dev/zram0 # Enable with high priority (use before disk swap)swapon -p 100 /dev/zram0 # Verifyswapon --show# NAME TYPE SIZE USED PRIO# /dev/zram0 partition 4G 256M 100# /dev/sda2 partition 8G 0M -2 # Monitor compression statscat /sys/block/zram0/mm_stat# orig_data_size compr_data_size mem_used_total ...# 536870912 234567890 256789012 ...# Compression ratio: orig/compr ≈ 2.3xPerformance impact of compression:
| Scenario | Without Compression | With zswap | With zram |
|---|---|---|---|
| Swap read latency | 100-250μs (NVMe) | 10-50μs (decompress) | 2-10μs (decompress) |
| Swap write latency | 50-100μs (NVMe) | 5-20μs (compress) | 2-10μs (compress) |
| Effective swap capacity | 1x | 1.5-3x (pool acts as buffer) | 2-3x |
| Disk I/O reduction | 0% | 50-70% (many pages never hit disk) | 100% (no disk involved) |
| CPU overhead | None | Low-moderate | Low-moderate |
When to use each:
In typical workloads, anonymous pages (heap, stack) compress well—often 2-3x. A 16GB system with 8GB zram can effectively have ~24GB of usable memory. This transforms marginal memory situations into comfortable ones, often eliminating visible swap impact entirely.
Swap performance is a critical aspect of system behavior that can mean the difference between a responsive system and one that grinds to a halt. Understanding the metrics, recognizing pathological patterns, and applying appropriate tuning transforms swap from a mysterious performance killer into a manageable aspect of system administration.
Module complete:
You have now completed the comprehensive study of swapping in operating systems. From the fundamental concept of swap space through the mechanics of swap operations, the evolution from standard swapping to paged memory management, and finally the critical performance considerations—you are equipped to understand, diagnose, and optimize memory management in production systems.
Congratulations! You have mastered the swapping module: swap space organization, swap in/out mechanics, standard vs. paged swapping, and performance optimization. You now understand how operating systems extend physical memory to disk and the profound performance implications of this fundamental technique.