Loading content...
Standard swapping, as we've seen, treats processes as monolithic units: entirely in memory or entirely on disk. Paging, by contrast, divides memory into fixed-size pages that can be individually managed. When we combine these concepts, we get page-level swapping—the technique used by virtually all modern operating systems.
In page-level swapping (often called demand paging when combined with lazy loading), individual pages can reside in memory or on swap space. A process runs with some pages in RAM and others on disk, loading pages on demand as they're accessed. This approach dramatically improves memory utilization, reduces swap I/O, and enables optimizations like copy-on-write and memory-mapped files.
This page explores how swapping and paging integrate, the mechanisms that make page-level swapping work, and the sophisticated optimizations that modern systems employ.
By the end of this page, you will understand how paging transforms swapping from a coarse process-level operation into fine-grained page management. You'll learn about demand paging, copy-on-write, lazy allocation, anonymous vs. file-backed pages, and the sophisticated heuristics modern kernels use to decide what to keep in memory.
Demand paging is the cornerstone of modern memory management. Instead of loading an entire process into memory at start-up, pages are loaded on demand—only when first accessed. This lazy approach has profound implications for system efficiency.
How demand paging changes program startup:
Consider starting a large application like a web browser:
Without demand paging (standard swapping era):
With demand paging:
The difference is transformative. Most programs never execute all their code paths in a typical session—demand paging ensures unused code never consumes memory or I/O.
A critical distinction in page-based swapping is between anonymous pages and file-backed pages. This classification determines how pages are handled during memory reclaim.
Why this distinction matters for swapping:
Consider the memory reclaim algorithm's decisions:
| Page Type | Clean or Dirty | Reclaim Action | I/O Cost |
|---|---|---|---|
| File-backed | Clean | Discard immediately | None |
| File-backed | Dirty | Write to file, then discard | File write |
| Anonymous | Clean (in swap) | Discard (swap copy valid) | None |
| Anonymous | Dirty | Write to swap | Swap write |
File-backed clean pages are the cheapest to reclaim—no I/O needed at all. This is why systems often favor caching executables and libraries: they can be evicted instantly if memory is needed.
Linux's approach:
Linux maintains separate LRU (Least Recently Used) lists for anonymous and file-backed pages. The vm.swappiness parameter (0-200) controls the balance:
# View current swappiness
cat /proc/sys/vm/swappiness
# Reduce swappiness for database workloads (prefer file cache eviction)
echo 10 > /proc/sys/vm/swappiness
Databases like PostgreSQL and MySQL manage their own buffer pools and caches. They prefer low swappiness because: (1) database caches are more valuable than kernel file cache for their workload, (2) unpredictable swap latency causes query timeouts, and (3) they'd rather the kernel evict file-backed pages (which they'll re-read efficiently) than swap their carefully-managed anonymous memory.
Copy-on-Write (CoW) is a critical optimization enabled by paged memory management. It allows processes to share physical pages until one process writes, at which point a private copy is made. This technique deeply integrates with swap management.
How Copy-on-Write works:
Fork creates shared mappings — When a process forks, the child doesn't copy parent memory immediately. Instead, both processes' page tables point to the same physical frames.
Pages marked read-only — Even for originally writable memory, CoW pages are marked read-only in both page tables.
Write triggers fault — When either process tries to write, a protection fault occurs.
Fault handler makes copy — The kernel allocates a new frame, copies the original content, maps the new frame to the writer, and marks both the old and new frame writable.
Original frame may have other sharers — Reference counting tracks whether a frame is still shared or can be reclaimed.
CoW and swap interaction:
Copy-on-write pages present special challenges for swap management:
Shared swap slots:
Swap entry sharing:
Page P is CoW-shared by processes A and B
Memory pressure: P is swapped out to slot S
- A's PTE: swap entry pointing to slot S
- B's PTE: swap entry pointing to slot S
- Slot S reference count: 2
Process A accesses the page:
- Page fault, swap-in from slot S
- New frame F allocated, content loaded
- A's PTE: points to F, marked CoW (still shared)
- B's PTE: still swap entry to S
- Slot S reference count: 1 (only B references it now)
- Frame F: marked CoW, A and 'swap' are 'sharers'
Process A writes to the page:
- CoW fault, copy made (but this is first copy after swap-in)
- A gets exclusive writable frame
Process B accesses the page:
- Swap-in from slot S
- B gets frame with original content
- Slot S freed (ref count = 0)
This intricate dance ensures that CoW semantics are preserved even when pages transit through swap.
Without CoW, fork() would need to copy gigabytes of process memory instantly—making it impractically slow for large processes. CoW makes fork() nearly instantaneous, copying only the page tables (KB) rather than memory contents (GB). Many forked processes immediately exec() a new program, never needing most parent pages.
Beyond loading pages from files or swap on demand, modern systems apply laziness to initial memory allocation as well. When a process allocates memory (e.g., via malloc() for large allocations or mmap()), the kernel typically doesn't allocate physical frames immediately. Instead, it uses zero-fill-on-demand (ZFOD).
123456789101112131415161718192021222324252627282930313233343536373839
#include <stdlib.h>#include <stdio.h>#include <unistd.h> int main() { printf("Before allocation: check /proc/%d/status\n", getpid()); sleep(5); // Allocate 1GB - no physical memory used yet! char *ptr = malloc(1024 * 1024 * 1024); printf("After malloc: still no physical memory committed\n"); sleep(5); // Touch first page - now 1 page (4KB) is allocated ptr[0] = 'A'; printf("After touch: only 1 page allocated\n"); sleep(5); // Touch every page - now all 1GB is allocated for (long i = 0; i < 1024 * 1024 * 1024; i += 4096) { ptr[i] = 'B'; } printf("After full touch: all pages allocated\n"); sleep(5); return 0;} /* * Monitor with: watch -n 1 'cat /proc/<pid>/status | grep -E "VmRSS|VmSize"' * * VmSize: Virtual memory size (includes uncommitted) * VmRSS: Resident Set Size (actually in RAM) * * You'll see VmSize jump at malloc(), but VmRSS grows only as pages are touched. */The zero page optimization:
Linux and other systems maintain a special zero page—a single physical frame filled with zeros, mapped read-only. When a process reads from never-written anonymous memory, it gets the zero page:
This creates an interesting swap scenario: pages that have never been written don't need swap space at all. They're regenerated as zero pages when needed.
Swap and lazy allocation:
When memory pressure hits:
Never-touched pages — No swap needed; they're just zero-fill-on-demand. The page table entry is simply cleared.
Touched but unchanged pages — If a page was touched but remains all zeros, some systems detect this (page content checks) and treat it as never-touched.
Modified anonymous pages — These must go to swap; their content can't be regenerated.
Lazy allocation enables overcommit: a system with 16GB RAM might allocate 64GB to processes. If all processes try to use their allocations simultaneously, the OOM killer strikes. Linux's /proc/sys/vm/overcommit_memory controls this: 0 = heuristic overcommit, 1 = always overcommit, 2 = strict (don't overcommit beyond swap + percentage of RAM).
With page-based swapping, the system must decide which pages to evict when memory is needed. This is the page replacement problem, and the choice of algorithm significantly impacts performance.
The optimal algorithm (Bélády's MIN):
The theoretically optimal page replacement algorithm evicts the page that will be accessed furthest in the future. Unfortunately, this requires knowing future memory accesses—impossible in practice. However, it serves as a benchmark against which other algorithms are measured.
LRU (Least Recently Used):
The most intuitive practical approximation of optimal is LRU: evict the page that was accessed longest ago. The intuition is that recent access patterns predict future access: pages not used recently are unlikely to be used soon.
True LRU requires updating a timestamp on every memory access—too expensive. Real systems use approximations.
Linux's multi-generational LRU (MGLRU):
Introduced in Linux 6.1, MGLRU improves upon the traditional active/inactive lists by using multiple generations to more accurately track page age:
Pages age through generations over time. Access resets a page to the youngest generation. This provides finer-grained age tracking than binary active/inactive, better approximating true LRU.
Working set considerations:
The ideal behavior is to keep each process's working set (pages actively in use) in memory. When memory can hold all working sets, page faults are rare. When working sets exceed memory, thrashing occurs: constant page faults as processes fight for frames.
Linux's page reclaim considers working set size when deciding how aggressively to reclaim from a process. Processes with rapidly shrinking access patterns are good eviction targets; those with stable, large working sets are protected.
Modern kernels track 'refaults'—when a recently-evicted page is immediately accessed again. High refault rates indicate the working set doesn't fit in available memory. Linux uses this signal to adjust the balance between file and anonymous page eviction, and to trigger OOM earlier when thrashing is detected.
Memory-mapped files (mmap()) create a direct mapping between a file and virtual address space. This technique leverages paged memory management and interacts with swap in specific ways.
123456789101112131415161718192021222324
#include <sys/mman.h>#include <fcntl.h>#include <stdio.h> int main() { // Open a file int fd = open("data.bin", O_RDWR); // Map it into memory char *addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); // Now addr[0..4095] accesses file content directly printf("First byte: %c\n", addr[0]); // Writes go to the file (via page cache) addr[0] = 'X'; // Unmap when done munmap(addr, 4096); close(fd); return 0;}How mmap interacts with swap:
| mmap Type | Dirty Behavior | Swap Usage | Eviction |
|---|---|---|---|
| MAP_SHARED (file) | Writes go to file | No swap | Write to file, discard |
| MAP_PRIVATE (file) | Writes create CoW copy | Swap for modified pages | Clean: discard. Dirty: swap |
| MAP_ANONYMOUS | No file backing | Always swap | Write to swap |
| MAP_ANONYMOUS + MAP_SHARED | Shared anonymous (IPC) | Swap | Shared swap slots |
Key insights:
Shared file mappings never use swap — They write back to the file. The file is the backing store.
Private file mappings have split behavior — Unmodified pages discard to file. Modified (CoW) pages are now anonymous and use swap.
Anonymous mappings always need swap — There's no file backing. This is mathematically equivalent to reading/writing a temporary file, but more efficient.
The page cache connection:
File-backed memory mappings go through the page cache—the kernel's cache of file contents. The same physical frames back both mmap'd regions and regular read()/write() access to files. This means:
Many databases (SQLite, MongoDB, LMDB) use mmap for storage. Benefits: automatic caching via page cache, zero-copy access, kernel handles paging/swapping, and persistence via file backing. Drawbacks: less control over eviction, potential swap interference, and I/O error handling complexity.
Modern kernels employ sophisticated page reclaim subsystems that orchestrate swapping and cache eviction. Understanding these internals helps diagnose memory pressure issues and tune system behavior.
Linux memory reclaim overview:
Linux's memory reclaim operates at multiple levels:
Allocation Request
↓
Free pages available? ──Yes──→ Satisfy from free list
↓ No
Direct reclaim (synchronous)
↓
Activate kswapd (asynchronous)
↓
Scan LRU lists
↓
Evict/swap pages
↓
Still short? Try harder
↓
OOM killer (last resort)
Key components:
kswapd — Per-NUMA-node kernel thread that wakes when free memory falls below watermarks. It runs in the background, reclaiming pages to maintain a buffer of free memory.
Direct reclaim — When allocation fails despite kswapd's efforts, the allocating process itself performs synchronous reclaim. This adds latency to the allocation.
Watermarks — Three thresholds control behavior:
Priorities — Reclaim scans LRU lists with increasing intensity. Early scans are gentle; later scans become more aggressive as desperation increases.
12345678910111213141516171819202122232425262728293031
#!/bin/bash# Memory reclaim inspection and tuning # View watermarks per zonecat /proc/zoneinfo | grep -E "Node|min|low|high|free" # View kswapd activityvmstat 1 # Watch si/so (swap in/out) and bi/bo (block I/O) # Check memory pressure indicatorscat /proc/pressure/memory# some avg10=0.00 avg60=0.00 avg300=0.00 total=0# full avg10=0.00 avg60=0.00 avg300=0.00 total=0# Higher values indicate stalled processes # Tuning parameters# vm.swappiness - balance anonymous vs file reclaimsysctl vm.swappiness # vm.vfs_cache_pressure - willingness to reclaim dentries/inodessysctl vm.vfs_cache_pressure # vm.watermark_scale_factor - distance between watermarkssysctl vm.watermark_scale_factor # vm.dirty_ratio / dirty_background_ratio - when to writeback dirty pagessysctl vm.dirty_ratiosysctl vm.dirty_background_ratio # Per-cgroup memory.swap limits (cgroups v2)cat /sys/fs/cgroup/my_group/memory.swap.maxWhen kswapd can't keep up and direct reclaim kicks in, latency spikes occur. A process that expected a quick malloc() may stall for hundreds of milliseconds while reclaiming pages. This is why servers often pre-allocate memory and monitor free memory carefully—to avoid direct reclaim entirely.
Page-based swapping transforms memory management from coarse process-level operations to fine-grained page management. This approach enables efficient memory utilization, sophisticated sharing, and graceful degradation under pressure.
What's next:
We've explored how swapping integrates with paging. The final topic—performance considerations—examines the practical impact of swapping on system performance: measuring swap activity, identifying pathological patterns, and tuning systems to balance memory utilization against responsiveness.
You now understand how paging transforms swapping: demand paging, anonymous vs. file-backed pages, CoW integration, lazy allocation, page replacement, and kernel reclaim internals. Next, we'll focus on the performance implications of these mechanisms.