Swapping - Learning Module

Loading content...

0/227

Swapping with Paging

The Best of Both Worlds: Page-Level Swapping

Standard swapping, as we've seen, treats processes as monolithic units: entirely in memory or entirely on disk. Paging, by contrast, divides memory into fixed-size pages that can be individually managed. When we combine these concepts, we get page-level swapping—the technique used by virtually all modern operating systems.

In page-level swapping (often called demand paging when combined with lazy loading), individual pages can reside in memory or on swap space. A process runs with some pages in RAM and others on disk, loading pages on demand as they're accessed. This approach dramatically improves memory utilization, reduces swap I/O, and enables optimizations like copy-on-write and memory-mapped files.

This page explores how swapping and paging integrate, the mechanisms that make page-level swapping work, and the sophisticated optimizations that modern systems employ.

What You Will Learn

By the end of this page, you will understand how paging transforms swapping from a coarse process-level operation into fine-grained page management. You'll learn about demand paging, copy-on-write, lazy allocation, anonymous vs. file-backed pages, and the sophisticated heuristics modern kernels use to decide what to keep in memory.

Demand Paging Fundamentals

Demand paging is the cornerstone of modern memory management. Instead of loading an entire process into memory at start-up, pages are loaded on demand—only when first accessed. This lazy approach has profound implications for system efficiency.

Core Principles of Demand Paging

•Lazy loading — Pages are loaded only when accessed, not upfront. A process can begin executing immediately; code pages load as execution reaches them.
•Partial residency — At any moment, only a subset of a process's pages need be in memory. The working set concept captures which pages are 'hot.'
•Page faults drive loading — When a process accesses an unmapped page, a page fault occurs. The kernel loads the page and resumes execution—transparently.
•Swap backs anonymous pages — Pages without a file backing (heap, stack) must be written to swap when evicted. File-backed pages can be discarded (if clean) or written to their source file.
•Virtual memory exceeds physical — Processes can have address spaces larger than physical memory. Pages move between RAM and swap as needed.

How demand paging changes program startup:

Consider starting a large application like a web browser:

Without demand paging (standard swapping era):

Allocate memory for entire executable (500MB)
Load all code and data from disk
Initialize all data structures
Only then can execution begin
Startup time: 5-10 seconds

With demand paging:

Create page table entries marked 'not present'
Load only the first few code pages needed for main()
Begin execution immediately
As code branches run, their pages fault in
Unused features (e.g., print dialog) may never load
Startup time: <1 second

The difference is transformative. Most programs never execute all their code paths in a typical session—demand paging ensures unused code never consumes memory or I/O.

Converting Mermaid diagram...

Anonymous vs. File-Backed Pages

A critical distinction in page-based swapping is between anonymous pages and file-backed pages. This classification determines how pages are handled during memory reclaim.

Anonymous Pages

•No file backing — Created by malloc(), stack growth, or explicit mmap(MAP_ANONYMOUS)
•Must use swap — When evicted, content must be written to swap space; there's no other backing store
•Examples: Process heap, stack, shared memory segments, MAP_ANONYMOUS mappings
•Copy-on-write — After fork(), anonymous pages are shared until written, then copied
•Zero-filled on first access — New anonymous pages are filled with zeros for security

File-Backed Pages

•Backed by files — Loaded from executables, shared libraries, or mmap'd files
•Clean pages can be discarded — Re-read from file if needed later; no swap I/O required
•Dirty pages write to file — Modified pages are written back to their source file, not swap
•Examples: Executable code, shared libraries (libc.so), mmap'd data files
•Sharing optimized — Multiple processes mapping the same file share physical frames

Why this distinction matters for swapping:

Consider the memory reclaim algorithm's decisions:

Page Type	Clean or Dirty	Reclaim Action	I/O Cost
File-backed	Clean	Discard immediately	None
File-backed	Dirty	Write to file, then discard	File write
Anonymous	Clean (in swap)	Discard (swap copy valid)	None
Anonymous	Dirty	Write to swap	Swap write

File-backed clean pages are the cheapest to reclaim—no I/O needed at all. This is why systems often favor caching executables and libraries: they can be evicted instantly if memory is needed.

Linux's approach:

Linux maintains separate LRU (Least Recently Used) lists for anonymous and file-backed pages. The vm.swappiness parameter (0-200) controls the balance:

swappiness = 0: Strongly prefer evicting file-backed pages; only use swap as last resort
swappiness = 60 (default): Balance between file and anonymous reclaim
swappiness = 100: Equal preference for file and anonymous reclaim
swappiness = 200: Aggressively swap anonymous pages

# View current swappiness
cat /proc/sys/vm/swappiness

# Reduce swappiness for database workloads (prefer file cache eviction)
echo 10 > /proc/sys/vm/swappiness

Database Systems and Swapping

Databases like PostgreSQL and MySQL manage their own buffer pools and caches. They prefer low swappiness because: (1) database caches are more valuable than kernel file cache for their workload, (2) unpredictable swap latency causes query timeouts, and (3) they'd rather the kernel evict file-backed pages (which they'll re-read efficiently) than swap their carefully-managed anonymous memory.

Copy-on-Write and Swap

Copy-on-Write (CoW) is a critical optimization enabled by paged memory management. It allows processes to share physical pages until one process writes, at which point a private copy is made. This technique deeply integrates with swap management.

How Copy-on-Write works:

Fork creates shared mappings — When a process forks, the child doesn't copy parent memory immediately. Instead, both processes' page tables point to the same physical frames.
Pages marked read-only — Even for originally writable memory, CoW pages are marked read-only in both page tables.
Write triggers fault — When either process tries to write, a protection fault occurs.
Fault handler makes copy — The kernel allocates a new frame, copies the original content, maps the new frame to the writer, and marks both the old and new frame writable.
Original frame may have other sharers — Reference counting tracks whether a frame is still shared or can be reclaimed.

Converting Mermaid diagram...

CoW and swap interaction:

Copy-on-write pages present special challenges for swap management:

Shared swap slots:

When a CoW page is swapped out before either process writes, both processes share the same swap slot
Swap slot reference counts track this sharing (similar to frame reference counts)
If one process swaps the page back in and writes, CoW resolution creates a new frame, but the swap slot may still be referenced by the other process

Swap entry sharing:

Page P is CoW-shared by processes A and B
Memory pressure: P is swapped out to slot S
  - A's PTE: swap entry pointing to slot S
  - B's PTE: swap entry pointing to slot S
  - Slot S reference count: 2

Process A accesses the page:
  - Page fault, swap-in from slot S
  - New frame F allocated, content loaded
  - A's PTE: points to F, marked CoW (still shared)
  - B's PTE: still swap entry to S
  - Slot S reference count: 1 (only B references it now)
  - Frame F: marked CoW, A and 'swap' are 'sharers'

Process A writes to the page:
  - CoW fault, copy made (but this is first copy after swap-in)
  - A gets exclusive writable frame

Process B accesses the page:
  - Swap-in from slot S
  - B gets frame with original content
  - Slot S freed (ref count = 0)

This intricate dance ensures that CoW semantics are preserved even when pages transit through swap.

fork() Efficiency

Without CoW, fork() would need to copy gigabytes of process memory instantly—making it impractically slow for large processes. CoW makes fork() nearly instantaneous, copying only the page tables (KB) rather than memory contents (GB). Many forked processes immediately exec() a new program, never needing most parent pages.

Lazy Allocation and Zero-Fill-on-Demand

Beyond loading pages from files or swap on demand, modern systems apply laziness to initial memory allocation as well. When a process allocates memory (e.g., via malloc() for large allocations or mmap()), the kernel typically doesn't allocate physical frames immediately. Instead, it uses zero-fill-on-demand (ZFOD).

Lazy Allocation Mechanisms

•Virtual address space is 'free' — Creating page table entries costs only metadata space, not physical memory.
•Overcommit is possible — The system can promise more memory than exists, betting not all will be used.
•Zero pages share a single frame — Until written, all zero-filled pages can map to a shared zero page (read-only).
•First write allocates — Only when a process writes does the kernel allocate a private frame and fill it with zeros.
•BSS segment uses ZFOD — The uninitialized data section of programs is not stored on disk; it's zero-filled on demand.

lazy_allocation_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
 
int main() {
    printf("Before allocation: check /proc/%d/status\n", getpid());
    sleep(5);
    
    // Allocate 1GB - no physical memory used yet!
    char *ptr = malloc(1024 * 1024 * 1024);
    
    printf("After malloc: still no physical memory committed\n");
    sleep(5);
    
    // Touch first page - now 1 page (4KB) is allocated
    ptr[0] = 'A';
    
    printf("After touch: only 1 page allocated\n");
    sleep(5);
    
    // Touch every page - now all 1GB is allocated
    for (long i = 0; i < 1024 * 1024 * 1024; i += 4096) {
        ptr[i] = 'B';
    }
    
    printf("After full touch: all pages allocated\n");
    sleep(5);
    
    return 0;
}
 
/* 
 * Monitor with: watch -n 1 'cat /proc/<pid>/status | grep -E "VmRSS|VmSize"'
 * 
 * VmSize: Virtual memory size (includes uncommitted)
 * VmRSS: Resident Set Size (actually in RAM)
 * 
 * You'll see VmSize jump at malloc(), but VmRSS grows only as pages are touched.
 */

The zero page optimization:

Linux and other systems maintain a special zero page—a single physical frame filled with zeros, mapped read-only. When a process reads from never-written anonymous memory, it gets the zero page:

Memory: 1GB allocation with only 10MB written
Zero page mappings: 99% of pages (read-only, shared)
Private frames: 1% of pages (the 10MB actually written)

This creates an interesting swap scenario: pages that have never been written don't need swap space at all. They're regenerated as zero pages when needed.

Swap and lazy allocation:

When memory pressure hits:

Never-touched pages — No swap needed; they're just zero-fill-on-demand. The page table entry is simply cleared.
Touched but unchanged pages — If a page was touched but remains all zeros, some systems detect this (page content checks) and treat it as never-touched.
Modified anonymous pages — These must go to swap; their content can't be regenerated.

Overcommit Dangers

Lazy allocation enables overcommit: a system with 16GB RAM might allocate 64GB to processes. If all processes try to use their allocations simultaneously, the OOM killer strikes. Linux's /proc/sys/vm/overcommit_memory controls this: 0 = heuristic overcommit, 1 = always overcommit, 2 = strict (don't overcommit beyond swap + percentage of RAM).

Page Replacement and LRU Approximations

With page-based swapping, the system must decide which pages to evict when memory is needed. This is the page replacement problem, and the choice of algorithm significantly impacts performance.

The optimal algorithm (Bélády's MIN):

The theoretically optimal page replacement algorithm evicts the page that will be accessed furthest in the future. Unfortunately, this requires knowing future memory accesses—impossible in practice. However, it serves as a benchmark against which other algorithms are measured.

LRU (Least Recently Used):

The most intuitive practical approximation of optimal is LRU: evict the page that was accessed longest ago. The intuition is that recent access patterns predict future access: pages not used recently are unlikely to be used soon.

True LRU requires updating a timestamp on every memory access—too expensive. Real systems use approximations.

LRU Approximation Techniques

•Reference bits — Hardware sets a bit when a page is accessed. The OS periodically clears these bits and uses them to distinguish recently-used from idle pages.
•Clock (Second Chance) algorithm — Pages are arranged in a circular list. Clock hand scans for a victim; if a page's reference bit is set, it's cleared and the hand moves on. Pages get a 'second chance' if recently accessed.
•Two-handed clock — Two clock hands run with a gap between them. The front hand clears reference bits; the rear hand evicts pages with bits still clear.
•Active/Inactive lists (Linux) — Pages start on the inactive list. If accessed while inactive, they're promoted to the active list. Eviction targets the inactive list. Active pages with stale access are demoted to inactive.
•Aging (decay) — Each page maintains a multi-bit counter shifted over time. Access sets the high bit. Old accesses decay away. Lower counts indicate less-recent use.

Linux's multi-generational LRU (MGLRU):

Introduced in Linux 6.1, MGLRU improves upon the traditional active/inactive lists by using multiple generations to more accurately track page age:

Youngest generation — Recently accessed pages
Older generations — Pages not accessed in recent time intervals
Oldest generation — Eviction candidates

Pages age through generations over time. Access resets a page to the youngest generation. This provides finer-grained age tracking than binary active/inactive, better approximating true LRU.

Working set considerations:

The ideal behavior is to keep each process's working set (pages actively in use) in memory. When memory can hold all working sets, page faults are rare. When working sets exceed memory, thrashing occurs: constant page faults as processes fight for frames.

Linux's page reclaim considers working set size when deciding how aggressively to reclaim from a process. Processes with rapidly shrinking access patterns are good eviction targets; those with stable, large working sets are protected.

The Importance of Refault Tracking

Modern kernels track 'refaults'—when a recently-evicted page is immediately accessed again. High refault rates indicate the working set doesn't fit in available memory. Linux uses this signal to adjust the balance between file and anonymous page eviction, and to trigger OOM earlier when thrashing is detected.

Memory-Mapped Files and Swap

Memory-mapped files (mmap()) create a direct mapping between a file and virtual address space. This technique leverages paged memory management and interacts with swap in specific ways.

mmap_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <sys/mman.h>
#include <fcntl.h>
#include <stdio.h>
 
int main() {
    // Open a file
    int fd = open("data.bin", O_RDWR);
    
    // Map it into memory
    char *addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, 
                      MAP_SHARED, fd, 0);
    
    // Now addr[0..4095] accesses file content directly
    printf("First byte: %c\n", addr[0]);
    
    // Writes go to the file (via page cache)
    addr[0] = 'X';
    
    // Unmap when done
    munmap(addr, 4096);
    close(fd);
    
    return 0;
}

How mmap interacts with swap:

mmap Type	Dirty Behavior	Swap Usage	Eviction
MAP_SHARED (file)	Writes go to file	No swap	Write to file, discard
MAP_PRIVATE (file)	Writes create CoW copy	Swap for modified pages	Clean: discard. Dirty: swap
MAP_ANONYMOUS	No file backing	Always swap	Write to swap
MAP_ANONYMOUS + MAP_SHARED	Shared anonymous (IPC)	Swap	Shared swap slots

Key insights:

Shared file mappings never use swap — They write back to the file. The file is the backing store.
Private file mappings have split behavior — Unmodified pages discard to file. Modified (CoW) pages are now anonymous and use swap.
Anonymous mappings always need swap — There's no file backing. This is mathematically equivalent to reading/writing a temporary file, but more efficient.

The page cache connection:

File-backed memory mappings go through the page cache—the kernel's cache of file contents. The same physical frames back both mmap'd regions and regular read()/write() access to files. This means:

Reading a file via mmap uses the same cache as read()
Memory pressure that evicts cached file pages affects mmap performance
mmap'd database files benefit from the kernel's LRU page management

Why Databases Use mmap

Many databases (SQLite, MongoDB, LMDB) use mmap for storage. Benefits: automatic caching via page cache, zero-copy access, kernel handles paging/swapping, and persistence via file backing. Drawbacks: less control over eviction, potential swap interference, and I/O error handling complexity.

Kernel Page Reclaim Internals

Modern kernels employ sophisticated page reclaim subsystems that orchestrate swapping and cache eviction. Understanding these internals helps diagnose memory pressure issues and tune system behavior.

Linux memory reclaim overview:

Linux's memory reclaim operates at multiple levels:

Allocation Request
        ↓
  Free pages available? ──Yes──→ Satisfy from free list
        ↓ No
  Direct reclaim (synchronous)
        ↓
  Activate kswapd (asynchronous)
        ↓
  Scan LRU lists
        ↓
  Evict/swap pages
        ↓
  Still short? Try harder
        ↓
  OOM killer (last resort)

Key components:

kswapd — Per-NUMA-node kernel thread that wakes when free memory falls below watermarks. It runs in the background, reclaiming pages to maintain a buffer of free memory.
Direct reclaim — When allocation fails despite kswapd's efforts, the allocating process itself performs synchronous reclaim. This adds latency to the allocation.
Watermarks — Three thresholds control behavior:
- High watermark: kswapd stops reclaiming
- Low watermark: kswapd wakes up
- Min watermark: Emergency reserve for critical allocations
Priorities — Reclaim scans LRU lists with increasing intensity. Early scans are gentle; later scans become more aggressive as desperation increases.

memory_reclaim_tuning.sh
Linux
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
# Memory reclaim inspection and tuning
 
# View watermarks per zone
cat /proc/zoneinfo | grep -E "Node|min|low|high|free"
 
# View kswapd activity
vmstat 1  # Watch si/so (swap in/out) and bi/bo (block I/O)
 
# Check memory pressure indicators
cat /proc/pressure/memory
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# Higher values indicate stalled processes
 
# Tuning parameters
# vm.swappiness - balance anonymous vs file reclaim
sysctl vm.swappiness
 
# vm.vfs_cache_pressure - willingness to reclaim dentries/inodes
sysctl vm.vfs_cache_pressure
 
# vm.watermark_scale_factor - distance between watermarks
sysctl vm.watermark_scale_factor
 
# vm.dirty_ratio / dirty_background_ratio - when to writeback dirty pages
sysctl vm.dirty_ratio
sysctl vm.dirty_background_ratio
 
# Per-cgroup memory.swap limits (cgroups v2)
cat /sys/fs/cgroup/my_group/memory.swap.max

Direct Reclaim Latency

When kswapd can't keep up and direct reclaim kicks in, latency spikes occur. A process that expected a quick malloc() may stall for hundreds of milliseconds while reclaiming pages. This is why servers often pre-allocate memory and monitor free memory carefully—to avoid direct reclaim entirely.

Summary: Swapping with Paging

Page-based swapping transforms memory management from coarse process-level operations to fine-grained page management. This approach enables efficient memory utilization, sophisticated sharing, and graceful degradation under pressure.

Key Takeaways

•Demand paging loads pages on access — Processes start quickly; unused pages never consume resources.
•Anonymous pages need swap; file-backed pages often don't — Understanding this distinction is key to memory tuning.
•Copy-on-write enables efficient forking — Pages are shared until modified, with swap slot reference counting maintaining correctness.
•Lazy allocation delays commitment — malloc() doesn't immediately consume RAM; actual allocation occurs on first touch.
•LRU approximations guide page replacement — Active/inactive lists, reference bits, and MGLRU balance recency tracking against overhead.
•Memory-mapped files integrate with page cache — mmap provides unified memory/file access with kernel-managed paging.
•Kernel reclaim is sophisticated — kswapd, watermarks, and tunable parameters allow fine-grained control over swapping behavior.

What's next:

We've explored how swapping integrates with paging. The final topic—performance considerations—examines the practical impact of swapping on system performance: measuring swap activity, identifying pathological patterns, and tuning systems to balance memory utilization against responsiveness.

Page Complete

You now understand how paging transforms swapping: demand paging, anonymous vs. file-backed pages, CoW integration, lazy allocation, page replacement, and kernel reclaim internals. Next, we'll focus on the performance implications of these mechanisms.

Swapping with Paging

The Best of Both Worlds: Page-Level Swapping

This page explores how swapping and paging integrate, the mechanisms that make page-level swapping work, and the sophisticated optimizations that modern systems employ.

What You Will Learn

Demand Paging Fundamentals

Core Principles of Demand Paging

•Lazy loading — Pages are loaded only when accessed, not upfront. A process can begin executing immediately; code pages load as execution reaches them.
•Partial residency — At any moment, only a subset of a process's pages need be in memory. The working set concept captures which pages are 'hot.'
•Page faults drive loading — When a process accesses an unmapped page, a page fault occurs. The kernel loads the page and resumes execution—transparently.
•Swap backs anonymous pages — Pages without a file backing (heap, stack) must be written to swap when evicted. File-backed pages can be discarded (if clean) or written to their source file.
•Virtual memory exceeds physical — Processes can have address spaces larger than physical memory. Pages move between RAM and swap as needed.

How demand paging changes program startup:

Consider starting a large application like a web browser:

Without demand paging (standard swapping era):

Allocate memory for entire executable (500MB)
Load all code and data from disk
Initialize all data structures
Only then can execution begin
Startup time: 5-10 seconds

With demand paging:

Create page table entries marked 'not present'
Load only the first few code pages needed for main()
Begin execution immediately
As code branches run, their pages fault in
Unused features (e.g., print dialog) may never load
Startup time: <1 second

The difference is transformative. Most programs never execute all their code paths in a typical session—demand paging ensures unused code never consumes memory or I/O.

Converting Mermaid diagram...

Anonymous vs. File-Backed Pages

A critical distinction in page-based swapping is between anonymous pages and file-backed pages. This classification determines how pages are handled during memory reclaim.

Anonymous Pages

•No file backing — Created by malloc(), stack growth, or explicit mmap(MAP_ANONYMOUS)
•Must use swap — When evicted, content must be written to swap space; there's no other backing store
•Examples: Process heap, stack, shared memory segments, MAP_ANONYMOUS mappings
•Copy-on-write — After fork(), anonymous pages are shared until written, then copied
•Zero-filled on first access — New anonymous pages are filled with zeros for security

File-Backed Pages

•Backed by files — Loaded from executables, shared libraries, or mmap'd files
•Clean pages can be discarded — Re-read from file if needed later; no swap I/O required
•Dirty pages write to file — Modified pages are written back to their source file, not swap
•Examples: Executable code, shared libraries (libc.so), mmap'd data files
•Sharing optimized — Multiple processes mapping the same file share physical frames

Why this distinction matters for swapping:

Consider the memory reclaim algorithm's decisions:

Page Type	Clean or Dirty	Reclaim Action	I/O Cost
File-backed	Clean	Discard immediately	None
File-backed	Dirty	Write to file, then discard	File write
Anonymous	Clean (in swap)	Discard (swap copy valid)	None
Anonymous	Dirty	Write to swap	Swap write

File-backed clean pages are the cheapest to reclaim—no I/O needed at all. This is why systems often favor caching executables and libraries: they can be evicted instantly if memory is needed.

Linux's approach:

Linux maintains separate LRU (Least Recently Used) lists for anonymous and file-backed pages. The vm.swappiness parameter (0-200) controls the balance:

swappiness = 0: Strongly prefer evicting file-backed pages; only use swap as last resort
swappiness = 60 (default): Balance between file and anonymous reclaim
swappiness = 100: Equal preference for file and anonymous reclaim
swappiness = 200: Aggressively swap anonymous pages

# View current swappiness
cat /proc/sys/vm/swappiness

# Reduce swappiness for database workloads (prefer file cache eviction)
echo 10 > /proc/sys/vm/swappiness

Database Systems and Swapping

Copy-on-Write and Swap

How Copy-on-Write works:

Fork creates shared mappings — When a process forks, the child doesn't copy parent memory immediately. Instead, both processes' page tables point to the same physical frames.
Pages marked read-only — Even for originally writable memory, CoW pages are marked read-only in both page tables.
Write triggers fault — When either process tries to write, a protection fault occurs.
Fault handler makes copy — The kernel allocates a new frame, copies the original content, maps the new frame to the writer, and marks both the old and new frame writable.
Original frame may have other sharers — Reference counting tracks whether a frame is still shared or can be reclaimed.

Converting Mermaid diagram...

CoW and swap interaction:

Copy-on-write pages present special challenges for swap management:

Shared swap slots:

When a CoW page is swapped out before either process writes, both processes share the same swap slot
Swap slot reference counts track this sharing (similar to frame reference counts)
If one process swaps the page back in and writes, CoW resolution creates a new frame, but the swap slot may still be referenced by the other process

Swap entry sharing:

Page P is CoW-shared by processes A and B
Memory pressure: P is swapped out to slot S
  - A's PTE: swap entry pointing to slot S
  - B's PTE: swap entry pointing to slot S
  - Slot S reference count: 2

Process A accesses the page:
  - Page fault, swap-in from slot S
  - New frame F allocated, content loaded
  - A's PTE: points to F, marked CoW (still shared)
  - B's PTE: still swap entry to S
  - Slot S reference count: 1 (only B references it now)
  - Frame F: marked CoW, A and 'swap' are 'sharers'

Process A writes to the page:
  - CoW fault, copy made (but this is first copy after swap-in)
  - A gets exclusive writable frame

Process B accesses the page:
  - Swap-in from slot S
  - B gets frame with original content
  - Slot S freed (ref count = 0)

This intricate dance ensures that CoW semantics are preserved even when pages transit through swap.

fork() Efficiency

Lazy Allocation and Zero-Fill-on-Demand

Lazy Allocation Mechanisms

•Virtual address space is 'free' — Creating page table entries costs only metadata space, not physical memory.
•Overcommit is possible — The system can promise more memory than exists, betting not all will be used.
•Zero pages share a single frame — Until written, all zero-filled pages can map to a shared zero page (read-only).
•First write allocates — Only when a process writes does the kernel allocate a private frame and fill it with zeros.
•BSS segment uses ZFOD — The uninitialized data section of programs is not stored on disk; it's zero-filled on demand.

lazy_allocation_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
 
int main() {
    printf("Before allocation: check /proc/%d/status\n", getpid());
    sleep(5);
    
    // Allocate 1GB - no physical memory used yet!
    char *ptr = malloc(1024 * 1024 * 1024);
    
    printf("After malloc: still no physical memory committed\n");
    sleep(5);
    
    // Touch first page - now 1 page (4KB) is allocated
    ptr[0] = 'A';
    
    printf("After touch: only 1 page allocated\n");
    sleep(5);
    
    // Touch every page - now all 1GB is allocated
    for (long i = 0; i < 1024 * 1024 * 1024; i += 4096) {
        ptr[i] = 'B';
    }
    
    printf("After full touch: all pages allocated\n");
    sleep(5);
    
    return 0;
}
 
/* 
 * Monitor with: watch -n 1 'cat /proc/<pid>/status | grep -E "VmRSS|VmSize"'
 * 
 * VmSize: Virtual memory size (includes uncommitted)
 * VmRSS: Resident Set Size (actually in RAM)
 * 
 * You'll see VmSize jump at malloc(), but VmRSS grows only as pages are touched.
 */

The zero page optimization:

Memory: 1GB allocation with only 10MB written
Zero page mappings: 99% of pages (read-only, shared)
Private frames: 1% of pages (the 10MB actually written)

This creates an interesting swap scenario: pages that have never been written don't need swap space at all. They're regenerated as zero pages when needed.

Swap and lazy allocation:

When memory pressure hits:

Never-touched pages — No swap needed; they're just zero-fill-on-demand. The page table entry is simply cleared.
Touched but unchanged pages — If a page was touched but remains all zeros, some systems detect this (page content checks) and treat it as never-touched.
Modified anonymous pages — These must go to swap; their content can't be regenerated.

Overcommit Dangers

Page Replacement and LRU Approximations

The optimal algorithm (Bélády's MIN):

LRU (Least Recently Used):

True LRU requires updating a timestamp on every memory access—too expensive. Real systems use approximations.

LRU Approximation Techniques

•Reference bits — Hardware sets a bit when a page is accessed. The OS periodically clears these bits and uses them to distinguish recently-used from idle pages.
•Clock (Second Chance) algorithm — Pages are arranged in a circular list. Clock hand scans for a victim; if a page's reference bit is set, it's cleared and the hand moves on. Pages get a 'second chance' if recently accessed.
•Two-handed clock — Two clock hands run with a gap between them. The front hand clears reference bits; the rear hand evicts pages with bits still clear.
•Active/Inactive lists (Linux) — Pages start on the inactive list. If accessed while inactive, they're promoted to the active list. Eviction targets the inactive list. Active pages with stale access are demoted to inactive.
•Aging (decay) — Each page maintains a multi-bit counter shifted over time. Access sets the high bit. Old accesses decay away. Lower counts indicate less-recent use.

Linux's multi-generational LRU (MGLRU):

Introduced in Linux 6.1, MGLRU improves upon the traditional active/inactive lists by using multiple generations to more accurately track page age:

Youngest generation — Recently accessed pages
Older generations — Pages not accessed in recent time intervals
Oldest generation — Eviction candidates

Pages age through generations over time. Access resets a page to the youngest generation. This provides finer-grained age tracking than binary active/inactive, better approximating true LRU.

Working set considerations:

The Importance of Refault Tracking

Memory-Mapped Files and Swap

Memory-mapped files (mmap()) create a direct mapping between a file and virtual address space. This technique leverages paged memory management and interacts with swap in specific ways.

mmap_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <sys/mman.h>
#include <fcntl.h>
#include <stdio.h>
 
int main() {
    // Open a file
    int fd = open("data.bin", O_RDWR);
    
    // Map it into memory
    char *addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, 
                      MAP_SHARED, fd, 0);
    
    // Now addr[0..4095] accesses file content directly
    printf("First byte: %c\n", addr[0]);
    
    // Writes go to the file (via page cache)
    addr[0] = 'X';
    
    // Unmap when done
    munmap(addr, 4096);
    close(fd);
    
    return 0;
}

How mmap interacts with swap:

mmap Type	Dirty Behavior	Swap Usage	Eviction
MAP_SHARED (file)	Writes go to file	No swap	Write to file, discard
MAP_PRIVATE (file)	Writes create CoW copy	Swap for modified pages	Clean: discard. Dirty: swap
MAP_ANONYMOUS	No file backing	Always swap	Write to swap
MAP_ANONYMOUS + MAP_SHARED	Shared anonymous (IPC)	Swap	Shared swap slots

Key insights:

Shared file mappings never use swap — They write back to the file. The file is the backing store.
Private file mappings have split behavior — Unmodified pages discard to file. Modified (CoW) pages are now anonymous and use swap.
Anonymous mappings always need swap — There's no file backing. This is mathematically equivalent to reading/writing a temporary file, but more efficient.

The page cache connection:

File-backed memory mappings go through the page cache—the kernel's cache of file contents. The same physical frames back both mmap'd regions and regular read()/write() access to files. This means:

Reading a file via mmap uses the same cache as read()
Memory pressure that evicts cached file pages affects mmap performance
mmap'd database files benefit from the kernel's LRU page management

Why Databases Use mmap

Kernel Page Reclaim Internals

Modern kernels employ sophisticated page reclaim subsystems that orchestrate swapping and cache eviction. Understanding these internals helps diagnose memory pressure issues and tune system behavior.

Linux memory reclaim overview:

Linux's memory reclaim operates at multiple levels:

Allocation Request
        ↓
  Free pages available? ──Yes──→ Satisfy from free list
        ↓ No
  Direct reclaim (synchronous)
        ↓
  Activate kswapd (asynchronous)
        ↓
  Scan LRU lists
        ↓
  Evict/swap pages
        ↓
  Still short? Try harder
        ↓
  OOM killer (last resort)

Key components:

kswapd — Per-NUMA-node kernel thread that wakes when free memory falls below watermarks. It runs in the background, reclaiming pages to maintain a buffer of free memory.
Direct reclaim — When allocation fails despite kswapd's efforts, the allocating process itself performs synchronous reclaim. This adds latency to the allocation.
Watermarks — Three thresholds control behavior:
- High watermark: kswapd stops reclaiming
- Low watermark: kswapd wakes up
- Min watermark: Emergency reserve for critical allocations
Priorities — Reclaim scans LRU lists with increasing intensity. Early scans are gentle; later scans become more aggressive as desperation increases.

memory_reclaim_tuning.sh
Linux
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
# Memory reclaim inspection and tuning
 
# View watermarks per zone
cat /proc/zoneinfo | grep -E "Node|min|low|high|free"
 
# View kswapd activity
vmstat 1  # Watch si/so (swap in/out) and bi/bo (block I/O)
 
# Check memory pressure indicators
cat /proc/pressure/memory
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# Higher values indicate stalled processes
 
# Tuning parameters
# vm.swappiness - balance anonymous vs file reclaim
sysctl vm.swappiness
 
# vm.vfs_cache_pressure - willingness to reclaim dentries/inodes
sysctl vm.vfs_cache_pressure
 
# vm.watermark_scale_factor - distance between watermarks
sysctl vm.watermark_scale_factor
 
# vm.dirty_ratio / dirty_background_ratio - when to writeback dirty pages
sysctl vm.dirty_ratio
sysctl vm.dirty_background_ratio
 
# Per-cgroup memory.swap limits (cgroups v2)
cat /sys/fs/cgroup/my_group/memory.swap.max

Direct Reclaim Latency

Summary: Swapping with Paging

Key Takeaways

•Demand paging loads pages on access — Processes start quickly; unused pages never consume resources.
•Anonymous pages need swap; file-backed pages often don't — Understanding this distinction is key to memory tuning.
•Copy-on-write enables efficient forking — Pages are shared until modified, with swap slot reference counting maintaining correctness.
•Lazy allocation delays commitment — malloc() doesn't immediately consume RAM; actual allocation occurs on first touch.
•LRU approximations guide page replacement — Active/inactive lists, reference bits, and MGLRU balance recency tracking against overhead.
•Memory-mapped files integrate with page cache — mmap provides unified memory/file access with kernel-managed paging.
•Kernel reclaim is sophisticated — kswapd, watermarks, and tunable parameters allow fine-grained control over swapping behavior.

What's next:

Page Complete