Swapping - Learning Module

Loading content...

0/227

Swap In/Swap Out

The Mechanics of Memory Migration

Swap space, as we learned, is the reservoir of disk storage that extends physical memory. But knowing that swap exists is only half the story. The critical question is: How does data actually move between RAM and swap?

The answer lies in two complementary operations: swap out (moving data from RAM to disk) and swap in (bringing data back from disk to RAM). These operations are orchestrated by the operating system's memory manager, working in concert with the process scheduler and I/O subsystem to maintain the illusion that processes have unlimited memory.

This page takes you deep into the mechanics of these operations—when they trigger, what steps they involve, how the kernel maintains consistency, and why processes remain oblivious to the fact that their memory is temporarily on disk.

What You Will Learn

By the end of this page, you will understand the complete lifecycle of a swapped page—from eviction decisions through disk I/O to page fault retrieval. You'll see the data structures, synchronization, and optimizations that make swapping practical in production systems.

Swap Out: Evicting Pages to Disk

Swap out is the process of writing a memory page from RAM to swap space, then reclaiming the physical frame for other use. This operation is triggered when the system needs more free memory than is currently available.

The swap out operation is not instantaneous—it involves disk I/O, which is orders of magnitude slower than memory access. Therefore, the operating system must carefully select which pages to swap out, minimizing the likelihood that those pages will be needed again soon.

Triggers for Swap Out

•Page allocation failure — A process requests memory, but no free frames exist. The kernel must reclaim frames, potentially by swapping out existing pages.
•Low memory watermark — The kernel proactively reclaims pages when free memory drops below a threshold (kswapd in Linux), avoiding allocation failures.
•Explicit memory pressure — Containerization or cgroup limits force memory reclaim even if global memory is available.
•Hibernation preparation — Before entering hibernation, all memory must be written to swap to allow system state restoration.
•Memory compaction — To create large contiguous regions (for huge pages), pages may be evicted to make room for migration.

The page selection problem:

Not all pages are equally suitable for swapping. The kernel maintains a classification:

Anonymous pages — Pages not backed by a file (heap, stack). These must be written to swap if evicted, as there's no other backing store.
File-backed pages — Pages mapped from files (executables, shared libraries, mmap'd files). These can often be discarded rather than swapped—the file on disk is already the backing store. If dirty (modified), they're written back to the file, not to swap.
Kernel pages — Pages used by the kernel itself. Most are not swappable; some can be reclaimed if they cache data that can be reconstructed.
Locked/pinned pages — Pages explicitly marked non-swappable by the process (via mlock()) or kernel. These must stay in RAM.

File-Backed vs. Anonymous Pages

Understanding this distinction is crucial: file-backed pages often don't use swap at all. When memory is tight, the kernel prefers evicting clean file-backed pages (which can be re-read from the original file) over anonymous pages (which require swap I/O). This is why applications with large working sets of file data often perform better than those with equivalent anonymous allocations.

Swap Out: Step-by-Step Process

Let's trace the complete swap out operation for an anonymous page. This sequence illustrates the careful choreography required to safely evict a page while maintaining system consistency.

Swap Out Steps

•Select victim page — The page reclaim algorithm (e.g., LRU-based) identifies a candidate page that hasn't been accessed recently.
•Acquire page lock — Lock the page structure to prevent concurrent access during the swap operation.
•Allocate swap slot — Reserve a slot in swap space using the swap allocator. If swap is full, the operation fails (potentially triggering OOM).
•Unmap from page tables — Remove all page table entries (PTEs) pointing to this page. This requires walking the reverse mapping to find all processes using the page.
•Handle dirty page — If the page has been modified (dirty bit set), the content must be written. For file-backed pages, write to file; for anonymous pages, write to the allocated swap slot.
•Initiate disk I/O — Submit an asynchronous write request to the block layer. The page is marked as under writeback.
•Wait for I/O completion — The kernel waits for the disk write to complete. During this time, the page cannot be freed.
•Update swap cache — Add an entry to the swap cache mapping the (process, virtual address) to the swap slot. This enables efficient swap-in lookup.
•Free the physical frame — Return the frame to the free list. The memory is now available for other allocations.

Converting Mermaid diagram...

Reverse mapping (rmap):

Step 4—unmapping from page tables—is particularly complex. A single physical page may be mapped into multiple processes (via shared memory or copy-on-write). The kernel must find and update all page table entries pointing to this frame.

Linux maintains a reverse mapping (rmap) structure for each page, tracking which processes have the page mapped and at what virtual addresses. When swapping out, the kernel walks this structure to:

Clear the PTE's present bit in each process's page table
Set a "swap entry" value in the PTE containing the swap slot address
Update access/dirty bits based on hardware flags

This ensures that if any process tries to access the page after eviction, a page fault occurs and triggers swap-in.

swap_out_conceptual.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Conceptual illustration of swap out logic
 
int try_to_swap_out(struct page *page) {
    // Step 1: Lock the page
    if (!trylock_page(page))
        return SWAP_AGAIN;  // Busy, try later
    
    // Step 2: Check if page is still a candidate
    if (page_mapped(page) == 0) {
        // No mappings - can free directly
        unlock_page(page);
        return SWAP_SUCCESS;
    }
    
    // Step 3: Allocate swap slot
    swp_entry_t entry = get_swap_page();
    if (!entry.val) {
        unlock_page(page);
        return SWAP_FAIL;  // Swap full
    }
    
    // Step 4: Unmap from all page tables
    // Walk reverse mappings and clear PTEs
    int success = try_to_unmap(page, TTU_IGNORE_MLOCK);
    if (!success) {
        put_swap_page(entry);  // Release swap slot
        unlock_page(page);
        return SWAP_FAIL;
    }
    
    // Step 5: Write page content to swap
    add_to_swap_cache(page, entry);
    set_page_dirty(page);  // Ensure writeback
    
    // Step 6: Initiate I/O
    int err = swap_writepage(page, &wbc);
    if (err) {
        // Handle I/O error
        remove_from_swap_cache(page);
        put_swap_page(entry);
        unlock_page(page);
        return SWAP_FAIL;
    }
    
    // Step 7: Wait for completion (or return for async)
    wait_on_page_writeback(page);
    
    // Step 8: Success - page is now on swap
    // Frame can be reclaimed after this
    unlock_page(page);
    return SWAP_SUCCESS;
}

Swap In: Retrieving Pages from Disk

Swap in is the reverse operation: reading a page from swap space back into RAM. This operation is triggered when a process attempts to access a page that has been swapped out, causing a page fault.

The swap-in path is more latency-sensitive than swap-out. When a process faults on a swapped page, it typically cannot continue until the page is loaded—the process is blocked waiting for disk I/O. This makes swap-in performance critical to user experience.

Swap In Triggers

•Page fault on swap entry — Process accesses a virtual address whose PTE contains a swap entry (not present bit clear, swap identifier set).
•Read-ahead speculation — While faulting one page, the kernel may speculatively read adjacent swap pages, betting they'll be needed soon.
•Prefaulting — Some systems support explicit swap-in requests for performance optimization (rare).

The page fault path:

When a process accesses a virtual address, the MMU (Memory Management Unit) translates it using the page table. If the PTE's present bit is clear, a page fault exception occurs.

The kernel's page fault handler examines the faulted address:

If the address is invalid (not mapped), a segmentation fault (SIGSEGV) is delivered
If the address maps to a file and the page isn't loaded, a file-backed page fault occurs
If the PTE contains a swap entry, a swap-in page fault occurs

The swap entry in the PTE encodes:

The swap area (partition or file) containing the page
The slot within that area

This information is sufficient to locate and read the page from disk.

PTE Structure with Swap Entry

A page table entry is typically 8 bytes (64-bit system). When a page is present in RAM, the PTE contains the physical frame number and flags. When swapped, the same 8 bytes store a swap entry identifier instead. The present bit (bit 0) distinguishes these cases: 0 = not present (check for swap entry), 1 = present (use frame number).

Swap In: Step-by-Step Process

Let's trace the complete swap-in operation, from page fault to process resumption.

Swap In Steps

•Page fault occurs — The MMU cannot translate the address; present bit is 0. CPU transfers control to the kernel's page fault handler.
•Decode swap entry — Extract the swap area and slot number from the PTE. This identifies where the page data resides on disk.
•Check swap cache — The page may already be in the swap cache (recently swapped in for another process, or writeback not yet complete). If found, skip disk I/O.
•Allocate fresh frame — Request a free physical frame to hold the incoming page. This may trigger page reclaim if memory is tight.
•Initiate disk read — Submit an asynchronous read request to load the page content from the swap slot into the new frame.
•Block the faulting process — The process cannot continue until the page is available. It is moved to a wait queue for this page.
•I/O completion — The disk subsystem signals that the read is finished. The page content is now in RAM.
•Update page tables — Write the new frame number into the PTE, set the present bit, and clear the swap entry.
•Release swap slot — If this was the last reference to the swap slot (reference count drops to 0), free the slot for reuse.
•Wake the process — The faulting process is now runnable. When scheduled, it resumes from the faulting instruction.

Converting Mermaid diagram...

The swap cache:

The swap cache is a critical optimization. It maintains a mapping from swap entries to in-memory pages:

During swap-out, a page is added to the swap cache before the disk write completes. If a process faults on that page during writeback, the in-memory copy is returned immediately.
After swap-in, the page may remain in the swap cache briefly. If memory pressure causes re-eviction, the disk copy is still valid—no new write is needed.
Copy-on-write optimization: If multiple processes share a swapped page, the swap cache ensures only one disk read occurs. Subsequent faults find the page in cache.

The swap cache is indexed by (swap_area, slot_number) and allows O(1) lookup of whether a given swap slot has a corresponding in-memory page.

swap_in_conceptual.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Conceptual illustration of swap in logic
 
int do_swap_page(struct vm_fault *vmf) {
    // Step 1: Extract swap entry from PTE
    pte_t pte = vmf->orig_pte;
    swp_entry_t entry = pte_to_swp_entry(pte);
    
    // Step 2: Check swap cache
    struct page *page = lookup_swap_cache(entry);
    if (page) {
        // Cache hit - page already in memory
        lock_page(page);
        goto have_page;
    }
    
    // Step 3: Allocate a fresh frame
    page = alloc_page(GFP_HIGHUSER_MOVABLE);
    if (!page)
        return VM_FAULT_OOM;
    
    // Step 4: Add to swap cache (prevent races)
    if (add_to_swap_cache(page, entry) < 0) {
        // Another thread beat us - use their page
        put_page(page);
        page = lookup_swap_cache(entry);
        if (!page)
            return VM_FAULT_OOM;
        lock_page(page);
        goto have_page;
    }
    
    // Step 5: Read from swap
    lock_page(page);
    int err = swap_readpage(page);
    if (err) {
        delete_from_swap_cache(page);
        unlock_page(page);
        put_page(page);
        return VM_FAULT_ERROR;
    }
    
    // Step 6: Wait for I/O
    wait_on_page_locked(page);
    
have_page:
    // Step 7: Update page table
    pte = mk_pte(page, vmf->vma->vm_page_prot);
    set_pte_at(vmf->vma->vm_mm, vmf->address, vmf->pte, pte);
    
    // Step 8: Decrement swap reference count
    swap_free(entry);
    
    // Step 9: Done - process can resume
    unlock_page(page);
    return VM_FAULT_NOPAGE;
}

Synchronization and Concurrency

Swapping occurs in a highly concurrent environment. Multiple processors may simultaneously:

Try to swap out the same page
Try to swap in the same page
One swaps out while another swaps in
Access a page mid-swap

The kernel uses several synchronization mechanisms to ensure correctness.

Synchronization Mechanisms

•Page lock (PG_locked) — Each page structure has a lock bit. Most swap operations acquire this lock exclusively before modifying page state. Contending threads sleep until the lock is released.
•Page table locks — Modifying PTEs requires holding the appropriate page table lock (typically a spinlock). Different levels of page tables have different locks for concurrency.
•Swap slot reference counts — Each swap slot maintains a reference count. The slot is freed only when the count reaches zero. This handles copy-on-write sharing across processes.
•Swap cache coordination — The swap cache acts as a synchronization point. Adding a page to the swap cache is an atomic operation; if two CPUs race to swap in the same page, only one succeeds in adding to cache.
•Memory barriers — The kernel uses memory barriers to ensure proper ordering of memory and swap operations on weakly-ordered architectures.

Race condition example:

Consider a scenario where CPU A is swapping out page P, while CPU B (running a process that maps P) accesses P:

CPU A removes P from page tables (present bit = 0)
CPU B's process accesses P
CPU B takes a page fault
CPU B's fault handler finds the swap entry in PTE
CPU B checks swap cache—P is there (not yet written to disk)
CPU B returns P directly; process continues

Without the swap cache, CPU B would have to wait for A's write to complete, then read the page back—double I/O. The swap cache eliminates this inefficiency.

Another race:

CPU A starts writing page P to swap
Before write completes, memory pressure eases
CPU A's write completes, but the system now has free frames
The page can remain in memory and swap cache
If re-evicted later, the swap copy is still valid (if page is clean)

This "swap backing" means a page might simultaneously exist in RAM and have a valid swap slot. The kernel tracks this to avoid unnecessary I/O.

The Cost of Locking

Swap synchronization has overhead. Under extreme memory pressure with many CPUs, contention for page locks and swap data structures can become a bottleneck. This is one reason why swap-heavy workloads scale poorly—beyond the I/O cost, there's CPU time spent waiting for locks.

Read-Ahead and Clustering

Disk I/O has high latency but reasonable throughput once started. Reading one 4KB page takes nearly as long as reading 64KB due to seek time dominance (on HDDs) or command overhead (on SSDs). Modern kernels exploit this through read-ahead and clustering.

Swap Read-Ahead

•Spatial locality bet — When page N faults, the kernel speculatively reads pages N-k to N+k
•Configurable window — The read-ahead window size can be tuned (default often 8-16 pages)
•Sequential access patterns — If a process scans memory linearly, read-ahead dramatically reduces fault count
•Wasted reads possible — If locality assumption is wrong, read-ahead pages may never be used

Swap Clustering (Write)

•Adjacent page writes — When evicting a page, the kernel checks if neighbors are also candidates
•Single I/O for multiple pages — Reduces the number of write operations
•Swap slot allocation — The kernel tries to allocate consecutive swap slots for related pages
•LRU list ordering — Pages evicted together are often allocated together (temporal locality)

How clustering helps swap-out:

Linux maintains LRU (Least Recently Used) lists of pages. When memory pressure triggers reclaim, pages at the tail of the LRU list are eviction candidates. These pages are often from the same process and may be virtually adjacent.

The kernel scans the eviction candidates and groups pages that:

Can be written to consecutive swap slots
Are ready for eviction (not locked, not currently accessed)
Belong to the same swap area

Instead of 8 separate 4KB writes, the kernel issues one 32KB write. This is particularly beneficial for HDDs, where each operation incurs seek overhead.

Read-ahead tunables (Linux):

# View current swap read-ahead (in pages)
cat /proc/sys/vm/page-cluster
# Default is 3, meaning 2^3 = 8 pages read-ahead

# Increase for sequential workloads
echo 4 > /proc/sys/vm/page-cluster  # 16 pages

# Decrease for random access patterns
echo 1 > /proc/sys/vm/page-cluster  # 2 pages

Choosing the right value depends on workload characteristics. Too high wastes I/O bandwidth on unused pages; too low causes many small reads.

SSD Considerations

On SSDs, seek time is negligible, so the case for clustering is weaker. However, issuing larger I/O requests still reduces command overhead and can better utilize SSD parallelism. Read-ahead remains valuable for reducing the number of page faults, even if I/O latency per fault is lower.

Process Transparency: The Illusion of Unlimited Memory

A remarkable property of virtual memory and swapping is transparency: processes are completely unaware that their pages may reside on disk. From the application's perspective, all memory access works identically—pointers work, data is consistent, and the process never sees "swap" directly.

How is this illusion maintained?

Transparency Mechanisms

•Virtual addressing — Processes use virtual addresses, never physical. The MMU and page tables translate seamlessly.
•Hardware page faults — The CPU automatically traps to the kernel when accessing a non-present page. No special process logic needed.
•Blocking swap-in — The process sleeps during swap-in. From its perspective, the memory access just took longer—the process doesn't explicitly wait.
•State preservation — CPU registers are saved before fault handling and restored after. The instruction that caused the fault is re-executed successfully.
•Consistent memory model — From the process's view, memory behaves as expected. Swapping doesn't introduce data races or inconsistencies (from the process's perspective).

What the process experiences:

Imagine a process executing:

int value = *ptr;  // ptr points to a swapped page

From the process's perspective:

It issues a memory read instruction
... time passes ...
The value appears in a register

The process doesn't know that between steps 1 and 3:

The CPU faulted
The kernel took over
A disk read completed
The page table was updated
The instruction was re-executed

The latency increase is visible (if you measure carefully), but the semantics are identical to a present-page access.

Signals and swap:

One subtle interaction: if a process registers a signal handler, and the handler's code page is swapped out when a signal arrives, the kernel must swap in the handler page before delivering the signal. This is handled transparently, but the signal delivery latency may increase.

Real-Time Systems and Swap

For real-time applications where latency predictability matters, swapping is problematic. A page fault can introduce milliseconds of delay. Such systems often use mlock() to pin critical pages in RAM, or disable swap entirely to ensure deterministic behavior.

Summary: Swap In/Swap Out

The swap in and swap out operations are the fundamental mechanisms by which the operating system extends physical memory to disk. Let's consolidate the key insights:

Key Takeaways

•Swap out evicts pages to disk — Triggered by memory pressure, it frees RAM by writing pages to swap space and updating page tables.
•Swap in retrieves pages on demand — Page faults on swap entries trigger disk reads that restore pages to RAM, transparent to processes.
•The swap cache prevents redundant I/O — By caching recently swapped pages, the kernel avoids unnecessary disk operations during race conditions or rapid re-access.
•Synchronization ensures consistency — Page locks, swap slot reference counts, and careful ordering prevent corruption in concurrent environments.
•Clustering and read-ahead improve throughput — By batching I/O operations, the kernel amortizes disk latency across multiple pages.
•Processes remain oblivious — Virtual memory and hardware support maintain the illusion that all memory is always available.

What's next:

Having understood the mechanics of swap in/swap out, we now turn to standard swapping—the historical approach of swapping entire processes rather than individual pages. Understanding this older technique illuminates why modern paging-based swapping evolved and why some systems still fall back to process-level swapping under extreme pressure.

Page Complete

You now understand the complete lifecycle of swapped pages—from eviction decisions through disk I/O to page fault retrieval. Next, we'll explore standard swapping as a historical and fallback mechanism.

Swap In/Swap Out

The Mechanics of Memory Migration

What You Will Learn

Swap Out: Evicting Pages to Disk

Triggers for Swap Out

•Page allocation failure — A process requests memory, but no free frames exist. The kernel must reclaim frames, potentially by swapping out existing pages.
•Low memory watermark — The kernel proactively reclaims pages when free memory drops below a threshold (kswapd in Linux), avoiding allocation failures.
•Explicit memory pressure — Containerization or cgroup limits force memory reclaim even if global memory is available.
•Hibernation preparation — Before entering hibernation, all memory must be written to swap to allow system state restoration.
•Memory compaction — To create large contiguous regions (for huge pages), pages may be evicted to make room for migration.

The page selection problem:

Not all pages are equally suitable for swapping. The kernel maintains a classification:

Anonymous pages — Pages not backed by a file (heap, stack). These must be written to swap if evicted, as there's no other backing store.
File-backed pages — Pages mapped from files (executables, shared libraries, mmap'd files). These can often be discarded rather than swapped—the file on disk is already the backing store. If dirty (modified), they're written back to the file, not to swap.
Kernel pages — Pages used by the kernel itself. Most are not swappable; some can be reclaimed if they cache data that can be reconstructed.
Locked/pinned pages — Pages explicitly marked non-swappable by the process (via mlock()) or kernel. These must stay in RAM.

File-Backed vs. Anonymous Pages

Swap Out: Step-by-Step Process

Let's trace the complete swap out operation for an anonymous page. This sequence illustrates the careful choreography required to safely evict a page while maintaining system consistency.

Swap Out Steps

•Select victim page — The page reclaim algorithm (e.g., LRU-based) identifies a candidate page that hasn't been accessed recently.
•Acquire page lock — Lock the page structure to prevent concurrent access during the swap operation.
•Allocate swap slot — Reserve a slot in swap space using the swap allocator. If swap is full, the operation fails (potentially triggering OOM).
•Unmap from page tables — Remove all page table entries (PTEs) pointing to this page. This requires walking the reverse mapping to find all processes using the page.
•Handle dirty page — If the page has been modified (dirty bit set), the content must be written. For file-backed pages, write to file; for anonymous pages, write to the allocated swap slot.
•Initiate disk I/O — Submit an asynchronous write request to the block layer. The page is marked as under writeback.
•Wait for I/O completion — The kernel waits for the disk write to complete. During this time, the page cannot be freed.
•Update swap cache — Add an entry to the swap cache mapping the (process, virtual address) to the swap slot. This enables efficient swap-in lookup.
•Free the physical frame — Return the frame to the free list. The memory is now available for other allocations.

Converting Mermaid diagram...

Reverse mapping (rmap):

Clear the PTE's present bit in each process's page table
Set a "swap entry" value in the PTE containing the swap slot address
Update access/dirty bits based on hardware flags

This ensures that if any process tries to access the page after eviction, a page fault occurs and triggers swap-in.

swap_out_conceptual.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Conceptual illustration of swap out logic
 
int try_to_swap_out(struct page *page) {
    // Step 1: Lock the page
    if (!trylock_page(page))
        return SWAP_AGAIN;  // Busy, try later
    
    // Step 2: Check if page is still a candidate
    if (page_mapped(page) == 0) {
        // No mappings - can free directly
        unlock_page(page);
        return SWAP_SUCCESS;
    }
    
    // Step 3: Allocate swap slot
    swp_entry_t entry = get_swap_page();
    if (!entry.val) {
        unlock_page(page);
        return SWAP_FAIL;  // Swap full
    }
    
    // Step 4: Unmap from all page tables
    // Walk reverse mappings and clear PTEs
    int success = try_to_unmap(page, TTU_IGNORE_MLOCK);
    if (!success) {
        put_swap_page(entry);  // Release swap slot
        unlock_page(page);
        return SWAP_FAIL;
    }
    
    // Step 5: Write page content to swap
    add_to_swap_cache(page, entry);
    set_page_dirty(page);  // Ensure writeback
    
    // Step 6: Initiate I/O
    int err = swap_writepage(page, &wbc);
    if (err) {
        // Handle I/O error
        remove_from_swap_cache(page);
        put_swap_page(entry);
        unlock_page(page);
        return SWAP_FAIL;
    }
    
    // Step 7: Wait for completion (or return for async)
    wait_on_page_writeback(page);
    
    // Step 8: Success - page is now on swap
    // Frame can be reclaimed after this
    unlock_page(page);
    return SWAP_SUCCESS;
}

Swap In: Retrieving Pages from Disk

Swap In Triggers

•Page fault on swap entry — Process accesses a virtual address whose PTE contains a swap entry (not present bit clear, swap identifier set).
•Read-ahead speculation — While faulting one page, the kernel may speculatively read adjacent swap pages, betting they'll be needed soon.
•Prefaulting — Some systems support explicit swap-in requests for performance optimization (rare).

The page fault path:

When a process accesses a virtual address, the MMU (Memory Management Unit) translates it using the page table. If the PTE's present bit is clear, a page fault exception occurs.

The kernel's page fault handler examines the faulted address:

If the address is invalid (not mapped), a segmentation fault (SIGSEGV) is delivered
If the address maps to a file and the page isn't loaded, a file-backed page fault occurs
If the PTE contains a swap entry, a swap-in page fault occurs

The swap entry in the PTE encodes:

The swap area (partition or file) containing the page
The slot within that area

This information is sufficient to locate and read the page from disk.

PTE Structure with Swap Entry

Swap In: Step-by-Step Process

Let's trace the complete swap-in operation, from page fault to process resumption.

Swap In Steps

•Page fault occurs — The MMU cannot translate the address; present bit is 0. CPU transfers control to the kernel's page fault handler.
•Decode swap entry — Extract the swap area and slot number from the PTE. This identifies where the page data resides on disk.
•Check swap cache — The page may already be in the swap cache (recently swapped in for another process, or writeback not yet complete). If found, skip disk I/O.
•Allocate fresh frame — Request a free physical frame to hold the incoming page. This may trigger page reclaim if memory is tight.
•Initiate disk read — Submit an asynchronous read request to load the page content from the swap slot into the new frame.
•Block the faulting process — The process cannot continue until the page is available. It is moved to a wait queue for this page.
•I/O completion — The disk subsystem signals that the read is finished. The page content is now in RAM.
•Update page tables — Write the new frame number into the PTE, set the present bit, and clear the swap entry.
•Release swap slot — If this was the last reference to the swap slot (reference count drops to 0), free the slot for reuse.
•Wake the process — The faulting process is now runnable. When scheduled, it resumes from the faulting instruction.

Converting Mermaid diagram...

The swap cache:

The swap cache is a critical optimization. It maintains a mapping from swap entries to in-memory pages:

During swap-out, a page is added to the swap cache before the disk write completes. If a process faults on that page during writeback, the in-memory copy is returned immediately.
After swap-in, the page may remain in the swap cache briefly. If memory pressure causes re-eviction, the disk copy is still valid—no new write is needed.
Copy-on-write optimization: If multiple processes share a swapped page, the swap cache ensures only one disk read occurs. Subsequent faults find the page in cache.

The swap cache is indexed by (swap_area, slot_number) and allows O(1) lookup of whether a given swap slot has a corresponding in-memory page.

swap_in_conceptual.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Conceptual illustration of swap in logic
 
int do_swap_page(struct vm_fault *vmf) {
    // Step 1: Extract swap entry from PTE
    pte_t pte = vmf->orig_pte;
    swp_entry_t entry = pte_to_swp_entry(pte);
    
    // Step 2: Check swap cache
    struct page *page = lookup_swap_cache(entry);
    if (page) {
        // Cache hit - page already in memory
        lock_page(page);
        goto have_page;
    }
    
    // Step 3: Allocate a fresh frame
    page = alloc_page(GFP_HIGHUSER_MOVABLE);
    if (!page)
        return VM_FAULT_OOM;
    
    // Step 4: Add to swap cache (prevent races)
    if (add_to_swap_cache(page, entry) < 0) {
        // Another thread beat us - use their page
        put_page(page);
        page = lookup_swap_cache(entry);
        if (!page)
            return VM_FAULT_OOM;
        lock_page(page);
        goto have_page;
    }
    
    // Step 5: Read from swap
    lock_page(page);
    int err = swap_readpage(page);
    if (err) {
        delete_from_swap_cache(page);
        unlock_page(page);
        put_page(page);
        return VM_FAULT_ERROR;
    }
    
    // Step 6: Wait for I/O
    wait_on_page_locked(page);
    
have_page:
    // Step 7: Update page table
    pte = mk_pte(page, vmf->vma->vm_page_prot);
    set_pte_at(vmf->vma->vm_mm, vmf->address, vmf->pte, pte);
    
    // Step 8: Decrement swap reference count
    swap_free(entry);
    
    // Step 9: Done - process can resume
    unlock_page(page);
    return VM_FAULT_NOPAGE;
}

Synchronization and Concurrency

Swapping occurs in a highly concurrent environment. Multiple processors may simultaneously:

Try to swap out the same page
Try to swap in the same page
One swaps out while another swaps in
Access a page mid-swap

The kernel uses several synchronization mechanisms to ensure correctness.

Synchronization Mechanisms

•Page lock (PG_locked) — Each page structure has a lock bit. Most swap operations acquire this lock exclusively before modifying page state. Contending threads sleep until the lock is released.
•Page table locks — Modifying PTEs requires holding the appropriate page table lock (typically a spinlock). Different levels of page tables have different locks for concurrency.
•Swap slot reference counts — Each swap slot maintains a reference count. The slot is freed only when the count reaches zero. This handles copy-on-write sharing across processes.
•Swap cache coordination — The swap cache acts as a synchronization point. Adding a page to the swap cache is an atomic operation; if two CPUs race to swap in the same page, only one succeeds in adding to cache.
•Memory barriers — The kernel uses memory barriers to ensure proper ordering of memory and swap operations on weakly-ordered architectures.

Race condition example:

Consider a scenario where CPU A is swapping out page P, while CPU B (running a process that maps P) accesses P:

CPU A removes P from page tables (present bit = 0)
CPU B's process accesses P
CPU B takes a page fault
CPU B's fault handler finds the swap entry in PTE
CPU B checks swap cache—P is there (not yet written to disk)
CPU B returns P directly; process continues

Without the swap cache, CPU B would have to wait for A's write to complete, then read the page back—double I/O. The swap cache eliminates this inefficiency.

Another race:

CPU A starts writing page P to swap
Before write completes, memory pressure eases
CPU A's write completes, but the system now has free frames
The page can remain in memory and swap cache
If re-evicted later, the swap copy is still valid (if page is clean)

This "swap backing" means a page might simultaneously exist in RAM and have a valid swap slot. The kernel tracks this to avoid unnecessary I/O.

The Cost of Locking

Read-Ahead and Clustering

Swap Read-Ahead

•Spatial locality bet — When page N faults, the kernel speculatively reads pages N-k to N+k
•Configurable window — The read-ahead window size can be tuned (default often 8-16 pages)
•Sequential access patterns — If a process scans memory linearly, read-ahead dramatically reduces fault count
•Wasted reads possible — If locality assumption is wrong, read-ahead pages may never be used

Swap Clustering (Write)

•Adjacent page writes — When evicting a page, the kernel checks if neighbors are also candidates
•Single I/O for multiple pages — Reduces the number of write operations
•Swap slot allocation — The kernel tries to allocate consecutive swap slots for related pages
•LRU list ordering — Pages evicted together are often allocated together (temporal locality)

How clustering helps swap-out:

The kernel scans the eviction candidates and groups pages that:

Can be written to consecutive swap slots
Are ready for eviction (not locked, not currently accessed)
Belong to the same swap area

Instead of 8 separate 4KB writes, the kernel issues one 32KB write. This is particularly beneficial for HDDs, where each operation incurs seek overhead.

Read-ahead tunables (Linux):

# View current swap read-ahead (in pages)
cat /proc/sys/vm/page-cluster
# Default is 3, meaning 2^3 = 8 pages read-ahead

# Increase for sequential workloads
echo 4 > /proc/sys/vm/page-cluster  # 16 pages

# Decrease for random access patterns
echo 1 > /proc/sys/vm/page-cluster  # 2 pages

Choosing the right value depends on workload characteristics. Too high wastes I/O bandwidth on unused pages; too low causes many small reads.

SSD Considerations

Process Transparency: The Illusion of Unlimited Memory

How is this illusion maintained?

Transparency Mechanisms

•Virtual addressing — Processes use virtual addresses, never physical. The MMU and page tables translate seamlessly.
•Hardware page faults — The CPU automatically traps to the kernel when accessing a non-present page. No special process logic needed.
•Blocking swap-in — The process sleeps during swap-in. From its perspective, the memory access just took longer—the process doesn't explicitly wait.
•State preservation — CPU registers are saved before fault handling and restored after. The instruction that caused the fault is re-executed successfully.
•Consistent memory model — From the process's view, memory behaves as expected. Swapping doesn't introduce data races or inconsistencies (from the process's perspective).

What the process experiences:

Imagine a process executing:

int value = *ptr;  // ptr points to a swapped page

From the process's perspective:

It issues a memory read instruction
... time passes ...
The value appears in a register

The process doesn't know that between steps 1 and 3:

The CPU faulted
The kernel took over
A disk read completed
The page table was updated
The instruction was re-executed

The latency increase is visible (if you measure carefully), but the semantics are identical to a present-page access.

Signals and swap:

Real-Time Systems and Swap

Summary: Swap In/Swap Out

The swap in and swap out operations are the fundamental mechanisms by which the operating system extends physical memory to disk. Let's consolidate the key insights:

Key Takeaways

•Swap out evicts pages to disk — Triggered by memory pressure, it frees RAM by writing pages to swap space and updating page tables.
•Swap in retrieves pages on demand — Page faults on swap entries trigger disk reads that restore pages to RAM, transparent to processes.
•The swap cache prevents redundant I/O — By caching recently swapped pages, the kernel avoids unnecessary disk operations during race conditions or rapid re-access.
•Synchronization ensures consistency — Page locks, swap slot reference counts, and careful ordering prevent corruption in concurrent environments.
•Clustering and read-ahead improve throughput — By batching I/O operations, the kernel amortizes disk latency across multiple pages.
•Processes remain oblivious — Virtual memory and hardware support maintain the illusion that all memory is always available.

What's next:

Page Complete