Loading content...
Swap space, as we learned, is the reservoir of disk storage that extends physical memory. But knowing that swap exists is only half the story. The critical question is: How does data actually move between RAM and swap?
The answer lies in two complementary operations: swap out (moving data from RAM to disk) and swap in (bringing data back from disk to RAM). These operations are orchestrated by the operating system's memory manager, working in concert with the process scheduler and I/O subsystem to maintain the illusion that processes have unlimited memory.
This page takes you deep into the mechanics of these operations—when they trigger, what steps they involve, how the kernel maintains consistency, and why processes remain oblivious to the fact that their memory is temporarily on disk.
By the end of this page, you will understand the complete lifecycle of a swapped page—from eviction decisions through disk I/O to page fault retrieval. You'll see the data structures, synchronization, and optimizations that make swapping practical in production systems.
Swap out is the process of writing a memory page from RAM to swap space, then reclaiming the physical frame for other use. This operation is triggered when the system needs more free memory than is currently available.
The swap out operation is not instantaneous—it involves disk I/O, which is orders of magnitude slower than memory access. Therefore, the operating system must carefully select which pages to swap out, minimizing the likelihood that those pages will be needed again soon.
kswapd in Linux), avoiding allocation failures.The page selection problem:
Not all pages are equally suitable for swapping. The kernel maintains a classification:
Anonymous pages — Pages not backed by a file (heap, stack). These must be written to swap if evicted, as there's no other backing store.
File-backed pages — Pages mapped from files (executables, shared libraries, mmap'd files). These can often be discarded rather than swapped—the file on disk is already the backing store. If dirty (modified), they're written back to the file, not to swap.
Kernel pages — Pages used by the kernel itself. Most are not swappable; some can be reclaimed if they cache data that can be reconstructed.
Locked/pinned pages — Pages explicitly marked non-swappable by the process (via mlock()) or kernel. These must stay in RAM.
Understanding this distinction is crucial: file-backed pages often don't use swap at all. When memory is tight, the kernel prefers evicting clean file-backed pages (which can be re-read from the original file) over anonymous pages (which require swap I/O). This is why applications with large working sets of file data often perform better than those with equivalent anonymous allocations.
Let's trace the complete swap out operation for an anonymous page. This sequence illustrates the careful choreography required to safely evict a page while maintaining system consistency.
Reverse mapping (rmap):
Step 4—unmapping from page tables—is particularly complex. A single physical page may be mapped into multiple processes (via shared memory or copy-on-write). The kernel must find and update all page table entries pointing to this frame.
Linux maintains a reverse mapping (rmap) structure for each page, tracking which processes have the page mapped and at what virtual addresses. When swapping out, the kernel walks this structure to:
This ensures that if any process tries to access the page after eviction, a page fault occurs and triggers swap-in.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
// Conceptual illustration of swap out logic int try_to_swap_out(struct page *page) { // Step 1: Lock the page if (!trylock_page(page)) return SWAP_AGAIN; // Busy, try later // Step 2: Check if page is still a candidate if (page_mapped(page) == 0) { // No mappings - can free directly unlock_page(page); return SWAP_SUCCESS; } // Step 3: Allocate swap slot swp_entry_t entry = get_swap_page(); if (!entry.val) { unlock_page(page); return SWAP_FAIL; // Swap full } // Step 4: Unmap from all page tables // Walk reverse mappings and clear PTEs int success = try_to_unmap(page, TTU_IGNORE_MLOCK); if (!success) { put_swap_page(entry); // Release swap slot unlock_page(page); return SWAP_FAIL; } // Step 5: Write page content to swap add_to_swap_cache(page, entry); set_page_dirty(page); // Ensure writeback // Step 6: Initiate I/O int err = swap_writepage(page, &wbc); if (err) { // Handle I/O error remove_from_swap_cache(page); put_swap_page(entry); unlock_page(page); return SWAP_FAIL; } // Step 7: Wait for completion (or return for async) wait_on_page_writeback(page); // Step 8: Success - page is now on swap // Frame can be reclaimed after this unlock_page(page); return SWAP_SUCCESS;}Swap in is the reverse operation: reading a page from swap space back into RAM. This operation is triggered when a process attempts to access a page that has been swapped out, causing a page fault.
The swap-in path is more latency-sensitive than swap-out. When a process faults on a swapped page, it typically cannot continue until the page is loaded—the process is blocked waiting for disk I/O. This makes swap-in performance critical to user experience.
The page fault path:
When a process accesses a virtual address, the MMU (Memory Management Unit) translates it using the page table. If the PTE's present bit is clear, a page fault exception occurs.
The kernel's page fault handler examines the faulted address:
The swap entry in the PTE encodes:
This information is sufficient to locate and read the page from disk.
A page table entry is typically 8 bytes (64-bit system). When a page is present in RAM, the PTE contains the physical frame number and flags. When swapped, the same 8 bytes store a swap entry identifier instead. The present bit (bit 0) distinguishes these cases: 0 = not present (check for swap entry), 1 = present (use frame number).
Let's trace the complete swap-in operation, from page fault to process resumption.
The swap cache:
The swap cache is a critical optimization. It maintains a mapping from swap entries to in-memory pages:
During swap-out, a page is added to the swap cache before the disk write completes. If a process faults on that page during writeback, the in-memory copy is returned immediately.
After swap-in, the page may remain in the swap cache briefly. If memory pressure causes re-eviction, the disk copy is still valid—no new write is needed.
Copy-on-write optimization: If multiple processes share a swapped page, the swap cache ensures only one disk read occurs. Subsequent faults find the page in cache.
The swap cache is indexed by (swap_area, slot_number) and allows O(1) lookup of whether a given swap slot has a corresponding in-memory page.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
// Conceptual illustration of swap in logic int do_swap_page(struct vm_fault *vmf) { // Step 1: Extract swap entry from PTE pte_t pte = vmf->orig_pte; swp_entry_t entry = pte_to_swp_entry(pte); // Step 2: Check swap cache struct page *page = lookup_swap_cache(entry); if (page) { // Cache hit - page already in memory lock_page(page); goto have_page; } // Step 3: Allocate a fresh frame page = alloc_page(GFP_HIGHUSER_MOVABLE); if (!page) return VM_FAULT_OOM; // Step 4: Add to swap cache (prevent races) if (add_to_swap_cache(page, entry) < 0) { // Another thread beat us - use their page put_page(page); page = lookup_swap_cache(entry); if (!page) return VM_FAULT_OOM; lock_page(page); goto have_page; } // Step 5: Read from swap lock_page(page); int err = swap_readpage(page); if (err) { delete_from_swap_cache(page); unlock_page(page); put_page(page); return VM_FAULT_ERROR; } // Step 6: Wait for I/O wait_on_page_locked(page); have_page: // Step 7: Update page table pte = mk_pte(page, vmf->vma->vm_page_prot); set_pte_at(vmf->vma->vm_mm, vmf->address, vmf->pte, pte); // Step 8: Decrement swap reference count swap_free(entry); // Step 9: Done - process can resume unlock_page(page); return VM_FAULT_NOPAGE;}Swapping occurs in a highly concurrent environment. Multiple processors may simultaneously:
The kernel uses several synchronization mechanisms to ensure correctness.
Race condition example:
Consider a scenario where CPU A is swapping out page P, while CPU B (running a process that maps P) accesses P:
Without the swap cache, CPU B would have to wait for A's write to complete, then read the page back—double I/O. The swap cache eliminates this inefficiency.
Another race:
This "swap backing" means a page might simultaneously exist in RAM and have a valid swap slot. The kernel tracks this to avoid unnecessary I/O.
Swap synchronization has overhead. Under extreme memory pressure with many CPUs, contention for page locks and swap data structures can become a bottleneck. This is one reason why swap-heavy workloads scale poorly—beyond the I/O cost, there's CPU time spent waiting for locks.
Disk I/O has high latency but reasonable throughput once started. Reading one 4KB page takes nearly as long as reading 64KB due to seek time dominance (on HDDs) or command overhead (on SSDs). Modern kernels exploit this through read-ahead and clustering.
How clustering helps swap-out:
Linux maintains LRU (Least Recently Used) lists of pages. When memory pressure triggers reclaim, pages at the tail of the LRU list are eviction candidates. These pages are often from the same process and may be virtually adjacent.
The kernel scans the eviction candidates and groups pages that:
Instead of 8 separate 4KB writes, the kernel issues one 32KB write. This is particularly beneficial for HDDs, where each operation incurs seek overhead.
Read-ahead tunables (Linux):
# View current swap read-ahead (in pages)
cat /proc/sys/vm/page-cluster
# Default is 3, meaning 2^3 = 8 pages read-ahead
# Increase for sequential workloads
echo 4 > /proc/sys/vm/page-cluster # 16 pages
# Decrease for random access patterns
echo 1 > /proc/sys/vm/page-cluster # 2 pages
Choosing the right value depends on workload characteristics. Too high wastes I/O bandwidth on unused pages; too low causes many small reads.
On SSDs, seek time is negligible, so the case for clustering is weaker. However, issuing larger I/O requests still reduces command overhead and can better utilize SSD parallelism. Read-ahead remains valuable for reducing the number of page faults, even if I/O latency per fault is lower.
A remarkable property of virtual memory and swapping is transparency: processes are completely unaware that their pages may reside on disk. From the application's perspective, all memory access works identically—pointers work, data is consistent, and the process never sees "swap" directly.
How is this illusion maintained?
What the process experiences:
Imagine a process executing:
int value = *ptr; // ptr points to a swapped page
From the process's perspective:
The process doesn't know that between steps 1 and 3:
The latency increase is visible (if you measure carefully), but the semantics are identical to a present-page access.
Signals and swap:
One subtle interaction: if a process registers a signal handler, and the handler's code page is swapped out when a signal arrives, the kernel must swap in the handler page before delivering the signal. This is handled transparently, but the signal delivery latency may increase.
For real-time applications where latency predictability matters, swapping is problematic. A page fault can introduce milliseconds of delay. Such systems often use mlock() to pin critical pages in RAM, or disable swap entirely to ensure deterministic behavior.
The swap in and swap out operations are the fundamental mechanisms by which the operating system extends physical memory to disk. Let's consolidate the key insights:
What's next:
Having understood the mechanics of swap in/swap out, we now turn to standard swapping—the historical approach of swapping entire processes rather than individual pages. Understanding this older technique illuminates why modern paging-based swapping evolved and why some systems still fall back to process-level swapping under extreme pressure.
You now understand the complete lifecycle of swapped pages—from eviction decisions through disk I/O to page fault retrieval. Next, we'll explore standard swapping as a historical and fallback mechanism.