Loading learning content...
When a page fault occurs, the CPU has told the OS: "This virtual address doesn't map to any physical frame." Now the OS faces a critical question: Where is the data for this page?
This seemingly simple question has surprisingly complex answers. The page might be:
Finding the page requires consulting multiple OS data structures, understanding the process's memory mappings, and potentially navigating swap space or filesystem metadata. This page explores every aspect of how the OS locates page content, from the high-level policy decisions to the low-level data structure lookups.
By the end of this page, you will understand: (1) How Virtual Memory Areas (VMAs) describe a process's address space, (2) The difference between anonymous and file-backed pages, (3) How swap space is organized and entries are located, (4) The page table entry's role in tracking swapped pages, (5) The complete lookup process from fault address to disk location.
The first step in finding a page is determining whether the faulting address is valid for the process at all. The OS maintains data structures that describe which regions of the virtual address space are legitimate.
The VMA Concept:
A Virtual Memory Area (VMA) represents a contiguous region of the virtual address space with uniform properties:
Linux's mm_struct:
In Linux, each process has an mm_struct containing:
When a page fault occurs, the first action is to search the VMA list for an entry containing the faulting address.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
// Simplified Linux VMA structure// Actual structure has many more fields struct vm_area_struct { // Address range unsigned long vm_start; // Start address (inclusive) unsigned long vm_end; // End address (exclusive) // Linkage struct vm_area_struct *vm_next; // Next VMA in list struct vm_area_struct *vm_prev; // Previous VMA in list struct rb_node vm_rb; // Red-black tree node for fast lookup // Memory descriptor (owning process) struct mm_struct *vm_mm; // Page protection pgprot_t vm_page_prot; // Access permissions unsigned long vm_flags; // Flags (VM_READ, VM_WRITE, VM_EXEC, etc.) // Backing storage struct file *vm_file; // File being mapped (NULL for anonymous) unsigned long vm_pgoff; // Offset into file in PAGE_SIZE units // Operations const struct vm_operations_struct *vm_ops; // Callbacks for fault handling // For anonymous pages: link to anon_vma for reverse mapping struct anon_vma *anon_vma;}; // Common VM flags#define VM_READ 0x00000001 // Can read#define VM_WRITE 0x00000002 // Can write#define VM_EXEC 0x00000004 // Can execute#define VM_SHARED 0x00000008 // Shared mapping#define VM_MAYREAD 0x00000010 // May be read#define VM_MAYWRITE 0x00000020 // May be written#define VM_MAYEXEC 0x00000040 // May be executed#define VM_GROWSDOWN 0x00000100 // Stack: grows downward#define VM_DENYWRITE 0x00000800 // Deny write to file#define VM_LOCKED 0x00002000 // Pages are locked in memory // Find VMA containing an addressstruct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) { struct vm_area_struct *vma; // Use red-black tree for O(log n) lookup vma = rb_tree_lookup(&mm->mm_rb, addr); if (vma && vma->vm_start <= addr && addr < vma->vm_end) return vma; return NULL; // Address not in any VMA}Linux organizes VMAs in both a linked list (for sequential iteration) and a red-black tree (for fast lookup). The tree structure is essential for performance—a process can have hundreds of VMAs, and page faults happen frequently. O(log n) lookup instead of O(n) is critical.
When a page fault occurs, the OS searches for the VMA containing the faulting address. The outcome of this search determines the next steps:
Case 1: No VMA Found
If no VMA contains the address, the access is invalid. This typically results in:
However, there's a special case: stack expansion. If the address is just below a stack VMA (marked with VM_GROWSDOWN), the OS may expand the stack to include the new address.
Case 2: VMA Found, Permission Denied
The VMA exists, but the access type doesn't match permissions:
Case 3: VMA Found, Access Permitted
The address is valid and the access is permitted. Now the OS must determine where to get the page content.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
// Page fault handler: VMA lookup and validation phase static int __do_page_fault(struct mm_struct *mm, unsigned long address, unsigned int flags, struct pt_regs *regs) { struct vm_area_struct *vma; int fault_type = 0; // Step 1: Find VMA containing the faulting address vma = find_vma(mm, address); if (!vma) { // No VMA contains this address return VM_FAULT_SIGSEGV; // Bad address } if (vma->vm_start > address) { // Address is below VMA - maybe stack expansion? if (!(vma->vm_flags & VM_GROWSDOWN)) return VM_FAULT_SIGSEGV; // Can't grow if (expand_stack(vma, address)) return VM_FAULT_SIGSEGV; // Expansion failed } // Step 2: Check permissions if (flags & FAULT_FLAG_WRITE) { if (!(vma->vm_flags & VM_WRITE)) { // Write to non-writable page if (!(vma->vm_flags & VM_MAYWRITE)) return VM_FAULT_SIGSEGV; // Definitely not writable // Might be copy-on-write - handled later fault_type |= FAULT_TYPE_COW; } } if (flags & FAULT_FLAG_INSTRUCTION) { if (!(vma->vm_flags & VM_EXEC)) return VM_FAULT_SIGSEGV; // Execute on non-exec page } // Step 3: VMA is valid, access is potentially OK // Now determine where to get the page content return handle_mm_fault(mm, vma, address, flags);}Once the VMA is located and permissions validated, the OS must determine the source of the page content. Pages fall into two fundamental categories:
Anonymous Pages:
Anonymous pages are not backed by any file. They include:
Characteristics of anonymous pages:
File-Backed Pages:
File-backed pages are mapped from files on disk. They include:
Characteristics of file-backed pages:
| Aspect | Anonymous Pages | File-Backed Pages |
|---|---|---|
| Initial content | Zero-filled | Read from file |
| Eviction (clean) | Write to swap | Discard (can re-read from file) |
| Eviction (dirty) | Write to swap | Write back (shared) or swap (private) |
| VMA has file? | vm_file = NULL | vm_file points to file |
| Examples | Heap, stack, BSS | Text, mmap files, .so libs |
| Persistence | None (process lifetime) | File outlives process |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
// Determine page source based on VMA enum page_source { SOURCE_ZERO_FILL, // New anonymous page - fill with zeros SOURCE_SWAP, // Previously evicted anonymous page SOURCE_FILE, // File-backed page - read from file SOURCE_FILE_COW, // Private file mapping - might need COW}; enum page_source determine_page_source(struct vm_area_struct *vma, unsigned long address, pte_t *pte) { // Check if page was previously present (now swapped out) if (!pte_none(*pte) && !pte_present(*pte)) { // PTE has a swap entry - page was evicted to swap return SOURCE_SWAP; } // Check if VMA is file-backed if (vma->vm_file != NULL) { // File-backed VMA if (vma->vm_flags & VM_SHARED) { // Shared mapping - reads/writes go to file return SOURCE_FILE; } else { // Private mapping - might need COW on write return SOURCE_FILE_COW; } } // Anonymous VMA, first access return SOURCE_ZERO_FILL;} // Linux handles this through vm_operations_struct// Each VMA type has its own fault handler static const struct vm_operations_struct generic_file_vm_ops = { .fault = filemap_fault, // Handle file-backed page faults .map_pages = filemap_map_pages, // Pre-map surrounding pages .page_mkwrite = filemap_page_mkwrite, // Handle COW}; // Anonymous VMAs use a different handlerstatic const struct vm_operations_struct shmem_vm_ops = { .fault = shmem_fault,};When a process requests memory (malloc/brk), the OS doesn't immediately allocate physical pages. It just creates a VMA. Physical allocation happens on first access—a page fault triggers, the OS allocates a zero-filled page, and maps it. This 'zero-fill-on-demand' means only actually-used pages consume physical memory.
When anonymous pages are evicted from memory, they're written to swap space—a region of disk dedicated to holding evicted pages. Understanding swap organization is crucial for understanding how pages are located.
Swap Space Types:
Swap Partition: A dedicated disk partition formatted for swap. Fastest option as there's no filesystem overhead.
Swap File: A regular file used as swap. More flexible (can be resized) but slightly slower due to filesystem indirection.
Swap Space Structure:
Swap space is divided into slots (or pages), each exactly one page in size. A swap entry identifies which slot contains a particular page:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
// Swap entry encoding// When a page is swapped out, the PTE holds a swap entry instead of a PFN // Swap entry format (varies by architecture, this is conceptual)// +-------+-------------+// | Type | Swap Offset |// +-------+-------------+// 5 bits (varies) typedef struct { unsigned long val;} swp_entry_t; // Multiple swap devices can be configured#define MAX_SWAPFILES 32 // Swap info for each swap devicestruct swap_info_struct { unsigned long flags; // SWP_USED, SWP_WRITEOK, etc. struct file *swap_file; // File or NULL for partition struct block_device *bdev; // Block device unsigned long max; // Max slots in this swap area unsigned char *swap_map; // Count of users per slot unsigned long inuse_pages; // Number of slots in use unsigned long lowest_bit; // Hint for free slot search unsigned long highest_bit;}; // Global array of swap devicesstatic struct swap_info_struct *swap_info[MAX_SWAPFILES]; // Encode a swap entrystatic inline swp_entry_t make_swap_entry(int type, unsigned long offset) { swp_entry_t entry; entry.val = (type << SWAP_TYPE_SHIFT) | offset; return entry;} // Decode swap entry back to type and offsetstatic inline int swp_type(swp_entry_t entry) { return (entry.val >> SWAP_TYPE_SHIFT) & SWAP_TYPE_MASK;} static inline unsigned long swp_offset(swp_entry_t entry) { return entry.val & SWAP_OFFSET_MASK;} // Look up where to read a swapped page fromstruct page *lookup_swap_cache(swp_entry_t entry) { // First check if page is in swap cache (recently swapped in/out) return find_get_page(swap_address_space(entry), swp_offset(entry));} sector_t get_swap_page_sector(swp_entry_t entry) { int type = swp_type(entry); unsigned long offset = swp_offset(entry); struct swap_info_struct *sis = swap_info[type]; // Convert slot offset to disk sector return (sector_t)(offset * (PAGE_SIZE / 512));}When a page is swapped out, the PTE's present bit is cleared, but the rest of the entry isn't zeroed—it's filled with a swap entry. The OS interprets non-present PTEs to determine whether they're empty (never faulted) or contain a swap entry (previously present, now swapped). Architecture-specific bits help distinguish these cases.
When a page fault occurs on a previously-resident anonymous page, the page table entry contains the swap entry that tells the OS exactly where to find the page.
The Lookup Process:
Read the PTE: The faulting virtual address determines which PTE to examine.
Check for swap entry: If pte_present() is false but pte_none() is also false, the PTE contains a swap entry.
Decode the swap entry: Extract the swap type and offset.
Check swap cache: Maybe the page is already in memory (in the swap cache).
Read from disk: If not in cache, issue I/O to read from the swap device at the calculated sector.
The Swap Cache Optimization:
The swap cache keeps recently swapped pages in memory even after they've been mapped elsewhere. Benefits:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
// Reading a page from swap int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *pte, pte_t orig_pte, unsigned int flags) { swp_entry_t entry; struct page *page; int ret = 0; // Step 1: Extract swap entry from PTE entry = pte_to_swp_entry(orig_pte); // Step 2: Try to find page in swap cache first page = lookup_swap_cache(entry); if (!page) { // Page not in swap cache - must read from disk // Step 3: Allocate a new page for the data page = alloc_page(GFP_HIGHUSER_MOVABLE); if (!page) return VM_FAULT_OOM; // Out of memory // Step 4: Initiate disk read // This will block until I/O completes (or be handled async) ret = swap_readpage(page, entry); if (ret) { put_page(page); return VM_FAULT_SIGBUS; // I/O error } // Step 5: Add to swap cache // Allows sharing with other processes and quick re-eviction add_to_swap_cache(page, entry); } // Step 6: Lock the page while we set up the mapping lock_page(page); // Step 7: Map the page into the process's address space set_pte_at(mm, address, pte, mk_pte(page, vma->vm_page_prot)); // Step 8: Update accounting mm->_rss++; // Resident set size increased // Step 9: Optionally free the swap slot // (Deferred until page is actually dirty again) swap_free(entry); unlock_page(page); return VM_FAULT_MINOR; // Fault handled, no major I/O} // Low-level swap readint swap_readpage(struct page *page, swp_entry_t entry) { struct swap_info_struct *sis = swap_info[swp_type(entry)]; sector_t sector = swp_offset(entry) * (PAGE_SIZE >> 9); // Set up block I/O struct bio *bio = bio_alloc(GFP_KERNEL, 1); bio->bi_bdev = sis->bdev; bio->bi_iter.bi_sector = sector; bio_add_page(bio, page, PAGE_SIZE, 0); // Submit and wait for I/O submit_bio_wait(bio); int error = bio->bi_status; bio_put(bio); return error;}Reading from swap is the most expensive operation in page fault handling. Even on modern SSDs, a 4KB read takes ~50-100 microseconds—tens of thousands of CPU cycles. On spinning disks, it can take 5-10 milliseconds. This is why avoiding swap (having enough RAM) is critical for performance.
File-backed pages are located differently from anonymous pages. Instead of swap entries, the OS uses the VMA's file mapping information.
The File Mapping Relationship:
A file-backed VMA contains:
vm_file: The file being mappedvm_pgoff: Offset into the file (in pages) where this VMA startsBy combining the faulting address with this information, the OS can calculate exactly which file offset is needed:
file_offset = vm_pgoff + (address - vm_start) / PAGE_SIZE
The Page Cache:
The kernel maintains a page cache—a cache of file pages in memory. Before reading from disk, the OS checks if the page is already cached:
(file, offset) pair1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
// Handling a file-backed page fault int filemap_fault(struct vm_fault *vmf) { struct file *file = vmf->vma->vm_file; struct address_space *mapping = file->f_mapping; struct page *page; pgoff_t offset; int ret = 0; // Step 1: Calculate file offset for faulting address offset = vmf->pgoff; // Already calculated by caller // Equivalent to: (fault_addr - vma->vm_start) / PAGE_SIZE + vma->vm_pgoff // Step 2: Look for page in page cache page = find_get_page(mapping, offset); if (!page) { // Page not in cache - must read from file // Step 3: Allocate a new page page = page_cache_alloc(mapping); if (!page) return VM_FAULT_OOM; // Step 4: Add to page cache (before reading) int err = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL); if (err) { put_page(page); if (err == -EEXIST) { // Race: another thread added it, retry goto retry; } return VM_FAULT_SIGBUS; } // Step 5: Read from file // This calls into the filesystem to read the page err = mapping->a_ops->readpage(file, page); if (err) return VM_FAULT_SIGBUS; // Wait for read to complete lock_page(page); if (!PageUptodate(page)) { unlock_page(page); return VM_FAULT_SIGBUS; } } // Step 6: Page is now in memory (cached or just read) vmf->page = page; return VM_FAULT_LOCKED; // Return with page locked} // The address_space operations for file mappingsconst struct address_space_operations ext4_aops = { .readpage = ext4_readpage, // Read single page .readahead = ext4_readahead, // Read multiple pages ahead .writepage = ext4_writepage, // Write dirty page back .write_begin = ext4_write_begin, .write_end = ext4_write_end,};| Result | Action | Performance |
|---|---|---|
| Page in cache, uptodate | Use immediately | ~microseconds |
| Page in cache, being read | Wait for I/O completion | Depends on remaining I/O |
| Page not in cache | Allocate, add to cache, read from file | ~milliseconds (disk I/O) |
When reading a file-backed page, the kernel often reads additional pages speculatively (readahead). If the access pattern is sequential, these prefetched pages will already be in cache when next faulted. This dramatically improves performance for streaming access patterns like reading large files.
Let's put it all together. When a page fault occurs, the OS follows a systematic algorithm to locate the page content:
Step-by-Step Algorithm:
Extract faulting address from CR2 (or architecture equivalent)
Find VMA containing address
Validate permissions
Examine the PTE
Determine source based on VMA and PTE:
Perform the read (if needed)
Map the page into the page table
Return to user mode and restart instruction
Most page faults are either minor faults (page already in memory, just needs mapping) or zero-fill (new anonymous page). These are handled entirely in memory without any disk I/O. Only major faults—reading from swap or file—incur the expensive disk access.
Real-world page fault handling involves numerous edge cases and complications that production kernels must handle:
Race Conditions:
Special Page Types:
Error Conditions:
Exotic Scenarios:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
// Handling race conditions in page fault int handle_pte_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *pte, pte_t entry, unsigned int flags) { spinlock_t *ptl; // Take the page table lock to prevent races ptl = pte_lockptr(mm, pmd); spin_lock(ptl); // Re-read the PTE - it might have changed! if (unlikely(!pte_same(*pte, entry))) { // Another thread handled this fault already spin_unlock(ptl); return 0; // Retry the access } // ... handle the fault ... spin_unlock(ptl); return result;} // Handling out-of-memory during faultint do_fault(struct vm_fault *vmf) { struct page *page = alloc_page(GFP_HIGHUSER_MOVABLE); if (!page) { // Out of memory! // Try to reclaim some pages and retry if (should_reclaim_retry(GFP_HIGHUSER_MOVABLE)) return VM_FAULT_RETRY; // Really out of memory - invoke OOM killer or return error return VM_FAULT_OOM; } // ... continue with page we got ...} // NUMA-aware page allocationint do_numa_page_fault(struct vm_fault *vmf) { struct page *page = vmf->page; int page_nid = page_to_nid(page); // NUMA node of page int cpu_nid = numa_node_id(); // Current CPU's node if (page_nid != cpu_nid) { // Page is on a remote NUMA node // Consider migrating it for better performance if (should_migrate_page(page, cpu_nid)) { migrate_page_to_node(page, cpu_nid); } } // ... continue ...}Production kernels like Linux handle dozens of edge cases that aren't shown in simplified examples. The actual mm/memory.c file in Linux is thousands of lines of carefully crafted code handling races, errors, special cases, and performance optimizations. Understanding the fundamental algorithm is essential, but production code is considerably more complex.
Finding a page's location on disk is a multi-step process involving several OS data structures. The VMA tells us about the region's properties, the PTE tells us about the page's current state, and the combination determines where to get the content. Let's consolidate:
What's Next:
With the page located (in swap or in a file), the OS must bring it into physical memory. The next page explores Load into Frame—how the OS allocates a physical frame, initiates disk I/O to read the page content, and updates the page table to reflect the new mapping.
You now understand how the OS locates page content on secondary storage. Whether a page is anonymous (swap-backed) or file-backed, the OS has systematic ways to determine where the data resides. Next, we'll explore how that data is actually loaded into physical memory.