Operating SystemsPage Fault Handling

Page Fault Handling: The Critical Path to Virtual Memory

LevelIntermediate

Duration90 mins

TopicPage Fault Handling

3 / 5

Find Page on Disk

The Quest for Missing Data: Locating Pages in Storage

When a page fault occurs, the CPU has told the OS: "This virtual address doesn't map to any physical frame." Now the OS faces a critical question: Where is the data for this page?

This seemingly simple question has surprisingly complex answers. The page might be:

On disk in swap space: Previously evicted to make room for other pages
In a file on the filesystem: A memory-mapped file or executable image
Nowhere (zero-fill): A newly allocated page that should be zeroed
Actually invalid: The process is accessing memory it shouldn't

Finding the page requires consulting multiple OS data structures, understanding the process's memory mappings, and potentially navigating swap space or filesystem metadata. This page explores every aspect of how the OS locates page content, from the high-level policy decisions to the low-level data structure lookups.

What You Will Learn

By the end of this page, you will understand: (1) How Virtual Memory Areas (VMAs) describe a process's address space, (2) The difference between anonymous and file-backed pages, (3) How swap space is organized and entries are located, (4) The page table entry's role in tracking swapped pages, (5) The complete lookup process from fault address to disk location.

Virtual Memory Areas (VMAs): The Address Space Map

The first step in finding a page is determining whether the faulting address is valid for the process at all. The OS maintains data structures that describe which regions of the virtual address space are legitimate.

The VMA Concept:

A Virtual Memory Area (VMA) represents a contiguous region of the virtual address space with uniform properties:

Start and end addresses: Defines the range covered
Permissions: Read, write, execute
Backing source: Anonymous, file-backed, shared, or device
Flags: Private vs shared, grows-down (stack), etc.

Linux's mm_struct:

In Linux, each process has an mm_struct containing:

A list (or tree) of VMAs
The page table base pointer
Various counters and limits

When a page fault occurs, the first action is to search the VMA list for an entry containing the faulting address.

vm_area_struct.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// Simplified Linux VMA structure
// Actual structure has many more fields
 
struct vm_area_struct {
    // Address range
    unsigned long vm_start;     // Start address (inclusive)
    unsigned long vm_end;       // End address (exclusive)
    
    // Linkage
    struct vm_area_struct *vm_next;   // Next VMA in list
    struct vm_area_struct *vm_prev;   // Previous VMA in list
    struct rb_node vm_rb;             // Red-black tree node for fast lookup
    
    // Memory descriptor (owning process)
    struct mm_struct *vm_mm;
    
    // Page protection
    pgprot_t vm_page_prot;      // Access permissions
    unsigned long vm_flags;     // Flags (VM_READ, VM_WRITE, VM_EXEC, etc.)
    
    // Backing storage
    struct file *vm_file;       // File being mapped (NULL for anonymous)
    unsigned long vm_pgoff;     // Offset into file in PAGE_SIZE units
    
    // Operations
    const struct vm_operations_struct *vm_ops;  // Callbacks for fault handling
    
    // For anonymous pages: link to anon_vma for reverse mapping
    struct anon_vma *anon_vma;
};
 
// Common VM flags
#define VM_READ         0x00000001  // Can read
#define VM_WRITE        0x00000002  // Can write
#define VM_EXEC         0x00000004  // Can execute
#define VM_SHARED       0x00000008  // Shared mapping
#define VM_MAYREAD      0x00000010  // May be read
#define VM_MAYWRITE     0x00000020  // May be written
#define VM_MAYEXEC      0x00000040  // May be executed
#define VM_GROWSDOWN    0x00000100  // Stack: grows downward
#define VM_DENYWRITE    0x00000800  // Deny write to file
#define VM_LOCKED       0x00002000  // Pages are locked in memory
 
// Find VMA containing an address
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) {
    struct vm_area_struct *vma;
    
    // Use red-black tree for O(log n) lookup
    vma = rb_tree_lookup(&mm->mm_rb, addr);
    
    if (vma && vma->vm_start <= addr && addr < vma->vm_end)
        return vma;
    
    return NULL;  // Address not in any VMA
}

The VMA Tree Structure

Linux organizes VMAs in both a linked list (for sequential iteration) and a red-black tree (for fast lookup). The tree structure is essential for performance—a process can have hundreds of VMAs, and page faults happen frequently. O(log n) lookup instead of O(n) is critical.

VMA Lookup and Validation

When a page fault occurs, the OS searches for the VMA containing the faulting address. The outcome of this search determines the next steps:

Case 1: No VMA Found

If no VMA contains the address, the access is invalid. This typically results in:

SIGSEGV signal delivered to the process
If no signal handler: process termination
Core dump if enabled

However, there's a special case: stack expansion. If the address is just below a stack VMA (marked with VM_GROWSDOWN), the OS may expand the stack to include the new address.

Case 2: VMA Found, Permission Denied

The VMA exists, but the access type doesn't match permissions:

Write to read-only page → Check for copy-on-write
Execute on non-executable page → SIGSEGV
Genuine protection violation → SIGSEGV

Case 3: VMA Found, Access Permitted

The address is valid and the access is permitted. Now the OS must determine where to get the page content.

fault_flow.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Page fault handler: VMA lookup and validation phase
 
static int __do_page_fault(struct mm_struct *mm, unsigned long address,
                           unsigned int flags, struct pt_regs *regs) {
    struct vm_area_struct *vma;
    int fault_type = 0;
    
    // Step 1: Find VMA containing the faulting address
    vma = find_vma(mm, address);
    
    if (!vma) {
        // No VMA contains this address
        return VM_FAULT_SIGSEGV;  // Bad address
    }
    
    if (vma->vm_start > address) {
        // Address is below VMA - maybe stack expansion?
        if (!(vma->vm_flags & VM_GROWSDOWN))
            return VM_FAULT_SIGSEGV;  // Can't grow
        
        if (expand_stack(vma, address))
            return VM_FAULT_SIGSEGV;  // Expansion failed
    }
    
    // Step 2: Check permissions
    if (flags & FAULT_FLAG_WRITE) {
        if (!(vma->vm_flags & VM_WRITE)) {
            // Write to non-writable page
            if (!(vma->vm_flags & VM_MAYWRITE))
                return VM_FAULT_SIGSEGV;  // Definitely not writable
            
            // Might be copy-on-write - handled later
            fault_type |= FAULT_TYPE_COW;
        }
    }
    
    if (flags & FAULT_FLAG_INSTRUCTION) {
        if (!(vma->vm_flags & VM_EXEC))
            return VM_FAULT_SIGSEGV;  // Execute on non-exec page
    }
    
    // Step 3: VMA is valid, access is potentially OK
    // Now determine where to get the page content
    return handle_mm_fault(mm, vma, address, flags);
}

Converting Mermaid diagram...

Anonymous vs File-Backed Pages

Once the VMA is located and permissions validated, the OS must determine the source of the page content. Pages fall into two fundamental categories:

Anonymous Pages:

Anonymous pages are not backed by any file. They include:

Heap memory (malloc, brk)
Stack memory
BSS section (uninitialized static data)

Characteristics of anonymous pages:

First access: page is zero-filled (all bytes = 0)
Modified content: stored in swap space when evicted
No persistent backing: content lost when process terminates
Private to process (or shared via fork/COW)

File-Backed Pages:

File-backed pages are mapped from files on disk. They include:

Executable text (code)
Memory-mapped files (mmap with MAP_PRIVATE or MAP_SHARED)
Shared libraries

Characteristics of file-backed pages:

First access: read from file
Eviction: clean pages discarded; dirty pages written back (shared) or to swap (private)
Content persists in file
Can be shared between processes

Anonymous vs File-Backed Page Comparison
Aspect	Anonymous Pages	File-Backed Pages
Initial content	Zero-filled	Read from file
Eviction (clean)	Write to swap	Discard (can re-read from file)
Eviction (dirty)	Write to swap	Write back (shared) or swap (private)
VMA has file?	vm_file = NULL	vm_file points to file
Examples	Heap, stack, BSS	Text, mmap files, .so libs
Persistence	None (process lifetime)	File outlives process

page_source.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// Determine page source based on VMA
 
enum page_source {
    SOURCE_ZERO_FILL,     // New anonymous page - fill with zeros
    SOURCE_SWAP,          // Previously evicted anonymous page
    SOURCE_FILE,          // File-backed page - read from file
    SOURCE_FILE_COW,      // Private file mapping - might need COW
};
 
enum page_source determine_page_source(struct vm_area_struct *vma,
                                       unsigned long address,
                                       pte_t *pte) {
    // Check if page was previously present (now swapped out)
    if (!pte_none(*pte) && !pte_present(*pte)) {
        // PTE has a swap entry - page was evicted to swap
        return SOURCE_SWAP;
    }
    
    // Check if VMA is file-backed
    if (vma->vm_file != NULL) {
        // File-backed VMA
        if (vma->vm_flags & VM_SHARED) {
            // Shared mapping - reads/writes go to file
            return SOURCE_FILE;
        } else {
            // Private mapping - might need COW on write
            return SOURCE_FILE_COW;
        }
    }
    
    // Anonymous VMA, first access
    return SOURCE_ZERO_FILL;
}
 
// Linux handles this through vm_operations_struct
// Each VMA type has its own fault handler
 
static const struct vm_operations_struct generic_file_vm_ops = {
    .fault = filemap_fault,         // Handle file-backed page faults
    .map_pages = filemap_map_pages, // Pre-map surrounding pages
    .page_mkwrite = filemap_page_mkwrite, // Handle COW
};
 
// Anonymous VMAs use a different handler
static const struct vm_operations_struct shmem_vm_ops = {
    .fault = shmem_fault,
};

Zero-Fill-on-Demand

When a process requests memory (malloc/brk), the OS doesn't immediately allocate physical pages. It just creates a VMA. Physical allocation happens on first access—a page fault triggers, the OS allocates a zero-filled page, and maps it. This 'zero-fill-on-demand' means only actually-used pages consume physical memory.

Swap Space Organization

When anonymous pages are evicted from memory, they're written to swap space—a region of disk dedicated to holding evicted pages. Understanding swap organization is crucial for understanding how pages are located.

Swap Space Types:

Swap Partition: A dedicated disk partition formatted for swap. Fastest option as there's no filesystem overhead.
Swap File: A regular file used as swap. More flexible (can be resized) but slightly slower due to filesystem indirection.

Swap Space Structure:

Swap space is divided into slots (or pages), each exactly one page in size. A swap entry identifies which slot contains a particular page:

swap_space.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Swap entry encoding
// When a page is swapped out, the PTE holds a swap entry instead of a PFN
 
// Swap entry format (varies by architecture, this is conceptual)
// +-------+-------------+
// | Type  | Swap Offset |
// +-------+-------------+
//   5 bits   (varies)
 
typedef struct {
    unsigned long val;
} swp_entry_t;
 
// Multiple swap devices can be configured
#define MAX_SWAPFILES 32
 
// Swap info for each swap device
struct swap_info_struct {
    unsigned long flags;        // SWP_USED, SWP_WRITEOK, etc.
    struct file *swap_file;     // File or NULL for partition
    struct block_device *bdev;  // Block device
    unsigned long max;          // Max slots in this swap area
    unsigned char *swap_map;    // Count of users per slot
    unsigned long inuse_pages;  // Number of slots in use
    unsigned long lowest_bit;   // Hint for free slot search
    unsigned long highest_bit;
};
 
// Global array of swap devices
static struct swap_info_struct *swap_info[MAX_SWAPFILES];
 
// Encode a swap entry
static inline swp_entry_t make_swap_entry(int type, unsigned long offset) {
    swp_entry_t entry;
    entry.val = (type << SWAP_TYPE_SHIFT) | offset;
    return entry;
}
 
// Decode swap entry back to type and offset
static inline int swp_type(swp_entry_t entry) {
    return (entry.val >> SWAP_TYPE_SHIFT) & SWAP_TYPE_MASK;
}
 
static inline unsigned long swp_offset(swp_entry_t entry) {
    return entry.val & SWAP_OFFSET_MASK;
}
 
// Look up where to read a swapped page from
struct page *lookup_swap_cache(swp_entry_t entry) {
    // First check if page is in swap cache (recently swapped in/out)
    return find_get_page(swap_address_space(entry), swp_offset(entry));
}
 
sector_t get_swap_page_sector(swp_entry_t entry) {
    int type = swp_type(entry);
    unsigned long offset = swp_offset(entry);
    struct swap_info_struct *sis = swap_info[type];
    
    // Convert slot offset to disk sector
    return (sector_t)(offset * (PAGE_SIZE / 512));
}

Swap Space Key Concepts

•Swap Slot: A page-sized region in swap space. Each slot holds exactly one evicted page.
•Swap Entry: An identifier stored in the PTE when a page is swapped out. Contains swap type (device) and offset (slot number).
•Swap Map: A per-device array tracking how many references exist to each slot. Enables sharing swapped pages.
•Swap Cache: A cache of recently swapped pages still in memory. Avoids redundant disk reads if page is accessed again soon.

Swap Entry in PTE

When a page is swapped out, the PTE's present bit is cleared, but the rest of the entry isn't zeroed—it's filled with a swap entry. The OS interprets non-present PTEs to determine whether they're empty (never faulted) or contain a swap entry (previously present, now swapped). Architecture-specific bits help distinguish these cases.

Locating Swapped Pages

When a page fault occurs on a previously-resident anonymous page, the page table entry contains the swap entry that tells the OS exactly where to find the page.

The Lookup Process:

Read the PTE: The faulting virtual address determines which PTE to examine.
Check for swap entry: If pte_present() is false but pte_none() is also false, the PTE contains a swap entry.
Decode the swap entry: Extract the swap type and offset.
Check swap cache: Maybe the page is already in memory (in the swap cache).
Read from disk: If not in cache, issue I/O to read from the swap device at the calculated sector.

The Swap Cache Optimization:

The swap cache keeps recently swapped pages in memory even after they've been mapped elsewhere. Benefits:

If the page is evicted again before being modified, no disk I/O needed
If another process faults on the same swapped page (shared mapping), it's already in memory

swap_readpage.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
// Reading a page from swap
 
int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
                 unsigned long address, pte_t *pte, pte_t orig_pte,
                 unsigned int flags) {
    swp_entry_t entry;
    struct page *page;
    int ret = 0;
    
    // Step 1: Extract swap entry from PTE
    entry = pte_to_swp_entry(orig_pte);
    
    // Step 2: Try to find page in swap cache first
    page = lookup_swap_cache(entry);
    
    if (!page) {
        // Page not in swap cache - must read from disk
        
        // Step 3: Allocate a new page for the data
        page = alloc_page(GFP_HIGHUSER_MOVABLE);
        if (!page)
            return VM_FAULT_OOM;  // Out of memory
        
        // Step 4: Initiate disk read
        // This will block until I/O completes (or be handled async)
        ret = swap_readpage(page, entry);
        if (ret) {
            put_page(page);
            return VM_FAULT_SIGBUS;  // I/O error
        }
        
        // Step 5: Add to swap cache
        // Allows sharing with other processes and quick re-eviction
        add_to_swap_cache(page, entry);
    }
    
    // Step 6: Lock the page while we set up the mapping
    lock_page(page);
    
    // Step 7: Map the page into the process's address space
    set_pte_at(mm, address, pte, mk_pte(page, vma->vm_page_prot));
    
    // Step 8: Update accounting
    mm->_rss++;  // Resident set size increased
    
    // Step 9: Optionally free the swap slot
    // (Deferred until page is actually dirty again)
    swap_free(entry);
    
    unlock_page(page);
    return VM_FAULT_MINOR;  // Fault handled, no major I/O
}
 
// Low-level swap read
int swap_readpage(struct page *page, swp_entry_t entry) {
    struct swap_info_struct *sis = swap_info[swp_type(entry)];
    sector_t sector = swp_offset(entry) * (PAGE_SIZE >> 9);
    
    // Set up block I/O
    struct bio *bio = bio_alloc(GFP_KERNEL, 1);
    bio->bi_bdev = sis->bdev;
    bio->bi_iter.bi_sector = sector;
    bio_add_page(bio, page, PAGE_SIZE, 0);
    
    // Submit and wait for I/O
    submit_bio_wait(bio);
    
    int error = bio->bi_status;
    bio_put(bio);
    return error;
}

The Swap Read is the Expensive Part

Reading from swap is the most expensive operation in page fault handling. Even on modern SSDs, a 4KB read takes ~50-100 microseconds—tens of thousands of CPU cycles. On spinning disks, it can take 5-10 milliseconds. This is why avoiding swap (having enough RAM) is critical for performance.

Locating File-Backed Pages

File-backed pages are located differently from anonymous pages. Instead of swap entries, the OS uses the VMA's file mapping information.

The File Mapping Relationship:

A file-backed VMA contains:

vm_file: The file being mapped
vm_pgoff: Offset into the file (in pages) where this VMA starts

By combining the faulting address with this information, the OS can calculate exactly which file offset is needed:

file_offset = vm_pgoff + (address - vm_start) / PAGE_SIZE

The Page Cache:

The kernel maintains a page cache—a cache of file pages in memory. Before reading from disk, the OS checks if the page is already cached:

Calculate (file, offset) pair
Search page cache for this pair
If found: use cached page (no disk I/O)
If not found: read from file, add to cache

filemap_fault.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Handling a file-backed page fault
 
int filemap_fault(struct vm_fault *vmf) {
    struct file *file = vmf->vma->vm_file;
    struct address_space *mapping = file->f_mapping;
    struct page *page;
    pgoff_t offset;
    int ret = 0;
    
    // Step 1: Calculate file offset for faulting address
    offset = vmf->pgoff;  // Already calculated by caller
    // Equivalent to: (fault_addr - vma->vm_start) / PAGE_SIZE + vma->vm_pgoff
    
    // Step 2: Look for page in page cache
    page = find_get_page(mapping, offset);
    
    if (!page) {
        // Page not in cache - must read from file
        
        // Step 3: Allocate a new page
        page = page_cache_alloc(mapping);
        if (!page)
            return VM_FAULT_OOM;
        
        // Step 4: Add to page cache (before reading)
        int err = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
        if (err) {
            put_page(page);
            if (err == -EEXIST) {
                // Race: another thread added it, retry
                goto retry;
            }
            return VM_FAULT_SIGBUS;
        }
        
        // Step 5: Read from file
        // This calls into the filesystem to read the page
        err = mapping->a_ops->readpage(file, page);
        if (err)
            return VM_FAULT_SIGBUS;
        
        // Wait for read to complete
        lock_page(page);
        if (!PageUptodate(page)) {
            unlock_page(page);
            return VM_FAULT_SIGBUS;
        }
    }
    
    // Step 6: Page is now in memory (cached or just read)
    vmf->page = page;
    return VM_FAULT_LOCKED;  // Return with page locked
}
 
// The address_space operations for file mappings
const struct address_space_operations ext4_aops = {
    .readpage = ext4_readpage,         // Read single page
    .readahead = ext4_readahead,       // Read multiple pages ahead
    .writepage = ext4_writepage,       // Write dirty page back
    .write_begin = ext4_write_begin,
    .write_end = ext4_write_end,
};

Page Cache Lookup Results
Result	Action	Performance
Page in cache, uptodate	Use immediately	~microseconds
Page in cache, being read	Wait for I/O completion	Depends on remaining I/O
Page not in cache	Allocate, add to cache, read from file	~milliseconds (disk I/O)

Readahead: Anticipating Future Needs

When reading a file-backed page, the kernel often reads additional pages speculatively (readahead). If the access pattern is sequential, these prefetched pages will already be in cache when next faulted. This dramatically improves performance for streaming access patterns like reading large files.

The Complete Lookup Algorithm

Let's put it all together. When a page fault occurs, the OS follows a systematic algorithm to locate the page content:

Step-by-Step Algorithm:

Extract faulting address from CR2 (or architecture equivalent)
Find VMA containing address
- No VMA → SIGSEGV (bad address)
- Address below VMA → Check stack growth
Validate permissions
- Mismatch → Check COW or SIGSEGV
Examine the PTE
- PTE present → Should have been TLB hit (unusual)
- PTE not present, not none → Swap entry present
- PTE none → First access to this page
Determine source based on VMA and PTE:
- PTE has swap entry → Read from swap
- VMA is file-backed → Read from file (via page cache)
- VMA is anonymous → Zero-fill new page
Perform the read (if needed)
Map the page into the page table
Return to user mode and restart instruction

Converting Mermaid diagram...

The Fast Path

Most page faults are either minor faults (page already in memory, just needs mapping) or zero-fill (new anonymous page). These are handled entirely in memory without any disk I/O. Only major faults—reading from swap or file—incur the expensive disk access.

Edge Cases and Complications

Real-world page fault handling involves numerous edge cases and complications that production kernels must handle:

Race Conditions:

Another thread might be faulting on the same page simultaneously
The page might be mapped by another thread while we're fetching it
The VMA might be modified while we're handling the fault

Special Page Types:

Huge pages: 2MB or 1GB pages require different handling
Device-mapped pages: Memory-mapped I/O regions with special semantics
Locked pages: mlocked pages should never be swapped

Error Conditions:

Out of physical memory: cannot allocate a frame
I/O error: cannot read from swap or file
Swap exhaustion: no swap space available for eviction

Exotic Scenarios:

NUMA considerations: Should the page come from local or remote memory?
Memory policies: Process may have specific allocation constraints
cgroups: Process may be in a cgroup with memory limits

edge_cases.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Handling race conditions in page fault
 
int handle_pte_fault(struct mm_struct *mm, struct vm_area_struct *vma,
                     unsigned long address, pte_t *pte, pte_t entry,
                     unsigned int flags) {
    spinlock_t *ptl;
    
    // Take the page table lock to prevent races
    ptl = pte_lockptr(mm, pmd);
    spin_lock(ptl);
    
    // Re-read the PTE - it might have changed!
    if (unlikely(!pte_same(*pte, entry))) {
        // Another thread handled this fault already
        spin_unlock(ptl);
        return 0;  // Retry the access
    }
    
    // ... handle the fault ...
    
    spin_unlock(ptl);
    return result;
}
 
// Handling out-of-memory during fault
int do_fault(struct vm_fault *vmf) {
    struct page *page = alloc_page(GFP_HIGHUSER_MOVABLE);
    
    if (!page) {
        // Out of memory!
        // Try to reclaim some pages and retry
        if (should_reclaim_retry(GFP_HIGHUSER_MOVABLE))
            return VM_FAULT_RETRY;
        
        // Really out of memory - invoke OOM killer or return error
        return VM_FAULT_OOM;
    }
    
    // ... continue with page we got ...
}
 
// NUMA-aware page allocation
int do_numa_page_fault(struct vm_fault *vmf) {
    struct page *page = vmf->page;
    int page_nid = page_to_nid(page);  // NUMA node of page
    int cpu_nid = numa_node_id();      // Current CPU's node
    
    if (page_nid != cpu_nid) {
        // Page is on a remote NUMA node
        // Consider migrating it for better performance
        if (should_migrate_page(page, cpu_nid)) {
            migrate_page_to_node(page, cpu_nid);
        }
    }
    // ... continue ...
}

Complexity in Production Kernels

Production kernels like Linux handle dozens of edge cases that aren't shown in simplified examples. The actual mm/memory.c file in Linux is thousands of lines of carefully crafted code handling races, errors, special cases, and performance optimizations. Understanding the fundamental algorithm is essential, but production code is considerably more complex.

Summary: Mapping Faults to Storage

Finding a page's location on disk is a multi-step process involving several OS data structures. The VMA tells us about the region's properties, the PTE tells us about the page's current state, and the combination determines where to get the content. Let's consolidate:

Key Takeaways

•VMAs describe valid address ranges. The VMA lookup is the first step—if no VMA contains the address, the access is invalid (SIGSEGV).
•Anonymous pages are backed by swap or are zero-filled on first access. They have no persistent file backing.
•File-backed pages are read from files via the page cache. The VMA's vm_file and vm_pgoff determine which file offset to read.
•Swap entries are stored in PTEs when pages are evicted. They encode the swap device and slot number for later retrieval.
•The page cache and swap cache reduce disk I/O by keeping recently-accessed pages in memory even after they're unmapped.
•The lookup algorithm is: VMA lookup → permission check → PTE examination → source determination → I/O (if needed).
•Race conditions and edge cases add significant complexity to production implementations.

What's Next:

With the page located (in swap or in a file), the OS must bring it into physical memory. The next page explores Load into Frame—how the OS allocates a physical frame, initiates disk I/O to read the page content, and updates the page table to reflect the new mapping.

Page Complete

You now understand how the OS locates page content on secondary storage. Whether a page is anonymous (swap-backed) or file-backed, the OS has systematic ways to determine where the data resides. Next, we'll explore how that data is actually loaded into physical memory.

3 / 5

Loading learning content...

Operating SystemsPage Fault Handling

Page Fault Handling: The Critical Path to Virtual Memory

LevelIntermediate

Duration90 mins

TopicPage Fault Handling

3 / 5

Find Page on Disk

The Quest for Missing Data: Locating Pages in Storage

When a page fault occurs, the CPU has told the OS: "This virtual address doesn't map to any physical frame." Now the OS faces a critical question: Where is the data for this page?

This seemingly simple question has surprisingly complex answers. The page might be:

On disk in swap space: Previously evicted to make room for other pages
In a file on the filesystem: A memory-mapped file or executable image
Nowhere (zero-fill): A newly allocated page that should be zeroed
Actually invalid: The process is accessing memory it shouldn't

What You Will Learn

Virtual Memory Areas (VMAs): The Address Space Map

The VMA Concept:

A Virtual Memory Area (VMA) represents a contiguous region of the virtual address space with uniform properties:

Start and end addresses: Defines the range covered
Permissions: Read, write, execute
Backing source: Anonymous, file-backed, shared, or device
Flags: Private vs shared, grows-down (stack), etc.

Linux's mm_struct:

In Linux, each process has an mm_struct containing:

A list (or tree) of VMAs
The page table base pointer
Various counters and limits

When a page fault occurs, the first action is to search the VMA list for an entry containing the faulting address.

vm_area_struct.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// Simplified Linux VMA structure
// Actual structure has many more fields
 
struct vm_area_struct {
    // Address range
    unsigned long vm_start;     // Start address (inclusive)
    unsigned long vm_end;       // End address (exclusive)
    
    // Linkage
    struct vm_area_struct *vm_next;   // Next VMA in list
    struct vm_area_struct *vm_prev;   // Previous VMA in list
    struct rb_node vm_rb;             // Red-black tree node for fast lookup
    
    // Memory descriptor (owning process)
    struct mm_struct *vm_mm;
    
    // Page protection
    pgprot_t vm_page_prot;      // Access permissions
    unsigned long vm_flags;     // Flags (VM_READ, VM_WRITE, VM_EXEC, etc.)
    
    // Backing storage
    struct file *vm_file;       // File being mapped (NULL for anonymous)
    unsigned long vm_pgoff;     // Offset into file in PAGE_SIZE units
    
    // Operations
    const struct vm_operations_struct *vm_ops;  // Callbacks for fault handling
    
    // For anonymous pages: link to anon_vma for reverse mapping
    struct anon_vma *anon_vma;
};
 
// Common VM flags
#define VM_READ         0x00000001  // Can read
#define VM_WRITE        0x00000002  // Can write
#define VM_EXEC         0x00000004  // Can execute
#define VM_SHARED       0x00000008  // Shared mapping
#define VM_MAYREAD      0x00000010  // May be read
#define VM_MAYWRITE     0x00000020  // May be written
#define VM_MAYEXEC      0x00000040  // May be executed
#define VM_GROWSDOWN    0x00000100  // Stack: grows downward
#define VM_DENYWRITE    0x00000800  // Deny write to file
#define VM_LOCKED       0x00002000  // Pages are locked in memory
 
// Find VMA containing an address
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) {
    struct vm_area_struct *vma;
    
    // Use red-black tree for O(log n) lookup
    vma = rb_tree_lookup(&mm->mm_rb, addr);
    
    if (vma && vma->vm_start <= addr && addr < vma->vm_end)
        return vma;
    
    return NULL;  // Address not in any VMA
}

The VMA Tree Structure

VMA Lookup and Validation

When a page fault occurs, the OS searches for the VMA containing the faulting address. The outcome of this search determines the next steps:

Case 1: No VMA Found

If no VMA contains the address, the access is invalid. This typically results in:

SIGSEGV signal delivered to the process
If no signal handler: process termination
Core dump if enabled

However, there's a special case: stack expansion. If the address is just below a stack VMA (marked with VM_GROWSDOWN), the OS may expand the stack to include the new address.

Case 2: VMA Found, Permission Denied

The VMA exists, but the access type doesn't match permissions:

Write to read-only page → Check for copy-on-write
Execute on non-executable page → SIGSEGV
Genuine protection violation → SIGSEGV

Case 3: VMA Found, Access Permitted

The address is valid and the access is permitted. Now the OS must determine where to get the page content.

fault_flow.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Page fault handler: VMA lookup and validation phase
 
static int __do_page_fault(struct mm_struct *mm, unsigned long address,
                           unsigned int flags, struct pt_regs *regs) {
    struct vm_area_struct *vma;
    int fault_type = 0;
    
    // Step 1: Find VMA containing the faulting address
    vma = find_vma(mm, address);
    
    if (!vma) {
        // No VMA contains this address
        return VM_FAULT_SIGSEGV;  // Bad address
    }
    
    if (vma->vm_start > address) {
        // Address is below VMA - maybe stack expansion?
        if (!(vma->vm_flags & VM_GROWSDOWN))
            return VM_FAULT_SIGSEGV;  // Can't grow
        
        if (expand_stack(vma, address))
            return VM_FAULT_SIGSEGV;  // Expansion failed
    }
    
    // Step 2: Check permissions
    if (flags & FAULT_FLAG_WRITE) {
        if (!(vma->vm_flags & VM_WRITE)) {
            // Write to non-writable page
            if (!(vma->vm_flags & VM_MAYWRITE))
                return VM_FAULT_SIGSEGV;  // Definitely not writable
            
            // Might be copy-on-write - handled later
            fault_type |= FAULT_TYPE_COW;
        }
    }
    
    if (flags & FAULT_FLAG_INSTRUCTION) {
        if (!(vma->vm_flags & VM_EXEC))
            return VM_FAULT_SIGSEGV;  // Execute on non-exec page
    }
    
    // Step 3: VMA is valid, access is potentially OK
    // Now determine where to get the page content
    return handle_mm_fault(mm, vma, address, flags);
}

Converting Mermaid diagram...

Anonymous vs File-Backed Pages

Once the VMA is located and permissions validated, the OS must determine the source of the page content. Pages fall into two fundamental categories:

Anonymous Pages:

Anonymous pages are not backed by any file. They include:

Heap memory (malloc, brk)
Stack memory
BSS section (uninitialized static data)

Characteristics of anonymous pages:

First access: page is zero-filled (all bytes = 0)
Modified content: stored in swap space when evicted
No persistent backing: content lost when process terminates
Private to process (or shared via fork/COW)

File-Backed Pages:

File-backed pages are mapped from files on disk. They include:

Executable text (code)
Memory-mapped files (mmap with MAP_PRIVATE or MAP_SHARED)
Shared libraries

Characteristics of file-backed pages:

First access: read from file
Eviction: clean pages discarded; dirty pages written back (shared) or to swap (private)
Content persists in file
Can be shared between processes

Anonymous vs File-Backed Page Comparison
Aspect	Anonymous Pages	File-Backed Pages
Initial content	Zero-filled	Read from file
Eviction (clean)	Write to swap	Discard (can re-read from file)
Eviction (dirty)	Write to swap	Write back (shared) or swap (private)
VMA has file?	vm_file = NULL	vm_file points to file
Examples	Heap, stack, BSS	Text, mmap files, .so libs
Persistence	None (process lifetime)	File outlives process

page_source.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// Determine page source based on VMA
 
enum page_source {
    SOURCE_ZERO_FILL,     // New anonymous page - fill with zeros
    SOURCE_SWAP,          // Previously evicted anonymous page
    SOURCE_FILE,          // File-backed page - read from file
    SOURCE_FILE_COW,      // Private file mapping - might need COW
};
 
enum page_source determine_page_source(struct vm_area_struct *vma,
                                       unsigned long address,
                                       pte_t *pte) {
    // Check if page was previously present (now swapped out)
    if (!pte_none(*pte) && !pte_present(*pte)) {
        // PTE has a swap entry - page was evicted to swap
        return SOURCE_SWAP;
    }
    
    // Check if VMA is file-backed
    if (vma->vm_file != NULL) {
        // File-backed VMA
        if (vma->vm_flags & VM_SHARED) {
            // Shared mapping - reads/writes go to file
            return SOURCE_FILE;
        } else {
            // Private mapping - might need COW on write
            return SOURCE_FILE_COW;
        }
    }
    
    // Anonymous VMA, first access
    return SOURCE_ZERO_FILL;
}
 
// Linux handles this through vm_operations_struct
// Each VMA type has its own fault handler
 
static const struct vm_operations_struct generic_file_vm_ops = {
    .fault = filemap_fault,         // Handle file-backed page faults
    .map_pages = filemap_map_pages, // Pre-map surrounding pages
    .page_mkwrite = filemap_page_mkwrite, // Handle COW
};
 
// Anonymous VMAs use a different handler
static const struct vm_operations_struct shmem_vm_ops = {
    .fault = shmem_fault,
};

Zero-Fill-on-Demand

Swap Space Organization

Swap Space Types:

Swap Partition: A dedicated disk partition formatted for swap. Fastest option as there's no filesystem overhead.
Swap File: A regular file used as swap. More flexible (can be resized) but slightly slower due to filesystem indirection.

Swap Space Structure:

Swap space is divided into slots (or pages), each exactly one page in size. A swap entry identifies which slot contains a particular page:

swap_space.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Swap entry encoding
// When a page is swapped out, the PTE holds a swap entry instead of a PFN
 
// Swap entry format (varies by architecture, this is conceptual)
// +-------+-------------+
// | Type  | Swap Offset |
// +-------+-------------+
//   5 bits   (varies)
 
typedef struct {
    unsigned long val;
} swp_entry_t;
 
// Multiple swap devices can be configured
#define MAX_SWAPFILES 32
 
// Swap info for each swap device
struct swap_info_struct {
    unsigned long flags;        // SWP_USED, SWP_WRITEOK, etc.
    struct file *swap_file;     // File or NULL for partition
    struct block_device *bdev;  // Block device
    unsigned long max;          // Max slots in this swap area
    unsigned char *swap_map;    // Count of users per slot
    unsigned long inuse_pages;  // Number of slots in use
    unsigned long lowest_bit;   // Hint for free slot search
    unsigned long highest_bit;
};
 
// Global array of swap devices
static struct swap_info_struct *swap_info[MAX_SWAPFILES];
 
// Encode a swap entry
static inline swp_entry_t make_swap_entry(int type, unsigned long offset) {
    swp_entry_t entry;
    entry.val = (type << SWAP_TYPE_SHIFT) | offset;
    return entry;
}
 
// Decode swap entry back to type and offset
static inline int swp_type(swp_entry_t entry) {
    return (entry.val >> SWAP_TYPE_SHIFT) & SWAP_TYPE_MASK;
}
 
static inline unsigned long swp_offset(swp_entry_t entry) {
    return entry.val & SWAP_OFFSET_MASK;
}
 
// Look up where to read a swapped page from
struct page *lookup_swap_cache(swp_entry_t entry) {
    // First check if page is in swap cache (recently swapped in/out)
    return find_get_page(swap_address_space(entry), swp_offset(entry));
}
 
sector_t get_swap_page_sector(swp_entry_t entry) {
    int type = swp_type(entry);
    unsigned long offset = swp_offset(entry);
    struct swap_info_struct *sis = swap_info[type];
    
    // Convert slot offset to disk sector
    return (sector_t)(offset * (PAGE_SIZE / 512));
}

Swap Space Key Concepts

•Swap Slot: A page-sized region in swap space. Each slot holds exactly one evicted page.
•Swap Entry: An identifier stored in the PTE when a page is swapped out. Contains swap type (device) and offset (slot number).
•Swap Map: A per-device array tracking how many references exist to each slot. Enables sharing swapped pages.
•Swap Cache: A cache of recently swapped pages still in memory. Avoids redundant disk reads if page is accessed again soon.

Swap Entry in PTE

Locating Swapped Pages

When a page fault occurs on a previously-resident anonymous page, the page table entry contains the swap entry that tells the OS exactly where to find the page.

The Lookup Process:

Read the PTE: The faulting virtual address determines which PTE to examine.
Check for swap entry: If pte_present() is false but pte_none() is also false, the PTE contains a swap entry.
Decode the swap entry: Extract the swap type and offset.
Check swap cache: Maybe the page is already in memory (in the swap cache).
Read from disk: If not in cache, issue I/O to read from the swap device at the calculated sector.

The Swap Cache Optimization:

The swap cache keeps recently swapped pages in memory even after they've been mapped elsewhere. Benefits:

If the page is evicted again before being modified, no disk I/O needed
If another process faults on the same swapped page (shared mapping), it's already in memory

swap_readpage.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
// Reading a page from swap
 
int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
                 unsigned long address, pte_t *pte, pte_t orig_pte,
                 unsigned int flags) {
    swp_entry_t entry;
    struct page *page;
    int ret = 0;
    
    // Step 1: Extract swap entry from PTE
    entry = pte_to_swp_entry(orig_pte);
    
    // Step 2: Try to find page in swap cache first
    page = lookup_swap_cache(entry);
    
    if (!page) {
        // Page not in swap cache - must read from disk
        
        // Step 3: Allocate a new page for the data
        page = alloc_page(GFP_HIGHUSER_MOVABLE);
        if (!page)
            return VM_FAULT_OOM;  // Out of memory
        
        // Step 4: Initiate disk read
        // This will block until I/O completes (or be handled async)
        ret = swap_readpage(page, entry);
        if (ret) {
            put_page(page);
            return VM_FAULT_SIGBUS;  // I/O error
        }
        
        // Step 5: Add to swap cache
        // Allows sharing with other processes and quick re-eviction
        add_to_swap_cache(page, entry);
    }
    
    // Step 6: Lock the page while we set up the mapping
    lock_page(page);
    
    // Step 7: Map the page into the process's address space
    set_pte_at(mm, address, pte, mk_pte(page, vma->vm_page_prot));
    
    // Step 8: Update accounting
    mm->_rss++;  // Resident set size increased
    
    // Step 9: Optionally free the swap slot
    // (Deferred until page is actually dirty again)
    swap_free(entry);
    
    unlock_page(page);
    return VM_FAULT_MINOR;  // Fault handled, no major I/O
}
 
// Low-level swap read
int swap_readpage(struct page *page, swp_entry_t entry) {
    struct swap_info_struct *sis = swap_info[swp_type(entry)];
    sector_t sector = swp_offset(entry) * (PAGE_SIZE >> 9);
    
    // Set up block I/O
    struct bio *bio = bio_alloc(GFP_KERNEL, 1);
    bio->bi_bdev = sis->bdev;
    bio->bi_iter.bi_sector = sector;
    bio_add_page(bio, page, PAGE_SIZE, 0);
    
    // Submit and wait for I/O
    submit_bio_wait(bio);
    
    int error = bio->bi_status;
    bio_put(bio);
    return error;
}

The Swap Read is the Expensive Part

Locating File-Backed Pages

File-backed pages are located differently from anonymous pages. Instead of swap entries, the OS uses the VMA's file mapping information.

The File Mapping Relationship:

A file-backed VMA contains:

vm_file: The file being mapped
vm_pgoff: Offset into the file (in pages) where this VMA starts

By combining the faulting address with this information, the OS can calculate exactly which file offset is needed:

file_offset = vm_pgoff + (address - vm_start) / PAGE_SIZE

The Page Cache:

The kernel maintains a page cache—a cache of file pages in memory. Before reading from disk, the OS checks if the page is already cached:

Calculate (file, offset) pair
Search page cache for this pair
If found: use cached page (no disk I/O)
If not found: read from file, add to cache

filemap_fault.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Handling a file-backed page fault
 
int filemap_fault(struct vm_fault *vmf) {
    struct file *file = vmf->vma->vm_file;
    struct address_space *mapping = file->f_mapping;
    struct page *page;
    pgoff_t offset;
    int ret = 0;
    
    // Step 1: Calculate file offset for faulting address
    offset = vmf->pgoff;  // Already calculated by caller
    // Equivalent to: (fault_addr - vma->vm_start) / PAGE_SIZE + vma->vm_pgoff
    
    // Step 2: Look for page in page cache
    page = find_get_page(mapping, offset);
    
    if (!page) {
        // Page not in cache - must read from file
        
        // Step 3: Allocate a new page
        page = page_cache_alloc(mapping);
        if (!page)
            return VM_FAULT_OOM;
        
        // Step 4: Add to page cache (before reading)
        int err = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
        if (err) {
            put_page(page);
            if (err == -EEXIST) {
                // Race: another thread added it, retry
                goto retry;
            }
            return VM_FAULT_SIGBUS;
        }
        
        // Step 5: Read from file
        // This calls into the filesystem to read the page
        err = mapping->a_ops->readpage(file, page);
        if (err)
            return VM_FAULT_SIGBUS;
        
        // Wait for read to complete
        lock_page(page);
        if (!PageUptodate(page)) {
            unlock_page(page);
            return VM_FAULT_SIGBUS;
        }
    }
    
    // Step 6: Page is now in memory (cached or just read)
    vmf->page = page;
    return VM_FAULT_LOCKED;  // Return with page locked
}
 
// The address_space operations for file mappings
const struct address_space_operations ext4_aops = {
    .readpage = ext4_readpage,         // Read single page
    .readahead = ext4_readahead,       // Read multiple pages ahead
    .writepage = ext4_writepage,       // Write dirty page back
    .write_begin = ext4_write_begin,
    .write_end = ext4_write_end,
};

Page Cache Lookup Results
Result	Action	Performance
Page in cache, uptodate	Use immediately	~microseconds
Page in cache, being read	Wait for I/O completion	Depends on remaining I/O
Page not in cache	Allocate, add to cache, read from file	~milliseconds (disk I/O)

Readahead: Anticipating Future Needs

The Complete Lookup Algorithm

Let's put it all together. When a page fault occurs, the OS follows a systematic algorithm to locate the page content:

Step-by-Step Algorithm:

Extract faulting address from CR2 (or architecture equivalent)
Find VMA containing address
- No VMA → SIGSEGV (bad address)
- Address below VMA → Check stack growth
Validate permissions
- Mismatch → Check COW or SIGSEGV
Examine the PTE
- PTE present → Should have been TLB hit (unusual)
- PTE not present, not none → Swap entry present
- PTE none → First access to this page
Determine source based on VMA and PTE:
- PTE has swap entry → Read from swap
- VMA is file-backed → Read from file (via page cache)
- VMA is anonymous → Zero-fill new page
Perform the read (if needed)
Map the page into the page table
Return to user mode and restart instruction

Converting Mermaid diagram...

The Fast Path

Edge Cases and Complications

Real-world page fault handling involves numerous edge cases and complications that production kernels must handle:

Race Conditions:

Another thread might be faulting on the same page simultaneously
The page might be mapped by another thread while we're fetching it
The VMA might be modified while we're handling the fault

Special Page Types:

Huge pages: 2MB or 1GB pages require different handling
Device-mapped pages: Memory-mapped I/O regions with special semantics
Locked pages: mlocked pages should never be swapped

Error Conditions:

Out of physical memory: cannot allocate a frame
I/O error: cannot read from swap or file
Swap exhaustion: no swap space available for eviction

Exotic Scenarios:

NUMA considerations: Should the page come from local or remote memory?
Memory policies: Process may have specific allocation constraints
cgroups: Process may be in a cgroup with memory limits

edge_cases.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Handling race conditions in page fault
 
int handle_pte_fault(struct mm_struct *mm, struct vm_area_struct *vma,
                     unsigned long address, pte_t *pte, pte_t entry,
                     unsigned int flags) {
    spinlock_t *ptl;
    
    // Take the page table lock to prevent races
    ptl = pte_lockptr(mm, pmd);
    spin_lock(ptl);
    
    // Re-read the PTE - it might have changed!
    if (unlikely(!pte_same(*pte, entry))) {
        // Another thread handled this fault already
        spin_unlock(ptl);
        return 0;  // Retry the access
    }
    
    // ... handle the fault ...
    
    spin_unlock(ptl);
    return result;
}
 
// Handling out-of-memory during fault
int do_fault(struct vm_fault *vmf) {
    struct page *page = alloc_page(GFP_HIGHUSER_MOVABLE);
    
    if (!page) {
        // Out of memory!
        // Try to reclaim some pages and retry
        if (should_reclaim_retry(GFP_HIGHUSER_MOVABLE))
            return VM_FAULT_RETRY;
        
        // Really out of memory - invoke OOM killer or return error
        return VM_FAULT_OOM;
    }
    
    // ... continue with page we got ...
}
 
// NUMA-aware page allocation
int do_numa_page_fault(struct vm_fault *vmf) {
    struct page *page = vmf->page;
    int page_nid = page_to_nid(page);  // NUMA node of page
    int cpu_nid = numa_node_id();      // Current CPU's node
    
    if (page_nid != cpu_nid) {
        // Page is on a remote NUMA node
        // Consider migrating it for better performance
        if (should_migrate_page(page, cpu_nid)) {
            migrate_page_to_node(page, cpu_nid);
        }
    }
    // ... continue ...
}

Complexity in Production Kernels

Summary: Mapping Faults to Storage

Key Takeaways

•VMAs describe valid address ranges. The VMA lookup is the first step—if no VMA contains the address, the access is invalid (SIGSEGV).
•Anonymous pages are backed by swap or are zero-filled on first access. They have no persistent file backing.
•File-backed pages are read from files via the page cache. The VMA's vm_file and vm_pgoff determine which file offset to read.
•Swap entries are stored in PTEs when pages are evicted. They encode the swap device and slot number for later retrieval.
•The page cache and swap cache reduce disk I/O by keeping recently-accessed pages in memory even after they're unmapped.
•The lookup algorithm is: VMA lookup → permission check → PTE examination → source determination → I/O (if needed).
•Race conditions and edge cases add significant complexity to production implementations.

What's Next:

Page Complete

3 / 5